94 lines
7.2 KiB
HTML
94 lines
7.2 KiB
HTML
<!DOCTYPE html>
|
|
<html lang="en-us">
|
|
<head><script src="/livereload.js?mindelay=10&v=2&port=1313&path=livereload" data-no-instant defer></script><meta charset="utf-8">
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
|
<meta name="description" content="How I built a pipeline to translate Hindi PDFs to English using OCR and neural machine translation.">
|
|
<meta name="author" content="Divyam Ahuja">
|
|
|
|
|
|
<meta property="og:title" content="Building a PDF Translation Pipeline">
|
|
<meta property="og:description" content="How I built a pipeline to translate Hindi PDFs to English using OCR and neural machine translation.">
|
|
<meta property="og:type" content="article">
|
|
<meta property="og:url" content="http://localhost:1313/posts/pdf-translation-pipeline/">
|
|
<meta property="og:site_name" content="Divyam Ahuja">
|
|
|
|
|
|
<meta name="twitter:card" content="summary">
|
|
<meta name="twitter:title" content="Building a PDF Translation Pipeline">
|
|
<meta name="twitter:description" content="How I built a pipeline to translate Hindi PDFs to English using OCR and neural machine translation.">
|
|
|
|
<title>Building a PDF Translation Pipeline · Divyam Ahuja</title>
|
|
|
|
|
|
<link rel="icon" type="image/svg+xml" href="/favicon.svg">
|
|
|
|
|
|
|
|
<link rel="stylesheet" href="http://localhost:1313/css/style.min.86de29e37fd55fb8581ee5569d0e766097f2718b8f8029e4e11c87973d24a5b1.css">
|
|
|
|
|
|
</head>
|
|
<body>
|
|
<div class="site-wrapper"><header class="site-header">
|
|
<div class="site-header-inner">
|
|
<a href="/" class="site-title">Divyam Ahuja</a>
|
|
<nav class="site-nav"><a href="/">Home</a><span class="nav-separator">|</span><a href="/posts/">Blog</a><span class="nav-separator">|</span><a href="/projects/">Projects</a><span class="nav-separator">|</span><a href="/resume/">Resume</a></nav>
|
|
</div>
|
|
</header>
|
|
<main>
|
|
<article>
|
|
<div class="post-header">
|
|
<h1>Building a PDF Translation Pipeline</h1>
|
|
<div class="post-meta">
|
|
<span>DIVYAM AHUJA</span>
|
|
<span class="separator">·</span>
|
|
<time datetime="2026-02-11">11 Feb 2026</time>
|
|
<span class="separator">|</span>
|
|
<span class="post-tags">
|
|
<a href="http://localhost:1313/tags/python/" class="post-tag">#python</a>
|
|
<a href="http://localhost:1313/tags/ocr/" class="post-tag">#ocr</a>
|
|
<a href="http://localhost:1313/tags/ml/" class="post-tag">#ml</a>
|
|
</span>
|
|
</div>
|
|
</div>
|
|
<div class="post-content">
|
|
<p>Recently, I needed to translate a PDF document from Hindi to English. Sounds simple enough, right? Turns out, it’s a surprisingly deep rabbit hole.</p>
|
|
<h2 id="the-pipeline">The Pipeline</h2>
|
|
<p>The approach I settled on follows this flow:</p>
|
|
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">PDF → Images → OCR → Translation → Rendered Images → PDF
|
|
</span></span></code></pre></div><p>Each step has its own set of challenges:</p>
|
|
<ol>
|
|
<li><strong>PDF to Images</strong>: Convert each page to a high-DPI image for better OCR accuracy</li>
|
|
<li><strong>OCR</strong>: Extract text with position data using PaddleOCR</li>
|
|
<li><strong>Translation</strong>: Run extracted text through NLLB (No Language Left Behind)</li>
|
|
<li><strong>Rendering</strong>: Paint translated text back onto the original image</li>
|
|
<li><strong>Assembly</strong>: Combine rendered images back into a PDF</li>
|
|
</ol>
|
|
<h2 id="lessons-learned">Lessons Learned</h2>
|
|
<ul>
|
|
<li><strong>DPI matters a lot</strong> — bumping from 150 to 300 DPI dramatically improved OCR accuracy for Hindi text</li>
|
|
<li><strong>Font rendering is hard</strong> — getting translated text to fit in the same bounding boxes required careful font size calculation</li>
|
|
<li><strong>Fallback strategies</strong> — TrOCR as a fallback when PaddleOCR fails on certain text regions</li>
|
|
</ul>
|
|
<p>The code is messy but it works. Sometimes that’s enough.</p>
|
|
|
|
</div>
|
|
<hr>
|
|
<a href="/" class="back-link">← back to home</a>
|
|
</article>
|
|
</main>
|
|
</div><footer class="site-footer">
|
|
<div class="site-wrapper">
|
|
<div class="site-footer-inner">
|
|
<div class="footer-copyright">
|
|
© 2026 Divyam Ahuja
|
|
</div>
|
|
<div class="social-icons"><a href="https://github.com/ahujadivyam" target="_blank" rel="noopener noreferrer" title="GitHub"><svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M12 0C5.374 0 0 5.373 0 12c0 5.302 3.438 9.8 8.207 11.387.599.111.793-.261.793-.577v-2.234c-3.338.726-4.033-1.416-4.033-1.416-.546-1.387-1.333-1.756-1.333-1.756-1.089-.745.083-.729.083-.729 1.205.084 1.839 1.237 1.839 1.237 1.07 1.834 2.807 1.304 3.492.997.107-.775.418-1.305.762-1.604-2.665-.305-5.467-1.334-5.467-5.931 0-1.311.469-2.381 1.236-3.221-.124-.303-.535-1.524.117-3.176 0 0 1.008-.322 3.301 1.23A11.509 11.509 0 0112 5.803c1.02.005 2.047.138 3.006.404 2.291-1.552 3.297-1.23 3.297-1.23.653 1.653.242 2.874.118 3.176.77.84 1.235 1.911 1.235 3.221 0 4.609-2.807 5.624-5.479 5.921.43.372.823 1.102.823 2.222v3.293c0 .319.192.694.801.576C20.566 21.797 24 17.3 24 12c0-6.627-5.373-12-12-12z"/></svg></a><a href="https://linkedin.com/in/ahujadivyam" target="_blank" rel="noopener noreferrer" title="LinkedIn"><svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M20.447 20.452h-3.554v-5.569c0-1.328-.027-3.037-1.852-3.037-1.853 0-2.136 1.445-2.136 2.939v5.667H9.351V9h3.414v1.561h.046c.477-.9 1.637-1.85 3.37-1.85 3.601 0 4.267 2.37 4.267 5.455v6.286zM5.337 7.433c-1.144 0-2.063-.926-2.063-2.065 0-1.138.92-2.063 2.063-2.063 1.14 0 2.064.925 2.064 2.063 0 1.139-.925 2.065-2.064 2.065zm1.782 13.019H3.555V9h3.564v11.452zM22.225 0H1.771C.792 0 0 .774 0 1.729v20.542C0 23.227.792 24 1.771 24h20.451C23.2 24 24 23.227 24 22.271V1.729C24 .774 23.2 0 22.222 0h.003z"/></svg></a><a href="https://git.divyam.dev" target="_blank" rel="noopener noreferrer" title="Git"><svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M6 3a3 3 0 1 0 0 6 3 3 0 0 0 0-6zM3 6a3 3 0 0 1 5.83-.75h4.34A3.001 3.001 0 0 1 18 6a3 3 0 0 1-4.83.75H8.83A3.001 3.001 0 0 1 3 6zm3 9a3 3 0 1 0 0 6 3 3 0 0 0 0-6zm0 3a3 3 0 0 1 0-6 3 3 0 0 1 0 6zm12-3a3 3 0 1 0 0 6 3 3 0 0 0 0-6z" fill="none"/><path d="M6 2a4 4 0 0 0-1 7.874V14.126A4.002 4.002 0 0 0 6 22a4 4 0 0 0 1-7.874V9.874A4.002 4.002 0 0 0 6 2zm0 2a2 2 0 1 1 0 4 2 2 0 0 1 0-4zm0 12a2 2 0 1 1 0 4 2 2 0 0 1 0-4zm12-12a4 4 0 0 0-1 7.874v2.252A4.002 4.002 0 0 0 18 22a4 4 0 0 0 1-7.874V9.874A4.002 4.002 0 0 0 18 2zm0 2a2 2 0 1 1 0 4 2 2 0 0 1 0-4zm0 12a2 2 0 1 1 0 4 2 2 0 0 1 0-4z"/></svg></a><a href="mailto:ahujadivyam@gmail.com" target="_blank" rel="noopener noreferrer" title="Email"><svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M1.5 8.67v8.58a3 3 0 003 3h15a3 3 0 003-3V8.67l-8.928 5.493a3 3 0 01-3.144 0L1.5 8.67z"/><path d="M22.5 6.908V6.75a3 3 0 00-3-3h-15a3 3 0 00-3 3v.158l9.714 5.978a1.5 1.5 0 001.572 0L22.5 6.908z"/></svg></a><a href="/index.xml" title="RSS Feed">
|
|
<svg viewBox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"><path d="M19.199 24C19.199 13.467 10.533 4.8 0 4.8V0c13.165 0 24 10.835 24 24h-4.801zM3.291 17.415c1.814 0 3.293 1.479 3.293 3.295 0 1.813-1.485 3.29-3.301 3.29C1.47 24 0 22.526 0 20.71s1.475-3.295 3.291-3.295zM15.909 24h-4.665c0-6.169-5.075-11.245-11.244-11.245V8.09c8.727 0 15.909 7.184 15.909 15.91z"/></svg>
|
|
</a>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
</footer>
|
|
</body>
|
|
</html>
|