divyam.dev/content/posts/pdf-translation-pipeline.md at 05b4d4cad8063ad6644d20b095ca8ba25c39a486

Divyam Ahuja 2e5580024f

Deploy Hugo site to Pages / build (push) Waiting to run

Details

Deploy Hugo site to Pages / deploy (push) Blocked by required conditions

Details

2026-05-08 14:55:12 +05:30

title

date

description

The Pipeline

The approach I settled on follows this flow:

PDF → Images → OCR → Translation → Rendered Images → PDF

Each step has its own set of challenges:

PDF to Images: Convert each page to a high-DPI image for better OCR accuracy
OCR: Extract text with position data using PaddleOCR
Translation: Run extracted text through NLLB (No Language Left Behind)
Rendering: Paint translated text back onto the original image
Assembly: Combine rendered images back into a PDF

DPI matters a lot — bumping from 150 to 300 DPI dramatically improved OCR accuracy for Hindi text
Font rendering is hard — getting translated text to fit in the same bounding boxes required careful font size calculation
Fallback strategies — TrOCR as a fallback when PaddleOCR fails on certain text regions

The code is messy but it works. Sometimes that's enough.