32 lines
1.3 KiB
Markdown
32 lines
1.3 KiB
Markdown
---
|
|
title: "Building a PDF Translation Pipeline"
|
|
date: 2026-02-11
|
|
description: "How I built a pipeline to translate Hindi PDFs to English using OCR and neural machine translation."
|
|
tags: ["python", "ocr", "ml"]
|
|
---
|
|
|
|
Recently, I needed to translate a PDF document from Hindi to English. Sounds simple enough, right? Turns out, it's a surprisingly deep rabbit hole.
|
|
|
|
## The Pipeline
|
|
|
|
The approach I settled on follows this flow:
|
|
|
|
```
|
|
PDF → Images → OCR → Translation → Rendered Images → PDF
|
|
```
|
|
|
|
Each step has its own set of challenges:
|
|
|
|
1. **PDF to Images**: Convert each page to a high-DPI image for better OCR accuracy
|
|
2. **OCR**: Extract text with position data using PaddleOCR
|
|
3. **Translation**: Run extracted text through NLLB (No Language Left Behind)
|
|
4. **Rendering**: Paint translated text back onto the original image
|
|
5. **Assembly**: Combine rendered images back into a PDF
|
|
|
|
## Lessons Learned
|
|
|
|
- **DPI matters a lot** — bumping from 150 to 300 DPI dramatically improved OCR accuracy for Hindi text
|
|
- **Font rendering is hard** — getting translated text to fit in the same bounding boxes required careful font size calculation
|
|
- **Fallback strategies** — TrOCR as a fallback when PaddleOCR fails on certain text regions
|
|
|
|
The code is messy but it works. Sometimes that's enough.
|