OCR (Optical Character Recognition) — PDF Glossary

OCR stands for Optical Character Recognition — the process of looking at an image (a photo of a page, a scanned document, a screenshot) and recognizing the letters and words in it. After OCR, you can search, copy, and edit the text that was previously just pixels.

Why OCR matters for PDFs

PDFs come in two flavors:

Text-based PDFs. Created directly from Word, LaTeX, or a web export. The text lives in the file as actual characters — you can copy it, search it, and tools can process it directly.
Scanned / image-based PDFs. Created from a scanner or photo. Each page is essentially a picture. You can see text on the page, but computers see only an image. Copy-paste doesn't work. Search doesn't work.

OCR bridges the gap: it reads the images and adds an invisible text layer on top, so the PDF looks identical but is now searchable.

Limits of OCR

Modern OCR is very accurate on clean, printed English text (98%+ accuracy). Accuracy drops on:

Handwritten text (very challenging)
Non-Latin scripts (varies by language)
Low-resolution or skewed scans
Unusual fonts or heavy formatting

Tools

OCR PDF runs Tesseract OCR to add a searchable text layer to scanned PDFs
PDF to Text extracts text from already-text-based PDFs

Why OCR matters for PDFs

Limits of OCR

Tools

관련 도구