PDF GeniePDF Genie
ocrhow-toexplainer

OCR Accuracy: What Actually Determines Whether Your Scanned PDF Becomes Searchable

OCR accuracy ranges from 99% to unusable depending on DPI, contrast, font, and language. Here's what changes the number — and what you can control.

P

作者 PDF Genie Editorial Team

·9 min read·1,705

由 PDF Genie 编辑团队审核。查看我们的编辑标准

We run thousands of PDFs through OCR every week, and the accuracy we see on the output ranges from effectively perfect to completely unusable. Same tool, same default settings, wildly different results. The reason is never the OCR engine itself — the engines have been commoditized for years, and the gap between the best and the worst open-source option is small. The variance comes from the input. A scan made at 300 DPI of a clean, typed page on white paper will produce accuracy around 99%. The same page at 150 DPI, skewed ten degrees, with a highlighter mark across it, falls to 70% or worse, and at that point the searchable layer is more misleading than useful.

This post walks through the factors that actually determine how good your OCR output will be, in roughly the order they matter. It is aimed at people who have tried an OCR tool, gotten disappointing results, and want to know whether the problem is their file, their settings, or their expectations. Usually it is the first.

DPI: the biggest single lever

Scan resolution is the factor that swamps everything else. The Tesseract documentationTesseract being the open-source OCR engine that powers most browser-based and mid-tier server tools, including ours — is explicit about this: Tesseract's recognition quality degrades sharply below 200 DPI, performs well at 300 DPI, and sees negligible gains above 400 DPI on typed text.

A rough accuracy-versus-DPI curve on clean typed English:

  • 72-100 DPI (screenshots, web images): 40-70% character accuracy. Often unusable.
  • 150 DPI (fast-scan default on cheap scanners): 85-92%. Readable but noisy.
  • 200 DPI: 94-97%. The minimum for serious use.
  • 300 DPI: 97-99%. The industry standard and what our tool targets.
  • 400-600 DPI: 98-99%, maybe. File size quadruples. Accuracy gains are measurement noise.
Anything above 600 DPI costs you speed and bytes without improving the text output. If you control the scan, 300 DPI is the answer. If someone else scanned it and sent it to you at 150 DPI, no amount of clever OCR post-processing fully compensates — the information is just not in the pixels.

Source contrast and pre-processing

A page is only as readable as the contrast between the ink and the background. Faded print-out copies, grey-on-white laser output, yellowed archival scans, and photographs taken in dim light all give OCR engines less signal to work with. The accuracy hit is non-linear: a page that is 10% darker than ideal might only lose 1-2 points of accuracy. A page that is 30% darker can drop 10-20 points.

The fix is almost always image pre-processing before OCR:

  • Binarization. Convert the page to pure black-and-white using a threshold (often Otsu's method). This makes character edges crisp for the recognizer. Greyscale input is easier for humans but harder for OCR.
  • Contrast stretching. Re-map the pixel histogram so the darkest ink is true black and the brightest paper is near-white.
  • Despeckling. Remove isolated black pixels smaller than a character stroke. Cheap scans pick these up, and they get misread as punctuation or stray marks.
Most modern OCR pipelines — Tesseract 5, Google Cloud Vision, AWS Textract — do some version of this internally. The quality varies. If you are getting poor results on a low-contrast scan, applying a threshold pass before submitting to OCR can add five or more points of accuracy.

Font

On typed documents, the font matters less than DPI but more than people expect. In rough order of OCR friendliness:

  • Clean sans-serif at reasonable size (Helvetica, Arial, Calibri, 10-12 pt): 98-99% on clean scans.
  • Serif (Times, Georgia): 97-99%. Minor drop from serif artifacts being mistaken for adjacent characters.
  • Monospace (Courier): 98-99%. Often the best target, because character widths are uniform.
  • Decorative and display fonts: 85-95%. Fonts designed to look distinctive rather than legible confuse the classifier.
  • Italic and condensed: 2-5 point accuracy hit across most engines.
  • Handwriting: 60-85% for neat printed handwriting, much lower for cursive. The open-source engines are mostly not competitive here; Google Cloud Vision and specialized handwriting models from Microsoft and AWS do better, and even they fail on low-quality inputs.
If your source is typewritten or inkjet/laser output in a common font, you are in the easy zone. If it is Victorian-era Fraktur or a chef's handwritten recipe book, adjust expectations.

Language pack

Tesseract ships English as the default, and it is easy to forget that you need to explicitly tell it to expect anything else. Running OCR with English-only settings on a French document produces a result that is nominally readable — most Latin-alphabet characters overlap — but every accented character, every French-specific ligature, and every word that disagrees with the English language model adds errors. Typical French-as-English OCR runs 85-92% instead of 98%.

The fix is the language parameter. Tesseract takes a -l flag with a concatenation: -l eng, -l fra, -l eng+fra for mixed-language documents. Tesseract's language pack repository ships packs for roughly 130 languages, including every major European language, Arabic, Chinese, Japanese, Korean, Turkish, Hindi, and many more. Each pack is a few megabytes and dramatically improves accuracy on its target script.

Our OCR PDF tool includes English, Turkish, and about thirty other language packs loaded by default, and we auto-detect the script for common cases. For unusual combinations, explicit language selection gives the best result.

Skew, rotation, and page geometry

Pages are rarely scanned perfectly straight. Tesseract and most modern OCR engines include a deskew stage that corrects small rotations automatically — typically up to 10-15 degrees without noticeable accuracy loss. Beyond that, accuracy craters, because character recognition is trained on roughly-horizontal baselines, and a 20-degree tilt makes the bounding boxes for adjacent characters overlap in ways the classifier cannot disentangle.

The fix is simple: if your scan is visibly rotated, straighten it before OCR. Our pipeline runs a deskew stage automatically for angles under 15 degrees. For worse rotations, rotate the file first with our Rotate PDF tool and then OCR.

Noise, stamps, highlights, and watermarks

Anything that is not text, but sits on top of text, confuses OCR. Common offenders:

  • Highlighter marks. Yellow highlights are relatively OCR-friendly (most engines treat yellow as white-ish); pink, green, and blue highlights reduce contrast on the text beneath them and cost accuracy.
  • Stamps. "RECEIVED," "DRAFT," "CONFIDENTIAL" stamps overlapping body text are read as additional words or garble the characters they cover.
  • Watermarks. Subtle watermarks are usually ignored; strong watermarks with decorative patterns cost accuracy.
  • Stray marks and speckle. Low-quality photocopies accumulate specks that get mistaken for punctuation — periods, apostrophes, hyphens — and corrupt the text flow.
If you control the source, clean scans without stamps and highlights OCR better. If you do not, accept that the output will need light cleanup.

Tables and multi-column layouts

This is the one place open-source OCR consistently under-performs commercial services. Tesseract reads a page as a sequence of text blocks and, within each block, left to right. It does a competent job of detecting block boundaries on simple layouts but struggles with:

  • Multi-column text. A two-column academic paper often gets read as if the columns were concatenated horizontally, producing "column 1 line 1, column 2 line 1, column 1 line 2, column 2 line 2" — an interleaved mess.
  • Tables. The cell structure is lost. The text of each cell is usually correct; the relationship between cells is not preserved in the output text layer.
  • Complex page geometry. Invoices with pricing tables, sidebars, and multi-block layouts lose structure.
AWS Textract and Google Cloud Document AI solve this with dedicated layout analysis models and return tables as structured objects with rows and columns preserved. This is where paid services earn their pricing — for document-AI use cases where you need structured extraction, not just a searchable text layer. For the typical "make this scan searchable so Ctrl+F works" use case, Tesseract is fine.

Realistic accuracy expectations by document type

Concrete numbers from our own pipeline, measured against human-checked ground truth on representative samples:

  • Clean typed document, 300 DPI, English: 98-99%.
  • Printed book page, 300 DPI, serif body: 97-99%.
  • Receipt (thermal print, faded): 85-95%. Thermal printing fades and has poor contrast.
  • Screenshot of a webpage: 90-97% depending on source DPI.
  • Old newspaper, 300 DPI: 85-93%. Degraded paper and fine-print ink bleed.
  • 19th-century book (pre-offset-printing): 70-90%. Irregular inking and old typefaces.
  • Neat printed handwriting: 60-80%.
  • Cursive handwriting: 30-60%. Rarely worth the effort outside specialized handwriting models.
Those are ranges we have observed, not guarantees. The variance within a category is large.

What PDF Genie's OCR tool does

Our OCR PDF tool runs Tesseract.js entirely in your browser — the file never uploads. Defaults: 300 DPI target resolution for any page rendered below that, automatic deskew up to 15 degrees, binarization with Otsu's threshold, and language packs for roughly thirty of the most-used world languages including English, Turkish, German, French, Spanish, Italian, Portuguese, Dutch, Russian, Polish, Arabic, Chinese (Simplified and Traditional), Japanese, and Korean. Output is a new PDF with the original page images preserved and a selectable, searchable text layer added on top.

The tool is appropriate for: making a scanned document searchable, extracting text from a clean typed scan, preparing an archive of scanned paperwork for full-text indexing. It is not appropriate for: handwriting recognition, structured extraction of tables and forms, OCR of documents where the output needs to be correct to the character. For those, a specialized tool (AWS Textract, Google Document AI, or a commercial handwriting OCR service) is the right call.

Honest caveats

OCR is always probabilistic. Even under ideal conditions, you will find occasional errors — a "0" read as "O," a comma read as a period, a 1 read as an l. For any document where character-level accuracy matters (legal, medical, financial records used verbatim), the output should be proofread by a human or at minimum spot-checked against the source. Running OCR is a convenience feature for search and discovery. It is not a substitute for the original.

Make your scanned PDF searchable now

Try OCR PDF — free, runs in your browser

分享这篇文章:𝕏 TwitterLinkedIn
PDF Genie

自己试试 — 免费

40+ PDF 工具,无需注册。直接在浏览器中运行。

探索 PDF Genie →