Version: Version 1.17

OCR with Pdftools SDK

Pdftools SDK supports OCR through Pdftools OCR Service. Pdftools SDK provides the API to configure and run OCR; the service performs the actual character recognition. The output is a PDF with an invisible, selectable text layer that preserves the document’s visual appearance.

Pdftools OCR Service required

OCR with Pdftools SDK requires a running Pdftools OCR Service instance. For installation details, review Set up Pdftools OCR Service with Pdftools SDK.

How it works

The OCR Processor walks each PDF through the following pipeline:

Open the document. The OCR processor opens the input PDF for reading.
Process pages. The OCR processor processes each page individually:
1. Analyze the page. The OCR processor reads the page content and evaluates the modes configured in image options, text options, and page options to determine whether to run OCR.
2. OCR the page (if required). The OCR processor renders the page to an image at the optimal resolution, sends it to the OCR Engine (the connection to the service), and receives the results.
3. Apply results. The OCR processor matches OCR results to the page content and applies them according to each dimension’s mode:
  - The image options mode handles results on images.
  - The text options mode handles results that match existing text.
  - If the page options mode isn’t None, the processor adds remaining results as page text.
4. Write the page. The OCR processor writes the page to the output with OCR enhancements applied.
Process embedded files (optional). When you enable ProcessEmbeddedFiles, the OCR processor processes embedded PDF files recursively.

The OCR processor sends a page to the OCR engine only when the configured modes require it. Choosing modes carefully avoids unnecessary OCR operations, which improves performance.

Processing dimensions

Pdftools SDK runs OCR across three independent dimensions. You can use any combination of these to suit your workflow.

Image OCR

Image OCR recognizes text in scanned images within a PDF and adds an invisible text layer that makes the content searchable and selectable.

Mode	Description
`UpdateText`	Process only images that don’t already have OCR text. Recommended for most scanned documents.
`ReplaceText`	Re-OCR all images, replacing any existing text layer.
`RemoveText`	Remove existing OCR text without re-processing. No OCR engine required.
`IfNoText`	Process images only if the entire document contains no text.

Text OCR

Text OCR fixes non-extractable text in born-digital PDFs. Some PDF fonts lack proper Unicode mappings, which breaks text copying and search. Text OCR determines the correct Unicode values for these characters.

Mode	Description
`Update`	Fix only text with missing or incorrect Unicode mappings. Recommended for most documents.
`Replace`	Reprocess all text, even text that already has valid Unicode mappings.

Page OCR

Page OCR processes entire pages and adds the results as OCR text. Page OCR can also add PDF tagging for accessibility compliance.

Mode	Description
`All`	Process all non-empty pages.
`IfNoText`	Process only pages that have content but no text.
`AddResults`	Doesn’t trigger OCR independently, but adds page-level results when image or text processing triggers OCR.

Accepted formats

Input format	Output format
PDF 1.x, PDF 2.0, PDF/A-1, PDF/A-2, PDF/A-3	PDF (same format preserved, PDF/A conformance maintained)

Use cases

You can combine the three processing dimensions in different ways depending on your goal:

Make scanned documents searchable. Use image OCR (UpdateText) and text OCR (Update) to add a text layer to scanned pages and fix any non-extractable text.
Fix non-extractable text in born-digital PDFs. Use text OCR (Update) to correct Unicode mappings for fonts that don’t provide proper encoding information.
Tag scans for accessibility. Use image OCR with page OCR tagging to prepare scanned documents for PDF/A level A conformance or PDF/UA accessibility requirements.
Full document processing. Use all three dimensions together to handle scanned images, non-extractable text, and page-level tagging in a single pass.

note

Pdftools SDK connects to Pdftools OCR Service for text recognition and produces enhanced PDFs. For comparison, Pdftools OCR Service outputs XML for use with Conversion Service.

Get started

Learn how to OCR a PDF document.

How it works​

Processing dimensions​

Image OCR​

Text OCR​

Page OCR​

Use cases​