Skip to main content

OCR with the Pdftools SDK

The Pdftools SDK includes a built-in OCR module that enhances PDF documents by making text searchable and extractable. The module takes a PDF as input, analyzes it with an OCR engine, and outputs a PDF with an invisible, selectable text layer. The visual appearance of the document is preserved.

Pdftools OCR Service required

The OCR module requires a running Pdftools OCR Service instance. The SDK connects to the OCR Service for text recognition but processes PDFs directly and produces enhanced PDFs as output. This is different from using the OCR Service standalone, which outputs XML for use with the Conversion Service.

How it works

The OCR module processes a PDF document through the following pipeline:

  1. Open document. The input PDF is opened for reading.
  2. Process pages. Each page is processed individually:
    1. Analyze page. The page content is analyzed. The modes configured in image options, text options, and page options are evaluated to determine whether the OCR engine is required.
    2. OCR page (if required). The page is rendered to an image at the optimal resolution, sent to the OCR engine, and results are received.
    3. Apply results. OCR results are matched to the page content and applied according to each dimension’s mode:
      • Results on images are processed by the image options mode.
      • Results matching existing text are processed by the text options mode.
      • Remaining results are added as page text if the page options mode is not None.
    4. Copy page. The page is written to the output with OCR enhancements applied.
  3. Process embedded files (optional). If ProcessEmbeddedFiles is enabled, embedded PDF files are processed recursively.

A page is sent to the OCR engine only when required by the configured modes. Choosing modes carefully avoids unnecessary OCR operations, which improves performance.

Processing dimensions

The OCR module processes documents across three independent dimensions. You can use any combination of these to suit your workflow.

Image OCR

Image OCR recognizes text in scanned images within a PDF and adds an invisible text layer that makes the content searchable and selectable.

ModeDescription
UpdateTextProcess only images that don’t already have OCR text. Recommended for most scanned documents.
ReplaceTextRe-OCR all images, replacing any existing text layer.
RemoveTextRemove existing OCR text without re-processing. No OCR engine required.
IfNoTextProcess images only if the entire document contains no text.

Text OCR

Text OCR fixes non-extractable text in born-digital PDFs. Some PDF fonts lack proper Unicode mappings, which prevents text from being copied or searched correctly. Text OCR determines the correct Unicode values for these characters.

ModeDescription
UpdateFix only text with missing or incorrect Unicode mappings. Recommended for most documents.
ReplaceReprocess all text, even text that already has valid Unicode mappings.

Page OCR

Page OCR processes entire pages and adds the results as OCR text. It can also add PDF tagging for accessibility compliance.

ModeDescription
AllProcess all non-empty pages.
IfNoTextProcess only pages that have content but no text.
AddResultsDon’t trigger OCR independently, but add page-level results when OCR is triggered by image or text processing.
Accepted formats
Input formatOutput format
PDF 1.x, PDF 2.0, PDF/A-1, PDF/A-2, PDF/A-3PDF (same format preserved, PDF/A conformance maintained)

Use cases

The three processing dimensions can be combined in different ways depending on your goal:

  • Make scanned documents searchable. Use image OCR (UpdateText) and text OCR (Update) to add a text layer to scanned pages and fix any non-extractable text.
  • Fix non-extractable text in born-digital PDFs. Use text OCR (Update) to correct Unicode mappings for fonts that don’t provide proper encoding information.
  • Tag scans for accessibility. Use image OCR with page OCR tagging to prepare scanned documents for PDF/A level A conformance or PDF/UA accessibility requirements.
  • Full document processing. Use all three dimensions together to handle scanned images, non-extractable text, and page-level tagging in a single pass.
Get started

Learn how to OCR a PDF document.