Make Documents Searchable with OCR

Transform scanned and digital documents with Optical Character Recognition and get more value out of PDFs that can be searched and edited

Book a demo See documentation

Integrate OCR into your document pipeline

PDF SDK

Programmatic OCR processing

Process PDFs programmatically in .NET, Java, Python, or C

Use Pdftools SDK to call the built-in OCR module
Recognise text in scanned images
Fix non-extractable text in born-digital PDFs

Get started with the Pdftools SDK

See product

Conversion Service

Configure OCR for 50+ file types

Implement OCR as part of your document automation pipeline

Designed for automated, high-volume workflows
Configured as a processing step within a workflow
Option to output XML file for structured data

Get the OCR Service add-on for the Conversion Service

See product

OCR Service features

Detect text

Detects text in scanned images and PDFs, making them searchable and editable

Detect tables

Detects tables, barcodes, engineering drawings, and other complex layout elements

Add text layer

Embeds invisible text layer in Unicode format without altering appearance

Automatic correction

Automatic skew correction, rotation, and resolution handling

No unnecessary processes

Detects which elements require OCR and only processes those

180+ languages

Supports over 180 natural and technical languages

Learn more about the OCR Service and its features

See documentation

OCR in document workflows

The SDK takes PDFs as input and outputs PDFs with an invisible text layer. The Conversion Service takes one of the 50+ file formats the Conversion Service supports; the output can be a PDF or XML file.

Recognize text

Recognize text in scanned images and run OCR on it

Fix non-extractable text

Fix non-extractable text in born-digital PDFs by adding Unicode mappings

Process entire pages

Process entire pages and add the results as OCR text

Add tagging

Add PDF tagging for accessibility compliance

Extract XML for OCR quality checks and audits

With the Conversion Service, you can check the accuracy of OCR results by extracting an XML file that gives insight into any OCR process that has previously been applied. The workflow extracts OCR-related information from PDF documents, outputs a structured XML file with detailed data, and supplies a confidence score for the OCR process.

This opens up quality control workflows: a low confidence score on a key field in a scanned document is a signal to route that document for human review rather than processing it further. It also has audit and legal value, as the XML creates a structured, timestamped record of the OCR interpretation, not just the final output.

See docs