Make Documents Searchable with OCR
Transform scanned and digital documents with Optical Character Recognition and get more value out of PDFs that can be searched and edited
Integrate OCR into your document pipeline
Programmatic OCR processing
Process PDFs programmatically in .NET, Java, Python, or C
Use Pdftools SDK to call the built-in OCR module
Recognise text in scanned images
Fix non-extractable text in born-digital PDFs
Get started with the Pdftools SDK
Configure OCR for 50+ file types
Implement OCR as part of your document automation pipeline
Designed for automated, high-volume workflows
Configured as a processing step within a workflow
Option to output XML file for structured data
Get the OCR Service add-on for the Conversion Service
Learn more about the OCR Service and its features
OCR in document workflows
The SDK takes PDFs as input and outputs PDFs with an invisible text layer. The Conversion Service takes one of the 50+ file formats the Conversion Service supports; the output can be a PDF or XML file.
Extract XML for OCR quality checks and audits
With the Conversion Service, you can check the accuracy of OCR results by extracting an XML file that gives insight into any OCR process that has previously been applied. The workflow extracts OCR-related information from PDF documents, outputs a structured XML file with detailed data, and supplies a confidence score for the OCR process.
This opens up quality control workflows: a low confidence score on a key field in a scanned document is a signal to route that document for human review rather than processing it further. It also has audit and legal value, as the XML creates a structured, timestamped record of the OCR interpretation, not just the final output.