OCR a PDF document
Apply OCR to a PDF document to make scanned content searchable and text extractable. The Pdftools SDK analyzes the document with an OCR engine and adds an invisible, selectable text layer while preserving the visual appearance.
Steps to OCR a document:
- Initialize the Pdftools SDK license
- Install and start Pdftools OCR Service
- Create the OCR engine
- Open the input document
- Configure OCR options
- Process the document
1. Initialize the Pdftools SDK license
Before you begin, Initialize the Pdftools SDK license.
2. Install and start Pdftools OCR Service
Install and start Pdftools OCR Service. Pdftools SDK connects to Pdftools OCR Service over HTTP for text recognition. The default endpoint is http://localhost:7982/. For installation details, review Set up Pdftools OCR Service with Pdftools SDK.
3. Create the OCR engine
Create an Engine instance by passing the engine name and connection parameters. The only supported engine is service, which connects to a running Pdftools OCR Service instance over HTTP. Specify the engine name followed by @ and the service URL. For example, "service@http://localhost:7982/" connects to an OCR Service at that URL. To connect to multiple instances, separate URLs with a semicolon (for example, "service@http://host1:7982/;http://host2:7982/").
After creating the engine, set the recognition languages as a comma-separated string (for example, "German,English"). You can also set engine-specific parameters using the Parameters property as semicolon-separated key-value pairs (for example, "PredefinedProfile=Default" or "Profile=/path/to/custom-profile.ini").
You can reuse the engine across multiple documents. However, only one thread can use an Engine instance at a time.
- .NET
- Java
- Python
- C
// Create the OCR engine
using var engine = Engine.Create(ocrEngineName);
// Set the language(s) for OCR recognition (e.g. "German,English")
engine.Languages = language;
// Create the OCR engine
Engine engine = Engine.create(ocrEngineName);
// Set the language(s) for OCR recognition (e.g. "German,English")
engine.setLanguages(language);
# Create the OCR engine
engine = Engine.create(ocr_engine_name)
# Set the language(s) for OCR recognition (e.g. "German,English")
engine.languages = language
// Create the OCR engine
pEngine = PdfToolsOcr_Engine_Create(szOcrEngineName);
// Set the language(s) for OCR recognition (e.g. "German,English")
PdfToolsOcr_Engine_SetLanguages(pEngine, szLanguage);
4. Open the input document
Load the input PDF from the file system into a read-only Document.
- .NET
- Java
- Python
- C
// Open input document
using var inStr = File.OpenRead(inPath);
using var inDoc = Document.Open(inStr);
// Open input document
FileStream inStr = new FileStream(inPath, FileStream.Mode.READ_ONLY);
Document inDoc = Document.open(inStr);
# Open input document
in_stream = io.FileIO(input_path, 'rb')
input_document = Document.open(in_stream)
// Open input document
pInStream = _tfopen(szInPath, _T("rb"));
TPdfToolsSys_StreamDescriptor inDesc;
PdfToolsSysCreateFILEStreamDescriptor(&inDesc, pInStream, 0);
pInDoc = PdfToolsPdf_Document_Open(&inDesc, _T(""));
5. Configure OCR options
Create an OcrOptions object and configure its three sub-objects: image options, text options, and page options. Each dimension controls a different aspect of OCR processing.
Image options
Image options control how the OCR processor handles scanned images within the PDF. Set the Mode property to determine which images to OCR:
UpdateText: Process only images without existing OCR text. Recommended for most scanned documents.ReplaceText: Re-OCR all images, replacing any existing text layer. Use this when the existing OCR results are poor.RemoveText: Remove existing OCR text without re-processing. This mode doesn’t need an OCR engine.IfNoText: Process images only if the entire document contains no text at all.
Additional image options:
RotateScan: Automatically detect and correct page rotation.DeskewScan: Straighten skewed scans.RemoveOnlyInvisibleOcrText: When usingReplaceTextorRemoveText, only affect invisible OCR text (text rendering mode 3). Manually placed visible text remains untouched.
- .NET
- Java
- Python
- C
var options = new OcrOptions();
// Configure image OCR: recognize text from scanned images
options.ImageOptions.Mode = ImageProcessingMode.UpdateText;
options.ImageOptions.RemoveOnlyInvisibleOcrText = true;
options.ImageOptions.DeskewScan = true;
options.ImageOptions.RotateScan = true;
OcrOptions options = new OcrOptions();
// Configure image OCR: recognize text from scanned images
options.getImageOptions().setMode(ImageProcessingMode.UPDATE_TEXT);
options.getImageOptions().setRemoveOnlyInvisibleOcrText(true);
options.getImageOptions().setDeskewScan(true);
options.getImageOptions().setRotateScan(true);
options = OcrOptions()
# Configure image OCR: recognize text from scanned images
options.image_options.mode = ImageProcessingMode.UPDATE_TEXT
options.image_options.remove_only_invisible_ocr_text = True
options.image_options.deskew_scan = True
options.image_options.rotate_scan = True
pOptions = PdfToolsOcr_OcrOptions_New();
// Configure image OCR: recognize text from scanned images
pImageOptions = PdfToolsOcr_OcrOptions_GetImageOptions(pOptions);
PdfToolsOcr_ImageOptions_SetMode(pImageOptions, ePdfToolsOcr_ImageProcessingMode_UpdateText);
PdfToolsOcr_ImageOptions_SetRemoveOnlyInvisibleOcrText(pImageOptions, TRUE);
PdfToolsOcr_ImageOptions_SetDeskewScan(pImageOptions, TRUE);
PdfToolsOcr_ImageOptions_SetRotateScan(pImageOptions, TRUE);
Text options
Text options control how the OCR processor handles non-extractable text in the PDF. Some fonts lack proper Unicode mappings, which breaks text copying and search.
Update: Fix only text with missing or incorrect Unicode mappings. Recommended for most documents.Replace: Reprocess all text, even text that already has valid Unicode mappings.
Additional text options:
SkipMode: Skip specific font types during text processing. You can combine values. Available flags:KnownSymbolic(skip symbolic fonts such as ZapfDingbats and Wingdings) andPrivateUseArea(skip text with Unicode Private Use Area code points).UnicodeSource: Specify additional sources for Unicode mapping. You can combine values. Available flags:InstalledFont(look up Unicode values from system-installed fonts),KnownSymbolicPua(use Private Use Area values for known symbolic fonts), andFallbackAllPua(use Private Use Area values as a fallback for all characters).
- .NET
- Java
- Python
- C
// Configure text OCR: update non-extractable text with correct Unicode
options.TextOptions.Mode = TextProcessingMode.Update;
options.TextOptions.SkipMode = TextSkipMode.KnownSymbolic;
options.TextOptions.UnicodeSource = UnicodeSource.InstalledFont;
// Configure text OCR: update non-extractable text with correct Unicode
options.getTextOptions().setMode(TextProcessingMode.UPDATE);
options.getTextOptions().setSkipMode(EnumSet.of(TextSkipMode.KNOWN_SYMBOLIC));
options.getTextOptions().setUnicodeSource(EnumSet.of(UnicodeSource.INSTALLED_FONT));
# Configure text OCR: update non-extractable text with correct Unicode
options.text_options.mode = TextProcessingMode.UPDATE
options.text_options.skip_mode = TextSkipMode.KNOWN_SYMBOLIC
options.text_options.unicode_source = UnicodeSource.INSTALLED_FONT
// Configure text OCR: update non-extractable text with correct Unicode
pTextOptions = PdfToolsOcr_OcrOptions_GetTextOptions(pOptions);
PdfToolsOcr_TextOptions_SetMode(pTextOptions, ePdfToolsOcr_TextProcessingMode_Update);
PdfToolsOcr_TextOptions_SetSkipMode(pTextOptions, ePdfToolsOcr_TextSkipMode_KnownSymbolic);
PdfToolsOcr_TextOptions_SetUnicodeSource(pTextOptions, ePdfToolsOcr_UnicodeSource_InstalledFont);
Page options
Page options control page-level processing and accessibility tagging.
All: Process all non-empty pages.IfNoText: Process only pages that have content but no text.AddResults: Doesn’t trigger OCR independently, but adds page-level results when image or text processing triggers OCR.
The Tagging property controls PDF tagging for accessibility:
Auto: Automatically add tagging for scanned or already-tagged documents. Recommended for most workflows.Update: Always add tagging. The OCR processor emits a warning if tagging fails.None: Don’t add any tagging.
- .NET
- Java
- Python
- C
// Configure page OCR: process all pages and add tagging for accessibility
options.PageOptions.Mode = PageProcessingMode.All;
options.PageOptions.Tagging = TaggingMode.Auto;
// Configure page OCR: process all pages and add tagging for accessibility
options.getPageOptions().setMode(PageProcessingMode.ALL);
options.getPageOptions().setTagging(TaggingMode.AUTO);
# Configure page OCR: process all pages and add tagging for accessibility
options.page_options.mode = PageProcessingMode.ALL
options.page_options.tagging = TaggingMode.AUTO
// Configure page OCR: process all pages and add tagging for accessibility
pPageOptions = PdfToolsOcr_OcrOptions_GetPageOptions(pOptions);
PdfToolsOcr_PageOptions_SetMode(pPageOptions, ePdfToolsOcr_PageProcessingMode_All);
PdfToolsOcr_PageOptions_SetTagging(pPageOptions, ePdfToolsOcr_TaggingMode_Auto);
Resolution settings
The OcrOptions object also controls the resolution for OCR processing. The OCR processor determines each page’s optimal OCR resolution automatically. If the optimal resolution falls within the configured range, the processor uses the default resolution. Otherwise, the processor generates a warning.
Dpi: Default resolution (default: 300).MinDpi: Minimum allowed resolution (default: 200).MaxDpi: Maximum allowed resolution (default: 400).
Embedded files
Set ProcessEmbeddedFiles to true on the OcrOptions object to recursively process PDF files embedded within the input document. By default, the OCR processor copies embedded files as-is without OCR processing.
6. Process the document
Create a Processor instance and register a warning handler before calling Process. The processor applies the configured OCR options and writes the result to the output stream.
Warnings provide diagnostic information about each page, such as images with resolution outside the configured range or tagging issues.
Warnings are non-critical. The Pdftools SDK completes processing even when warnings occur. However, depending on your use case, you may need to treat certain warning categories as errors.
| Category | Description | When to treat as error |
|---|---|---|
Ocr | OCR-related issues such as resolution outside the optimal range | Rarely (usually informational) |
Tagging | Issues adding tagging or structural information | When producing accessible PDFs or preparing for PDF/A level A |
Text | Issues making text extractable | When text extraction is the primary goal |
SignedDocument | Processing removed existing digital signatures | When preserving signatures is important |
Processing a signed PDF invalidates and removes all existing digital signatures. The OCR processor emits a SignedDocument warning when this occurs.
- .NET
- Java
- Python
- C
// Create the OCR processor and add a warning handler
var processor = new Processor();
processor.Warning += (s, e) =>
{
Console.WriteLine("- {0}: {1} ({2}{3})",
e.Category, e.Message, e.Context, e.PageNo > 0 ? " page " + e.PageNo : "");
};
// Create stream for output file
using var outStr = File.Create(outPath);
// Process the document with OCR
using var outDoc = processor.Process(inDoc, engine, outStr, options);
// Create the OCR processor and add a warning handler
Processor processor = new Processor();
processor.addWarningListener(new Processor.WarningListener() {
@Override
public void warning(Processor.Warning event) {
System.out.println(String.format("- %s: %s (%s%s)",
event.getCategory(), event.getMessage(), event.getContext(),
event.getPageNo() > 0 ? " page " + event.getPageNo() : ""));
}
});
// Create stream for output file
FileStream outStr = new FileStream(outPath, FileStream.Mode.READ_WRITE_NEW);
// Process the document with OCR
Document outDoc = processor.process(inDoc, engine, outStr, options, null);
def warning_handler(message: str, category, page_no: int, context: str):
if page_no > 0:
print(f"- {category.name}: {message} ({context} page {page_no})")
else:
print(f"- {category.name}: {message} ({context})")
# Create the OCR processor and add a warning handler
processor = Processor()
processor.add_warning_handler(warning_handler)
# Create stream for output file
with io.FileIO(output_path, 'wb+') as output_stream:
# Process the document with OCR
processor.process(input_document, engine, output_stream, options)
// Create the OCR processor and add a warning handler
pProcessor = PdfToolsOcr_Processor_New();
PdfToolsOcr_Processor_AddWarningHandler(pProcessor, NULL, WarningHandler);
// Create stream for output file
pOutStream = _tfopen(szOutPath, _T("wb+"));
TPdfToolsSys_StreamDescriptor outDesc;
PdfToolsSysCreateFILEStreamDescriptor(&outDesc, pOutStream, 0);
// Process the document with OCR
pOutDoc = PdfToolsOcr_Processor_Process(pProcessor, pInDoc, pEngine, &outDesc, pOptions, NULL);
Handle warnings by category
For workflows where certain warnings are critical, filter warnings by category. This example treats tagging and text warnings as errors:
- .NET
- Java
- Python
- C
processor.Warning += (s, e) =>
{
if (e.Category == WarningCategory.Tagging || e.Category == WarningCategory.Text)
throw new Exception($"Critical OCR warning: {e.Message}");
Console.WriteLine($"Warning: {e.Category}: {e.Message}");
};
processor.addWarningListener(new Processor.WarningListener() {
@Override
public void warning(Processor.Warning event) {
if (event.getCategory() == WarningCategory.TAGGING ||
event.getCategory() == WarningCategory.TEXT)
throw new RuntimeException("Critical OCR warning: " + event.getMessage());
System.out.println(String.format("Warning: %s: %s",
event.getCategory(), event.getMessage()));
}
});
def warning_handler(message: str, category, page_no: int, context: str):
if category in (WarningCategory.TAGGING, WarningCategory.TEXT):
raise Exception(f"Critical OCR warning: {message}")
print(f"Warning: {category.name}: {message}")
void PDFTOOLS_CALL WarningHandler(void* pContext, const TCHAR* szMessage,
TPdfToolsOcr_WarningCategory iCategory, int iPageNo, const TCHAR* szContext)
{
if (iCategory == ePdfToolsOcr_WarningCategory_Tagging ||
iCategory == ePdfToolsOcr_WarningCategory_Text)
{
_tprintf(_T("Critical OCR warning: %s\n"), szMessage);
// Handle as error (e.g. set error flag)
return;
}
_tprintf(_T("Warning: %d: %s\n"), iCategory, szMessage);
}