Skip to main content
Version: Version 1.17

OCR a PDF document

Apply OCR to a PDF document to make scanned content searchable and text extractable. The Pdftools SDK analyzes the document with an OCR engine and adds an invisible, selectable text layer while preserving the visual appearance.

Quick start with a code sample

Get the full sample on GitHub: C#, Java, and Python.

Steps to OCR a document:

  1. Initialize the Pdftools SDK license
  2. Install and start Pdftools OCR Service
  3. Create the OCR engine
  4. Open the input document
  5. Configure OCR options
  6. Process the document

1. Initialize the Pdftools SDK license

Before you begin, Initialize the Pdftools SDK license.

2. Install and start Pdftools OCR Service

Install and start Pdftools OCR Service. Pdftools SDK connects to Pdftools OCR Service over HTTP for text recognition. The default endpoint is http://localhost:7982/. For installation details, review Set up Pdftools OCR Service with Pdftools SDK.

3. Create the OCR engine

Create an Engine instance by passing the engine name and connection parameters. The only supported engine is service, which connects to a running Pdftools OCR Service instance over HTTP. Specify the engine name followed by @ and the service URL. For example, "service@http://localhost:7982/" connects to an OCR Service at that URL. To connect to multiple instances, separate URLs with a semicolon (for example, "service@http://host1:7982/;http://host2:7982/").

After creating the engine, set the recognition languages as a comma-separated string (for example, "German,English"). You can also set engine-specific parameters using the Parameters property as semicolon-separated key-value pairs (for example, "PredefinedProfile=Default" or "Profile=/path/to/custom-profile.ini").

You can reuse the engine across multiple documents. However, only one thread can use an Engine instance at a time.

// Create the OCR engine
using var engine = Engine.Create(ocrEngineName);

// Set the language(s) for OCR recognition (e.g. "German,English")
engine.Languages = language;

4. Open the input document

Load the input PDF from the file system into a read-only Document.

// Open input document
using var inStr = File.OpenRead(inPath);
using var inDoc = Document.Open(inStr);

5. Configure OCR options

Create an OcrOptions object and configure its three sub-objects: image options, text options, and page options. Each dimension controls a different aspect of OCR processing.

Image options

Image options control how the OCR processor handles scanned images within the PDF. Set the Mode property to determine which images to OCR:

  • UpdateText: Process only images without existing OCR text. Recommended for most scanned documents.
  • ReplaceText: Re-OCR all images, replacing any existing text layer. Use this when the existing OCR results are poor.
  • RemoveText: Remove existing OCR text without re-processing. This mode doesn’t need an OCR engine.
  • IfNoText: Process images only if the entire document contains no text at all.

Additional image options:

  • RotateScan: Automatically detect and correct page rotation.
  • DeskewScan: Straighten skewed scans.
  • RemoveOnlyInvisibleOcrText: When using ReplaceText or RemoveText, only affect invisible OCR text (text rendering mode 3). Manually placed visible text remains untouched.
var options = new OcrOptions();

// Configure image OCR: recognize text from scanned images
options.ImageOptions.Mode = ImageProcessingMode.UpdateText;
options.ImageOptions.RemoveOnlyInvisibleOcrText = true;
options.ImageOptions.DeskewScan = true;
options.ImageOptions.RotateScan = true;

Text options

Text options control how the OCR processor handles non-extractable text in the PDF. Some fonts lack proper Unicode mappings, which breaks text copying and search.

  • Update: Fix only text with missing or incorrect Unicode mappings. Recommended for most documents.
  • Replace: Reprocess all text, even text that already has valid Unicode mappings.

Additional text options:

  • SkipMode: Skip specific font types during text processing. You can combine values. Available flags: KnownSymbolic (skip symbolic fonts such as ZapfDingbats and Wingdings) and PrivateUseArea (skip text with Unicode Private Use Area code points).
  • UnicodeSource: Specify additional sources for Unicode mapping. You can combine values. Available flags: InstalledFont (look up Unicode values from system-installed fonts), KnownSymbolicPua (use Private Use Area values for known symbolic fonts), and FallbackAllPua (use Private Use Area values as a fallback for all characters).
// Configure text OCR: update non-extractable text with correct Unicode
options.TextOptions.Mode = TextProcessingMode.Update;
options.TextOptions.SkipMode = TextSkipMode.KnownSymbolic;
options.TextOptions.UnicodeSource = UnicodeSource.InstalledFont;

Page options

Page options control page-level processing and accessibility tagging.

  • All: Process all non-empty pages.
  • IfNoText: Process only pages that have content but no text.
  • AddResults: Doesn’t trigger OCR independently, but adds page-level results when image or text processing triggers OCR.

The Tagging property controls PDF tagging for accessibility:

  • Auto: Automatically add tagging for scanned or already-tagged documents. Recommended for most workflows.
  • Update: Always add tagging. The OCR processor emits a warning if tagging fails.
  • None: Don’t add any tagging.
// Configure page OCR: process all pages and add tagging for accessibility
options.PageOptions.Mode = PageProcessingMode.All;
options.PageOptions.Tagging = TaggingMode.Auto;

Resolution settings

The OcrOptions object also controls the resolution for OCR processing. The OCR processor determines each page’s optimal OCR resolution automatically. If the optimal resolution falls within the configured range, the processor uses the default resolution. Otherwise, the processor generates a warning.

  • Dpi: Default resolution (default: 300).
  • MinDpi: Minimum allowed resolution (default: 200).
  • MaxDpi: Maximum allowed resolution (default: 400).

Embedded files

Set ProcessEmbeddedFiles to true on the OcrOptions object to recursively process PDF files embedded within the input document. By default, the OCR processor copies embedded files as-is without OCR processing.

6. Process the document

Create a Processor instance and register a warning handler before calling Process. The processor applies the configured OCR options and writes the result to the output stream.

Warnings provide diagnostic information about each page, such as images with resolution outside the configured range or tagging issues.

Warnings are non-critical. The Pdftools SDK completes processing even when warnings occur. However, depending on your use case, you may need to treat certain warning categories as errors.

CategoryDescriptionWhen to treat as error
OcrOCR-related issues such as resolution outside the optimal rangeRarely (usually informational)
TaggingIssues adding tagging or structural informationWhen producing accessible PDFs or preparing for PDF/A level A
TextIssues making text extractableWhen text extraction is the primary goal
SignedDocumentProcessing removed existing digital signaturesWhen preserving signatures is important
Signed documents

Processing a signed PDF invalidates and removes all existing digital signatures. The OCR processor emits a SignedDocument warning when this occurs.

// Create the OCR processor and add a warning handler
var processor = new Processor();
processor.Warning += (s, e) =>
{
Console.WriteLine("- {0}: {1} ({2}{3})",
e.Category, e.Message, e.Context, e.PageNo > 0 ? " page " + e.PageNo : "");
};

// Create stream for output file
using var outStr = File.Create(outPath);

// Process the document with OCR
using var outDoc = processor.Process(inDoc, engine, outStr, options);

Handle warnings by category

For workflows where certain warnings are critical, filter warnings by category. This example treats tagging and text warnings as errors:

processor.Warning += (s, e) =>
{
if (e.Category == WarningCategory.Tagging || e.Category == WarningCategory.Text)
throw new Exception($"Critical OCR warning: {e.Message}");

Console.WriteLine($"Warning: {e.Category}: {e.Message}");
};