Skip to main content

OCR a PDF document

Apply OCR to a PDF document to make scanned content searchable and text extractable. The Pdftools SDK analyzes the document with an OCR engine and adds an invisible, selectable text layer while preserving the visual appearance.

Steps to OCR a document:

  1. Create the OCR engine
  2. Open the input document
  3. Configure OCR options
  4. Process the document
  5. Full example

Before you begin

Create the OCR engine

Create an Engine instance by passing the engine name and connection parameters. The only supported engine is service, which connects to a running Pdftools OCR Service instance over HTTP. Specify the engine name followed by @ and the service URL. For example, "service@http://localhost:7982/" connects to an OCR Service at that URL. To connect to multiple instances, separate URLs with a semicolon (for example, "service@http://host1:7982/;http://host2:7982/").

After creating the engine, set the recognition languages as a comma-separated string (for example, "German,English"). You can also set engine-specific parameters using the Parameters property as semicolon-separated key-value pairs (for example, "PredefinedProfile=Default" or "Profile=/path/to/custom-profile.ini").

The engine can be reused across multiple documents. However, each Engine instance must only be used by one thread at a time.

// Create the OCR engine
using var engine = Engine.Create(ocrEngineName);

// Set the language(s) for OCR recognition (e.g. "German,English")
engine.Languages = language;

Open the input document

Load the input PDF from the file system into a read-only Document.

// Open input document
using var inStr = File.OpenRead(inPath);
using var inDoc = Document.Open(inStr);

Configure OCR options

Create an OcrOptions object and configure its three sub-objects: image options, text options, and page options. Each dimension controls a different aspect of OCR processing.

Image options

Image options control how scanned images within the PDF are processed. Set the Mode property to determine which images to OCR:

  • UpdateText: Process only images without existing OCR text. Recommended for most scanned documents.
  • ReplaceText: Re-OCR all images, replacing any existing text layer. Use this when the existing OCR results are poor.
  • RemoveText: Remove existing OCR text without re-processing. No OCR engine is required.
  • IfNoText: Process images only if the entire document contains no text at all.

Additional image options:

  • RotateScan: Automatically detect and correct page rotation.
  • DeskewScan: Straighten skewed scans.
  • RemoveOnlyInvisibleOcrText: When using ReplaceText or RemoveText, only affect invisible OCR text (text rendering mode 3). Visible text that was placed manually is preserved.
var options = new OcrOptions();

// Configure image OCR: recognize text from scanned images
options.ImageOptions.Mode = ImageProcessingMode.UpdateText;
options.ImageOptions.RemoveOnlyInvisibleOcrText = true;
options.ImageOptions.DeskewScan = true;
options.ImageOptions.RotateScan = true;

Text options

Text options control how non-extractable text in the PDF is processed. Some fonts lack proper Unicode mappings, which prevents text from being copied or searched correctly.

  • Update: Fix only text with missing or incorrect Unicode mappings. Recommended for most documents.
  • Replace: Reprocess all text, even text that already has valid Unicode mappings.

Additional text options:

  • SkipMode: Skip specific font types during text processing. Values can be combined. Available flags: KnownSymbolic (skip symbolic fonts such as ZapfDingbats and Wingdings) and PrivateUseArea (skip text with Unicode Private Use Area code points).
  • UnicodeSource: Specify additional sources for Unicode mapping. Values can be combined. Available flags: InstalledFont (look up Unicode values from system-installed fonts), KnownSymbolicPua (use Private Use Area values for known symbolic fonts), and FallbackAllPua (use Private Use Area values as a fallback for all characters).
// Configure text OCR: update non-extractable text with correct Unicode
options.TextOptions.Mode = TextProcessingMode.Update;
options.TextOptions.SkipMode = TextSkipMode.KnownSymbolic;
options.TextOptions.UnicodeSource = UnicodeSource.InstalledFont;

Page options

Page options control page-level processing and accessibility tagging.

  • All: Process all non-empty pages.
  • IfNoText: Process only pages that have content but no text.
  • AddResults: Don’t trigger OCR independently, but add page-level results when OCR is triggered by image or text processing.

The Tagging property controls PDF tagging for accessibility:

  • Auto: Automatically add tagging for scanned or already-tagged documents. Recommended for most workflows.
  • Update: Always add tagging. A warning is emitted if tagging fails.
  • None: Don’t add any tagging.
// Configure page OCR: process all pages and add tagging for accessibility
options.PageOptions.Mode = PageProcessingMode.All;
options.PageOptions.Tagging = TaggingMode.Auto;

Resolution settings

The OcrOptions object also controls the resolution for OCR processing. Each page’s optimal OCR resolution is determined automatically. If the optimal resolution falls within the configured range, the default resolution is used. A warning is generated if a page’s optimal resolution falls outside the range.

  • Dpi: Default resolution (default: 300).
  • MinDpi: Minimum allowed resolution (default: 200).
  • MaxDpi: Maximum allowed resolution (default: 400).

Embedded files

Set ProcessEmbeddedFiles to true on the OcrOptions object to recursively process PDF files embedded within the input document. By default, embedded files are copied as-is without OCR processing.

Process the document

Create a Processor instance and register a warning handler before calling Process. The processor applies the configured OCR options and writes the result to the output stream.

Warnings provide diagnostic information about each page, such as images with resolution outside the configured range or tagging issues.

Warnings are non-critical. The Pdftools SDK completes processing even when warnings occur. However, depending on your use case, you may need to treat certain warning categories as errors.

CategoryDescriptionWhen to treat as error
OcrOCR-related issues such as resolution outside the optimal rangeRarely (usually informational)
TaggingIssues adding tagging or structural informationWhen producing accessible PDFs or preparing for PDF/A level A
TextIssues making text extractableWhen text extraction is the primary goal
SignedDocumentProcessing removed existing digital signaturesWhen preserving signatures is important
Signed documents

Processing a signed PDF invalidates all existing digital signatures, which are removed during processing. The SignedDocument warning is generated when this occurs.

// Create the OCR processor and add a warning handler
var processor = new Processor();
processor.Warning += (s, e) =>
{
Console.WriteLine("- {0}: {1} ({2}{3})",
e.Category, e.Message, e.Context, e.PageNo > 0 ? " page " + e.PageNo : "");
};

// Create stream for output file
using var outStr = File.Create(outPath);

// Process the document with OCR
using var outDoc = processor.Process(inDoc, engine, outStr, options);

Handle warnings by category

For workflows where certain warnings are critical, filter warnings by category. This example treats tagging and text warnings as errors:

processor.Warning += (s, e) =>
{
if (e.Category == WarningCategory.Tagging || e.Category == WarningCategory.Text)
throw new Exception($"Critical OCR warning: {e.Message}");

Console.WriteLine($"Warning: {e.Category}: {e.Message}");
};

Full example

// Create the OCR engine
using var engine = Engine.Create(ocrEngineName);

// Set the language(s) for OCR recognition (e.g. "German,English")
engine.Languages = language;

// Open input document
using var inStr = File.OpenRead(inPath);
using var inDoc = Document.Open(inStr);

// Configure OCR options
var options = new OcrOptions();

// Configure image OCR: recognize text from scanned images
options.ImageOptions.Mode = ImageProcessingMode.UpdateText;
options.ImageOptions.RemoveOnlyInvisibleOcrText = true;
options.ImageOptions.DeskewScan = true;
options.ImageOptions.RotateScan = true;

// Configure text OCR: update non-extractable text with correct Unicode
options.TextOptions.Mode = TextProcessingMode.Update;
options.TextOptions.SkipMode = TextSkipMode.KnownSymbolic;
options.TextOptions.UnicodeSource = UnicodeSource.InstalledFont;

// Configure page OCR: process all pages and add tagging for accessibility
options.PageOptions.Mode = PageProcessingMode.All;
options.PageOptions.Tagging = TaggingMode.Auto;

// Create the OCR processor and add a warning handler
var processor = new Processor();
processor.Warning += (s, e) =>
{
Console.WriteLine("- {0}: {1} ({2}{3})",
e.Category, e.Message, e.Context, e.PageNo > 0 ? " page " + e.PageNo : "");
};

// Create stream for output file
using var outStr = File.Create(outPath);

// Process the document with OCR
using var outDoc = processor.Process(inDoc, engine, outStr, options);