Skip to main content

OCR best practices

Performance and operational guidance for using the OCR module in production environments.

Minimize OCR operations

A page is sent to the OCR engine only when required by the configured modes. Set only the modes you need to avoid unnecessary processing:

  • To OCR scanned images only, set ImageOptions.Mode and leave other modes at None.
  • To fix non-extractable text only, set TextOptions.Mode and leave other modes at None.
  • To process entire pages, set PageOptions.Mode. This triggers OCR for all matching pages regardless of other modes.
  • PageProcessingMode.AddResults doesn’t trigger OCR on its own. It only adds page-level results when OCR is already triggered by another mode.

For details on each mode, refer to Configure OCR options.

Reuse the OCR engine

Create one Engine instance and reuse it across multiple documents. Creating a new engine for each document adds overhead because the engine connection and configuration must be re-established each time.

// Create the engine once
using var engine = Engine.Create("service@http://localhost:7982/");
engine.Languages = "German,English";

// Configure options once
var options = new OcrOptions();
options.ImageOptions.Mode = ImageProcessingMode.UpdateText;

// Create the processor once
var processor = new Processor();

// Process multiple documents with the same engine
foreach (var inputPath in inputFiles)
{
using var inStr = File.OpenRead(inputPath);
using var inDoc = Document.Open(inStr);
using var outStr = File.Create(GetOutputPath(inputPath));
using var outDoc = processor.Process(inDoc, engine, outStr, options);
}

One Engine instance can only process one document at a time.

Scale with the OCR Service

The Pdftools OCR Service processes multiple pages concurrently. Performance scales with:

  • The number of running OCR Service instances.
  • The number of parallel processes configured in each instance.

For high-throughput workflows, deploy additional OCR Service instances and connect to them using multiple URLs (for example, "service@http://host1:7982/;http://host2:7982/").

Optimize engine parameters

Use engine-specific parameters to improve performance:

  • Deactivate recognition features you don’t need (for example, barcode detection when you only need text).
  • Use a predefined profile optimized for your use case (for example, "PredefinedProfile=Default").
  • Consult the OCR engine manual for engine-specific tuning options.

Thread safety

The Pdftools SDK OCR module is thread-safe with one rule: an object may only be accessed by one thread at a time.

  • You can process multiple documents concurrently, provided each thread uses its own Engine and Document instances.
  • Some OCR engines must be disposed in the same thread where they were created.
  • The OCR Service is recommended for concurrent processing because it handles thread safety internally.

Resource management

Close all OCR objects after use. The Engine, input Document, and output Document returned by Processor.Process are all disposable resources.

  • .NET: Use using statements.
  • Java: Use try-with-resources.
  • Python: Use with context managers.
  • C: Call PdfToolsOcr_Engine_Close and PdfToolsPdf_Document_Close.

Disposal ordering

The output Document returned by Processor.Process must be disposed before the output stream is closed. If you close the stream first, the document cannot flush its remaining data and the output file may be incomplete or corrupt.

In .NET, declare the output document after the output stream so that using disposes them in the correct (reverse) order. The following example shows the key lines from the full OCR example:

// Create stream for output file
using var outStr = File.Create(outPath);

// Process the document with OCR
using var outDoc = processor.Process(inDoc, engine, outStr, options);
// outDoc is disposed first, then outStr — correct order

Engine instance limits

Some OCR engines only allow a single instance per process. Creating a second Engine of the same type fails in that case. This is another reason to create one engine and reuse it across documents.

For complete examples showing proper resource management in each language, refer to OCR a PDF document.