Skip to main content
Version: Version 1.1.0

[ObjectsExtractionParams] INI file section

The [ObjectsExtractionParams] INI file section controls how Pdftools OCR Service extracts, filters, and detects visual objects and text elements from scanned images.


Common settings

FastObjectsExtraction

KeyTypeDefault
FastObjectsExtractionBooleanfalse

If this property is set to true, object extraction speeds up, but quality may deteriorate.


ProhibitColorImage

KeyTypeDefault
ProhibitColorImageBooleanfalse

If set to true, Pdftools OCR Service uses only a black-and-white plane during object extraction. Detection quality for colored tables and images may be reduced.


SourceContentReuseMode

KeyTypeDefault
SourceContentReuseModeSourceContentReuseModeEnumCRM_Auto

Specifies how to use the text and image layers of the source PDF file.

SourceContentReuseModeEnum

  • CRM_Auto: Automatically selects how to reuse source content from PDF files. If the result doesn’t meet expectations, or you know the document type in advance, select the mode manually.
  • CRM_ContentAndPictures: Automatically selects whether to use the source text or rasterized image for each part of a page. If the text from the source file is considered reliable, it’s used; otherwise, the text from the raster is used.
  • CRM_ContentOnly: Uses both the text and image layers of the source PDF file directly.
    caution

    Using the text contents of the source file speeds up processing, but if you choose this mode and there’s no text layer, an error occurs. Use this mode for source files with visible text encoded in Unicode, ASCII, or another character encoding standard, with correct font and size settings. For other file types, use CRM_Auto, CRM_ContentAndPictures, or CRM_DoNotReuse.

  • CRM_DoNotReuse: Rasterizes the pages of the source PDF file and processes them. The contents of the source file are ignored.

Removing objects

RemoveGarbage

KeyTypeDefault
RemoveGarbageBooleanfalse

Specifies whether to remove “garbage” (for example, dots smaller than a certain size) from the image during object extraction.


RemoveTexture

KeyTypeDefault
RemoveTextureBooleantrue

If set to true, Pdftools OCR Service removes background texture noise from a temporary image used for recognition. The source image itself remains unchanged.


Detecting objects

DetectMatrixPrinter

KeyTypeDefault
DetectMatrixPrinterBooleantrue

If this property is set to true, text printed using a matrix printer is detected during objects extraction.


DetectPorousText

KeyTypeDefault
DetectPorousTextBooleantrue

If set to true, regions with porous text are detected during objects extraction.


EnableAggressiveTextExtraction

KeyTypeDefault
EnableAggressiveTextExtractionBooleanfalse

If set to true, Pdftools OCR Service attempts to extract as much text as possible, even from low-quality images. Recommended when the input contains degraded or faint text.

warning

The EnableAggressiveTextExtraction mode may lead to misinterpreting pictures as text or vertically rearranging horizontal text.


ProhibitDottedSeparators

KeyTypeDefault
ProhibitDottedSeparatorsBooleanfalse

If this property is set to true, Pdftools OCR Service presumes that the document does not contain dotted separators. This can be useful if you’re certain the document lacks dotted separators or if some content is mistakenly identified as one.