Skip to main content
Version: Version 1.0.0

[RecognizerParams] INI file section

The [RecognizerParams] INI file section includes settings for controlling language, text types, performance, and fine-tuning options used by the Pdftools OCR Service during text recognition.


Main settings

TextLanguage

KeyTypeDefault
PerformSynthesisTextLanguageEnglish

The languages used for text recognition are separated by commas. The supported languages are listed in the Supported languages page.


LanguageDetectionMode

KeyTypeDefault
LanguageDetectionModeThreeStatePropertyValueEnumTSPV_Auto

Manages automatic language detection. When language autodetection is on, the recognition language is detected for each word in the text. It is selected from the list of languages specified in the TextLanguage property. Autodetection is intended to be used to recognize documents whose language is not known to you. If you know for certain that all the languages you specified are present in the document, autodetection is useless. Turn it off by setting this property to TSPV_No.

ThreeStatePropertyValueEnum

  • TSPV_Auto: Automatically determine if this processing mode should be used, depending on the situation (image characteristics, etc.).
  • TSPV_No: The processing mode in question will not be used.
  • TSPV_Yes: The processing mode in question will be used.

TextTypes

KeyTypeDefault
TextTypesTextTypeEnumTT_Normal

The value of this property is an OR superposition of the TextTypeEnum enumeration constants that denote possible text types used for recognition. For example, if it is set to TT_Normal | TT_Index, Pdftools OCR Server will presume that the text contains only common typographic text and digits written in ZIP-code style, ignoring all other variants. See also Using Text Type Autodetection.

info

TextTypeEnum

  • TT_Gothic: This value sets the Pdftools OCR Service to presume that the text on the recognized image is printed in the Gothic type.
  • TT_Handprinted: This value corresponds to handprinted text
    note

    Automatic analysis is not available for handprinted text. The coordinates of blocks containing handprinted text should be set manually.

  • TT_Index: This constant corresponds to a special set of characters, including only digits, which are written in ZIP-code style.
  • TT_Matrix: This value tells Pdftools Server to presume that the text on the recognized image is printed on a dot-matrix printer.
  • TT_MICR_CMC7: This value corresponds to a special set of characters, which includes only digits and A, B, C, D, E characters, written in MICR barcode font (CMC-7).
  • TT_MICR_E13B: This value corresponds to a special set of characters including only digits and A, B, C, D characters printed in magnetic ink. MICR (Magnetic Ink Character Recognition) characters are found in a variety of places, including personal checks.
  • TT_Normal: This value corresponds to a common typographic type of text.
  • TT_OCR_A: This value corresponds to a monospaced font, designed for Optical Character Recognition. Largely used by banks, credit card companies, and similar businesses.
  • TT_OCR_B: This value corresponds to a font designed for Optical Character Recognition.
  • TT_Receipt: This value corresponds to the text of a receipt. This type of text is designed to recognize sales receipts, invoices, etc. Unlike the other types, it is not concerned with the actual font of the text. Rather, it tells the recognizer that there may be text of low quality, mostly in monospaced or normal font.
  • TT_Typewriter: This value sets the Pdftools OCR Service to presume that the text on the recognized image is typed on a typewriter.

Recognition speed

BalancedMode

KeyTypeDefault
BalancedModeBooleanfalse

If this property is true, the recognition runs in balanced mode. The balanced mode is an intermediate mode between full and fast modes. The fast mode can be activated with the help of the FastMode property. This property is available for machine-printed text only. For handprinted text, the recognition is run in full mode.


FastMode

KeyTypeDefault
FastModeBooleanfalse

When this property is set to true, the Pdftools OCR Service provides 2-2.5 times faster recognition speed at the cost of a moderately increased error rate (1.5-2 times more errors). This property is available both for machine and handprinted text. In the case of a handprinted text (text type TT_Handprinted), a special recognition mode is used. On good print quality text, Pdftools OCR Service makes an average of 1-2 errors per page, and such a moderate increase in error rate can be easily tolerated in many cases, such as full text indexing with “fuzzy” searches, preliminary recognition, etc.

note

We do not recommend using this mode to recognize small image fragments (for example, fragments that consist of only one line or word) because the time advantage is insignificant.


Fine tuning

LowResolutionMode

KeyTypeDefault
LowResolutionModeBooleanfalse

Specifies whether text on an image with low resolution is recognized. This property is useful when recognizing faxes, small prints, images with low resolution, or bad print quality.


OneLinePerBlock

KeyTypeDefault
OneLinePerBlockBooleanfalse

When set to true, the Pdftools OCR Service presumes that the text in the block to which the current RecognizerParams object belongs contains no more than one string.


OneWordPerLine

KeyTypeDefault
OneWordPerLineBooleanfalse

When set to true, the Pdftools OCR Service presumes that no text line may contain more than one word, so the lines of text will be recognized as a single word.


ProhibitItalic

KeyTypeDefault
ProhibitItalicBooleanfalse

When set to true, the Pdftools OCR Service does not recognize letters printed with an italic-style font. It is useful when a text with presumably no italic letters is recognized, in which case it may speed up the recognition. If there are any italic letters on the image, and this property is true, these letters will be recognized incorrectly.


ProhibitSubscript

KeyTypeDefault
ProhibitSubscriptBooleanfalse

When set to true, the Pdftools OCR Service does not recognize subscript letters. It is useful when a text with presumably no subscripts is recognized, in which case it may speed up the recognition. If there are any subscript letters on the image, and this property is true, these letters will be recognized incorrectly.

ProhibitSuperscript

KeyTypeDefault
ProhibitSuperscriptBooleanfalse

When set to true, the Pdftools OCR Service does not recognize superscript letters. It is useful when a text with presumably no superscripts is recognized, in which case it may speed up the recognition. If there are any superscript letters on the image, and this property is true, these letters will be recognized incorrectly.


ProhibitHyphenation

KeyTypeDefault
ProhibitHyphenationBooleanfalse

This property set to true prohibits recognition of hyphenation from line to line. It is useful when a text with presumably no hyphenations is recognized, in which case it may speed up the recognition. If there are any hyphenations in the recognized block, and this property is true, the hyphenated words will be recognized incorrectly.


ProhibitInterblockHyphenation

KeyTypeDefault
ProhibitInterblockHyphenationBooleanfalse

This property set to true tells Pdftools OCR Service to presume that text from one block cannot be carried over to the next block.


CaseRecognitionMode

KeyTypeDefault
CaseRecognitionModeCaseRecognitionModeEnumCRM_AutoCase

This property specifies the mode of letter case recognition.

CaseRecognitionModeEnum

  • CRM_AutoCase: Automatically detect the case of letters and keep it in the output text
  • CRM_CapitalCase: The recognized text will be set in capitals.
  • CRM_SmallCase: The recognized text will be set in lowercase letters.

Handprint recognition

WritingStyle

KeyTypeDefault
WritingStyleWritingStyleEnumWS_Auto

Provides additional information about handprinted letters writing style. By default, the value of this property is WS_Auto, which means that the writing style is automatically detected.

WritingStyleEnum

  • WS_American: The American writing style.
  • WS_Arabic: The Arabic writing style. This style does not contain any specific characters. There is no need to use this constant.
  • WS_Auto: The writing style is detected automatically.
  • WS_Azerbaijan: The Azerbaijan writing style.
  • WS_Baltic: The Baltic writing style
  • WS_British: The British writing style.
  • WS_Bulgarian: The Bulgarian writing style.
  • WS_Canadian: The Canadian writing style.
  • WS_Chinese: The Chinese writing style.
  • WS_Common: The Esperanto writing style
  • WS_Croatian: The Croatian writing style
  • WS_Czech: The Czech writing style.
  • WS_Default: This constant is deprecated and will be removed in future versions. Please use WS_Auto instead to ensure the best recognition result. If you need to select the writing style corresponding to the current operating system language, use WS_DetectByLocale, which has the same value and behavior.
  • WS_DetectByLocale: The writing style is selected depending on the current language of the operating system.
  • WS_French: The French writing style.
  • WS_German: The German writing style.
  • WS_Greek: The Greek writing style.
  • WS_Hungarian: The Hungarian writing style
  • WS_Italian: The Italian writing style
  • WS_Japanese: The Japanese writing style.
  • WS_Kazakh: The Kazakh writing style.
  • WS_Kirgiz: The Kirgiz writing style.
  • WS_Latvian: The Latvian writing style.
  • WS_Polish: The Polish writing style.
  • WS_Romanian: The Romanian writing style
  • WS_Russian: The Russian writing style.
  • WS_Slovak: The Slovak writing style.
  • WS_Spanish: The Spanish writing style
  • WS_Thai: The Thai writing style.
  • WS_Turkish: The Turkish writing style. This style does not contain any specific characters. There is no need to use this constant.
  • WS_Ukrainian: The Ukrainian writing style.

FieldMarkingType

KeyTypeDefault
FieldMarkingTypeFieldMarkingTypeEnumFMT_SimpleText

This property specifies the type of marking around letters (for example, underline, frame, box, etc.). This property is valid only for the handprint recognition.

::: info For correct handprint recognition, use the CellsCount property that allows you to set the number of character cells for a recognized block. :::

FieldMarkingTypeEnum

  • FMT_CharBoxSeries: This value specifies that the field where the text is located is a set of separate boxes
  • FMT_CombInFrame: This value specifies that the field where the text is located is a comb and that this comb is also the bottom line of a frame.
  • FMT_GrayBoxes: This value specifies that the text is located in white fields on a gray background
  • FMT_PartitionedFrame: This value specifies that the field where the text is located is a frame, and this frame is split by vertical lines.
  • FMT_SimpleComb: This value specifies that the field where the text is located is a comb.
  • FMT_SimpleText: This value denotes the plain text.
  • FMT_TextInFrame: This value denotes the plain text.
  • FMT_UnderlinedText: This value denotes the plain text.

CellsCount

KeyTypeDefault
CellsCountInteger1

Specifies the number of character cells for a recognized block. This property is valid only for the handprint recognition. It only makes sense for the field marking types (the FieldMarkingType property) that imply splitting the text into cells. The default value for this property is 1, but you should set the appropriate value to recognize the text correctly.


User patterns

UseBuiltInPatterns

KeyTypeDefault
UseBuiltInPatternsBooleantrue

This property set to true means that Pdftools OCR Service will use its own built-in patterns for recognition. Patterns are files establishing a relationship between the character image and the character itself. You may want to set this property to false when you do not want to use standard Pdftools OCR Service patterns for character recognition, but user patterns only. This may be useful for the recognition of text typed with decorative or nonstandard fonts. In this case, it is better not to use Pdftools OCR Service built-in patterns but to use your own user-defined patterns trained for these fonts. A path to a user-defined pattern file is stored in the UserPatternsFile property. If the UserPatternsFile property is empty, the UseBuiltInPatterns property is ignored.


UserPatternsFile

KeyTypeDefault
UseBuiltInPatternsString""

Contains the full path to a file with the user pattern used for recognition. If the value of this property is not empty, information from the user pattern file will be used during recognition. If the UseBuiltInPatterns property is false, which means that standard Pdftools OCR Service patterns are not used during recognition, this property should contain a path to a user-defined pattern file, as only information stored in it will be used.