[RecognizerParams] INI file section
The [RecognizerParams]
INI file section includes settings for controlling language, text types, performance, and fine-tuning options used by the Pdftools OCR Service during text recognition.
Main settings
TextLanguage
Key | Type | Default |
---|---|---|
PerformSynthesis | TextLanguage | English |
The languages used for text recognition are separated by commas. The supported languages are listed in the Supported languages page.
LanguageDetectionMode
Key | Type | Default |
---|---|---|
LanguageDetectionMode | ThreeStatePropertyValueEnum | TSPV_Auto |
Manages automatic language detection.
When language autodetection is on, the recognition language is detected for each word in the text. It is selected from the list of languages specified in the TextLanguage property. Autodetection is intended to be used to recognize documents whose language is not known to you. If you know for certain that all the languages you specified are present in the document, autodetection is useless. Turn it off by setting this property to TSPV_No
.
ThreeStatePropertyValueEnum
TSPV_Auto
: Automatically determine if this processing mode should be used, depending on the situation (image characteristics, etc.).TSPV_No
: The processing mode in question will not be used.TSPV_Yes
: The processing mode in question will be used.
TextTypes
Key | Type | Default |
---|---|---|
TextTypes | TextTypeEnum | TT_Normal |
The value of this property is an OR superposition of the TextTypeEnum
enumeration constants that denote possible text types used for recognition. For example, if it is set to TT_Normal
| TT_Index
,
Pdftools OCR Server will presume that the text contains only common typographic text and digits written in ZIP-code style, ignoring all
other variants. See also Using Text Type Autodetection.
- If this property is equal to any combination of
TT_Matrix
,TT_Typewriter
,TT_OCR_A
, andTT_OCR_B
, italic fonts and superscript/subscript will not be recognized, regardless of the values of theProhibitItalic
,ProhibitSubscript
andProhibitSuperscript
properties. - If this property is
TT_Handprinted
, theCorrectOrientation
property of the[PagePreprocessingParams]
section cannot be set totrue
.
TextTypeEnum
TT_Gothic
: This value sets the Pdftools OCR Service to presume that the text on the recognized image is printed in the Gothic type.TT_Handprinted
: This value corresponds to handprinted textnoteAutomatic analysis is not available for handprinted text. The coordinates of blocks containing handprinted text should be set manually.
TT_Index
: This constant corresponds to a special set of characters, including only digits, which are written in ZIP-code style.TT_Matrix
: This value tells Pdftools Server to presume that the text on the recognized image is printed on a dot-matrix printer.TT_MICR_CMC7
: This value corresponds to a special set of characters, which includes only digits and A, B, C, D, E characters, written in MICR barcode font (CMC-7).TT_MICR_E13B
: This value corresponds to a special set of characters including only digits and A, B, C, D characters printed in magnetic ink. MICR (Magnetic Ink Character Recognition) characters are found in a variety of places, including personal checks.TT_Normal
: This value corresponds to a common typographic type of text.TT_OCR_A
: This value corresponds to a monospaced font, designed for Optical Character Recognition. Largely used by banks, credit card companies, and similar businesses.TT_OCR_B
: This value corresponds to a font designed for Optical Character Recognition.TT_Receipt
: This value corresponds to the text of a receipt. This type of text is designed to recognize sales receipts, invoices, etc. Unlike the other types, it is not concerned with the actual font of the text. Rather, it tells the recognizer that there may be text of low quality, mostly in monospaced or normal font.TT_Typewriter
: This value sets the Pdftools OCR Service to presume that the text on the recognized image is typed on a typewriter.
Recognition speed
BalancedMode
Key | Type | Default |
---|---|---|
BalancedMode | Boolean | false |
If this property is true
, the recognition runs in balanced mode.
The balanced mode is an intermediate mode between full and fast modes.
The fast mode can be activated with the help of the FastMode
property.
This property is available for machine-printed text only. For handprinted text, the recognition is run in full mode.
FastMode
Key | Type | Default |
---|---|---|
FastMode | Boolean | false |
When this property is set to true
, the Pdftools OCR Service provides 2-2.5 times faster recognition speed at the cost of a moderately increased error
rate (1.5-2 times more errors). This property is available both for machine and handprinted text. In the case of a
handprinted text (text type TT_Handprinted), a special recognition mode is used. On good print quality text,
Pdftools OCR Service makes an average of 1-2 errors per page, and such a moderate increase in error rate can be
easily tolerated in many cases, such as full text indexing with “fuzzy” searches, preliminary recognition, etc.
We do not recommend using this mode to recognize small image fragments (for example, fragments that consist of only one line or word) because the time advantage is insignificant.
Fine tuning
LowResolutionMode
Key | Type | Default |
---|---|---|
LowResolutionMode | Boolean | false |
Specifies whether text on an image with low resolution is recognized. This property is useful when recognizing faxes, small prints, images with low resolution, or bad print quality.
OneLinePerBlock
Key | Type | Default |
---|---|---|
OneLinePerBlock | Boolean | false |
When set to true
, the Pdftools OCR Service presumes that the text in the block to which the current
RecognizerParams
object belongs contains no more than one string.
OneWordPerLine
Key | Type | Default |
---|---|---|
OneWordPerLine | Boolean | false |
When set to true
, the Pdftools OCR Service presumes that no text line may contain more than one
word, so the lines of text will be recognized as a single word.
ProhibitItalic
Key | Type | Default |
---|---|---|
ProhibitItalic | Boolean | false |
When set to true
, the Pdftools OCR Service does not recognize letters printed with an italic-style font. It
is useful when a text with presumably no italic letters is recognized, in which case it may speed up the recognition.
If there are any italic letters on the image, and this property is true
, these letters will be recognized incorrectly.
ProhibitSubscript
Key | Type | Default |
---|---|---|
ProhibitSubscript | Boolean | false |
When set to true
, the Pdftools OCR Service does not recognize subscript letters. It is useful when a
text with presumably no subscripts is recognized, in which case it may speed up the recognition. If there are any
subscript letters on the image, and this property is true
, these letters will be recognized incorrectly.
ProhibitSuperscript
Key | Type | Default |
---|---|---|
ProhibitSuperscript | Boolean | false |
When set to true
, the Pdftools OCR Service does not recognize superscript letters. It is useful when a
text with presumably no superscripts is recognized, in which case it may speed up the recognition. If there are any
superscript letters on the image, and this property is true
, these letters will be recognized incorrectly.
ProhibitHyphenation
Key | Type | Default |
---|---|---|
ProhibitHyphenation | Boolean | false |
This property set to true
prohibits recognition of hyphenation from line to line. It is useful when a text with presumably no hyphenations is recognized, in which case it may speed up the recognition.
If there are any hyphenations in the recognized block, and this property is true
, the hyphenated words will be recognized incorrectly.
ProhibitInterblockHyphenation
Key | Type | Default |
---|---|---|
ProhibitInterblockHyphenation | Boolean | false |
This property set to true
tells Pdftools OCR Service to presume that text from one block cannot be carried
over to the next block.
CaseRecognitionMode
Key | Type | Default |
---|---|---|
CaseRecognitionMode | CaseRecognitionModeEnum | CRM_AutoCase |
This property specifies the mode of letter case recognition.
CaseRecognitionModeEnum
CRM_AutoCase
: Automatically detect the case of letters and keep it in the output textCRM_CapitalCase
: The recognized text will be set in capitals.CRM_SmallCase
: The recognized text will be set in lowercase letters.
Handprint recognition
WritingStyle
Key | Type | Default |
---|---|---|
WritingStyle | WritingStyleEnum | WS_Auto |
Provides additional information about handprinted letters writing style.
By default, the value of this property is WS_Auto
, which means that the writing style is automatically detected.
WritingStyleEnum
WS_American
: The American writing style.WS_Arabic
: The Arabic writing style. This style does not contain any specific characters. There is no need to use this constant.WS_Auto
: The writing style is detected automatically.WS_Azerbaijan
: The Azerbaijan writing style.WS_Baltic
: The Baltic writing styleWS_British
: The British writing style.WS_Bulgarian
: The Bulgarian writing style.WS_Canadian
: The Canadian writing style.WS_Chinese
: The Chinese writing style.WS_Common
: The Esperanto writing styleWS_Croatian
: The Croatian writing styleWS_Czech
: The Czech writing style.WS_Default
: This constant is deprecated and will be removed in future versions. Please useWS_Auto
instead to ensure the best recognition result. If you need to select the writing style corresponding to the current operating system language, useWS_DetectByLocale
, which has the same value and behavior.WS_DetectByLocale
: The writing style is selected depending on the current language of the operating system.WS_French
: The French writing style.WS_German
: The German writing style.WS_Greek
: The Greek writing style.WS_Hungarian
: The Hungarian writing styleWS_Italian
: The Italian writing styleWS_Japanese
: The Japanese writing style.WS_Kazakh
: The Kazakh writing style.WS_Kirgiz
: The Kirgiz writing style.WS_Latvian
: The Latvian writing style.WS_Polish
: The Polish writing style.WS_Romanian
: The Romanian writing styleWS_Russian
: The Russian writing style.WS_Slovak
: The Slovak writing style.WS_Spanish
: The Spanish writing styleWS_Thai
: The Thai writing style.WS_Turkish
: The Turkish writing style. This style does not contain any specific characters. There is no need to use this constant.WS_Ukrainian
: The Ukrainian writing style.
FieldMarkingType
Key | Type | Default |
---|---|---|
FieldMarkingType | FieldMarkingTypeEnum | FMT_SimpleText |
This property specifies the type of marking around letters (for example, underline, frame, box, etc.). This property is valid only for the handprint recognition.
::: info
For correct handprint recognition, use the CellsCount
property that allows
you to set the number of character cells for a recognized block.
:::
FieldMarkingTypeEnum
FMT_CharBoxSeries
: This value specifies that the field where the text is located is a set of separate boxesFMT_CombInFrame
: This value specifies that the field where the text is located is a comb and that this comb is also the bottom line of a frame.FMT_GrayBoxes
: This value specifies that the text is located in white fields on a gray backgroundFMT_PartitionedFrame
: This value specifies that the field where the text is located is a frame, and this frame is split by vertical lines.FMT_SimpleComb
: This value specifies that the field where the text is located is a comb.FMT_SimpleText
: This value denotes the plain text.FMT_TextInFrame
: This value denotes the plain text.FMT_UnderlinedText
: This value denotes the plain text.
CellsCount
Key | Type | Default |
---|---|---|
CellsCount | Integer | 1 |
Specifies the number of character cells for a recognized block.
This property is valid only for the handprint recognition.
It only makes sense for the field marking types (the FieldMarkingType
property) that imply splitting the text into cells.
The default value for this property is 1
, but you should set the appropriate value to recognize the text correctly.
User patterns
UseBuiltInPatterns
Key | Type | Default |
---|---|---|
UseBuiltInPatterns | Boolean | true |
This property set to true
means that Pdftools OCR Service will use its own built-in patterns for recognition.
Patterns are files establishing a relationship between the character image and the character itself. You may want to set this
property to false
when you do not want to use standard Pdftools OCR Service patterns for character recognition, but user patterns only. This may be useful for the recognition of text typed with decorative or nonstandard fonts.
In this case, it is better not to use Pdftools OCR Service built-in patterns but to use your own user-defined patterns trained for these fonts.
A path to a user-defined pattern file is stored in the UserPatternsFile
property. If the UserPatternsFile
property is empty, the UseBuiltInPatterns
property is ignored.
UserPatternsFile
Key | Type | Default |
---|---|---|
UseBuiltInPatterns | String | "" |
Contains the full path to a file with the user pattern used for recognition. If the value of this property is not empty, information from the user pattern file will be used during recognition.
If the UseBuiltInPatterns
property is false
, which means that standard Pdftools OCR Service patterns are not
used during recognition, this property should contain a path to a user-defined pattern file, as only information stored in it will be used.