Pattern recognizers
A pattern recognizer detects entities by matching the document text against one or more regular expressions. AI Smart Redact ships with built-in recognizers for common entity types such as email addresses, credit card numbers, and phone numbers. Refer to Entity types for the full list of pattern-based labels. You can also define your own recognizers per detection request.
How pattern matching works
The following steps run inside the Detect entities stage of the detection pipeline. For each recognizer whose label is in the configured labels list:
- Run each regex pattern against the input text.
- Optionally validate the format of each match. A failed format check rejects the match.
- Optionally validate a checksum. A failed checksum check either rejects the match or reduces its score, depending on
checksumValidationMode. - Apply context boosting. Words appearing near the match raise the confidence score.
- Emit a detected entity with the resulting score.
Built-in pattern recognizers
The built-in pattern recognizers cover a broad set of common entity formats and include format and checksum validation where applicable. For example:
CREDIT_CARDvalidates the Luhn checksum.IBANvalidates the ISO 7064 Mod 97-10 checksum.BIC_SWIFTvalidates the country code position against ISO 3166.EMAIL_ADDRESSvalidates the top-level domain against the IANA list.
For the full list of labels and what each one matches, refer to Entity types.
Format and checksum validators
Format and checksum validators reject matches that share the structural shape of the target entity but don’t conform to its actual specification.
-
A format validator runs first and always rejects on failure. For example, the email recognizer rejects matches whose top-level domain isn’t on the IANA list.
-
A checksum validator runs after the format check. Its behavior on failure depends on the top-level
checksumValidationModefield:Mode Behavior on checksum failure strict(default)The match is discarded. relaxedThe match is kept with its score reduced by 0.30. Context boosting can still raise the reduced score over the threshold.
Use relaxed when missing an entity is more costly than a few false positives. For example, when redacting documents where credit card numbers may have been transcribed with errors, relaxed mode keeps numbers that fail Luhn validation, at the cost of detecting some random digit sequences as credit cards.
Context boosting
Context boosting raises the confidence score of a pattern match when relevant words appear nearby. This rewards matches found in a semantically meaningful context. For example, the word email near an email address.
How it works
- The text around the match is tokenized into words.
- A window of 6 tokens before and 3 tokens after the match is collected.
- Each
contextWordsentry is matched against the window using whole-word, case-insensitive comparison. Multi-word phrases are matched as consecutive tokens. Longer phrases are matched first; a shorter context word that overlaps an already-matched phrase is skipped. - Each unique match adds 0.05 to the confidence score, capped at +0.15 total.
- The final score is capped at 1.0.
Parameters
| Parameter | Value |
|---|---|
| Window before match | 6 tokens |
| Window after match | 3 tokens |
| Boost per context match | +0.05 |
| Maximum total boost | +0.15 |
| Maximum final score | 1.0 |
Example
For the input Please send to email: john@example.com for details:
- Entity:
john@example.comwith a base score of 0.80. - Tokens preceding the match (up to 6):
["Please", "send", "to", "email"]. Matchesemail. - Tokens following the match (up to 3):
["for", "details"]. No match. - Total boost: 1 × 0.05 = +0.05.
- Final score: 0.85.
These numbers are illustrative. Actual base scores vary by recognizer and pattern variant.
Context boosting applies only to pattern recognizers. Keyword and semantic recognizers don’t perform context boosting.
Language support
The detection request’s languages field controls language-aware behavior in pattern recognizers:
- Context words. Each selected language adds its own context word vocabulary on top of the universal context words. Universal context words such as
iban,visa, anddobare always loaded. DATEpattern coverage. Verbal date patterns are generated per selected language. For example,Januarymatches in English,Januarin German, andjanvierin French.DATEis the only label whose pattern coverage depends onlanguages.
Supported languages: English (en), German (de), French (fr), Italian (it), Spanish (es), Portuguese (pt), and Dutch (nl).
Custom pattern recognizers
Add a custom pattern recognizer to detect entity types not covered by the built-in set. Custom recognizers are appended to the built-in ones unless you also disable the built-in recognizer for the same label.
{
"detectionConfiguration": {
"patternRecognizers": [
{
"name": "CustomProjectCode",
"label": "PROJECT_CODE",
"patterns": [
{
"name": "ProjectCode",
"regex": "\\bPRJ-\\d{4}-[A-Z]{3}\\b",
"score": 0.85
}
],
"contextWords": ["project", "code", "reference"]
}
]
}
}
For the full schema, refer to PatternRecognizer schema.
Disable a built-in recognizer
To replace a built-in pattern recognizer for a label, set disableBuiltInPatternRecognizersForLabels to disable the built-in one and add a custom recognizer for the same label:
{
"detectionConfiguration": {
"disableBuiltInPatternRecognizersForLabels": ["PHONE_NUMBER"],
"patternRecognizers": [
{
"name": "InternalPhone",
"label": "PHONE_NUMBER",
"patterns": [
{
"name": "InternalExtension",
"regex": "\\bx\\d{4}\\b",
"score": 0.85
}
]
}
]
}
}
Performance and safety
- Compilation and caching. Each regex pattern compiles on first use and is cached for subsequent matches.
- Backtracking protection. Custom patterns run in non-backtracking mode by default. Set
allowBacktracking: trueonly when your pattern requires it. Backtracking patterns can run slowly on adversarial inputs, and each match runs until it completes or hitsregexTimeout. - Execution timeout. The
regexTimeoutfield caps a single regex execution at the configured number of milliseconds (default 1000). Matches that exceed the timeout are aborted to prevent stalled requests.