Semantic recognizer
The semantic recognizer detects entities by understanding the context in which they appear, rather than by matching predefined formats. AI Smart Redact uses a neural Named Entity Recognition (NER) model for this purpose.
Semantic detection is best suited for entities whose form varies too widely for regex matching, such as person names, organization names, and physical addresses. For format-specific entities such as credit card numbers, IBANs, and dates, use Pattern recognizers instead.
Default entity mapping
The semantic model emits its own output labels. Out of the box, these are mapped to four entity labels:
| Label | Source labels |
|---|---|
PERSON | name, first name, last name |
ORGANISATION | company name, organization name |
PHYSICAL_ADDRESS | location address |
USERNAME | user name |
Customize the entity mapping
The semantic model can recognize a wide range of entity types beyond the defaults. The model itself isn’t user-configurable; only the entityMapping field can be changed. Override semanticRecognizer.entityMapping to add new entity types or change the mapping. Each entry maps a model output label (the key) to the entity label that should be emitted (the value).
The model accepts arbitrary descriptive phrases as output labels. There’s no fixed vocabulary, so you can ask for any concept the model can interpret. Avoid choosing labels that overlap with other semantic mappings (for example, city would compete with the built-in location address mapped to PHYSICAL_ADDRESS) or with deterministic recognizers (for example, date of birth would compete with the DATE pattern recognizer, which detects dates more reliably).
The following example extends the built-in default mapping to detect nationalities and occupations as separate labels:
{
"detectionConfiguration": {
"semanticRecognizer": {
"name": "GlinerLarge",
"entityMapping": {
"name": "PERSON",
"first name": "PERSON",
"last name": "PERSON",
"company name": "ORGANISATION",
"organization name": "ORGANISATION",
"location address": "PHYSICAL_ADDRESS",
"user name": "USERNAME",
"nationality": "NATIONALITY",
"occupation": "OCCUPATION"
}
}
}
}
When you customize the mapping, also include the new entity labels in the request’s labels field. Otherwise the detected entities are removed by the label-filtering pipeline stage.
For the full schema, refer to SemanticRecognizer schema.
Inference settings
Semantic-model inference is tuned through Worker configuration:
| Setting | Default | Description |
|---|---|---|
Inference.MaxChunkSize | 256 tokens | Maximum chunk size sent to the model. Long text is split into chunks of this size before inference. |
Inference.MaxLength | 512 tokens | Model token-length limit. Don’t increase beyond the model’s supported context window. |
Inference.BatchSize | 1 | Number of chunks processed per inference call. Higher values increase throughput at the cost of memory. |
For where to set these values, refer to Worker configuration.
Limitations
Semantic detection can produce false positives for entities the semantic model recognizes correctly but that aren’t sensitive in your context. Common cases include:
- Well-known public entities such as government bodies or sports teams detected as
ORGANISATION. - Cities, counties, states, or postcodes detected as
PHYSICAL_ADDRESSwhen they appear in address-like context. - Words that resemble names, such as colors, demonyms, or United States state names, detected as
PERSON.
Mitigate these by adding keyword exclusions for the specific terms that recur in your documents. For accuracy figures, refer to Detection accuracy.