Skip to main content

Semantic recognizer

The semantic recognizer detects entities by understanding the context in which they appear, rather than by matching predefined formats. AI Smart Redact uses a neural Named Entity Recognition (NER) model for this purpose.

Semantic detection is best suited for entities whose form varies too widely for regex matching, such as person names, organization names, and physical addresses. For format-specific entities such as credit card numbers, IBANs, and dates, use Pattern recognizers instead.

Default entity mapping

The semantic model emits its own output labels. Out of the box, these are mapped to four entity labels:

LabelSource labels
PERSONname, first name, last name
ORGANISATIONcompany name, organization name
PHYSICAL_ADDRESSlocation address
USERNAMEuser name

Customize the entity mapping

The semantic model can recognize a wide range of entity types beyond the defaults. The model itself isn’t user-configurable; only the entityMapping field can be changed. Override semanticRecognizer.entityMapping to add new entity types or change the mapping. Each entry maps a model output label (the key) to the entity label that should be emitted (the value).

The model accepts arbitrary descriptive phrases as output labels. There’s no fixed vocabulary, so you can ask for any concept the model can interpret. Avoid choosing labels that overlap with other semantic mappings (for example, city would compete with the built-in location address mapped to PHYSICAL_ADDRESS) or with deterministic recognizers (for example, date of birth would compete with the DATE pattern recognizer, which detects dates more reliably).

The following example extends the built-in default mapping to detect nationalities and occupations as separate labels:

{
"detectionConfiguration": {
"semanticRecognizer": {
"name": "GlinerLarge",
"entityMapping": {
"name": "PERSON",
"first name": "PERSON",
"last name": "PERSON",
"company name": "ORGANISATION",
"organization name": "ORGANISATION",
"location address": "PHYSICAL_ADDRESS",
"user name": "USERNAME",
"nationality": "NATIONALITY",
"occupation": "OCCUPATION"
}
}
}
}

When you customize the mapping, also include the new entity labels in the request’s labels field. Otherwise the detected entities are removed by the label-filtering pipeline stage.

For the full schema, refer to SemanticRecognizer schema.

Inference settings

Semantic-model inference is tuned through Worker configuration:

SettingDefaultDescription
Inference.MaxChunkSize256 tokensMaximum chunk size sent to the model. Long text is split into chunks of this size before inference.
Inference.MaxLength512 tokensModel token-length limit. Don’t increase beyond the model’s supported context window.
Inference.BatchSize1Number of chunks processed per inference call. Higher values increase throughput at the cost of memory.

For where to set these values, refer to Worker configuration.

Limitations

Semantic detection can produce false positives for entities the semantic model recognizes correctly but that aren’t sensitive in your context. Common cases include:

  • Well-known public entities such as government bodies or sports teams detected as ORGANISATION.
  • Cities, counties, states, or postcodes detected as PHYSICAL_ADDRESS when they appear in address-like context.
  • Words that resemble names, such as colors, demonyms, or United States state names, detected as PERSON.

Mitigate these by adding keyword exclusions for the specific terms that recur in your documents. For accuracy figures, refer to Detection accuracy.