Semantic recognizer

The semantic recognizer detects entities by understanding the context in which they appear, rather than by matching predefined formats. AI Smart Redact uses a neural Named Entity Recognition (NER) model for this purpose.

Semantic detection is best suited for entities whose form varies too widely for regex matching, such as person names, organization names, and physical addresses. For format-specific entities such as credit card numbers, IBANs, and dates, use Pattern recognizers instead.

Default entity mapping

The semantic model emits its own output labels. Out of the box, these are mapped to four entity labels:

Label	Source labels
`PERSON`	`name`, `first name`, `last name`
`ORGANISATION`	`company name`, `organization name`
`PHYSICAL_ADDRESS`	`location address`
`USERNAME`	`user name`

Customize the entity mapping

The semantic model can recognize a wide range of entity types beyond the defaults. The model itself isn’t user-configurable; only the entityMapping field can be changed. Override semanticRecognizer.entityMapping to add new entity types or change the mapping. Each entry maps a model output label (the key) to the entity label that should be emitted (the value).

The model accepts arbitrary descriptive phrases as output labels. There’s no fixed vocabulary, so you can ask for any concept the model can interpret. Avoid choosing labels that overlap with other semantic mappings (for example, city would compete with the built-in location address mapped to PHYSICAL_ADDRESS) or with deterministic recognizers (for example, date of birth would compete with the DATE pattern recognizer, which detects dates more reliably).

The following example extends the built-in default mapping to detect nationalities and occupations as separate labels:

{
  "detectionConfiguration": {
    "semanticRecognizer": {
      "name": "GlinerLarge",
      "entityMapping": {
        "name": "PERSON",
        "first name": "PERSON",
        "last name": "PERSON",
        "company name": "ORGANISATION",
        "organization name": "ORGANISATION",
        "location address": "PHYSICAL_ADDRESS",
        "user name": "USERNAME",
        "nationality": "NATIONALITY",
        "occupation": "OCCUPATION"
      }
    }
  }
}

When you customize the mapping, also include the new entity labels in the request’s labels field. Otherwise the detected entities are removed by the label-filtering pipeline stage.

For the full schema, refer to SemanticRecognizer schema.

Inference settings

Semantic-model inference is tuned through Worker configuration:

Setting	Default	Description
`Inference.MaxChunkSize`	256 tokens	Maximum chunk size sent to the model. Long text is split into chunks of this size before inference.
`Inference.MaxLength`	512 tokens	Model token-length limit. Don’t increase beyond the model’s supported context window.
`Inference.BatchSize`	1	Number of chunks processed per inference call. Higher values increase throughput at the cost of memory.

For where to set these values, refer to Worker configuration.

Limitations

Semantic detection can produce false positives for entities the semantic model recognizes correctly but that aren’t sensitive in your context. Common cases include:

Well-known public entities such as government bodies or sports teams detected as ORGANISATION.
Cities, counties, states, or postcodes detected as PHYSICAL_ADDRESS when they appear in address-like context.
Words that resemble names, such as colors, demonyms, or United States state names, detected as PERSON.

Mitigate these by adding keyword exclusions for the specific terms that recur in your documents. For accuracy figures, refer to Detection accuracy.

Default entity mapping​

Customize the entity mapping​

Inference settings​

Limitations​

Default entity mapping

Customize the entity mapping

Inference settings

Limitations