Skip to main content

Detection

AI Smart Redact detects sensitive entities in PDF documents by combining deterministic pattern matching with semantic understanding. Each detection request runs through a pipeline that extracts text, runs detectors in parallel, resolves overlapping detections, filters by configured labels, and applies any user-defined exclusions before returning results.

Detection methods

Two complementary detection methods work together:

  • Deterministic detection uses regex-based pattern recognizers and keyword matching. It targets entities with predictable structure such as credit card numbers, IBANs, email addresses, and phone numbers. Format and checksum validators reject malformed matches. Refer to Pattern recognizers and Keyword recognizers.
  • Semantic detection uses a neural Named Entity Recognition model to identify entities that depend on context, such as person names, organizations, and physical addresses. Refer to Semantic recognizer.

Out of the box, AI Smart Redact uses these methods in a hybrid-complementary strategy: each entity type is handled by the method best suited to it. Deterministic recognizers handle structured formats, the semantic model handles contextual entities, and the two don’t overlap. This approach delivers the highest precision because format-specific entities benefit from validation that the semantic model can’t perform. For benchmark figures, refer to Detection accuracy.

Detection pipeline

Each detection request passes through five stages:

  1. Resolve configuration. The pipeline merges three configuration layers to produce the final request configuration:

    • Built-in defaults shipped with the service.
    • A default configuration set in the Human-in-the-Loop (HITL) web application by an administrator. This applies to every detection request unless overridden.
    • An optional per-request configuration submitted in the Manager API call.

    Each layer overrides the previous one. Refer to Detection configuration.

  2. Detect entities. Pattern recognizers, keyword recognizers, and the semantic recognizer run in parallel against the extracted text. Each pattern recognizer applies its format and checksum validators (refer to Format and checksum validators). The pipeline then filters out any remaining invalid emissions, such as empty labels, out-of-range scores, or negative positions.

  3. Consolidate results. Entities scoring under the configured score threshold are removed, and overlapping detections are merged according to the rules in Overlap resolution.

  4. Filter by labels. Only entities whose label is in the configured labels list are kept.

  5. Apply exclusions. Any entity whose text matches a configured allowlist keyword is removed. Refer to Keyword exclusions.

The result is a list of entities, each with its text, label, position, and final confidence score.

Overlap resolution

When two detected entities overlap, the consolidator applies one of three rule sets depending on how they overlap. Two entities that merely touch (one ends where the next begins) don’t overlap and are both kept.

Same span

When two entities cover the exact same start and end positions, the winner is chosen according to the request’s sameSpanStrategy field:

StrategyWinner
deterministicWins (default)The deterministic detector wins regardless of score.
semanticWinsThe semantic detector wins regardless of score.
higherScoreWinsThe entity with the higher score wins.

In all strategies, the higher-scoring entity wins among entities of the same detector type, and the first-detected entity wins on a final tie. With higherScoreWins, the deterministic detector wins over the semantic detector on equal score before the first-detected fallback applies.

Full containment

When one entity fully contains another (the longer entity’s span includes all of the shorter entity’s span and they aren’t identical), the longer entity always wins. A longer match is considered more specific and complete.

Partial overlap

When two entities overlap but neither fully contains the other, the higher-scoring entity wins. On a tie, the longer entity wins; on a further tie, the first-detected entity wins.

Reference pages