Detection
AI Smart Redact detects sensitive entities in PDF documents by combining deterministic pattern matching with semantic understanding. Each detection request runs through a pipeline that extracts text, runs detectors in parallel, resolves overlapping detections, filters by configured labels, and applies any user-defined exclusions before returning results.
Detection methods
Two complementary detection methods work together:
- Deterministic detection uses regex-based pattern recognizers and keyword matching. It targets entities with predictable structure such as credit card numbers, IBANs, email addresses, and phone numbers. Format and checksum validators reject malformed matches. Refer to Pattern recognizers and Keyword recognizers.
- Semantic detection uses a neural Named Entity Recognition model to identify entities that depend on context, such as person names, organizations, and physical addresses. Refer to Semantic recognizer.
Out of the box, AI Smart Redact uses these methods in a hybrid-complementary strategy: each entity type is handled by the method best suited to it. Deterministic recognizers handle structured formats, the semantic model handles contextual entities, and the two don’t overlap. This approach delivers the highest precision because format-specific entities benefit from validation that the semantic model can’t perform. For benchmark figures, refer to Detection accuracy.
Detection pipeline
Each detection request passes through five stages:
-
Resolve configuration. The pipeline merges three configuration layers to produce the final request configuration:
- Built-in defaults shipped with the service.
- A default configuration set in the Human-in-the-Loop (HITL) web application by an administrator. This applies to every detection request unless overridden.
- An optional per-request configuration submitted in the Manager API call.
Each layer overrides the previous one. Refer to Detection configuration.
-
Detect entities. Pattern recognizers, keyword recognizers, and the semantic recognizer run in parallel against the extracted text. Each pattern recognizer applies its format and checksum validators (refer to Format and checksum validators). The pipeline then filters out any remaining invalid emissions, such as empty labels, out-of-range scores, or negative positions.
-
Consolidate results. Entities scoring under the configured score threshold are removed, and overlapping detections are merged according to the rules in Overlap resolution.
-
Filter by labels. Only entities whose label is in the configured
labelslist are kept. -
Apply exclusions. Any entity whose text matches a configured allowlist keyword is removed. Refer to Keyword exclusions.
The result is a list of entities, each with its text, label, position, and final confidence score.
Overlap resolution
When two detected entities overlap, the consolidator applies one of three rule sets depending on how they overlap. Two entities that merely touch (one ends where the next begins) don’t overlap and are both kept.
Same span
When two entities cover the exact same start and end positions, the winner is chosen according to the request’s sameSpanStrategy field:
| Strategy | Winner |
|---|---|
deterministicWins (default) | The deterministic detector wins regardless of score. |
semanticWins | The semantic detector wins regardless of score. |
higherScoreWins | The entity with the higher score wins. |
In all strategies, the higher-scoring entity wins among entities of the same detector type, and the first-detected entity wins on a final tie. With higherScoreWins, the deterministic detector wins over the semantic detector on equal score before the first-detected fallback applies.
Full containment
When one entity fully contains another (the longer entity’s span includes all of the shorter entity’s span and they aren’t identical), the longer entity always wins. A longer match is considered more specific and complete.
Partial overlap
When two entities overlap but neither fully contains the other, the higher-scoring entity wins. On a tie, the longer entity wins; on a further tie, the first-detected entity wins.
Reference pages
📄️ Entity types
Learn about the built-in pattern-based and semantic entity types detected by AI Smart Redact.
📄️ Detection configuration
Learn about the detectionConfiguration object schema that controls AI Smart Redact detection requests.
📄️ Pattern recognizers
Learn about built-in and custom pattern recognizers, format and checksum validators, context boosting, and regex safety in AI Smart Redact.
📄️ Keyword recognizers
Learn about keyword recognizers (denylists) in AI Smart Redact detection.
📄️ Keyword exclusions
Learn about keyword exclusions (allowlists) that suppress detected entities in AI Smart Redact.
📄️ Semantic recognizer
Learn about the AI Smart Redact semantic recognizer, including default entity mappings, customization, and limitations.
📄️ Detection accuracy
F1 benchmarks and configuration comparisons for AI Smart Redact detection accuracy.