Entity types
An entity type is a label assigned to a detected piece of sensitive information, such as EMAIL_ADDRESS or PERSON. AI Smart Redact ships with 36 built-in entity types: 32 pattern-based labels detected by regex pattern recognizers, and 4 semantic labels detected by the semantic model. You can also define your own entity types per detection request.
Pattern-based labels
The following 32 labels have built-in pattern recognizers. Each label may match more than one format. For example, DATE covers numeric and verbal date formats across more than one language.
| Label | What it matches |
|---|---|
ALPHANUMERIC_CODE | Mixed letter-digit codes |
BARCODE | Product barcodes with checksum validation |
BIC_SWIFT | Bank Identifier Codes |
CREDIT_CARD | Credit card numbers with Luhn validation |
CURRENCY_CODE | ISO 4217 three-letter currency codes |
DATE | Calendar dates in common international formats |
DATETIME | Combined date-time strings |
DECIMAL_NUMBER | Decimal numbers |
DOMAIN_NAME | Internet domain names |
DURATION | Time durations |
EMAIL_ADDRESS | Email addresses |
FILE_PATH | Windows and Unix file paths |
GPS_COORDINATE | Geographic coordinates |
HASHTAG | Social media hashtags |
HTTP_COOKIE | HTTP cookie strings |
IBAN | International Bank Account Numbers |
INTEGER_NUMBER | Integers |
IP_ADDRESS | IPv4 and IPv6 addresses |
ISIN | International Securities Identification Numbers |
LEI | Legal Entity Identifiers |
MAC_ADDRESS | Hardware MAC addresses |
MENTION | Social media @mentions |
MONEY | Monetary amounts |
NUMERIC_ID | Digit-only identifiers |
PERCENTAGE | Percentage values |
PHONE_NUMBER | Phone numbers in common international formats |
SCIENTIFIC_NUMBER | Scientific notation numbers |
TIME | Times in common formats |
UNIQUE_IDENTIFIER | UUID and GUID strings |
URL | HTTP, HTTPS, and FTP URLs |
VAT_NUMBER | VAT numbers |
VIN | Vehicle Identification Numbers |
For details on how pattern recognizers work and how to add custom ones, refer to Pattern recognizers.
Semantic labels
Out of the box, the semantic model detects four entity types:
| Label | What it matches |
|---|---|
PERSON | Person names |
ORGANISATION | Organization and company names |
PHYSICAL_ADDRESS | Street addresses and locations |
USERNAME | Usernames and handles |
The built-in default mapping is a starting point, not a fixed set. You can extend or replace it to detect any entity type the semantic model can recognize. Refer to Semantic recognizer.
Confidence scores
Every detected entity has a confidence score between 0 and 1. The score reflects how unambiguous the match is. Entities with structurally unique formats and supporting validators score higher; broader patterns that can match many non-PII strings score lower.
| Range | Tier | Description |
|---|---|---|
| 0.90–1.00 | Highest | Structurally unique, almost no false positives |
| 0.80–0.89 | High | Distinctive structure, often with checksum validation |
| 0.70–0.79 | Moderate | Recognizable format with some ambiguity |
| 0.60–0.69 | Lower | Formats that overlap with common text |
| 0.50–0.59 | Low | Broad patterns with higher ambiguity |
| Under 0.50 | Sub-threshold | Filtered out at the built-in default threshold. Refer to Sub-threshold patterns. |
The built-in default scoreThreshold is 0.5. Entities scoring under the threshold are removed from the results. Raise the threshold to favor precision; lower it to favor recall.
Sub-threshold patterns
Some patterns are deliberately configured to score under the built-in default threshold to prevent false positives from highly ambiguous matches. For example, a bare four-digit number could be military time, a year, a PIN, or an arbitrary identifier; the military-time pattern scores under 0.5 and surfaces only when context boosting raises the score over the threshold.
Custom entity types
You can extend the set of detected entity types in three ways:
- Add custom regex patterns. Refer to Pattern recognizers.
- Add a denylist of known sensitive terms. Refer to Keyword recognizers.
- Map more semantic-model outputs to new labels. Refer to Semantic recognizer.