Skip to main content

Entity types

An entity type is a label assigned to a detected piece of sensitive information, such as EMAIL_ADDRESS or PERSON. AI Smart Redact ships with 36 built-in entity types: 32 pattern-based labels detected by regex pattern recognizers, and 4 semantic labels detected by the semantic model. You can also define your own entity types per detection request.

Pattern-based labels

The following 32 labels have built-in pattern recognizers. Each label may match more than one format. For example, DATE covers numeric and verbal date formats across more than one language.

LabelWhat it matches
ALPHANUMERIC_CODEMixed letter-digit codes
BARCODEProduct barcodes with checksum validation
BIC_SWIFTBank Identifier Codes
CREDIT_CARDCredit card numbers with Luhn validation
CURRENCY_CODEISO 4217 three-letter currency codes
DATECalendar dates in common international formats
DATETIMECombined date-time strings
DECIMAL_NUMBERDecimal numbers
DOMAIN_NAMEInternet domain names
DURATIONTime durations
EMAIL_ADDRESSEmail addresses
FILE_PATHWindows and Unix file paths
GPS_COORDINATEGeographic coordinates
HASHTAGSocial media hashtags
HTTP_COOKIEHTTP cookie strings
IBANInternational Bank Account Numbers
INTEGER_NUMBERIntegers
IP_ADDRESSIPv4 and IPv6 addresses
ISINInternational Securities Identification Numbers
LEILegal Entity Identifiers
MAC_ADDRESSHardware MAC addresses
MENTIONSocial media @mentions
MONEYMonetary amounts
NUMERIC_IDDigit-only identifiers
PERCENTAGEPercentage values
PHONE_NUMBERPhone numbers in common international formats
SCIENTIFIC_NUMBERScientific notation numbers
TIMETimes in common formats
UNIQUE_IDENTIFIERUUID and GUID strings
URLHTTP, HTTPS, and FTP URLs
VAT_NUMBERVAT numbers
VINVehicle Identification Numbers

For details on how pattern recognizers work and how to add custom ones, refer to Pattern recognizers.

Semantic labels

Out of the box, the semantic model detects four entity types:

LabelWhat it matches
PERSONPerson names
ORGANISATIONOrganization and company names
PHYSICAL_ADDRESSStreet addresses and locations
USERNAMEUsernames and handles

The built-in default mapping is a starting point, not a fixed set. You can extend or replace it to detect any entity type the semantic model can recognize. Refer to Semantic recognizer.

Confidence scores

Every detected entity has a confidence score between 0 and 1. The score reflects how unambiguous the match is. Entities with structurally unique formats and supporting validators score higher; broader patterns that can match many non-PII strings score lower.

RangeTierDescription
0.90–1.00HighestStructurally unique, almost no false positives
0.80–0.89HighDistinctive structure, often with checksum validation
0.70–0.79ModerateRecognizable format with some ambiguity
0.60–0.69LowerFormats that overlap with common text
0.50–0.59LowBroad patterns with higher ambiguity
Under 0.50Sub-thresholdFiltered out at the built-in default threshold. Refer to Sub-threshold patterns.

The built-in default scoreThreshold is 0.5. Entities scoring under the threshold are removed from the results. Raise the threshold to favor precision; lower it to favor recall.

Sub-threshold patterns

Some patterns are deliberately configured to score under the built-in default threshold to prevent false positives from highly ambiguous matches. For example, a bare four-digit number could be military time, a year, a PIN, or an arbitrary identifier; the military-time pattern scores under 0.5 and surfaces only when context boosting raises the score over the threshold.

Custom entity types

You can extend the set of detected entity types in three ways: