Detection accuracy

Detection accuracy is measured against the nvidia/Nemotron-PII dataset, a synthetic dataset of 50,000 documents annotated across 51 PII label types. Across all 51 label types, AI Smart Redact achieves an adjusted F1 score of 0.96. Detection logic is used as shipped — only the semantic recognizer’s entityMapping is extended so its outputs align with the dataset’s label set.

Out-of-the-box accuracy

The evaluation targets every label in the dataset (51 types) using AI Smart Redact’s built-in detection configuration. The only change from the shipped defaults is the semantic recognizer’s entityMapping, which is extended to cover all 51 dataset label types. No custom pattern recognizers, keyword recognizers, or exclusions were added.

Evaluation on the full 50,000-sample international test split:

Metric	Raw	Adjusted
True positives	399,967	399,967
False negatives	21,979	13,016
False positives	40,143	18,161
Recall	0.9479	0.9685
Precision	0.9088	0.9566
F1	0.9279	0.9625

Adjusted scores exclude errors that per-case analysis attributed to the dataset’s annotations rather than to the detector. Refer to Adjusted scores.

Metric definitions

Three standard metrics describe detection quality. They appear throughout this page; the headline F1 score links here.

Recall: of all actual sensitive entities in the data, how many did the detector find? A recall of 0.95 means the detector found 95% of the entities that should have been detected.
Precision: of all entities the detector flagged, how many were genuinely sensitive? A precision of 0.95 means 95% of the detector’s flagged entities were genuine sensitive entities.
F1: the harmonic mean of recall and precision, on a scale from 0 to 1. A high F1 means the detector finds most actual entities (high recall) and most of its detections are correct (high precision).

What this configuration covers

The evaluation deliberately stops short of fitting AI Smart Redact to the dataset:

The built-in pattern recognizers are used as shipped, with their format and checksum validators. Recognizers for entity types the dataset doesn’t include (such as IBAN, MONEY, and DOMAIN_NAME) were disabled for the evaluation; their detections would otherwise be counted as false positives against a dataset that has no annotations for those types.
The built-in semantic recognizer is used, with entityMapping extended from the default 4 mappings to cover the dataset labels that aren’t handled by deterministic pattern recognizers.
No custom pattern recognizers were added for dataset-specific labels such as SSN, license plate, national ID, postcode, account number, or employee ID. These types have predictable formats and are best detected by purpose-built recognizers, but adding them would be fitting the configuration to this specific dataset.
No keyword denylists or exclusions were added.

Some of the 51 dataset labels are deterministic by nature (region-specific identifiers with their own validation rules, such as SSN, LICENSE_PLATE, and NATIONAL_ID), but in this evaluation they’re detected only through the semantic model. Adding custom pattern recognizers for those identifier formats typically raises F1 toward the levels seen for built-in deterministic recognizers in this evaluation, where CREDIT_CARD reached 0.97, EMAIL_ADDRESS 0.99, MAC_ADDRESS 1.00, and URL 0.99. Actual gains depend on the pattern. To close this gap, use the configuration options described in Improving accuracy in your environment.

Built-in label scores

Per-label raw scores from the same evaluation, for the 17 entity types that AI Smart Redact detects out of the box and that the Nemotron-PII dataset annotates:

Entity type	Detection	Recall	Precision	F1
`MAC_ADDRESS`	Deterministic	0.9991	0.9976	0.9983
`EMAIL_ADDRESS`	Deterministic	0.9893	0.9965	0.9929
`PERSON`	Semantic	0.9945	0.9880	0.9912
`GPS_COORDINATE`	Deterministic	0.9898	0.9914	0.9906
`URL`	Deterministic	0.9894	0.9812	0.9853
`DATETIME`	Deterministic	0.9625	0.9793	0.9708
`PHYSICAL_ADDRESS`	Semantic	0.9728	0.9687	0.9707
`IP_ADDRESS`	Deterministic	0.9671	0.9729	0.9700
`CREDIT_CARD`	Deterministic	0.9443	0.9968	0.9698
`DATE`	Deterministic	0.9423	0.9902	0.9657
`USERNAME`	Semantic	0.9390	0.9830	0.9605
`VIN`	Deterministic	0.9182	0.9889	0.9523
`PHONE_NUMBER`	Deterministic	0.9657	0.9326	0.9489
`ORGANISATION`	Semantic	0.9581	0.9168	0.9370
`HTTP_COOKIE`	Deterministic	0.7136	0.9688	0.8218
`BIC_SWIFT`	Deterministic	0.6701	0.9917	0.7997
`TIME`	Deterministic	0.6777	0.8402	0.7503

The lower raw F1 scores in the table reflect dataset annotation issues rather than detection mistakes. For example, BIC_SWIFT recall rises to near-perfect after the cases described in Adjusted scores are removed.

Adjusted scores

Each detection error was reviewed individually to determine its cause. A meaningful share of false negatives and false positives traced back to issues in the dataset’s annotations. The Nemotron-PII dataset is synthetic, which explains why some annotated values don’t conform to real-world format conventions. Most are clear annotation errors. Examples include:

Duration expressions like “30 minutes” annotated as TIME rather than as a duration.
BIC/SWIFT codes whose country position doesn’t appear in ISO 3166.
Bare years annotated as full dates without a month or day.
Credit card numbers that don’t satisfy the Luhn checksum.

The remainder are edge cases where the detector and the dataset apply different conventions. For example, the dataset annotates a datetime as separate date and time spans while the detector returns a single datetime entity.

Adjusted scores exclude both kinds. The combined impact is significant: about 41% of false negatives and 55% of false positives in the raw count come from these annotation issues rather than from detection mistakes.

Improving accuracy in your environment

The figures shown measure the built-in defaults without any dataset-specific tuning. Detection is fully customizable per request, and customer-tuned configurations consistently produce higher precision and F1 on production documents. To raise accuracy on your documents:

Add custom pattern recognizers for region-specific identifiers (SSN, license plates, national IDs) or domain-specific codes. This is the single biggest lever for higher precision on identifier-style entity types.
Use keyword exclusions to suppress recurring false positives, such as your own company name being detected as ORGANISATION.
Adjust scoreThreshold to favor precision (raise it) or recall (lower it).
Switch checksumValidationMode to relaxed if missed entities are more costly than occasional false positives in checksum-validated formats.

Industry context

No universally agreed benchmark exists for evaluating PII detection systems, which makes direct comparisons across products difficult. The closest established field is Named Entity Recognition (NER), where state-of-the-art models routinely report F1 of 0.93–0.94. Those benchmarks are narrower than PII detection: they typically cover 3 or 4 broad entity types (person, location, organization) on clean newswire-style text.

PII detection in real-world applications is harder. It covers dozens of heterogeneous entity types (emails, phone numbers, credit card details, national IDs, addresses, and other nuanced personal identifiers), often in messy, domain-specific, or conversational text. It also requires balancing two competing risks: missed sensitive data, which creates privacy and compliance exposure, and over-redaction, which reduces data utility.

Reported numbers in the field reflect this difficulty. General-purpose open source PII tools (rule-based and NER hybrids) commonly report F1 in the 0.81–0.85 range on realistic evaluations. Commercial or heavily domain-tuned systems frequently claim 0.91–0.98, though those numbers are often obtained on narrower entity sets, synthetic data, or proprietary benchmarks rather than broad standardized tests.

It’s worth restating that the Nemotron-PII dataset is synthetic. Real-world PDFs introduce additional challenges that synthetic text doesn’t capture: scanned pages where text is extracted through OCR and can include character recognition errors, multi-column or table-heavy layouts that fragment the surrounding context that detection relies on, mixed-language documents, and form-style PDFs where entities appear in field labels and values rather than in flowing prose. The accuracy figures earlier on this page should be read with these factors in mind.

Out-of-the-box accuracy​

Metric definitions​

What this configuration covers​

Built-in label scores​

Adjusted scores​

Improving accuracy in your environment​

Industry context​