Skip to main content

Redaction

AI Smart Redact redacts sensitive entities in PDF documents by removing the marked content and rebuilding the document. Every entity in a redaction request is removed, not just visually masked. Each request takes a source PDF and a list of redaction areas: rectangles on a page that mark where each entity to redact appears.

True redaction by reconstruction

The output PDF is built from scratch. Only content that the redaction engine fully understands and that isn’t marked for redaction carries over to the new document. Anything else is dropped, including hidden data that wasn’t part of the visible page: document metadata, bookmarks, annotations, and embedded files.

Page count, page sizes, and the positioning of all unredacted content are preserved. Only the content inside the redaction areas is removed.

AI Smart Redact doesn’t just draw a black rectangle over redacted content. The text, images, vector graphics, and any OCR text layers inside each redaction area are structurally removed from the page content streams, just like the visible text on top of them. As a result, redacted text can’t be copied from the content stream, and no parallel hidden layer preserves the original content.

Performance

Redaction is fast and runs independently of the detection step, with up to four concurrent jobs per Worker by default. Redaction time scales with document size rather than the number of entities to redact.

Unhandled features

The redaction engine doesn’t yet support every PDF feature. Most page structures are unconditionally rebuilt or removed during redaction, but some features have to stay in the document so the redaction engine doesn’t break the document’s integrity. The following PDF features fall into that group:

  • Marked Content (accessibility tagging and structural marks inside page content streams) isn’t removed. Structures such as ActualText can leak sensitive information.
  • Optional Content Groups (OCG, also known as “Layers”) aren’t considered separately. They are redacted as if they were normal content.
  • OutputIntent in ICC profiles and Font or FontDescriptor metadata aren’t stripped. These are theoretical surfaces; while the data inside is custom, this data is usually not sensitive in any way.

OCR text layers and other on-page text are removed only where a redaction area covers them, the same as visible text.

Invisible content

The redaction engine doesn’t perform additional sanitization passes for content that the reviewer can’t see. Out-of-bounds content falls into this category:

  • Text, images, and vector paths positioned outside the page bounds. A redaction area placed there still removes the content, but the reviewer never sees out-of-bounds content in the viewer, so the reviewer doesn’t select it for redaction during review.

Reporting new findings

If you encounter a PDF where AI Smart Redact leaves content that the redaction engine should have removed, open a support request with a sample. Real-world feedback on issues encountered with PDF files guides further development of the product, and the redaction engine’s coverage can expand as you report edge cases.