Why PDFs Break AI KYC Pipelines | Document Normalization

Let’s say your team is evaluating Anthropic’s KYC Screener to automate onboarding document parsing and compliance rules execution. A scanned document enters the pipeline — one that contains text in two columns. OCR reads the text across the columns, which creates garbled output, with dates taken out of context and names fragmented and mixed with address data. The pipeline doesn’t flag the errors, so the document clears ingestion and the issues go unnoticed. The next time the errors are surfaced is when a compliance officer reviews the extracted record and realizes that the values don’t match the source file.

When it comes to KYC onboarding and document review, the agent layer can solve many automation problems, but the entire workflow rests on the document layer. Without upstream document normalization to fix the issues PDFs can introduce to a pipeline, the entire workflow will inherit any structural instability your inbound documents have.

Why (and how) PDFs break automated pipelines

The reason that PDFs present this challenge when it comes to machine readability and data extraction is due to how and why they were created. PDFs were created in the early 1990s with the goal of creating a document with universal appearance — one that would look exactly the same on every device it was viewed on. Essentially, PDFs were created to be a human-first file format, not machine-first, and most of the quirks that come with working alongside or on top of PDFs come from that initial decision.

In general, PDF issues with automated workflows tend to come from three sources:

Lack of normalization
The way text is formatted and presented
OCR

Lack of normalization

In addition to being human-first documents, PDFs are also highly variable in both their internal and external structure. When it comes to KYC documents, a single customer dossier might contain manually scanned identity documents, computer-generated bank statements, and table-heavy transaction records. Each of these documents arrives with a different internal structure, but the pipeline has to handle all of them.

A large contributor to differences in internal structure is whether the PDF was machine-created or scanned in. For example, the same onboarding form can arrive via two channels: one was filled out digitally, and one was printed, completed in handwriting, and scanned back in. To a human reviewer, they’ll look almost identical. To the document pipeline, however, they’re structurally different documents, and the scanned version has lost all of its original metadata and internal structure.

Even with digital-first files, their internal structure will vary depending on how they were created, how much effort was put into creating a coherent structure and metadata, if they have other files attached or embedded, etc. This is precisely why document normalization is so important, as inconsistent internal structure across a customer dossier means the pipeline encounters a different document every time, and that variability drives exception queues.

When these failures reach the compliance record undetected, a regulator reviewing KYC records against source documents will find the discrepancy. Here's exactly what can happen when documents are processed without being normalized:

Random line breaks are added: This can scramble field values and split up words: a DOB gets split across two lines and becomes two unrecognizable strings. The compliance rules engine finds no match against the screening database.
Line breaks are deleted: When line breaks disappear, data points are merged and text becomes jumbled/unreadable: a customer’s name and their address are suddenly part of the same line. The entity extraction model reads a single unstructured string, where two discrete values should be.
Characters are dropped: Characters are often dropped or erroneously converted into Unicode. If one number in an identification code gets dropped, then document verification silently fails — no error is thrown, the record just doesn’t match.
Image-based content disappears: When extracted text skips over image-based content, a signature block or identity photograph embedded in a scanned form becomes invisible to the pipeline. The document then clears ingestion with fields the pipeline never extracted.
Jumbled text is introduced: Depending on how the original file was created, there might be hidden junk text in the document, invisible until data extraction pulls it out of its hidden layer. Later, a compliance officer flags a record and finds extracted text that doesn’t match the source document…and there’s no clean audit trail explaining where the corruption entered the pipeline.

Pipelines without explicit validation steps have no mechanism to flag extraction failures, which allows problematic documents to pass through. In the instances where the failure is captured early and exception handling exists, garbled documents get escalated and create manual review overhead. In similar situations where exception handling isn’t part of the workflow, the failure will reach the compliance record undetected, where they’ll fly under the radar until a regulator finds them.

Amount and formatting of text

Documents that rely on images for context can present issues, as previously referenced, but even with text-heavy documents, there can be issues. Often times, these crop up around:

Headers and subheaders, as the model doesn’t recognize them as being structurally significant to the document
Text in columns or other places (footnotes, for example), due to OCR quirks
Tables being turned into unorganized data, as the structure of the table isn’t parsed correctly
Forms, for similar reasons as tables, which can cause a problem when it comes to key-value pair extraction

In a KYC use-case, structured key-value pair extraction is especially vital. Without it, users are often left to sift through an unstructured jumble of data points and labels — a common result when processing a filled-out PDF form with OCR or other traditional extraction processes. On the other hand, structured key-value pair extraction can recognize that 2026-03-25 is a date and label it accordingly when extracted, which paves the way for other parts of the document pipeline to be automated and streamlined in turn.

OCR

OCR is still used as a first step in many document pipelines, and it often comes up short when interacting with documents that have formatting quirks or haven’t been normalized yet. In instances of structured text mixed with unstructured text, tables, columns, etc., OCR will likely struggle and create garbled text in its attempt to parse the information. This garbled text is then inserted in a hidden layer of the PDF, which can show up later during extraction and cause issues with the entire downstream document workflow.

Preparing your PDFs for the KYC Screener

To improve the performance of an AI KYC Screener and ensure that it works reliably, you need the right document workflow upstream of it. Here’s what that looks like:

Normalize documents

In short, document normalization involves processing all incoming documents to give them the same (or very similar) internal structures. When there’s internal consistency from one document to the next, any automated agents will also function more consistently when interacting with those documents. As a result, errors are reduced and data extraction is both more accurate and less resource-intensive.

Here’s a brief overview of what document normalization typically looks like for someone using our Conversion Service:

Analyze: Conversion Service double-checks to make sure the document is among the 62 file types we support. Unsupported file types abort conversion before any other processing occurs. If desired, the user can specify which of the supported file types they want to process, and define what happens to the other types (reject the documents, pass them through, etc.)
Validate and repair / Convert to PDF: If the uploaded document is a PDF, it’s validated for structural integrity; any detected file corruption triggers an automatic repair attempt. If the uploaded document isn’t a PDF, it’s converted to a PDF before any further processing takes place, so our core SDK can handle the rest of the process.
OCR: See below for more details on this step.
Optimize: Redundant data is removed, images are compressed, annotations are flattened, and fonts are merged and subset, normalizing the PDF and minimizing its file size.
Convert to PDF/A: This is where structural normalization happens — all exciting documents conform to PDF/A standards of the user’s choice (i.e. PDF/A-1, -2, -3, or -4). Metadata is standardized, external dependencies are removed, non-conformant annotations are removed, etc.

After this workflow, the documents are structurally similar and should perform the same when run through the rest of the document processing pipeline (including interacting with the KYC Screener). For more about document normalization and how to strengthen your normalization workflow, head here.

Use OCR as part of normalization

To prevent the aforementioned OCR problems with text being scrambled or “misread,” OCR is after “validate and repair” in our Conversion Service workflow, and consists of two SDK operations:

Analyze: Image preprocessing (deskew, binarization, noise removal, and resolution correction) runs before text recognition begins. This identifies pages/regions of the document containing image-based text and only runs OCR on those pages that require it.
Synthesize: The OCR engine processes the identified pages/regions and then embeds recognized text back into the PDF aligned with the layout, which resolves any potential issues around hidden text layers or text embedded in images.

Outputting XML for downstream agent pipelines

Converting your PDFs to XML is one of the best ways to create a machine-readable version of the document. XML, or Extensible Markup Language, is something of a machine lingua franca, created to give people a way of encoding documents that’s both human-readable and machine-readable.

Continuing our example of the Conversion Service workflow, after document processing you have the option to output to an XML file containing the recognized text, word-level position data, and OCR confidence scores for each detected character. This is ideal for pipelines that require structured data and gives a downstream agent pipeline stable, addressable text (rather than a visual approximation of data, as a PDF does). For a KYC rules engine, this means the difference between reading a field value reliably and inheriting whatever data (or mistakes) the PDF renderer happened to produce.

Institutions operating under GDPR should note that the pipeline architecture described here only stays within regulations if document processing runs within the data perimeter. Sending KYC documents containing PII to an external service for processing creates data residency exposure that would place organizations outside of GDPR compliance. Conversion Service is self-hosted, so the document data stays inside your infrastructure when using it.

Future-proof your document workflow

As AI agents become more powerful and more integrated with all aspects of work, it’s important to remember that their functionality relies on the upstream document workflow. Document normalization is a critical part of that — without it, your ability to automate is limited.

By contrast, the workflow we’ve outlined here doesn’t just improve opportunities for automation, it also improves future auditing experiences. A deterministic, normalized pipeline creates a stronger final result (less errors for a compliance officer to find in the first place), as well as recording each step along the way. If an error does happen, there’s a clean audit trail that can be traced back to discern exactly where, how, and why it entered the document pipeline, so it can be fixed before future audits occur.

Start building the necessary foundation to improve and streamline your document normalization processes today with our Conversion Service. It can handle normalization, OCR, optimization, and PDF/A conversion in one deterministic pipeline.

Your KYC agent inherits every upstream document problem