Why Your PDF/A Files Keep Failing Audits

A PDF document with errors flagged in red transforms into a validated PDF/A document with a blue checkmark.

Picture the archive at an insurance company, where employees have been generating PDF/A files for years…but may have never validated a single one of them. The metadata in the files says they’re PDF/A compliant. The auditor, however, disagrees. When there’s a lack of independent validation, it can easily result in a gap between claimed compliance and verified compliance. Worse yet, that gap can be invisible for months or years until a third party is conducting a regulatory review. And if a regulator requests records and the archive fails inspection, the failure to validate isn’t a procedural gap — it’s a compliance gap.

The costs of those invisible failures can be steep. GDPR has strict rules around archival standards: secure storage of personal data, maintaining the integrity of the stored data, and the ability to retrieve it on demand. Additionally, the EU directive MiFID II requires that financial institutions store client communications and transaction records for 5-7 years in a tamper-proof and easily retrievable format. Fines for failing to adhere to these regulations can be up to 4-10% of a company’s total annual revenue (depending on what regulations are violated and how). Your organization can avoid taking on those costs by understanding where issues normally arise in a PDF/A pipeline and how to avoid them before they happen.

Why your PDF/A files are failing audits

Understanding why PDF/A validation is important is one thing, but knowing why your PDF/A files are failing audits or aren’t passing validation checks is another. In general, these are the most common PDF/A issues:
Embedded font issues: Font issues are one of the most common culprits, with one study showing that out of 150 PDF files flagged as non-conformant, 101 of those files were flagged due to an incomplete font subset. All fonts must be embedded in the PDF, meaning the viewer won’t need to have the font on their computer to view the PDF. 

  • Color space profiles: Color spaces must use device-independent ICC profiles to guarantee consistent color reproduction regardless of how/where/when the PDF is viewed, with one common error being a document using color schemes other than sRGB. 

  • Metadata anomalies: All PDF/A documents must contain XMP metadata that identifies the details of the document’s PDF/A compliance, including the pdfaid:part property (which indicates the PDF/A version number) and the pdfaid:conformance property (which indicates the conformance level). 

  • Incremental update history: An incremental update is a method of modifying a PDF by adding new information to the end of the existing file, without modifying the original data in any way. A new xref table and trailer are created for the new version of the PDF, and the older version of the document remains present in the file, but isn’t actually displayed to the viewer. The validation issues come into play in a few different ways: 

    • PDF/A-1 doesn’t permit incremental updates at all — any modifications to a PDF/A-1 file produces a full save, not an incremental one

    • PDF/A-2 and -3 do permit incremental updates, but with restraints; the update must also be entirely PDF/A conformant, so you can’t append a revision that includes non-embedded fonts, or the document will fail validation

    • Aside from those specific constraints, another often-overlooked detail is that the document version listed in the document’s header must match the document version listed in the document’s catalog dictionary, or it will fail validation

PDF/A conversion doesn’t automatically mean PDF/A validation

Now that you know precisely why your files are failing validation checks, it’s worth learning more about the difference between conversion and validation:

During PDF/A conversion, a file is converted to PDF/A format — either from a standard PDF or from another filetype (.doc, .html, etc.). As part of this process, the new file will have specific metadata identifying itself as a PDF/A file (including information on the specific type of PDF/A file it is). It’s vital to note that the metadata is a claim of conformance to PDF/A standards, but doesn’t actually ensure conformance in any way, shape, or form.  

During PDF/A validation, a program checks the file to make sure that it’s actually PDF/A compliant. FOSS validators often operate on heuristics, rather than specification-level checks. They might provide a “best-guess” about compliance and say a file probably conforms, which is not the same as formal validation and can result in compliance violations further down the line. And, while important, a visual review of the document isn’t a sufficient stand-in for actual validation, because a file can look like it meets the requirements, without actually meeting them. 


An additional complicating factor is how many people across multiple different departments are typically involved in document processing workflows. Let’s say a person working in compliance signs off on creating a document archive, and an independent engineering team is building the pipeline for how those documents are processed before archival. If those two people (or teams) have a different understanding of what PDF/A compliance entails, the gap between claimed compliance and verified compliance can be extensive. Even worse, it often takes several years to discover that gap. 

The minimum requirements for all PDF/A files are: 

  • All content (fonts, text, images, etc.) must be embedded in the document without referencing external content

  • The file cannot contain audio/video, JavaScript, or XFA forms

  • The file can’t use LZQ compression, encryption, or password protection

  • All interactive form fields must have an appearance dictionary 

  • The file’s metadata must be encoded using Extensible Metadata Platform (XMP) technology

Those are just the minimum requirements that apply for all PDF/A subtypes. Depending on what subtype you’re using (and what conformance level underneath that subtype), there might be even more requirements. When you skip this step, you risk your archived documents becoming unreadable over time, being rejected by government agencies, and failing future audits.  

In most cases, the conformance level claimed in the document’s metadata is the standard that the document will be held to when being reviewed by outside regulators or auditors. This is why our SDK, by default, validates a PDF against the conformance level claimed by the document internally (unless the user specifies otherwise). Otherwise, a document can claim to be PDF/A-2b internally and fail its own claim. 
 

Why conformance level is an architectural decision

In addition to the above issues, conformance mismatching is another one to watch out for, as it’s both a common error and nuanced in surprising ways. For example, a document might claim to be PDF/A-1a compliant, but it’s actually PDF/A-1b compliant. There are two key things to understand about PDF/A conformance levels:

  • They aren’t interchangeable. The compliance standards for PDF/A-1, -2, and -3 were all created at different times and have different requirements. For example, PDF/A-1 doesn’t allow for transparency or layers and only has conformance levels A and B. Meanwhile, PDF/A-2 allows for transparency, layers, and attachments of other PDF/A files, and has conformance levels A, B, and U.  

  • They’re incremental. Level U (which is only applicable to PDF/A-2 and -3) includes all level B requirements in addition to adding new ones, and level A includes all level U requirements (and then some). The level of conformance you need to aim for will depend on what your use case is. For example, a claims archive that needs document text to be easily extractable years later requires level U at minimum — level B conformance doesn’t guarantee that functionality. This illustrates that decisions around conformance level are architectural decisions with long-term retrieval and regulatory consequences. 

Conformance level U in particular is specifically about searchability and text extraction. Where level B conformance is largely about the visual integrity of the document, level U conformance guarantees the searchability of the text within the PDF. In level U documents, Unicode text is copied to the output file, which means it can be digitally extracted later if necessary. For regulated industries where archival documents may need to be retrieved and reviewed years after filing, PDF/A conformance levels U and A are the most relevant.

Your audit-proof PDF/A workflow

What does a defensible pipeline look like? Here’s what we recommend: 


PDF normalization → Conversion to PDF/A → Independent validation of PDF/A compliance → Document is archived 

As part of that pipeline, you’ll be creating a solid audit trail, including: 

  • Records of who converted files and when they did so

  • Information on what tool (and what version of the tool) they were using to convert files

  • An independent validation report that confirms the specific conformance level claimed — ideally with a timestamp, stored alongside the document(s) it’s reporting on

If you’re wondering if independent validation matters, think about it: a library that both converts and validates is checking the end result against its own reading of the specification. And systematic misinterpretations — the kind most likely to surface during regulatory reviews — are also the kind most likely to pass undetected. 

This is one of the risks in using FOSS tools for PDF/A validation. Our validation checks surface errors (based on both the ISO spec and any custom profiles the user has set) and shows them by category, location, object number, and page number. That means that potential failures are silent…and in a regulated environment, silent compliance failures are the worst kind. In the case of an audit failure, your organization will be expected to explain how the failure happened, and skipping independent validation is not a defensible stance to take in such instances.  

Auditors expect to see generation and validation performed by separate tools (no self-certification), machine-readable validation reports retained as evidence, and consistency between claimed conformance levels and the actual structure of documents. Remember, under frameworks like GDPR and MiFID II, the burden of proof for document integrity lies entirely with the organization maintaining the archive. 

As you can see, independent validation is a key part of a successful PDF/A pipeline, which is one reason we’ve put so much effort into our SDK’s approach to validation. It can show you exactly what failed validation and where the failures are in the document, giving you the chance to fix the errors while creating a paper trail that you can use in case of future audits. A generation library that silently succeeds can’t give you the same level of detail or security, and can potentially set you up for future audit failures.

Like what you see? Share with a friend.