How to fix your claims pipeline with structured key-value extraction

Key Value Extraction Illustration

Every time a claim file crosses your desk, you’re likely to deal with a variety of file types and formats. Today’s contains a scanned ACORD file, a faxed repair estimate, screenshots of emails, and a handwritten loss description. Even after your OCR system catalogs them all, you’ve got a lot of manual cleanup ahead of you: matching the policy number to the events, labeling the numbers as medical costs or repair estimates or serial codes, and so on.

This scenario feels familiar because traditional OCR and RPA processes can handle structured documents with a fixed layout, but their efficacy fizzles out when faced with the variable layouts, mixed formats, and contextual interpretation that claim files demand. Approximately 80% of all enterprise data is unstructured, trapped in sources ranging from emails to medical records to miscellaneous documents, and insurance data is no exception. When you combine that amount of unstructured data with the fact that underwriting operations teams spend up to 40% of their time rekeying data, it’s clear that a new solution is needed. This is where structured key-value processing comes in.

Structured key-value pair extraction makes it significantly easier to build reliable, scalable claims pipelines. It matches data points to the relevant labels, giving them context when pulled from a PDF. For example, it can recognize that 2026-01-13 is a date and label it accordingly when extracted. In this instance, the key is “date” and the value is “2026-01-13,” hence “key-value pair extraction.” This style of structured extraction goes beyond preprocessing in a way that traditional extraction methods can’t, allowing for other parts of the claims processing workflow to be automated and streamlined in turn. 

Where traditional extraction fails in claims environments

One shortcoming of traditional document extraction methods is that they tend to rely exclusively on optical character recognition (OCR). OCR, the process through which images of typed/handwritten text are converted into machine-encoded text, has its place in document-processing workflows, but it also has its limitations. In this instance, the biggest shortcoming is that OCR doesn’t really structure data — it just pulls out the text from a document. Downstream claim processing systems need data to be structured and labeled, so the raw text isn’t immediately usable. It loses every relationship that makes the data relevant and usable: which value belongs to which field label, which line item goes where in which table, which amount is the deductible and which is the claim total, etc.


Other methods used in traditional extraction processes include image recognition, intelligent character recognition (ICR), and native text extraction. Each of these processing methods has its own strengths and shortcomings, and in general, they can handle one format well, but fail at handling other formats. If the tools you’re using can flawlessly extract the relevant data from a picture, but can’t pull out the same relevant information from an Excel spreadsheet, you’ll still have to do a significant amount of manual processing to get through the claim packet.

There’s a big difference between “text on a page” and “data in the correct field,” and that’s where most automation efforts fail. Here are a few ways that traditional extraction methods tend to fall short: 

  • When labels are separated from values, context disappears, and the process is cut short. When a scanned ACORD form is run through OCR, the label “Claimant Name” might appear three lines above the actual name, checkboxes for “Accident” vs. “Theft” lose their boolean meaning, and table columns for billed amount, allowed amount, and paid amount might be merged into a single unordered list. You’re left with a mess of raw text — practically useless for automated routing or adjudication. 

  • When context disappears, you can’t correlate documents across the claim lifecycle. A complete claim requires cross-referencing data from multiple sources at multiple times, ranging from FNOL (First Notice of Loss) to medical documentation to adjuster reports. The claims system needs to match, validate, and reconcile all of this information — the policy number from the FNOL needs to match the policy record, the claimed amount needs to reconcile with the repair estimate line items, etc. — and an automated system will be unable to do that when presented with raw text and no context. 

  • When that lack of context moves downstream, systems struggle to keep up. Whether it’s a claims management platform, fraud detection model, regulatory reporting system, or an AI-driven triage engine, downstream systems need data in specific schemas to be able to process it correctly. Claim amounts must be numeric, dates must be ISO-formatted, and the policy number needs to be the right string in the right field. Raw OCR output doesn’t meet any of these requirements unless you add a structured layer that maps extracted content to the correct schema and validates it against your business’s specific rules.

How structured key-value pair extraction works

After first notice of loss (FNOL), when a policyholder files a claim to report a loss, the next step is for the insurance company to verify and validate information based on the documents provided by the policyholder (and any relevant internal documents, like their policy). Structured key-value pair extraction can streamline this process and make sure that accurate data is available at every stage of the claims process. It solves the above problems with traditional extraction methods by reconstructing the relationships between different data points. It can associate “Policy Number” with the right string of characters, tie “Date of Loss” to an ISO date, link billing codes to their corresponding amounts, and map checkbox selections to boolean fields. 

The resulting output is clean and schema-aligned (available in JSON or XML), and can be used by claims management systems, fraud detection models, and compliance engines, without manual intervention. For example, a scanned ACORD form processed through standard OCR systems could return:

“Claimant Name John Smith Date of Loss 11032024 Accident Theft Policy ABC12345 Amount Billed 4200.00 Allowed 3800.00 Paid 3500.00”


The same document processed via structured key-value pair extraction would produce: 

{

"claimant_name": "John Smith",

"date_of_loss": "2024-11-03",

"claim_type": "accident",

"policy_number": "ABC-12345",

"amount_billed": 4200.00,

"amount_allowed": 3800.00,

"amount_paid": 3500.00

}

The first version requires additional labor before downstream systems can process it, but the second one is immediately machine-readable and ready for use. This solves two problems simultaneously: 

  1. It lets organizations feed AI/LLM tools with semantically coherent, labeled inputs, enabling intelligent claims triage and decision support

  2. It provides deterministic structured data for automated pipelines that require predictable, validated output at scale 

In addition to these major issues, this type of structured extraction is also a good solution to other issues that come up with traditional extraction methods: 

  • It’s better at tackling layout variability across providers and form versions. ACORD forms alone have dozens of versions, and every medical provider, repair shop, and legal office uses different form layouts for estimates, assessments, etc. Where a template-based approach demands constant reconfiguration, this type of structured extraction interprets layout semantically rather than positionally, making it truly scalable. 

  • It’s more accurate with handwriting and degraded documents. Traditional extraction methods often come up short when it comes to mixed typed/handwritten content on field claims and faxed or older documents with added visual noise. Accuracy on these items is crucial, though, as they often include things like handwritten loss descriptions or adjuster notes approving a claim. 

  • It excels at multi-document correlation across the claim lifecycle. Remember the loss of context we discussed before? Structured key-value pair extraction solves this problem and negates the downstream complications that can come from trying to feed a system raw text, when it needs cross-references and consistency. 

  • It understands and works within predefined schemas. Structured extraction tools interpret each document against an expected schema. For example, they can recognize that “DOL” stands for “Date of Loss,” that the column of numbers under “Amount Billed” are currency values, and that a checked box indicates a boolean claim type. Being able to interpret and work within schemas like this allows for visual documents to be turned into database-ready records. Additionally, it creates an audit trail related to a specific claim and keeps your organization compliant.

It’s vital to understand that structured extraction is not “just” preprocessing. Rather, it lays the foundation that enables AI-driven claims adjudication, fraud detection, compliance reporting, and straight-through processing (STP). Out of the global respondents for the Gallager Bassett 2025 Claims Insights report, 47% said they planned to manage evolving market dynamics by strengthening their claims management processes, and structured extraction is one of the best ways to do that.

AI extraction is not generative AI (and the distinction matters)

If you’ve seen generative AI hallucinate or are even familiar with the possibility, you’re right to be cautious about introducing AI tools into claims workflows. It’s a legitimate concern, especially when PII is involved — it also just happens to be based on category confusion. 

AI-assisted extraction isn’t generative, in that it doesn’t create content. It identifies and structures existing content against a predefined schema, with deterministic outputs that can be validated against business rules. Still, it’s imperative to be thoughtful in how you implement AI in workflows by creating clear policies, failsafes in case of errors (including human in the loop), and matching the right kind of AI to the task at hand. 


If your organization uses LLMs for other purposes, AI extraction can also significantly improve the dataset that your LLMs are working with. By presenting clean, ordered data in structured key-value pairs, instead of as a loose assortment of text and images, you can train the model on high-quality data and get higher-quality output as a result.

The downstream effects: what satble extraction enables

Structured key-value pair extraction comes with many immediate benefits: more flexibility processing different document formats and layouts, higher accuracy rates with handwritten content, and database-ready information, to name a few. Even after these initial benefits, there are other compelling reasons to adopt this type of extraction: 

  • The data obtained via extraction enables straight-through processing and lets low-complexity claims be automatically adjudicated end-to-end, freeing up time and energy to use on more complex cases that can’t be fully automated. 

  • Having higher-quality data to train LLMs on decreases the likelihood of errors in LLM-involved processes and output. For example, many insurance companies rely on RAG (Retrieval-Augmented Generation) for their internal information retrieval. Structured key-value pair extraction provides higher-quality data than raw text, which in turn is stored in the same knowledge bases that RAG draws from. Then, because the system is being provided with cleaner, more structured data, it can give more accurate, grounded responses when queried.  

  • Structured output formats improve interoperability. JSON, XML, or direct API payloads make it easy for claims management systems, analytics platforms, compliance reporting tools, and other software to consume extracted data consistently. Where flat text extraction struggles to represent multi-page medical bills, repeating line items, or nested entity relationships, this type of structured output does not. 

The claims process can be time-consuming and tedious, but with the technology currently available, it doesn’t have to be. Very regularly, failures in the claims automation process aren’t failures of AI or technology, but ingestion failures. In this case, raw text extraction from claim documents needs heavy manual processing to be usable elsewhere in the claims process — a problem that structured key-value pair extraction neatly circumvents. If insurance companies want to keep up with their competitors and with ever-shifting economic uncertainties, without lowering customer service standards (and customer satisfaction to match), it’s a fundamental element in building claims pipelines that are reliable, scalable, and free up internal resources for usage elsewhere. 

Like what you see? Share with a friend.