Data you can do more with: Conversion Service workflows with OCR can now output XML files
We’ve added a new capability to the Pdftools Conversion Service which uses OCR to extract structured data from over 50 supported file types and output it into an XML file. Customers can then use that data and integrate it downstream, for example, using LLMs and RAG.
Up until now, applying OCR (Optical Character Recognition) to a "Conversion Service" workflow would simply turn unstructured files into PDF and PDF/A files with selectable text. But we didn’t actually expose the unstructured data to the customer. The new capability we’re adding allows you to output not just the PDFs but also the structure of the input file itself as an XML file. That way, you can then use it for whatever you need.
Potential uses lie in data analysis and Retrieval-Augmented Generation (RAG), which allows LLMs to access external knowledge. Or perhaps you need to fulfil a requirement to store and archive the unstructured data to remain compliant with company policies or industry standards.
Outputting files that work for both humans and computers
Let’s take a PDF as a starting file, for example. PDFs are a great way to communicate information to humans and to preserve it for the future. The file format preserves text, visuals, and layout really well. However, the data in PDFs lacks a clear structure that can be easily extracted. You might have a visually appealing document that humans can instinctively read in the correct order, but to a computer, that same file is just a collection of pixels and glyphs with no clear reading order.
When we convert files to PDF from images or other PDFs, we take an input file and use OCR (Optical Character Recognition) to turn it into a PDF or PDF/A; one that is searchable. Text can be selected, copied, highlighted, annotated, and so on. But the PDF itself is still fairly unstructured data that can’t be processed or used further in a data pipeline.
How to turn unstructured into structured data
During the PDF conversion workflow, there’s a lot more happening in the background that we don’t see if the final output is only PDF or PDF/A.
OCR first dismantles the original PDF—or any other supported file type—into unstructured data and stores it in an XML file. The XML file records the content itself, provides details for every detected word, including the detected characters and their positions, and contextually places words within a text block, paragraph, and line. It can even show how confident the OCR engine is in its interpretation, for example, because the engine might struggle with uncommon words or proprietary names.
Based on the XML file, the output file is then constructed by creating a visual representation of the structured data in the XML file. The final PDF looks identical to the input file, but the text can now be selected and copied. And the data from the XML itself can be used separately and processed further.
Giving our customers more to work with
In the past, we’ve only exposed the final PDF to our customers so they could search in it, archive it, or do whatever else they needed to do with it. But with LLMs and RAG becoming such a huge part of many businesses’ workflows, we’re now giving users the ability to output the XML file itself. That way, they can access and use that unstructured data in its raw form, giving them more flexibility and business options.