Pdftools Conversion Service kann nun XML-Dateien ausgeben

Up until now, applying OCR (Optical Character Recognition) to a "Conversion Service" workflow would simply turn unstructured files into PDF and PDF/A files with selectable text. But we didn’t actually expose the unstructured data to the customer. The new capability we’re adding allows you to output not just the PDFs but also the structure of the input file itself as an XML file. That way, you can then use it for whatever you need.

Potential uses lie in data analysis and Retrieval-Augmented Generation (RAG), which allows LLMs to access external knowledge. Or perhaps you need to fulfil a requirement to store and archive the unstructured data to remain compliant with company policies or industry standards.

Outputting files that work for both humans and computers

Let’s take a PDF as a starting file, for example. PDFs are a great way to communicate information to humans and to preserve it for the future. The file format preserves text, visuals, and layout really well. However, the data in PDFs lacks a clear structure that can be easily extracted. You might have a visually appealing document that humans can instinctively read in the correct order, but to a computer, that same file is just a collection of pixels and glyphs with no clear reading order.

When we convert files to PDF from images or other PDFs, we take an input file and use OCR (Optical Character Recognition) to turn it into a PDF or PDF/A; one that is searchable. Text can be selected, copied, highlighted, annotated, and so on. But the PDF itself is still fairly unstructured data that can’t be processed or used further in a data pipeline.

How to turn unstructured into structured data

During the PDF conversion workflow, there’s a lot more happening in the background that we don’t see if the final output is only PDF or PDF/A.

OCR first dismantles the original PDF—or any other supported file type—into unstructured data and stores it in an XML file. The XML file records the content itself, provides details for every detected word, including the detected characters and their positions, and contextually places words within a text block, paragraph, and line. It can even show how confident the OCR engine is in its interpretation, for example, because the engine might struggle with uncommon words or proprietary names.

Based on the XML file, the output file is then constructed by creating a visual representation of the structured data in the XML file. The final PDF looks identical to the input file, but the text can now be selected and copied. And the data from the XML itself can be used separately and processed further.

Giving our customers more to work with

In the past, we’ve only exposed the final PDF to our customers so they could search in it, archive it, or do whatever else they needed to do with it. But with LLMs and RAG becoming such a huge part of many businesses’ workflows, we’re now giving users the ability to output the XML file itself. That way, they can access and use that unstructured data in its raw form, giving them more flexibility and business options.

Use the Conversion Service to output XML files

Get a demo or request a license for the Conversion Service in your Pdftools Portal.

Data you can do more with: Conversion Service workflows with OCR can now output XML files

Outputting files that work for both humans and computers

How to turn unstructured into structured data

Giving our customers more to work with

Use the Conversion Service to output XML files

Like what you see? Share with a friend.