Turn those files into pixels!

Everyone seems to be talking about digitalization these days. As soon as a pixel is involved, someone is sure to utter that ever-so-popular buzzword, usually followed by numerous names of companies, digital philosophers and data gurus. But what exactly do companies gain in the case of “capturing”?

Past and (still) present

Long before the advent of digitalization, mail deliveries were opened, sorted, date-stamped and dismissed to the responsible person’s internal mailbox. Later, that person viewed the document, assigned the invoice to an account, for example, and entered it into the accounting system. The document was then forwarded for signing by the person in charge of the particular cost center.

After approval, the invoice came back, the payment was executed and then the paperwork filed in a thick file, which later ended up in an archive in the basement. Filing cabinets and files in basement rooms were filled to the ceiling, and when a document was needed, the search through the shelves, files, and indexes began. Paper is time and cost-intensive. And woe to anyone who had misplaced a file or a document! Many companies still rely on these old practices – but are currently intensifying their efforts towards digitalization.

Digitalization during capturing?

Digitalization can offer companies added value in a number of processes. Take the buzzword “capturing”, for example. Every day we are flooded with documents, data, emails and other information. We receive business-relevant as well as less relevant information via various channels and in diverse formats. Digital processes can monitor, scan, sort, categorize, and guide this data through the relevant processing steps. The goal of digitalization is to help facilitate the subsequent processing steps in the best possible way. Yet machines and algorithms cannot necessarily handle every scenario.

To set the whole process in motion, incoming documents must first be scanned and an OCR engine must be used to recognize, export, and link the necessary information with the document. This is the only way, for example, to recognize and read invoices from suppliers in a wide variety of formats in order to support electronic payment processing.

Optical character recognition (OCR) not only speeds up the process from document receipt through to account assignment and payment – the stored information also enables finding documents based on full text search at any time. The hard copy of the invoice disappears as soon as it has been scanned, all necessary data has been stored, the payment is approved and completed, and all of this is stored in the electronic archive. This means not only saving, but also protecting the document in terms of access, timestamps and electronic signatures.

The very places where we do our work are changing too. This transformation calls for flexibility. More and more employees are working from home but need access to documents. Only through digitalization can the required information be made available. As a side note, this is also a dream come true for every clean desk inspector who expects to find worker’s desks less and less cluttered. For the same purpose there are also document processes that lead back out of the company. Here too, digitalized documents are in most cases significantly faster, more secure, and more efficient.

But there’s more! During audits, it is no longer necessary to drag crates of files from the basement to a meeting room for the auditor. Instead, the auditor is simply granted the corresponding access rights to view files and he reviews the documents remotely in the digital archive. This saves a lot of time and stress for both parties. If this scenario is undesirable, it is also possible to save the documents on a secure USB stick for a handover to the auditor.

The digital butterfly net

The input channels for documents are as diverse as the range of formats and qualities. What can be captured is captured, in both the analogue and in the digital world. Even in the age of electronic invoices, online shops and e-commerce, paper has not yet become obsolete: Documents such as invoices, tax forms, service reports and contracts are still frequently issued in paper form, sent and received by old-fashioned mail.

To achieve the desired electronic form for the document process, paper documents are scanned. Good image quality is essential for ensuring low recognition error rates. Achieving good image quality at a high compression rate requires levels of processing power that local multifunction printers do not usually possess. Traditionally, a scanner generates a TIFF or JPEG image for each page. Some devices are able to directly create PDF files, and newer ones produce files that comply with the PDF/A standard. However, the quality of these files varies greatly depending on the provider and the conversion software.

A central scan server is recommended to solve this quality problem. The service performs all tasks that can be delegated by local scanning stations. This server receives the scanned image files, analyzes the documents and generates a PDF/A document with all text and image information compressed to the right size. The document can also be tagged with a time stamp or a digital signature. The captured information is now available in a standardized, high-quality format that is suitable for human readers and for automated processing with IT applications.

In contrast, there is a contingent of digital producers offering electronic formats with inadequate PDF quality, resulting in unexpected problems and costs associated with document processing. PDF conversion is not just about packaging an image in a PDF “envelope”. The PDF document can contain text and barcode recognition, embedded metadata and digital signatures.

Rapid technological development means that systems deployed today swiftly become obsolete and need to be replaced. The content of archived documents, however, remains relevant. It should therefore be possible to migrate it to the new systems in an unaltered form. The prerequisite for lossless migration is a stable document format that outlives the life cycle of the systems.

From dot to bit

By enhancing the scanning function with text recognition (OCR), you have the option of enriching all documents with additional information, which has a positive effect on the entire document process. Documents ranging from scanned paper and electronic documents to emails and attachments can be assigned to departments or processes based on text attributes, elements and document structure.

The system recognizes whether the documents are normal correspondence, orders, delivery notes or invoices, for example, and forwards them according to the distribution key. This helps greatly with downstream processes and decision cycles.

In other words, good text recognition with post-processing not only supports capturing as such. It also makes it possible to structure the classification and document distribution more efficiently – including post-processing and improved verifiability and findability of the documents for compliance purposes. After information enrichment, OCR enables keyword searches, reporting functions, the compiling of topic-specific dossiers and thus generally speeds up reaction time and processing.

Even though the sharing of electronic documents in business processes has already become a matter of course, document quality is often neglected. In both cases, a quality control system for incoming documents has become indispensable. But before you set up what we call a “quality gate”, you have to be clear about the definition of quality. A large number of companies have a document process that works with one main format. This simplifies the entire process and reduces the jungle of different formats in the channels, and therefore makes processes much easier to control.

When we use PDF as the standard format, two types of quality result: inherent quality and dedicated quality. Inherent quality checks conformity with the file format specification (ISO 32000). This is because not every document labeled as PDF is actually a proper PDF – at least not in the form companies want for further processing all the way to the archive.

Dedicated quality focuses on use cases, such as scanning, document sharing, printing and archiving, for example, whether fonts and colors are optimized and, if required, whether the files comply with the PDF/A standard for long-term digital archiving. After capturing, a “quality gate” is in place to handle validation, repair, optimization and digital signing and thus protect the documents.

No matter how simple or extensive it may be, it is important that a document process be well-planned and tested before it is deployed, in terms of how external and internal documents are handled as well as their checking, enrichment, distribution and archival. This improves response times, conserves resources at all levels and is significantly more transparent for traceability and compliance.


A central scan server with optical character recognition (OCR) and a downstream quality gate are the keys to a smooth digital document capturing process. Text recognition makes it possible to channel the information collected from a document, to design document processes efficiently and to ensure good searchability. The quality gate ensures consistent quality and document security. Automation minimizes sources of error, improves document quality and thus saves time and money – from the moment documents are received through to their long-term archival.

In a world of bits and bytes we shouldn’t forget the human factor, either, whether with regard to user-friendliness or data protection. It should be noted that stellar software by itself can neither guarantee acceptance by users nor transparency in terms of legal regulations and guidelines. This requires a good concept and understanding. Traceability isn’t just relevant for digital files in an archive; it’s also important for people.

Capturing is not magic, but rather the streamlined capture, verification and enrichment of information. But you can achieve a lot with capturing – so transform those files into pixels and say hello to digitalization.

Like what you see? Share with a friend.

Nadine Schuppisser

Written by Nadine Schuppisser

Nadine Schuppisser is Head of Marketing and Communication of PDF Tools AG. Pdftools offers PDF & PDF/A components and solutions for digitization, the document process and legally compliant long-term archiving. PDF Tools AG is the Swiss representative on the ISO committee for PDF/A and PDF and a founding member of the PDF Association.

Grüezi! How can we help?