Original source (german): DOK.magazin
Author: Dr. Hans Bärfuss
Clever. Convenient. Conform. Scan server for digital long-term archiving
These days, most companies no longer wish to waste time and money on filling windowless rooms with paper files or assigning staff to search for paper documents. More and more managers are realizing the benefits of digital archiving, and not just in large enterprises. But how should it be implemented? Some say leave it to the manufacturers of the scanning devices, while others believe it takes more than that.
Is a scanner enough?
In most companies, scanning paper documents has become a routine task when handling incoming mail. Multifunction printers (MFP) or high-performance scanners are used for this purpose, depending on the type and volume of paper documents received.
In most cases, the scanned images are created as black-and-white TIFF files, the typical format used by fax machines. In special cases, such as when scanning checks or ID photos, the file is generated in color. However, color scanning is usually avoided, since the created TIFF files are either too large or the JPEG compression visibly reduces the image quality.
But good image quality is an important requirement for a good text recognition rate. Achieving good image quality at a high compression rate requires a level of processing power that local multifunction printers do not usually possess. Separate scanning software can offer considerable advantages in this respect.
Usually, the individual processing stages, such as text recognition, compression, PDF/A generation and digital signature, cannot be performed by the scanner alone, as metadata is often added retroactively by an index station. However, this work stage breaks the seal of the digital signature and makes it worthless. Here, too, separate software can offer a decisive advantage.
PDF/A – a universal document standard
The PDF/A standard is now widely established in incoming mail applications. The PDF/A standard offers the following important advantages in comparison to conventional document formats, such as TIFF and JPEG:
- Standardized format PDF/A is suitable for storing both scanned and digitally created documents.
- High compression rate The PDF/A standard supports more modern and powerful compression processes, and thus small file sizes for color images.
- Text recognition The created PDF/A documents can be made searchable by embedding text from an OCR engine.
- Embedded metadata In order for the document and the associated metadata to form an inseparable whole, the metadata is embedded in the file in PDF/A. For saving, PDF/A uses the Extensible Metadata Platform (XMP) format, which, like PDF/A, is also defined as its own ISO standard.
- Digital signature In order to ensure the integrity and authenticity of the created documents, a digital signature can be applied to the PDF/A document in accordance with the PAdES standard. The digital signature is a kind of electronic signature that can serve the same purpose as a handwritten signature, provided that the corresponding legal requirements (national signature laws) are met.
In principle, TIFF documents offer all these advantages, but only as proprietary extensions, since the TIFF standard itself does not offer solutions
|Data consistency||Proprietary tags for metadata||+|
|Authenticity/Integrity||With detached signatures||+|
|Required storage space||Black/White: + / Colour: -||+|
|Searchability||Proprietary tags for OCR text||+|
Advantages of PDF/A over TIFF
What can a central scan server do?
A Scan-Server is a central service that converts locally scanned files and associated index files into the standardized PDF/A file format within a company. To this end, the service performs all tasks that can be delegated to it by the local scanning station. The solution is particularly suitable for processing stages that do not require any user interaction or which impair the efficiency of the local scanning station with CPUintensive functions (OCR, compression).
The main functions of this service are:
Text and barcode recognition Scanned image files need to be made searchable. The services can use the 3-Heights® OCR Service to identify text in an image file and embed it into the converted file in a way that makes it searchable. The recognized barcodes can be used in several ways: in the text search, as part of the embedded metadata, or to control the processing (name of the output file, page separation, etc.) within the service.
Compression Color images are broken down into several elements. Using the Mixed Raster Content (MRC) process, they are then heavily compressed with no visible losses.
Embedding of metadata The PDF/A standard requires metadata to be embedded in the document in the form of XMP packets. This function is offered by the service.
PDF/A creation The service creates single or multi-page output documents in accordance with the ISO 19005 series of standards. All published parts of the standard – PDF/A-1, PDF/A-2 and PDF/A-3 – are supported.
Digital Signature The signature can be advanced or qualified, suitable for long-term storage or simply for exchange. It may also contain a time stamp. Only one time stamp can be applied in place of the personal signature. The service can use a cryptographic infrastructure (USB token, HSM) via a standard interface (PKCS#11) to create a digital signature.
A typical sequence would look as follows:
Image acquisition The scan operator starts the scanning process and creates a color TIFF file. The scanner usually stores files in a file folder. Facsimile documents are received by the fax machine and stored in a special folder as black-and-white TIFF files.
Manual classification Depending on the process, the scan operator can perform a manual classification. They control the scanner so that the images are stored in different folders (e.g. invoices and delivery notes), or special barcode sheets are added that help to separate and classify the documents, or a minimum set of index files is created.
Segmentation and compression The color image of each page is broken down into its different elements, such as background, text and pictures. The size of the individual elements is then reduced by subjecting them to compression processes specifically designed for that type of element. This MRC process makes it possible to achieve competitive file sizes for color documents.
Text and barcode recognition The images are processed further by an OCR engine. The image is cleaned up and deskewed, and text and barcode recognition then takes place.
Metadata Information from the manual classification, recognized barcodes and other sources is assembled into standardized XMP metadata.
PDF/A creation The prepared images of each page, the recognized text and the metadata are assembled into a PDF/A document together with the ICC color profile of the scanner. Optionally, an index file containing only the metadata can be created.
Digital Signature If desired, the PDF/A files can be digitally signed in order to preserve the traceability and revision integrity of the documents.
Validation As an additional option, the PDF/A conformance of the created document and the validity of the digital signature can be verified. The service also offers a range of additional functions.
Where can the service be used?
A Scan-Server is used for the following purposes:
- Paper Capture Electronic archiving of paper documents received as incoming mail within a company.
- Facsimile Capture Electronic archiving of all fax transactions between the company and its business partners.
- Archiv Migration Migration of paper archives to an electronic archive with the standardized PDF/A format.
- Web/Mobile Capture Use of the central service in client/server applications via a web service.
- Enterprise Application Integration Use of the central service for PDF/A document creation via a programming interface (API) from specialist applications that create TIFF or JPEG files.
Although developing a digital long-term archive has become essential in large companies, it also benefits small and medium-sized businesses by cutting their storage and personnel costs.
A well-designed scanning process can help remove the need for inconvenient paper from the earliest stage in the chain (i.e. incoming mail). At the same time, the validity of the electronic documents is ensured through digital signatures. With a central scan service, businesses can implement a powerful, flexible and future-proof archiving process.
PDF/A, a standardized file format for long-term archiving, is not only suitable for scanned documents but also serves as a universal format for digitally created documents.