Automate your data extraction
Java | .NET | C/C++ | COM | Command Line
3-Heights™ PDF Extract is a component for reading out the contents and properties of PDF documents.
This component can extract information quickly and efficiently, regardless of whether document content or document properties. The results can be stored in a database, for instance, or used for evaluations and statistics or to secure internal corporate knowledge.
Extract information such as text, images and metadata from PDF
Integrate into data analysis, indexing and output management systems
Extract information to index documents and find them more easily
Different teams in the accounting department are now able to process PDF’s from countries around the world in their original languages. The extracted data is used for further processes, e.g. to pay invoices or to do financial audits and reporting. Thereby Quickcomm benefits from reduced labor expenses, increased accuracy of their data and fast turn-around.
GoArchive now enables the editors working for Oppolis customers to research archives quickly and easily to search, find and import PDF documents. Furthermore, the program guarantees the PDF documents stored in the regional newspaper's archive are available to external users, despite the publication archive's large volume.
Content from PDF files such as forms or scanned incoming invoices, for instance, is extracted and processed for characterization or indexing.
PDF documents are used to store important information relating to products, customer data and corporate knowledge. Meta information such as the document’s creator, date of creation or date of modification are further integral parts of a PDF document. PDF documents are often used as “containers” to enable the transfer of text, images, videos and other data to other processes independently of the platforms in use.
PDF documents are restructured in preparation for use by other target groups. The process reads out processing information such as barcodes, address information or page formats that can then be used for controlling printing and packaging lines or sorting processes.
Texts or their components are extracted for separate storage in metadata. This allows document indexing to be extended as required.
If I try to extract images from a PDF file it sometimes happens that I get a bunch of slices of the original image, mostly consisting of a few image rows per slice or, in extreme cases, just one row. Why is that and how can I get the entire image in one piece?