3-Heights® PDF Extract - Parse and extract content, resources and metadata in C#, Java or Batch

3-Heights® PDF Extract is a highly efficient and versatile PDF content and metadata parser and extractor. It constitutes the technical foundation of many solutions: from basic PDF to Text conversion to complex solutions in the area of business intelligence, big data and reporting. It allows a precise and throrough conversion of binary data (PDF) to structured information, e.g. in Unicode, images and metadata. The product provides page-wise extraction via command line or more complex operations using its API, e.g. with C#, Visual Basic, Java or C/C++.

Extract Information from PDF

Extract information such as text, images and metadata from PDF

Easy Integration

Integrate into data analysis, indexing and output management systems

Intuitive Indexing

Extract information to index documents and find them more easily

logo

Improved “InsureSign” solution by using PDF Tools AG products

InsureSign is the easiest way to get a document signed by the insured instantly, regardless of their location. The provider Advance Management Company aimed to reduce the overall size of the application and intended to increase the speed of their solution.
logo

Bayer CropScience relies on the ISO long-term archiving format PDF/A

More than 20,000 documents that are required by public authorities for regulatory reasons are created yearly within Bayer CropScience. The wide product spectrum and the unique functionality convinced Bayer CropScience to select the products from PDF Tools AG.
3-Heights® PDF Extract - Product Image

PDF Extract - Features

  • Extract text:
    • Word by word with configurable word boundary detection
    • Retrieve text attributes such as position, font and font size
    • Automatically apply correct character decoding and produce Unicode output
    • Extract raw character codes
  • Extract graphics objects (paths):
    • As strings that contain PDF graphics operators
    • Convert extracted paths to images
  • Extract and store images:
    • Retrieve image attributes such as compression format, position and transparency masks
    • Extract and store transparency masks
    • Extract and store alternate images
  • Extract PDF document-level information:
    • Page count
    • PDF version
    • Page labels
    • Creation and modification date
    • Document information such as title, author, subjects, and more
    • Outlines (bookmarks) including destinations
  • Extract page information:
    • Media box, crop box, trim box, bleed box and art box
    • Page rotation
    • Annotations
  • Extract and store embedded font files
  • Retrieve detailed font information
  • Retrieve optional content group (OCG) information and visibility (layers)
  • Retrieve detailed graphic state information for each extracted page content object
  • Extract raw PDF objects
  • Extract document parts for PDF/X or PDF 2.0
  • Retrieve detailed color space information including lookup tables for indexed color spaces
  • Extract and store embedded files
  • Specify a password to decrypt PDF files
3-Heights® PDF Extract - Functionality

Conformance

  • ISO 32000-1 (PDF 1.7)
  • ISO 32000-2 (PDF 2.0)
  • ISO 19005-1 (PDF/A-1)
  • ISO 19005-2 (PDF/A-2)
  • ISO 19005-3 (PDF/A-3)
Powered by 3‑Heights® TechnologyPDF/A compliant

Supported formats

Input formats

  • PDF 1.0 to 1.7
  • PDF 2.0
  • PDF/A-1, PDF/A-2, PDF/A-3

MANUALS

API | Shell

Areas of use - extract information out of your PDF documents

Incoming mail and document processing

Content from PDF files such as forms or scanned incoming invoices, for instance, is extracted and processed for characterization or indexing.

PDF documents are used to store important information relating to products, customer data and corporate knowledge. Meta information such as the document’s creator, date of creation or date of modification are further integral parts of a PDF document. PDF documents are often used as “containers” to enable the transfer of text, images, videos and other data to other processes independently of the platforms in use.

Outgoing mail

PDF documents are restructured in preparation for use by other target groups. The process reads out processing information such as barcodes, address information or page formats that can then be used for controlling printing and packaging lines or sorting processes.

Archiving

Texts or their components are extracted for separate storage in metadata. This allows document indexing to be extended as required.

Other areas of use

  • Convert PDF documents into text documents
  • Extract information such as addresses, invoice data and report data from documents for process control purposes
  • Extract information for document classification and document indexing
  • Process data in forms
  • Extract images for further processing (scans, photos, etc.)
  • Analyze and evaluate the content of PDF documents in mass processing
Contact us
What can I do about sliced images?

What can I do about sliced images?

If I try to extract images from a PDF file it sometimes happens that I get a bunch of slices of the original image, mostly consisting of a few image rows per slice or, in extreme cases, just one row. Why is that and how can I get the entire image in one piece?