3-Heights™ PDF Extract – extracting content, resources and metadata

3‑Heights™ PDF Extract is a component for reading out the contents and properties of PDF documents.

PDF documents are used to store important information relating to products, customer data and corporate knowledge. Meta information such as the document’s creator, date of creation or date of modification are further integral parts of a PDF document. PDF documents are often used as “containers” to enable the transfer of text, images, videos and other data to other processes independently of the platforms in use.

This component can extract information quickly and efficiently, regardless of whether document content or document properties. The results can be stored in a database, for instance, or used for evaluations and statistics or to secure internal corporate knowledge.

Product illustration 3-Heights™ PDF Extract

Properties and benefits

Texts extracted using the 3‑Heights™ PDF Extract Tool can be used for indexing documents or in search engines, for instance. The component is generally used to extract data and resources from a PDF document for further processing. Highly detailed information is available for the purpose, which can also be transferred to document management systems (DMS) in various forms.

Performance characteristics

  • Extract text by the character, word or page (including invisible text)
  • Search for keywords and retrieve their position
  • Extract images (including alternative images)
  • Retrieve form fields
  • Extract document information such as version, encryption, linearization and metadata
  • List fonts and color spaces
  • Extract page information and page descriptions (graphic objects, position and other attributes)
  • Extract bookmarks

Information is extracted on the basis of the object type. The component supports the following objects and their respective properties:


  • Query document attributes, including:

    • Author
    • Title
    • Subject
    • Keywords
    • Application
    • PDF Producer
    • Creation date
    • Modification date

  • Is the document encrypted?
  • Is the document linearized (optimized for fast web view)?
  • PDF version, e. g. 1.4, 1.7
  • Read the document from the file or from memory
  • Query the number of pages
  • Properties of bookmarks
  • Query page labels (e. g. “vii”, “IX”)
  • Properties of resources (image, color space, fonts)
  • Destinations
  • List and extract embedded files
  • List and set optional content groups (layers)


  • Page size (Media Box) and other dimensions such as visible size (Crop Box) or other dimensions of relevance to printing (Trim Box, Art Box, Bleed Box)
  • Device colorant
  • Viewing rotation
  • Page content
  • Annotations

Page content

  • Jump to next object (object, image, text, path) and query its attributes (image, text)
  • Query current graphics state


  • Extract text as Unicode by the character, word or page
  • Supports texts that do not contain space characters
  • Coordinates (X, Y)
  • Bounding box
  • Font size in points
  • Length in points
  • Length in characters
  • Rotation

Font type

  • All, average, standard missing width and maximum width of glyphs
  • Base name
  • Height of uppercase and lowercase letters
  • Available character names of the font subset
  • Encoding
  • Flags
  • Bounding box
  • Datastream of a font program
  • Type (e. g. TrueType, Type1)
  • Tilt angle of italic fonts
  • Recommended distance between base line and following line (leading)
  • Vertical and horizontal width of glyph stems

Color space

  • Base color space
  • Colorant
  • Components per pixel
  • The highest index value for indexed color spaces
  • Color space (colorant, indexed, monochrome)
  • Lookup table
  • Name
Functionality graphic 3-Heights™ PDF Extract


  • Height and width in pixels
  • Resolution (DPI)
  • Number of bits per channel
  • Color space (bi‑tonal, monochrome, color)
  • Convert to RGB
  • Alternative image
  • Extract image (from file or memory) and set orientation
  • Set the compression of extracted and stored TIFF image (Flate, CCITT G3, G3‑2D, G4, JPEG, LZW, none)
  • Mask, transparency mask
  • Alternative image and whether it should be used as standard for printing

Graphics state

  • Blend mode
  • Spacing between characters and words
  • Current transformation matrix
  • Elements and phase of a dash pattern
  • Color space of fill and line colors
  • Fill and line colors as RGB or CMYK value
  • Overprint settings for fill and line colors
  • Alpha constant of fill and line colors
  • Flatness tolerance
  • Font and font size (see Font)
  • Horizontal scaling
  • Text style (leading, line spacing)
  • Line style (line cap, line join, miter limit) and line width
  • Name of the rendering intent
  • Smoothness tolerance
  • Soft mask
  • Text knockout
  • Text rendering mode
  • Text relocation (up or down)

Transformation matrix

  • Transformation values
  • Orientation
  • Rotation
  • Scaling in X and Y direction
  • Positioning in X and Y direction
  • Skewing in X and Y direction


  • Annotation type
  • Color
  • Contents
  • Date
  • Destination
  • Flags
  • MarkUp annotation
  • Name
  • Position (rectangle)
  • Subject
  • TextLabel
  • URL
  • Corner points if it is a polygon


  • Quantity
  • Destination
  • Title


  • Position (coordinates for bottom left and top right)
  • Type
  • Page number

Incoming mail and document processing

Content from PDF files such as forms or scanned incoming invoices, for instance, is extracted and processed for characterization or indexing.

Outgoing mail

PDF documents are restructured in preparation for use by other target groups. The process reads out processing information such as barcodes, address information or page formats that can then be used for controlling printing and packaging lines or sorting processes.


Texts or their components are extracted for separate storage in metadata. This allows document indexing to be extended as required.

Other areas of use

  • Convert PDF documents into text documents
  • Extract information such as addresses, invoice data and report data from documents for process control purposes
  • Extract information for document classification and document indexing
  • Process data in forms
  • Extract images for further processing (scans, photos, etc.)
  • Analyze and evaluate the content of PDF documents in mass processing

Input formats

  • PDF


  • Standards: ISO 32000 (PDF 1.7)

Operating Systems

  • Windows Vista, 7, 8, 8.1, 10 - 32 & 64 bit
  • Windows Server 2008, 2008 R2, 2012, 2012 R2, 2016 – 32 & 64 bit
  • HP‑UX 11i incl. ia64 (Itanium) - 64 bit
  • IBM AIX 6.1 - 64 bit
  • macOS 10.4 - 32 & 64 bit
  • Linux 2.4 & 2.6 - 32 & 64 bit
  • Oracle Solaris 10, SPARC & Intel
  • HP-UX 11, PA-RISC2.0 - 32 bit


  • API: C, Java, .NET, COM

Programming languages

All program libraries are written in efficient and thread‑safe C++. API offers a selection of the following connections to programming languages:

  • C and C++ via native C
  • C#, VB .NET, J# via .NET
  • Java via JNI
  • MS Visual Basic, Borland Delphi, MS Office products such as Access and C++ via COM

Product Variants

  • Shell tool (command line)
  • API (programming interface)

References 3-Heights™ PDF Extract