3-Heights™ PDF Extract

Overview
Functions
Benefits

Areas of use
Technical details


Overview

3-Heights™ PDF Extract is a component for reading out the contents and properties of PDF documents.

PDF documents are used to store important information relating to products, customer data and corporate knowledge. Meta information such as the document's creator, date of creation or date of modification are further integral parts of a PDF document. PDF documents are often used as "containers" to enable the transfer of text, images, videos and other data to other processes independently of the platforms in use.

This component can extract information quickly and efficiently, regardless of whether document content or document properties. The results can be stored in a

3-Heights™ PDF Extract

database, for instance, or used for evaluations and statistics or to secure internal corporate knowledge.


Functions

Information is extracted on the basis of the object type. The component supports the following objects and their respective properties:

Document

  • Query document attributes, including:
    • Author
    • Title
    • Subject
    • Keywords
    • Application
    • PDF Producer
    • Creation date
    • Modification date
  • Is the document encrypted?
  • Is the document linearized (optimized for fast web view)?
  • PDF version, e.g. 1.4, 1.7
  • Read the document from the file or from memory
  • Query the number of pages
  • Properties of bookmarks
  • Query page labels (e.g. "vii", "IX")
  • Properties of resources (image, color space, fonts)
  • Destinations
  • List and extract embedded files
  • List and set optional content groups (layers)

Page

  • Page size (Media Box) and other dimensions such as visible size (Crop Box) or other dimensions of relevance to printing (Trim Box, Art Box, Bleed Box)
  • Device colorant
  • Viewing rotation
  • Page content
  • Annotations

Page content

  • Jump to next object (object, image, text, path) and query its attributes (image, text)
  • Query current graphics state

Text

  • Extract text as Unicode by the character, word or page
  • Supports texts that do not contain space characters
  • Coordinates (X, Y)
  • Bounding box
  • Font size in points
  • Length in points
  • Length in characters
  • Rotation

Font type

  • All, average, standard missing width and maximum width of glyphs
  • Base name
  • Height of uppercase and lowercase letters
  • Available character names of the font subset
  • Encoding
  • Flags
  • Bounding box
  • Datastream of a font program
  • Type (e.g. TrueType, Type1)
  • Tilt angle of italic fonts
  • Recommended distance between base line and following line (leading)
  • Vertical and horizontal width of glyph stems

Color space

  • Base color space
  • Colorant
  • Components per pixel
  • The highest index value for indexed color spaces
  • Color space (colorant, indexed, monochrome)
  • Lookup table
  • Name

3-Heights™ PDF Extract

Image

  • Height and width in pixels
  • Resolution (DPI)
  • Number of bits per channel
  • Color space (bi-tonal, monochrome, color)
  • Convert to RGB
  • Alternative image
  • Extract image (from file or memory) and set orientation
  • Set the compression of extracted and stored TIFF image (Flate, CCITT G3, G3-2D, G4, JPEG, LZW, none)
  • Mask, transparency mask
  • Alternative image and whether it should be used as standard for printing

Graphics state

  • Blend mode
  • Spacing between characters and words
  • Current transformation matrix
  • Elements and phase of a dash pattern
  • Color space of fill and line colors
  • Fill and line colors as RGB or CMYK value
  • Overprint settings for fill and line colors
  • Alpha constant of fill and line colors
  • Flatness tolerance
  • Font and font size (see Font)
  • Horizontal scaling
  • Text style (leading, line spacing)
  • Line style (line cap, line join, miter limit) and line width
  • Name of the rendering intent
  • Smoothness tolerance
  • Soft mask
  • Text knockout
  • Text rendering mode
  • Text relocation (up or down)

Transformation matrix

  • Transformation values
  • Orientation
  • Rotation
  • Scaling in X and Y direction
  • Positioning in X and Y direction
  • Skewing in X and Y direction

Annotation

  • Annotation type
  • Color
  • Contents
  • Date
  • Destination
  • Flags
  • MarkUp annotation
  • Name
  • Position (rectangle)
  • Subject
  • TextLabel
  • URL
  • Corner points if it is a polygon

Bookmarks

  • Quantity
  • Destination
  • Title

Destination

  • Position (coordinates for bottom left and top right)
  • Type
  • Page number

Benefits

Properties and benefits

Texts extracted using the 3-Heights™ PDF Extract Tool can be used for indexing documents or in search engines, for instance. The component is generally used to extract data and resources from a PDF document for further processing. Highly detailed information is available for the purpose, which can also be transferred to document management systems (DMS) in various forms.

Performance characteristics

  • Extract text by the character, word or page (including invisible text)
  • Search for keywords and retrieve their position
  • Extract images (including alternative images)
  • Retrieve form fields
  • Extract document information such as version, encryption, linearization and metadata
  • List fonts and color spaces
  • Extract page information and page descriptions (graphic objects, position and other attributes)
  • Extract bookmarks

Areas of use

Incoming mail and document processing

Content from PDF files such as forms or scanned incoming invoices, for instance, is extracted and processed for characterization or indexing.

Outgoing mail

PDF documents are restructured in preparation for use by other target groups. The process reads out processing information such as barcodes, address information or page formats that can then be used for controlling printing and packaging lines or sorting processes.

Archiving

Texts or their components are extracted for separate storage in metadata. This allows document indexing to be extended as required.

Other areas of use

  • Convert PDF documents into text documents
  • Extract information such as addresses, invoice data and report data from documents for process control purposes
  • Extract information for document classification and document indexing
  • Process data in forms
  • Extract images for further processing (scans, photos, etc.)
  • Analyze and evaluate the content of PDF documents in mass processing

Technical details

Input formats

  • PDF

Compliance

  • Standards: ISO 32000 (PDF 1.7)

Operating systems

  • Windows 2000, XP, Vista, 7, 8
  • Windows Server 2003, 2008,
    2008 R2, 2012 – 32 and 64 Bit
  • HP-UX – 32 Bit and Itanium
  • IBM AIX – 32 and 64 Bit
  • Linux (SuSE and Red Hat on Intel)
  • Mac OS X
  • Sun Solaris

Interfaces

  • API: C, Java, .NET, COM

Programming languages

All program libraries are written in efficient and thread-safe C++. API offers a selection of the following connections to programming languages:

  • C#, VB .NET, J# via .NET
  • Java via JNI
  • MS Visual Basic, Borland Delphi, MS Office products such as Access and C++ via COM
  • C and C++ via native C

Product variants

  • Shell tool (command line)
  • API (programming interface)

Next steps

Prices/Buy
Download
Test Online
Quote

Product-specific success stories

Advance Management Company, United States

Bayer CropScience AG, Germany

Metafile, USA

Oppolis, UK

Quickcomm, USA

SSL, United States

StratOz, Germany

Documentation / FAQ

Product flyer

Manual:
API - Shell

Samples (API)

FAQ:
API - Shell

We are here to help

Easy ways to get the answers you need.

Contact via email

Via phone :
08:00-17:00 HEC (UTC+1)
+41 43 411 44 51

Follow and tell

   
 

Subscribe newsletter

Copyright 2001-2014 PDF Tools AG

Sitemap | Privacy | Legal | Masthead