|

|
3-Heights™ PDF Extract
|
|
Overview
Functions
Benefits
|
Areas of Use
Technical Details
|
Overview
|
|
3-Heights™ PDF Extract is a component for reading out the contents and properties of PDF documents.
PDF documents are used to store important information relating to products, customer data and corporate knowledge. Meta information such as the document's creator, date of creation or date of modification are further integral parts of a PDF document. PDF documents are often used as "containers" to enable the transfer of text, images, videos and other data to other processes independently of the platforms in use.
This component can extract information quickly and efficiently, regardless of whether document content or document properties. The results can be stored in a
|

database, for instance, or used for evaluations and statistics or to secure internal corporate knowledge.
|
Functions
|
|
Information is extracted on the basis of the object type. The component supports the following objects and their respective properties:
Document
- Query document attributes, including:
- Author
- Title
- Subject
- Keywords
- Application
- PDF Producer
- Creation date
- Modification date
- Is the document encrypted?
- Is the document linearized (optimized for fast web view)?
- PDF version, e.g. 1.4, 1.7
- Read the document from the file or from memory
- Query the number of pages
- Properties of bookmarks
- Query page labels (e.g. "vii", "IX")
- Properties of resources (image, color space, fonts)
- Destinations
- List and extract embedded files
- List and set optional content groups (layers)
Page
- Page size (Media Box) and other dimensions such as visible size (Crop Box) or other dimensions of relevance to printing (Trim Box, Art Box, Bleed Box)
- Device colorant
- Viewing rotation
- Page content
- Annotations
Page Content
- Jump to next object (object, image, text, path) and query its attributes (image, text)
- Query current graphics state
Text
- Extract text as Unicode by the character, word or page
- Supports texts that do not contain space characters
- Coordinates (X, Y)
- Bounding box
- Font size in points
- Length in points
- Length in characters
- Rotation
Font Type
- All, average, standard missing width and maximum width of glyphs
- Base name
- Height of uppercase and lowercase letters
- Available character names of the font subset
- Encoding
- Flags
- Bounding box
- Datastream of a font program
- Type (e.g. TrueType, Type1)
- Tilt angle of italic fonts
- Recommended distance between base line and following line (leading)
- Vertical and horizontal width of glyph stems
Color Space
- Base color space
- Colorant
- Components per pixel
- The highest index value for indexed color spaces
- Color space (colorant, indexed, monochrome)
- Lookup table
- Name
|

Image
- Height and width in pixels
- Resolution (DPI)
- Number of bits per channel
- Color space (bi-tonal, monochrome, color)
- Convert to RGB
- Alternative image
- Extract image (from file or memory) and set orientation
- Set the compression of extracted and stored TIFF image (Flate, CCITT G3, G3-2D, G4, JPEG, LZW, none)
- Mask, transparency mask
- Alternative image and whether it should be used as standard for printing
Graphics State
- Blend mode
- Spacing between characters and words
- Current transformation matrix
- Elements and phase of a dash pattern
- Color space of fill and line colors
- Fill and line colors as RGB or CMYK value
- Overprint settings for fill and line colors
- Alpha constant of fill and line colors
- Flatness tolerance
- Font and font size (see Font)
- Horizontal scaling
- Text style (leading, line spacing)
- Line style (line cap, line join, miter limit) and line width
- Name of the rendering intent
- Smoothness tolerance
- Soft mask
- Text knockout
- Text rendering mode
- Text relocation (up or down)
Transformation Matrix
- Transformation values
- Orientation
- Rotation
- Scaling in X and Y direction
- Positioning in X and Y direction
- Skewing in X and Y direction
Annotation
- Annotation type
- Color
- Contents
- Date
- Destination
- Flags
- MarkUp annotation
- Name
- Position (rectangle)
- Subject
- TextLabel
- URL
- Corner points if it is a polygon
Bookmarks
- Quantity
- Destination
- Title
Destination
- Position (coordinates for bottom left and top right)
- Type
- Page number
|
Benefits
|
Properties and Benefits
Texts extracted using the 3-Heights™ PDF Extract Tool can be used for indexing documents or in search engines, for instance. The component is generally used to extract data and resources from a PDF document for further processing. Highly detailed information is available for the purpose, which can also be transferred to document management systems (DMS) in various forms.
|
Performance Characteristics
- Extract text by the character, word or page (including invisible text)
- Search for keywords and retrieve their position
- Extract images (including alternative images)
- Retrieve form fields
- Extract document information such as version, encryption, linearization and metadata
- List fonts and color spaces
- Extract page information and page descriptions (graphic objects, position and other attributes)
- Extract bookmarks
|
Areas of Use
|
Incoming Mail and Document Processing
Content from PDF files such as forms or scanned incoming invoices, for instance, is extracted and processed for characterization or indexing.
Outgoing Mail
PDF documents are restructured in preparation for use by other target groups. The process reads out processing information such as barcodes, address information or page formats that can then be used for controlling printing and packaging lines or sorting processes.
Archiving
Texts or their components are extracted for separate storage in metadata. This allows document indexing to be extended as required.
|
Other Areas of Use
- Convert PDF documents into text documents
- Extract information such as addresses, invoice data and report data from documents for process control purposes
- Extract information for document classification and document indexing
- Process data in forms
- Extract images for further processing (scans, photos, etc.)
- Analyze and evaluate the content of PDF documents in mass processing
|
Technical Details
|
Input Formats
Compliance
- Standards: ISO 32000 (PDF 1.7)
Operating Systems
- Windows 2000, XP, Vista, 7
- Windows Server 2003, 2008,
2008 R2 – 32 and 64 Bit
- HP-UX – 32 Bit and Itanium
- IBM AIX – 32 and 64 Bit
- Linux (SuSE and Red Hat on Intel)
- Mac OS X
- Sun Solaris
|
Interfaces
Programming Languages
All program libraries are written in efficient and thread-safe C++. API offers a selection of the following connections to programming languages:
- C#, VB .NET, J# via .NET
- Java via JNI
- MS Visual Basic, Borland Delphi, MS Office products such as Access and C++ via COM
- C and C++ via native C
Product Variants
- Shell tool (command line)
- API (programming interface)
|
|
|
|
Documentation / FAQ
|
|
Product Flyer
Manual:
API - Shell
Samples (API)
FAQ:
API - Shell
|
We are here to help
|
| Easy ways to get the answers you need.
Contact via email
Via phone :
08:00-17:00 HEC (UTC+1)
+41 43 411 44 51
|
|
|