|
|
3-Heights™ PDF Extract
|
- Introduction
- Brief Description
- Functions
- Benefits
- Areas of Use
- Technical Details
- Further Product Details
Introduction
|
|
The 3-Heights™ PDF Extract Tool is a solution for extracting and querying various attributes and page content from a PDF document. This includes texts, images, graphic objects (including paths), metadata and embedded fonts.
|
It is also possible to query the properties of objects. Intelligent mechanisms significantly increase extraction rates, for instance when extracting text.
|
|

|
Brief Description
|
Performance Characteristics
- Comprehensive support for extracting objects and object attributes from a PDF
- Intelligent support functions for extracting text
- High throughput
- Universal utilization in all corporate processes
Areas of Use
- Incoming mail
- Document processing
- Outgoing mail
- Archiving
|
Functions
- Document
- Query document attributes,
- Is the document encoded?
- PDF Version (e.g. 1.4, 1.7)
- Query the number of pages
- Page
- Query the page size
- Query page content
- Query annotations
- Text
- Extraction per word or per line
- Font size in points
- (Query) font
- Image
- Height and width in pixels
- Resolution in dots per inch (DPI)
- Image extraction
- Set the compression of the stored image (Standard, Flate, CCITT G3, CCITT G3-2D, CCITT G4, JBIG2, JPEG, JPEG2000, LZW, none)
- (Query) graphic status
- (Query) transformation matrix
- (Query) annotations
- (Query) color space
- (Query) bookmarks
- (Query) destination
Sectors
- Public sector
- Telecommunications
- Banking and insurance
- Archives and libraries
- Health sector
- Pharmaceutical industry
|
Functions
|
|
The PDF Extract Tool is used to extract text, images and graphic objects (including paths) from PDF documents. Text is extractable as lines and as individual words. It is also possible to query information such as position, color, font and font size. Intelligent functions such as heuristics, word formation support and character set interpretation make it possible to restore text that is lacking essential information. The tool can also collect significant data such as position, color space and size when extracting images such as TIFF or JPEG. Querying document attributes such as PDF version, creator, author, title, subject and creation date is also possible. The tool also supports reading encrypted PDF files.
Functions
- Document
- Query document attributes, including:
- Author
- Title
- Subject
- Keywords
- Creator
- Producer
- Creation date
- Modification date
- Is the document encoded?
- Is the document linearized (optimized for fast web display)?
- PDF Version (e.g. 1.4, 1.7)
- Read the document from the file or from memory
- Query the number of pages
- Select page and query attributes -> see "Page"
- Jump to next bookmark and query its attributes -> see "Bookmarks"
- Query page designations (e.g. "vii", "IX")
- Jump to next resource and query its attributes (Images / Color spaces / Fonts)
- Destinations -> see "Destination"
- Page
- Query the page size (Media Box) and other dimensions such as visible size (Crop Box) or other dimensions of relevance to printing (Trim Box, Art Box, Bleed Box)
- DeviceColorant
- Rotation for display
- Query page content -> see Page content
- Query annotations -> see Annotations
- Page content
- Jump to next object (object, image, text, path) and query its attributes -> see Image -> see Text
- Query current graphic status -> see Graphic status
- Text
- Extraction by word or by line (in the same font) in Unicode
- Supports texts that do not contain space characters
- Coordinates (X, Y)
- Bounding box
- Font size in points
- Length in points
- Length in characters
- Rotation (in radians)
- (Query) font
- Ascent, descent
- All, average, standard (missing width) and maximum width of glyphs
- BaseName
- Height of uppercase and lowercase letters
- Existing character name of the font subset
- Coding
- Flags
- Bounding box
- Datastream of a Type1 font program
- Type (e.g. TrueType, Type1)
- Tilt angle of italic fonts
- Recommended distance between base line and following line (leading)
- Vertical and horizontal width of glyph stems
- Image
- Height and width in pixels
- Resolution in dots per inch (DPI)
- Number of bits per channel
- Color space (bitonal, monochrome, color) -> see Color space
- Convert to RGB
- Alternative image
- Extract image (from file or memory) and set orientation
- Set the compression of the stored image (Standard, Flate, CCITT G3, CCITT G3-2D, CCITT G4, JBIG2, JPEG, JPEG2000, LZW, none)
- (Transparency) mask
- Alternative image and whether it should be used as standard for printing
- (Query) graphic status
- AlphaIsShape
- Blend mode
- Spacing between characters and words (character spacing, word spacing)
- Current transformation matrix -> see Transformation matrix
- Elements and phase of a dash pattern
- Color space of fill and line colors -> see Color space
- Fill and line colors as RGB or CMYK value
- Overprint settings for fill and line colors
- Alpha constant of fill and line colors
- (flatness tolerance)
- Font and font size -> see Font
- Horizontal scaling
- Text style (leading, line spacing)
- Line style (line cap, line join, miter limit) and line width
- Overprint mode
- Name of the (rendering intent)
- (smoothness tolerance)
- Soft mask -> see Image
- (stroke adjustment)
- Text knockout
- Text rendering mode
- Text relocation (up or down)
- (Query) transformation matrix
- Transformation values (a, b, c, d, e, f)
- Orientation (8 standard values or undefined)
- Rotation
- Scaling in X and Y direction
- Positioning in X and Y direction
- Skewing in X and Y direction
- (Query) annotations
- Color
- Contents
- Date
- Destination -> see Destination
- Flags
- If the annotation is a so-called MarkUp Annotation
- Name
- Position (rectangle)
- Subject
- Subtype
- TextLabel
- URL
- Corner points if it is a polygon
- (Query) color space
- Basic color space
- Colorant
- Components per pixel
- The highest index value for indexed color spaces
- Color space (colorant, indexed, monochrome)
- Lookup table
- Name
- (Query) bookmarks
- Quantity
- Destination -> see Destination
- Title
- (Query) destination
- Position (coordinates for bottom left and top right)
- Type
- Page number
|
Formats
Input Formats
- PDF 1.x (e.g. PDF 1.4, PDF 1.5)
Compliance
- Standards: ISO 32000 (PDF 1.7)
|
Benefits
|
Properties and Benefits
Texts extracted using the 3-Heights™ PDF Extract Tool can be used for indexing documents or in search engines, for instance. The tool is generally used to extract data and resources from a PDF file for further processing. Highly detailed information is available for the purpose, which can also be transferred to third-party systems in various forms.
|
Performance Characteristics
- Comprehensive support for extracting objects and object attributes from a PDF
- Intelligent support functions for extracting text
- High throughput
- Universal utilization in all corporate processes
|
Areas of Use
|
Incoming Mail and Document Processing
This process digitalizes characterizable content components of PDF documents – e.g. standardized forms or scanned incoming invoices – and conditions them for use in ERP systems or for indexing.
Outgoing Mail
Restructures PDF documents in preparation for use by other target groups. The process reads out processing information such as barcodes, address information or page formats that can then be used for controlling printing and packaging lines or sorting processes.
Areas of Use
- Incoming mail
- Document processing
- Outgoing mail
- Archiving
|
Archiving
Texts or their components are extracted for separate storage in metadata. This allows document indexing to be extended as required.
Sectors
- Public sector
- Telecommunications
- Banking and insurance
- Archives and libraries
- Health sector
- Pharmaceutical industry
|
Technical Details
|
Architecture and Application Options
The program is available in two variants:
as a command line for batch processing
as a programming interface for integration in existing applications
Variants and Options
Product Variants
- Shell tool (command line)
- API (programming interface)
Formats
Input Formats
- PDF 1.x (e.g. PDF 1.4, PDF 1.5)
Compliance
- Standards: ISO 32000 (PDF 1.7)
|
Platforms
Operating Systems
- Windows 2000, XP, 2003, Vista, 2008, Windows 7 – 32 and 64 bit
- FreeBSD 4.7 for Intel
- HP-UX 11.0 – 32 bit
- IBM AIX (4.3: 32 Bit, 5.1: 64 bit)
- Linux (SuSE and Red Hat on Intel)
- Mac OS X
- Sun Solaris (2.7 and higher)
Interfaces and Languages
Interfaces
- Shell tool: Command line for batch processing
- API: C, Java, .NET, COM
Programming Languages
All program libraries are written in efficient and thread-safe C++. API offers a selection of the following connections to programming languages:
- C#, VB .NET, J# via .NET
- Java via JNI
- MS Visual Basic, Borland Delphi, MS Office products such as Access and C++ via COM
- C and C++ via native C
Product Code
EXP
Related Products
|
Further Product Details
|
|
The 3-Heights™ PDF Extract Tool is a programmable component for extracting content (e.g. text, graphic paths) and querying information from a PDF file.
The 3-Heights™ PDF Extract Tool offers a number of options for extracting content and information from PDF documents (for instance: text, fonts and font information, graphic paths and document attributes).
|
|
|
Product Variants
|
| API |
Shell |
|
|
Documentation
|
|
Manual:
API - Shell
Samples (API)
|
Support/FAQ
|
|
Product specific:
API - Shell
General Info
FAQ
|
Personal Questions?
|
| We are pleased to help you!
Contact via email
Via phone:
Europe, Middle East, Asia
08:00-17:00 CET (UTC+1)
+41 43 411 44 51
America, Australia
08:00-16:00 MST (UTC-7)
+1 403 932 4220
|

|
|