PDF Extract

All features and tool possibilities at a glance

Linux
MacOS
Windows Client
Windows Server
API
Shell tool (command line)
.NET Core
Java
C#
C/C++

Short facts

Conformance

  • ISO 32000-1 (PDF 1.7)

  • ISO 32000-2 (PDF 2.0)

  • ISO 19005-1 (PDF/A-1)

  • ISO 19005-2 (PDF/A-2)

  • ISO 19005-3 (PDF/A-3)

Supported formats

  • PDF 1.0 to 1.7

  • PDF 2.0

  • PDF/A-1, PDF/A-2, PDF/A-3

Features

Extract text

  • Configure word boundary detection, with word by word

  • Retrieve text attributes such as position, font and font size

  • Automatically apply correct character decoding and produce Unicode output

  • Extract raw character codes

Extract graphics objects (paths)

  • Extract as strings that contain PDF graphics operators

  • Convert extracted paths to images

Extract and store images

  • Retrieve image attributes such as compression format, position, and transparency masks

  • Extract and store transparency masks

  • Extract and store alternate images

Extract PDF document-level information

  • Page count

  • PDF version

  • Page labels

  • Creation and modification date

  • Document information such as title, author, subjects, and more

  • Outlines (bookmarks), including destinations

Extract page information

  • Media box, crop box, trim box, bleed box, and art box

  • Page rotation

  • Annotations

Additional features

  • Extract and store embedded font files

  • Retrieve detailed font information

  • Retrieve optional content group (OCG) information and visibility (layers)

  • Retrieve detailed graphic state information for each extracted page content object

  • Extract raw PDF objects

  • Extract document parts for PDF/X or PDF 2.0

  • Retrieve detailed color space information including lookup tables for indexed color spaces

  • Extract and store embedded files

  • Specify a password to decrypt PDF files