PDF Extract

All features and tool possibilities at a glance

.NET Core

Java

C#

C/C++

Try now Documentation

Short facts

Conformance

ISO 32000-1 (PDF 1.7)
ISO 32000-2 (PDF 2.0)
ISO 19005-1 (PDF/A-1)
ISO 19005-2 (PDF/A-2)
ISO 19005-3 (PDF/A-3)

Supported formats

PDF 1.0 to 1.7
PDF 2.0
PDF/A-1, PDF/A-2, PDF/A-3

Features

Extract text

Configure word boundary detection, with word by word
Retrieve text attributes such as position, font and font size
Automatically apply correct character decoding and produce Unicode output
Extract raw character codes

Extract graphics objects (paths)

Extract as strings that contain PDF graphics operators
Convert extracted paths to images

Extract and store images

Retrieve image attributes such as compression format, position, and transparency masks
Extract and store transparency masks
Extract and store alternate images

Extract PDF document-level information

Page count
PDF version
Page labels
Creation and modification date
Document information such as title, author, subjects, and more
Outlines (bookmarks), including destinations

Extract page information

Media box, crop box, trim box, bleed box, and art box
Page rotation
Annotations

Additional features

Extract and store embedded font files
Retrieve detailed font information
Retrieve optional content group (OCG) information and visibility (layers)
Retrieve detailed graphic state information for each extracted page content object
Extract raw PDF objects
Extract document parts for PDF/X or PDF 2.0
Retrieve detailed color space information including lookup tables for indexed color spaces
Extract and store embedded files
Specify a password to decrypt PDF files