Skip to main content

Extract text from a PDF

Pdftools SDK extracts the text content of a PDF into a UTF-8 plain-text stream. You can extract every page or a selected page range, and choose whether the output follows the order of the text as it’s stored in the PDF or mimics the visual page layout with whitespace.

Quick start with a code sample

Get the full sample on GitHub: C, C#, Java, Python, and Visual Basic.

Input and output

  • Input: a PDF with a valid text layer. The document must contain a text layer where the glyphs have a valid Unicode mapping. Without that mapping, the extractor can’t decode glyphs into characters, and the result is empty or garbled. For PDFs that aren’t machine-born (for example, scans or photos), add a text layer to a scanned PDF by running OCR first with OCR a PDF document.
  • Output: UTF-8 encoded plain text written to an output stream of your choice (file, memory buffer, or any other writable stream).

The extractor doesn’t change the input PDF or return images, vector graphics, annotations, form fields, or metadata.

Steps to extract text from a PDF:

  1. Initialize the Pdftools SDK license
  2. Open the input document
  3. Choose the extraction format
  4. Configure text extraction options
  5. Run the extractor
  6. Full example

1. Initialize the Pdftools SDK license

Before you begin, Initialize the Pdftools SDK license.

2. Open the input document

Load the input PDF from the file system into a read-only Document. Pass this Document to the Extractor in a later step.

// Open input document
using var inStr = File.OpenRead(inPath);
using var inDoc = Document.Open(inStr);

3. Choose the extraction format

Pdftools SDK supports two extraction formats, defined by the TextExtractionFormat enumeration. The format you choose drives whether the rest of the options apply.

  • DocumentOrder (default): The extractor outputs text in the order it’s embedded in the content stream of the PDF. Use this format to feed extracted text into a search index, a large language model, or any downstream system that cares about reading order, not visual position.
  • Monospace: The extracted text mimics the visual layout of each page by padding it with whitespaces, so that the output renders correctly with a monospaced font. Use this format to preserve the appearance of tables, forms, or columnar layouts in plain text.

In Monospace mode, the SDK approximates the position of each glyph on the page by mapping PDF units to character columns and lines. The advanceWidth and lineHeight options described in the next step control that mapping and apply only to Monospace. DocumentOrder ignores them. The wordSeparationFactor option applies to both formats.

4. Configure text extraction options

Create a TextOptions object and configure it. All four properties have defaults, so you only need to set the ones you want to change. Some options apply only to the Monospace format mentioned in the previous section.

PropertyTypeDefaultApplicable formatWhat it controls
ExtractionFormatTextExtractionFormatDocumentOrderBoth formatsWhether the output follows document order (DocumentOrder) or mimics the page layout (Monospace).
AdvanceWidthLength (.NET, Java); a number of points (float in Python, double in C)7.2ptMonospaceThe horizontal space in the PDF that corresponds to one character column in the output. Smaller values spread glyphs across more columns and increase whitespace between them. Larger values pack glyphs closer together.
LineHeightLength (.NET, Java); a number of points (float in Python, double in C)unsetMonospaceThe vertical space in the PDF that triggers a new line in the output. When unset, the extractor doesn’t insert extra blank lines between lines of source text. Set this to add empty lines that match the vertical spacing of the page.
WordSeparationFactordouble0.3Both formatsA factor multiplied by the width of the space character to determine word boundaries. When the distance between two glyphs exceeds the resulting value, the extractor inserts a word separator. Lower values produce more word breaks. Higher values merge nearby glyphs into the same word.

AdvanceWidth is the main lever for tuning the “density” of monospaced output. Adjust it when the default 7.2 pt spreads narrow text too far apart, or when wide text overflows on top of itself. The samples linked in the Full example section use 9.2 pt for a typical report.

var options = new TextOptions();
options.ExtractionFormat = TextExtractionFormat.Monospace; // or TextExtractionFormat.DocumentOrder
options.AdvanceWidth = Length.Parse("9.2pt");
// options.LineHeight = Length.Parse("12pt");
// options.WordSeparationFactor = 0.3;

5. Run the extractor

Create an Extractor and call ExtractText. The method has four overloads that progressively narrow the scope:

  • ExtractText(inDoc, outStream): Extract every page with default options.
  • ExtractText(inDoc, outStream, options): Extract every page with custom options. Pass null (None in Python) for options to use defaults.
  • ExtractText(inDoc, outStream, options, firstPage): Extract from firstPage to the end of the document.
  • ExtractText(inDoc, outStream, options, firstPage, lastPage): Extract the page range [firstPage, lastPage], one-based and inclusive on both ends.

Page indices must be in the range 1 ≤ firstPage ≤ lastPage ≤ PageCount. An index outside that range raises IllegalArgumentException in Java, ArgumentException in .NET, ValueError in Python, and returns FALSE with ePdfTools_Error_IllegalArgument in C.

A single Extractor instance can process more than one document. To produce one text file per page, call ExtractText in a loop and pass the same page index as both firstPage and lastPage.

Extractor extractor = new Extractor();
for (int i = 0; i < inDoc.PageCount; i++)
{
using var outStr = File.Create(Path.Combine(outDir, $"page{i + 1}.txt"));
extractor.ExtractText(inDoc, outStr, options, i + 1, i + 1);
}
note

If a PDF page contains no extractable text (for example, a scanned image without a text layer), the extractor writes an empty result for that page. Run OCR a PDF document first to add a text layer, then extract.

Errors and exceptions

ExtractText can fail for the reasons in this table. Wrap the call in your language’s exception handling and surface or log the failure.

FailureJava / .NETPythonC
Invalid licenseLicenseExceptionLicenseErrorPdfTools_GetLastError returns a license error code
Extraction failure (corrupted content, decryption failure, internal error)ProcessingExceptionProcessingErrorThe function returns FALSE and sets a processing error code
Output stream write failureIOExceptionOSErrorThe function returns FALSE. Check PdfTools_GetLastError
Page index out of rangeIllegalArgumentException (Java), ArgumentException (.NET)ValueErrorThe function returns FALSE with ePdfTools_Error_IllegalArgument
Other internal errorsGenericExceptionGenericErrorPdfTools_GetLastError returns a generic error code

Full example

The code snippets in the previous steps are excerpts that illustrate the workflow. They aren’t complete, runnable programs. Before you run Pdftools SDK, install the package for your language and set up a license key. For setup steps, see Getting started with Pdftools SDK.

For a complete, runnable project, clone the sample repository for your language and run the ExtractTextLayout sample: