Extract text from a PDF

Pdftools SDK extracts the text content of a PDF into a UTF-8 plain-text stream. You can extract every page or a selected page range, and choose whether the output follows the order of the text as it’s stored in the PDF or mimics the visual page layout with whitespace.

Quick start with a code sample

Get the full sample on GitHub: C, C#, Java, Python, and Visual Basic.

Input and output

Input: a PDF with a valid text layer. The document must contain a text layer where the glyphs have a valid Unicode mapping. Without that mapping, the extractor can’t decode glyphs into characters, and the result is empty or garbled. For PDFs that aren’t machine-born (for example, scans or photos), add a text layer to a scanned PDF by running OCR first with OCR a PDF document.
Output: UTF-8 encoded plain text written to an output stream of your choice (file, memory buffer, or any other writable stream).

The extractor doesn’t change the input PDF or return images, vector graphics, annotations, form fields, or metadata.

Steps to extract text from a PDF:

Initialize the Pdftools SDK license
Open the input document
Choose the extraction format
Configure text extraction options
Run the extractor
Full example

1. Initialize the Pdftools SDK license

Before you begin, Initialize the Pdftools SDK license.

2. Open the input document

Load the input PDF from the file system into a read-only Document. Pass this Document to the Extractor in a later step.

.NET
Java
Python
C

// Open input document
using var inStr = File.OpenRead(inPath);
using var inDoc = Document.Open(inStr);

// Open input document
FileStream inStr = new FileStream(inPath, FileStream.Mode.READ_ONLY);
Document inDoc = Document.open(inStr);

# Open input document
in_stream = io.FileIO(input_path, 'rb')
in_doc = Document.open(in_stream)

// Open input document
pInStream = _tfopen(szInPath, _T("rb"));
TPdfToolsSys_StreamDescriptor inDesc;
PdfToolsSysCreateFILEStreamDescriptor(&inDesc, pInStream, 0);
pInDoc = PdfToolsPdf_Document_Open(&inDesc, _T(""));

3. Choose the extraction format

Pdftools SDK supports two extraction formats, defined by the TextExtractionFormat enumeration. The format you choose drives whether the rest of the options apply.

DocumentOrder (default): The extractor outputs text in the order it’s embedded in the content stream of the PDF. Use this format to feed extracted text into a search index, a large language model, or any downstream system that cares about reading order, not visual position.
Monospace: The extracted text mimics the visual layout of each page by padding it with whitespaces, so that the output renders correctly with a monospaced font. Use this format to preserve the appearance of tables, forms, or columnar layouts in plain text.

In Monospace mode, the SDK approximates the position of each glyph on the page by mapping PDF units to character columns and lines. The advanceWidth and lineHeight options described in the next step control that mapping and apply only to Monospace. DocumentOrder ignores them. The wordSeparationFactor option applies to both formats.

4. Configure text extraction options

Create a TextOptions object and configure it. All four properties have defaults, so you only need to set the ones you want to change. Some options apply only to the Monospace format mentioned in the previous section.

Property	Type	Default	Applicable format	What it controls
`ExtractionFormat`	`TextExtractionFormat`	`DocumentOrder`	Both formats	Whether the output follows document order (`DocumentOrder`) or mimics the page layout (`Monospace`).
`AdvanceWidth`	`Length` (.NET, Java); a number of points (`float` in Python, `double` in C)	`7.2pt`	`Monospace`	The horizontal space in the PDF that corresponds to one character column in the output. Smaller values spread glyphs across more columns and increase whitespace between them. Larger values pack glyphs closer together.
`LineHeight`	`Length` (.NET, Java); a number of points (`float` in Python, `double` in C)	unset	`Monospace`	The vertical space in the PDF that triggers a new line in the output. When unset, the extractor doesn’t insert extra blank lines between lines of source text. Set this to add empty lines that match the vertical spacing of the page.
`WordSeparationFactor`	`double`	`0.3`	Both formats	A factor multiplied by the width of the space character to determine word boundaries. When the distance between two glyphs exceeds the resulting value, the extractor inserts a word separator. Lower values produce more word breaks. Higher values merge nearby glyphs into the same word.

AdvanceWidth is the main lever for tuning the “density” of monospaced output. Adjust it when the default 7.2 pt spreads narrow text too far apart, or when wide text overflows on top of itself. The samples linked in the Full example section use 9.2 pt for a typical report.

.NET
Java
Python
C

var options = new TextOptions();
options.ExtractionFormat = TextExtractionFormat.Monospace; // or TextExtractionFormat.DocumentOrder
options.AdvanceWidth = Length.Parse("9.2pt");
// options.LineHeight = Length.Parse("12pt");
// options.WordSeparationFactor = 0.3;

TextOptions options = new TextOptions();
options.setExtractionFormat(TextExtractionFormat.MONOSPACE); // or TextExtractionFormat.DOCUMENT_ORDER
options.setAdvanceWidth(Length.parse("9.2pt"));
// options.setLineHeight(Length.parse("12pt"));
// options.setWordSeparationFactor(0.3);

options = TextOptions()
options.extraction_format = TextExtractionFormat.MONOSPACE  # or TextExtractionFormat.DOCUMENT_ORDER
options.advance_width = 9.2  # value in points
# options.line_height = 12.0
# options.word_separation_factor = 0.3

pOptions = PdfToolsExtraction_TextOptions_New();
PdfToolsExtraction_TextOptions_SetExtractionFormat(
    pOptions, ePdfToolsExtraction_TextExtractionFormat_Monospace);

double dAdvanceWidth = 9.2;
PdfToolsExtraction_TextOptions_SetAdvanceWidth(pOptions, &dAdvanceWidth);

// double dLineHeight = 12.0;
// PdfToolsExtraction_TextOptions_SetLineHeight(pOptions, &dLineHeight);
// PdfToolsExtraction_TextOptions_SetWordSeparationFactor(pOptions, 0.3);

5. Run the extractor

Create an Extractor and call ExtractText. The method has four overloads that progressively narrow the scope:

ExtractText(inDoc, outStream): Extract every page with default options.
ExtractText(inDoc, outStream, options): Extract every page with custom options. Pass null (None in Python) for options to use defaults.
ExtractText(inDoc, outStream, options, firstPage): Extract from firstPage to the end of the document.
ExtractText(inDoc, outStream, options, firstPage, lastPage): Extract the page range [firstPage, lastPage], one-based and inclusive on both ends.

Page indices must be in the range 1 ≤ firstPage ≤ lastPage ≤ PageCount. An index outside that range raises IllegalArgumentException in Java, ArgumentException in .NET, ValueError in Python, and returns FALSE with ePdfTools_Error_IllegalArgument in C.

A single Extractor instance can process more than one document. To produce one text file per page, call ExtractText in a loop and pass the same page index as both firstPage and lastPage.

.NET
Java
Python
C

Extractor extractor = new Extractor();
for (int i = 0; i < inDoc.PageCount; i++)
{
    using var outStr = File.Create(Path.Combine(outDir, $"page{i + 1}.txt"));
    extractor.ExtractText(inDoc, outStr, options, i + 1, i + 1);
}

Extractor extractor = new Extractor();
for (int i = 0; i < inDoc.getPageCount(); i++) {
    try (FileStream outStr = new FileStream(
            outDir + File.separator + "page" + (i + 1) + ".txt",
            FileStream.Mode.READ_WRITE_NEW)) {
        extractor.extractText(inDoc, outStr, options, i + 1, i + 1);
    }
}

extractor = Extractor()
for i in range(in_doc.page_count):
    output_file = os.path.join(output_dir, f"page{i + 1}.txt")
    with open(output_file, 'wb') as out_stream:
        extractor.extract_text(in_doc, out_stream, options, i + 1, i + 1)

pExtractor = PdfToolsExtraction_Extractor_New();
int nPageCount = PdfToolsPdf_Document_GetPageCount(pInDoc);
for (int i = 1; i <= nPageCount; i++)
{
    TCHAR szPageFile[512];
    _sntprintf(szPageFile, sizeof(szPageFile) / sizeof(szPageFile[0]),
               _T("%s/page%d.txt"), szOutDir, i);

    pOutStream = _tfopen(szPageFile, _T("wb+"));
    TPdfToolsSys_StreamDescriptor outDesc;
    PdfToolsSysCreateFILEStreamDescriptor(&outDesc, pOutStream, 0);

    int iFirstPage = i;
    int iLastPage  = i;
    PdfToolsExtraction_Extractor_ExtractText(
        pExtractor, pInDoc, &outDesc, pOptions, &iFirstPage, &iLastPage);

    fclose(pOutStream);
    pOutStream = NULL;
}

note

If a PDF page contains no extractable text (for example, a scanned image without a text layer), the extractor writes an empty result for that page. Run OCR a PDF document first to add a text layer, then extract.

Errors and exceptions

ExtractText can fail for the reasons in this table. Wrap the call in your language’s exception handling and surface or log the failure.

Failure	Java / .NET	Python	C
Invalid license	`LicenseException`	`LicenseError`	`PdfTools_GetLastError` returns a license error code
Extraction failure (corrupted content, decryption failure, internal error)	`ProcessingException`	`ProcessingError`	The function returns `FALSE` and sets a processing error code
Output stream write failure	`IOException`	`OSError`	The function returns `FALSE`. Check `PdfTools_GetLastError`
Page index out of range	`IllegalArgumentException` (Java), `ArgumentException` (.NET)	`ValueError`	The function returns `FALSE` with `ePdfTools_Error_IllegalArgument`
Other internal errors	`GenericException`	`GenericError`	`PdfTools_GetLastError` returns a generic error code

Full example

The code snippets in the previous steps are excerpts that illustrate the workflow. They aren’t complete, runnable programs. Before you run Pdftools SDK, install the package for your language and set up a license key. For setup steps, see Getting started with Pdftools SDK.

For a complete, runnable project, clone the sample repository for your language and run the ExtractTextLayout sample:

C
C#
Java
Python
Visual Basic

Input and output​

1. Initialize the Pdftools SDK license​

2. Open the input document​

3. Choose the extraction format​

4. Configure text extraction options​

5. Run the extractor​

Errors and exceptions​

Full example​