Extract text from a PDF
Pdftools SDK extracts the text content of a PDF into a UTF-8 plain-text stream. You can extract every page or a selected page range, and choose whether the output follows the order of the text as it’s stored in the PDF or mimics the visual page layout with whitespace.
Get the full sample on GitHub: C, C#, Java, Python, and Visual Basic.
Input and output
- Input: a PDF with a valid text layer. The document must contain a text layer where the glyphs have a valid Unicode mapping. Without that mapping, the extractor can’t decode glyphs into characters, and the result is empty or garbled. For PDFs that aren’t machine-born (for example, scans or photos), add a text layer to a scanned PDF by running OCR first with OCR a PDF document.
- Output: UTF-8 encoded plain text written to an output stream of your choice (file, memory buffer, or any other writable stream).
The extractor doesn’t change the input PDF or return images, vector graphics, annotations, form fields, or metadata.
Steps to extract text from a PDF:
- Initialize the Pdftools SDK license
- Open the input document
- Choose the extraction format
- Configure text extraction options
- Run the extractor
- Full example
1. Initialize the Pdftools SDK license
Before you begin, Initialize the Pdftools SDK license.
2. Open the input document
Load the input PDF from the file system into a read-only Document. Pass this Document to the Extractor in a later step.
- .NET
- Java
- Python
- C
// Open input document
using var inStr = File.OpenRead(inPath);
using var inDoc = Document.Open(inStr);
// Open input document
FileStream inStr = new FileStream(inPath, FileStream.Mode.READ_ONLY);
Document inDoc = Document.open(inStr);
# Open input document
in_stream = io.FileIO(input_path, 'rb')
in_doc = Document.open(in_stream)
// Open input document
pInStream = _tfopen(szInPath, _T("rb"));
TPdfToolsSys_StreamDescriptor inDesc;
PdfToolsSysCreateFILEStreamDescriptor(&inDesc, pInStream, 0);
pInDoc = PdfToolsPdf_Document_Open(&inDesc, _T(""));
3. Choose the extraction format
Pdftools SDK supports two extraction formats, defined by the TextExtractionFormat enumeration. The format you choose drives whether the rest of the options apply.
DocumentOrder(default): The extractor outputs text in the order it’s embedded in the content stream of the PDF. Use this format to feed extracted text into a search index, a large language model, or any downstream system that cares about reading order, not visual position.Monospace: The extracted text mimics the visual layout of each page by padding it with whitespaces, so that the output renders correctly with a monospaced font. Use this format to preserve the appearance of tables, forms, or columnar layouts in plain text.
In Monospace mode, the SDK approximates the position of each glyph on the page by mapping PDF units to character columns and lines. The advanceWidth and lineHeight options described in the next step control that mapping and apply only to Monospace. DocumentOrder ignores them. The wordSeparationFactor option applies to both formats.
4. Configure text extraction options
Create a TextOptions object and configure it. All four properties have defaults, so you only need to set the ones you want to change. Some options apply only to the Monospace format mentioned in the previous section.
| Property | Type | Default | Applicable format | What it controls |
|---|---|---|---|---|
ExtractionFormat | TextExtractionFormat | DocumentOrder | Both formats | Whether the output follows document order (DocumentOrder) or mimics the page layout (Monospace). |
AdvanceWidth | Length (.NET, Java); a number of points (float in Python, double in C) | 7.2pt | Monospace | The horizontal space in the PDF that corresponds to one character column in the output. Smaller values spread glyphs across more columns and increase whitespace between them. Larger values pack glyphs closer together. |
LineHeight | Length (.NET, Java); a number of points (float in Python, double in C) | unset | Monospace | The vertical space in the PDF that triggers a new line in the output. When unset, the extractor doesn’t insert extra blank lines between lines of source text. Set this to add empty lines that match the vertical spacing of the page. |
WordSeparationFactor | double | 0.3 | Both formats | A factor multiplied by the width of the space character to determine word boundaries. When the distance between two glyphs exceeds the resulting value, the extractor inserts a word separator. Lower values produce more word breaks. Higher values merge nearby glyphs into the same word. |
AdvanceWidth is the main lever for tuning the “density” of monospaced output. Adjust it when the default 7.2 pt spreads narrow text too far apart, or when wide text overflows on top of itself. The samples linked in the Full example section use 9.2 pt for a typical report.
- .NET
- Java
- Python
- C
var options = new TextOptions();
options.ExtractionFormat = TextExtractionFormat.Monospace; // or TextExtractionFormat.DocumentOrder
options.AdvanceWidth = Length.Parse("9.2pt");
// options.LineHeight = Length.Parse("12pt");
// options.WordSeparationFactor = 0.3;
TextOptions options = new TextOptions();
options.setExtractionFormat(TextExtractionFormat.MONOSPACE); // or TextExtractionFormat.DOCUMENT_ORDER
options.setAdvanceWidth(Length.parse("9.2pt"));
// options.setLineHeight(Length.parse("12pt"));
// options.setWordSeparationFactor(0.3);
options = TextOptions()
options.extraction_format = TextExtractionFormat.MONOSPACE # or TextExtractionFormat.DOCUMENT_ORDER
options.advance_width = 9.2 # value in points
# options.line_height = 12.0
# options.word_separation_factor = 0.3
pOptions = PdfToolsExtraction_TextOptions_New();
PdfToolsExtraction_TextOptions_SetExtractionFormat(
pOptions, ePdfToolsExtraction_TextExtractionFormat_Monospace);
double dAdvanceWidth = 9.2;
PdfToolsExtraction_TextOptions_SetAdvanceWidth(pOptions, &dAdvanceWidth);
// double dLineHeight = 12.0;
// PdfToolsExtraction_TextOptions_SetLineHeight(pOptions, &dLineHeight);
// PdfToolsExtraction_TextOptions_SetWordSeparationFactor(pOptions, 0.3);
5. Run the extractor
Create an Extractor and call ExtractText. The method has four overloads that progressively narrow the scope:
ExtractText(inDoc, outStream): Extract every page with default options.ExtractText(inDoc, outStream, options): Extract every page with custom options. Passnull(Nonein Python) foroptionsto use defaults.ExtractText(inDoc, outStream, options, firstPage): Extract fromfirstPageto the end of the document.ExtractText(inDoc, outStream, options, firstPage, lastPage): Extract the page range[firstPage, lastPage], one-based and inclusive on both ends.
Page indices must be in the range 1 ≤ firstPage ≤ lastPage ≤ PageCount. An index outside that range raises IllegalArgumentException in Java, ArgumentException in .NET, ValueError in Python, and returns FALSE with ePdfTools_Error_IllegalArgument in C.
A single Extractor instance can process more than one document. To produce one text file per page, call ExtractText in a loop and pass the same page index as both firstPage and lastPage.
- .NET
- Java
- Python
- C
Extractor extractor = new Extractor();
for (int i = 0; i < inDoc.PageCount; i++)
{
using var outStr = File.Create(Path.Combine(outDir, $"page{i + 1}.txt"));
extractor.ExtractText(inDoc, outStr, options, i + 1, i + 1);
}
Extractor extractor = new Extractor();
for (int i = 0; i < inDoc.getPageCount(); i++) {
try (FileStream outStr = new FileStream(
outDir + File.separator + "page" + (i + 1) + ".txt",
FileStream.Mode.READ_WRITE_NEW)) {
extractor.extractText(inDoc, outStr, options, i + 1, i + 1);
}
}
extractor = Extractor()
for i in range(in_doc.page_count):
output_file = os.path.join(output_dir, f"page{i + 1}.txt")
with open(output_file, 'wb') as out_stream:
extractor.extract_text(in_doc, out_stream, options, i + 1, i + 1)
pExtractor = PdfToolsExtraction_Extractor_New();
int nPageCount = PdfToolsPdf_Document_GetPageCount(pInDoc);
for (int i = 1; i <= nPageCount; i++)
{
TCHAR szPageFile[512];
_sntprintf(szPageFile, sizeof(szPageFile) / sizeof(szPageFile[0]),
_T("%s/page%d.txt"), szOutDir, i);
pOutStream = _tfopen(szPageFile, _T("wb+"));
TPdfToolsSys_StreamDescriptor outDesc;
PdfToolsSysCreateFILEStreamDescriptor(&outDesc, pOutStream, 0);
int iFirstPage = i;
int iLastPage = i;
PdfToolsExtraction_Extractor_ExtractText(
pExtractor, pInDoc, &outDesc, pOptions, &iFirstPage, &iLastPage);
fclose(pOutStream);
pOutStream = NULL;
}
If a PDF page contains no extractable text (for example, a scanned image without a text layer), the extractor writes an empty result for that page. Run OCR a PDF document first to add a text layer, then extract.
Errors and exceptions
ExtractText can fail for the reasons in this table. Wrap the call in your language’s exception handling and surface or log the failure.
| Failure | Java / .NET | Python | C |
|---|---|---|---|
| Invalid license | LicenseException | LicenseError | PdfTools_GetLastError returns a license error code |
| Extraction failure (corrupted content, decryption failure, internal error) | ProcessingException | ProcessingError | The function returns FALSE and sets a processing error code |
| Output stream write failure | IOException | OSError | The function returns FALSE. Check PdfTools_GetLastError |
| Page index out of range | IllegalArgumentException (Java), ArgumentException (.NET) | ValueError | The function returns FALSE with ePdfTools_Error_IllegalArgument |
| Other internal errors | GenericException | GenericError | PdfTools_GetLastError returns a generic error code |
Full example
The code snippets in the previous steps are excerpts that illustrate the workflow. They aren’t complete, runnable programs. Before you run Pdftools SDK, install the package for your language and set up a license key. For setup steps, see Getting started with Pdftools SDK.
For a complete, runnable project, clone the sample repository for your language and run the ExtractTextLayout sample: