Skip to main content

Read PDF logical structure

Use the Toolbox add-on to read and traverse the logical structure of a tagged PDF document. This guide covers the technical process of accessing and analyzing existing document structure.

If you need to add logical structure to an existing PDF, see our guide on Adding logical structure to existing PDFs.

info

This functionality is part of the Toolbox add-on, a separate SDK that you can use with the same license key as the Pdftools SDK. To use and integrate this add-on, review Getting started with the Toolbox add-on and Toolbox add-on code samples.

Quick start

Download the full sample now in C#, Java, and Python.

For background on PDF accessibility concepts and the importance of logical structure, review A primer on PDF accessibility.

Reading logical structure involves accessing and traversing the document’s structure tree to extract information about tagged elements. Steps to read PDF logical structure:

  1. Opening the tagged document
  2. Accessing the structure tree
  3. Traversing the tree recursively
  4. Reading node properties
  5. Full example

Before you begin

Opening the tagged document

Start by opening the PDF document that contains a logical structure. Only tagged PDFs include accessible structure information.

You can then check whether the PDF claims to be PDF/UA conformant using the is_pdf_ua_conformant property.

// Open input document
using Stream inStream = new FileStream(inPath, FileMode.Open, FileAccess.Read);
using Document inDoc = Document.Open(inStream, null);

if (inDoc.IsPdfUaConformant)
{
Console.WriteLine("This PDF declares PDF/UA conformance.");
}
else
{
Console.WriteLine("This PDF does not declare PDF/UA conformance.");
}
PDF/UA Declaration

The isPdfUaConformant flag only reflects the PDF’s metadata declaration.
It does not guarantee actual PDF/UA compliance — use a validator to verify true conformance.

Accessing the structure tree

Create a Tree object to access the document’s logical structure. The tree provides access to the root document node and its children.

// Create a structure tree object
var tree = new Tree(inDoc);

// Traverse all top-level structure elements
foreach (var child in tree.Children)
{
PrintNodeRecursively(child);
}

Traversing the tree recursively

Implement a recursive function to traverse the entire structure tree. Each node can have child nodes, creating a hierarchical structure.

static void PrintNodeRecursively(Node node, int level = 0)
{
// Print current node information
PrintProperty(level, "Tag", node.Tag);
PrintProperty(level, "Alternative text", node.AlternateText);
PrintProperty(level, "Actual text", node.ActualText);
PrintProperty(level, "Language", node.Language);

// Recursively traverse child nodes
foreach (var child in node.Children)
{
PrintNodeRecursively(child, level + 1);
}
}

Reading node properties

Each structure node contains various properties that provide information about the tagged element:

  • Tag: The structure type (e.g., “H1”, “P”, “Figure”, “Table”)
  • Actual Text: The text content for text elements
  • Alternative Text: Alternative text for images and non-text elements
  • Language: Language specification for the element
  • Abbreviation: Expanded form of abbreviations
static void PrintProperty(int level, String name, String value)
{
Console.Write($"{new string(' ', level * 2)}");
Console.WriteLine($"{name}: '{value}'");
}

Example output

When you run the structure traversal, the output is similar to the following:

This PDF declares PDF/UA conformance.
Tag: 'Document'
Alternative text: ''
Actual text: ''
Language: ''
Tag: 'Title'
Alternative text: ''
Actual text: ''
Language: ''
Tag: 'Text body'
Alternative text: ''
Actual text: ''
Language: ''
Tag: 'Text body'
Alternative text: ''
Actual text: ''
Language: ''
Tag: 'Figure'
Alternative text: 'A test image of a document icon'
Actual text: ''
Language: ''

Full example

Download the complete structure traversal example:

Use cases

Reading logical structure is useful for:

  • Accessibility auditing: Verify that documents have proper structure
  • Content extraction: Extract structured content while preserving hierarchy
  • Document analysis: Understand document organization and reading order
  • Quality assurance: Validate that remediation or creation processes worked correctly

Reading logical structure provides insights into how assistive technologies will interpret your PDF documents, making it an essential tool for accessibility validation.