Read PDF logical structure
Use the Toolbox add-on to read and traverse the logical structure of a tagged PDF document. This guide covers the technical process of accessing and analyzing existing document structure.
If you need to add logical structure to an existing PDF, see our guide on Adding logical structure to existing PDFs.
This functionality is part of the Toolbox add-on, a separate SDK that you can use with the same license key as the Pdftools SDK. To use and integrate this add-on, review Getting started with the Toolbox add-on and Toolbox add-on code samples.
For background on PDF accessibility concepts and the importance of logical structure, review A primer on PDF accessibility.
Reading logical structure involves accessing and traversing the document’s structure tree to extract information about tagged elements. Steps to read PDF logical structure:
- Opening the tagged document
- Accessing the structure tree
- Traversing the tree recursively
- Reading node properties
- Full example
You need to initialize the library.
Opening the tagged document
Start by opening the PDF document that contains a logical structure. Only tagged PDFs include accessible structure information.
You can then check whether the PDF claims to be PDF/UA conformant using the is_pdf_ua_conformant property.
- .NET
- Java
- Python
// Open input document
using Stream inStream = new FileStream(inPath, FileMode.Open, FileAccess.Read);
using Document inDoc = Document.Open(inStream, null);
if (inDoc.IsPdfUaConformant)
{
Console.WriteLine("This PDF declares PDF/UA conformance.");
}
else
{
Console.WriteLine("This PDF does not declare PDF/UA conformance.");
}
// Open input document
try (FileStream inStream = new FileStream(inPath, FileStream.Mode.READ_ONLY);
Document inDoc = Document.open(inStream, null)) {
if (inDoc.getIsPdfUaConformant()) {
System.out.println("This PDF declares PDF/UA conformance.");
} else {
System.out.println("This PDF does not declare PDF/UA conformance.");
}
# Open input document
with open(input_file_path, "rb") as in_stream:
with Document.open(in_stream, None) as in_doc:
if in_doc.is_pdf_ua_conformant:
print("This PDF declares PDF/UA conformance.")
else:
print("This PDF does not declare PDF/UA conformance.")
The isPdfUaConformant
flag only reflects the PDF’s metadata declaration.
It does not guarantee actual PDF/UA compliance — use a validator to verify true conformance.
Accessing the structure tree
Create a Tree
object to access the document’s logical structure. The tree provides access to the root document node and its children.
- .NET
- Java
- Python
// Create a structure tree object
var tree = new Tree(inDoc);
// Traverse all top-level structure elements
foreach (var child in tree.Children)
{
PrintNodeRecursively(child);
}
// Create a structure tree object
Tree tree = new Tree(inDoc);
// Traverse all top-level structure elements
for (Node node : tree.getChildren()) {
printNodeRecursively(node, 0);
}
# Create structure tree object
tree = Tree(in_doc)
# Traverse all top-level structure elements
for node in tree.children:
print_node_recursive(node, 0)
Traversing the tree recursively
Implement a recursive function to traverse the entire structure tree. Each node can have child nodes, creating a hierarchical structure.
- .NET
- Java
- Python
static void PrintNodeRecursively(Node node, int level = 0)
{
// Print current node information
PrintProperty(level, "Tag", node.Tag);
PrintProperty(level, "Alternative text", node.AlternateText);
PrintProperty(level, "Actual text", node.ActualText);
PrintProperty(level, "Language", node.Language);
// Recursively traverse child nodes
foreach (var child in node.Children)
{
PrintNodeRecursively(child, level + 1);
}
}
static void printNodeRecursively(Node node, int level) throws Exception {
// Print current node information
printProperty(level, "Tag", node.getTag());
printProperty(level, "Alternative text", node.getAlternateText());
printProperty(level, "Actual text", node.getActualText());
printProperty(level, "Language", node.getLanguage());
// Recursively traverse child nodes
for (Node child : node.getChildren()) {
printNodeRecursively(child, level + 1);
}
}
def print_node_recursive(node: Node, level: int):
# Print current node information
print_property(level, "Tag", node.tag)
print_property(level, "Alternative text", node.alternate_text)
print_property(level, "Actual text", node.actual_text)
print_property(level, "Language", node.language)
# Recursively traverse child nodes
for child in node.children:
print_node_recursive(child, level + 1)
Reading node properties
Each structure node contains various properties that provide information about the tagged element:
- Tag: The structure type (e.g., “H1”, “P”, “Figure”, “Table”)
- Actual Text: The text content for text elements
- Alternative Text: Alternative text for images and non-text elements
- Language: Language specification for the element
- Abbreviation: Expanded form of abbreviations
- .NET
- Java
- Python
static void PrintProperty(int level, String name, String value)
{
Console.Write($"{new string(' ', level * 2)}");
Console.WriteLine($"{name}: '{value}'");
}
static void printProperty(int level, String name, String value) {
for (int i = 0; i < level; ++i) {
System.out.print(" ");
}
System.out.println(name + ": '" + value + "'");
}
def print_property(level: int, label: str, value):
indent = " " * level
value_str = str(value or "")
print(f"{indent}{label}: '{value_str}'")
Example output
When you run the structure traversal, the output is similar to the following:
This PDF declares PDF/UA conformance.
Tag: 'Document'
Alternative text: ''
Actual text: ''
Language: ''
Tag: 'Title'
Alternative text: ''
Actual text: ''
Language: ''
Tag: 'Text body'
Alternative text: ''
Actual text: ''
Language: ''
Tag: 'Text body'
Alternative text: ''
Actual text: ''
Language: ''
Tag: 'Figure'
Alternative text: 'A test image of a document icon'
Actual text: ''
Language: ''
Full example
Download the complete structure traversal example:
Use cases
Reading logical structure is useful for:
- Accessibility auditing: Verify that documents have proper structure
- Content extraction: Extract structured content while preserving hierarchy
- Document analysis: Understand document organization and reading order
- Quality assurance: Validate that remediation or creation processes worked correctly
Reading logical structure provides insights into how assistive technologies will interpret your PDF documents, making it an essential tool for accessibility validation.