Add logical structure to an existing PDF

Make PDF documents accessible by adding a logical structure using the Toolbox add-on. Learn about the technical process of PDF remediation, which means adding tags and structure to existing PDF content.

If you work with a new PDF or have control over the document creation process, consider Creating an accessible PDF from scratch instead. Creating accessible documents from scratch proves more efficient and reliable than remediation.

info

This functionality is part of the Toolbox add-on, a separate SDK that you can use with the same license key as the Pdftools SDK. To use and integrate this add-on, review Getting started with the Toolbox add-on and Toolbox add-on code samples.

Quick start

Download the full sample now in C#, Java, and Python.

For background on PDF accessibility concepts and the importance of logical structure, review A primer on PDF accessibility.

Adding logical structure to an existing PDF involves analyzing the current content and selectively applying tags to create a hierarchical structure. Steps to add logical structure to an existing PDF:

Opening the existing document
Creating the logical structure tree
Identifying and copying content
Tagging specific content elements
Full example

Before you begin

You need to initialize the library.

Opening the existing document

To begin, open the existing PDF and create a new output document. Since you work with existing content, copy the document-wide metadata before remediating individual pages.

While the main work involves applying a logical structure to the PDF visual elements, specific metadata must conform to the PDF/Universal Accessibility (PDF/UA) standard. These requirements include setting a valid document language and a title. You must also configure the PDF to instruct viewers to display the document title in the title bar. Finally, once the entire PDF has been successfully tagged, you can declare it as PDF/UA compliant.

The following code shows the high-level process for opening the input document, creating the output document, copying document-wide data, and ensuring all required metadata fields have suitable values.

.NET
Java
Python

// Open input document
using Stream inStream = new FileStream(inPath, FileMode.Open, FileAccess.Read);
using Document inDoc = Document.Open(inStream, null);

// Create output document
using Stream outStream = new FileStream(outPath, FileMode.Create, FileAccess.ReadWrite);
using Document outDoc = Document.Create(outStream, inDoc.Conformance, null);

CopyDocumentData(inDoc, outDoc);

// Set required metadata for PDF/UA compliance
outDoc.Language = "en";
outDoc.Metadata.Title = "TaggedPDF";
outDoc.ViewerSettings.DisplayDocumentTitle = true;

// Remediating a PDF manually requires knowledge of the input PDF's layout.
// This example assumes that the input is a single-page PDF.
if (inDoc.Pages.Count != 1)
    throw new InvalidOperationException("Unexpected number of pages: " + inDoc.Pages.Count);

RemediatePage(inDoc, 0, outDoc);

outDoc.SetPdfUaConformant();

try (FileStream inStream = new FileStream(inPath, FileStream.Mode.READ_ONLY);
     Document inDoc = Document.open(inStream, null);
     FileStream outStream = new FileStream(outPath, FileStream.Mode.READ_WRITE_NEW);
     Document outDoc = Document.create(outStream, inDoc.getConformance(), null)) {

    copyDocumentData(inDoc, outDoc);

    // Set required metadata for PDF/UA compliance
    outDoc.setLanguage("en");
    outDoc.getMetadata().setTitle("TaggedPDF");
    outDoc.getViewerSettings().setDisplayDocumentTitle(true);

    // Remediating a PDF manually requires knowledge of the input PDF's layout.
    // This example assumes that the input is a single-page PDF.
    if (inDoc.getPages().size() != 1) {
        throw new IllegalStateException("Unexpected number of pages: " + inDoc.getPages().size());
    }

    remediatePage(inDoc, 0, outDoc);

    outDoc.setPdfUaConformant();
}

# Open input document
with open(in_path, 'rb') as in_stream:
    with Document.open(in_stream, None) as in_doc:
        # Create output document
        with open(out_path, 'wb+') as out_stream:
            with Document.create(out_stream, in_doc.conformance, None) as out_doc:
                copy_document_data(in_doc, out_doc)

                # Set required metadata for PDF/UA compliance
                out_doc.language = "en"
                out_doc.metadata.title = "TaggedPDF"
                out_doc.viewer_settings.display_document_title = True

                # Remediating a PDF manually requires knowledge of the input PDF's layout.
                # This example assumes that the input is a single-page PDF.
                if len(in_doc.pages) != 1:
                    raise ValueError(f"Unexpected number of pages: {len(in_doc.pages)}")

                remediate_page(in_doc, 0, out_doc)

                out_doc.set_pdf_ua_conformant()

Creating the logical structure tree

The RemediatePage method handles the page-level remediation process. It begins by creating a new, empty page in the output document that matches the dimensions of the original page.

.NET
Java
Python

private static void RemediatePage(Document inDoc, int pageIndex, Document outDoc)
{
    Page inPage = inDoc.Pages[pageIndex];
    Page outPage = Page.Create(outDoc, inPage.Size);

    CopyAndTagContent(inPage, outPage, outDoc);

    outDoc.Pages.Add(outPage);
}

private static void remediatePage(Document inDoc, int pageIndex, Document outDoc) throws Exception {
    Page inPage = inDoc.getPages().get(pageIndex);
    Page outPage = Page.create(outDoc, inPage.getSize());

    copyAndTagContent(inPage, outPage, outDoc);

    outDoc.getPages().add(outPage);
}

def remediate_page(in_doc: Document, page_index: int, out_doc: Document):
    in_page = in_doc.pages[page_index]
    out_page = Page.create(out_doc, in_page.size)

    copy_and_tag_content(in_page, out_page, out_doc)

    out_doc.pages.append(out_page)

Instantiating a Tree object creates the logical structure on the first call and returns the existing tree on all later calls. The root node is always of type DocumentNode. For this example, the code wraps the entire page in a section node (Sect) to define the high-level logical structure.

Next, a ContentExtractor retrieves content elements from the source page, and a ContentGenerator re-creates them on the destination page. The for-loop iterates through each element from the source page, identifies it, applies the correct tag, and then copies it to the destination.

.NET
Java
Python

private static void CopyAndTagContent(Page inPage, Page outPage, Document outDoc)
{
    // Create or retrieve the structure tree.
    var structTree = new Tree(outDoc);
    var documentNode = structTree.DocumentNode;

    // Create a section node for the page content and add it to the document structure.
    var section = new Node("Sect", outDoc, outPage);
    documentNode.Children.Add(section);

    ContentExtractor extractor = new ContentExtractor(inPage.Content);
    using ContentGenerator generator = new ContentGenerator(outPage.Content, false);

    foreach (ContentElement element in extractor)
    {
        // The logic to identify, tag, and add the element to the generator goes here.
        // This process is detailed in the following examples 
        // and is specific to the document's layout.
        
        // PDF/UA requires all content to be tagged. 
        // Throw an exception for any unrecognized element to avoid untagged content.
        throw new InvalidOperationException("Unexpected content element found.");
    }
}

private static void copyAndTagContent(Page inPage, Page outPage, Document outDoc) throws Exception {
    // Create or retrieve the structure tree.
    Tree structTree = new Tree(outDoc);
    Node documentNode = structTree.getDocumentNode();

    // Create a section node for the page content and add it to the document structure.
    Node section = new Node("Sect", outDoc, outPage);
    documentNode.getChildren().add(section);

    ContentExtractor extractor = new ContentExtractor(inPage.getContent());
    try (ContentGenerator generator = new ContentGenerator(outPage.getContent(), false)) {
        for (ContentElement element : extractor) {
            // The logic to identify, tag, and add the element to the generator goes here.
            // This process is detailed in the following examples
            // and is specific to the document's layout.
            
            // PDF/UA requires all content to be tagged.
            // Throw an exception for any unrecognized element to avoid untagged content.
            throw new IllegalStateException("Unexpected content element found.");
        }
    }
}

def copy_and_tag_content(in_page: Page, out_page: Page, out_doc: Document):
    # Create or retrieve the structure tree.
    struct_tree = Tree(out_doc)
    document_node = struct_tree.document_node

    # Create a section node for the page content and add it to the document structure.
    section = Node("Sect", out_doc, out_page)
    document_node.children.append(section)

    extractor = ContentExtractor(in_page.content)
    with ContentGenerator(out_page.content, False) as generator:
        for element in extractor:
            # The logic to identify, tag, and add the element to the generator goes here.
            # This process is detailed in the following examples 
            # and is specific to the document's layout.
            
            # PDF/UA requires all content to be tagged. 
            # Throw an exception for any unrecognized element to avoid untagged content.
            raise ValueError("Unexpected content element found.")

Identifying and copying content

Identifying content elements to apply the correct tag requires specific knowledge of the document’s layout. You can use specific properties for this identification. For instance, text elements have identifiable content, while non-textual elements like images have identifiable positions and sizes on the page (for example, by their bounding box).

The following code demonstrates these two approaches. This logic goes inside the for each loop shown earlier.

.NET
Java
Python

if (element is TextElement textElement)
{
    if (textElement.Text[0].Text == "This is a properly tagged heading")
    {
        CopyAndTagTextElement(textElement, section, generator, outPage, outDoc, "H1");
        continue;
    }
    if (textElement.Text[0].Text.StartsWith("This is a properly tagged paragraph."))
    {
        CopyAndTagTextElement(textElement, section, generator, outPage, outDoc, "P");
        continue;
    }
    // Add more logic here to identify other text elements.
}
else if (element is ImageElement imageElement)
{
    var bbox = imageElement.Transform.TransformRectangle(element.BoundingBox);
    if (Math.Abs(bbox.BottomLeft.X - 56.7) < 0.5
        && Math.Abs(bbox.BottomLeft.Y - 600.489) < 0.5
        && Math.Abs(bbox.TopRight.X - 152.489) < 0.5
        && Math.Abs(bbox.TopRight.Y - 696.489) < 0.5)
    {
        CopyAndTagImageElement(imageElement, documentNode, generator, outPage, outDoc, "PdfTools AG Logo");
        continue;
    }
    // Add more logic here to identify other image elements.
}
// Remember, if an element is not identified by the logic above, the surrounding
// code will throw an exception to ensure all content is tagged.

if (element instanceof TextElement) {
    TextElement textElement = (TextElement) element;
    if (textElement.getText().get(0).getText().equals("This is a properly tagged heading")) {
        copyAndTagTextElement(textElement, section, generator, outPage, outDoc, "H1");
        continue;
    }
    if (textElement.getText().get(0).getText().startsWith("This is a properly tagged paragraph.")) {
        copyAndTagTextElement(textElement, section, generator, outPage, outDoc, "P");
        continue;
    }
    // Add more logic here to identify other text elements.
} else if (element instanceof ImageElement) {
    ImageElement imageElement = (ImageElement) element;
    Quadrilateral bbox = imageElement.getTransform().transformRectangle(element.getBoundingBox());
    if (Math.abs(bbox.getBottomLeft().getX() - 56.7) < 0.5
        && Math.abs(bbox.getBottomLeft().getY() - 600.489) < 0.5
        && Math.abs(bbox.getTopRight().getX() - 152.489) < 0.5
        && Math.abs(bbox.getTopRight().getY() - 696.489) < 0.5) {
        copyAndTagImageElement(imageElement, documentNode, generator, outPage, outDoc, "PdfTools AG Logo");
        continue;
    }
    // Add more logic here to identify other image elements.
}
// Remember, if an element is not identified by the logic above, the surrounding
// code will throw an exception to ensure all content is tagged.

if isinstance(element, TextElement):
    if element.text[0].text == "This is a properly tagged heading":
        copy_and_tag_text_element(element, section, generator, out_page, out_doc, "H1")
        continue
    if element.text[0].text.startswith("This is a properly tagged paragraph."):
        copy_and_tag_text_element(element, section, generator, out_page, out_doc, "P")
        continue
    # Add more logic here to identify other text elements.
elif isinstance(element, ImageElement):
    bbox = element.transform.transform_rectangle(element.bounding_box)
    if (abs(bbox.bottom_left.x - 56.7) < 0.5
            and abs(bbox.bottom_left.y - 600.489) < 0.5
            and abs(bbox.top_right.x - 152.489) < 0.5
            and abs(bbox.top_right.y - 696.489) < 0.5):
        copy_and_tag_image_element(element, document_node, generator, out_page, out_doc, "PdfTools AG Logo")
        continue
    # Add more logic here to identify other image elements.
# Remember, if an element is not identified by the logic above, the surrounding
# code will throw an exception to ensure all content is tagged.

Tagging specific content elements

The key to successful remediation involves identifying which content elements need structure tags and applying the suitable tags. The following examples show how to handle different types of content:

Tagging text elements

When tagging text, the main goal involves assigning the correct semantic role, such as a heading H1, a paragraph P, or a list item LI. Providing ActualText is also crucial for ensuring that screen readers and other assistive technologies can accurately interpret the content.

.NET
Java
Python

private static void CopyAndTagTextElement(
    TextElement inElement, Node section, ContentGenerator generator,
    Page outPage, Document outDoc, string tag)
{
    // Create a text structure node (e.g., "H1"), set its ActualText
    // for accessibility, and add it to the parent section.
    Node headerElement = new Node(tag, outDoc, outPage);
    headerElement.ActualText = inElement.Text[0].Text;
    headerElement.Language = "en";

    section.Children.Add(headerElement);

    // Wrap the original text element in the new tag and add it to the page content.
    generator.TagAs(headerElement);
    generator.AppendContentElement(ContentElement.Copy(outDoc, inElement));
    generator.StopTagging();
}

private static void copyAndTagTextElement(
    TextElement inElement, Node section, ContentGenerator generator,
    Page outPage, Document outDoc, String tag) throws Exception {
    // Create a text structure node (e.g., "H1"), set its ActualText
    // for accessibility, and add it to the parent section.
    Node headerElement = new Node(tag, outDoc, outPage);
    headerElement.setActualText(inElement.getText().get(0).getText());
    headerElement.setLanguage("en");

    section.getChildren().add(headerElement);

    // Wrap the original text element in the new tag and add it to the page content.
    generator.tagAs(headerElement);
    generator.appendContentElement(ContentElement.copy(outDoc, inElement));
    generator.stopTagging();
}

def copy_and_tag_text_element(
    in_element: TextElement, section: Node, generator: ContentGenerator,
    out_page: Page, out_doc: Document, tag: str
):
    # Create a text structure node (e.g., "H1"), set its ActualText
    # for accessibility, and add it to the parent section.
    header_element = Node(tag, out_doc, out_page)
    header_element.actual_text = in_element.text[0].text
    header_element.language = "en"

    section.children.append(header_element)

    # Wrap the original text element in the new tag and add it to the page content.
    generator.tag_as(header_element)
    generator.append_content_element(ContentElement.copy(out_doc, in_element))
    generator.stop_tagging()

Tagging image elements

For images, the most important accessibility feature involves providing descriptive alternate text. Screen readers read this text aloud for users who can’t see the image. Images typically are tagged as Figure elements, and you need to define their bounding box to associate the tag with the correct location on the page.

.NET
Java
Python

private static void CopyAndTagImageElement(
    ImageElement inElement, Node documentNode, ContentGenerator generator, 
    Page outPage, Document outDoc, string alternateText)
{
    // Create a "Figure" structure node, set its accessibility properties 
    // (Alternate Text, Bounding Box), and add it to the document's structure tree.
    Node imgNode = new Node("Figure", outDoc, outPage);
    imgNode.AlternateText = alternateText;
    imgNode.Language = "en";

    Quadrilateral bbox = inElement.Transform.TransformRectangle(inElement.BoundingBox);
    Rectangle rectangle = new Rectangle();
    rectangle.Left = bbox.BottomLeft.X;
    rectangle.Bottom = bbox.BottomLeft.Y;
    rectangle.Right = bbox.TopRight.X;
    rectangle.Top = bbox.TopRight.Y;
    imgNode.BoundingBox = rectangle;
    imgNode.SetStringAttribute("O", "Layout");

    documentNode.Children.Add(imgNode);

    // Wrap the original image element in the new tag and add it to the page content.
    generator.TagAs(imgNode);
    generator.AppendContentElement(ContentElement.Copy(outDoc, inElement));
    generator.StopTagging();
}

private static void copyAndTagImageElement(
    ImageElement inElement, Node documentNode, ContentGenerator generator,
    Page outPage, Document outDoc, String alternateText) throws Exception {
    // Create a "Figure" structure node, set its accessibility properties
    // (Alternate Text, Bounding Box), and add it to the document's structure tree.
    Node imgNode = new Node("Figure", outDoc, outPage);
    imgNode.setAlternateText(alternateText);
    imgNode.setLanguage("en");

    Quadrilateral bbox = inElement.getTransform().transformRectangle(inElement.getBoundingBox());
    Rectangle rectangle = new Rectangle();
    rectangle.setLeft(bbox.getBottomLeft().getX());
    rectangle.setBottom(bbox.getBottomLeft().getY());
    rectangle.setRight(bbox.getTopRight().getX());
    rectangle.setTop(bbox.getTopRight().getY());
    imgNode.setBoundingBox(rectangle);
    imgNode.setStringAttribute("O", "Layout");

    documentNode.getChildren().add(imgNode);

    // Wrap the original image element in the new tag and add it to the page content.
    generator.tagAs(imgNode);
    generator.appendContentElement(ContentElement.copy(outDoc, inElement));
    generator.stopTagging();
}

def copy_and_tag_image_element(
    in_element: ImageElement, document_node: Node, generator: ContentGenerator,
    out_page: Page, out_doc: Document, alternate_text: str
):
    # Create a "Figure" structure node, set its accessibility properties 
    # (Alternate Text, Bounding Box), and add it to the document's structure tree.
    img_node = Node("Figure", out_doc, out_page)
    img_node.alternate_text = alternate_text
    img_node.language = "en"

    bbox = in_element.transform.transform_rectangle(in_element.bounding_box)
    rectangle = Rectangle(
        left=bbox.bottom_left.x,
        bottom=bbox.bottom_left.y,
        right=bbox.top_right.x,
        top=bbox.top_right.y
    )
    img_node.bounding_box = rectangle
    img_node.set_string_attribute("O", "Layout")

    document_node.children.append(img_node)

    # Wrap the original image element in the new tag and add it to the page content.
    generator.tag_as(img_node)
    generator.append_content_element(ContentElement.copy(out_doc, in_element))
    generator.stop_tagging()

Best practices for remediation

When adding logical structure to existing PDF documents, consider these best practices:

Content analysis: Use heuristics to identify content types, but always validate results manually. Automated detection of headings, paragraphs, and other elements has inherent limitations.
Quality assurance: Always validate the remediated PDF with accessibility checkers and screen readers to make sure the logical structure works as intended.
Alternative text: Meaningful alternative text for images can’t be automated. Always perform a human review before claiming PDF/UA compliance.

Full example

Download the complete remediation example:

Next steps

After adding logical structure to your PDF:

Validate the structure: Use PDF accessibility checkers to verify the logical structure.
Test with assistive technology: Ensure screen readers can navigate the document correctly.
Review alternative text: Manually verify that all images have meaningful descriptions.
Check reading order: Confirm the logical structure follows the intended reading flow.

info

Remember: adding logical structure to existing PDF documents requires technical skill, but true accessibility requires human understanding and validation. Always perform manual testing to make sure the remediated document serves its intended users effectively.

Opening the existing document​

Creating the logical structure tree​

Identifying and copying content​

Tagging specific content elements​

Tagging text elements​

Tagging image elements​

Best practices for remediation​

Full example​

Next steps​