Skip to main content

Convert to XML workflow

Learn how to automatically extract text data from documents, images, and office files as structured XML. The Convert to XML workflow is engineered specifically to extract searchable text from documents and output the results as structured XML.

Supported file formats for Convert to XML workflow

The Convert to XML workflow supports the following file formats:

Content typeFile type
PDF formatsPDF 1.x, PDF 2.0, PDF/A-1, PDF/A-2, PDF/A-3
Image formatsJPEG, JPEG200, TIFF, BMP, GIF, JBIG2, PNG, HEIC, HEIF, WebP, JFIF
EmailEML, MSG (without encryption)
WordDOC, DOT, DOCX, DOCM, DOTX, DOTM, RTF
ExcelXLS, XLT, XLSX, XLSM, XLTX, XLTM
PowerPointPPT, PPS, PPTX, PPTM, PPSX, PPSM
OpenOfficeODT, ODS, ODP
OtherCSV, HTML, HTM (prepared for archiving), TXT, ZIP (without password protection)
Format-specific limitations

Some file formats have specific limitations when used with the Convert to XML workflow:

  • OpenOffice formats: PDF conversion depends on rendering in Microsoft Word, Excel, or PowerPoint. The conversion can result in visual differences in tables and tabs. Visual differences caused by differences in shape rendering are usually unacceptable.
  • HTML: Documents need to be self-contained (layout information and images are either inline or available on the web) and suited for portrait page layout. JavaScript content is disabled during processing.
  • XML input files: The Convert to XML workflow can’t process XML files. If you submit an XML file to this workflow, it generates an error.

The conversion of most file formats is enabled by default in the profile’s Convert mode for child documents (attachments).

Features and limitations

Compared to the other workflows, this workflow has the following features and limitations:

Resolution settings

You can configure the resolution (in DPI) used when converting PDF pages to TIFF images. Higher resolution values produce more detailed images, which can improve text recognition accuracy but increase processing time.

Collect mode

All documents processed in a job are merged together before text extraction. The Merge collect mode is used to combine documents and their child documents.

Configure the workflow

The workflow’s profile provides detailed configuration options for the conversion and text extraction processes. All processing steps can be enabled and customized within the profile configuration. The Convert to XML workflow includes the following configurable features:

Convert mode for child documents (attachments)

Certain child documents, such as attachments to emails or PDF documents, can be skipped (removed) during conversion. The convert mode can be specified based on the type of the child document, its filename, or the type of its parent document.

For example, by default executables attached to an email are removed.

info

XML files are explicitly excluded from conversion in this workflow. If the workflow encounters an XML file as a child document, it generates an error.

Collect mode

The collect mode configuration defines how a converted document and its child documents are combined. The collect mode can be configured for each document type and also defines how errors are handled.

For example, emails can be converted by merging their bodies and attachments. When converting Word documents, all embedded files can be merged into the converted PDF.

Job and document options for the Convert to XML workflow

The Convert to XML workflow lets you use job and document options to pass job and document-specific values to be used when processing documents using the workflow.

note

Job and document options you pass at runtime affect only the current job. When you change a setting of the job or document options, that change applies only to the current job. For the next job, the workflow reverts to the profile settings (saved default) unless you pass job or document options again.

Document options

Document options apply only to a specific input. It lets you specify properties for an individual document, rather than as a global setting (determined by the job or the profile). Use the default profile settings for any subsequent jobs processed with the workflow profile.

TypeOptionDescription
Document propertyDOC.PASSWORDSet the password for the document.