Skip to main content
Version: Version 1.0.0

Scale the Pdftools OCR Service

The Pdftools OCR service uses a master-worker architecture. The central master node, called the Pdftools OCR Service Manager, distributes tasks to multiple worker nodes, called Pdftools OCR Service Workers, which perform the actual processing.

The current setup simplifies initial configuration by letting you install and run a preconfigured Pdftools OCR Service Manager that communicates with a single Pdftools OCR Service Worker. In this guide, you’ll learn how to scale horizontally by configuring the manager to work with multiple worker nodes.

The manager node communicates with worker nodes through a RESTful API.

Scaling worker

  1. Locate the manager configuration file. In a default installation, the file is located at:

    C:\Program Files\Pdftools\Pdftools OCR Service\PdftoolsOcrService\appsettings.json
  2. Point ServiceCommunication to your load balancer:

    {
    "ServiceCommunication": {
    "ServiceCommunicationType": "Rest",
    "ConnectionString": "http://localhost:8080/"
    }
    }
  3. Install workers on different host machines. No need to install the manager again.

    Screenshot of the Pdftools OCR Service Windows MSI installer.

    In this example, there are three workers installed on:

    • 192.168.1.101:7998
    • 192.168.1.102:7998
    • 192.168.1.103:7998
  4. Configure a load balancer (for example, nginx) to distribute requests to your worker nodes. Following this example: The load balancer listens on port 8080, as defined in the previous step:

    {
    http {
    upstream backend_servers {
    # List of backend servers
    server 192.168.1.101:7998;
    server 192.168.1.102:7998;
    server 192.168.1.103:7998;
    }

    server {
    listen 8080;

    location / {
    proxy_pass http://backend_servers;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
    }
    }
    }
  5. Add a license key to each worker

    {
    "Licensing": {
    "LicenseKey": "<LICENSE_KEY>"
    }
    }

    In a default installation, the worker configuration file is located at:

    C:\Program Files\Pdftools\Pdftools OCR Service\PdftoolsOcrWorker\appsettings.json
  6. Share a common file storage between all nodes:

    {
    "FileStorage": {
    "FileStorageType": "HostFileSystem",
    "FilesDirectoryPath": "F:/SharedFolder/ProgramData/Pdftools/OcrService/Files"
    }
    }
    • FileStorage
      • FileStorageType: Storage system type (for example, HostFileSystem).
      • FilesDirectoryPath: Directory path for storing OCR-processed files.
  7. Make sure every manager and worker node:

    • Uses the same FileStorage settings.
    • Has read and write permission for the shared directory.

Scaling manager

You can scale the Pdftools OCR Service Manager similarly to the worker nodes, but you first need to switch from SQLite to a production-ready database.

  1. Configure PostgreSQL:
    {
    "Database": {
    "DatabaseType": "PostgreSql",
    "ConnectionString" : "User ID=myUser;Password=mySecurePassword;Server=my.database.com;Port=5432;Database=ocr-service-db;",
    "DeleteJobsAfterDays": 2
    }
    }
  2. Install manager nodes on separate hosts, for example:
    • 192.168.1.101:7982
    • 192.168.1.102:7982
    • 192.168.1.103:7982
  3. Configure a load balancer (for example, nginx). The load balancer listens on port 7982 as long as it runs on a different host than the manager nodes:
    {
    http {
    upstream backend_servers {
    # List of backend servers
    server 192.168.1.101:7982;
    server 192.168.1.102:7982;
    server 192.168.1.103:7982;
    }

    server {
    listen 7982;

    location / {
    proxy_pass http://backend_servers;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
    }
    }
    }