Scale the Pdftools OCR Service
The Pdftools OCR service uses a master-worker architecture. The central master node, called the Pdftools OCR Service Manager, distributes tasks to multiple worker nodes, called Pdftools OCR Service Workers, which perform the actual processing.
The current setup simplifies initial configuration by letting you install and run a preconfigured Pdftools OCR Service Manager that communicates with a single Pdftools OCR Service Worker. In this guide, you’ll learn how to scale horizontally by configuring the manager to work with multiple worker nodes.
The manager node communicates with worker nodes through a RESTful API.
Scaling worker
-
Locate the manager configuration file. In a default installation, the file is located at:
C:\Program Files\Pdftools\Pdftools OCR Service\PdftoolsOcrService\appsettings.json
-
Point
ServiceCommunication
to your load balancer:{
"ServiceCommunication": {
"ServiceCommunicationType": "Rest",
"ConnectionString": "http://localhost:8080/"
}
} -
Install workers on different host machines. No need to install the manager again.
In this example, there are three workers installed on:
192.168.1.101:7998
192.168.1.102:7998
192.168.1.103:7998
-
Configure a load balancer (for example,
nginx
) to distribute requests to your worker nodes. Following this example: The load balancer listens on port8080
, as defined in the previous step:{
http {
upstream backend_servers {
# List of backend servers
server 192.168.1.101:7998;
server 192.168.1.102:7998;
server 192.168.1.103:7998;
}
server {
listen 8080;
location / {
proxy_pass http://backend_servers;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
}
} -
Add a license key to each worker
{
"Licensing": {
"LicenseKey": "<LICENSE_KEY>"
}
}In a default installation, the worker configuration file is located at:
C:\Program Files\Pdftools\Pdftools OCR Service\PdftoolsOcrWorker\appsettings.json
-
Share a common file storage between all nodes:
{
"FileStorage": {
"FileStorageType": "HostFileSystem",
"FilesDirectoryPath": "F:/SharedFolder/ProgramData/Pdftools/OcrService/Files"
}
}FileStorage
FileStorageType
: Storage system type (for example,HostFileSystem
).FilesDirectoryPath
: Directory path for storing OCR-processed files.
-
Make sure every manager and worker node:
- Uses the same
FileStorage
settings. - Has read and write permission for the shared directory.
- Uses the same
Scaling manager
You can scale the Pdftools OCR Service Manager similarly to the worker nodes, but you first need to switch from SQLite to a production-ready database.
- Configure PostgreSQL:
{
"Database": {
"DatabaseType": "PostgreSql",
"ConnectionString" : "User ID=myUser;Password=mySecurePassword;Server=my.database.com;Port=5432;Database=ocr-service-db;",
"DeleteJobsAfterDays": 2
}
} - Install manager nodes on separate hosts, for example:
192.168.1.101:7982
192.168.1.102:7982
192.168.1.103:7982
- Configure a load balancer (for example,
nginx
). The load balancer listens on port7982
as long as it runs on a different host than the manager nodes:{
http {
upstream backend_servers {
# List of backend servers
server 192.168.1.101:7982;
server 192.168.1.102:7982;
server 192.168.1.103:7982;
}
server {
listen 7982;
location / {
proxy_pass http://backend_servers;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
}
}