Skip to main content

Scale AI Smart Redact

AI Smart Redact uses a Manager-Worker architecture:

  • The Manager handles the client-facing API, job orchestration, and file management.
  • Workers perform entity detection and PDF redaction.
  • The Orchestrator powers the Human-in-the-Loop (HITL) web application on top.

Minimal setup

The minimal deployment uses a single Manager with SQLite and a single Worker, connected through REST. No message broker, no PostgreSQL. This setup is suitable for testing, evaluation, and low-traffic workloads.

# Manager
environment:
Database__DatabaseType: "SqlLite"
ServiceCommunication__ServiceCommunicationType: "Rest"
ServiceCommunication__ConnectionString: "http://smart-redact-worker:4885/"
FileStorage__FileStorageType: "HostFileSystem"
FileStorage__FilesDirectoryPath: "/app/storage_folder"

To scale beyond a single Worker, choose a communication transport.

Communication transports

The Manager and Worker communicate using one of two transports. The transport determines how horizontal scaling works.

Two complementary approaches scale AI Smart Redact:

  • Horizontal scaling: Add more Worker instances to increase throughput linearly. Each Worker loads its own AI model and operates independently.
  • Vertical scaling: Optimize throughput within a single Worker using GPU acceleration and batch inference tuning.

REST transport

In REST mode, the Manager calls the Worker directly over HTTP. No message broker is needed. To scale horizontally with REST, place a load balancer in front of multiple Worker instances.

environment:
ServiceCommunication__ServiceCommunicationType: "Rest"
ServiceCommunication__ConnectionString: "http://smart-redact-worker:4885/"

RabbitMQ transport

In RabbitMQ mode, the Manager sends job commands to queues. Workers consume from these queues as competing consumers. Horizontal scaling is built in: add more Workers and RabbitMQ distributes work automatically without a load balancer.

environment:
ServiceCommunication__ServiceCommunicationType: "RabbitMQ"
ServiceCommunication__Host: "rabbitmq"
ServiceCommunication__Username: "guest"
ServiceCommunication__Password: "guest"
info

With REST transport, horizontal scaling requires a load balancer in front of the Workers. With RabbitMQ, Workers scale automatically through competing consumers without a load balancer.

Horizontal scaling: Add more Workers

Each Worker instance processes multiple detections per minute (throughput depends on document size, entity density, and hardware). Add more Workers for higher throughput. RabbitMQ distributes work through competing consumers automatically.

Scale with Docker Compose

Start additional Worker replicas with the --scale flag:

docker compose up --scale smart-redact-worker=3

The deploy.replicas field in docker-compose.yml only applies under Docker Swarm. With docker compose up, use --scale instead.

Each Worker instance loads its own copy of the AI model and operates independently. Throughput scales linearly: three Workers produce about three times the throughput of a single Worker.

How competing consumers work

  1. All Worker instances subscribe to the same RabbitMQ queues (worker-detection-queue, worker-redaction-queue).
  2. RabbitMQ distributes messages across Workers as competing consumers based on prefetch availability.
  3. Each Worker processes detection jobs one at a time (configurable with DetectionConcurrencyLimit) and redaction jobs concurrently (default: 4).
  4. If a Worker crashes mid-job, the broker redelivers the message to another available Worker.

Horizontal scaling trade-offs

ProsCons
Linear throughput scalingMemory: ~2.9 GB per Worker (model loaded per instance)
Independent failure domainsCold start time per instance (about 10 to 15 seconds for model loading)
Straightforward to configureNetwork and storage overhead

Configure admission control for scaled deployments

Scale MaxPendingJobs proportionally to the number of Workers:

MaxPendingJobs = 2-3x * (num_workers * DetectionConcurrencyLimit)

For example, with 3 Workers and a detection concurrency of 1:

environment:
ServiceCommunication__MaxPendingJobs: 9 # 3 * 3 * 1

Vertical scaling: Optimize a single Worker

Vertical scaling focuses on maximizing throughput within a single Worker instance using GPU acceleration and batch inference tuning.

How the Worker processes documents

When the Worker receives a PDF, it extracts text, splits it into text chunks (sized by MaxChunkSize), runs entity detection on each chunk, and consolidates the results. The AI model processes one inference call at a time per Worker, which is why horizontal scaling is the primary throughput strategy.

For details on MaxChunkSize, BatchSize, and other inference settings, refer to Worker > Inference and Tune chunk size for performance.

CPU vs. GPU inference

The Worker supports both CPU and GPU inference. GPU acceleration significantly reduces processing time per text chunk.

AspectCPUGPU (CUDA)
SetupNo additional requirementsRequires NVIDIA drivers and CUDA toolkit
Processing speedBaselineSignificantly faster per chunk
MemoryModel loads into RAM (~2.9 GB)Model loads into RAM + VRAM
CostLowHigher (GPU hardware)
Best forDev/test, low trafficProduction, high throughput

Configure the execution provider in appsettings.json:

{
"Inference": {
"ExecutionProvider": "Auto",
"GpuDeviceId": 0,
"CpuUtilizationPercentage": 80
}
}
SettingDefaultDescription
ExecutionProviderAutoHardware for inference. Auto uses GPU automatically when running the -cuda image, and CPU otherwise. Set to Cpu to force CPU-only inference.
GpuDeviceId0GPU device ID when using CUDA.
CpuUtilizationPercentage80Percentage of CPU cores used for inference (1-100). Only applies to CPU execution.

Docker image variants

Two Worker images are available on DockerHub:

ImageUse case
pdftoolsag/smart-redact-worker:latestCPU-only inference. No GPU requirements.
pdftoolsag/smart-redact-worker:latest-cudaGPU-accelerated inference with NVIDIA CUDA. Best performance.

NVIDIA GPU deployment

Use the GPU-specific Compose file and make sure the NVIDIA Container Toolkit is installed on the host:

services:
smart-redact-worker:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
Inference__ExecutionProvider: "Auto"
Inference__GpuDeviceId: 0
info

Auto is the recommended default. It uses GPU automatically with the -cuda image, and CPU otherwise. Set to Cpu to force CPU-only inference.

Batch inference tuning

The Worker splits each PDF into text chunks and groups them into batches before sending them to the AI model. Tuning batch size and chunk size directly controls the throughput-accuracy-memory trade-off.

How batching works

  1. The PDF is split into text chunks of up to MaxChunkSize tokens each. A 10-page document might produce 20-40 chunks depending on text density.
  2. Chunks are grouped into batches of BatchSize. Each batch is sent as a single inference call to the model.
  3. Fewer, larger batches mean fewer inference calls and less overhead per document. More, smaller batches use less memory.
{
"Inference": {
"BatchSize": 4,
"MaxChunkSize": 256,
"MaxLength": 512,
"MaxWidth": 12
}
}

Batch size (BatchSize)

Controls how many text chunks the model processes in a single inference call.

BatchSizeBehaviorMemory impactBest for
1 (default)One chunk per inference callLowestDevelopment, memory-constrained environments
2-4Small batchesLow-moderateCPU deployments, balanced workloads
4-10Medium batchesModerateGPU deployments, production
10-100Large batchesHighGPU with ample VRAM, maximum throughput

Example: A 30-chunk document with BatchSize: 1 requires 30 inference calls. With BatchSize: 10, it requires only 3 inference calls, significantly reducing per-document processing time.

Chunk size (MaxChunkSize)

Controls how many tokens (roughly words) each text chunk contains. Larger chunks give the model more context for disambiguation but take longer to process.

MaxChunkSizeUse caseTrade-off
384-512High accuracyMaximum context per chunk, slower processing, fewer chunks per document
256 (default)BalancedGood accuracy and speed
128-256High throughputFaster per-chunk processing, less context, more chunks per document

The model evaluates candidate entity spans within each chunk. The number of candidates scales linearly with chunk size:

MaxChunkSizeCandidates evaluatedRelative speed
512~6,1441x (baseline)
384~4,608~1.3x faster
256~3,072~2x faster
128~1,536~4x faster
tip

Start with MaxChunkSize: 256 and BatchSize: 2. Increase BatchSize first for more throughput. Only reduce MaxChunkSize below 256 if detection accuracy is acceptable at smaller context windows.

Other model parameters

SettingDefaultDescription
MaxLength512Maximum input length the model accepts. Don’t exceed without changing models.
MaxWidth12Maximum entity span width in words. For example, “Bank of America Corporation” is 4 words.

Graph optimization

The Worker applies graph optimizations at model load time to improve inference speed:

{
"Inference": {
"GraphOptimizationLevel": "All",
"ExecutionMode": "Parallel"
}
}
SettingDefaultOptions
GraphOptimizationLevelAllDisableAll, Basic, Extended, All (recommended).
ExecutionModeParallelSequential (single-threaded), Parallel (multi-threaded, recommended).
tip

Keep GraphOptimizationLevel set to All and ExecutionMode set to Parallel for best performance. Changing these defaults is not recommended.

Scale Manager nodes

To scale Manager nodes horizontally, switch from SQLite to PostgreSQL so that all Manager instances share the same state.

  1. Configure each Manager instance to use the same PostgreSQL database:
    environment:
    Database__DatabaseType: "PostgreSql"
    Database__ConnectionString: "User ID=smartredact;Password=smartredact;Server=smart-redact-manager-db;Port=5432;Database=smartredact;Maximum Pool Size=50;Timeout=30;"
  2. Configure a load balancer to distribute requests across Manager instances.
  3. Make sure all Manager instances share the same file storage configuration. Refer to AWS S3 file storage.
info

Each Manager instance runs its own backpressure monitor, but they all query the same PostgreSQL database. All Managers see the same pending job count regardless of which Manager created the jobs.

Scale Orchestrator nodes

To scale the Orchestrator horizontally, switch from SQLite to PostgreSQL and add a shared Redis instance for session and token caching. Place a load balancer in front of the Orchestrator instances.

  1. Configure each Orchestrator instance to use the same PostgreSQL database:
    environment:
    Database__DatabaseType: "PostgreSql"
    Database__ConnectionString: "User ID=smartredact;Password=smartredact;Server=smart-redact-orchestrator-db;Port=5432;Database=smartredact;Maximum Pool Size=50;Timeout=30;"
  2. Configure a shared Redis instance for session and token caching:
    environment:
    Redis__ConnectionString: "shared-redis:6379"
  3. Place a load balancer in front of the Orchestrator instances.

Memory planning

Each Worker loads the AI model into memory at startup (~2.9 GB steady-state). Plan 4 GB per Worker to allow headroom for batch processing.

Concurrency tuning

Detection and redaction use separate RabbitMQ queues with independent concurrency limits:

QueueDefault concurrencyReason
worker-detection-queue1The AI model processes one inference at a time. Concurrent detection adds no throughput, only memory pressure.
worker-redaction-queue4Lighter workload (about 100 ms per redaction), benefits from parallelism.

Override per Worker:

environment:
ServiceCommunication__DetectionConcurrencyLimit: 1
ServiceCommunication__RedactionConcurrencyLimit: 4
warning

Don’t increase DetectionConcurrencyLimit past 1. The AI model processes one inference at a time. Multiple detection threads queue up without improving throughput, only increasing memory pressure.

Scaling decision matrix

ScenarioRecommended strategy
Low traffic, cost-sensitiveSingle Worker, CPU, BatchSize: 1
Medium traffic, balanced2-3 Worker instances, CPU, BatchSize: 2-4
High traffic, latency-sensitiveMultiple Workers + GPU, BatchSize: 4-10
Highest trafficHorizontal (N Workers) + GPU + batch tuning
Burst trafficKubernetes HPA + multiple Workers

Example configurations

The following examples show resource and environment fields for each scenario. The deploy.replicas field in these YAML blocks applies under Docker Swarm or Kubernetes; with docker compose up, start replicas with --scale smart-redact-worker=N instead.

Small deployment (dev/test):

# 1 Worker, CPU, minimal resources
smart-redact-worker:
mem_limit: 4g
cpus: 2.0
environment:
Inference__ExecutionProvider: "Cpu"
Inference__BatchSize: 1
Inference__MaxChunkSize: 256

Medium deployment (production):

# 3 Workers, CPU, tuned batching
smart-redact-worker:
deploy:
replicas: 3
mem_limit: 4g
cpus: 2.0
environment:
Inference__ExecutionProvider: "Cpu"
Inference__BatchSize: 4
Inference__MaxChunkSize: 256
ServiceCommunication__DetectionConcurrencyLimit: 1
ServiceCommunication__RedactionConcurrencyLimit: 4

High-throughput deployment (production, GPU):

# 3 Workers with GPU, aggressive batching
smart-redact-worker:
deploy:
replicas: 3
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
mem_limit: 6g
cpus: 2.0
environment:
Inference__ExecutionProvider: "Auto"
Inference__GpuDeviceId: 0
Inference__BatchSize: 10
Inference__MaxChunkSize: 384
ServiceCommunication__DetectionConcurrencyLimit: 1
ServiceCommunication__RedactionConcurrencyLimit: 4