Scale AI Smart Redact
AI Smart Redact uses a Manager-Worker architecture:
- The Manager handles the client-facing API, job orchestration, and file management.
- Workers perform entity detection and PDF redaction.
- The Orchestrator powers the Human-in-the-Loop (HITL) web application on top.
Minimal setup
The minimal deployment uses a single Manager with SQLite and a single Worker, connected through REST. No message broker, no PostgreSQL. This setup is suitable for testing, evaluation, and low-traffic workloads.
# Manager
environment:
Database__DatabaseType: "SqlLite"
ServiceCommunication__ServiceCommunicationType: "Rest"
ServiceCommunication__ConnectionString: "http://smart-redact-worker:4885/"
FileStorage__FileStorageType: "HostFileSystem"
FileStorage__FilesDirectoryPath: "/app/storage_folder"
To scale beyond a single Worker, choose a communication transport.
Communication transports
The Manager and Worker communicate using one of two transports. The transport determines how horizontal scaling works.
Two complementary approaches scale AI Smart Redact:
- Horizontal scaling: Add more Worker instances to increase throughput linearly. Each Worker loads its own AI model and operates independently.
- Vertical scaling: Optimize throughput within a single Worker using GPU acceleration and batch inference tuning.
REST transport
In REST mode, the Manager calls the Worker directly over HTTP. No message broker is needed. To scale horizontally with REST, place a load balancer in front of multiple Worker instances.
environment:
ServiceCommunication__ServiceCommunicationType: "Rest"
ServiceCommunication__ConnectionString: "http://smart-redact-worker:4885/"
RabbitMQ transport
In RabbitMQ mode, the Manager sends job commands to queues. Workers consume from these queues as competing consumers. Horizontal scaling is built in: add more Workers and RabbitMQ distributes work automatically without a load balancer.
environment:
ServiceCommunication__ServiceCommunicationType: "RabbitMQ"
ServiceCommunication__Host: "rabbitmq"
ServiceCommunication__Username: "guest"
ServiceCommunication__Password: "guest"
With REST transport, horizontal scaling requires a load balancer in front of the Workers. With RabbitMQ, Workers scale automatically through competing consumers without a load balancer.
Horizontal scaling: Add more Workers
Each Worker instance processes multiple detections per minute (throughput depends on document size, entity density, and hardware). Add more Workers for higher throughput. RabbitMQ distributes work through competing consumers automatically.
Scale with Docker Compose
Start additional Worker replicas with the --scale flag:
docker compose up --scale smart-redact-worker=3
The deploy.replicas field in docker-compose.yml only applies under Docker Swarm. With docker compose up, use --scale instead.
Each Worker instance loads its own copy of the AI model and operates independently. Throughput scales linearly: three Workers produce about three times the throughput of a single Worker.
How competing consumers work
- All Worker instances subscribe to the same RabbitMQ queues (
worker-detection-queue,worker-redaction-queue). - RabbitMQ distributes messages across Workers as competing consumers based on prefetch availability.
- Each Worker processes detection jobs one at a time (configurable with
DetectionConcurrencyLimit) and redaction jobs concurrently (default: 4). - If a Worker crashes mid-job, the broker redelivers the message to another available Worker.
Horizontal scaling trade-offs
| Pros | Cons |
|---|---|
| Linear throughput scaling | Memory: ~2.9 GB per Worker (model loaded per instance) |
| Independent failure domains | Cold start time per instance (about 10 to 15 seconds for model loading) |
| Straightforward to configure | Network and storage overhead |
Configure admission control for scaled deployments
Scale MaxPendingJobs proportionally to the number of Workers:
MaxPendingJobs = 2-3x * (num_workers * DetectionConcurrencyLimit)
For example, with 3 Workers and a detection concurrency of 1:
environment:
ServiceCommunication__MaxPendingJobs: 9 # 3 * 3 * 1
Vertical scaling: Optimize a single Worker
Vertical scaling focuses on maximizing throughput within a single Worker instance using GPU acceleration and batch inference tuning.
How the Worker processes documents
When the Worker receives a PDF, it extracts text, splits it into text chunks (sized by MaxChunkSize), runs entity detection on each chunk, and consolidates the results. The AI model processes one inference call at a time per Worker, which is why horizontal scaling is the primary throughput strategy.
For details on MaxChunkSize, BatchSize, and other inference settings, refer to Worker > Inference and Tune chunk size for performance.
CPU vs. GPU inference
The Worker supports both CPU and GPU inference. GPU acceleration significantly reduces processing time per text chunk.
| Aspect | CPU | GPU (CUDA) |
|---|---|---|
| Setup | No additional requirements | Requires NVIDIA drivers and CUDA toolkit |
| Processing speed | Baseline | Significantly faster per chunk |
| Memory | Model loads into RAM (~2.9 GB) | Model loads into RAM + VRAM |
| Cost | Low | Higher (GPU hardware) |
| Best for | Dev/test, low traffic | Production, high throughput |
Configure the execution provider in appsettings.json:
{
"Inference": {
"ExecutionProvider": "Auto",
"GpuDeviceId": 0,
"CpuUtilizationPercentage": 80
}
}
| Setting | Default | Description |
|---|---|---|
ExecutionProvider | Auto | Hardware for inference. Auto uses GPU automatically when running the -cuda image, and CPU otherwise. Set to Cpu to force CPU-only inference. |
GpuDeviceId | 0 | GPU device ID when using CUDA. |
CpuUtilizationPercentage | 80 | Percentage of CPU cores used for inference (1-100). Only applies to CPU execution. |
Docker image variants
Two Worker images are available on DockerHub:
| Image | Use case |
|---|---|
pdftoolsag/smart-redact-worker:latest | CPU-only inference. No GPU requirements. |
pdftoolsag/smart-redact-worker:latest-cuda | GPU-accelerated inference with NVIDIA CUDA. Best performance. |
NVIDIA GPU deployment
Use the GPU-specific Compose file and make sure the NVIDIA Container Toolkit is installed on the host:
services:
smart-redact-worker:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
Inference__ExecutionProvider: "Auto"
Inference__GpuDeviceId: 0
Auto is the recommended default. It uses GPU automatically with the -cuda image, and CPU otherwise. Set to Cpu to force CPU-only inference.
Batch inference tuning
The Worker splits each PDF into text chunks and groups them into batches before sending them to the AI model. Tuning batch size and chunk size directly controls the throughput-accuracy-memory trade-off.
How batching works
- The PDF is split into text chunks of up to
MaxChunkSizetokens each. A 10-page document might produce 20-40 chunks depending on text density. - Chunks are grouped into batches of
BatchSize. Each batch is sent as a single inference call to the model. - Fewer, larger batches mean fewer inference calls and less overhead per document. More, smaller batches use less memory.
{
"Inference": {
"BatchSize": 4,
"MaxChunkSize": 256,
"MaxLength": 512,
"MaxWidth": 12
}
}
Batch size (BatchSize)
Controls how many text chunks the model processes in a single inference call.
| BatchSize | Behavior | Memory impact | Best for |
|---|---|---|---|
| 1 (default) | One chunk per inference call | Lowest | Development, memory-constrained environments |
| 2-4 | Small batches | Low-moderate | CPU deployments, balanced workloads |
| 4-10 | Medium batches | Moderate | GPU deployments, production |
| 10-100 | Large batches | High | GPU with ample VRAM, maximum throughput |
Example: A 30-chunk document with BatchSize: 1 requires 30 inference calls. With BatchSize: 10, it requires only 3 inference calls, significantly reducing per-document processing time.
Chunk size (MaxChunkSize)
Controls how many tokens (roughly words) each text chunk contains. Larger chunks give the model more context for disambiguation but take longer to process.
| MaxChunkSize | Use case | Trade-off |
|---|---|---|
| 384-512 | High accuracy | Maximum context per chunk, slower processing, fewer chunks per document |
| 256 (default) | Balanced | Good accuracy and speed |
| 128-256 | High throughput | Faster per-chunk processing, less context, more chunks per document |
The model evaluates candidate entity spans within each chunk. The number of candidates scales linearly with chunk size:
| MaxChunkSize | Candidates evaluated | Relative speed |
|---|---|---|
| 512 | ~6,144 | 1x (baseline) |
| 384 | ~4,608 | ~1.3x faster |
| 256 | ~3,072 | ~2x faster |
| 128 | ~1,536 | ~4x faster |
Start with MaxChunkSize: 256 and BatchSize: 2. Increase BatchSize first for more throughput. Only reduce MaxChunkSize below 256 if detection accuracy is acceptable at smaller context windows.
Other model parameters
| Setting | Default | Description |
|---|---|---|
MaxLength | 512 | Maximum input length the model accepts. Don’t exceed without changing models. |
MaxWidth | 12 | Maximum entity span width in words. For example, “Bank of America Corporation” is 4 words. |
Graph optimization
The Worker applies graph optimizations at model load time to improve inference speed:
{
"Inference": {
"GraphOptimizationLevel": "All",
"ExecutionMode": "Parallel"
}
}
| Setting | Default | Options |
|---|---|---|
GraphOptimizationLevel | All | DisableAll, Basic, Extended, All (recommended). |
ExecutionMode | Parallel | Sequential (single-threaded), Parallel (multi-threaded, recommended). |
Keep GraphOptimizationLevel set to All and ExecutionMode set to Parallel for best performance. Changing these defaults is not recommended.
Scale Manager nodes
To scale Manager nodes horizontally, switch from SQLite to PostgreSQL so that all Manager instances share the same state.
- Configure each Manager instance to use the same PostgreSQL database:
environment:Database__DatabaseType: "PostgreSql"Database__ConnectionString: "User ID=smartredact;Password=smartredact;Server=smart-redact-manager-db;Port=5432;Database=smartredact;Maximum Pool Size=50;Timeout=30;"
- Configure a load balancer to distribute requests across Manager instances.
- Make sure all Manager instances share the same file storage configuration. Refer to AWS S3 file storage.
Each Manager instance runs its own backpressure monitor, but they all query the same PostgreSQL database. All Managers see the same pending job count regardless of which Manager created the jobs.
Scale Orchestrator nodes
To scale the Orchestrator horizontally, switch from SQLite to PostgreSQL and add a shared Redis instance for session and token caching. Place a load balancer in front of the Orchestrator instances.
- Configure each Orchestrator instance to use the same PostgreSQL database:
environment:Database__DatabaseType: "PostgreSql"Database__ConnectionString: "User ID=smartredact;Password=smartredact;Server=smart-redact-orchestrator-db;Port=5432;Database=smartredact;Maximum Pool Size=50;Timeout=30;"
- Configure a shared Redis instance for session and token caching:
environment:Redis__ConnectionString: "shared-redis:6379"
- Place a load balancer in front of the Orchestrator instances.
Memory planning
Each Worker loads the AI model into memory at startup (~2.9 GB steady-state). Plan 4 GB per Worker to allow headroom for batch processing.
Concurrency tuning
Detection and redaction use separate RabbitMQ queues with independent concurrency limits:
| Queue | Default concurrency | Reason |
|---|---|---|
worker-detection-queue | 1 | The AI model processes one inference at a time. Concurrent detection adds no throughput, only memory pressure. |
worker-redaction-queue | 4 | Lighter workload (about 100 ms per redaction), benefits from parallelism. |
Override per Worker:
environment:
ServiceCommunication__DetectionConcurrencyLimit: 1
ServiceCommunication__RedactionConcurrencyLimit: 4
Don’t increase DetectionConcurrencyLimit past 1. The AI model processes one inference at a time. Multiple detection threads queue up without improving throughput, only increasing memory pressure.
Scaling decision matrix
| Scenario | Recommended strategy |
|---|---|
| Low traffic, cost-sensitive | Single Worker, CPU, BatchSize: 1 |
| Medium traffic, balanced | 2-3 Worker instances, CPU, BatchSize: 2-4 |
| High traffic, latency-sensitive | Multiple Workers + GPU, BatchSize: 4-10 |
| Highest traffic | Horizontal (N Workers) + GPU + batch tuning |
| Burst traffic | Kubernetes HPA + multiple Workers |
Example configurations
The following examples show resource and environment fields for each scenario. The deploy.replicas field in these YAML blocks applies under Docker Swarm or Kubernetes; with docker compose up, start replicas with --scale smart-redact-worker=N instead.
Small deployment (dev/test):
# 1 Worker, CPU, minimal resources
smart-redact-worker:
mem_limit: 4g
cpus: 2.0
environment:
Inference__ExecutionProvider: "Cpu"
Inference__BatchSize: 1
Inference__MaxChunkSize: 256
Medium deployment (production):
# 3 Workers, CPU, tuned batching
smart-redact-worker:
deploy:
replicas: 3
mem_limit: 4g
cpus: 2.0
environment:
Inference__ExecutionProvider: "Cpu"
Inference__BatchSize: 4
Inference__MaxChunkSize: 256
ServiceCommunication__DetectionConcurrencyLimit: 1
ServiceCommunication__RedactionConcurrencyLimit: 4
High-throughput deployment (production, GPU):
# 3 Workers with GPU, aggressive batching
smart-redact-worker:
deploy:
replicas: 3
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
mem_limit: 6g
cpus: 2.0
environment:
Inference__ExecutionProvider: "Auto"
Inference__GpuDeviceId: 0
Inference__BatchSize: 10
Inference__MaxChunkSize: 384
ServiceCommunication__DetectionConcurrencyLimit: 1
ServiceCommunication__RedactionConcurrencyLimit: 4