Scale AI Smart Redact

AI Smart Redact uses a Manager-Worker architecture:

The Manager handles the client-facing API, job orchestration, and file management.
Workers perform entity detection and PDF redaction.
The Orchestrator powers the Human-in-the-Loop (HITL) web application on top.

Minimal setup

The minimal deployment uses a single Manager with SQLite and a single Worker, connected through REST. No message broker, no PostgreSQL. This setup is suitable for testing, evaluation, and low-traffic workloads.

# Manager
environment:
  Database__DatabaseType: "SqlLite"
  ServiceCommunication__ServiceCommunicationType: "Rest"
  ServiceCommunication__ConnectionString: "http://smart-redact-worker:4885/"
  FileStorage__FileStorageType: "HostFileSystem"
  FileStorage__FilesDirectoryPath: "/app/storage_folder"

To scale beyond a single Worker, choose a communication transport.

Communication transports

The Manager and Worker communicate using one of two transports. The transport determines how horizontal scaling works.

Two complementary approaches scale AI Smart Redact:

Horizontal scaling: Add more Worker instances to increase throughput linearly. Each Worker loads its own AI model and operates independently.
Vertical scaling: Optimize throughput within a single Worker using GPU acceleration and batch inference tuning.

REST transport

In REST mode, the Manager calls the Worker directly over HTTP. No message broker is needed. To scale horizontally with REST, place a load balancer in front of multiple Worker instances.

environment:
  ServiceCommunication__ServiceCommunicationType: "Rest"
  ServiceCommunication__ConnectionString: "http://smart-redact-worker:4885/"

RabbitMQ transport

In RabbitMQ mode, the Manager sends job commands to queues. Workers consume from these queues as competing consumers. Horizontal scaling is built in: add more Workers and RabbitMQ distributes work automatically without a load balancer.

environment:
  ServiceCommunication__ServiceCommunicationType: "RabbitMQ"
  ServiceCommunication__Host: "rabbitmq"
  ServiceCommunication__Username: "guest"
  ServiceCommunication__Password: "guest"

info

With REST transport, horizontal scaling requires a load balancer in front of the Workers. With RabbitMQ, Workers scale automatically through competing consumers without a load balancer.

Horizontal scaling: Add more Workers

Each Worker instance processes multiple detections per minute (throughput depends on document size, entity density, and hardware). Add more Workers for higher throughput. RabbitMQ distributes work through competing consumers automatically.

Scale with Docker Compose

Start additional Worker replicas with the --scale flag:

docker compose up --scale smart-redact-worker=3

The deploy.replicas field in docker-compose.yml only applies under Docker Swarm. With docker compose up, use --scale instead.

Each Worker instance loads its own copy of the AI model and operates independently. Throughput scales linearly: three Workers produce about three times the throughput of a single Worker.

How competing consumers work

All Worker instances subscribe to the same RabbitMQ queues (worker-detection-queue, worker-redaction-queue).
RabbitMQ distributes messages across Workers as competing consumers based on prefetch availability.
Each Worker processes detection jobs one at a time (configurable with DetectionConcurrencyLimit) and redaction jobs concurrently (default: 4).
If a Worker crashes mid-job, the broker redelivers the message to another available Worker.

Horizontal scaling trade-offs

Pros	Cons
Linear throughput scaling	Memory: ~2.9 GB per Worker (model loaded per instance)
Independent failure domains	Cold start time per instance (about 10 to 15 seconds for model loading)
Straightforward to configure	Network and storage overhead

Configure admission control for scaled deployments

Scale MaxPendingJobs proportionally to the number of Workers:

MaxPendingJobs = 2-3x * (num_workers * DetectionConcurrencyLimit)

For example, with 3 Workers and a detection concurrency of 1:

environment:
  ServiceCommunication__MaxPendingJobs: 9  # 3 * 3 * 1

Vertical scaling: Optimize a single Worker

Vertical scaling focuses on maximizing throughput within a single Worker instance using GPU acceleration and batch inference tuning.

How the Worker processes documents

When the Worker receives a PDF, it extracts text, splits it into text chunks (sized by MaxChunkSize), runs entity detection on each chunk, and consolidates the results. The AI model processes one inference call at a time per Worker, which is why horizontal scaling is the primary throughput strategy.

For details on MaxChunkSize, BatchSize, and other inference settings, refer to Worker > Inference and Tune chunk size for performance.

CPU vs. GPU inference

The Worker supports both CPU and GPU inference. GPU acceleration significantly reduces processing time per text chunk.

Aspect	CPU	GPU (CUDA)
Setup	No additional requirements	Requires NVIDIA drivers and CUDA toolkit
Processing speed	Baseline	Significantly faster per chunk
Memory	Model loads into RAM (~2.9 GB)	Model loads into RAM + VRAM
Cost	Low	Higher (GPU hardware)
Best for	Dev/test, low traffic	Production, high throughput

Configure the execution provider in appsettings.json:

{
  "Inference": {
    "ExecutionProvider": "Auto",
    "GpuDeviceId": 0,
    "CpuUtilizationPercentage": 80
  }
}

Setting	Default	Description
`ExecutionProvider`	`Auto`	Hardware for inference. `Auto` uses GPU automatically when running the `-cuda` image, and CPU otherwise. Set to `Cpu` to force CPU-only inference.
`GpuDeviceId`	`0`	GPU device ID when using CUDA.
`CpuUtilizationPercentage`	`80`	Percentage of CPU cores used for inference (1-100). Only applies to CPU execution.

Docker image variants

Two Worker images are available on DockerHub:

Image	Use case
`pdftoolsag/smart-redact-worker:latest`	CPU-only inference. No GPU requirements.
`pdftoolsag/smart-redact-worker:latest-cuda`	GPU-accelerated inference with NVIDIA CUDA. Best performance.

NVIDIA GPU deployment

Use the GPU-specific Compose file and make sure the NVIDIA Container Toolkit is installed on the host:

services:
  smart-redact-worker:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      Inference__ExecutionProvider: "Auto"
      Inference__GpuDeviceId: 0

info

Auto is the recommended default. It uses GPU automatically with the -cuda image, and CPU otherwise. Set to Cpu to force CPU-only inference.

Batch inference tuning

The Worker splits each PDF into text chunks and groups them into batches before sending them to the AI model. Tuning batch size and chunk size directly controls the throughput-accuracy-memory trade-off.

How batching works

The PDF is split into text chunks of up to MaxChunkSize tokens each. A 10-page document might produce 20-40 chunks depending on text density.
Chunks are grouped into batches of BatchSize. Each batch is sent as a single inference call to the model.
Fewer, larger batches mean fewer inference calls and less overhead per document. More, smaller batches use less memory.

{
  "Inference": {
    "BatchSize": 4,
    "MaxChunkSize": 256,
    "MaxLength": 512,
    "MaxWidth": 12
  }
}

Batch size (`BatchSize`)

Controls how many text chunks the model processes in a single inference call.

BatchSize	Behavior	Memory impact	Best for
1 (default)	One chunk per inference call	Lowest	Development, memory-constrained environments
2-4	Small batches	Low-moderate	CPU deployments, balanced workloads
4-10	Medium batches	Moderate	GPU deployments, production
10-100	Large batches	High	GPU with ample VRAM, maximum throughput

Example: A 30-chunk document with BatchSize: 1 requires 30 inference calls. With BatchSize: 10, it requires only 3 inference calls, significantly reducing per-document processing time.

Chunk size (`MaxChunkSize`)

Controls how many tokens (roughly words) each text chunk contains. Larger chunks give the model more context for disambiguation but take longer to process.

MaxChunkSize	Use case	Trade-off
384-512	High accuracy	Maximum context per chunk, slower processing, fewer chunks per document
256 (default)	Balanced	Good accuracy and speed
128-256	High throughput	Faster per-chunk processing, less context, more chunks per document

The model evaluates candidate entity spans within each chunk. The number of candidates scales linearly with chunk size:

MaxChunkSize	Candidates evaluated	Relative speed
512	~6,144	1x (baseline)
384	~4,608	~1.3x faster
256	~3,072	~2x faster
128	~1,536	~4x faster

tip

Start with MaxChunkSize: 256 and BatchSize: 2. Increase BatchSize first for more throughput. Only reduce MaxChunkSize below 256 if detection accuracy is acceptable at smaller context windows.

Other model parameters

Setting	Default	Description
`MaxLength`	`512`	Maximum input length the model accepts. Don’t exceed without changing models.
`MaxWidth`	`12`	Maximum entity span width in words. For example, “Bank of America Corporation” is 4 words.

Graph optimization

The Worker applies graph optimizations at model load time to improve inference speed:

{
  "Inference": {
    "GraphOptimizationLevel": "All",
    "ExecutionMode": "Parallel"
  }
}

Setting	Default	Options
`GraphOptimizationLevel`	`All`	`DisableAll`, `Basic`, `Extended`, `All` (recommended).
`ExecutionMode`	`Parallel`	`Sequential` (single-threaded), `Parallel` (multi-threaded, recommended).

tip

Keep GraphOptimizationLevel set to All and ExecutionMode set to Parallel for best performance. Changing these defaults is not recommended.

Scale Manager nodes

To scale Manager nodes horizontally, switch from SQLite to PostgreSQL so that all Manager instances share the same state.

Configure each Manager instance to use the same PostgreSQL database:

environment:
  Database__DatabaseType: "PostgreSql"
  Database__ConnectionString: "User ID=smartredact;Password=smartredact;Server=smart-redact-manager-db;Port=5432;Database=smartredact;Maximum Pool Size=50;Timeout=30;"

Configure a load balancer to distribute requests across Manager instances.
Make sure all Manager instances share the same file storage configuration. Refer to AWS S3 file storage.

info

Each Manager instance runs its own backpressure monitor, but they all query the same PostgreSQL database. All Managers see the same pending job count regardless of which Manager created the jobs.

Scale Orchestrator nodes

To scale the Orchestrator horizontally, switch from SQLite to PostgreSQL and add a shared Redis instance for session and token caching. Place a load balancer in front of the Orchestrator instances.

Configure each Orchestrator instance to use the same PostgreSQL database:

environment:
  Database__DatabaseType: "PostgreSql"
  Database__ConnectionString: "User ID=smartredact;Password=smartredact;Server=smart-redact-orchestrator-db;Port=5432;Database=smartredact;Maximum Pool Size=50;Timeout=30;"

Configure a shared Redis instance for session and token caching:
```
environment:
  Redis__ConnectionString: "shared-redis:6379"
```
Place a load balancer in front of the Orchestrator instances.

Memory planning

Each Worker loads the AI model into memory at startup (~2.9 GB steady-state). Plan 4 GB per Worker to allow headroom for batch processing.

Concurrency tuning

Detection and redaction use separate RabbitMQ queues with independent concurrency limits:

Queue	Default concurrency	Reason
`worker-detection-queue`	1	The AI model processes one inference at a time. Concurrent detection adds no throughput, only memory pressure.
`worker-redaction-queue`	4	Lighter workload (about 100 ms per redaction), benefits from parallelism.

Override per Worker:

environment:
  ServiceCommunication__DetectionConcurrencyLimit: 1
  ServiceCommunication__RedactionConcurrencyLimit: 4

warning

Don’t increase DetectionConcurrencyLimit past 1. The AI model processes one inference at a time. Multiple detection threads queue up without improving throughput, only increasing memory pressure.

Scaling decision matrix

Scenario	Recommended strategy
Low traffic, cost-sensitive	Single Worker, CPU, `BatchSize: 1`
Medium traffic, balanced	2-3 Worker instances, CPU, `BatchSize: 2-4`
High traffic, latency-sensitive	Multiple Workers + GPU, `BatchSize: 4-10`
Highest traffic	Horizontal (N Workers) + GPU + batch tuning
Burst traffic	Kubernetes HPA + multiple Workers

Example configurations

The following examples show resource and environment fields for each scenario. The deploy.replicas field in these YAML blocks applies under Docker Swarm or Kubernetes; with docker compose up, start replicas with --scale smart-redact-worker=N instead.

Small deployment (dev/test):

# 1 Worker, CPU, minimal resources
smart-redact-worker:
  mem_limit: 4g
  cpus: 2.0
  environment:
    Inference__ExecutionProvider: "Cpu"
    Inference__BatchSize: 1
    Inference__MaxChunkSize: 256

Medium deployment (production):

# 3 Workers, CPU, tuned batching
smart-redact-worker:
  deploy:
    replicas: 3
  mem_limit: 4g
  cpus: 2.0
  environment:
    Inference__ExecutionProvider: "Cpu"
    Inference__BatchSize: 4
    Inference__MaxChunkSize: 256
    ServiceCommunication__DetectionConcurrencyLimit: 1
    ServiceCommunication__RedactionConcurrencyLimit: 4

High-throughput deployment (production, GPU):

# 3 Workers with GPU, aggressive batching
smart-redact-worker:
  deploy:
    replicas: 3
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]
  mem_limit: 6g
  cpus: 2.0
  environment:
    Inference__ExecutionProvider: "Auto"
    Inference__GpuDeviceId: 0
    Inference__BatchSize: 10
    Inference__MaxChunkSize: 384
    ServiceCommunication__DetectionConcurrencyLimit: 1
    ServiceCommunication__RedactionConcurrencyLimit: 4

Minimal setup​

Communication transports​

REST transport​

RabbitMQ transport​

Horizontal scaling: Add more Workers​

Scale with Docker Compose​

How competing consumers work​

Horizontal scaling trade-offs​

Configure admission control for scaled deployments​

Vertical scaling: Optimize a single Worker​

How the Worker processes documents​

CPU vs. GPU inference​

Docker image variants​

NVIDIA GPU deployment​

Batch inference tuning​

How batching works​

Batch size (BatchSize)​

Chunk size (MaxChunkSize)​

Other model parameters​

Graph optimization​

Scale Manager nodes​

Scale Orchestrator nodes​

Memory planning​

Concurrency tuning​

Scaling decision matrix​

Example configurations​

Minimal setup

Communication transports

REST transport

RabbitMQ transport

Horizontal scaling: Add more Workers

Scale with Docker Compose

How competing consumers work

Horizontal scaling trade-offs

Configure admission control for scaled deployments

Vertical scaling: Optimize a single Worker

How the Worker processes documents

CPU vs. GPU inference

Docker image variants

NVIDIA GPU deployment

Batch inference tuning

How batching works

Batch size (`BatchSize`)

Chunk size (`MaxChunkSize`)

Other model parameters

Graph optimization

Scale Manager nodes

Scale Orchestrator nodes

Memory planning

Concurrency tuning

Scaling decision matrix

Example configurations