Monitor AI Smart Redact

Monitor AI Smart Redact using health check endpoints, log files, and the RabbitMQ management interface. This guide covers the monitoring capabilities of all three services.

Health check endpoints

Each service exposes health check endpoints for monitoring, load balancer probes, and container orchestration.

Service	Endpoint	Default port
Manager	`http://localhost:9982/healthz/ready`	9982
Worker	`http://localhost:4885/healthz/ready` (internal)	4885
Orchestrator	`http://localhost:9983/healthz/ready`	9983

The Worker port (4885) is internal to the Docker network in the default deployment and isn’t exposed to the host. To check Worker health from the host, use:

docker compose exec smart-redact-worker curl http://localhost:4885/healthz/ready

All services also expose two additional health endpoints:

Endpoint	Purpose
`/healthz/startup`	Startup probe. Returns 200 once the service has finished initialization.
`/healthz/live`	Liveness probe. Returns 200 if the service process is running and not deadlocked.
`/healthz/ready`	Readiness probe. Returns 200 if the service is ready to accept traffic.

For Kubernetes deployments, use /healthz/live for liveness probes and /healthz/ready for readiness probes. Use /healthz/startup for startup probes to avoid premature liveness failures during model loading.

License validation

Each service validates the license key, but the behavior differs:

Service	Behavior on invalid or missing license
Manager	Starts normally, but rejects API requests with HTTP 403.
Orchestrator	Starts normally, but rejects API requests with HTTP 403.
Worker	Fails to start. Check Worker logs if the container exits immediately after startup.

Health states

The Manager’s health endpoint reports three states:

State	HTTP status	Condition
Healthy	200	Pending job count is below the threshold.
Degraded	200	Pending job count exceeds the `HealthCheckPendingJobThreshold`. The service is still accepting requests, but signals saturation to monitoring tools.
Unhealthy	503	The database is unreachable.

Configure backpressure-aware health checks

Enable backpressure monitoring on the Manager to signal saturation to load balancers and orchestrators like Kubernetes:

environment:
  ServiceCommunication__HealthCheckPendingJobThreshold: 15
  ServiceCommunication__BackpressureMonitorIntervalSeconds: 5

Setting	Default	Description
`HealthCheckPendingJobThreshold`	`null` (disabled)	Pending job count above which `/healthz/ready` returns `Degraded`.
`BackpressureMonitorIntervalSeconds`	`5`	How often (in seconds) the service polls the pending job count from the database.

Admission control

Limit the number of concurrent pending jobs to prevent unbounded queue growth. When the limit is reached, new requests are rejected with HTTP 429 (Too Many Requests).

environment:
  ServiceCommunication__MaxPendingJobs: 20

Setting	Default	Description
`MaxPendingJobs`	`null` (disabled)	Maximum in-progress jobs before rejecting new requests. Must be a positive integer when set.

tip

Scale MaxPendingJobs proportionally to the number of Workers: 2-3x * (num_workers * DetectionConcurrencyLimit).

Circuit breaker

The Manager detects repeated Worker failures and fast-fails subsequent requests instead of waiting for timeouts.

environment:
  ServiceCommunication__CircuitBreakerFailureThreshold: 3
  ServiceCommunication__CircuitBreakerDurationSeconds: 30

State	Behavior
Closed (normal)	Requests flow through to the Worker.
Open	After consecutive failures exceed the threshold, all requests fast-fail with HTTP 503 for the configured duration.
Half-Open	After the open duration, one probe request is allowed. If it succeeds, the circuit closes. If it fails, it reopens.

Log files

All services write structured logs using Serilog. In the samples repository Docker Compose files, logs are written to a named volume:

volumes:
  - logs:/app/logs

Configure the log file path and retention in appsettings.json:

{
  "LogFilePath": "/app/logs/smart-redact-manager-log.txt",
  "LogRetentionDays": 7
}

OpenTelemetry

AI Smart Redact supports OpenTelemetry for exporting traces, logs, and metrics to your monitoring backend. Telemetry is disabled by default and has zero runtime overhead when not configured.

Enable telemetry by setting the OTLP endpoint on the Manager and Worker services:

environment:
  OTEL_EXPORTER_OTLP_ENDPOINT: http://your-collector:4317
  OTEL_EXPORTER_OTLP_PROTOCOL: grpc

Signal	What it tells you
Traces	How long each job took, where time was spent, whether it succeeded or failed.
Logs	Structured application logs with trace correlation (`TraceId`/`SpanId`).
Metrics	API request rates, response latencies, error rates, and job counters.

The service is compatible with any OTLP-capable backend: Grafana, Seq, Jaeger, Datadog, Elastic APM, or Azure Monitor.

Grafana dashboard

AI Smart Redact exports telemetry through OpenTelemetry, which you can connect to Grafana to visualize jobs, detection metrics, HTTP server activity, and recent operations. For span attributes, custom metrics, export defaults, and example queries, refer to Set up observability for AI Smart Redact.

RabbitMQ management UI

When using RabbitMQ for service communication, the management UI provides visibility into queue depths, message rates, and consumer status.

URL: http://localhost:15672
Default credentials: guest / guest

Key queues to monitor

Queue	Purpose
`worker-detection-queue`	Detection jobs sent from Manager to Workers
`worker-redaction-queue`	Redaction jobs sent from Manager to Workers
`manager-queue`	Job results sent from Workers back to the Manager
`worker-detection-queue_error`	Failed detection messages (after all retries exhausted)
`worker-redaction-queue_error`	Failed redaction messages
`manager-queue_error`	Failed result messages from Workers to the Manager

A growing error queue indicates recurring infrastructure issues. Check Worker logs for the root cause.

Troubleshooting

Worker job stuck in `InProgress`

Check Worker logs for errors.
Check Manager logs for S-FAULT or Send timeout. These indicate the broker rejected the message or the send timed out.
Check the RabbitMQ management UI. Look for the message in an error queue.
If the Worker restarted, the message should have been redelivered automatically.
If MaxPendingJobs is configured and the limit was reached, the job is deleted (not stuck) and the client receives HTTP 429.

Worker not consuming messages

Verify ServiceCommunicationType matches on both Manager and Worker (both must use RabbitMQ).

Check the broker is healthy:

docker exec rabbitmq rabbitmq-diagnostics -q check_port_connectivity

Verify the concurrency limit isn’t set to 0:

ServiceCommunication__DetectionConcurrencyLimit: 1

Out-of-memory during file upload

If an upload triggers an out-of-memory condition during encryption, the service returns HTTP 503 with "The server does not have enough resources to process this request. Try reducing file size or concurrent uploads." The service stays healthy and continues processing subsequent requests.

To prevent this, tune MaxFileSizeBytes and MaxConcurrentConnections together.

Health check endpoints​

License validation​

Health states​

Configure backpressure-aware health checks​

Admission control​

Circuit breaker​

Log files​

OpenTelemetry​

Grafana dashboard​

RabbitMQ management UI​

Key queues to monitor​

Troubleshooting​

Worker job stuck in InProgress​

Worker not consuming messages​

Out-of-memory during file upload​