Monitor AI Smart Redact
Monitor AI Smart Redact using health check endpoints, log files, and the RabbitMQ management interface. This guide covers the monitoring capabilities of all three services.
Health check endpoints
Each service exposes health check endpoints for monitoring, load balancer probes, and container orchestration.
| Service | Endpoint | Default port |
|---|---|---|
| Manager | http://localhost:9982/healthz/ready | 9982 |
| Worker | http://localhost:4885/healthz/ready (internal) | 4885 |
| Orchestrator | http://localhost:9983/healthz/ready | 9983 |
The Worker port (4885) is internal to the Docker network in the default deployment and isn’t exposed to the host. To check Worker health from the host, use:
docker compose exec smart-redact-worker curl http://localhost:4885/healthz/ready
All services also expose two additional health endpoints:
| Endpoint | Purpose |
|---|---|
/healthz/startup | Startup probe. Returns 200 once the service has finished initialization. |
/healthz/live | Liveness probe. Returns 200 if the service process is running and not deadlocked. |
/healthz/ready | Readiness probe. Returns 200 if the service is ready to accept traffic. |
For Kubernetes deployments, use /healthz/live for liveness probes and /healthz/ready for readiness probes. Use /healthz/startup for startup probes to avoid premature liveness failures during model loading.
License validation
Each service validates the license key, but the behavior differs:
| Service | Behavior on invalid or missing license |
|---|---|
| Manager | Starts normally, but rejects API requests with HTTP 403. |
| Orchestrator | Starts normally, but rejects API requests with HTTP 403. |
| Worker | Fails to start. Check Worker logs if the container exits immediately after startup. |
Health states
The Manager’s health endpoint reports three states:
| State | HTTP status | Condition |
|---|---|---|
| Healthy | 200 | Pending job count is below the threshold. |
| Degraded | 200 | Pending job count exceeds the HealthCheckPendingJobThreshold. The service is still accepting requests, but signals saturation to monitoring tools. |
| Unhealthy | 503 | The database is unreachable. |
Configure backpressure-aware health checks
Enable backpressure monitoring on the Manager to signal saturation to load balancers and orchestrators like Kubernetes:
environment:
ServiceCommunication__HealthCheckPendingJobThreshold: 15
ServiceCommunication__BackpressureMonitorIntervalSeconds: 5
| Setting | Default | Description |
|---|---|---|
HealthCheckPendingJobThreshold | null (disabled) | Pending job count above which /healthz/ready returns Degraded. |
BackpressureMonitorIntervalSeconds | 5 | How often (in seconds) the service polls the pending job count from the database. |
Admission control
Limit the number of concurrent pending jobs to prevent unbounded queue growth. When the limit is reached, new requests are rejected with HTTP 429 (Too Many Requests).
environment:
ServiceCommunication__MaxPendingJobs: 20
| Setting | Default | Description |
|---|---|---|
MaxPendingJobs | null (disabled) | Maximum in-progress jobs before rejecting new requests. Must be a positive integer when set. |
Scale MaxPendingJobs proportionally to the number of Workers: 2-3x * (num_workers * DetectionConcurrencyLimit).
Circuit breaker
The Manager detects repeated Worker failures and fast-fails subsequent requests instead of waiting for timeouts.
environment:
ServiceCommunication__CircuitBreakerFailureThreshold: 3
ServiceCommunication__CircuitBreakerDurationSeconds: 30
| State | Behavior |
|---|---|
| Closed (normal) | Requests flow through to the Worker. |
| Open | After consecutive failures exceed the threshold, all requests fast-fail with HTTP 503 for the configured duration. |
| Half-Open | After the open duration, one probe request is allowed. If it succeeds, the circuit closes. If it fails, it reopens. |
Log files
All services write structured logs using Serilog. In the samples repository Docker Compose files, logs are written to a named volume:
volumes:
- logs:/app/logs
Configure the log file path and retention in appsettings.json:
{
"LogFilePath": "/app/logs/smart-redact-manager-log.txt",
"LogRetentionDays": 7
}
OpenTelemetry
AI Smart Redact supports OpenTelemetry for exporting traces, logs, and metrics to your monitoring backend. Telemetry is disabled by default and has zero runtime overhead when not configured.
Enable telemetry by setting the OTLP endpoint on the Manager and Worker services:
environment:
OTEL_EXPORTER_OTLP_ENDPOINT: http://your-collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL: grpc
| Signal | What it tells you |
|---|---|
| Traces | How long each job took, where time was spent, whether it succeeded or failed. |
| Logs | Structured application logs with trace correlation (TraceId/SpanId). |
| Metrics | API request rates, response latencies, error rates, and job counters. |
The service is compatible with any OTLP-capable backend: Grafana, Seq, Jaeger, Datadog, Elastic APM, or Azure Monitor.
Grafana dashboard
AI Smart Redact exports telemetry through OpenTelemetry, which you can connect to Grafana to visualize jobs, detection metrics, HTTP server activity, and recent operations. For span attributes, custom metrics, export defaults, and example queries, refer to Set up observability for AI Smart Redact.
RabbitMQ management UI
When using RabbitMQ for service communication, the management UI provides visibility into queue depths, message rates, and consumer status.
- URL:
http://localhost:15672 - Default credentials:
guest/guest
Key queues to monitor
| Queue | Purpose |
|---|---|
worker-detection-queue | Detection jobs sent from Manager to Workers |
worker-redaction-queue | Redaction jobs sent from Manager to Workers |
manager-queue | Job results sent from Workers back to the Manager |
worker-detection-queue_error | Failed detection messages (after all retries exhausted) |
worker-redaction-queue_error | Failed redaction messages |
manager-queue_error | Failed result messages from Workers to the Manager |
A growing error queue indicates recurring infrastructure issues. Check Worker logs for the root cause.
Troubleshooting
Worker job stuck in InProgress
- Check Worker logs for errors.
- Check Manager logs for
S-FAULTorSend timeout. These indicate the broker rejected the message or the send timed out. - Check the RabbitMQ management UI. Look for the message in an error queue.
- If the Worker restarted, the message should have been redelivered automatically.
- If
MaxPendingJobsis configured and the limit was reached, the job is deleted (not stuck) and the client receives HTTP 429.
Worker not consuming messages
- Verify
ServiceCommunicationTypematches on both Manager and Worker (both must useRabbitMQ). - Check the broker is healthy:
docker exec rabbitmq rabbitmq-diagnostics -q check_port_connectivity
- Verify the concurrency limit isn’t set to 0:
ServiceCommunication__DetectionConcurrencyLimit: 1
Out-of-memory during file upload
If an upload triggers an out-of-memory condition during encryption, the service returns HTTP 503 with "The server does not have enough resources to process this request. Try reducing file size or concurrent uploads." The service stays healthy and continues processing subsequent requests.
To prevent this, tune MaxFileSizeBytes and MaxConcurrentConnections together.