Skip to main content

Monitor AI Smart Redact

Monitor AI Smart Redact using health check endpoints, log files, and the RabbitMQ management interface. This guide covers the monitoring capabilities of all three services.

Health check endpoints

Each service exposes health check endpoints for monitoring, load balancer probes, and container orchestration.

ServiceEndpointDefault port
Managerhttp://localhost:9982/healthz/ready9982
Workerhttp://localhost:4885/healthz/ready (internal)4885
Orchestratorhttp://localhost:9983/healthz/ready9983

The Worker port (4885) is internal to the Docker network in the default deployment and isn’t exposed to the host. To check Worker health from the host, use:

docker compose exec smart-redact-worker curl http://localhost:4885/healthz/ready

All services also expose two additional health endpoints:

EndpointPurpose
/healthz/startupStartup probe. Returns 200 once the service has finished initialization.
/healthz/liveLiveness probe. Returns 200 if the service process is running and not deadlocked.
/healthz/readyReadiness probe. Returns 200 if the service is ready to accept traffic.

For Kubernetes deployments, use /healthz/live for liveness probes and /healthz/ready for readiness probes. Use /healthz/startup for startup probes to avoid premature liveness failures during model loading.

License validation

Each service validates the license key, but the behavior differs:

ServiceBehavior on invalid or missing license
ManagerStarts normally, but rejects API requests with HTTP 403.
OrchestratorStarts normally, but rejects API requests with HTTP 403.
WorkerFails to start. Check Worker logs if the container exits immediately after startup.

Health states

The Manager’s health endpoint reports three states:

StateHTTP statusCondition
Healthy200Pending job count is below the threshold.
Degraded200Pending job count exceeds the HealthCheckPendingJobThreshold. The service is still accepting requests, but signals saturation to monitoring tools.
Unhealthy503The database is unreachable.

Configure backpressure-aware health checks

Enable backpressure monitoring on the Manager to signal saturation to load balancers and orchestrators like Kubernetes:

environment:
ServiceCommunication__HealthCheckPendingJobThreshold: 15
ServiceCommunication__BackpressureMonitorIntervalSeconds: 5
SettingDefaultDescription
HealthCheckPendingJobThresholdnull (disabled)Pending job count above which /healthz/ready returns Degraded.
BackpressureMonitorIntervalSeconds5How often (in seconds) the service polls the pending job count from the database.

Admission control

Limit the number of concurrent pending jobs to prevent unbounded queue growth. When the limit is reached, new requests are rejected with HTTP 429 (Too Many Requests).

environment:
ServiceCommunication__MaxPendingJobs: 20
SettingDefaultDescription
MaxPendingJobsnull (disabled)Maximum in-progress jobs before rejecting new requests. Must be a positive integer when set.
tip

Scale MaxPendingJobs proportionally to the number of Workers: 2-3x * (num_workers * DetectionConcurrencyLimit).

Circuit breaker

The Manager detects repeated Worker failures and fast-fails subsequent requests instead of waiting for timeouts.

environment:
ServiceCommunication__CircuitBreakerFailureThreshold: 3
ServiceCommunication__CircuitBreakerDurationSeconds: 30
StateBehavior
Closed (normal)Requests flow through to the Worker.
OpenAfter consecutive failures exceed the threshold, all requests fast-fail with HTTP 503 for the configured duration.
Half-OpenAfter the open duration, one probe request is allowed. If it succeeds, the circuit closes. If it fails, it reopens.

Log files

All services write structured logs using Serilog. In the samples repository Docker Compose files, logs are written to a named volume:

volumes:
- logs:/app/logs

Configure the log file path and retention in appsettings.json:

{
"LogFilePath": "/app/logs/smart-redact-manager-log.txt",
"LogRetentionDays": 7
}

OpenTelemetry

AI Smart Redact supports OpenTelemetry for exporting traces, logs, and metrics to your monitoring backend. Telemetry is disabled by default and has zero runtime overhead when not configured.

Enable telemetry by setting the OTLP endpoint on the Manager and Worker services:

environment:
OTEL_EXPORTER_OTLP_ENDPOINT: http://your-collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL: grpc
SignalWhat it tells you
TracesHow long each job took, where time was spent, whether it succeeded or failed.
LogsStructured application logs with trace correlation (TraceId/SpanId).
MetricsAPI request rates, response latencies, error rates, and job counters.

The service is compatible with any OTLP-capable backend: Grafana, Seq, Jaeger, Datadog, Elastic APM, or Azure Monitor.

Grafana dashboard

AI Smart Redact exports telemetry through OpenTelemetry, which you can connect to Grafana to visualize jobs, detection metrics, HTTP server activity, and recent operations. For span attributes, custom metrics, export defaults, and example queries, refer to Set up observability for AI Smart Redact.

RabbitMQ management UI

When using RabbitMQ for service communication, the management UI provides visibility into queue depths, message rates, and consumer status.

  • URL: http://localhost:15672
  • Default credentials: guest / guest

Key queues to monitor

QueuePurpose
worker-detection-queueDetection jobs sent from Manager to Workers
worker-redaction-queueRedaction jobs sent from Manager to Workers
manager-queueJob results sent from Workers back to the Manager
worker-detection-queue_errorFailed detection messages (after all retries exhausted)
worker-redaction-queue_errorFailed redaction messages
manager-queue_errorFailed result messages from Workers to the Manager

A growing error queue indicates recurring infrastructure issues. Check Worker logs for the root cause.

Troubleshooting

Worker job stuck in InProgress

  1. Check Worker logs for errors.
  2. Check Manager logs for S-FAULT or Send timeout. These indicate the broker rejected the message or the send timed out.
  3. Check the RabbitMQ management UI. Look for the message in an error queue.
  4. If the Worker restarted, the message should have been redelivered automatically.
  5. If MaxPendingJobs is configured and the limit was reached, the job is deleted (not stuck) and the client receives HTTP 429.

Worker not consuming messages

  1. Verify ServiceCommunicationType matches on both Manager and Worker (both must use RabbitMQ).
  2. Check the broker is healthy:
    docker exec rabbitmq rabbitmq-diagnostics -q check_port_connectivity
  3. Verify the concurrency limit isn’t set to 0:
    ServiceCommunication__DetectionConcurrencyLimit: 1

Out-of-memory during file upload

If an upload triggers an out-of-memory condition during encryption, the service returns HTTP 503 with "The server does not have enough resources to process this request. Try reducing file size or concurrent uploads." The service stays healthy and continues processing subsequent requests.

To prevent this, tune MaxFileSizeBytes and MaxConcurrentConnections together.