Set up observability for AI Smart Redact
AI Smart Redact includes built-in observability through OpenTelemetry, giving you visibility into job processing, errors, and performance. Telemetry is disabled by default and has zero runtime overhead when not configured.
Overview
The service exports three types of telemetry data through the OpenTelemetry Protocol (OTLP):
| Signal | What it tells you |
|---|---|
| Traces | How long each job took, where time was spent, whether it succeeded or failed. |
| Logs | Structured application logs with trace correlation (TraceId/SpanId). |
| Metrics | API request rates, response latencies, error rates, and job counters. |
The service is compatible with any OTLP-capable backend: Grafana, Seq, Jaeger, Datadog, Elastic APM, Azure Monitor, or a self-hosted OpenTelemetry Collector.
Enable telemetry
Set these environment variables on the Manager and Worker services:
| Variable | Required | Description |
|---|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT | Yes | Your OTLP collector endpoint. Setting this enables telemetry. |
OTEL_EXPORTER_OTLP_PROTOCOL | No | Transport protocol: grpc (default) or http/protobuf. Use http/protobuf for backends like Seq that require HTTP. |
OTEL_SERVICE_NAME | No | Overrides the default service name in telemetry data. Defaults to SmartRedact.Manager or SmartRedact.Worker. Set this if you prefer a different name in your telemetry backend. |
The default service names (SmartRedact.Manager, SmartRedact.Worker) are set by the application. The example queries in this guide use these defaults. If you override OTEL_SERVICE_NAME, adjust the queries accordingly.
Docker Compose example (gRPC backend):
services:
smart-redact-manager:
environment:
OTEL_EXPORTER_OTLP_ENDPOINT: http://your-collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL: grpc
smart-redact-worker:
environment:
OTEL_EXPORTER_OTLP_ENDPOINT: http://your-collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL: grpc
Export defaults
Telemetry data is batched and buffered before export. These are the OpenTelemetry SDK defaults. No configuration is needed unless you want to tune them.
Traces (Batch Span Processor)
| Variable | Default | Description |
|---|---|---|
OTEL_BSP_SCHEDULE_DELAY | 5000 ms | How often batches are flushed. |
OTEL_BSP_MAX_EXPORT_BATCH_SIZE | 512 | Maximum spans per export. |
OTEL_BSP_MAX_QUEUE_SIZE | 2048 | Maximum spans queued before dropping. |
OTEL_BSP_EXPORT_TIMEOUT | 30000 ms | Timeout per export call. |
Logs (Batch Log Record Processor)
| Variable | Default | Description |
|---|---|---|
OTEL_BLRP_SCHEDULE_DELAY | 1000 ms | How often batches are flushed. |
OTEL_BLRP_MAX_EXPORT_BATCH_SIZE | 512 | Maximum log records per export. |
OTEL_BLRP_MAX_QUEUE_SIZE | 2048 | Maximum log records queued before dropping. |
OTEL_BLRP_EXPORT_TIMEOUT | 30000 ms | Timeout per export call. |
Metrics (Periodic Metric Reader)
| Variable | Default | Description |
|---|---|---|
OTEL_METRIC_EXPORT_INTERVAL | 60000 ms | How often metrics are exported. |
OTEL_METRIC_EXPORT_TIMEOUT | 30000 ms | Timeout per export call. |
In practice: spans are sent every 5 seconds (or when 512 accumulate), logs every 1 second, and metrics every 60 seconds.
Job processing traces
Every detection and redaction job produces a trace span on the Worker service. Each span captures:
- Job identity: job ID, file ID, job type (detection or redaction).
- Status: finished or error, with failure reason on errors.
- Timing: start time, duration, end time.
- Job metrics: page count, file size, entity counts (depending on job type).
The Manager enriches its consumer spans with job identity tags, so you can trace the full flow from Manager to Worker and back. Application logs include TraceId and SpanId fields, letting you jump from a log entry directly to its trace.
Span attributes reference
These attributes are set on Worker job processing spans:
| Attribute | Type | Description | Example |
|---|---|---|---|
job.id | string | Unique job identifier | a1b2c3d4-e5f6-... |
job.type | string | detection or redaction | detection |
job.file.id | string | Primary input file identifier | e5f6g7h8-... |
job.status | string | Final status: Finished or Error | Finished |
failure.reason | string | Exception type on failure (absent on success) | DekTokenValidationException |
input.file.pages | int | Input PDF page count (detection only) | 12 |
input.file.size_bytes | long | Input PDF size in bytes (detection only) | 524288 |
input.entities.count | int | Entities submitted for redaction (redaction only) | 35 |
output.entities.count | int | Detected entity count (detection only) | 42 |
output.file.size_bytes | long | Redacted PDF size in bytes (redaction only) | 498000 |
Custom metrics reference
These Prometheus counters are exported through OTLP and tagged with job.type (detection/redaction) and job.status (Finished/Error):
| Metric name | Extra tags | Description |
|---|---|---|
jobs.completed | (none) | Jobs completed (detection + redaction). |
detection.entities.detected | entity.type | Total entities detected, broken down by label. |
detection.pages.processed | (none) | Total pages processed across detection jobs. |
licensing.pages.consumed | (none) | Total pages reported for license consumption. |
Useful queries
These examples use Grafana’s TraceQL and LogQL query languages.
Find all job spans for a specific job:
{span.job.id = "your-job-id-here"}
Find all failed jobs:
{span.job.status = "Error"}
Find failed detection jobs:
{span.job.status = "Error" && span.job.type = "detection"}
List recent detection results (log-based):
{service_name="SmartRedact.Worker"} |= "DetectionResultEvent"
Find log events for a specific file:
{service_name="SmartRedact.Worker"} |= "ResultEvent" | json | FileId = "your-file-id-here"