Set up observability for AI Smart Redact

AI Smart Redact includes built-in observability through OpenTelemetry, giving you visibility into job processing, errors, and performance. Telemetry is disabled by default and has zero runtime overhead when not configured.

Overview

The service exports three types of telemetry data through the OpenTelemetry Protocol (OTLP):

Signal	What it tells you
Traces	How long each job took, where time was spent, whether it succeeded or failed.
Logs	Structured application logs with trace correlation (`TraceId`/`SpanId`).
Metrics	API request rates, response latencies, error rates, and job counters.

The service is compatible with any OTLP-capable backend: Grafana, Seq, Jaeger, Datadog, Elastic APM, Azure Monitor, or a self-hosted OpenTelemetry Collector.

Enable telemetry

Set these environment variables on the Manager and Worker services:

Variable	Required	Description
`OTEL_EXPORTER_OTLP_ENDPOINT`	Yes	Your OTLP collector endpoint. Setting this enables telemetry.
`OTEL_EXPORTER_OTLP_PROTOCOL`	No	Transport protocol: `grpc` (default) or `http/protobuf`. Use `http/protobuf` for backends like Seq that require HTTP.
`OTEL_SERVICE_NAME`	No	Overrides the default service name in telemetry data. Defaults to `SmartRedact.Manager` or `SmartRedact.Worker`. Set this if you prefer a different name in your telemetry backend.

info

The default service names (SmartRedact.Manager, SmartRedact.Worker) are set by the application. The example queries in this guide use these defaults. If you override OTEL_SERVICE_NAME, adjust the queries accordingly.

Docker Compose example (gRPC backend):

services:
  smart-redact-manager:
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: http://your-collector:4317
      OTEL_EXPORTER_OTLP_PROTOCOL: grpc

  smart-redact-worker:
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: http://your-collector:4317
      OTEL_EXPORTER_OTLP_PROTOCOL: grpc

Export defaults

Telemetry data is batched and buffered before export. These are the OpenTelemetry SDK defaults. No configuration is needed unless you want to tune them.

Traces (Batch Span Processor)

Variable	Default	Description
`OTEL_BSP_SCHEDULE_DELAY`	5000 ms	How often batches are flushed.
`OTEL_BSP_MAX_EXPORT_BATCH_SIZE`	512	Maximum spans per export.
`OTEL_BSP_MAX_QUEUE_SIZE`	2048	Maximum spans queued before dropping.
`OTEL_BSP_EXPORT_TIMEOUT`	30000 ms	Timeout per export call.

Logs (Batch Log Record Processor)

Variable	Default	Description
`OTEL_BLRP_SCHEDULE_DELAY`	1000 ms	How often batches are flushed.
`OTEL_BLRP_MAX_EXPORT_BATCH_SIZE`	512	Maximum log records per export.
`OTEL_BLRP_MAX_QUEUE_SIZE`	2048	Maximum log records queued before dropping.
`OTEL_BLRP_EXPORT_TIMEOUT`	30000 ms	Timeout per export call.

Metrics (Periodic Metric Reader)

Variable	Default	Description
`OTEL_METRIC_EXPORT_INTERVAL`	60000 ms	How often metrics are exported.
`OTEL_METRIC_EXPORT_TIMEOUT`	30000 ms	Timeout per export call.

In practice: spans are sent every 5 seconds (or when 512 accumulate), logs every 1 second, and metrics every 60 seconds.

Job processing traces

Every detection and redaction job produces a trace span on the Worker service. Each span captures:

Job identity: job ID, file ID, job type (detection or redaction).
Status: finished or error, with failure reason on errors.
Timing: start time, duration, end time.
Job metrics: page count, file size, entity counts (depending on job type).

The Manager enriches its consumer spans with job identity tags, so you can trace the full flow from Manager to Worker and back. Application logs include TraceId and SpanId fields, letting you jump from a log entry directly to its trace.

Span attributes reference

These attributes are set on Worker job processing spans:

Attribute	Type	Description	Example
`job.id`	string	Unique job identifier	`a1b2c3d4-e5f6-...`
`job.type`	string	`detection` or `redaction`	`detection`
`job.file.id`	string	Primary input file identifier	`e5f6g7h8-...`
`job.status`	string	Final status: `Finished` or `Error`	`Finished`
`failure.reason`	string	Exception type on failure (absent on success)	`DekTokenValidationException`
`input.file.pages`	int	Input PDF page count (detection only)	`12`
`input.file.size_bytes`	long	Input PDF size in bytes (detection only)	`524288`
`input.entities.count`	int	Entities submitted for redaction (redaction only)	`35`
`output.entities.count`	int	Detected entity count (detection only)	`42`
`output.file.size_bytes`	long	Redacted PDF size in bytes (redaction only)	`498000`

Custom metrics reference

These Prometheus counters are exported through OTLP and tagged with job.type (detection/redaction) and job.status (Finished/Error):

Metric name	Extra tags	Description
`jobs.completed`	(none)	Jobs completed (detection + redaction).
`detection.entities.detected`	`entity.type`	Total entities detected, broken down by label.
`detection.pages.processed`	(none)	Total pages processed across detection jobs.
`licensing.pages.consumed`	(none)	Total pages reported for license consumption.

Useful queries

These examples use Grafana’s TraceQL and LogQL query languages.

Find all job spans for a specific job:

{span.job.id = "your-job-id-here"}

Find all failed jobs:

{span.job.status = "Error"}

Find failed detection jobs:

{span.job.status = "Error" && span.job.type = "detection"}

List recent detection results (log-based):

{service_name="SmartRedact.Worker"} |= "DetectionResultEvent"

Find log events for a specific file:

{service_name="SmartRedact.Worker"} |= "ResultEvent" | json | FileId = "your-file-id-here"

Overview​

Enable telemetry​

Export defaults​

Traces (Batch Span Processor)​

Logs (Batch Log Record Processor)​

Metrics (Periodic Metric Reader)​

Job processing traces​

Span attributes reference​

Custom metrics reference​

Useful queries​