OpenTelemetry for Kuadrant Operator¶
This example demonstrates how to enable OpenTelemetry logging, tracing, and metrics export from the Kuadrant Operator.
Features¶
- Dual Logging: Logs to both console (Zap) and remote collector (OTLP) with automatic trace correlation
- Trace Correlation: Logs include
trace_idandspan_idfor distributed tracing - Metrics Bridge: Export existing Prometheus metrics via OTLP without code changes
- Flexible Configuration: Enable signals independently via endpoint configuration
- Local Development Stack: Complete observability stack (Loki, Grafana, Tempo, Prometheus) with Docker Compose
Architecture¶
┌─────────────────────────────────────────┐
│ Kuadrant Operator │
│ │
│ ┌──────────────────────────────────┐ │
│ │ Zap Logger (Tee Core) │ │
│ │ ┌─────────────┬──────────────┐ │ │
│ │ │ Console Core│ OTel Core │ │ │
│ │ │ (formatted) │ (otelzap │ │ │
│ │ │ │ bridge) │ │ │
│ │ └──────┬──────┴──────┬───────┘ │ │
│ └─────────┼─────────────┼──────────┘ │
│ stdout OTLP (logs) │
│ │ │
│ ┌──────────────────────┼──────────┐ │
│ │ Prometheus Metrics │ │ │
│ │ • controller_runtime_* │ │
│ │ • kuadrant_dns_policy_ready │ │
│ └──────────┬───────────┼──────────┘ │
│ │ │ │
│ ┌──────────▼───────────┼──────────┐ │
│ │ OTel Prometheus Bridge │ │
│ │ (zero code changes) │ │
│ └──────────┬───────────┼──────────┘ │
│ │ │ │
└─────────────┼───────────┼───────────────┘
│ OTLP (metrics)
│ │
┌─────────▼───────────▼──────────┐
│ OTel Collector │
│ • Logs pipeline │
│ • Traces pipeline │
│ • Metrics pipeline │
└─────────┬──────────────────────┘
│
┌───────┴──────────────────┐
│ │
┌───▼────┐ ┌────▼─────┐ ┌───▼────────┐
│ Loki │ │ Tempo │ │ Prometheus │
│ (Logs) │ │ (Traces) │ │ (Metrics) │
└────────┘ └──────────┘ └────────────┘
│ │ │
└────────────┴──────────────┘
│
┌──────▼──────┐
│ Grafana │
│ (Dashboards)│
└─────────────┘
Quick Start¶
1. Start Observability Stack¶
This starts:
- OTel Collector - Receives OTLP logs, traces, and metrics on ports 4317 (gRPC) and 4318 (HTTP)
- Loki - Stores logs with full-text search and label filtering on port 3100
- Tempo - Distributed tracing backend on port 3200
- Jaeger - Alternative distributed tracing UI on port 16686
- Prometheus - Stores and queries metrics on port 9090
- Grafana - Unified observability UI on port 3000 (admin/admin)
2. Run Operator with OpenTelemetry Enabled¶
# Set environment variables
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
export OTEL_EXPORTER_OTLP_INSECURE=true
export OTEL_METRICS_INTERVAL_SECONDS=5
# Alternatively, enable signals individually:
# export OTEL_EXPORTER_OTLP_LOGS_ENDPOINT=http://localhost:4318
# export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=rpc://localhost:4317
# export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://localhost:4318
# Run the operator
make run
3. Verify Logs¶
View logs in Loki via Grafana:
# Open Grafana
open http://localhost:3000 # Login: admin/admin
# Navigate to Explore → Loki
# Query: {service_name="kuadrant-operator"}
View logs in OTel Collector debug output:
4. Verify Metrics¶
Query metrics in Prometheus:
# Open Prometheus UI
open http://localhost:9090
# Or query via API
curl 'http://localhost:9090/api/v1/query?query=controller_runtime_active_workers'
Access operator Prometheus endpoint directly:
5. Verify Traces¶
View traces in Tempo via Grafana:
# Open Grafana
open http://localhost:3000
# Navigate to Explore → Tempo
# Search by trace ID from logs
Or view in Jaeger UI:
6. Unified Observability in Grafana¶
Grafana provides a unified view across all signals:
- Logs (Loki): Full-text search with label filtering
- Traces (Tempo): Distributed request tracing
- Metrics (Prometheus): Time-series metrics and dashboards
- Correlation: Click trace IDs in logs to jump to traces
How It Works¶
Dual Logging with Trace Correlation¶
The operator uses a Tee core architecture powered by the official go.opentelemetry.io/contrib/bridges/otelzap library:
Console Core:
- Formats logs for human readability (JSON or console format based on
LOG_MODE) - Respects
LOG_LEVELfor verbosity filtering - Extracts and displays
trace_idandspan_idfrom context for correlation - Filters out noisy context objects
OTel Core (otelzap bridge):
- Sends structured logs to OTLP collector
- Automatically extracts trace context from
context.Contextfields - Preserves all log attributes and severity levels
- Enables correlation with traces in Tempo/Jaeger
Usage in code:
import (
"context"
"github.com/kuadrant/policy-machinery/controller"
"go.opentelemetry.io/otel"
)
func (r *MyReconciler) Reconcile(ctx context.Context) (controller.Result, error) {
// Start a tracing span (required for trace_id/span_id)
tracer := otel.Tracer("kuadrant-operator")
ctx, span := tracer.Start(ctx, "MyReconciler.Reconcile")
defer span.End()
// Get logger from context and attach context for trace extraction
logger := controller.LoggerFromContext(ctx).WithValues("context", ctx)
logger.Info("reconciling resource") // Automatically includes trace_id and span_id
return controller.Result{}, nil
}
Important Notes:
- The tracing span must be started before getting the logger for trace IDs to be present
- The
tracer.Start()call enriches the context with trace context - The
.WithValues("context", ctx)passes the enriched context to the logger for extraction
Both cores receive the same log records from the Tee, ensuring consistent logging across console and remote backends.
Dual Metrics Export¶
The operator exposes metrics in two ways simultaneously:
- Prometheus
/metricsendpoint (:8080/metrics) - Native Prometheus scraping - OTLP push (when metrics endpoint is configured) - Push to OTel Collector
Both expose the same underlying metrics from the same Prometheus registry. The OTel bridge reads from the Prometheus registry and converts to OTLP format.
Important: Avoid Metric Duplication¶
When configuring Prometheus scraping, choose one of these options:
Option 1 (Recommended): Scrape via OTel Collector
# prometheus.yaml (default in this example)
- job_name: "kuadrant-operator"
static_configs:
- targets: ["otel-collector:8889"]
✅ Use when OTLP metrics export is enabled (endpoint configured) ✅ Allows OTel processing/filtering before Prometheus ✅ Consistent with OTel-first approach
Option 2: Scrape operator directly
# prometheus.yaml (alternative)
- job_name: "kuadrant-operator"
static_configs:
- targets: ["host.docker.internal:8080"]
✅ Use when OTLP metrics export is disabled (no endpoint) ✅ Traditional Prometheus setup ✅ No OTel Collector needed
❌ Don't scrape both - This creates duplicate time series with different labels.
Environment Variables¶
Shared OpenTelemetry Configuration¶
| Variable | Required | Default | Description |
|---|---|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT |
No | - (disabled) | OTLP collector endpoint (enables all signals). Supports http://, https://, rpc:// schemes |
OTEL_EXPORTER_OTLP_LOGS_ENDPOINT |
No | - | Override endpoint specifically for logs (enables logs if set) |
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT |
No | - | Override endpoint specifically for traces (enables traces if set) |
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT |
No | - | Override endpoint specifically for metrics (enables metrics if set) |
OTEL_EXPORTER_OTLP_INSECURE |
No | false |
Disable TLS for OTLP export (required for rpc:// scheme without TLS) |
OTEL_SERVICE_NAME |
No | kuadrant-operator |
Service name shown in Grafana/Tempo/Jaeger |
OTEL_SERVICE_VERSION |
No | Build version | Service version (defaults to version from ldflags) |
Configuration Logic:
- If
OTEL_EXPORTER_OTLP_ENDPOINTis set, all signals (logs, traces, metrics) are enabled with that endpoint - Per-signal endpoints override the global endpoint for that specific signal
- If no endpoint is configured (neither global nor per-signal), that signal is disabled
- Endpoint schemes:
http://(insecure HTTP),https://(secure HTTP),rpc://(gRPC, use withOTEL_EXPORTER_OTLP_INSECURE=truefor plaintext)
Metrics-Specific Configuration¶
| Variable | Default | Description |
|---|---|---|
OTEL_METRICS_INTERVAL_SECONDS |
15 |
Export interval in seconds |
Available Metrics¶
Controller-Runtime Metrics¶
All standard controller-runtime metrics are automatically exported:
controller_runtime_reconcile_total- Total reconciliations per controllercontroller_runtime_reconcile_errors_total- Reconciliation errorscontroller_runtime_reconcile_time_seconds- Reconciliation durationcontroller_runtime_max_concurrent_reconciles- Worker countcontroller_runtime_active_workers- Active workers
Custom Kuadrant Metrics¶
kuadrant_dns_policy_ready- DNS Policy ready status- Labels:
dns_policy_name,dns_policy_namespace,dns_policy_condition
Go Runtime Metrics¶
Standard Go metrics from prometheus/client_golang:
go_memstats_*- Memory statisticsgo_goroutines- Number of goroutinesgo_threads- Number of OS threadsprocess_*- Process metrics
Kubernetes Deployment¶
Add to your operator deployment:
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "https://otel-collector.observability.svc.cluster.local:4318"
# Or enable signals individually with different endpoints/protocols:
# - name: OTEL_EXPORTER_OTLP_LOGS_ENDPOINT
# value: "http://loki-gateway.observability.svc.cluster.local:3100"
# - name: OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
# value: "rpc://tempo.observability.svc.cluster.local:4317"
# - name: OTEL_EXPORTER_OTLP_METRICS_ENDPOINT
# value: "http://otel-collector.observability.svc.cluster.local:4318"
- name: OTEL_EXPORTER_OTLP_INSECURE
value: "false" # Use TLS in production
- name: OTEL_SERVICE_NAME
value: "kuadrant-operator"
- name: OTEL_METRICS_INTERVAL_SECONDS
value: "60"
Configuration¶
OTel Collector¶
Edit otel-collector-config.yaml to add remote OTLP export:
exporters:
debug:
verbosity: detailed
prometheus:
endpoint: "0.0.0.0:8889"
# Add remote OTLP exporter
otlphttp:
endpoint: https://your-observability-backend.com
headers:
authorization: "Bearer <your-api-key>"
service:
pipelines:
logs:
receivers: [otlp]
processors: [batch, resource]
exporters: [debug, otlphttp] # Export logs to multiple backends
traces:
receivers: [otlp]
processors: [batch, resource]
exporters: [debug, otlphttp] # Export traces to multiple backends
metrics:
receivers: [otlp]
processors: [batch, resource]
exporters: [debug, prometheus, otlphttp] # Export metrics to multiple backends
Implementation Details¶
Logging Architecture¶
The operator uses a sophisticated logging setup that provides:
- Official OTel Integration: Uses
go.opentelemetry.io/contrib/bridges/otelzapfor robust OTel support - Tee Core Pattern: Single Zap logger with two cores (console + OTel) via
zapcore.NewTee() - Trace Context Extraction: Custom
contextFilterCoreextractstrace_idandspan_idfor console output - Clean Console Output: Filters noisy context objects while preserving trace correlation
- Zero Overhead When Disabled: Standard Zap logger when no endpoint is configured
Key Files:
internal/log/otel.go- OTel logging setup with Tee architecture andcontextFilterCoreinternal/log/log.go- Standard logging setupcmd/main.go- Conditional OTel initialization based on env vars
Trace Context Propagation¶
To enable trace correlation in logs, you must start a tracing span and attach the context to the logger:
import "go.opentelemetry.io/otel"
// Without tracing span - no trace context
logger := controller.LoggerFromContext(ctx).WithValues("context", ctx)
logger.Info("message") // No trace_id (no active span)
// With tracing span - full trace correlation
tracer := otel.Tracer("kuadrant-operator")
ctx, span := tracer.Start(ctx, "MyOperation")
defer span.End()
logger := controller.LoggerFromContext(ctx).WithValues("context", ctx)
logger.Info("message") // Includes trace_id and span_id
The contextFilterCore in internal/log/otel.go handles the context field differently for each core:
- Console core: Extracts
trace_idandspan_idas readable strings, filters out noisy context object - OTel core: Uses official otelzap bridge to include full trace context in OTLP records