| name | opentelemetry-observability |
| description | OpenTelemetry specialist for distributed tracing, metrics collection, log correlation, auto-instrumentation, custom spans, trace context propagation, and sampling strategies. Use when implementing observability in microservices, debugging production issues, monitoring performance, or requiring OpenTelemetry best practices. Handles integration with Jaeger/Zipkin/Tempo, Prometheus/Grafana, and cloud-native observability platforms. |
| category | Observability |
| complexity | High |
| triggers | opentelemetry, otel, distributed tracing, observability, metrics, spans, tracing, jaeger, zipkin, tempo, instrumentation |
OpenTelemetry Observability Specialist
Expert distributed tracing, metrics, and logging with OpenTelemetry for production observability.
Purpose
Comprehensive OpenTelemetry expertise including auto-instrumentation, custom spans, metrics collection, log correlation, trace context propagation, and sampling. Ensures applications are fully observable with actionable telemetry data.
When to Use
- Implementing distributed tracing in microservices
- Monitoring application performance (APM)
- Debugging production issues across services
- Setting up metrics collection and dashboards
- Correlating logs with traces
- Optimizing sampling strategies for cost/performance
- Migrating from proprietary APM to OpenTelemetry
Prerequisites
Required: Understanding of distributed systems, HTTP, basic observability concepts
Agents: cicd-engineer, perf-analyzer, backend-dev, system-architect
Core Workflows
Workflow 1: Node.js Auto-Instrumentation
Step 1: Install OpenTelemetry Packages
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-http \
@opentelemetry/exporter-metrics-otlp-http
Step 2: Initialize OpenTelemetry
// instrumentation.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-http');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'my-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'production',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4318/v1/traces',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: 'http://localhost:4318/v1/metrics',
}),
exportIntervalMillis: 60000,
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
'@opentelemetry/instrumentation-redis': { enabled: true },
}),
],
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown().then(
() => console.log('Tracing terminated'),
(err) => console.log('Error terminating tracing', err)
);
});
Step 3: Start Application with Instrumentation
node --require ./instrumentation.js app.js
Workflow 2: Custom Spans and Attributes
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('my-service', '1.0.0');
async function processOrder(orderId) {
const span = tracer.startSpan('processOrder', {
attributes: {
'order.id': orderId,
'order.priority': 'high',
},
});
try {
// Set span status
span.setStatus({ code: SpanStatusCode.OK });
// Add event to span
span.addEvent('order_validated', {
'validation.result': 'success',
});
// Child span
const childSpan = tracer.startSpan('calculateTotal', {
parent: span,
});
const total = await calculateTotal(orderId);
childSpan.setAttribute('order.total', total);
childSpan.end();
return total;
} catch (error) {
// Record exception
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
}
Workflow 3: Custom Metrics
const { metrics } = require('@opentelemetry/api');
const meter = metrics.getMeter('my-service', '1.0.0');
// Counter: Monotonically increasing value
const orderCounter = meter.createCounter('orders.processed', {
description: 'Total number of orders processed',
});
orderCounter.add(1, {
'order.type': 'online',
'order.status': 'completed',
});
// Histogram: Statistical distribution
const requestDuration = meter.createHistogram('http.server.duration', {
description: 'HTTP request duration in milliseconds',
unit: 'ms',
});
requestDuration.record(150, {
'http.method': 'POST',
'http.route': '/api/orders',
'http.status_code': 200,
});
// UpDownCounter: Value can go up or down
const activeConnections = meter.createUpDownCounter('db.connections.active', {
description: 'Number of active database connections',
});
activeConnections.add(1); // Connection opened
activeConnections.add(-1); // Connection closed
// ObservableGauge: Current value snapshot
const memoryUsage = meter.createObservableGauge('process.memory.usage', {
description: 'Process memory usage in bytes',
unit: 'bytes',
});
memoryUsage.addCallback((result) => {
result.observe(process.memoryUsage().heapUsed, {
'memory.type': 'heap',
});
});
Workflow 4: Context Propagation (W3C Trace Context)
// Propagate context between services
const { propagation, context } = require('@opentelemetry/api');
// Client-side: Inject trace context into HTTP headers
async function callExternalService(url, data) {
const span = tracer.startSpan('external_api_call');
const headers = {};
// Inject trace context into headers (W3C Trace Context)
propagation.inject(context.active(), headers);
try {
const response = await fetch(url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
...headers, // traceparent, tracestate headers
},
body: JSON.stringify(data),
});
return response.json();
} finally {
span.end();
}
}
// Server-side: Extract trace context from HTTP headers
app.post('/api/process', (req, res) => {
// Extract context from incoming headers
const extractedContext = propagation.extract(context.active(), req.headers);
context.with(extractedContext, () => {
const span = tracer.startSpan('process_request');
// This span will be a child of the parent trace from the caller
// ...
span.end();
});
res.json({ status: 'ok' });
});
Workflow 5: Sampling Strategies
const { ParentBasedSampler, AlwaysOnSampler, AlwaysOffSampler, TraceIdRatioBasedSampler } = require('@opentelemetry/sdk-trace-base');
// Probability-based sampling (10% of traces)
const sampler = new TraceIdRatioBasedSampler(0.1);
// Parent-based sampling with rate limiting
const parentBasedSampler = new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(0.1), // 10% for root spans
remoteParentSampled: new AlwaysOnSampler(), // Always sample if parent sampled
remoteParentNotSampled: new AlwaysOffSampler(), // Never sample if parent not sampled
localParentSampled: new AlwaysOnSampler(),
localParentNotSampled: new AlwaysOffSampler(),
});
const sdk = new NodeSDK({
sampler: parentBasedSampler,
// ... other config
});
Best Practices
1. Use Semantic Conventions
// ✅ GOOD: Standard semantic conventions
const { SemanticAttributes } = require('@opentelemetry/semantic-conventions');
span.setAttributes({
[SemanticAttributes.HTTP_METHOD]: 'POST',
[SemanticAttributes.HTTP_URL]: '/api/users',
[SemanticAttributes.HTTP_STATUS_CODE]: 200,
[SemanticAttributes.DB_SYSTEM]: 'postgresql',
[SemanticAttributes.DB_NAME]: 'mydb',
});
// ❌ BAD: Custom attributes without namespace
span.setAttributes({
method: 'POST',
url: '/api/users',
});
2. Keep Span Names Concise
// ✅ GOOD: Generic operation name (use attributes for details)
const span = tracer.startSpan('GET /api/users/:id', {
attributes: { 'user.id': userId },
});
// ❌ BAD: High cardinality span names
const span = tracer.startSpan(`GET /api/users/${userId}`);
3. Always End Spans
// ✅ GOOD: Use try/finally to ensure span ends
const span = tracer.startSpan('operation');
try {
await doWork();
} finally {
span.end();
}
// ❌ BAD: Span might never end
const span = tracer.startSpan('operation');
await doWork();
span.end();
4. Use Baggage for Cross-Cutting Concerns
const { propagation, baggageUtils } = require('@opentelemetry/api');
// Set baggage (propagates across service boundaries)
const baggage = propagation.createBaggage({
'user.id': { value: '12345' },
'request.id': { value: 'req-abc-123' },
});
context.with(propagation.setBaggage(context.active(), baggage), () => {
// Baggage available in all child spans
const userId = propagation.getBaggage(context.active())?.getEntry('user.id')?.value;
});
5. Log Correlation
const { trace } = require('@opentelemetry/api');
const winston = require('winston');
const logger = winston.createLogger({
format: winston.format.combine(
winston.format((info) => {
const span = trace.getActiveSpan();
if (span) {
const spanContext = span.spanContext();
info.trace_id = spanContext.traceId;
info.span_id = spanContext.spanId;
}
return info;
})(),
winston.format.json()
),
transports: [new winston.transports.Console()],
});
logger.info('Order processed', { order_id: '123' });
// Output: { "message": "Order processed", "order_id": "123", "trace_id": "...", "span_id": "..." }
Quality Criteria
- ✅ All HTTP requests automatically traced
- ✅ Database queries instrumented
- ✅ Custom business logic spans added
- ✅ Metrics exported every 60 seconds
- ✅ Sampling rate configured (not 100% in production)
- ✅ Trace context propagated across services
- ✅ Logs correlated with traces
Backend Setup (Jaeger)
# Run Jaeger all-in-one (for development)
docker run -d --name jaeger \
-e COLLECTOR_OTLP_ENABLED=true \
-p 16686:16686 \
-p 4318:4318 \
jaegertracing/all-in-one:latest
# Access Jaeger UI: http://localhost:16686
Troubleshooting
Issue: No traces appearing in Jaeger Solution: Check exporter URL, ensure OTLP collector is running, verify network connectivity
Issue: High memory usage Solution: Reduce sampling rate, use batch span processor with smaller queue size
Issue: Missing trace context between services Solution: Ensure W3C Trace Context headers (traceparent, tracestate) are propagated
Related Skills
kubernetes-specialist: Deploying OTel Collector in K8saws-specialist: AWS X-Ray integrationbackend-dev: Application instrumentation
Tools
- Jaeger: Open-source tracing backend
- Zipkin: Distributed tracing system
- Grafana Tempo: High-scale tracing backend
- Prometheus: Metrics collection
- Grafana: Visualization
MCP Tools
mcp__flow-nexus__execution_stream_subscribefor real-time trace monitoringmcp__flow-nexus__realtime_subscribefor live metricsmcp__memory-mcp__memory_storefor OTel patterns
Success Metrics
- Trace coverage: ≥95% of requests
- Sampling rate: 5-10% (production)
- Metrics export interval: 60 seconds
- Span drop rate: <1%
- Log-trace correlation: 100%
Skill Version: 1.0.0 Last Updated: 2025-11-02
Core Principles
OpenTelemetry Observability operates on 3 fundamental principles:
Principle 1: Context Propagation is Non-Negotiable
Distributed tracing only works if trace context flows across service boundaries. W3C Trace Context headers (traceparent, tracestate) must be propagated through HTTP calls, message queues, and async operations. This principle enables end-to-end visibility.
In practice:
- Use OpenTelemetry auto-instrumentation to inject/extract trace context automatically
- Explicitly propagate context in custom HTTP clients and message queue consumers
- Validate context propagation in integration tests by checking trace continuity
- Use baggage for cross-cutting concerns like user ID or request ID that need to flow everywhere
Principle 2: Cardinality Control Prevents Metric Explosions
High-cardinality attributes (user IDs, request IDs) in metric labels cause exponential growth in time series, leading to OOM errors, high storage costs, and query timeouts. This principle ensures sustainable observability costs.
In practice:
- Use low-cardinality labels for metrics (environment, service, endpoint, status code)
- Put high-cardinality data in span attributes, not metric labels
- Use histograms for latency distribution instead of individual timers per request
- Implement sampling to reduce trace volume while maintaining statistical significance
Principle 3: Semantic Conventions Enable Tool Interoperability
OpenTelemetry defines semantic conventions for common attributes (http.method, db.system, messaging.destination). Following conventions ensures your telemetry works with all backends (Jaeger, Zipkin, Grafana, Prometheus) without custom transforms.
In practice:
- Use SemanticAttributes constants from OpenTelemetry SDK, not custom strings
- Follow naming patterns (http., db., messaging., rpc.) for standard operations
- Document custom attributes with namespace prefixes (myapp.order.priority)
- Validate semantic convention compliance in code reviews
Common Anti-Patterns
| Anti-Pattern | Problem | Solution |
|---|---|---|
| 100% Sampling in Production | Capturing every trace creates massive backend load, storage costs exploding to hundreds of GB/day, and query performance degradation from index bloat. | Use TraceIdRatioBasedSampler (5-10% for production). Implement parent-based sampling to preserve complete traces. Use tail sampling for error-biased retention. |
| High-Cardinality Span Names | Using unique IDs in span names (GET /users/12345 instead of GET /users/:id) creates millions of unique operations, breaking trace aggregation and dashboards. | Use generic operation names with placeholders. Put dynamic values in span attributes. Follow semantic conventions for HTTP routes (http.route: /users/:id). |
| Forgetting to End Spans | Unclosed spans accumulate in memory, causing memory leaks, inaccurate latency measurements, and spans never exported to backend. | Always use try/finally blocks to ensure span.end() is called. Use context managers (Python with) or defer (Go) for automatic cleanup. |
| Logging Without Trace Correlation | Logs and traces live in separate systems with no correlation, forcing manual detective work to connect error logs to slow traces. | Inject trace_id and span_id into structured log fields. Use OpenTelemetry LogRecordProcessor to auto-correlate. Configure backend (Grafana) to link logs to traces. |
| No Metric Export Validation | Metrics are collected but never exported due to misconfigured endpoint, network issues, or authentication failures. Silent failures leave blind spots. | Implement health checks that verify metric export success. Monitor OTLP exporter metrics (exported count, failed count). Test export in staging environments. |
Conclusion
The OpenTelemetry Observability skill provides a comprehensive framework for implementing production-grade distributed tracing, metrics, and log correlation across microservices architectures. By mastering the three core principles of context propagation, cardinality control, and semantic conventions, you ensure that your observability infrastructure is both powerful and sustainable at scale.
The workflows demonstrate the complete lifecycle from auto-instrumentation setup to custom span creation, metrics collection, and advanced sampling strategies. The emphasis on W3C Trace Context propagation ensures trace continuity across polyglot services, while the semantic conventions guarantee interoperability with all major observability backends. The anti-patterns table serves as a critical reference to avoid common pitfalls that lead to metric explosions, memory leaks, and unactionable telemetry.
This skill is particularly valuable when debugging production issues across distributed systems, implementing SLO-based alerting, or migrating from proprietary APM solutions to vendor-neutral OpenTelemetry. Whether you're instrumenting a Node.js microservice, a Python Flask API, or a complex event-driven architecture with message queues, the patterns and best practices documented here provide a solid foundation. Combined with backend setup guides (Jaeger, Prometheus, Grafana Tempo) and troubleshooting references, you have everything needed to build observable systems that provide actionable insights when incidents occur.