Distributed Tracing: Stop Guessing Why Your API Takes 4 Seconds πβ±οΈ
Distributed Tracing: Stop Guessing Why Your API Takes 4 Seconds πβ±οΈ
True story: It was a Monday morning. Our checkout API had mysteriously jumped from 200ms to 4,100ms over the weekend. No deploys. No config changes. Nothing in the logs.
Me: "Maybe the database?" Database metrics: π Perfectly fine.
Me: "Maybe the payment service?" Payment service logs: π Also fine.
Me: "Maybe it's... DNS? Aliens? Mercury in retrograde?" Senior Engineer: "Do you have distributed tracing?" Me: "We have... logs? And CloudWatch?" Senior Engineer: "..."
Six hours of console.log archaeology later, we found it: a new product recommendations microservice was doing 87 database queries for a single checkout request. It had been deployed at 2 AM Saturday by a junior dev who "just added one feature."
Without distributed tracing, finding that took 6 hours. With it, it would have taken 6 seconds.
The Problem: Logs Are Lies (Sort Of) π€₯
When a single user request flows through your system like this:
User hits /checkout
β
βΌ
ββββββββββββββ βββββββββββββββ ββββββββββββββββ
β API GW ββββββΆβ Order Svc ββββββΆβ Inventory Svcβ
ββββββββββββββ ββββββββ¬βββββββ ββββββββββββββββ
β
βββββββ΄βββββββ
βΌ βΌ
βββββββββββ βββββββββββββ
β Payment β β Product β β π here be dragons
β Svc β β Rec. Svc β
βββββββββββ βββββββββββββ
Your logs across 5 services look like this:
[order-service] INFO: Processing order for user 42 [time: 12:00:00.100]
[payment-service] INFO: Charging user 42 $99.99 [time: 12:00:00.215]
[inventory-service] INFO: Reserving SKU-777 [time: 12:00:00.220]
[product-rec-svc] INFO: Fetching recommendations [time: 12:00:00.225]
[order-service] INFO: Order 9921 confirmed [time: 12:00:04.180]
Which service caused the 4-second delay? The logs tell you "something happened between 12:00:00.225 and 12:00:04.180" β which is basically useless. You'd need to grep across 5 separate log streams, correlate timestamps manually, and pray the clocks are in sync.
This is distributed systems debugging without tracing. It's like trying to solve a murder mystery where each witness only knows their own line.
Enter Distributed Tracing: A Map of Every Request πΊοΈ
Distributed tracing gives every request a unique Trace ID that follows it through every service. Each unit of work β a service call, a DB query, an HTTP request β becomes a Span. Spans know their parent, so you end up with a complete tree:
Trace ID: abc-123-xyz
β
ββ [Span] API Gateway 0ms ββββββββββββββββββββββββ 4,100ms
β β
β ββ [Span] Order Service 5ms ββββ 4,090ms
β β β
β β ββ [Span] DB: INSERT orders 5ms β 12ms
β β β
β β ββ [Span] Payment Service 15ms ββββ 85ms
β β β ββ [Span] Stripe API call 20ms βββ 80ms
β β β
β β ββ [Span] Inventory Service 90ms βββ 110ms
β β β ββ [Span] DB: UPDATE inventory 92ms β 108ms
β β β
β β ββ [Span] Product Rec. Service 115ms ββββββββββββββββ 4,080ms β π¨
β β ββ [Span] DB: SELECT products 116ms β 118ms
β β ββ [Span] DB: SELECT products 119ms β 121ms β N+1 begins
β β ββ [Span] DB: SELECT products 122ms β 124ms
β β β ... (84 more queries) ...
β β ββ [Span] DB: SELECT products 3,990ms β 3,995ms
The culprit is obvious instantly. 87 database queries. 3,965ms wasted. Found in 6 seconds, not 6 hours.
That's distributed tracing.
OpenTelemetry: The One Standard to Rule Them All π§
For years, every tracing tool had its own SDK. Jaeger SDK, Zipkin SDK, Datadog SDK β switching vendors meant rewriting all your instrumentation. Then came OpenTelemetry (OTel): a vendor-neutral open standard for traces, metrics, and logs.
Your App (OTel SDK) β OTel Collector β Jaeger / Datadog / Honeycomb / Grafana Tempo
Write your instrumentation once. Change backends without touching your code. Beautiful.
Setting It Up in Node.js π οΈ
When designing our e-commerce backend, here's the setup I use now in every Node.js service:
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-http
// tracing.js β load this FIRST before everything else
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const sdk = new NodeSDK({
serviceName: 'order-service', // Shows up in your trace UI!
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
// Auto-instruments: Express, HTTP, MySQL, Redis, etc.
// Zero code changes needed for most frameworks!
'@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
}),
],
});
sdk.start();
// Graceful shutdown
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch(err => console.error('Error terminating tracing', err))
.finally(() => process.exit(0));
});
// app.js β load tracing BEFORE express!
require('./tracing'); // β must be first line
const express = require('express');
const app = express();
// Your existing routes work as-is. Auto-instrumentation handles the rest.
app.get('/checkout', async (req, res) => {
// This entire handler is now automatically traced! β
const order = await orderService.create(req.body);
res.json(order);
});
That's it. Auto-instrumentation captures Express routes, outgoing HTTP calls, database queries (MySQL, PostgreSQL, MongoDB, Redis) β all with zero changes to your existing code.
Adding Custom Spans for Business Logic π―
Auto-instrumentation is great for infrastructure. But sometimes you need to trace business logic:
const { trace, SpanStatusCode } = require('@opentelemetry/api');
const tracer = trace.getTracer('order-service');
async function processCheckout(cartId, userId) {
// Create a custom span for this business operation
return tracer.startActiveSpan('checkout.process', async (span) => {
try {
// Add metadata that shows up in your trace explorer!
span.setAttributes({
'checkout.cart_id': cartId,
'checkout.user_id': userId,
'checkout.currency': 'USD',
});
const cart = await cartService.get(cartId);
span.setAttribute('checkout.item_count', cart.items.length);
span.setAttribute('checkout.total', cart.total);
// Child spans are automatically nested
const payment = await processPayment(cart);
span.setAttribute('checkout.payment_id', payment.id);
span.setStatus({ code: SpanStatusCode.OK });
return { orderId: payment.orderId };
} catch (error) {
// Mark span as failed β shows up in red in your UI!
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
throw error;
} finally {
span.end(); // Always end your spans!
}
});
}
The result in Jaeger UI:
checkout.process [cart_id=cart_77, user_id=42, item_count=3, total=99.99]
ββ cartService.get 12ms
ββ processPayment 68ms
β ββ validateCard 5ms
β ββ stripe.charge 60ms
ββ orderService.create 8ms
ββ DB: INSERT INTO orders 6ms
Propagating Context Across Services π
The magic of distributed tracing is that the Trace ID must follow the request from service to service. OpenTelemetry handles this automatically via HTTP headers:
Service A β HTTP request to Service B:
traceparent: 00-abc123xyz-def456-01
^ ^ ^ ^
version trace-id span-id flags
If you're making HTTP calls between services, the OTel HTTP instrumentation injects these headers automatically. But if you use message queues (SQS, RabbitMQ), you need to propagate manually:
const { propagation, context } = require('@opentelemetry/api');
// Publishing to SQS β inject trace context into message attributes
async function publishOrderEvent(order) {
const carrier = {};
propagation.inject(context.active(), carrier); // Add trace headers to carrier
await sqs.sendMessage({
QueueUrl: process.env.ORDER_QUEUE_URL,
MessageBody: JSON.stringify(order),
MessageAttributes: {
// Propagate tracing context in message attributes!
traceparent: { DataType: 'String', StringValue: carrier.traceparent || '' },
tracestate: { DataType: 'String', StringValue: carrier.tracestate || '' },
},
}).promise();
}
// Consuming from SQS β extract trace context to continue the trace
async function handleOrderMessage(message) {
const carrier = {
traceparent: message.MessageAttributes?.traceparent?.StringValue,
tracestate: message.MessageAttributes?.tracestate?.StringValue,
};
const ctx = propagation.extract(context.active(), carrier);
// Now run handler within the extracted context
return context.with(ctx, async () => {
return tracer.startActiveSpan('order.process', async (span) => {
try {
await processOrder(JSON.parse(message.Body));
span.setStatus({ code: SpanStatusCode.OK });
} finally {
span.end();
}
});
});
}
This is how that 4-second checkout bug appeared as one unified trace across 5 services β the SQS message carried the Trace ID from the order service all the way into the product recommendation service.
The Production Setup (Docker Compose) π³
Running Jaeger locally for development:
# docker-compose.yml
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "4318:4318" # OTLP HTTP receiver
environment:
COLLECTOR_OTLP_ENABLED: "true"
order-service:
build: ./order-service
environment:
OTEL_EXPORTER_OTLP_ENDPOINT: "http://jaeger:4318/v1/traces"
OTEL_SERVICE_NAME: "order-service"
depends_on:
- jaeger
product-rec-service:
build: ./product-rec-service
environment:
OTEL_EXPORTER_OTLP_ENDPOINT: "http://jaeger:4318/v1/traces"
OTEL_SERVICE_NAME: "product-rec-service"
depends_on:
- jaeger
Open http://localhost:16686 and you have a full distributed trace explorer. Click on any slow trace and see exactly which service, which query, which line of code is responsible.
In production, we ship traces to Grafana Tempo (free, open source) fronted by Grafana dashboards. You can also use Honeycomb, Datadog APM, AWS X-Ray, or Jaeger with persistent storage. OTel means you pick the backend without rewriting a thing.
Sampling: Don't Trace Everything πΈ
Tracing every single request in high-traffic systems gets expensive fast.
const { ParentBasedSampler, TraceIdRatioBased } = require('@opentelemetry/sdk-trace-base');
const sdk = new NodeSDK({
// Trace 10% of requests in production
sampler: new ParentBasedSampler({
root: new TraceIdRatioBased(0.10), // 10% sample rate
}),
// ...
});
As a Technical Lead, I've learned: Use tail-based sampling when possible β trace 100% of errors and slow requests (>500ms), and only 1-5% of fast, successful requests. That way you never miss a problem, but don't pay for tracing every happy-path request.
// Custom sampler: always trace errors and slow requests
class SmartSampler {
shouldSample(context, traceId, spanName, spanKind, attributes) {
// Always sample if marked as slow or error
if (attributes['http.status_code'] >= 500) {
return { decision: SamplingDecision.RECORD_AND_SAMPLED };
}
// Otherwise 5% sample rate
return Math.random() < 0.05
? { decision: SamplingDecision.RECORD_AND_SAMPLED }
: { decision: SamplingDecision.NOT_RECORD };
}
}
Common Mistakes I Made π
Mistake #1: Forgetting require('./tracing') First
// β Bad β Express loads before tracing, gets no instrumentation
const express = require('express');
require('./tracing'); // Too late! Express already loaded.
// β
Good β tracing initializes before any other require
require('./tracing');
const express = require('express');
A scalability lesson that cost us: I spent a full day wondering why our traces had no spans, only to realize I had require('./tracing') on line 8 of app.js. Line 3 was require('express'). Tracing must come first. Always.
Mistake #2: Not Ending Spans
// β Span never ends β memory leak, incomplete trace
const span = tracer.startSpan('my-operation');
await doSomething();
// Forgot span.end()! Span stays "in progress" forever.
// β
Always use try/finally
const span = tracer.startSpan('my-operation');
try {
await doSomething();
} finally {
span.end(); // Always runs, even on exception!
}
Mistake #3: Adding Too Many Custom Spans
Tracing overhead is real. If you create a span for every function call, you'll end up with a 10,000-node trace tree that's impossible to read and expensive to store.
β Bad: trace every internal function
β
Good: trace service boundaries, external calls, and "things that can be slow"
(DB queries, HTTP calls, queue operations, 3rd-party APIs)
Trade-Offs βοΈ
Benefits of distributed tracing:
- β Find performance bottlenecks in minutes, not hours
- β Understand the full call chain of any request
- β See which service causes cascading failures
- β Debug race conditions and intermittent slowness
- β Zero changes needed to existing code with auto-instrumentation
Costs:
- β οΈ Storage for traces (use sampling to control costs)
- β οΈ Small CPU/memory overhead per request (~1-3%)
- β οΈ Context propagation across message queues requires manual work
- β οΈ Learning curve for setting up the collector + backend
When NOT to add distributed tracing:
- Simple monolith with one service (use CloudWatch Logs + X-Ray insights)
- Lambda functions called in isolation (X-Ray is simpler)
- Prototype or early-stage app (instrument when you actually have slowness to debug)
When you MUST add it:
- 3+ services talking to each other
- You've ever said "I don't know which service is slow"
- You have SLA requirements (you can't fix what you can't see)
- Black Friday is coming (trust me on this one)
The Architecture View That Changes Everything π‘
Once you have tracing, you also get a service dependency map for free:
Auto-generated from traces:
[API Gateway]
β
ββββΆ [Order Service] βββΆ [PostgreSQL]
β β
β ββββΆ [Payment Service] βββΆ [Stripe API]
β β
β ββββΆ [Product Rec. Svc] βββΆ [MySQL] β 87 queries π¨
β
ββββΆ [User Service] βββΆ [Redis Cache]
That map updates automatically as your system evolves. No Confluence wiki needed. The traces are the documentation.
TL;DR π
- Distributed tracing = unique Trace ID follows every request through all services
- Spans = individual units of work (DB query, HTTP call, service handler)
- OpenTelemetry = vendor-neutral standard, instrument once, use any backend
- Auto-instrumentation = traces Express, MySQL, Redis, HTTP calls with zero code changes
- Always propagate context in message queues (SQS, Kafka) β it doesn't happen automatically
- Sample intelligently β 100% of errors, 5-10% of normal requests
- The checkout API that took 4 seconds? 87 N+1 queries in a service I didn't even know existed. Found in 6 seconds with tracing. Would've cost us another sprint without it.
As a Technical Lead, I've learned: metrics tell you something is wrong, logs tell you what happened, but traces tell you exactly where and why. You need all three β but if I had to pick one to add to a microservices system today, it'd be tracing. Every time.
Your users don't care which of your 8 services is slow. They just know checkout is broken. Tracing is how you care about the right thing at 2 AM.
Dealing with mysterious slow requests? Connect with me on LinkedIn β I've spent enough time staring at CloudWatch logs to fill a book, and I'd rather talk tracing.
Want to see the full OTel setup? Check out my GitHub for the complete production-ready tracing configuration I use across our microservices.
Trace everything. Guess nothing. πβ¨
P.S. The 87-query bug was fixed in 20 minutes once we found it β one missing with(['recommendations']) eager load in Laravel. Six hours of debugging. Twenty minutes of fixing. Distributed tracing would have made it six minutes of debugging. Add it now, before your next 2 AM incident. π