0x55aa
← Back to Blog

Distributed Tracing: Stop Guessing Why Your API Takes 4 Seconds πŸ”β±οΈ

β€’12 min read

Distributed Tracing: Stop Guessing Why Your API Takes 4 Seconds πŸ”β±οΈ

True story: It was a Monday morning. Our checkout API had mysteriously jumped from 200ms to 4,100ms over the weekend. No deploys. No config changes. Nothing in the logs.

Me: "Maybe the database?" Database metrics: πŸ“Š Perfectly fine.

Me: "Maybe the payment service?" Payment service logs: πŸ“„ Also fine.

Me: "Maybe it's... DNS? Aliens? Mercury in retrograde?" Senior Engineer: "Do you have distributed tracing?" Me: "We have... logs? And CloudWatch?" Senior Engineer: "..."

Six hours of console.log archaeology later, we found it: a new product recommendations microservice was doing 87 database queries for a single checkout request. It had been deployed at 2 AM Saturday by a junior dev who "just added one feature."

Without distributed tracing, finding that took 6 hours. With it, it would have taken 6 seconds.

The Problem: Logs Are Lies (Sort Of) πŸ€₯

When a single user request flows through your system like this:

User hits /checkout
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  API GW    │────▢│  Order Svc  │────▢│ Inventory Svcβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
                    β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
                    β–Ό            β–Ό
             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
             β”‚ Payment β”‚  β”‚  Product  β”‚ ← 😈 here be dragons
             β”‚   Svc   β”‚  β”‚ Rec. Svc  β”‚
             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Your logs across 5 services look like this:

[order-service]    INFO: Processing order for user 42  [time: 12:00:00.100]
[payment-service]  INFO: Charging user 42 $99.99       [time: 12:00:00.215]
[inventory-service] INFO: Reserving SKU-777            [time: 12:00:00.220]
[product-rec-svc]  INFO: Fetching recommendations      [time: 12:00:00.225]
[order-service]    INFO: Order 9921 confirmed           [time: 12:00:04.180]

Which service caused the 4-second delay? The logs tell you "something happened between 12:00:00.225 and 12:00:04.180" β€” which is basically useless. You'd need to grep across 5 separate log streams, correlate timestamps manually, and pray the clocks are in sync.

This is distributed systems debugging without tracing. It's like trying to solve a murder mystery where each witness only knows their own line.

Enter Distributed Tracing: A Map of Every Request πŸ—ΊοΈ

Distributed tracing gives every request a unique Trace ID that follows it through every service. Each unit of work β€” a service call, a DB query, an HTTP request β€” becomes a Span. Spans know their parent, so you end up with a complete tree:

Trace ID: abc-123-xyz
β”‚
β”œβ”€ [Span] API Gateway                       0ms ──────────────────────── 4,100ms
β”‚  β”‚
β”‚  β”œβ”€ [Span] Order Service                  5ms ──── 4,090ms
β”‚  β”‚  β”‚
β”‚  β”‚  β”œβ”€ [Span] DB: INSERT orders           5ms ─ 12ms
β”‚  β”‚  β”‚
β”‚  β”‚  β”œβ”€ [Span] Payment Service             15ms ──── 85ms
β”‚  β”‚  β”‚  └─ [Span] Stripe API call          20ms ─── 80ms
β”‚  β”‚  β”‚
β”‚  β”‚  β”œβ”€ [Span] Inventory Service           90ms ─── 110ms
β”‚  β”‚  β”‚  └─ [Span] DB: UPDATE inventory    92ms ─ 108ms
β”‚  β”‚  β”‚
β”‚  β”‚  └─ [Span] Product Rec. Service        115ms ──────────────── 4,080ms ← 🚨
β”‚  β”‚     β”œβ”€ [Span] DB: SELECT products      116ms ─ 118ms
β”‚  β”‚     β”œβ”€ [Span] DB: SELECT products      119ms ─ 121ms  ← N+1 begins
β”‚  β”‚     β”œβ”€ [Span] DB: SELECT products      122ms ─ 124ms
β”‚  β”‚     β”‚   ... (84 more queries) ...
β”‚  β”‚     └─ [Span] DB: SELECT products      3,990ms ─ 3,995ms

The culprit is obvious instantly. 87 database queries. 3,965ms wasted. Found in 6 seconds, not 6 hours.

That's distributed tracing.

OpenTelemetry: The One Standard to Rule Them All πŸ§™

For years, every tracing tool had its own SDK. Jaeger SDK, Zipkin SDK, Datadog SDK β€” switching vendors meant rewriting all your instrumentation. Then came OpenTelemetry (OTel): a vendor-neutral open standard for traces, metrics, and logs.

Your App (OTel SDK)  β†’  OTel Collector  β†’  Jaeger / Datadog / Honeycomb / Grafana Tempo

Write your instrumentation once. Change backends without touching your code. Beautiful.

Setting It Up in Node.js πŸ› οΈ

When designing our e-commerce backend, here's the setup I use now in every Node.js service:

npm install @opentelemetry/sdk-node \
            @opentelemetry/auto-instrumentations-node \
            @opentelemetry/exporter-trace-otlp-http
// tracing.js β€” load this FIRST before everything else
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const sdk = new NodeSDK({
    serviceName: 'order-service',  // Shows up in your trace UI!
    traceExporter: new OTLPTraceExporter({
        url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://otel-collector:4318/v1/traces',
    }),
    instrumentations: [
        getNodeAutoInstrumentations({
            // Auto-instruments: Express, HTTP, MySQL, Redis, etc.
            // Zero code changes needed for most frameworks!
            '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
        }),
    ],
});

sdk.start();

// Graceful shutdown
process.on('SIGTERM', () => {
    sdk.shutdown()
        .then(() => console.log('Tracing terminated'))
        .catch(err => console.error('Error terminating tracing', err))
        .finally(() => process.exit(0));
});
// app.js β€” load tracing BEFORE express!
require('./tracing');  // ← must be first line

const express = require('express');
const app = express();

// Your existing routes work as-is. Auto-instrumentation handles the rest.
app.get('/checkout', async (req, res) => {
    // This entire handler is now automatically traced! βœ…
    const order = await orderService.create(req.body);
    res.json(order);
});

That's it. Auto-instrumentation captures Express routes, outgoing HTTP calls, database queries (MySQL, PostgreSQL, MongoDB, Redis) β€” all with zero changes to your existing code.

Adding Custom Spans for Business Logic 🎯

Auto-instrumentation is great for infrastructure. But sometimes you need to trace business logic:

const { trace, SpanStatusCode } = require('@opentelemetry/api');

const tracer = trace.getTracer('order-service');

async function processCheckout(cartId, userId) {
    // Create a custom span for this business operation
    return tracer.startActiveSpan('checkout.process', async (span) => {
        try {
            // Add metadata that shows up in your trace explorer!
            span.setAttributes({
                'checkout.cart_id': cartId,
                'checkout.user_id': userId,
                'checkout.currency': 'USD',
            });

            const cart = await cartService.get(cartId);
            span.setAttribute('checkout.item_count', cart.items.length);
            span.setAttribute('checkout.total', cart.total);

            // Child spans are automatically nested
            const payment = await processPayment(cart);
            span.setAttribute('checkout.payment_id', payment.id);

            span.setStatus({ code: SpanStatusCode.OK });
            return { orderId: payment.orderId };

        } catch (error) {
            // Mark span as failed β€” shows up in red in your UI!
            span.setStatus({
                code: SpanStatusCode.ERROR,
                message: error.message,
            });
            span.recordException(error);
            throw error;
        } finally {
            span.end(); // Always end your spans!
        }
    });
}

The result in Jaeger UI:

checkout.process [cart_id=cart_77, user_id=42, item_count=3, total=99.99]
β”œβ”€ cartService.get                    12ms
β”œβ”€ processPayment                     68ms
β”‚  β”œβ”€ validateCard                    5ms
β”‚  └─ stripe.charge                  60ms
└─ orderService.create               8ms
   └─ DB: INSERT INTO orders         6ms

Propagating Context Across Services πŸ”—

The magic of distributed tracing is that the Trace ID must follow the request from service to service. OpenTelemetry handles this automatically via HTTP headers:

Service A β†’ HTTP request to Service B:
  traceparent: 00-abc123xyz-def456-01
  ^              ^             ^       ^
  version      trace-id    span-id   flags

If you're making HTTP calls between services, the OTel HTTP instrumentation injects these headers automatically. But if you use message queues (SQS, RabbitMQ), you need to propagate manually:

const { propagation, context } = require('@opentelemetry/api');

// Publishing to SQS β€” inject trace context into message attributes
async function publishOrderEvent(order) {
    const carrier = {};
    propagation.inject(context.active(), carrier);  // Add trace headers to carrier

    await sqs.sendMessage({
        QueueUrl: process.env.ORDER_QUEUE_URL,
        MessageBody: JSON.stringify(order),
        MessageAttributes: {
            // Propagate tracing context in message attributes!
            traceparent: { DataType: 'String', StringValue: carrier.traceparent || '' },
            tracestate:  { DataType: 'String', StringValue: carrier.tracestate || '' },
        },
    }).promise();
}

// Consuming from SQS β€” extract trace context to continue the trace
async function handleOrderMessage(message) {
    const carrier = {
        traceparent: message.MessageAttributes?.traceparent?.StringValue,
        tracestate: message.MessageAttributes?.tracestate?.StringValue,
    };

    const ctx = propagation.extract(context.active(), carrier);

    // Now run handler within the extracted context
    return context.with(ctx, async () => {
        return tracer.startActiveSpan('order.process', async (span) => {
            try {
                await processOrder(JSON.parse(message.Body));
                span.setStatus({ code: SpanStatusCode.OK });
            } finally {
                span.end();
            }
        });
    });
}

This is how that 4-second checkout bug appeared as one unified trace across 5 services β€” the SQS message carried the Trace ID from the order service all the way into the product recommendation service.

The Production Setup (Docker Compose) 🐳

Running Jaeger locally for development:

# docker-compose.yml
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # Jaeger UI
      - "4318:4318"    # OTLP HTTP receiver
    environment:
      COLLECTOR_OTLP_ENABLED: "true"

  order-service:
    build: ./order-service
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: "http://jaeger:4318/v1/traces"
      OTEL_SERVICE_NAME: "order-service"
    depends_on:
      - jaeger

  product-rec-service:
    build: ./product-rec-service
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: "http://jaeger:4318/v1/traces"
      OTEL_SERVICE_NAME: "product-rec-service"
    depends_on:
      - jaeger

Open http://localhost:16686 and you have a full distributed trace explorer. Click on any slow trace and see exactly which service, which query, which line of code is responsible.

In production, we ship traces to Grafana Tempo (free, open source) fronted by Grafana dashboards. You can also use Honeycomb, Datadog APM, AWS X-Ray, or Jaeger with persistent storage. OTel means you pick the backend without rewriting a thing.

Sampling: Don't Trace Everything πŸ’Έ

Tracing every single request in high-traffic systems gets expensive fast.

const { ParentBasedSampler, TraceIdRatioBased } = require('@opentelemetry/sdk-trace-base');

const sdk = new NodeSDK({
    // Trace 10% of requests in production
    sampler: new ParentBasedSampler({
        root: new TraceIdRatioBased(0.10), // 10% sample rate
    }),
    // ...
});

As a Technical Lead, I've learned: Use tail-based sampling when possible β€” trace 100% of errors and slow requests (>500ms), and only 1-5% of fast, successful requests. That way you never miss a problem, but don't pay for tracing every happy-path request.

// Custom sampler: always trace errors and slow requests
class SmartSampler {
    shouldSample(context, traceId, spanName, spanKind, attributes) {
        // Always sample if marked as slow or error
        if (attributes['http.status_code'] >= 500) {
            return { decision: SamplingDecision.RECORD_AND_SAMPLED };
        }
        // Otherwise 5% sample rate
        return Math.random() < 0.05
            ? { decision: SamplingDecision.RECORD_AND_SAMPLED }
            : { decision: SamplingDecision.NOT_RECORD };
    }
}

Common Mistakes I Made πŸ˜…

Mistake #1: Forgetting require('./tracing') First

// ❌ Bad β€” Express loads before tracing, gets no instrumentation
const express = require('express');
require('./tracing');  // Too late! Express already loaded.

// βœ… Good β€” tracing initializes before any other require
require('./tracing');
const express = require('express');

A scalability lesson that cost us: I spent a full day wondering why our traces had no spans, only to realize I had require('./tracing') on line 8 of app.js. Line 3 was require('express'). Tracing must come first. Always.

Mistake #2: Not Ending Spans

// ❌ Span never ends β€” memory leak, incomplete trace
const span = tracer.startSpan('my-operation');
await doSomething();
// Forgot span.end()! Span stays "in progress" forever.

// βœ… Always use try/finally
const span = tracer.startSpan('my-operation');
try {
    await doSomething();
} finally {
    span.end(); // Always runs, even on exception!
}

Mistake #3: Adding Too Many Custom Spans

Tracing overhead is real. If you create a span for every function call, you'll end up with a 10,000-node trace tree that's impossible to read and expensive to store.

❌ Bad: trace every internal function
βœ… Good: trace service boundaries, external calls, and "things that can be slow"
         (DB queries, HTTP calls, queue operations, 3rd-party APIs)

Trade-Offs βš–οΈ

Benefits of distributed tracing:

  • βœ… Find performance bottlenecks in minutes, not hours
  • βœ… Understand the full call chain of any request
  • βœ… See which service causes cascading failures
  • βœ… Debug race conditions and intermittent slowness
  • βœ… Zero changes needed to existing code with auto-instrumentation

Costs:

  • ⚠️ Storage for traces (use sampling to control costs)
  • ⚠️ Small CPU/memory overhead per request (~1-3%)
  • ⚠️ Context propagation across message queues requires manual work
  • ⚠️ Learning curve for setting up the collector + backend

When NOT to add distributed tracing:

  • Simple monolith with one service (use CloudWatch Logs + X-Ray insights)
  • Lambda functions called in isolation (X-Ray is simpler)
  • Prototype or early-stage app (instrument when you actually have slowness to debug)

When you MUST add it:

  • 3+ services talking to each other
  • You've ever said "I don't know which service is slow"
  • You have SLA requirements (you can't fix what you can't see)
  • Black Friday is coming (trust me on this one)

The Architecture View That Changes Everything πŸ’‘

Once you have tracing, you also get a service dependency map for free:

Auto-generated from traces:

[API Gateway]
     β”‚
     β”œβ”€β”€β–Ά [Order Service]  ──▢ [PostgreSQL]
     β”‚         β”‚
     β”‚         β”œβ”€β”€β–Ά [Payment Service]  ──▢ [Stripe API]
     β”‚         β”‚
     β”‚         └──▢ [Product Rec. Svc] ──▢ [MySQL] ← 87 queries 🚨
     β”‚
     └──▢ [User Service]  ──▢ [Redis Cache]

That map updates automatically as your system evolves. No Confluence wiki needed. The traces are the documentation.

TL;DR 🏁

  • Distributed tracing = unique Trace ID follows every request through all services
  • Spans = individual units of work (DB query, HTTP call, service handler)
  • OpenTelemetry = vendor-neutral standard, instrument once, use any backend
  • Auto-instrumentation = traces Express, MySQL, Redis, HTTP calls with zero code changes
  • Always propagate context in message queues (SQS, Kafka) β€” it doesn't happen automatically
  • Sample intelligently β€” 100% of errors, 5-10% of normal requests
  • The checkout API that took 4 seconds? 87 N+1 queries in a service I didn't even know existed. Found in 6 seconds with tracing. Would've cost us another sprint without it.

As a Technical Lead, I've learned: metrics tell you something is wrong, logs tell you what happened, but traces tell you exactly where and why. You need all three β€” but if I had to pick one to add to a microservices system today, it'd be tracing. Every time.

Your users don't care which of your 8 services is slow. They just know checkout is broken. Tracing is how you care about the right thing at 2 AM.


Dealing with mysterious slow requests? Connect with me on LinkedIn β€” I've spent enough time staring at CloudWatch logs to fill a book, and I'd rather talk tracing.

Want to see the full OTel setup? Check out my GitHub for the complete production-ready tracing configuration I use across our microservices.

Trace everything. Guess nothing. πŸ”βœ¨


P.S. The 87-query bug was fixed in 20 minutes once we found it β€” one missing with(['recommendations']) eager load in Laravel. Six hours of debugging. Twenty minutes of fixing. Distributed tracing would have made it six minutes of debugging. Add it now, before your next 2 AM incident. πŸ™