0x55aa
Back to Blog

ðŸŠĶ Dead Letter Queues: Stop Losing Messages Forever

|
6 min read

Somewhere in your production system right now, a message is dying quietly.

No error. No alert. No retry. Just... gone. A job that was supposed to send a confirmation email, process a payment webhook, or update a third-party CRM — vanished into the void because your consumer threw an unhandled exception and your queue said "cool, noted, discarding."

This is the message queue equivalent of a black hole: things go in, nothing comes out, and you only notice when a customer emails asking why their order never shipped.

The fix is embarrassingly simple, and it has a wonderfully morbid name: the Dead Letter Queue.

What Even Is a Dead Letter Queue?

A Dead Letter Queue (DLQ) is a holding area for messages that couldn't be processed successfully. Think of it as a quarantine ward — messages that failed don't get deleted, they get routed to a secondary queue where you can inspect them, fix the root cause, and reprocess them later.

Every serious message broker supports this concept:

  • RabbitMQ calls them dead letter exchanges
  • AWS SQS has a native DLQ feature you configure per queue
  • Kafka typically uses a dedicated error topic (by convention)
  • Azure Service Bus has built-in dead lettering per subscription

The pattern is the same everywhere: fail enough times → move to the DLQ → humans investigate → fix → replay.

Why Your "Just Retry It" Strategy Is Failing You

I've seen this pattern at Cubet more times than I can count. A team wires up a queue, adds a retry loop, and calls it a day:

// The optimistic approach (don't do this alone)
channel.consume('orders', async (msg) => {
  try {
    await processOrder(JSON.parse(msg.content.toString()));
    channel.ack(msg);
  } catch (err) {
    console.error('Failed, retrying...', err);
    channel.nack(msg, false, true); // requeue: true
  }
});

Looks reasonable. But here's what actually happens at 2 AM:

  1. A malformed message arrives — JSON parses fine but has a missing customerId field.
  2. processOrder throws a validation error.
  3. The message gets requeued.
  4. Consumer picks it up again immediately.
  5. Same error. Requeue.
  6. Repeat forever, at full speed, consuming 100% CPU, blocking all legitimate messages behind it.

This is called a poison message, and it will ruin your on-call engineer's weekend. Infinite retry loops without backoff or limits are not a retry strategy — they're a DoS attack you wrote yourself.

Building a Proper Dead Letter Setup

Here's a RabbitMQ setup with dead lettering done right:

const amqp = require('amqplib');

async function setupQueues(channel) {
  // 1. Declare the dead letter exchange and queue first
  await channel.assertExchange('orders.dlx', 'direct', { durable: true });
  await channel.assertQueue('orders.dead', {
    durable: true,
    arguments: {
      'x-message-ttl': 7 * 24 * 60 * 60 * 1000, // keep dead messages 7 days
    },
  });
  await channel.bindQueue('orders.dead', 'orders.dlx', 'orders');

  // 2. Declare the main queue, pointing failures to the DLX
  await channel.assertQueue('orders', {
    durable: true,
    arguments: {
      'x-dead-letter-exchange': 'orders.dlx',
      'x-dead-letter-routing-key': 'orders',
      'x-max-delivery': 3, // max 3 attempts before DLQ
    },
  });
}

async function consume(channel) {
  channel.prefetch(10);
  channel.consume('orders', async (msg) => {
    try {
      const order = JSON.parse(msg.content.toString());
      await processOrder(order);
      channel.ack(msg);
    } catch (err) {
      const deliveryCount = msg.properties.headers['x-death']?.length ?? 0;
      console.error(`Order processing failed (attempt ${deliveryCount + 1}):`, err.message);

      // nack WITHOUT requeue — let RabbitMQ handle retry via DLX policy
      channel.nack(msg, false, false);
    }
  });
}

The key move: nack(msg, false, false) — the third false tells RabbitMQ don't requeue this. RabbitMQ then routes it through the dead letter exchange instead, where your retry policy (set at the broker level, not in application code) decides what happens next.

This separation of concerns matters. Your application code shouldn't be counting retries — that's infrastructure's job.

The Part Everyone Skips: Actually Processing the DLQ

A DLQ you never read is just a slightly fancier /dev/null. The graveyard needs a caretaker.

Set up a separate worker that drains the DLQ on a schedule (or manually, for high-value messages):

async function drainDeadLetterQueue(channel, dryRun = false) {
  let processed = 0;

  while (true) {
    const msg = await channel.get('orders.dead', { noAck: false });
    if (!msg) break;

    const body = JSON.parse(msg.content.toString());
    const deathInfo = msg.properties.headers['x-death']?.[0];

    console.log({
      orderId: body.orderId,
      failedAt: deathInfo?.time,
      reason: deathInfo?.reason,
      originalQueue: deathInfo?.queue,
      attempts: deathInfo?.count,
    });

    if (!dryRun) {
      // Republish to the original queue for reprocessing
      channel.publish('', 'orders', msg.content, {
        persistent: true,
        headers: { 'x-replayed-from-dlq': true },
      });
      channel.ack(msg);
      processed++;
    } else {
      channel.nack(msg, false, true); // put back without processing
    }
  }

  console.log(`Replayed ${processed} messages from DLQ`);
}

The x-death header is RabbitMQ's gift to debugging — it tells you exactly why the message died, how many times it was attempted, and when each failure happened. Log this to your observability stack (Datadog, Grafana, whatever) and you'll have a forensic trail for every failure.

Operational Playbook

A few lessons learned from running this in production:

Set a TTL on your DLQ. Dead messages shouldn't live forever. 7 days gives you a window to investigate and replay; after that, archive or discard. Unbounded DLQs become multi-GB surprises.

Alert on DLQ depth, not just queue depth. If your orders.dead queue starts growing, something in your consumer is broken — you want to know now, not when a customer calls. A simple CloudWatch or Prometheus alert at dlq_depth > 10 is enough to catch most incidents.

Tag replayed messages. Adding x-replayed-from-dlq: true in the headers lets your consumer log or treat replayed messages differently. It's also useful for idempotency checks — if you're worried about duplicate processing, you can short-circuit on this header.

Never retry infinitely at the broker level either. Even with a DLQ in place, set x-max-delivery or an equivalent retry limit. The DLQ should be the final destination for genuinely broken messages, not a second infinite loop.

The Bigger Picture

A Dead Letter Queue is really about epistemic honesty. Without one, your system pretends failures don't exist — messages vanish and silence reads as success. With a DLQ, failures become visible artifacts you can reason about, measure, and fix.

At Cubet, the moment we wired up DLQ monitoring for our event-driven services, we discovered a whole category of silent failures that had been happening for weeks: a downstream CRM that occasionally returned 500s, causing webhook delivery to silently drop. The DLQ surfaced it. A 20-minute fix later, we replayed the backlog and everything caught up.

Before that? We would've only noticed when a client complained.

Your Next Move

  1. Check whether your current queue setup has a DLQ configured. If not, add one today — most brokers make it a 3-line config change.
  2. Add an alert on DLQ depth. Even a simple threshold alert is infinitely better than nothing.
  3. Build a replay script. Investigating dead messages without a way to replay them is just archaeology.

Your messages are trying to tell you something. The least you can do is build them a proper resting place and actually check it.

Thanks for reading!

Back to all posts