AWS SQS Dead Letter Queues: Stop Losing Messages at 3 AM ððŽ
AWS SQS Dead Letter Queues: Stop Losing Messages at 3 AM ððŽ
Storytime: Eighteen months into our serverless e-commerce backend, we deployed a "minor" update to our order processing Lambda. One line changed. One integer field that we started parsing as a string. The Lambda threw an exception, SQS retried the message, Lambda threw again, SQS gave up and... silently discarded the message.
40,000 messages. Gone. Orders never processed. No alert. No error page. Just confused customers emailing support asking where their orders were.
That was the week I got religion about Dead Letter Queues.
What Is SQS, Actually? ðŽ
Amazon SQS is a fully managed message queue. When your services need to communicate asynchronously â and they should, for anything non-trivial â SQS sits in the middle as a buffer:
[Order Service] â [SQS Queue] â [Processing Lambda]
The beauty: Order Service doesn't wait for Processing Lambda to finish. It drops the message and moves on. Decoupled. Fast. Resilient.
The catch nobody warns you about: SQS has a maxReceiveCount. After a message fails that many times, SQS either drops it or routes it to... nowhere, if you haven't configured a Dead Letter Queue. Default behavior: your message just disappears. No log. No alert. It's gone.
What's a Dead Letter Queue? â ïļ
A Dead Letter Queue (DLQ) is just another SQS queue. When a message fails maxReceiveCount times, instead of being dropped into the void, it gets sent to your DLQ.
[SQS Queue] â [Lambda] â â fails
â retry
[Lambda] â â fails
â retry
[Lambda] â â fails (maxReceiveCount hit)
â
[Dead Letter Queue] â message survives here
Now the message is preserved. You can inspect it, fix the bug, and replay it. This is the difference between a bad afternoon and a catastrophic data loss incident.
Setting It Up âïļ
Step 1: Create the DLQ first
# CloudFormation
OrderProcessingDLQ:
Type: AWS::SQS::Queue
Properties:
QueueName: order-processing-dlq
MessageRetentionPeriod: 1209600 # 14 days (maximum)
Retention period matters. 14 days gives you time to notice the problem, fix it, and replay messages before they expire.
Step 2: Attach it to your main queue
OrderProcessingQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: order-processing
RedrivePolicy:
deadLetterTargetArn: !GetAtt OrderProcessingDLQ.Arn
maxReceiveCount: 3 # Fail 3 times â DLQ
maxReceiveCount: 3 means Lambda gets 3 shots at the message. Three failures and it moves to the DLQ. Don't set this too high â if the bug isn't transient, retrying 20 times just delays the inevitable while blocking queue processing.
When architecting on AWS, I learned: maxReceiveCount: 3 is my default for idempotent Lambdas. For Lambdas doing external API calls (payment processors, email providers), I use 5 to account for transient network failures.
The CloudWatch Alarm You Must Set ðĻ
Creating the DLQ is step one. Alerting when messages land in it is the actual point.
DLQDepthAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: order-processing-dlq-not-empty
AlarmDescription: "Messages in DLQ - investigate immediately"
MetricName: ApproximateNumberOfMessagesVisible
Namespace: AWS/SQS
Dimensions:
- Name: QueueName
Value: order-processing-dlq
Statistic: Sum
Period: 60
EvaluationPeriods: 1
Threshold: 0
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref OnCallSNSTopic
TreatMissingData: notBreaching
DLQ depth > 0 = immediate alert. This is non-negotiable. A serverless pattern that saved us countless times: every SQS queue in production has a corresponding DLQ, and every DLQ has this alarm.
FIFO vs. Standard: Don't Choose Wrong ð
SQS has two flavors and choosing the wrong one will wreck your day:
| Standard Queue | FIFO Queue | |
|---|---|---|
| Throughput | Unlimited | 3,000 msg/sec (or 300 with batching) |
| Ordering | Best-effort | Guaranteed |
| Delivery | At-least-once | Exactly-once |
| Use case | Most things | Order-sensitive workflows |
| Cost | Cheaper | ~10x more expensive |
In production, I've deployed both and the rule I follow: Standard Queue unless you have a specific reason for ordering. "But what if messages arrive out of order?" is the wrong question â design your consumers to be idempotent instead. It's cheaper and scales better.
When you genuinely need FIFO: Payment state transitions (authorized â captured â refunded â order matters), inventory reservations (reserve before deduct â race conditions are bad), anything where processing message B before message A produces incorrect state.
Visibility Timeout: The Gotcha That Burns Everyone ð
This one is subtle and I've watched it trip up senior engineers.
When Lambda picks up a message, SQS hides it from other consumers for the visibility timeout period. If Lambda finishes before the timeout, the message is deleted. If Lambda doesn't finish, the message becomes visible again and gets reprocessed.
The classic mistake:
Visibility timeout: 30 seconds
Lambda timeout: 60 seconds
Lambda times out at 60s. SQS made the message visible again at 30s. Another Lambda picked it up. Now two Lambdas are processing the same order. Duplicate order. Angry customer charged twice.
The fix:
OrderProcessingQueue:
Type: AWS::SQS::Queue
Properties:
VisibilityTimeout: 180 # 3Ã your Lambda timeout
A serverless pattern that saved us: Set visibility timeout to 6Ã your Lambda timeout. 6Ã sounds excessive but Lambda can have initialization overhead, especially with provisioned concurrency scaling. Be generous or face duplicate processing.
Replaying DLQ Messages After You Fix the Bug ð
You fixed the bug. 300 messages are sitting in the DLQ. Now what?
# Built-in SQS DLQ redrive (the modern way)
aws sqs start-message-move-task \
--source-arn arn:aws:sqs:us-east-1:123456789:order-processing-dlq \
--destination-arn arn:aws:sqs:us-east-1:123456789:order-processing \
--max-number-of-messages-per-second 10 # Throttle to avoid Lambda concurrency spike
The max-number-of-messages-per-second parameter is key. Replaying 5,000 messages instantly will spike your Lambda concurrency. Throttle the replay and let it drain gradually.
When architecting on AWS, I learned: Test your DLQ replay procedure before you need it. I've seen teams discover their IAM permissions blocked the redrive operation at the worst possible time. Run a fire drill. Move a test message to the DLQ. Replay it. Confirm it processes correctly.
Cost Reality Check ð°
SQS pricing is genuinely negligible for most applications:
- First 1M requests/month: Free
- After that: $0.40 per million requests (Standard), $0.50 per million (FIFO)
My production numbers:
E-commerce backend: ~12M SQS messages/month
order-processing queue: 5M messages
inventory-updates queue: 4M messages
notification queue: 3M messages
Monthly SQS cost: ~$4.40
Monthly value of not losing order messages: ð°ð°ð°
The DLQ itself is just another queue â same pricing. The CloudWatch alarm costs essentially nothing. The visibility timeout is free to configure.
SQS is one of the cheapest reliability improvements you can make to a serverless architecture.
Common Pitfalls Quick Reference ðŠĪ
| Mistake | What Happens | Fix |
|---|---|---|
| No DLQ configured | Failed messages silently disappear | Always configure DLQ on every queue |
| No alarm on DLQ | You never know messages are failing | CloudWatch alarm on DLQ depth > 0 |
| Visibility timeout < Lambda timeout | Duplicate message processing | Set visibility timeout to 6Ã Lambda timeout |
| maxReceiveCount too high | Bug messages retry forever, clog queue | Use 3-5; high retry counts rarely help |
| Forgetting DLQ retention | Messages expire before you replay them | Always set 14-day retention on DLQs |
| Replaying DLQ without throttling | Lambda concurrency spike, throttles | Use max-number-of-messages-per-second |
TL;DR ð
Without a DLQ: Message fails â SQS retries â message disappears â you find out from customers.
With a DLQ + alarm: Message fails â SQS retries â message goes to DLQ â alert fires â you fix bug â replay messages â nobody loses data.
Your DLQ checklist:
- Every SQS queue gets a DLQ (no exceptions)
- DLQ retention: 14 days
maxReceiveCount: 3-5 (not 20)- Visibility timeout: 6Ã your Lambda timeout
- CloudWatch alarm on DLQ depth > 0
- Test your replay procedure before production needs it
- Choose FIFO only when ordering is genuinely required
In production, I've deployed SQS + DLQ patterns across 15+ queues in our e-commerce backend. After that 40,000 lost order incident, I added DLQ configuration as a PR checklist item. Every new queue. Every time. Non-negotiable.
Zero lost messages since. Zero. That's the whole point.
Got a DLQ war story? I'd love to hear it on LinkedIn â message queue incidents are the best kind of "I survived this" stories. ð
More production serverless patterns? Check GitHub for real-world SQS, EventBridge, and Lambda examples from a production e-commerce backend.
Go configure those DLQs. Before your messages disappear into the void and take your customers with them. ððŽ
P.S. SQS also has message groups (FIFO only), message attributes for metadata filtering, and long polling to reduce empty receive calls (and costs). Long polling alone â setting WaitTimeSeconds: 20 on your Lambda event source mapping â can cut your SQS API calls by 95% if your queue has bursty traffic. That $4.40/month bill I mentioned? It was $38 before I turned on long polling. The math is not subtle. ð