AWS SQS Dead Letter Queues: Stop Losing Messages in the Void ๐ฌ๐
AWS SQS Dead Letter Queues: Stop Losing Messages in the Void ๐ฌ๐
Honest confession: For the first 6 months of running our serverless e-commerce backend on AWS, I had absolutely no idea that messages were failing silently in SQS. Orders were occasionally "disappearing." Payments processed but confirmation emails never sent. Inventory updates just... vanished.
I thought it was a bug in my code. Turned out I'd built a perfectly designed message black hole. ๐ณ๏ธ
Let me save you from that particular flavor of production panic.
What Even Is SQS? (Quick Primer) ๐ค
Think of Amazon SQS (Simple Queue Service) like a postal service between your microservices:
Order Service โ [SQS Queue] โ Email Service
Payment Service
Inventory Service
Why not just call these services directly?
- Direct calls fail if the downstream service is down
- No retry logic built in
- Tight coupling = cascading failures
- One slow service blocks everything
SQS gives you:
- Guaranteed delivery (at-least-once)
- Built-in retries
- Decoupled services
- Automatic scaling
In production, I've deployed SQS between every major service boundary in our e-commerce stack. Order placed? Queue message. Payment captured? Queue message. Nothing calls anything directly anymore.
The Problem: Messages Just... Die ๐
Here's what happens without Dead Letter Queues:
Order placed โ SQS โ Lambda processes order
โ (Lambda throws error!)
Message goes back to queue
โ (Lambda tries again)
Lambda throws error AGAIN
โ (3 attempts... 4... 5...)
Message finally disappears ๐ป
SQS's default behavior: After your configured number of retries, the message is simply deleted. Gone. No trace. No alert. No record.
Translation: Your customer's order confirmation email got lost and nobody knows. ๐ฌ
A serverless pattern that saved us was treating every failed message as data, not garbage. That's what Dead Letter Queues are for.
Dead Letter Queues: Your Message Safety Net ๐ฅ
A Dead Letter Queue (DLQ) is just another SQS queue where failed messages land instead of being deleted:
Order placed โ [Main Queue] โ Lambda fails 3 times
โ
[Dead Letter Queue] โ Failed messages here!
โ
Alert fires, team investigates
The difference: Now you know something broke. The message is preserved. You can replay it after fixing the bug.
Setting It Up (The Right Way) โ๏ธ
Step 1: Create your DLQ first
aws sqs create-queue \
--queue-name order-processing-dlq \
--attributes '{
"MessageRetentionPeriod": "1209600"
}'
# 1209600 seconds = 14 days to investigate and replay
Step 2: Wire it to your main queue
aws sqs create-queue \
--queue-name order-processing \
--attributes '{
"RedrivePolicy": "{
\"deadLetterTargetArn\": \"arn:aws:sqs:us-east-1:123456789:order-processing-dlq\",
\"maxReceiveCount\": \"3\"
}"
}'
maxReceiveCount: 3 means - try 3 times, then send to DLQ. Not 30 times. Not forever.
Step 3: Via CloudFormation/CDK (what I actually use)
OrderProcessingDLQ:
Type: AWS::SQS::Queue
Properties:
QueueName: order-processing-dlq
MessageRetentionPeriod: 1209600 # 14 days
OrderProcessingQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: order-processing
RedrivePolicy:
deadLetterTargetArn: !GetAtt OrderProcessingDLQ.Arn
maxReceiveCount: 3
VisibilityTimeout: 30
This is infrastructure as code. Deployable, reviewable, repeatable.
The Alarm You Absolutely Need ๐จ
Creating the DLQ is only half the job. You need to know when messages land there:
aws cloudwatch put-metric-alarm \
--alarm-name "OrderDLQ-HasMessages" \
--alarm-description "Messages failing in order processing" \
--metric-name ApproximateNumberOfMessagesVisible \
--namespace AWS/SQS \
--dimensions Name=QueueName,Value=order-processing-dlq \
--statistic Sum \
--period 60 \
--evaluation-periods 1 \
--threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789:on-call-alert
When architecting on AWS, I learned: An unmonitored DLQ is almost as bad as no DLQ. The messages are preserved but nobody's looking at them. Set the alarm or you'll find out 3 weeks later that 400 order confirmations never sent. ๐
Gotcha #1: The Visibility Timeout Trap โฑ๏ธ
This one burns everyone the first time.
The scenario:
- SQS sends message to Lambda
- Lambda takes 25 seconds to process
- Visibility timeout is 30 seconds
- Lambda finishes at 28 seconds โ
But what if Lambda takes 35 seconds?
- Message becomes visible AGAIN at 30 seconds
- Another Lambda picks it up
- Two Lambdas now processing the SAME order!
maxReceiveCountincrements!- After 3 "failures" (timeouts), message goes to DLQ!
Fix: Set visibility timeout to at least 6x your Lambda timeout:
OrderProcessingQueue:
Type: AWS::SQS::Queue
Properties:
VisibilityTimeout: 180 # Lambda timeout is 30s, so 6x = 180s
And set your Lambda timeout properly:
OrderProcessorFunction:
Type: AWS::Lambda::Function
Properties:
Timeout: 30 # Actual max processing time
I got burned by this on a payment processing Lambda that occasionally hit Stripe rate limits. The timeouts were counting as "failures" and orders were landing in DLQ. Took me two hours to figure out why.
Gotcha #2: The maxReceiveCount Math ๐งฎ
Bad config:
maxReceiveCount: 1
This means: Try ONCE. If it fails, straight to DLQ.
Good for: Critical operations where retrying could cause duplicate side effects (though idempotency is the real fix here).
Bad for: Transient errors like network blips. Your DLQ fills up with messages that would've succeeded on retry #2.
My production defaults:
- Transient operations (emails, notifications):
maxReceiveCount: 5 - Critical operations (payments, orders):
maxReceiveCount: 3 - Idempotent background jobs:
maxReceiveCount: 10
Replaying DLQ Messages After the Fix ๐
You fixed the bug. Now 47 messages are sitting in the DLQ. How do you reprocess them?
Option 1: Manual replay (for small batches)
# Get messages from DLQ
aws sqs receive-message \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789/order-processing-dlq \
--max-number-of-messages 10
# Send back to main queue
aws sqs send-message \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789/order-processing \
--message-body '{"orderId": "12345", ...}'
Option 2: AWS Console DLQ Redrive (the easy button)
AWS added a one-click redrive in the console. Go to your DLQ โ "Start DLQ redrive" โ Point it at your main queue. Done!
Option 3: Lambda redrive script (for automation)
import boto3
sqs = boto3.client('sqs')
DLQ_URL = 'https://sqs.us-east-1.amazonaws.com/123456789/order-processing-dlq'
MAIN_URL = 'https://sqs.us-east-1.amazonaws.com/123456789/order-processing'
def redrive_messages():
while True:
response = sqs.receive_message(
QueueUrl=DLQ_URL,
MaxNumberOfMessages=10
)
messages = response.get('Messages', [])
if not messages:
break
for msg in messages:
# Send to main queue
sqs.send_message(
QueueUrl=MAIN_URL,
MessageBody=msg['Body']
)
# Delete from DLQ
sqs.delete_message(
QueueUrl=DLQ_URL,
ReceiptHandle=msg['ReceiptHandle']
)
print("Redrive complete!")
A serverless pattern that saved us: Keep this script version-controlled. You WILL need it at 2 AM someday.
Cost: Basically Nothing ๐ฐ
SQS pricing is almost embarrassingly cheap:
- First 1 million requests/month: Free
- After that: $0.40 per million requests
- DLQ storage: Same pricing as regular queues
My real numbers from production:
- E-commerce backend processing ~500K orders/month
- Total SQS cost: ~$8/month
- DLQ storage of failed messages: <$0.50/month
For the peace of mind of never losing a message again? I'd pay 100x that. ๐
Common Pitfalls Quick Reference ๐ชค
| Mistake | Impact | Fix |
|---|---|---|
| No DLQ configured | Lost messages forever | Always configure DLQ |
| DLQ with no alarm | Silent failures | CloudWatch alarm on queue depth |
| Visibility timeout too low | Duplicate processing | Set to 6x Lambda timeout |
| maxReceiveCount too low (1-2) | DLQ fills up on transient errors | Use 3-5 for most cases |
| 14-day retention expiry | Lost messages from old failures | Monitor DLQ weekly |
| Not testing DLQ | Unknown failure modes | Intentionally throw errors in dev |
Testing Your DLQ Setup ๐งช
Don't wait for production failures to test this:
// In your Lambda handler, add a test path
exports.handler = async (event) => {
const body = JSON.parse(event.Records[0].body);
// Test: Simulate failure
if (body.testDLQ === true) {
throw new Error('Intentional test failure for DLQ validation');
}
// Real processing
await processOrder(body);
};
Send a test message, watch it fail 3 times, confirm it lands in DLQ, verify your alarm fires.
When I architected serverless e-commerce backends, I made this part of every service deployment checklist. If your DLQ alarm doesn't fire during testing, your alarm is broken.
The Architecture That Actually Works ๐๏ธ
Here's the full setup I use for every critical service:
[Main Queue]
โ
[Lambda]
/ \
Success Failure (3x)
โ โ
Process order [Dead Letter Queue]
โ
[CloudWatch Alarm]
โ
[SNS Topic]
/ \
Email alert PagerDuty
Every critical message queue. Every service. No exceptions.
The TL;DR โก
AWS SQS without a DLQ = messages disappear and you never know why.
AWS SQS with a DLQ = messages are preserved, you get alerted, you can replay after fixing bugs.
Setup time: 15 minutes. Heartache saved: immeasurable.
Your checklist:
- Create DLQ for every production queue
- Set
maxReceiveCountto 3-5 - Set visibility timeout to 6x Lambda timeout
- Add CloudWatch alarm for DLQ depth > 0
- Keep a redrive script handy
- Test it on purpose before production does it for you
In production, I've deployed this pattern across 12+ queue/Lambda pairs. The DLQ has saved real customer data more times than I can count. Set it up now, before your messages start disappearing.
Had a SQS horror story? Share it on LinkedIn - misery loves company when it's a distributed systems war story! ๐
Want to see my full serverless architecture patterns? Check GitHub for real-world examples.
Go build reliable queues. Your future self will thank you. ๐ฌโ
P.S. FIFO queues in SQS are a thing. They guarantee order and exactly-once processing. They're also 10x more expensive. Use them only when message order genuinely matters - like financial transactions. For most things, standard queues with idempotent processing is the right call. ๐ธ