AWS SQS: Stop Losing Messages When Your System Crashes š¬š
AWS SQS: Stop Losing Messages When Your System Crashes š¬š
Picture this: It's Black Friday. Your e-commerce backend is getting hammered. A customer hits "Place Order." Your Lambda processes the request, fires off a "send confirmation email" function, and... your email service is down. Request times out. Customer never gets the email.
They call support. "Did my order go through?!" š±
I lived this exact nightmare before I got serious about SQS. After that incident, I refactored our entire async communication layer around Amazon SQS. Zero lost messages since. Here's everything I learned the hard way.
What Is SQS and Why Should You Care? š¤
SQS (Simple Queue Service) is AWS's managed message queue. Think of it like a to-do list in the cloud - your app writes tasks to the queue, and workers pick them up and process them. If the worker crashes, the task goes back in the queue and another worker picks it up.
Without SQS:
User ā API ā Email Service (DOWN!) ā š message gone forever
With SQS:
User ā API ā SQS Queue ā Email Service (DOWN!) ā message waits safely
ā (comes back online)
Email Service (UP!) ā picks up message ā ā
The queue holds your messages for up to 14 days. Your service can crash, restart, redeploy - the messages are just sitting there waiting like a patient golden retriever. š
The Production Scenario That Changed Everything š
In production, I've deployed a serverless e-commerce backend that processes thousands of orders daily. The original architecture was fire-and-forget:
// The OLD way - dangerous!
exports.handler = async (event) => {
const order = await saveOrder(event.body);
// If ANY of these fail, the order is partially processed
await sendConfirmationEmail(order); // What if email service is down?
await updateInventory(order); // What if this times out?
await notifyWarehouse(order); // What if THIS fails?
return { statusCode: 200, body: JSON.stringify(order) };
};
One night, our email service had a 20-minute outage. We processed 400 orders with zero confirmation emails sent. Support tickets flooded in. It was not fun.
The SQS fix:
// The NEW way - bulletproof!
exports.handler = async (event) => {
const order = await saveOrder(event.body);
// Just drop messages into queues - return instantly!
await sqs.sendMessage({
QueueUrl: process.env.EMAIL_QUEUE_URL,
MessageBody: JSON.stringify({ type: 'ORDER_CONFIRMATION', order }),
}).promise();
await sqs.sendMessage({
QueueUrl: process.env.INVENTORY_QUEUE_URL,
MessageBody: JSON.stringify({ orderId: order.id, items: order.items }),
}).promise();
// API returns in 50ms - downstream failures don't affect the user!
return { statusCode: 200, body: JSON.stringify(order) };
};
The email service went down again two months later. This time? Zero support tickets. Messages queued up, service came back, emails sent. Beautiful. šÆ
The Gotchas Nobody Tells You About ā ļø
Gotcha #1: Visibility Timeout Is Everything
When a consumer picks up a message, SQS hides it from other consumers for a set period (the visibility timeout). If the consumer doesn't delete the message before the timeout, SQS assumes it failed and makes the message visible again.
The trap:
# Default visibility timeout: 30 seconds
# Your Lambda function runtime: 3 minutes
# Result: MESSAGE GETS PROCESSED TWICE š±
The fix:
# Set visibility timeout to AT LEAST 6x your Lambda timeout
aws sqs set-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456/my-queue \
--attributes VisibilityTimeout=360 # 6 minutes for a 1 min Lambda
When architecting on AWS, I learned: Always set your visibility timeout to 6x the maximum expected processing time. Religiously.
Gotcha #2: Dead Letter Queues Save Your Sanity šŖ£
What happens to messages that keep failing? By default, they retry forever. That's... not ideal.
A Dead Letter Queue (DLQ) is a separate queue where failed messages go to die gracefully after N retries:
# Create a DLQ first
aws sqs create-queue --queue-name my-queue-dlq
# Attach it to your main queue (max 3 retries before DLQ)
aws sqs set-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456/my-queue \
--attributes '{
"RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456:my-queue-dlq\",\"maxReceiveCount\":\"3\"}"
}'
Now set up a CloudWatch alarm on your DLQ. When messages pile up there, you get paged and can manually re-process them or investigate.
A serverless pattern that saved us: DLQ + CloudWatch alarm = instant notification when something's genuinely broken, not just temporarily flaky. We caught a bug in our inventory processor that would have silently dropped hundreds of stock updates.
Gotcha #3: Standard vs FIFO Queues
Standard Queue: At-least-once delivery, best-effort ordering. Fast, cheap, scales infinitely.
FIFO Queue: Exactly-once processing, strict ordering. Slower, costs more.
Use Standard for:
ā
Email notifications (duplicate email? Annoying, not catastrophic)
ā
Log processing
ā
Background analytics
Use FIFO for:
ā
Financial transactions (DO NOT process a payment twice!)
ā
Inventory updates (order matters!)
ā
Any workflow where sequence and deduplication matter
I've been burned by using a Standard queue for inventory updates. A product went negative in stock because two messages processed in the wrong order. FIFO was the right call. šø
The Lambda + SQS Pattern That Just Works ā”
This is the setup I use for almost every async workload:
// SQS consumer Lambda
exports.handler = async (event) => {
const results = await Promise.allSettled(
event.Records.map(record => processRecord(record))
);
// Report partial failures back to SQS
const failures = results
.map((result, i) => ({ result, record: event.Records[i] }))
.filter(({ result }) => result.status === 'rejected')
.map(({ record }) => ({ itemIdentifier: record.messageId }));
// SQS will retry ONLY the failed records - not the whole batch!
return { batchItemFailures: failures };
};
async function processRecord(record) {
const message = JSON.parse(record.body);
// Your actual logic here
await sendEmail(message.order);
}
The batchItemFailures trick is huge - without it, if 1 out of 10 messages fails, ALL 10 get retried. With it, only the actual failures get retried. Your DLQ stays clean.
Cost Reality Check š°
SQS is almost criminally cheap:
First 1 million requests/month: FREE
Next 100 billion requests: $0.40 per million
1 million messages = $0.40
10 million messages = $4.00
1 billion messages = $400
For context: I process ~5 million async messages per month on the e-commerce platform. That's $2/month. My coffee costs more than my message queue.
The real cost trap: Don't use SQS for high-frequency polling with Lambda. If you set your Lambda to poll every second and there are no messages, you're still paying for empty polls.
Fix: Use Long Polling - Lambda waits up to 20 seconds for a message before returning:
aws sqs set-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456/my-queue \
--attributes ReceiveMessageWaitTimeSeconds=20
This cut our SQS API costs by 90% for low-traffic queues. š
Common Pitfalls to Avoid šŖ¤
Pitfall #1: Storing large payloads in SQS
SQS has a 256KB message size limit. I've seen engineers try to stuff entire product catalogs into messages.
// BAD - will fail for large payloads
await sqs.sendMessage({
MessageBody: JSON.stringify(entireProductCatalog) // 5MB š
}).promise();
// GOOD - store in S3, reference in SQS
const s3Key = await uploadToS3(entireProductCatalog);
await sqs.sendMessage({
MessageBody: JSON.stringify({ s3Key, type: 'CATALOG_UPDATE' })
}).promise();
Pitfall #2: Not setting message retention
Default retention is 4 days. If your consumer is down for a long weekend, you lose messages. I set everything to 14 days (the max).
Pitfall #3: Ignoring message ordering in Standard queues
If you send messages A ā B ā C, don't assume they arrive in that order with a Standard queue. Design your consumers to be order-independent, or use FIFO.
The Architecture That Handles 10x Traffic Spikes š
This is the pattern I reached for when our platform had viral moments:
User Request ā API Lambda (fast, just enqueues)
ā
SQS Queue (absorbs the spike)
ā
Worker Lambda (scales automatically)
ā
Email / DB / 3rd party (at controlled pace)
During a flash sale, API response times stayed at 60ms because we just dumped messages in the queue and returned. The workers processed at a steady rate. No cascading failures, no timeouts, no angry customers.
The queue acts as a shock absorber. Your downstream services don't get hit with 10x traffic at once - they get a steady, manageable stream.
TL;DR - Your SQS Survival Checklist ā
- Use SQS for anything async - email, notifications, background jobs, anything that doesn't need to be in-line with the user request
- Set visibility timeout to 6x your processing time - prevents double processing
- Always create a DLQ - with a CloudWatch alarm on it
- Use FIFO queues for financial/inventory operations - order matters there
- Enable long polling - 20-second wait time cuts empty API calls by 90%
- Use
batchItemFailuresin Lambda - retry only what actually failed - Store large payloads in S3 - reference them in SQS, not the data itself
- Set message retention to 14 days - give yourself time to fix downstream issues
The Bottom Line š”
When architecting on AWS, I learned that the difference between a fragile system and a resilient one is usually a queue in the middle. SQS is one of those AWS services that costs almost nothing, takes 20 minutes to set up, and saves you from catastrophic data loss.
Before SQS: one downstream outage = data loss + angry customers + 3 AM pagerduty alerts.
After SQS: downstream service crashes ā messages wait patiently ā service recovers ā business continues ā nobody wakes up at 3 AM. š“
Go add a queue to your most critical async operations. Your future self will thank you at 3 AM.
Still losing messages in production? Find me on LinkedIn and let's talk architecture!
Want to see real SQS patterns? Check out my GitHub for production-tested serverless templates.
Now go forth and queue responsibly! š¬š
P.S. If you're choosing between SQS and EventBridge: SQS = point-to-point reliable delivery, EventBridge = pub/sub event routing. Use both. They're complements, not competitors! š¤
P.P.S. Set up that DLQ alarm BEFORE you need it. Finding out about message failures from a support ticket is a very humbling experience. š