The Transactional Outbox Pattern: Stop Losing Events Between Your DB and Queue ๐ฌโก
The Transactional Outbox Pattern: Stop Losing Events Between Your DB and Queue ๐ฌโก
Scene: Our e-commerce backend. Order placed successfully. Database says committed. Inventory service never got the event. Warehouse never got the pick request. Customer got a confirmation email. Package never arrived.
Customer support ticket, 3 days later: "Where is my order???"
Me, staring at CloudWatch logs: "The order is definitely in the database... but the queue event... it just... isn't... anywhere." ๐ฐ
This is the silent killer of distributed systems. Not the crashes you can see. The events that vanish into the void between your database commit and your message queue publish.
This is the story of the pattern that finally fixed it.
The Classic Two-Write Problem ๐ฃ
When you move to event-driven architecture, you immediately hit this wall:
async function createOrder(orderData) {
// Write 1: Save to database
const order = await db.orders.create(orderData);
// โ ๏ธ DANGER ZONE โ ๏ธ
// What happens if anything fails HERE?
// Write 2: Publish event to queue
await sqs.sendMessage({
QueueUrl: ORDER_QUEUE_URL,
MessageBody: JSON.stringify({
type: 'ORDER_CREATED',
orderId: order.id,
userId: order.userId
})
});
return order;
}
Looks innocent. It's a ticking time bomb.
Things that can fail between Write 1 and Write 2:
Database commits โ
โ
Network blip ๐ก๐ฅ
SQS temporarily down ๐ด
Lambda cold start timeout โฑ๏ธ
Memory pressure OOM kill ๐
Your server restarts for a deploy ๐
โ
SQS never receives the message
โ
Downstream services never notified
โ
Order exists in DB, no fulfillment
โ
Angry customer + confused warehouse
Or worse โ the reverse happens. Queue gets the event. Database rolls back. Now you're fulfilling an order that doesn't exist. Good luck explaining that to accounting.
The fundamental problem: Two separate systems, zero atomicity. You can't commit to your database AND publish to your queue in a single atomic operation. They're different systems. They don't share transaction boundaries.
This isn't a bug in your code. It's a fundamental distributed systems problem โ and the Transactional Outbox Pattern is the elegant solution.
The Outbox Pattern: Use Your Database's Existing Guarantees ๐๏ธ
The insight is beautiful in its simplicity:
Instead of writing to two systems, write everything to ONE system (your database) atomically. Then, separately, relay those writes to the queue.
Your database is ACID-compliant. Within a single transaction, you can write an order AND an outbox event โ both committed atomically, or both rolled back. No more split-brain between your DB and queue.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ TRADITIONAL APPROACH โ
โ โ
โ API โ [Write to DB] โ [Write to Queue] โ Done โ
โ โ โ โ
โ ACID safe NOT atomic with DB โ
โ = events can be lost โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ OUTBOX PATTERN โ
โ โ
โ API โ [Write to DB + Write to outbox_events] โ Done โ
โ โ โ โ
โ Same transaction! Same transaction! โ
โ Both commit or both rollback = ATOMIC โ
โ โ
โ Outbox Relay โ Poll outbox_events โ Publish to Queue โ
โ โ โ
โ At-least-once delivery โ
โ guaranteed by retry logic โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The magic: your outbox events live in the same database as your business data. One commit, one rollback. Atomic by nature.
How I Built It on Our E-Commerce Backend ๐
Step 1: The Outbox Table
CREATE TABLE outbox_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
event_type VARCHAR(100) NOT NULL,
payload JSONB NOT NULL,
status VARCHAR(20) DEFAULT 'PENDING',
created_at TIMESTAMP DEFAULT NOW(),
processed_at TIMESTAMP,
retry_count INT DEFAULT 0,
error_message TEXT
);
CREATE INDEX idx_outbox_pending ON outbox_events(status, created_at)
WHERE status = 'PENDING';
Simple. Just a table with your events waiting to be shipped.
Step 2: Write Business Data + Outbox Event in One Transaction
// orderService.js
async function createOrder(orderData, db) {
return await db.transaction(async (trx) => {
// Write 1: Save the order
const [order] = await trx('orders')
.insert({
user_id: orderData.userId,
total: orderData.total,
status: 'PENDING',
created_at: new Date()
})
.returning('*');
// Write 2: Save outbox event (SAME TRANSACTION!)
await trx('outbox_events').insert({
event_type: 'ORDER_CREATED',
payload: JSON.stringify({
orderId: order.id,
userId: order.user_id,
total: order.total,
items: orderData.items
})
});
// If ANYTHING fails above, both writes rollback.
// If both succeed, both are committed atomically.
return order;
});
}
No more praying the network holds between your DB write and queue publish. Either both the order and the outbox event exist, or neither does.
Step 3: The Outbox Relay (The Polling Worker)
// outboxRelay.js โ runs as a separate Lambda or cron job
const SQS = require('@aws-sdk/client-sqs');
const sqs = new SQS.SQSClient({ region: 'ap-south-1' });
async function processOutboxEvents(db) {
// Grab a batch of pending events
const events = await db('outbox_events')
.where('status', 'PENDING')
.where('retry_count', '<', 5) // give up after 5 retries
.orderBy('created_at', 'asc')
.limit(10) // process in small batches
.forUpdate() // lock rows to prevent duplicate processing
.skipLocked(); // skip rows locked by other relay instances
if (events.length === 0) return;
for (const event of events) {
try {
// Publish to SQS
await sqs.send(new SQS.SendMessageCommand({
QueueUrl: process.env.ORDER_QUEUE_URL,
MessageBody: event.payload,
MessageAttributes: {
EventType: {
DataType: 'String',
StringValue: event.event_type
}
},
MessageDeduplicationId: event.id, // SQS FIFO dedup
MessageGroupId: 'orders'
}));
// Mark as processed
await db('outbox_events')
.where('id', event.id)
.update({
status: 'PROCESSED',
processed_at: new Date()
});
console.log(`โ
Published event: ${event.event_type} [${event.id}]`);
} catch (error) {
// Failed to publish โ increment retry count, try again later
await db('outbox_events')
.where('id', event.id)
.update({
retry_count: db.raw('retry_count + 1'),
error_message: error.message,
// Exponential backoff: reschedule by pushing it
// past our next poll window via created_at trick
});
console.error(`โ Failed to publish: ${event.id}`, error.message);
}
}
}
// Run every 5 seconds
setInterval(() => processOutboxEvents(db), 5000);
Key line: .forUpdate().skipLocked() โ this is PostgreSQL's advisory locking. If you run multiple relay instances (you will, for resilience), they won't step on each other's toes. Each instance grabs a unique batch and processes it without overlap.
Step 4: Cleanup Old Processed Events
// Run daily โ don't let your outbox table grow forever
async function cleanupOldEvents(db) {
const deletedCount = await db('outbox_events')
.where('status', 'PROCESSED')
.where('processed_at', '<', db.raw("NOW() - INTERVAL '7 days'"))
.delete();
console.log(`๐งน Cleaned up ${deletedCount} old outbox events`);
}
The Scalability Lesson That Cost Us ๐
A scalability lesson that cost us: We initially ran the outbox relay as a single Lambda on a schedule. Works great. Until we had a spike: 2,000 orders in 5 minutes, relay running every 60 seconds with a batch size of 10. We were 200 batches behind. Events were delayed by 3+ hours. Inventory was never notified. We oversold.
The fix was brutal-simple: increase concurrency and reduce batch intervals.
// Instead of one relay instance, run multiple in parallel
// Lambda: increase reserved concurrency
// Batch size: tune based on your DB connection pool limits
// For high-volume: don't poll, use PostgreSQL LISTEN/NOTIFY
// to trigger the relay instantly on new outbox inserts:
await db.raw("LISTEN outbox_new_event");
db.client.connection().on('notification', (msg) => {
if (msg.channel === 'outbox_new_event') {
processOutboxEvents(db); // Triggered immediately, no polling delay!
}
});
// In your create order transaction, also:
await trx.raw("NOTIFY outbox_new_event, 'new'");
Polling works. Real-time LISTEN/NOTIFY is better for latency-sensitive systems. We use both: NOTIFY for immediate processing, polling as a fallback safety net.
Trade-offs: This Isn't Free Lunch ๐ค
โ
WHAT YOU GET:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Atomic DB + event publish โ no split-brain ever โ
โ At-least-once delivery guarantee โ
โ Built-in retry with backoff โ
โ Full event audit log in your DB โ
โ Survives queue downtime (events wait in outbox) โ
โ No external dependencies for the write path โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๏ธ WHAT YOU GIVE UP:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Extra DB table to maintain โ
โ Eventual consistency (not instant queue publish) โ
โ Relay process = another thing to monitor โ
โ At-LEAST-once = possible duplicates downstream โ
โ DB is now on the critical path for event publishing โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The duplicate problem: At-least-once delivery means your downstream services WILL receive the same event more than once (relay retries on transient failures). Design consumers to be idempotent. If inventory service receives ORDER_CREATED twice for the same order ID โ it should only decrement stock once. Use the order ID as a deduplication key.
When Designing Our E-Commerce Backend... ๐๏ธ
When designing our e-commerce backend, we evaluated three approaches:
Option 1: Two-phase commit (2PC)
โ Distributed transaction across DB + Queue
โ Works in theory. Complex in practice.
โ Requires 2PC-compatible queue (most don't support it).
โ Major performance hit. We killed this idea fast.
Option 2: Saga Pattern (separate post!)
โ For multi-service transactions with compensation
โ Overkill if you just need DB + Queue atomicity
โ Great for order fulfillment across 5+ services
โ Different problem, different tool
Option 3: Transactional Outbox โ WE CHOSE THIS
โ Uses existing DB guarantees
โ Works with any queue (SQS, RabbitMQ, Kafka, SNS)
โ Simple to understand, simple to debug
โ Battle-tested at scale (Shopify, Airbnb, Netflix use variants)
As a Technical Lead, I've learned: the pattern that uses your existing infrastructure's guarantees beats the pattern that requires new infrastructure every time. Your PostgreSQL is already ACID-compliant. Exploit that.
Common Mistakes to Avoid ๐ชค
Mistake #1: Polling too infrequently
// BAD: 60-second poll = 60-second event delay
setInterval(() => processOutboxEvents(db), 60000);
// BETTER: 5-second poll for most use cases
setInterval(() => processOutboxEvents(db), 5000);
// BEST: LISTEN/NOTIFY + polling fallback
Mistake #2: Batch size too large
// BAD: Massive batch = long-running transaction = DB connection held
.limit(1000) // holds connection for potentially 30+ seconds
// GOOD: Small batches, frequent runs
.limit(10) // fast, returns connection quickly
Mistake #3: Not handling the "stuck" event
// Events with retry_count >= 5 sit in PENDING forever
// Add a dead letter concept:
await db('outbox_events')
.where('retry_count', '>=', 5)
.update({ status: 'DEAD' }); // Alert on DEAD events!
Mistake #4: No index on the outbox table
-- Without this index, every poll is a full table scan ๐ฑ
CREATE INDEX idx_outbox_pending ON outbox_events(status, created_at)
WHERE status = 'PENDING';
We forgot this index for the first week in production. Table hit 50,000 rows. Query went from 2ms to 800ms. Added index. Back to 2ms. Don't be us.
The Bottom Line ๐ก
The Transactional Outbox Pattern solves a problem that most engineers don't even know they have โ until an event goes missing and they spend three days debugging why an order was never fulfilled.
The core idea:
- Write your business data AND your event in ONE database transaction
- A relay process reads pending events and publishes them to your queue
- Downstream services get at-least-once delivery with retry guarantees
When designing our e-commerce backend, this was the pattern that made our event-driven architecture actually reliable. Before it: random event loss, debugging nightmares, angry customers. After it: 99.98% event delivery rate, clear audit trail, confident deploys.
Is it perfect? No. You get eventual consistency instead of instant publish. Your downstream services need to be idempotent. You add operational complexity with the relay process.
But the alternative is lying to yourself โ pretending that writing to two systems simultaneously is atomic when it absolutely isn't.
Your database and message queue can finally stop fighting. The Outbox Pattern is the peace treaty.
Your Action Plan โ
This week:
- Audit your code: everywhere you write to DB AND publish to queue in sequence
- Create the
outbox_eventstable with the proper index - Pick ONE endpoint and implement the outbox pattern on it
This month:
- Roll out the outbox pattern to all critical event-publishing paths
- Add the LISTEN/NOTIFY optimization if using PostgreSQL
- Set up monitoring: alert on DEAD events and relay lag
This quarter:
- Implement consumer idempotency (dedup on order ID / event ID)
- Write chaos tests: kill the relay mid-processing, verify events eventually deliver
- Add metrics: event delivery latency (time from DB write to queue publish)
Resources Worth Your Time ๐
- Microservices Patterns by Chris Richardson โ the definitive writeup on the outbox pattern
- PostgreSQL LISTEN/NOTIFY docs โ for zero-latency relay triggering
- Designing Data-Intensive Applications โ Chapter 11 on stream processing covers the theory deeply
Lost events in production? Tell me your war story โ connect on LinkedIn!
Want to see the outbox relay in action? Check out GitHub for real implementation examples.
Now go make your event delivery boring and reliable! ๐ฌโจ
P.S. If you're currently writing to your database and then your queue in sequence, thinking "we haven't lost an event yet" โ you're not lucky. You just haven't had enough traffic or network instability yet. Add the outbox pattern now, before the incident.
P.P.S. Companies like Shopify, Airbnb, and Netflix all use variants of this pattern at massive scale. When the best engineering teams in the world converge on the same solution, it's probably the right one. ๐ฏ