The Saga Pattern: Distributed Transactions Without Crying Yourself to Sleep ππ
The Saga Pattern: Distributed Transactions Without Crying Yourself to Sleep ππ
True story: It was 11 PM on a Tuesday when our Slack blew up.
Customer support: "Users are being charged but orders aren't showing up!"
Me: "That's... not supposed to happen."
DevOps: "Payment service is up. Inventory service... crashed 40 minutes ago."
Me: π±
We had successfully charged 23 customers for orders that existed nowhere. Payment had committed. Inventory had failed. And we had absolutely no rollback plan. Welcome to the distributed transaction hell β the problem the Saga Pattern exists to solve.
The Problem: Distributed Transactions Don't Exist π€¦
In a monolith with a single database, life is beautiful:
BEGIN TRANSACTION;
UPDATE accounts SET balance = balance - 100 WHERE user_id = 42;
INSERT INTO orders (user_id, total) VALUES (42, 100);
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 7;
COMMIT; -- All or nothing! β
-- (or)
ROLLBACK; -- Like it never happened! β
One database. One transaction. Either everything works or nothing does. Pure magic.
In microservices, you get this:
User clicks "Buy Now"
β
βΌ
βββββββββββββββ
β Order Svc β β Creates order (committed to OrderDB)
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β Payment Svc β β Charges card (committed to PaymentDB)
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
βInventory Svcβ β CRASHES π₯ (InventoryDB unreachable)
βββββββββββββββ
Result: Order created β
, Payment charged β
, Inventory NOT updated β
There's no magic ROLLBACK across 3 independent databases. You need a strategy. That strategy is the Saga.
What Even Is a Saga? π
A Saga is a sequence of local transactions where each step has a compensating transaction β basically an "undo" action that reverses the previous step if something downstream fails.
Forward steps (happy path):
1. Create Order β Compensate: Cancel Order
2. Charge Payment β Compensate: Refund Payment
3. Reserve Inventory β Compensate: Release Inventory
4. Notify Customer β Compensate: Send Cancellation Email
If step 3 fails:
β Run compensate(step 2): Refund Payment
β Run compensate(step 1): Cancel Order
β Customer gets a refund email instead of a ghost order β
Think of it like a Vegas heist plan. Everyone has their role, and if the vault alarm goes off mid-job, everyone knows the abort sequence. No one stands around saying "well, I already picked the lock, so..." π°
Implementation #1: Choreography-Based Saga (Event-Driven) πΌ
Each service does its job and publishes an event. Other services listen and react. No one's in charge β it's a jazz band, not an orchestra.
βββββββββββββββ
β Order Svc βββpublishesβββΆ order.created
βββββββββββββββ
β
βββββββββββββββββ
βΌ
βββββββββββββββ
β Payment Svc β listens to order.created
β βββpublishesβββΆ payment.completed (or payment.failed)
βββββββββββββββ
β
βββββββββββββββββ
βΌ
βββββββββββββββ
βInventory Svcβ listens to payment.completed
β βββpublishesβββΆ inventory.reserved (or inventory.failed)
βββββββββββββββ
When designing our e-commerce backend, I started with choreography because it looked simpler. It wasn't. Here's what the Node.js code actually looks like:
// order-service/src/handlers/createOrder.js
async function createOrder(orderData) {
// Step 1: Create the order locally
const order = await db.orders.create({
userId: orderData.userId,
productId: orderData.productId,
status: 'PENDING', // NOT 'CONFIRMED' yet!
total: orderData.total
});
// Publish event β don't wait for payment!
await eventBus.publish('order.created', {
orderId: order.id,
userId: order.userId,
total: order.total,
productId: order.productId
});
return order;
}
// Compensating transaction (called if payment fails)
async function cancelOrder(orderId) {
await db.orders.update(orderId, { status: 'CANCELLED' });
await eventBus.publish('order.cancelled', { orderId });
}
// payment-service/src/handlers/paymentHandler.js
eventBus.subscribe('order.created', async (event) => {
try {
const payment = await chargeCustomer({
userId: event.userId,
amount: event.total,
orderId: event.orderId
});
await eventBus.publish('payment.completed', {
orderId: event.orderId,
paymentId: payment.id
});
} catch (error) {
// Compensate! Tell Order Service to cancel.
await eventBus.publish('payment.failed', {
orderId: event.orderId,
reason: error.message
});
}
});
// Order service listens for payment.failed
eventBus.subscribe('payment.failed', async (event) => {
await cancelOrder(event.orderId); // Compensating transaction!
await notifyUser(event.orderId, 'Payment failed, order cancelled');
});
The catch with choreography:
- β Services are loosely coupled
- β No single point of failure
- β Easy to add new services
- β οΈ Hard to visualize the full flow
- β οΈ Debugging is a detective game across 5 services' logs
- β οΈ Cyclic dependencies creep in silently
- β οΈ "Which service handles this edge case?" β nobody knows
A scalability lesson that cost us: We had a bug where inventory.failed triggered payment.refund, which published payment.refunded, which triggered order.cancelled, which published order.cancelled, which triggered... a loop. Events firing forever. At 3 AM. Fun times. ππΈ
Implementation #2: Orchestration-Based Saga (Central Coordinator) π―
One service β the Saga Orchestrator β tells everyone what to do. Think of it as a project manager: it knows every step, decides who does what, and handles rollbacks when things go sideways.
ββββββββββββββββββββββββ
β Order Orchestrator β
ββββββββββββ¬ββββββββββββ
β
βββββββββββββΌββββββββββββ
βΌ βΌ βΌ
[Create Order] [Charge $] [Reserve Stock]
β β β
β if fails: β
β βββββββββ β
β Refund $ β
βββββββββββββββββββββββ
Cancel Order
Here's what the orchestrator looks like in Node.js:
// saga/checkoutSaga.js
class CheckoutSaga {
constructor(sagaId, orderData) {
this.sagaId = sagaId;
this.orderData = orderData;
this.steps = [];
this.currentStep = 0;
}
async execute() {
try {
// Step 1: Create Order
const order = await this.createOrder();
this.steps.push({ name: 'createOrder', data: order });
// Step 2: Process Payment
const payment = await this.processPayment(order);
this.steps.push({ name: 'processPayment', data: payment });
// Step 3: Reserve Inventory
const reservation = await this.reserveInventory(order);
this.steps.push({ name: 'reserveInventory', data: reservation });
// Step 4: Confirm Order (all good!)
await this.confirmOrder(order.id);
return { success: true, orderId: order.id };
} catch (error) {
console.error(`Saga ${this.sagaId} failed at step: ${error.step}`);
await this.compensate(); // Run undo operations
throw error;
}
}
async compensate() {
// Run compensating transactions in REVERSE order
for (const step of this.steps.reverse()) {
try {
await this.runCompensation(step);
} catch (compensationError) {
// Log but continue - compensation must complete!
console.error(`Compensation failed for ${step.name}:`, compensationError);
await this.alertOps(step.name, compensationError);
}
}
}
async runCompensation(step) {
switch (step.name) {
case 'createOrder':
await orderService.cancel(step.data.id);
break;
case 'processPayment':
await paymentService.refund(step.data.paymentId);
break;
case 'reserveInventory':
await inventoryService.release(step.data.reservationId);
break;
}
}
}
// Usage
const saga = new CheckoutSaga(uuid(), { userId, productId, total });
const result = await saga.execute();
As a Technical Lead, I've learned: Orchestration wins for complex flows. It's easier to debug, monitor, and reason about. The orchestrator is the single source of truth for what's happening.
Trade-offs vs Choreography:
| Choreography | Orchestration | |
|---|---|---|
| Coupling | Loose | Tighter |
| Debugging | Nightmare π | Easier π |
| Visibility | Scattered | Centralized |
| Single point of failure | No | Yes (orchestrator) |
| Adding new steps | Complex | Straightforward |
The Compensating Transaction Trap πͺ€
Here's the thing nobody tells you about compensation: it can fail too.
// This can happen:
async function compensate() {
await orderService.cancel(orderId); // β
Works
await paymentService.refund(paymentId); // π₯ Stripe is down!
await inventoryService.release(reservationId); // Never reached
}
// Now you have: Cancelled order + No refund + Locked inventory
// Congrats, you have a new problem!
The fix β Idempotent compensations with retry:
// saga/sagaStore.js β Persist saga state!
async function saveSagaState(sagaId, state) {
await db.sagas.upsert({
id: sagaId,
state: JSON.stringify(state),
updatedAt: new Date()
});
}
// Retry compensation until it succeeds
async function compensateWithRetry(step, maxRetries = 5) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
await runCompensation(step);
return; // Success!
} catch (error) {
if (attempt === maxRetries) {
// Alert a human! Manual intervention needed.
await alertOpsTeam({
sagaId: this.sagaId,
failedStep: step.name,
error: error.message,
priority: 'P0' // Wake someone up!
});
return;
}
// Exponential backoff
await sleep(Math.pow(2, attempt) * 1000);
}
}
}
When designing our e-commerce backend, we stored every saga state in a database. If the orchestrator crashed mid-saga, it could resume from where it left off. Without this, you're flying blind.
Common Saga Mistakes I've Made π
Mistake #1: Treating Compensation Like a Rollback
A database rollback is instant and total. Compensation is not. The payment was already charged. The email was already sent. You're compensating, not time-traveling.
β Wrong mindset: "Compensation undoes everything perfectly"
β
Right mindset: "Compensation corrects the business state"
// Email was sent? Compensation sends another email saying "sorry, cancelled!"
// Money was charged? Compensation issues a refund (takes 3-5 business days)
Mistake #2: Non-Idempotent Compensations
Network retries mean your compensation might run twice. If it charges a refund twice... you're now paying customers to shop with you. π€
// β Bad - charges refund twice if called twice
async function refund(paymentId, amount) {
await stripe.refunds.create({ payment_intent: paymentId, amount });
}
// β
Good - idempotent, safe to call multiple times
async function refund(paymentId, amount, idempotencyKey) {
await stripe.refunds.create({
payment_intent: paymentId,
amount,
idempotency_key: idempotencyKey // Stripe ignores duplicates!
});
}
Mistake #3: Forgetting Saga Timeout
What if a step hangs forever? The saga just... waits. Forever. Locking resources indefinitely.
// Always set timeouts!
const payment = await Promise.race([
paymentService.charge(order),
timeout(30_000, `Payment timed out for saga ${this.sagaId}`)
]);
When to Use the Saga Pattern π€
Use Saga when:
- You have multiple services that need to work together atomically
- You're doing e-commerce checkout, booking systems, bank transfers
- You need "eventually consistent" transactions across services
- You can't use a distributed database (2-phase commit is slow and fragile)
Don't use Saga when:
- You have a monolith with one database (just use a DB transaction!)
- Your operations are read-only (no writes = no saga needed)
- The business can tolerate "partial success" (some use cases can!)
- You're just overcomplicating a simple CRUD app
As a Technical Lead, I've learned: Sagas add complexity. Don't use them because they're cool β use them because your business logic genuinely requires cross-service atomicity.
The Architecture Diagram That Finally Made It Click π‘
Customer clicks "Place Order"
β
βΌ
ββββββββββββββββββββββ
β Checkout Saga ββββββ Persisted to DB (survive crashes!)
β Status: RUNNING β
ββββββββββ¬ββββββββββββ
β
ββββββΌβββββ
β Step 1 β Create Order (PENDING)
β β
Done β
ββββββ¬βββββ
β
ββββββΌβββββ
β Step 2 β Charge Payment
β β
Done β paymentId: pay_abc123
ββββββ¬βββββ
β
ββββββΌβββββ
β Step 3 β Reserve Inventory
β π₯ FAIL β "Out of stock!"
ββββββ¬βββββ
β
ββββββΌβββββββββββββββββββββ
β COMPENSATION PHASE β
β Refund pay_abc123 β
β
β Cancel Order β
β
β Status: COMPENSATED β
βββββββββββββββββββββββββββ
β
βΌ
Customer gets: "Sorry, out of stock. Refund in 3-5 days." π§
(Not a ghost charge!) π
TL;DR π
- Distributed transactions don't exist β each service commits locally
- Saga Pattern = sequence of steps + compensating transactions for rollback
- Choreography = services react to events (loosely coupled, hard to debug)
- Orchestration = central coordinator (easier to reason about, harder to scale)
- Always persist saga state β crashes happen, your saga should survive them
- Idempotent compensations β retries WILL happen, write for it
- Set timeouts β don't let a hanging service lock your resources forever
The night our inventory service crashed and charged 23 customers for nothing? That was 3 years ago. We built the Saga Pattern into our checkout the next week. We've had service crashes since β many times. Not one customer has been incorrectly charged since. π
The Saga Pattern won't make distributed systems simple. Nothing will. But it gives you a structured way to fail gracefully, compensate correctly, and sleep through the night. β
Building distributed systems? Connect with me on LinkedIn β I love swapping war stories about production incidents!
Want to see the full saga implementation? Check out my GitHub for real production patterns!
Now go forth and compensate gracefully! πβ¨
P.S. The hardest part of the Saga Pattern isn't the code β it's explaining to your product manager why the refund takes 3-5 business days but the charge was instant. Some things can't be compensated for with code. π³π