0x55aa
← Back to Blog

The Saga Pattern: Distributed Transactions Without Crying Yourself to Sleep 🎭😭

β€’10 min read

The Saga Pattern: Distributed Transactions Without Crying Yourself to Sleep 🎭😭

True story: It was 11 PM on a Tuesday when our Slack blew up.

Customer support: "Users are being charged but orders aren't showing up!"

Me: "That's... not supposed to happen."

DevOps: "Payment service is up. Inventory service... crashed 40 minutes ago."

Me: 😱

We had successfully charged 23 customers for orders that existed nowhere. Payment had committed. Inventory had failed. And we had absolutely no rollback plan. Welcome to the distributed transaction hell β€” the problem the Saga Pattern exists to solve.

The Problem: Distributed Transactions Don't Exist 🀦

In a monolith with a single database, life is beautiful:

BEGIN TRANSACTION;
  UPDATE accounts SET balance = balance - 100 WHERE user_id = 42;
  INSERT INTO orders (user_id, total) VALUES (42, 100);
  UPDATE inventory SET quantity = quantity - 1 WHERE product_id = 7;
COMMIT;  -- All or nothing! βœ…
-- (or)
ROLLBACK; -- Like it never happened! βœ…

One database. One transaction. Either everything works or nothing does. Pure magic.

In microservices, you get this:

User clicks "Buy Now"
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Order Svc  β”‚ ← Creates order (committed to OrderDB)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Payment Svc β”‚ ← Charges card (committed to PaymentDB)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Inventory Svcβ”‚ ← CRASHES πŸ’₯ (InventoryDB unreachable)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Result: Order created βœ…, Payment charged βœ…, Inventory NOT updated ❌

There's no magic ROLLBACK across 3 independent databases. You need a strategy. That strategy is the Saga.

What Even Is a Saga? πŸ“–

A Saga is a sequence of local transactions where each step has a compensating transaction β€” basically an "undo" action that reverses the previous step if something downstream fails.

Forward steps (happy path):
1. Create Order      β†’ Compensate: Cancel Order
2. Charge Payment    β†’ Compensate: Refund Payment
3. Reserve Inventory β†’ Compensate: Release Inventory
4. Notify Customer   β†’ Compensate: Send Cancellation Email

If step 3 fails:
β†’ Run compensate(step 2): Refund Payment
β†’ Run compensate(step 1): Cancel Order
β†’ Customer gets a refund email instead of a ghost order βœ…

Think of it like a Vegas heist plan. Everyone has their role, and if the vault alarm goes off mid-job, everyone knows the abort sequence. No one stands around saying "well, I already picked the lock, so..." 🎰

Implementation #1: Choreography-Based Saga (Event-Driven) 🎼

Each service does its job and publishes an event. Other services listen and react. No one's in charge β€” it's a jazz band, not an orchestra.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Order Svc  │──publishes──▢ order.created
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Payment Svc β”‚ listens to order.created
β”‚             │──publishes──▢ payment.completed (or payment.failed)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Inventory Svcβ”‚ listens to payment.completed
β”‚             │──publishes──▢ inventory.reserved (or inventory.failed)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

When designing our e-commerce backend, I started with choreography because it looked simpler. It wasn't. Here's what the Node.js code actually looks like:

// order-service/src/handlers/createOrder.js
async function createOrder(orderData) {
    // Step 1: Create the order locally
    const order = await db.orders.create({
        userId: orderData.userId,
        productId: orderData.productId,
        status: 'PENDING',  // NOT 'CONFIRMED' yet!
        total: orderData.total
    });

    // Publish event β€” don't wait for payment!
    await eventBus.publish('order.created', {
        orderId: order.id,
        userId: order.userId,
        total: order.total,
        productId: order.productId
    });

    return order;
}

// Compensating transaction (called if payment fails)
async function cancelOrder(orderId) {
    await db.orders.update(orderId, { status: 'CANCELLED' });
    await eventBus.publish('order.cancelled', { orderId });
}
// payment-service/src/handlers/paymentHandler.js
eventBus.subscribe('order.created', async (event) => {
    try {
        const payment = await chargeCustomer({
            userId: event.userId,
            amount: event.total,
            orderId: event.orderId
        });

        await eventBus.publish('payment.completed', {
            orderId: event.orderId,
            paymentId: payment.id
        });
    } catch (error) {
        // Compensate! Tell Order Service to cancel.
        await eventBus.publish('payment.failed', {
            orderId: event.orderId,
            reason: error.message
        });
    }
});

// Order service listens for payment.failed
eventBus.subscribe('payment.failed', async (event) => {
    await cancelOrder(event.orderId);  // Compensating transaction!
    await notifyUser(event.orderId, 'Payment failed, order cancelled');
});

The catch with choreography:

  • βœ… Services are loosely coupled
  • βœ… No single point of failure
  • βœ… Easy to add new services
  • ⚠️ Hard to visualize the full flow
  • ⚠️ Debugging is a detective game across 5 services' logs
  • ⚠️ Cyclic dependencies creep in silently
  • ⚠️ "Which service handles this edge case?" β€” nobody knows

A scalability lesson that cost us: We had a bug where inventory.failed triggered payment.refund, which published payment.refunded, which triggered order.cancelled, which published order.cancelled, which triggered... a loop. Events firing forever. At 3 AM. Fun times. πŸ”„πŸ’Έ

Implementation #2: Orchestration-Based Saga (Central Coordinator) 🎯

One service β€” the Saga Orchestrator β€” tells everyone what to do. Think of it as a project manager: it knows every step, decides who does what, and handles rollbacks when things go sideways.

          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚   Order Orchestrator β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β–Ό           β–Ό           β–Ό
   [Create Order] [Charge $] [Reserve Stock]
         β”‚           β”‚           β”‚
         β”‚     if fails:         β”‚
         β”‚    ◄────────          β”‚
         β”‚   Refund $            β”‚
         ◄──────────────────────
      Cancel Order

Here's what the orchestrator looks like in Node.js:

// saga/checkoutSaga.js
class CheckoutSaga {
    constructor(sagaId, orderData) {
        this.sagaId = sagaId;
        this.orderData = orderData;
        this.steps = [];
        this.currentStep = 0;
    }

    async execute() {
        try {
            // Step 1: Create Order
            const order = await this.createOrder();
            this.steps.push({ name: 'createOrder', data: order });

            // Step 2: Process Payment
            const payment = await this.processPayment(order);
            this.steps.push({ name: 'processPayment', data: payment });

            // Step 3: Reserve Inventory
            const reservation = await this.reserveInventory(order);
            this.steps.push({ name: 'reserveInventory', data: reservation });

            // Step 4: Confirm Order (all good!)
            await this.confirmOrder(order.id);

            return { success: true, orderId: order.id };

        } catch (error) {
            console.error(`Saga ${this.sagaId} failed at step: ${error.step}`);
            await this.compensate();  // Run undo operations
            throw error;
        }
    }

    async compensate() {
        // Run compensating transactions in REVERSE order
        for (const step of this.steps.reverse()) {
            try {
                await this.runCompensation(step);
            } catch (compensationError) {
                // Log but continue - compensation must complete!
                console.error(`Compensation failed for ${step.name}:`, compensationError);
                await this.alertOps(step.name, compensationError);
            }
        }
    }

    async runCompensation(step) {
        switch (step.name) {
            case 'createOrder':
                await orderService.cancel(step.data.id);
                break;
            case 'processPayment':
                await paymentService.refund(step.data.paymentId);
                break;
            case 'reserveInventory':
                await inventoryService.release(step.data.reservationId);
                break;
        }
    }
}

// Usage
const saga = new CheckoutSaga(uuid(), { userId, productId, total });
const result = await saga.execute();

As a Technical Lead, I've learned: Orchestration wins for complex flows. It's easier to debug, monitor, and reason about. The orchestrator is the single source of truth for what's happening.

Trade-offs vs Choreography:

Choreography Orchestration
Coupling Loose Tighter
Debugging Nightmare 😭 Easier 😌
Visibility Scattered Centralized
Single point of failure No Yes (orchestrator)
Adding new steps Complex Straightforward

The Compensating Transaction Trap πŸͺ€

Here's the thing nobody tells you about compensation: it can fail too.

// This can happen:
async function compensate() {
    await orderService.cancel(orderId);    // βœ… Works
    await paymentService.refund(paymentId); // πŸ’₯ Stripe is down!
    await inventoryService.release(reservationId); // Never reached
}

// Now you have: Cancelled order + No refund + Locked inventory
// Congrats, you have a new problem!

The fix β€” Idempotent compensations with retry:

// saga/sagaStore.js β€” Persist saga state!
async function saveSagaState(sagaId, state) {
    await db.sagas.upsert({
        id: sagaId,
        state: JSON.stringify(state),
        updatedAt: new Date()
    });
}

// Retry compensation until it succeeds
async function compensateWithRetry(step, maxRetries = 5) {
    for (let attempt = 1; attempt <= maxRetries; attempt++) {
        try {
            await runCompensation(step);
            return;  // Success!
        } catch (error) {
            if (attempt === maxRetries) {
                // Alert a human! Manual intervention needed.
                await alertOpsTeam({
                    sagaId: this.sagaId,
                    failedStep: step.name,
                    error: error.message,
                    priority: 'P0'  // Wake someone up!
                });
                return;
            }
            // Exponential backoff
            await sleep(Math.pow(2, attempt) * 1000);
        }
    }
}

When designing our e-commerce backend, we stored every saga state in a database. If the orchestrator crashed mid-saga, it could resume from where it left off. Without this, you're flying blind.

Common Saga Mistakes I've Made πŸ˜…

Mistake #1: Treating Compensation Like a Rollback

A database rollback is instant and total. Compensation is not. The payment was already charged. The email was already sent. You're compensating, not time-traveling.

❌ Wrong mindset: "Compensation undoes everything perfectly"
βœ… Right mindset: "Compensation corrects the business state"

// Email was sent? Compensation sends another email saying "sorry, cancelled!"
// Money was charged? Compensation issues a refund (takes 3-5 business days)

Mistake #2: Non-Idempotent Compensations

Network retries mean your compensation might run twice. If it charges a refund twice... you're now paying customers to shop with you. πŸ€‘

// ❌ Bad - charges refund twice if called twice
async function refund(paymentId, amount) {
    await stripe.refunds.create({ payment_intent: paymentId, amount });
}

// βœ… Good - idempotent, safe to call multiple times
async function refund(paymentId, amount, idempotencyKey) {
    await stripe.refunds.create({
        payment_intent: paymentId,
        amount,
        idempotency_key: idempotencyKey  // Stripe ignores duplicates!
    });
}

Mistake #3: Forgetting Saga Timeout

What if a step hangs forever? The saga just... waits. Forever. Locking resources indefinitely.

// Always set timeouts!
const payment = await Promise.race([
    paymentService.charge(order),
    timeout(30_000, `Payment timed out for saga ${this.sagaId}`)
]);

When to Use the Saga Pattern πŸ€”

Use Saga when:

  • You have multiple services that need to work together atomically
  • You're doing e-commerce checkout, booking systems, bank transfers
  • You need "eventually consistent" transactions across services
  • You can't use a distributed database (2-phase commit is slow and fragile)

Don't use Saga when:

  • You have a monolith with one database (just use a DB transaction!)
  • Your operations are read-only (no writes = no saga needed)
  • The business can tolerate "partial success" (some use cases can!)
  • You're just overcomplicating a simple CRUD app

As a Technical Lead, I've learned: Sagas add complexity. Don't use them because they're cool β€” use them because your business logic genuinely requires cross-service atomicity.

The Architecture Diagram That Finally Made It Click πŸ’‘

Customer clicks "Place Order"
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Checkout Saga     │◄──── Persisted to DB (survive crashes!)
β”‚  Status: RUNNING   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
    β”‚ Step 1  β”‚ Create Order (PENDING)
    β”‚ βœ… Done β”‚
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
    β”‚ Step 2  β”‚ Charge Payment
    β”‚ βœ… Done β”‚ paymentId: pay_abc123
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
    β”‚ Step 3  β”‚ Reserve Inventory
    β”‚ πŸ’₯ FAIL β”‚ "Out of stock!"
    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  COMPENSATION PHASE     β”‚
    β”‚  Refund pay_abc123  βœ…  β”‚
    β”‚  Cancel Order       βœ…  β”‚
    β”‚  Status: COMPENSATED    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
Customer gets: "Sorry, out of stock. Refund in 3-5 days." πŸ“§
(Not a ghost charge!) πŸŽ‰

TL;DR 🏁

  • Distributed transactions don't exist β€” each service commits locally
  • Saga Pattern = sequence of steps + compensating transactions for rollback
  • Choreography = services react to events (loosely coupled, hard to debug)
  • Orchestration = central coordinator (easier to reason about, harder to scale)
  • Always persist saga state β€” crashes happen, your saga should survive them
  • Idempotent compensations β€” retries WILL happen, write for it
  • Set timeouts β€” don't let a hanging service lock your resources forever

The night our inventory service crashed and charged 23 customers for nothing? That was 3 years ago. We built the Saga Pattern into our checkout the next week. We've had service crashes since β€” many times. Not one customer has been incorrectly charged since. πŸŽ‰

The Saga Pattern won't make distributed systems simple. Nothing will. But it gives you a structured way to fail gracefully, compensate correctly, and sleep through the night. β˜•


Building distributed systems? Connect with me on LinkedIn β€” I love swapping war stories about production incidents!

Want to see the full saga implementation? Check out my GitHub for real production patterns!

Now go forth and compensate gracefully! 🎭✨


P.S. The hardest part of the Saga Pattern isn't the code β€” it's explaining to your product manager why the refund takes 3-5 business days but the charge was instant. Some things can't be compensated for with code. πŸ’³πŸ˜