Eventual Consistency: Why Your Distributed System is Always a Little Bit Wrong (And That's Okay) π
Eventual Consistency: Why Your Distributed System is Always a Little Bit Wrong (And That's Okay) π
True story: I spent three days debugging why customers could add items to their cart, see them in the cart, then hit checkout and get "item no longer available." The inventory was updated. The cart was updated. Everything was "correct." They just weren't correct at the same time.
Welcome to distributed systems, where the data is made up and the timestamps don't matter.
This is the post I needed before I confidently told my CTO our e-commerce backend was "strongly consistent." Spoiler: it was not. It was eventually consistent, occasionally inconsistent, and once, magnificently wrong for about 14 seconds during a Black Friday sale.
The CAP Theorem: A Three-Way Love Triangle You Can't Win π
Eric Brewer dropped this bomb in 2000: in a distributed system, you can only guarantee two of these three properties at once.
The CAP Triangle (Pick 2):
Consistency
(Every read returns
the latest write)
β³
/|\
/ | \
/ | \
/ | \
/ | \
β½ββββββββββ½
Availability Partition
(Every request Tolerance
gets a response) (System works
despite network
failures)
The bad news: Network partitions WILL happen.
The good news: At least you get to pick which way to suffer.
The hidden truth nobody tells you: In a real distributed system running across multiple servers, you always have partition tolerance. Networks fail. Packets drop. An AWS availability zone has a bad day. So the real choice isn't CAP β it's CP vs AP.
- CP (Consistency + Partition Tolerance): When a partition happens, refuse to serve stale data. Go offline or return an error. Example: your banking transaction system.
- AP (Availability + Partition Tolerance): When a partition happens, keep serving β but data might be stale. Fix it later. Example: your shopping cart.
When designing our e-commerce backend, this wasn't a theoretical question. It was: "Do we want our checkout to go down during a network blip, or do we want to occasionally oversell by 3 units?"
We picked availability. We picked eventual consistency. And then we had to actually engineer that choice properly β which is where most teams drop the ball.
Eventual Consistency Isn't "We Hope It Gets Fixed Eventually" π
This is the part where engineers misunderstand the concept and then have catastrophic incidents.
Eventual consistency means: given no new updates, all replicas will converge to the same value. It doesn't mean "it'll probably be fine." It means you need to design explicit mechanisms to make convergence happen correctly.
Strong Consistency (CP):
ββββββββββββ Write "stock=5" ββββββββββββ
β Client ββββββββββββββββββββββΊβ Primary β
ββββββββββββ β DB β
β ββββββ¬ββββββ
β Read stock β Sync replication
ββββββββββββββββββββββββββββββββββ€ (wait for ACK)
β Returns 5 β
β
ββββββ½ββββββ
β Replica β
β DB β
ββββββββββββ
Every read sees the latest write.
Cost: Higher latency. Availability drops if replica is slow.
Eventual Consistency (AP):
ββββββββββββ Write "stock=5" ββββββββββββ
β Client ββββββββββββββββββββββΊβ Primary β
ββββββββββββ Gets ACK β
β DB β
β ββββββ¬ββββββ
β Read stock immediately β Async replication
ββββββββββββββββββββ β (happens later)
β Returns 10 β ββββββ½ββββββ
β (stale from replica!) β Replica β
β β DB: 10 β
β Read stock 200ms later β (stale!) β
βββββββββββββββββββββββββββββ€ DB: 5 β
β
ββββββββββββ
Some reads see old data. But system stays available.
Cost: Complexity. Requires handling stale reads explicitly.
A scalability lesson that cost us: We assumed "async replication means maybe 50ms lag." During a traffic spike with our Aurora read replicas under load, replication lag hit 8 seconds. Users were seeing inventory counts from 8 seconds ago. We oversold 47 units of a limited-edition product. Customer support had a very bad Tuesday.
The Consistency Models No One Explains Clearly π
Between "perfectly consistent" and "total chaos," there's a whole spectrum. Here's what actually matters in production:
The Consistency Spectrum:
STRONGEST WEAKEST
β β
βΌ βΌ
Linearizable β Sequential β Causal β Read-Your-Writes β Eventual
What each means in plain English:
Linearizable: "Every operation appears to happen instantly
and atomically. Reality matches one global clock."
β Zookeeper, etcd leader reads
β Cost: SLOW. Every write waits for quorum.
Sequential: "All nodes see operations in same order,
but maybe with some delay."
β Kafka partition ordering
β Cost: Still slow. Still needs coordination.
Causal: "If A causes B, everyone sees A before B.
Unrelated things can be in any order."
β DynamoDB with condition writes
β Sweet spot for many use cases β
Read-Your-Writes: "YOU always see YOUR latest write.
Others might not yet."
β What most users actually expect
β Achievable with sticky sessions or
primary reads for the writer
Eventual: "It'll all match up... eventually.
No promises about when."
β DNS, S3 after PUT
β Fine for most non-critical reads
As a Technical Lead, I've learned: Most systems don't need linearizability. They need read-your-writes consistency β the user sees their own changes reflected immediately. Serving someone else's latest data? A few hundred milliseconds of stale is usually fine.
Real E-Commerce Patterns That Don't Lie π
Let me show you how we actually handle this in production, not in theory.
Pattern 1: Inventory β Never Oversell, Accept Staleness for Display
The Two-Tier Inventory Approach:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Product Listing Page β
β β Reads from: Redis cache (stale OK) β
β β Shows: "In Stock" / "Low Stock" / "Out" β
β β Consistency: Eventual (5-min cache TTL) β
β β Why: 10k concurrent product page views β
β can't all hammer the DB β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Checkout / Add to Cart β
β β Reads from: Primary DB (strong consistency) β
β β Uses: Optimistic locking / SELECT FOR UPDATE β
β β Consistency: Linearizable for that 1 item β
β β Why: Money is involved. No overselling. β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
// Display inventory (eventual consistency - fine!)
public function getProductStock(int $productId): int
{
return Cache::remember("stock:{$productId}", 300, function () use ($productId) {
return Product::find($productId)->stock;
});
}
// Reserve inventory at checkout (strong consistency - required!)
public function reserveStock(int $productId, int $quantity): bool
{
return DB::transaction(function () use ($productId, $quantity) {
// Pessimistic lock: only ONE checkout can run this at a time
$product = Product::where('id', $productId)
->lockForUpdate() // SELECT ... FOR UPDATE
->first();
if ($product->stock < $quantity) {
return false; // Bail! Don't oversell.
}
$product->decrement('stock', $quantity);
return true;
});
}
The insight: You don't need strong consistency everywhere. You need it at the exact moment money moves. Everything else can be eventual.
Pattern 2: The "Read Your Writes" Problem for User Profiles
The Problem:
User updates their email β gets redirected β sees old email π±
Why it happens:
Write goes to: PRIMARY DB ββββ user sees "[email protected]"
Read comes from: REPLICA DB βββ user sees "[email protected]"
(replication lag = 200ms of confusion)
The Fix: Route the writer to primary for 30 seconds
// Middleware: After any write, set a flag
// Route subsequent reads to primary for this user session
class RouteToReplicaMiddleware
{
public function handle(Request $request, Closure $next)
{
// If user just made a write, use primary for 30s
if (Cache::has("primary_session:{$request->user()?->id}")) {
DB::unprepared('SET SESSION aurora_readonly = 0'); // Force primary
}
return $next($request);
}
}
// After any write operation:
Cache::put("primary_session:{$userId}", true, 30); // 30 second window
This is how you achieve "read-your-writes" without strong consistency everywhere. Your user sees their own changes. Everyone else might see 200ms stale. Nobody cares.
Pattern 3: The Saga Pattern for Distributed Transactions (When You Have No Choice)
The Nightmare Scenario: Checkout spans 3 services
Order Service Payment Service Inventory Service
β β β
β Create Order β β
ββββββββββββββββββΊβ β
β β Charge Card β
β ββββββββββββββββββββΊβ
β β β Reserve Stock
β β ββββββββββββββββ
β β β
β β β Payment fails β
β β β
What do we do with the already-reserved stock?? π€―
Without distributed transactions (which are slow and fragile), you use compensating transactions:
The Saga Compensation Flow:
1. ReserveStock β SUCCESS
2. ChargeCard β FAILED β
Trigger compensation:
3. ReleaseStock β compensating transaction
Each step has a matching "undo" step.
If anything fails, walk backwards.
class CheckoutSaga
{
private array $compensations = [];
public function execute(Order $order): bool
{
try {
// Step 1: Reserve inventory
$this->inventoryService->reserve($order);
$this->compensations[] = fn() => $this->inventoryService->release($order);
// Step 2: Charge payment
$this->paymentService->charge($order);
$this->compensations[] = fn() => $this->paymentService->refund($order);
// Step 3: Confirm order
$this->orderService->confirm($order);
return true;
} catch (\Exception $e) {
// Walk backwards through compensating transactions
foreach (array_reverse($this->compensations) as $compensate) {
rescue(fn() => $compensate()); // Best effort
}
return false;
}
}
}
The key insight: Sagas accept eventual consistency. Between steps 1 and 3, the system is temporarily inconsistent. That's the price you pay for not having a distributed lock across 3 services.
Common Mistakes That Will Age You Quickly π§
Mistake #1: Eventual Consistency Without Monitoring
// This silently builds up replication lag
// until 8 seconds later you're overselling on Black Friday
// What you actually need:
DB::listen(function ($query) {
if ($query->connectionName === 'replica') {
$lag = DB::connection('primary')
->selectOne('SELECT TIMESTAMPDIFF(SECOND,
MIN(last_update), NOW()) as lag
FROM information_schema.replica_status');
if ($lag->lag > 2) {
// Alert! Route reads to primary temporarily
Log::warning("Replica lag: {$lag->lag}s - routing to primary");
}
}
});
Mistake #2: Strong Consistency Where You Don't Need It
// Bad: Reading product descriptions from primary (why??)
$description = DB::connection('primary')
->table('products')
->where('id', $id)
->value('description'); // This is a marketing copy change
// 2s of stale is FINE
// Good: Save primary reads for writes and financial ops
$description = Cache::remember("product_desc:{$id}", 3600, function () use ($id) {
return Product::find($id)->description;
});
Mistake #3: Ignoring the "Eventually" in Eventual
// You set a 24-hour cache TTL on product prices
// Marketing runs a flash sale that changes prices instantly
// Users see old prices for up to 24 hours
// Legal has questions π
// You need cache invalidation on writes:
public function updatePrice(Product $product, float $price): void
{
$product->update(['price' => $price]);
Cache::forget("product_price:{$product->id}"); // Invalidate immediately
Cache::tags(['products'])->flush(); // Or flush by tag
}
When to Demand Strong Consistency πͺ
Don't let "eventual consistency is fine" become your excuse to never think about correctness. These absolutely need strong consistency:
REQUIRE Strong Consistency:
βββ Financial transactions (charge card, transfer money)
βββ Inventory reservation at checkout (overselling = bad)
βββ Seat reservations (double-booking flights = very bad)
βββ Authentication tokens (stale "invalid" token lets hackers in)
βββ Any time two users compete for the same resource
EVENTUAL IS FINE:
βββ Product catalog browsing
βββ "Customers also bought" recommendations
βββ Social activity feeds
βββ Analytics dashboards
βββ Search index updates
βββ Email/notification delivery (within reasonable time)
When designing our e-commerce backend, the rule of thumb became: does money move, or does a user expect their own action reflected? If yes β primary read or strong lock. Everything else β cache and replicas.
The Decision Framework πΊοΈ
Should this operation be strongly consistent?
ββββββββββββββββββββββββββββββββββ
β Does this operation involve β
β money, stock, or seat limits? β
ββββββββββββββββ¬ββββββββββββββββββ
β
βββββββββββ΄βββββββββ
β YES β NO
βΌ βΌ
ββββββββββββββββββββ βββββββββββββββββββββββββββββββ
β Use pessimistic β β Is this the user reading β
β locking / saga β β their own recent write? β
β Strong: YES β
β ββββββββββββββββ¬βββββββββββββββ
ββββββββββββββββββββ β
ββββββββββββ΄βββββββββββ
β YES β NO
βΌ βΌ
ββββββββββββββββββββ βββββββββββββββββββ
β Route to primary β β Eventual is fine β
β for 30 seconds β β Use replica/cacheβ
β Read-your-writes β β Eventual: YES β
β
ββββββββββββββββββββ βββββββββββββββββββ
TL;DR β‘
- CAP Theorem: In distributed systems, pick CP (consistent but may go offline) or AP (available but may serve stale data). Network partitions are inevitable.
- Eventual consistency β "it'll be fine." It means you need explicit convergence mechanisms.
- Most user-facing reads: Eventual is fine. Stale data by 200ms is unnoticeable.
- Financial operations / inventory checkout: Strong consistency. No exceptions.
- Read-your-writes: The consistency model users actually expect. Route writers to primary for a short window.
- Sagas for distributed transactions: Accept temporary inconsistency, compensate on failure.
- Monitor replication lag. Seriously. Set an alert. I learned this the hard way during a Black Friday sale.
The goal isn't perfect consistency everywhere. It's knowing exactly where you need it and engineering the rest to converge reliably.
Built a distributed system that was more "eventual" than "consistent"? I'd love to hear the war story β find me on LinkedIn.
Want to see how I handle saga patterns in production Laravel + AWS? Real code examples on my GitHub.
Now go audit your read replicas. Someone's replication lag is quietly building up right now. π
P.S. The two hardest problems in distributed systems are: 1) Cache invalidation, 2) Naming things, and 3) Exactly-once delivery. Yes, that's three. Counting is also hard in distributed systems.
P.P.S. If a vendor tells you their distributed database is "fully ACID consistent with zero latency," they are lying to you. Physics doesn't allow it. The speed of light is not configurable. π