Distributed Locks: Stop Two Servers Stepping on Each Other's Feet ðâĄ
Distributed Locks: Stop Two Servers Stepping on Each Other's Feet ðâĄ
True story: We had a promo where customers got a "first-time buyer" discount. One Monday morning I got a Slack message: "Hey, why did this customer get the discount TWICE?"
They'd clicked "Apply Discount" really fast. Two Lambda functions spun up simultaneously. Both checked "has this user used the discount?" â both answered "No" â both applied it.
Two Lambda functions walked into the same bar and both ordered a round. The customer paid once, got two rounds. We lost $40.
And that was just ONE customer. We had 12,000 orders that day. ðą
Welcome to the world of distributed locking â where you teach multiple servers to take turns!
What's the Problem, Exactly? ðĪ
In a single-server world, concurrency is handled with mutexes. One thread locks a resource, does its thing, unlocks. Simple.
In a distributed world (multiple servers, multiple Lambda instances, multiple workers), you can't use a regular mutex. Server A and Server B don't share memory. They don't even know each other exists.
This is called a race condition at scale:
Timeline:
t=0ms Server A: checks if order #123 is processed â "No"
t=1ms Server B: checks if order #123 is processed â "No"
t=2ms Server A: starts processing order #123
t=3ms Server B: starts processing order #123
t=100ms Server A: charges card $99.99 â
t=101ms Server B: charges card $99.99 â
t=200ms Customer: "why did you charge me twice?!" ðĪ
The fix? A distributed lock. Before any server does "the dangerous thing", it must acquire a lock. If it can't get the lock, it backs off.
The Redis Solution (My Go-To) ðī
Redis has an atomic command called SET NX PX (Set if Not eXists, with expiry in ms). This is the foundation of most distributed locks.
The naive version:
const redis = require('redis');
const client = redis.createClient();
async function acquireLock(resourceKey, ttlMs = 30000) {
const lockKey = `lock:${resourceKey}`;
const lockValue = `${process.pid}-${Date.now()}`; // Unique identifier
// SET key value NX PX milliseconds
// NX = Only set if key doesn't exist
// PX = Expire after N milliseconds
const result = await client.set(lockKey, lockValue, {
NX: true,
PX: ttlMs
});
return result === 'OK' ? lockValue : null;
}
async function releaseLock(resourceKey, lockValue) {
const lockKey = `lock:${resourceKey}`;
// CRITICAL: Only release YOUR lock, not someone else's!
const script = `
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
`;
return await client.eval(script, 1, lockKey, lockValue);
}
// Usage
async function processOrder(orderId) {
const lockValue = await acquireLock(`order:${orderId}`);
if (!lockValue) {
throw new Error('Another server is already processing this order!');
}
try {
// Do the dangerous thing
await chargeCard(orderId);
await updateInventory(orderId);
await markOrderComplete(orderId);
} finally {
// ALWAYS release the lock
await releaseLock(`order:${orderId}`, lockValue);
}
}
Why the Lua script for release?
If you do a simple DEL lockKey, you might accidentally delete ANOTHER server's lock! The Lua script is atomic: check AND delete happen in one operation. No race condition possible.
A Scalability Lesson That Cost Us ð
When designing our e-commerce backend, I initially skipped the lockValue check in the release. I just deleted the key.
Here's what happened:
t=0ms Server A: acquires lock, starts processing
t=29s Server A: lock is about to expire (29s elapsed, TTL=30s)
t=30s Lock expires automatically (Server A is still running!)
t=30s Server B: acquires the now-expired lock
t=31s Server A: FINISHES work, calls releaseLock()
t=31s Server A: deletes Server B's lock! ðĨ
t=32s Server C: acquires the lock (Server B still running!)
t=32s Server B + C: both processing the same order
t=33s Double charge. Again.
The lockValue uniqueness check prevents Server A from releasing Server B's lock. If the lock expired and someone else grabbed it, Server A's release call is a no-op.
Lesson: Your lock TTL must be longer than your worst-case operation time. And you should add fencing tokens for truly critical operations (more on that in a sec).
Production-Ready Lock with Retry ð
Real code needs retry logic:
class DistributedLock {
constructor(redisClient) {
this.redis = redisClient;
}
async acquire(resourceKey, options = {}) {
const {
ttl = 30000, // Lock TTL in ms
retryDelay = 100, // Wait between retries
retryCount = 10, // Max retries
} = options;
const lockKey = `lock:${resourceKey}`;
const lockValue = `${require('crypto').randomBytes(16).toString('hex')}`;
for (let attempt = 0; attempt <= retryCount; attempt++) {
const result = await this.redis.set(lockKey, lockValue, {
NX: true,
PX: ttl
});
if (result === 'OK') {
return { acquired: true, lockValue, lockKey };
}
if (attempt < retryCount) {
// Jitter: add randomness to avoid thundering herd
const jitter = Math.random() * 50;
await sleep(retryDelay + jitter);
}
}
return { acquired: false };
}
async release(lockKey, lockValue) {
const script = `
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
`;
return await this.redis.eval(script, 1, lockKey, lockValue);
}
// Convenience: acquire, run, release
async withLock(resourceKey, fn, options = {}) {
const lock = await this.acquire(resourceKey, options);
if (!lock.acquired) {
throw new Error(`Could not acquire lock for: ${resourceKey}`);
}
try {
return await fn();
} finally {
await this.release(lock.lockKey, lock.lockValue);
}
}
}
const sleep = (ms) => new Promise(r => setTimeout(r, ms));
// Usage - clean and simple!
const lock = new DistributedLock(redisClient);
await lock.withLock(`order:${orderId}`, async () => {
await processPayment(orderId);
await updateInventory(orderId);
});
The jitter is important! Without it, 50 servers all waiting 100ms will ALL retry at the exact same time. That's a thundering herd â and it hammers Redis.
Real-World Use Cases ð
Use Case #1: Preventing Duplicate Orders
app.post('/api/checkout', async (req, res) => {
const idempotencyKey = req.headers['idempotency-key'];
await lock.withLock(`checkout:${idempotencyKey}`, async () => {
// Check if already processed
const existing = await db.orders.findOne({ idempotencyKey });
if (existing) {
return res.json({ orderId: existing.id, status: 'already_processed' });
}
// Safe to process!
const order = await createAndChargeOrder(req.body);
await db.orders.insert({ ...order, idempotencyKey });
res.json({ orderId: order.id });
}, { ttl: 60000 }); // 60 second lock
});
Use Case #2: Rate-Limited Background Jobs
// Only run price sync once, even if multiple workers try
async function syncProductPrices() {
const lock = await distributedLock.acquire('job:price-sync', {
ttl: 300000, // 5 minutes
retryCount: 0 // Don't retry â if locked, skip
});
if (!lock.acquired) {
console.log('Price sync already running, skipping');
return;
}
try {
await fetchAndUpdateAllPrices(); // Takes 3-4 minutes
} finally {
await distributedLock.release(lock.lockKey, lock.lockValue);
}
}
// Runs every minute, but only ONE instance runs at a time
setInterval(syncProductPrices, 60 * 1000);
Use Case #3: Flash Sale Inventory
async function purchaseFlashSaleItem(userId, itemId) {
// Lock PER ITEM â multiple items can be purchased simultaneously
return await lock.withLock(`flash-sale:${itemId}`, async () => {
const item = await db.flashSaleItems.findOne({ id: itemId });
if (item.quantity <= 0) {
throw new Error('SOLD OUT');
}
await db.flashSaleItems.decrement(itemId, 'quantity', 1);
await createOrder(userId, itemId);
return { success: true };
}, { ttl: 5000 }); // 5 second lock â flash sale ops are fast
}
As a Technical Lead, I've learned: Lock granularity matters! Lock flash-sale:item-123, not flash-sale:*. Otherwise you serialize ALL purchases when you only need to serialize purchases of the SAME item.
The Fencing Token Pattern (For the Paranoid) ðĄïļ
Here's the uncomfortable truth: Redis locks are not perfect. In edge cases (network partitions, clock skew, garbage collection pauses), two servers CAN hold the lock simultaneously.
For truly critical operations (like billing), add a fencing token:
// Lock returns a monotonically increasing token
async function acquireLockWithToken(resourceKey) {
const token = await redis.incr(`fence:${resourceKey}`); // Always increasing!
const lockValue = await acquireLock(resourceKey);
if (!lockValue) return null;
return { lockValue, fencingToken: token };
}
// Your datastore rejects writes with old tokens
async function updateDatabase(resourceKey, data, fencingToken) {
// Only accept if this is a newer token than we've seen
const lastToken = await db.getLastFencingToken(resourceKey);
if (fencingToken <= lastToken) {
throw new Error(`Stale write rejected! Token ${fencingToken} <= ${lastToken}`);
}
await db.update(resourceKey, data, fencingToken);
}
The fencing token is a counter that always goes up. Even if an old lock holder wakes up late and tries to write, the database sees "token 7? We already accepted token 8 â REJECTED."
In production, I've learned: Most teams don't need fencing tokens. For payment processing? Seriously consider it.
When NOT to Use Distributed Locks â ïļ
As a Technical Lead, I've seen distributed locks abused. They're not always the answer:
Don't use locks when:
- Idempotency solves it â Unique constraints in your DB are often better than locks.
INSERT ... ON CONFLICT DO NOTHINGis simpler and more reliable. - Optimistic locking works â Add a
versioncolumn. Update only if version matches. Retry on conflict. No Redis needed. - You're locking too broadly â Locking
all-ordersinstead oforder-123kills throughput. - The operation is read-only â Reads don't need locks (usually). Use caching instead.
Use locks when:
- Multiple servers compete to do the same one-time thing
- You're coordinating job scheduling across workers
- You need "at most one" execution semantics
- Flash sales, first-come-first-served, limited slots
Simple decision tree:
Can I use a DB unique constraint? â YES â Use that instead
Can I use optimistic locking? â YES â Use that instead
Is this a leader election? â YES â Use locks
Is this job deduplication? â YES â Use locks
Is this inventory dec? â Maybe â Benchmark first
Alternatives to Redis Locks ð§
| Tool | Best For | Complexity |
|---|---|---|
| Redis SET NX | General purpose | Low |
| Postgres advisory locks | Same DB transactions | Low |
| DynamoDB conditional writes | AWS-native, no Redis | Medium |
| Zookeeper / etcd | Leader election, critical infra | High |
| Redlock algorithm | Multi-Redis, high availability | High |
My setup at work: Redis for application-level locks, PostgreSQL advisory locks for database migrations, Redlock for anything that touches billing.
Common Mistakes (I Made All of These) ðŠĪ
Mistake #1: Forgetting the TTL
// BAD: Lock forever if server crashes!
await redis.set(`lock:${key}`, 'locked', { NX: true });
// GOOD: Lock expires even if server dies
await redis.set(`lock:${key}`, 'locked', { NX: true, PX: 30000 });
Mistake #2: TTL Too Short
// BAD: Operation takes 10s, lock expires in 5s â another server grabs it
await lock.acquire(key, { ttl: 5000 });
// GOOD: TTL > worst-case operation time, with buffer
await lock.acquire(key, { ttl: 60000 }); // 60s for a 10s operation
Mistake #3: Not Releasing on Error
// BAD: Exception? Lock stays forever (well, until TTL)
const lockValue = await acquireLock(key);
await doSomethingThatMightThrow(); // throws! lock not released
await releaseLock(key, lockValue);
// GOOD: Always release in finally
const lockValue = await acquireLock(key);
try {
await doSomethingThatMightThrow();
} finally {
await releaseLock(key, lockValue); // Always runs!
}
Mistake #4: Using Lock as a Queue
// BAD: Locking for every single request, creating a bottleneck
app.post('/api/update-user', async (req, res) => {
await lock.withLock('all-users', async () => { // ðĻ GLOBAL LOCK!
await updateUser(req.body);
});
});
// GOOD: Lock per resource
app.post('/api/update-user', async (req, res) => {
await lock.withLock(`user:${req.userId}`, async () => {
await updateUser(req.body);
});
});
The Bottom Line ðĄ
Distributed locks are one of those things you don't think about until they burn you. Then they become one of the first things you think about.
The essentials:
- Always set a TTL â locks must auto-expire if servers crash
- Use unique lock values â don't release someone else's lock
- Release in
finallyâ always, no exceptions - Lock at the right granularity â per-resource, not globally
- Prefer simpler alternatives â DB constraints and optimistic locking first
A scalability lesson that cost us: The double-discount bug cost us about $800 before we caught it. Redis locking costs about $0.01/month in compute. Some of the best ROI in distributed systems engineering.
When designing our e-commerce backend, locking wasn't in the initial design. It was bolted on after our first production incident. Do it right from day one â your future self (and your customers) will thank you!
Fighting race conditions in production? Connect with me on LinkedIn â I've got war stories.
Want to see my distributed systems patterns? Check out my GitHub for production-ready examples!
Now go lock your resources. Not all of them, though â please read the "when NOT to use" section first! ð