Consistent Hashing: Stop Breaking Your Entire Cache Every Time You Add a Server π‘
Consistent Hashing: Stop Breaking Your Entire Cache Every Time You Add a Server π‘
The call came in at 11:47 PM.
"The site is down. Checkout is returning 500s. Database CPU is 100%. What did you deploy?"
I didn't deploy anything. I just added a fourth Redis node to our cache cluster to handle a traffic spike.
Turned out, that one innocent change invalidated 75% of our cached data simultaneously. 75% of cache keys remapped to different servers. 75% of requests fell through to the database at once. The database, suddenly handling 10x its normal read load with zero cache assistance, just⦠gave up.
The culprit? A math lesson I'd apparently skipped. hash(key) % numServers is a trap, and I walked straight into it.
Let me show you why, and the elegant algorithm that fixes it.
The Naive Hashing Problem π£
When you have multiple cache servers, you need a way to decide which server stores a given key. The obvious approach:
server_index = hash("user:1234:profile") % num_servers
With 3 servers:
hash("user:1234:profile") = 17,294,822
17,294,822 % 3 = 2 β Server 2 β
Works perfectly! Until you add a 4th server:
hash("user:1234:profile") = 17,294,822
17,294,822 % 4 = 2 β Still Server 2 β
(lucky!)
hash("order:9876:items") = 25,104,517
25,104,517 % 3 = 1 β Was on Server 1
25,104,517 % 4 = 1 β Still Server 1 β
(lucky again!)
hash("product:42:details") = 31,000,005
31,000,005 % 3 = 0 β Was on Server 0
31,000,005 % 4 = 1 β Now on Server 1 π₯ CACHE MISS
Adding one server remaps (N-1)/N of your keys β roughly 75% when going from 3 to 4 servers. Removing a server remaps the same proportion.
3 servers β 4 servers:
βββββββββββββββββββββββββββββββββββββββββββββββ
β ~75% of cache keys: REMAPPED = CACHE MISS β
β ~25% of cache keys: stayed put β
β β
β Database gets hit with 75% of your traffic β
β simultaneously. RIP. πͺ¦ β
βββββββββββββββββββββββββββββββββββββββββββββββ
This is exactly what killed our checkout that night. And it's why consistent hashing exists.
The Hash Ring: A Donut That Saves Lives π©
Consistent hashing's core idea is brilliant and simple:
Instead of mapping keys to servers directly, map both keys AND servers onto the same circular ring of integers (0 to 2^32 - 1). A key is served by the first server clockwise from it on the ring.
0 / 2^32
β
270 0 90
βββββββββ΄ββββββββ
β S3 β
180βββ€ β βββ0
β S2 S1 β
βββββββββββββββββ
180
Keys route to the next server clockwise:
Key "user:1234" lands at position 40 β routes to S1 (next clockwise)
Key "order:9876" lands at position 140 β routes to S2
Key "product:42" lands at position 220 β routes to S3
Now what happens when you add Server 4 at position 100?
BEFORE: AFTER adding S4 at 100:
Keys 0-60 β S1 Keys 0-60 β S1 (unchanged β
)
Keys 60-130 β S2 Keys 60-100 β S4 (new! S4 handles this range)
Keys 130-230 β S3 Keys 100-130 β S2 (unchanged β
)
Keys 230-360 β S1 Keys 130-230 β S3 (unchanged β
)
Keys 230-360 β S1 (unchanged β
)
Only the keys in the range that S4 "claimed" from S2 need to be remapped. Instead of remapping 75% of keys, we only remapped ~25% β the slice that S4 took from its predecessor.
This is the superpower: adding or removing a server only affects the immediately neighboring slice on the ring. Everything else stays put.
Virtual Nodes: The Load-Balancing Cheat Code π²
Raw consistent hashing has a problem: if you only have 3 servers on a ring of 2^32 positions, they might land unevenly:
BAD distribution:
Server 1 gets 60% of the ring
Server 2 gets 30% of the ring
Server 3 gets 10% of the ring
β Server 1 handles 6x the traffic of Server 3. Oops.
Virtual nodes (vnodes) fix this. Instead of placing each server once on the ring, place it 150 times:
Server 1 appears as: S1-vn1, S1-vn2, ..., S1-vn150
Server 2 appears as: S2-vn1, S2-vn2, ..., S2-vn150
Server 3 appears as: S3-vn1, S3-vn2, ..., S3-vn150
450 points on the ring instead of 3. With enough points, each server statistically captures ~33% of the keyspace.
With 150 vnodes per server:
Server 1: ~33.2% of traffic
Server 2: ~33.1% of traffic
Server 3: ~33.7% of traffic
β
Roughly even without manual tuning.
Bonus: When you add Server 4, it steals a handful of vnodes from each existing server β spreading the migration cost evenly instead of hammering one neighbor.
The Code Behind the Magic π§
Here's a production-style consistent hash ring in Node.js:
const crypto = require('crypto');
class ConsistentHashRing {
constructor(servers, virtualNodes = 150) {
this.virtualNodes = virtualNodes;
this.ring = new Map(); // position β server
this.sortedKeys = []; // sorted ring positions
servers.forEach(server => this.addServer(server));
}
hash(key) {
// Use a fast, uniform hash function
return parseInt(
crypto.createHash('md5').update(key).digest('hex').slice(0, 8),
16
);
}
addServer(server) {
for (let i = 0; i < this.virtualNodes; i++) {
const virtualKey = `${server}:vn${i}`;
const position = this.hash(virtualKey);
this.ring.set(position, server);
this.sortedKeys.push(position);
}
this.sortedKeys.sort((a, b) => a - b);
console.log(`β
Added ${server} with ${this.virtualNodes} virtual nodes`);
}
removeServer(server) {
for (let i = 0; i < this.virtualNodes; i++) {
const virtualKey = `${server}:vn${i}`;
const position = this.hash(virtualKey);
this.ring.delete(position);
const idx = this.sortedKeys.indexOf(position);
if (idx !== -1) this.sortedKeys.splice(idx, 1);
}
console.log(`ποΈ Removed ${server}`);
}
getServer(key) {
if (this.ring.size === 0) throw new Error('No servers in ring');
const position = this.hash(key);
// Find first position >= hash (clockwise on ring)
const idx = this.sortedKeys.findIndex(p => p >= position);
// If no position found, wrap around to the first server
const serverPosition = idx === -1
? this.sortedKeys[0]
: this.sortedKeys[idx];
return this.ring.get(serverPosition);
}
}
// Usage β same pattern we use in our e-commerce caching layer
const ring = new ConsistentHashRing([
'redis-1:6379',
'redis-2:6379',
'redis-3:6379',
]);
console.log(ring.getServer('user:1234:profile')); // redis-2:6379
console.log(ring.getServer('order:9876:items')); // redis-1:6379
console.log(ring.getServer('product:42:details')); // redis-3:6379
// Add a 4th server during a traffic spike
ring.addServer('redis-4:6379');
// SAME key still routes to the same server (unless it fell in redis-4's slice)
console.log(ring.getServer('user:1234:profile')); // redis-2:6379 β
(likely unchanged)
console.log(ring.getServer('order:9876:items')); // redis-4:6379 (maybe moved)
// Only ~25% of keys moved. Database yawns.
When designing our e-commerce backend, we added a consistent hash layer between our application and Redis. The result: cache hit rate dropped from ~85% to ~80% when we added a node (not to ~10%), and the database absorbed a 5% load bump instead of an OOM crash.
Who Uses This in the Real World? π
Consistent hashing isn't academic theory. It's everywhere:
Redis Cluster:
β Splits keyspace into 16,384 hash slots
β Distributes slots across nodes
β Node addition/removal only migrates affected slots
Apache Cassandra:
β Uses a token ring with virtual nodes (vnodes)
β Each node claims a set of token ranges
β Data redistribution on scale-up is automatic and partial
Amazon DynamoDB:
β Consistent hashing under the hood for partition routing
β Partition keys map to physical storage nodes
β Seamless resharding without downtime
Akamai CDN (original use case!):
β Consistent hashing was literally invented for Akamai
β Routes requests to edge cache servers
β Adding PoPs doesn't invalidate everything else
As a Technical Lead, I've learned: when you see "horizontal scaling without cache invalidation storms" in a system's marketing materials, they almost certainly mean consistent hashing.
Trade-offs: The Honest Version βοΈ
β
WHAT YOU GAIN:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Scale-out cache clusters without cache stampedes β
β Node failures only affect ~1/N of keys (not all) β
β Even load distribution with virtual nodes β
β Zero downtime node additions and removals β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β οΈ WHAT YOU GIVE UP:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β More complex than % modulo (but not much more) β
β Ring state must be consistent across all clients β
β Hot keys still go to the same server (no spreading) β
β Virtual node count needs tuning for your cluster size β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The hot key problem is real. Consistent hashing assigns a key to one server. If product:iphone-16-pro is your most requested cache key, it always hits the same Redis node. For this, you want replication (read replicas) or a local in-process cache for ultra-hot keys. Consistent hashing routes intelligently β it doesn't replicate.
When to Reach for This π―
Use consistent hashing when:
- Running a distributed cache cluster (Redis, Memcached) you scale dynamically
- Building a sharded database layer where tables need stable routing
- Designing a CDN or proxy layer across multiple origin servers
- Any system where data is partitioned across nodes and node membership changes
Skip it when:
- You have a single cache server (% modulo is fine)
- Your cluster size never changes (consistent hashing's main benefit is elasticity)
- You're using a managed service like ElastiCache Redis Cluster (it does this for you already)
The check I do now before every cache cluster change: Which routing algorithm is my client using? If it's % numServers and I'm changing numServers, I schedule a maintenance window and pre-warm the cache. If it's consistent hashing, I roll the change during peak traffic without a second thought.
TL;DR π‘
Naive hashing (hash(key) % N) remaps ~75% of keys when you add just one server. Consistent hashing maps keys and servers onto a ring β adding a node only affects the keys in its slice (~1/N, not ~(N-1)/N).
- Hash ring β stable key routing across node changes
- Virtual nodes β even load distribution without manual balancing
- Used by Redis Cluster, Cassandra, DynamoDB, Akamai CDN
- Key trade-off β solves redistribution cost, not hot key concentration
The lesson that cost me a night of sleep: distributed caching is only as stable as your routing algorithm. Swapping % N for a hash ring is one of those changes that takes an afternoon to implement and saves you years of 2 AM incidents.
Don't let the next server addition be the thing that breaks your database. Let it be boring.
Had a cache invalidation nightmare from scaling? Tell me the story on LinkedIn β I'll trade you mine.
Want the full hash ring implementation with weighted nodes? It's on GitHub β production-tested, complete with metrics hooks.
Now go hash those servers consistently. Your on-call rotation will thank you. π‘π