0x55aa
← Back to Blog

Consistent Hashing: Stop Breaking Your Entire Cache Every Time You Add a Server 🎑

β€’9 min read

Consistent Hashing: Stop Breaking Your Entire Cache Every Time You Add a Server 🎑

The call came in at 11:47 PM.

"The site is down. Checkout is returning 500s. Database CPU is 100%. What did you deploy?"

I didn't deploy anything. I just added a fourth Redis node to our cache cluster to handle a traffic spike.

Turned out, that one innocent change invalidated 75% of our cached data simultaneously. 75% of cache keys remapped to different servers. 75% of requests fell through to the database at once. The database, suddenly handling 10x its normal read load with zero cache assistance, just… gave up.

The culprit? A math lesson I'd apparently skipped. hash(key) % numServers is a trap, and I walked straight into it.

Let me show you why, and the elegant algorithm that fixes it.

The Naive Hashing Problem πŸ’£

When you have multiple cache servers, you need a way to decide which server stores a given key. The obvious approach:

server_index = hash("user:1234:profile") % num_servers

With 3 servers:

hash("user:1234:profile") = 17,294,822
17,294,822 % 3 = 2   β†’ Server 2 βœ…

Works perfectly! Until you add a 4th server:

hash("user:1234:profile") = 17,294,822
17,294,822 % 4 = 2   β†’ Still Server 2 βœ… (lucky!)

hash("order:9876:items") = 25,104,517
25,104,517 % 3 = 1   β†’ Was on Server 1
25,104,517 % 4 = 1   β†’ Still Server 1 βœ… (lucky again!)

hash("product:42:details") = 31,000,005
31,000,005 % 3 = 0   β†’ Was on Server 0
31,000,005 % 4 = 1   β†’ Now on Server 1 πŸ’₯ CACHE MISS

Adding one server remaps (N-1)/N of your keys β€” roughly 75% when going from 3 to 4 servers. Removing a server remaps the same proportion.

3 servers β†’ 4 servers:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ~75% of cache keys: REMAPPED = CACHE MISS   β”‚
β”‚ ~25% of cache keys: stayed put              β”‚
β”‚                                             β”‚
β”‚ Database gets hit with 75% of your traffic  β”‚
β”‚ simultaneously. RIP. πŸͺ¦                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This is exactly what killed our checkout that night. And it's why consistent hashing exists.

The Hash Ring: A Donut That Saves Lives 🍩

Consistent hashing's core idea is brilliant and simple:

Instead of mapping keys to servers directly, map both keys AND servers onto the same circular ring of integers (0 to 2^32 - 1). A key is served by the first server clockwise from it on the ring.

           0 / 2^32
               β”‚
    270        0        90
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”
       β”‚       S3      β”‚
  180───       ↑       β”œβ”€β”€0
       β”‚    S2   S1    β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              180

Keys route to the next server clockwise:

Key "user:1234"  lands at position 40   β†’ routes to S1 (next clockwise)
Key "order:9876" lands at position 140  β†’ routes to S2
Key "product:42" lands at position 220  β†’ routes to S3

Now what happens when you add Server 4 at position 100?

BEFORE:                         AFTER adding S4 at 100:
Keys 0-60   β†’ S1                Keys 0-60    β†’ S1 (unchanged βœ…)
Keys 60-130 β†’ S2                Keys 60-100  β†’ S4 (new! S4 handles this range)
Keys 130-230 β†’ S3               Keys 100-130 β†’ S2 (unchanged βœ…)
Keys 230-360 β†’ S1               Keys 130-230 β†’ S3 (unchanged βœ…)
                                Keys 230-360 β†’ S1 (unchanged βœ…)

Only the keys in the range that S4 "claimed" from S2 need to be remapped. Instead of remapping 75% of keys, we only remapped ~25% β€” the slice that S4 took from its predecessor.

This is the superpower: adding or removing a server only affects the immediately neighboring slice on the ring. Everything else stays put.

Virtual Nodes: The Load-Balancing Cheat Code 🎲

Raw consistent hashing has a problem: if you only have 3 servers on a ring of 2^32 positions, they might land unevenly:

BAD distribution:
Server 1 gets 60% of the ring
Server 2 gets 30% of the ring
Server 3 gets 10% of the ring
β†’ Server 1 handles 6x the traffic of Server 3. Oops.

Virtual nodes (vnodes) fix this. Instead of placing each server once on the ring, place it 150 times:

Server 1 appears as: S1-vn1, S1-vn2, ..., S1-vn150
Server 2 appears as: S2-vn1, S2-vn2, ..., S2-vn150
Server 3 appears as: S3-vn1, S3-vn2, ..., S3-vn150

450 points on the ring instead of 3. With enough points, each server statistically captures ~33% of the keyspace.

With 150 vnodes per server:
Server 1: ~33.2% of traffic
Server 2: ~33.1% of traffic
Server 3: ~33.7% of traffic
βœ… Roughly even without manual tuning.

Bonus: When you add Server 4, it steals a handful of vnodes from each existing server β€” spreading the migration cost evenly instead of hammering one neighbor.

The Code Behind the Magic πŸ”§

Here's a production-style consistent hash ring in Node.js:

const crypto = require('crypto');

class ConsistentHashRing {
  constructor(servers, virtualNodes = 150) {
    this.virtualNodes = virtualNodes;
    this.ring = new Map();     // position β†’ server
    this.sortedKeys = [];      // sorted ring positions

    servers.forEach(server => this.addServer(server));
  }

  hash(key) {
    // Use a fast, uniform hash function
    return parseInt(
      crypto.createHash('md5').update(key).digest('hex').slice(0, 8),
      16
    );
  }

  addServer(server) {
    for (let i = 0; i < this.virtualNodes; i++) {
      const virtualKey = `${server}:vn${i}`;
      const position = this.hash(virtualKey);

      this.ring.set(position, server);
      this.sortedKeys.push(position);
    }
    this.sortedKeys.sort((a, b) => a - b);

    console.log(`βœ… Added ${server} with ${this.virtualNodes} virtual nodes`);
  }

  removeServer(server) {
    for (let i = 0; i < this.virtualNodes; i++) {
      const virtualKey = `${server}:vn${i}`;
      const position = this.hash(virtualKey);

      this.ring.delete(position);
      const idx = this.sortedKeys.indexOf(position);
      if (idx !== -1) this.sortedKeys.splice(idx, 1);
    }

    console.log(`πŸ—‘οΈ  Removed ${server}`);
  }

  getServer(key) {
    if (this.ring.size === 0) throw new Error('No servers in ring');

    const position = this.hash(key);

    // Find first position >= hash (clockwise on ring)
    const idx = this.sortedKeys.findIndex(p => p >= position);

    // If no position found, wrap around to the first server
    const serverPosition = idx === -1
      ? this.sortedKeys[0]
      : this.sortedKeys[idx];

    return this.ring.get(serverPosition);
  }
}

// Usage β€” same pattern we use in our e-commerce caching layer
const ring = new ConsistentHashRing([
  'redis-1:6379',
  'redis-2:6379',
  'redis-3:6379',
]);

console.log(ring.getServer('user:1234:profile'));  // redis-2:6379
console.log(ring.getServer('order:9876:items'));   // redis-1:6379
console.log(ring.getServer('product:42:details')); // redis-3:6379

// Add a 4th server during a traffic spike
ring.addServer('redis-4:6379');

// SAME key still routes to the same server (unless it fell in redis-4's slice)
console.log(ring.getServer('user:1234:profile'));  // redis-2:6379 βœ… (likely unchanged)
console.log(ring.getServer('order:9876:items'));   // redis-4:6379 (maybe moved)
// Only ~25% of keys moved. Database yawns.

When designing our e-commerce backend, we added a consistent hash layer between our application and Redis. The result: cache hit rate dropped from ~85% to ~80% when we added a node (not to ~10%), and the database absorbed a 5% load bump instead of an OOM crash.

Who Uses This in the Real World? 🌍

Consistent hashing isn't academic theory. It's everywhere:

Redis Cluster:
  β†’ Splits keyspace into 16,384 hash slots
  β†’ Distributes slots across nodes
  β†’ Node addition/removal only migrates affected slots

Apache Cassandra:
  β†’ Uses a token ring with virtual nodes (vnodes)
  β†’ Each node claims a set of token ranges
  β†’ Data redistribution on scale-up is automatic and partial

Amazon DynamoDB:
  β†’ Consistent hashing under the hood for partition routing
  β†’ Partition keys map to physical storage nodes
  β†’ Seamless resharding without downtime

Akamai CDN (original use case!):
  β†’ Consistent hashing was literally invented for Akamai
  β†’ Routes requests to edge cache servers
  β†’ Adding PoPs doesn't invalidate everything else

As a Technical Lead, I've learned: when you see "horizontal scaling without cache invalidation storms" in a system's marketing materials, they almost certainly mean consistent hashing.

Trade-offs: The Honest Version βš–οΈ

βœ… WHAT YOU GAIN:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Scale-out cache clusters without cache stampedes       β”‚
β”‚ Node failures only affect ~1/N of keys (not all)      β”‚
β”‚ Even load distribution with virtual nodes             β”‚
β”‚ Zero downtime node additions and removals             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

⚠️ WHAT YOU GIVE UP:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ More complex than % modulo (but not much more)         β”‚
β”‚ Ring state must be consistent across all clients       β”‚
β”‚ Hot keys still go to the same server (no spreading)   β”‚
β”‚ Virtual node count needs tuning for your cluster size  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The hot key problem is real. Consistent hashing assigns a key to one server. If product:iphone-16-pro is your most requested cache key, it always hits the same Redis node. For this, you want replication (read replicas) or a local in-process cache for ultra-hot keys. Consistent hashing routes intelligently β€” it doesn't replicate.

When to Reach for This 🎯

Use consistent hashing when:

  • Running a distributed cache cluster (Redis, Memcached) you scale dynamically
  • Building a sharded database layer where tables need stable routing
  • Designing a CDN or proxy layer across multiple origin servers
  • Any system where data is partitioned across nodes and node membership changes

Skip it when:

  • You have a single cache server (% modulo is fine)
  • Your cluster size never changes (consistent hashing's main benefit is elasticity)
  • You're using a managed service like ElastiCache Redis Cluster (it does this for you already)

The check I do now before every cache cluster change: Which routing algorithm is my client using? If it's % numServers and I'm changing numServers, I schedule a maintenance window and pre-warm the cache. If it's consistent hashing, I roll the change during peak traffic without a second thought.

TL;DR πŸ’‘

Naive hashing (hash(key) % N) remaps ~75% of keys when you add just one server. Consistent hashing maps keys and servers onto a ring β€” adding a node only affects the keys in its slice (~1/N, not ~(N-1)/N).

  • Hash ring β†’ stable key routing across node changes
  • Virtual nodes β†’ even load distribution without manual balancing
  • Used by Redis Cluster, Cassandra, DynamoDB, Akamai CDN
  • Key trade-off β†’ solves redistribution cost, not hot key concentration

The lesson that cost me a night of sleep: distributed caching is only as stable as your routing algorithm. Swapping % N for a hash ring is one of those changes that takes an afternoon to implement and saves you years of 2 AM incidents.

Don't let the next server addition be the thing that breaks your database. Let it be boring.


Had a cache invalidation nightmare from scaling? Tell me the story on LinkedIn β€” I'll trade you mine.

Want the full hash ring implementation with weighted nodes? It's on GitHub β€” production-tested, complete with metrics hooks.

Now go hash those servers consistently. Your on-call rotation will thank you. πŸŽ‘πŸš€