Your API has rate limiting. You added 429 Too Many Requests responses. You feel good about yourself.
And then some clever user sends 999 requests in the last second of window one and 999 in the first second of window two — effectively hammering you with 1998 requests in two seconds, well over your "1000 per minute" limit.
Congratulations: your rate limiter is a prop.
This is the fixed-window problem, and almost everyone falls for it. Let me show you why it happens and how the token bucket algorithm actually solves it.
The Problem With Fixed Windows
The naive approach looks like this:
# Fixed window - looks safe, isn't
def is_allowed(user_id: str, limit: int = 1000) -> bool:
window_key = f"ratelimit:{user_id}:{int(time.time() // 60)}"
count = redis.incr(window_key)
if count == 1:
redis.expire(window_key, 60)
return count <= limit
The logic is clean. The security isn't. Here's the attack:
- Window resets at
:00every minute - Attacker sends 1000 requests from
:58to:59— all allowed - Window resets at
:00 - Attacker sends 1000 more from
:00to:01— all allowed - You just served 2000 requests in 2 seconds to one client
Your backend just ate a ~16x burst. If your downstream database or third-party service can't absorb that, you have an availability problem dressed up as a rate limiting problem.
The Token Bucket Algorithm
Think of it as a literal bucket of tokens:
- The bucket holds up to
capacitytokens (say, 100) - Tokens refill at a steady rate — e.g., 10 tokens/second
- Each API request consumes one token
- If the bucket is empty:
429. If there are tokens: request proceeds
The key insight is that bursts are allowed, but bounded. A user can burn through their full 100 tokens in a second if they have them, but then they're throttled to 10 req/s until the bucket refills. No more double-window exploits.
Here's a Redis-based implementation in Node.js that I've shipped at Cubet on API gateways handling client-facing integrations:
import Redis from 'ioredis';
interface TokenBucketConfig {
capacity: number; // max tokens in bucket
refillRate: number; // tokens added per second
windowSeconds: number; // key TTL (set > capacity/refillRate)
}
async function tokenBucket(
redis: Redis,
key: string,
config: TokenBucketConfig
): Promise<{ allowed: boolean; remaining: number; retryAfter?: number }> {
const now = Date.now() / 1000; // seconds with millisecond precision
const { capacity, refillRate, windowSeconds } = config;
const luaScript = `
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local ttl = tonumber(ARGV[4])
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now
-- Refill tokens based on elapsed time
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
local allowed = 0
if tokens >= 1 then
tokens = tokens - 1
allowed = 1
end
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, ttl)
return { allowed, math.floor(tokens) }
`;
const result = await redis.eval(
luaScript, 1, key,
capacity, refillRate, now, windowSeconds
) as [number, number];
const allowed = result[0] === 1;
const remaining = result[1];
const retryAfter = allowed ? undefined : Math.ceil(1 / refillRate);
return { allowed, remaining, retryAfter };
}
A few things worth noting here:
The Lua script runs atomically. This is non-negotiable. If you do read-modify-write as three separate Redis commands, you have a race condition under concurrent requests. Lua scripts in Redis execute as a single unit — no interleaving.
Tokens are stored as floats. We compute elapsed * refillRate which might be 0.37 tokens. Keeping the precision means the bucket fills smoothly rather than in integer jumps that create micro-bursts at refill boundaries.
Wiring It Into Express Middleware
function createRateLimiter(redis: Redis, config: TokenBucketConfig) {
return async (req: Request, res: Response, next: NextFunction) => {
// Key by IP for unauthenticated, by user ID for authenticated
const identifier = req.user?.id ?? req.ip;
const key = `ratelimit:${req.path}:${identifier}`;
const { allowed, remaining, retryAfter } = await tokenBucket(
redis, key, config
);
res.setHeader('X-RateLimit-Limit', config.capacity);
res.setHeader('X-RateLimit-Remaining', remaining);
if (!allowed) {
res.setHeader('Retry-After', retryAfter!);
return res.status(429).json({
error: 'Too Many Requests',
retryAfter,
});
}
next();
};
}
// Usage: 100 tokens max, refill 10/sec, TTL 30s
app.use('/api/', createRateLimiter(redis, {
capacity: 100,
refillRate: 10,
windowSeconds: 30,
}));
Always send X-RateLimit-Remaining and Retry-After. Clients that do the right thing (back off and retry) will thank you. Clients that don't will hit the wall again faster — that's their problem.
Tuning Your Bucket
The three numbers you pick (capacity, refillRate, windowSeconds) define your entire rate limiting policy:
| Scenario | Capacity | Refill Rate | Result |
|---|---|---|---|
| Public read API | 60 | 1/sec | 1 req/sec sustained, burst to 60 |
| Authenticated user | 100 | 10/sec | 10 req/sec sustained, burst to 100 |
| Webhook ingest | 500 | 50/sec | 50/sec sustained, absorbs event storms |
| Password reset | 5 | 0.003/sec | ~10/hour sustained, 5 burst max |
For password resets, login endpoints, and anything authentication-adjacent: set capacity low and refill slow. A legitimate user doesn't need 100 password reset attempts. An attacker trying to brute-force OTPs does.
What This Doesn't Solve
Token bucket stops per-key abuse — but it won't save you if an attacker has 10,000 IP addresses (botnets, residential proxies). For that layer you need:
- Fingerprinting beyond IP (device headers, TLS fingerprinting)
- Aggregate rate limits at the account or org level, not just per-IP
- Anomaly detection when burst patterns look coordinated
Rate limiting is one defense layer, not a moat. But it's a foundational one, and doing it correctly — with token bucket instead of fixed windows — is the difference between a policy that holds under adversarial conditions and one that crumbles at the seams.
The One-Sentence Summary
Fixed windows let double-window attacks double your burst capacity; token buckets refill continuously so the math holds at every millisecond, not just at window boundaries.
Get this right and you've removed one of the most common "but we had rate limiting!" post-incident surprises from your future.
Have a war story about rate limiting going sideways in production? Hit me up on Twitter/X or connect on LinkedIn — I collect these.