Chaos Engineering: Break Your Own Systems Before Your Users Do ๐ฅ๐ต
Chaos Engineering: Break Your Own Systems Before Your Users Do ๐ฅ๐ต
Real talk: One Friday afternoon I told my team "I'm going to randomly terminate half our Lambda functions while real traffic is flowing." The silence in the Slack channel was deafening. Someone typed "are you feeling okay?" ๐
Three months later, our e-commerce backend survived a full AWS availability zone failure during Black Friday. Zero customer complaints. Zero data loss. The on-call engineer slept through it.
That's what chaos engineering does for you.
What Is Chaos Engineering? ๐ค
Netflix coined it. Their engineers got tired of discovering system failures when customers did. So they built Chaos Monkey โ a tool that randomly kills production servers.
Deliberately. In production.
The idea sounds insane until you realize the alternative:
WITHOUT Chaos Engineering:
โ System has hidden failure modes
โ Users discover them at 2am Black Friday
โ You spend 6 hours debugging something "that never happens"
โ Your boss discovers it on Twitter
โ You update your LinkedIn profile
WITH Chaos Engineering:
โ You discover failure modes in controlled conditions
โ You fix them before they matter
โ System handles real failures gracefully
โ Users don't notice
โ You get a raise (maybe)
The principle: If it hurts to do it, do it more often until it doesn't hurt.
The Incident That Made Me a Believer ๐
When I was building our e-commerce backend, we had a payment service that called an external card processor. It was fine. Until it wasn't.
What happened on a random Tuesday:
- Card processor had a slowdown (not down, just SLOW)
- Our payment service called it synchronously
- Response time went from 200ms โ 12 seconds
- Our checkout endpoints timed out
- Connection pool exhausted
- Checkout API went down completely
- Users couldn't buy anything
- 4 hours of downtime
- $80,000 revenue lost (the math that REALLY gets your attention)
What should have happened:
- Card processor slows down
- Our circuit breaker opens after 5 failures
- Checkout returns "payment processing delayed, we'll confirm shortly"
- Users get order confirmed via email when payment resolves
- Nobody notices
We HAD a circuit breaker in code review. We'd never tested it under real conditions. It had a bug. We discovered it in the worst possible way.
As a Technical Lead, I've learned: Code that isn't tested under failure conditions isn't tested at all.
The Chaos Engineering Process ๐ฌ
Don't just start randomly breaking things. That's not chaos engineering, that's self-sabotage. The process:
1. HYPOTHESIZE: "Our system should handle X failure gracefully"
2. BASELINE: Measure normal system behavior
3. EXPERIMENT: Introduce controlled failure
4. OBSERVE: Does system behave as hypothesized?
5. FIX: If not, repair the weak point
6. REPEAT: Run in more realistic conditions
Simple example:
Hypothesis: "If our Redis cache dies, the app falls back to database"
Baseline: p99 response time = 45ms (cache hit rate 85%)
Experiment: Kill Redis
Observation:
- Response time spiked to 800ms (expected - DB hit)
- Error rate: 0% โ
(great, fallback works!)
- Response time recovered when Redis came back โ
Result: PASSED. Cache is non-critical. ๐
Easy win. Now let's try a harder one.
The Experiments That Changed Our Architecture ๐งช
Experiment #1: Database Primary Goes Down
Hypothesis: Read replicas take over, writes queue up, no data loss.
# In staging: Kill the primary database
aws rds failover-db-cluster --db-cluster-identifier prod-cluster
# Observe what happens to your application
What we discovered:
// Our original connection code (BROKEN under failover)
const db = mysql.createPool({
host: 'primary-db.cluster.us-east-1.rds.amazonaws.com',
// Hardcoded primary! No failover handling!
});
// We got during the experiment:
// Error: ECONNREFUSED - after 30 seconds of retrying primary
// 30 seconds of 500 errors. In production = thousands of angry users.
// Fixed version:
const db = mysql.createPool({
host: 'cluster-endpoint.cluster.us-east-1.rds.amazonaws.com', // cluster endpoint auto-routes!
connectTimeout: 5000, // Fail fast, don't hang
acquireTimeout: 5000,
waitForConnections: true,
connectionLimit: 20,
queueLimit: 0
});
// With retry logic:
async function queryWithRetry(sql, params, retries = 3) {
for (let i = 0; i < retries; i++) {
try {
return await db.execute(sql, params);
} catch (err) {
if (i === retries - 1) throw err;
const delay = Math.pow(2, i) * 100; // Exponential backoff
console.log(`DB query failed, retry ${i+1} in ${delay}ms`);
await sleep(delay);
}
}
}
Finding this in a 3-hour chaos experiment vs finding it during Black Friday: Not a hard choice.
Experiment #2: Slow External Payment API
This is the one that would have saved us $80,000.
// Chaos experiment: Add artificial latency to payment service calls
// Using a local proxy (toxiproxy is perfect for this)
// toxiproxy config:
{
"name": "payment-api",
"listen": "0.0.0.0:8001",
"upstream": "payment-processor.com:443",
"enabled": true
}
// Add latency toxic via API:
// POST /proxies/payment-api/toxics
{
"name": "payment-latency",
"type": "latency",
"attributes": {
"latency": 8000, // 8 second delay
"jitter": 2000 // ยฑ 2 seconds
}
}
What the experiment revealed:
WITHOUT timeout config:
โ Payment calls hang for 8-10 seconds
โ Connection pool fills up (20 connections waiting)
โ All other endpoints start timing out
โ Entire app is degraded because of ONE slow external service
โ Classic cascading failure ๐ฅ
WITH proper timeouts + circuit breaker:
โ Payment call fails fast after 3 seconds
โ Circuit breaker opens after 5 consecutive timeouts
โ Fallback: Queue payment for async retry
โ User gets "Order confirmed, payment processing" message
โ Other endpoints unaffected โ
The fix was adding one thing:
// Before: Trusting external APIs to be fast
const result = await axios.post('https://payment-api.com/charge', data);
// After: Never trust external APIs
const result = await axios.post('https://payment-api.com/charge', data, {
timeout: 3000, // 3 second max. Period.
signal: AbortSignal.timeout(3000)
});
// Plus circuit breaker (we used opossum):
const CircuitBreaker = require('opossum');
const breaker = new CircuitBreaker(callPaymentApi, {
timeout: 3000, // 3 second timeout
errorThresholdPercentage: 50, // Open if 50% of calls fail
resetTimeout: 30000 // Try again after 30 seconds
});
breaker.fallback(() => ({ status: 'queued', message: 'Payment processing' }));
Experiment #3: One Availability Zone Dies
When I said "I'm going to randomly terminate half our Lambda functions" โ this was it. Simulating an AZ outage.
Architecture BEFORE chaos experiment:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ us-east-1a (ALL our Lambda functions)โ
โ RDS Primary โ
โ ElastiCache Primary โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Single AZ = Single point of failure ๐ฌ
Architecture AFTER:
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ us-east-1a โ โ us-east-1b โ โ us-east-1c โ
โ Lambda (33%) โ โ Lambda (33%) โ โ Lambda (34%) โ
โ RDS Primary โ โ RDS Standby โ โ RDS Standby โ
โ Cache Primary โ โ Cache Replica โ โ Cache Replica โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโ
Load Balancer
The chaos experiment that revealed this: We terminated all Lambda functions in us-east-1a. 100% of traffic failed. That's when we realized everything was in one AZ.
Fixing it was embarrassingly simple: Multi-AZ was one Terraform config change. We just... never checked.
Tools for Chaos Engineering ๐ ๏ธ
For AWS (What I Use)
AWS Fault Injection Simulator (FIS):
{
"description": "Kill 50% of ECS tasks",
"targets": {
"ECSTasksToStop": {
"resourceType": "aws:ecs:task",
"resourceTags": { "Environment": "staging" },
"selectionMode": "PERCENT(50)"
}
},
"actions": {
"StopTasks": {
"actionId": "aws:ecs:stop-task",
"targets": { "Tasks": "ECSTasksToStop" }
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "ErrorRateTooHigh"
}
]
}
The stop condition is KEY. If error rate exceeds your threshold, chaos stops automatically. Safety first. ๐
For Local Development
Toxiproxy โ Network condition simulator:
# Install
docker run -d --name toxiproxy -p 8474:8474 -p 8001-8010:8001-8010 shopify/toxiproxy
# Create proxy
curl -X POST http://localhost:8474/proxies \
-d '{"name":"database","listen":"0.0.0.0:5433","upstream":"postgres:5432"}'
# Add latency
curl -X POST http://localhost:8474/proxies/database/toxics \
-d '{"name":"latency","type":"latency","attributes":{"latency":500}}'
# Add packet loss
curl -X POST http://localhost:8474/proxies/database/toxics \
-d '{"name":"loss","type":"bandwidth","attributes":{"rate":0}}'
Your app talks to localhost:5433 instead of Postgres directly. You control the chaos.
The Chaos Checklist I Run Every Quarter ๐
Network failures:
โก Service A can't reach Service B
โก External API returns 500s
โก External API is slow (8+ seconds)
โก DNS resolution fails
Infrastructure failures:
โก Primary database goes down
โก Cache layer dies (Redis/ElastiCache)
โก Message queue is unavailable
โก 50% of application instances terminate
Resource exhaustion:
โก Database connection pool fills up
โก Memory pressure (kill -9 when OOM)
โก CPU spike on dependent service
โก Disk fills up on logging node
Data failures:
โก Malformed responses from external APIs
โก Unexpected null values in database
โก Schema mismatch between services
Every item that breaks something is a bug waiting to happen in production.
Common Chaos Findings (And How to Fix Them) ๐ชค
Finding #1: No Timeouts on External Calls
The symptom: Kill downstream service โ entire app hangs
The fix:
// NEVER do this:
const data = await externalApi.fetch(url);
// ALWAYS do this:
const data = await Promise.race([
externalApi.fetch(url),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), 3000)
)
]);
Finding #2: Retry Storms
The symptom: Service comes back up โ immediately gets hammered by retries from everyone โ goes back down
The fix: Exponential backoff with jitter
async function retry(fn, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (err) {
if (attempt === maxRetries) throw err;
// Exponential backoff: 100ms, 200ms, 400ms...
// Jitter: Add random 0-100ms so not all clients retry simultaneously
const backoff = Math.pow(2, attempt - 1) * 100;
const jitter = Math.random() * 100;
await sleep(backoff + jitter);
}
}
}
Finding #3: Synchronous Chain of Death
User Request
โ Service A (10ms)
โ Service B (50ms)
โ Service C (100ms)
โ External API (slow: 8000ms!)
โ Entire chain hangs ๐ฅ
The fix: Async where possible, timeouts everywhere, fallbacks for non-critical paths
Finding #4: No Health Checks
The symptom: Service crashes โ load balancer keeps sending traffic to it โ 50% of requests fail
The fix:
// Express health check endpoint
app.get('/health', async (req, res) => {
try {
await db.query('SELECT 1'); // Check DB
await redis.ping(); // Check cache
res.json({
status: 'healthy',
uptime: process.uptime(),
timestamp: Date.now()
});
} catch (err) {
res.status(503).json({ status: 'unhealthy', error: err.message });
}
});
When NOT to Run Chaos Experiments โ ๏ธ
As a Technical Lead, I've learned: chaos engineering has rules.
Never chaos in production without:
- Proper monitoring in place (you need to see the impact)
- Automated stop conditions (error rate too high โ stop experiment)
- Team awareness (don't chaos at 2pm on a Friday, and TELL your team)
- Rollback plan (can you undo the change in 30 seconds?)
- Starting in staging first
Never chaos:
- During peak traffic (unless that's the point, and even then, carefully)
- When the team is already fighting an incident
- On databases without backup verification
- Without understanding the blast radius
Start here:
- Staging environment, controlled conditions
- Non-critical services first (logging, analytics)
- Work up to critical paths
- Only then consider (carefully) production experiments
The ROI of Breaking Things on Purpose ๐ฐ
A scalability lesson that cost us: That $80,000 payment outage I mentioned? We could have prevented it with a 3-hour chaos experiment in staging. Instead, we found the failure mode when it cost us real money and real customer trust.
After we implemented regular chaos engineering:
- Production incidents down 60%
- Mean time to recovery down 75% (we'd practiced)
- Black Friday: Zero SEV1 incidents for the first time ever
- On-call engineer actually slept
The math is simple: 3 hours of controlled chaos > 6 hours of panicked production debugging + $80,000 revenue loss.
TL;DR ๐ฏ
Chaos engineering in one sentence: Find your system's breaking points in a controlled experiment before your users find them for you.
The core practice:
- Form a hypothesis (system should handle X failure)
- Baseline your metrics (know what "normal" looks like)
- Inject controlled failure (start small, staging first)
- Observe and fix (most findings are embarrassingly fixable)
- Run regularly (systems change, failure modes change)
Tools: AWS FIS for production, Toxiproxy for local, Chaos Monkey for the brave.
The uncomfortable truth: Every production system has failure modes its owners don't know about. Chaos engineering is just deciding whether you'll discover them on your schedule or your users' schedule.
When designing our e-commerce backend's resilience strategy, chaos engineering wasn't optional โ it was the only way to know if our circuit breakers, timeouts, and failover configs actually worked. Reading the docs and writing the code isn't enough. You have to break it to trust it.
Your users are going to find your weaknesses eventually. Better you find them first. ๐ฅ๐ต
Do you have chaos engineering horror stories? Hit me up on LinkedIn โ I've got more where these came from.
Want to see resilience patterns in code? Check my GitHub โ I've open-sourced some of the circuit breaker patterns we use in production.
Now go break something. Intentionally. With a plan. ๐งช
P.S. The first time I ran a chaos experiment in production, I forgot to tell the on-call engineer. At 3pm he got a PagerDuty alert, opened his laptop in a panic, and started debugging something I had deliberately caused. I owe him a beer. Or twelve. ๐บ