Kubernetes Health Probes: Stop Routing Traffic to Dead Pods π©Ίπ
Let me paint you a picture.
It's Sunday afternoon. Your app is deployed. kubectl get pods looks pristine β every pod shows Running. You're sipping coffee, feeling like an infrastructure deity. Then Slack explodes.
"Why are 30% of requests returning 502?" "The app is completely down for some users!" "Did you deploy something??"
You frantically check. Pods: Running. Deployments: Available. Load balancer: green. Everything looks fine. Except... it isn't. π
The culprit? Kubernetes was cheerfully routing traffic to pods that were alive in name only β their processes had deadlocked, their database connections had vanished, and they were silently swallowing every request into the void.
The fix was two dozen lines of YAML I should have written on day one. Welcome to Kubernetes health probes β the feature that makes Kubernetes actually smart about your app's health, not just whether the container PID is alive.
What Even ARE Health Probes? π€
Kubernetes runs three types of probes:
| Probe | Question it answers | Failure action |
|---|---|---|
| Liveness | "Is this pod still functioning?" | Restart the container |
| Readiness | "Is this pod ready to serve traffic?" | Remove from Service endpoints |
| Startup | "Has the app finished starting up?" | Block liveness/readiness checks until it does |
Without probes, Kubernetes only knows if the container process is running. With probes, it knows if your app is actually doing its job. Big difference.
The Horror Story in Detail π
Here's what was happening in our incident. We had a Node.js API that connected to PostgreSQL. Under heavy load, the connection pool would exhaust and the app would enter a zombie state β process alive, but every request timing out waiting for a DB connection.
Process status: RUNNING β
Container: RUNNING β
Kubernetes: "All good, routing traffic!" β
Actual app: πππ DEAD πππ
Without a liveness probe, Kubernetes had zero idea. It kept routing traffic to the three zombie pods for 47 minutes before an on-call engineer manually deleted them.
47 minutes. Of 502s. On a Sunday.
Liveness Probes: "Are You Actually Alive?" β€οΈ
A liveness probe tells Kubernetes to restart a container if it fails. Use it for detecting deadlocks, infinite loops, or anything where the app is stuck but the process hasn't died.
HTTP Liveness Probe (Most Common)
# deployment.yaml
spec:
containers:
- name: api
image: myapp:latest
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 10 # Wait 10s before first check
periodSeconds: 15 # Check every 15s
failureThreshold: 3 # Restart after 3 consecutive failures
timeoutSeconds: 5 # Fail if no response in 5s
And in your Node.js app, the /health/live endpoint should be dead simple β just confirm the process is responsive:
// routes/health.js
app.get('/health/live', (req, res) => {
// Don't check DB here β that's for readiness!
// Just confirm the event loop is alive.
res.status(200).json({ status: 'alive', timestamp: Date.now() });
});
The golden rule: Your liveness endpoint should NEVER check external dependencies (DB, Redis, external APIs). If your DB goes down, you don't want Kubernetes restarting all your pods β you want them to stop receiving new traffic while they wait for the DB to recover. That's readiness probe territory.
Readiness Probes: "Are You Ready for Traffic?" π¦
A readiness probe tells Kubernetes to remove the pod from the Service load balancer if it fails β without restarting it. Use it for:
- Waiting for DB connections to establish on startup
- Temporarily pulling a pod from rotation during heavy processing
- Graceful degradation when a dependency is down
spec:
containers:
- name: api
image: myapp:latest
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 5 # Check sooner than liveness
periodSeconds: 10
failureThreshold: 2 # Faster to pull from rotation
successThreshold: 1 # One success to re-add
timeoutSeconds: 3
Your readiness endpoint CAN (and should) check dependencies:
app.get('/health/ready', async (req, res) => {
const checks = {};
// Check DB connectivity
try {
await db.raw('SELECT 1');
checks.database = 'ok';
} catch (err) {
checks.database = 'unavailable';
}
// Check Redis if you use it
try {
await redis.ping();
checks.cache = 'ok';
} catch (err) {
checks.cache = 'unavailable';
}
const healthy = Object.values(checks).every(v => v === 'ok');
res.status(healthy ? 200 : 503).json({
status: healthy ? 'ready' : 'not ready',
checks,
});
});
When the DB goes down: readiness fails β pod removed from load balancer β users hit only healthy pods β no 502s. Beautiful. π―
Startup Probes: The Slow-Starting App's Best Friend π’
Some apps take forever to start β JVM warm-up, database migrations, loading ML models. Without a startup probe, your liveness probe fires during startup and restarts the pod in an infinite loop of shame.
spec:
containers:
- name: java-api
image: my-spring-boot-app:latest
startupProbe:
httpGet:
path: /actuator/health
port: 8080
failureThreshold: 30 # Allow up to 30 * 10s = 5 minutes to start
periodSeconds: 10
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
periodSeconds: 15
failureThreshold: 3
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
periodSeconds: 10
failureThreshold: 2
How it works: While the startup probe is active, liveness and readiness probes are paused. Once startup succeeds, the other probes take over. Slow app? No problem. Still stuck after 5 minutes? Something's very wrong β restart it.
Real-World Lessons Learned (The Painful Way) π
Lesson 1: Don't make your liveness probe check the DB.
We did this. Our DB had a 2-minute maintenance window. Kubernetes restarted ALL 8 pods simultaneously. Cold boot with no warm cache. Latency spiked to 10 seconds. Users thought we were down. We were down. π
Lesson 2: Set initialDelaySeconds based on real measurements.
Too short: pod gets killed before it finishes starting up β restart loop β CrashLoopBackOff β midnight pages. Too long: broken pods serve traffic for too long before detection. Profile your actual startup time and add a 50% buffer.
Lesson 3: Use successThreshold > 1 for readiness if you have a flaky dependency.
If Redis glitches for one second, you don't want pods yanked from rotation and readded every 10 seconds (thundering herd alert!). Setting successThreshold: 2 means it needs two consecutive successes to come back β much smoother.
Lesson 4: Probe endpoints need to be fast.
We had a readiness probe that ran 5 database queries. Under load, those queries would slow down, the probe would time out, pods would leave rotation, traffic concentrated on fewer pods, they slowed down more, and the whole thing snowballed. One lightweight SELECT 1 is enough.
The Complete Production Template π
Here's the battle-tested probe configuration I use for Node.js APIs today:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 3
template:
spec:
containers:
- name: api
image: myapp:latest
ports:
- containerPort: 3000
startupProbe:
httpGet:
path: /health/live
port: 3000
failureThreshold: 12 # 12 * 5s = 60s max startup time
periodSeconds: 5
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 0 # Startup probe handles the wait
periodSeconds: 15
failureThreshold: 3
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 0
periodSeconds: 10
failureThreshold: 2
successThreshold: 2
timeoutSeconds: 3
Three probes, zero zombie pods. That's the deal. β
Quick Wins Checklist π
Right now:
- Check
kubectl get podsβ do any showRunningbut behave oddly? You have no probes. - Add a
/health/liveendpoint that returns 200 if the process is responsive - Add
/health/readythat checks your critical dependencies - Wire up all three probes to your deployments
This week:
- Measure actual startup time for each service and set
initialDelaySecondsaccordingly - Test probe failure by intentionally killing the DB and watching readiness fail gracefully
- Add probe metrics to your Grafana dashboard (Kubernetes exposes them via the metrics API)
The test that matters most: Manually kubectl exec into a pod and kill your DB connection. Within failureThreshold * periodSeconds seconds, that pod should disappear from kubectl get endpoints. If it does β your probes work. If it doesn't β back to the YAML! π§
Wrapping Up π
Health probes are the difference between Kubernetes being a "container restarter" and an actual self-healing platform. Without them, you're flying blind β trusting that a running process equals a working app. With them, Kubernetes becomes genuinely intelligent about your service's health.
The 47-minute Sunday incident? After adding probes, the next DB connection exhaustion event was handled automatically in under 45 seconds. No pages. No 502s. Just Kubernetes quietly rotating traffic around the struggling pod until it recovered.
That's the dream. Go write your health endpoints. π©Ί
Still debugging mysterious 502s? Find me on LinkedIn β I've seen too many variations of this exact incident to count.
Want to see probe configurations from real projects? My GitHub has production Kubernetes manifests with full health probe setups.
Now go forth and let no zombie pod eat your traffic again! π§ββοΈπ«