Kubernetes Health Probes: Stop Sending Traffic to Dead Pods 💀🔍
True story from 3 AM on-call rotation:
My Node.js API had a memory leak. The pod was running (Kubernetes thought everything was fine), but it was returning 500 errors on every request. The process hadn't crashed - it was just completely broken. Kubernetes? Blissfully routing traffic to my zombie pod for 8 minutes until I woke up to 400 Slack alerts.
Me: "Why is Kubernetes sending traffic to a broken pod?!"
Kubernetes: "You never told me it was broken 🤷"
Me: 😡🤦♂️
That was the night I finally learned health probes properly. Let me save you the same pain.
The Three Probes You Need to Know 🎯
Kubernetes has three health probe types, and they all do different things:
| Probe | Job | What Happens on Failure |
|---|---|---|
| Liveness | "Is this pod alive?" | Pod gets killed & restarted |
| Readiness | "Is this pod ready for traffic?" | Pod removed from Service endpoints |
| Startup | "Has this pod finished starting?" | Delays liveness/readiness checks |
The key insight: These are completely independent. A pod can be alive but NOT ready. A pod can be ready but about to die. Each probe serves a different purpose!
Liveness Probe: Kill the Zombie 🧟
The liveness probe answers one question: "Is your app still functioning, or is it a zombie?"
If it fails, Kubernetes kills the pod and starts a new one. This is your escape hatch from infinite hangs, deadlocks, and memory corruption.
The classic mistake I see everywhere:
# deployment.yaml - THE WRONG WAY
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 3 # 3 seconds is NEVER enough
periodSeconds: 5 # Checking every 5s is too aggressive
failureThreshold: 1 # ONE failure = pod restart? RIP your app
What happens with this config:
- Pod starts, app needs 15 seconds to warm up
- Kubernetes checks at 3 seconds → 503 (still starting) → RESTART
- Kubernetes checks at 3 seconds → 503 → RESTART
- Repeat forever. Your app never starts. 🔄
The right way:
# deployment.yaml - THE RIGHT WAY
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-api
spec:
template:
spec:
containers:
- name: api
image: my-api:1.0.0
# Liveness: is the app stuck/crashed?
livenessProbe:
httpGet:
path: /healthz/live # Different endpoint from readiness!
port: 3000
initialDelaySeconds: 30 # Give it time to start
periodSeconds: 10 # Check every 10s
failureThreshold: 3 # 3 consecutive failures = restart
timeoutSeconds: 5 # Wait up to 5s for response
successThreshold: 1 # 1 success = healthy
# Readiness: is the app ready to take traffic?
readinessProbe:
httpGet:
path: /healthz/ready
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
timeoutSeconds: 3
successThreshold: 2 # Need 2 successes to be "ready"
What changed:
initialDelaySeconds: 30— give the app time to actually startfailureThreshold: 3— 3 strikes before restart (transient errors happen!)timeoutSeconds: 5— your health endpoint should respond FAST- Separate paths for liveness vs readiness (different logic!)
Readiness Probe: The Traffic Gatekeeper 🚦
While liveness handles zombie detection, readiness handles a different question: "Should this pod receive traffic right now?"
Readiness failures don't restart the pod — they just quietly remove it from the load balancer. Perfect for:
- Pods still warming up their cache
- Pods temporarily overwhelmed
- Pods waiting for a database connection
- Rolling deployments (new pod isn't ready? Old pod keeps getting traffic!)
The Node.js health endpoint I actually use in production:
// healthz.js - Two separate health endpoints
// Liveness: just "is the process responsive?"
app.get('/healthz/live', (req, res) => {
// Keep this SIMPLE. If you can respond, you're alive.
// Don't check DB here - a DB outage shouldn't restart every pod!
res.status(200).json({ status: 'alive', uptime: process.uptime() });
});
// Readiness: "are you actually ready to handle real traffic?"
app.get('/healthz/ready', async (req, res) => {
const checks = {
database: false,
cache: false,
externalApi: false,
};
try {
// Check DB connection
await db.query('SELECT 1');
checks.database = true;
// Check Redis
await redis.ping();
checks.cache = true;
// Check if we're warmed up (e.g., cache populated)
checks.externalApi = appState.isWarmedUp;
const allHealthy = Object.values(checks).every(Boolean);
if (allHealthy) {
return res.status(200).json({ status: 'ready', checks });
}
// Not ready yet - remove from load balancer, but DON'T restart
return res.status(503).json({ status: 'not ready', checks });
} catch (err) {
return res.status(503).json({ status: 'not ready', error: err.message });
}
});
The most important lesson here: Your liveness probe should be dead simple. If your liveness probe checks the database and the database goes down, Kubernetes will restart ALL your pods in a loop. That's called a thundering herd and it's how a DB outage becomes a complete service meltdown. 💥
Keep liveness simple. Put the dependency checks in readiness.
Startup Probe: For the Slow Starters 🐢
Some apps are slow to start (Spring Boot, I'm looking at you 👀). You could set initialDelaySeconds: 120 on the liveness probe... but then a truly crashed pod won't be restarted for 2 minutes.
Enter the startup probe — it runs INSTEAD of liveness/readiness until it succeeds, then hands off:
# For slow-starting apps (JVM, large Python services, etc.)
startupProbe:
httpGet:
path: /healthz/live
port: 8080
# Allow up to 5 minutes to start (30 attempts × 10s each)
failureThreshold: 30
periodSeconds: 10
# These only activate AFTER startupProbe succeeds
livenessProbe:
httpGet:
path: /healthz/live
port: 8080
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
periodSeconds: 5
failureThreshold: 3
How it works:
- Pod starts → startup probe kicks in (up to 5 min to succeed)
- Startup probe succeeds → liveness + readiness probes activate
- No more "restarting before it even started" madness!
Before I discovered startup probes, I'd set initialDelaySeconds: 180 for slow Spring Boot apps. Now deployments that hit a slow start just wait patiently instead of thrashing. My ops team's blood pressure dropped noticeably. ❤️
The Real-World Rolling Deployment Win 🚀
Here's where readiness probes shine brightest. Without them, rolling deployments are a gamble:
# Without readiness probes:
# - Kubernetes starts new pod
# - Kubernetes sees pod running
# - Kubernetes routes traffic immediately
# - Pod is still warming up → 502 errors!
# - Users are unhappy 😤
# With readiness probes:
# - Kubernetes starts new pod
# - Kubernetes routes traffic to OLD pods while new pod warms up
# - New pod's readiness probe passes
# - Only NOW does Kubernetes shift traffic to new pod
# - Zero user-visible errors! ✅
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Spin up 1 new pod before killing old
maxUnavailable: 0 # Never have zero ready pods
Combined with readiness probes, maxUnavailable: 0 means there's ALWAYS a ready pod serving traffic. This is how you do true zero-downtime deployments.
The deployment that changed my career: After a painful 5-minute outage during a routine deploy (no probes, maxUnavailable: 1), I added readiness probes and maxUnavailable: 0. The next 200 deployments? Zero downtime. Not once. 🎯
Quick Wins Checklist ✅
Before you close this tab:
# Check if your existing deployments have probes configured:
kubectl get deployments -n your-namespace -o json | \
jq '.items[] | {name: .metadata.name,
liveness: .spec.template.spec.containers[0].livenessProbe,
readiness: .spec.template.spec.containers[0].readinessProbe}'
# Watch probe failures in real-time:
kubectl describe pod <pod-name> | grep -A 10 "Events:"
# Check which pods are currently ready vs not ready:
kubectl get pods -n your-namespace -o wide
If you see null for either probe — that deployment is living dangerously. 🎲
Lessons Learned the Hard Way 📚
After running Kubernetes in production for years:
- Never check external dependencies in liveness probes — DB down → all pods restart → worse outage
- Set
failureThresholdto at least 3 — one slow response shouldn't kill a pod - Use startup probes for anything that takes > 30 seconds to start
- Your
/healthz/liveendpoint should respond in under 100ms — it's called every 10 seconds - Test probe failures locally with
kubectl execbefore deploying to prod
The configuration takes 20 minutes to add. The downtime it prevents is measured in hours.
Your Action Plan 🎯
This afternoon:
- List all deployments with missing probes (use the
kubectlcommand above) - Add a
/healthz/liveendpoint to your apps (simple, no dependencies) - Add a
/healthz/readyendpoint (check actual dependencies) - Deploy with proper probe config +
maxUnavailable: 0
This week:
- Add startup probes to any slow-starting services
- Set up alerting on
kubectl get eventsforLiveness probe failed - Test your probes actually work by killing the DB and watching readiness fail (without a pod restart!)
This month:
- Document your probe strategy in a runbook
- Add probe health checks to your CI/CD pipeline validation
- Never get paged at 3 AM for zombie pod traffic again 🎉
Still routing traffic to zombie pods? Find me on LinkedIn — I've been there and have the on-call scars to prove it.
More Kubernetes configs and battle-tested patterns on GitHub.
Now go add those probes before your next deploy! 🚀
P.S. The 8-minute zombie pod incident? Turned out to be a memory leak in a third-party library. The probe I added took 10 minutes to configure. Worth it 1000x over. ⚡