Kubernetes Health Probes: Stop Letting Broken Pods Pretend They're Healthy π©Ίπ₯
Kubernetes Health Probes: Stop Letting Broken Pods Pretend They're Healthy π©Ίπ₯
True story: I once spent three hours debugging a production incident where our API was returning 500s to every user. Kubernetes dashboard showed all pods: Running. Zero crashes. Zero restarts. Everything looked perfect.
The culprit? Our Node.js app had a database connection pool that silently exhausted itself and froze. The pod was alive. The app was not. Kubernetes had no idea.
Me at 2am: "Why is everything broken?!"
Kubernetes: "All pods are Running! π’"
Me: π€¬
That incident introduced me to health probes β the feature that would've saved me those three hours (and my sleep schedule).
The Three Probes You Need to Know π
Kubernetes gives you three types of health checks. They're simple, powerful, and criminally underused:
| Probe | Question It Answers | Action on Failure |
|---|---|---|
| Liveness | "Is this pod alive?" | Restart the container |
| Readiness | "Is this pod ready for traffic?" | Remove from load balancer |
| Startup | "Has this pod finished starting up?" | Give it more time before checking |
Think of it like a restaurant:
- Startup probe = "Are you even open yet?"
- Readiness probe = "Are you ready to seat customers?"
- Liveness probe = "Are you still in business?"
The Problem: Kubernetes Thinks "Running" Means "Healthy"
Without probes, Kubernetes only knows your pod is healthy if the main process is still alive. That's it.
What Kubernetes CAN'T detect without probes:
β Your app started but is in an infinite initialization loop
β Your database pool exhausted and all queries are timing out
β Your app started but hasn't finished loading config files yet
β Your Node.js event loop is completely blocked
β Your app is "running" but returning 503 to every request
β Your app deadlocked 30 minutes after startup
The pod keeps running. Traffic keeps flowing in. Users keep screaming. You keep wondering why everything is on fire. π₯
Liveness Probe: Kill It If It's Broken
A liveness probe answers: "Is my app fundamentally broken and needs a restart?"
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-api
spec:
template:
spec:
containers:
- name: api
image: my-api:latest
ports:
- containerPort: 3000
livenessProbe:
httpGet:
path: /health/live # A SIMPLE endpoint, not your full health check
port: 3000
initialDelaySeconds: 10 # Wait 10s after container starts
periodSeconds: 15 # Check every 15 seconds
failureThreshold: 3 # Restart after 3 consecutive failures
timeoutSeconds: 5 # Fail if no response in 5s
The /health/live endpoint should be dead simple:
// Don't check database, cache, or external services here!
// Just verify the process itself is alive and responsive
app.get('/health/live', (req, res) => {
res.json({ status: 'alive', timestamp: Date.now() });
});
Critical rule: Your liveness endpoint must be stupid simple. If it checks the database and the database is down, Kubernetes will restart your pod in a loop forever β even though a restart won't fix the database! π
I learned this the hard way when our liveness probe checked Redis, Redis had a blip, and Kubernetes helpfully restarted all 10 pods simultaneously. Instant outage. My pager went ballistic.
Readiness Probe: "Don't Send Traffic Yet!"
A readiness probe answers: "Is my app ready to handle real requests?"
This is the probe that saved my 2am incident. If readiness fails, Kubernetes removes the pod from the load balancer. Users stop getting errors. The pod stays running and can recover.
readinessProbe:
httpGet:
path: /health/ready # Thorough check - DB, cache, dependencies
port: 3000
initialDelaySeconds: 5 # Start checking after 5s
periodSeconds: 10 # Check every 10 seconds
failureThreshold: 3 # Remove from rotation after 3 failures
successThreshold: 2 # Needs 2 successes to get back in rotation
timeoutSeconds: 3
Your /health/ready endpoint can (and should) be thorough:
app.get('/health/ready', async (req, res) => {
const checks = {
database: false,
redis: false,
timestamp: Date.now()
};
try {
// Check database connection
await db.raw('SELECT 1');
checks.database = true;
} catch (err) {
// Database is down - we're NOT ready
}
try {
// Check Redis
await redis.ping();
checks.redis = true;
} catch (err) {
// Cache is down - we're NOT ready
}
const isReady = checks.database && checks.redis;
const statusCode = isReady ? 200 : 503;
res.status(statusCode).json({
status: isReady ? 'ready' : 'not_ready',
checks
});
});
The key insight: Readiness failure = "take me out of rotation, I'll recover." Liveness failure = "I'm broken, restart me." They answer completely different questions!
Startup Probe: For Slow-Starting Apps
Got a Java Spring app that takes 90 seconds to start? A Laravel app loading 50MB of config? Without a startup probe, your liveness probe will kill the container before it even finishes booting!
startupProbe:
httpGet:
path: /health/live
port: 3000
failureThreshold: 30 # Allow up to 30 failures
periodSeconds: 10 # Check every 10 seconds
# = Up to 300 seconds (5 minutes) to start!
# Liveness only kicks in AFTER startup probe succeeds
livenessProbe:
httpGet:
path: /health/live
port: 3000
periodSeconds: 15
failureThreshold: 3
Think of startup probe as: "Give this container a chance to get its act together before I start judging it."
Without it, this is what happens to slow-starting apps:
T+0s: Container starts
T+10s: Liveness probe: FAIL (app not up yet)
T+25s: Liveness probe: FAIL
T+40s: Liveness probe: FAIL (3 failures = RESTART!)
T+40s: Container restarted...
T+50s: Liveness probe: FAIL
π Infinite restart loop!
With startup probe:
T+0s: Container starts
T+10s: Startup probe: fail (ok, still starting)
T+60s: Startup probe: fail (still ok!)
T+90s: Startup probe: SUCCESS β
T+90s: Liveness probe now takes over
T+105s: Liveness probe: success β
T+120s: Liveness probe: success β
App is healthy and serving traffic!
Real Production Configuration (Battle-Tested)
Here's the complete setup I use for Node.js APIs in production, refined after multiple incidents:
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend-api
spec:
replicas: 3
selector:
matchLabels:
app: backend-api
template:
metadata:
labels:
app: backend-api
spec:
containers:
- name: api
image: backend-api:latest
ports:
- containerPort: 3000
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
# 1. Startup probe: generous time to boot
startupProbe:
httpGet:
path: /health/live
port: 3000
failureThreshold: 12 # 12 * 10s = 2 minutes to start
periodSeconds: 10
# 2. Liveness probe: basic "are you alive" check
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 0 # Startup probe handles the delay
periodSeconds: 20
failureThreshold: 3
timeoutSeconds: 5
# 3. Readiness probe: full dependency check
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 0
periodSeconds: 10
failureThreshold: 3
successThreshold: 1
timeoutSeconds: 5
Why this works in production:
- Startup probe gives the app 2 minutes to boot (no infinite restart loops)
- Liveness is lightweight (won't accidentally trigger mass restarts)
- Readiness is thorough (catches real dependency failures)
- Resources are set so OOMKills don't mask probe failures
The Mistakes I Made (So You Don't Have To) πͺ€
Mistake #1: Using the same endpoint for liveness and readiness
# WRONG: Both probes checking /health with DB check
livenessProbe:
httpGet:
path: /health # β Checks database!
readinessProbe:
httpGet:
path: /health # β Also checks database!
Database has a 5-second blip β Readiness fails (good!) β Liveness also fails β Pod restarts β Database blip hits all 3 pods β All pods restart simultaneously β Complete outage.
Separate your liveness (dumb) from readiness (smart)!
Mistake #2: Setting initialDelaySeconds too short
livenessProbe:
initialDelaySeconds: 5 # β Node takes 15s to start
# Pod restarts before it even finishes booting!
Use startup probes instead of guessing initialDelaySeconds.
Mistake #3: Setting failureThreshold: 1 on liveness
livenessProbe:
failureThreshold: 1 # β One slow response = restart!
One GC pause, one slow DNS lookup, one CPU spike β and Kubernetes restarts your perfectly healthy pod. Use at least 3.
Does It Actually Work? Testing Probes Locally
# Test your readiness endpoint directly
curl -s http://localhost:3000/health/ready | jq .
# Watch Kubernetes probe events in real-time
kubectl describe pod <pod-name> | grep -A 20 "Events:"
# See liveness/readiness status
kubectl get pods -w
# Force a liveness failure (to test it works)
kubectl exec -it <pod-name> -- kill -STOP 1
# Watch Kubernetes detect it and restart the pod!
# Check probe configuration on a running pod
kubectl get pod <pod-name> -o yaml | grep -A 15 "livenessProbe"
The Bottom Line π―
Health probes are not optional. They're the difference between:
- Kubernetes: "I have no idea your app is broken"
- Kubernetes: "Pod failing readiness checks, removed from load balancer. Users unaffected."
Three minutes to add, hours of incidents saved.
Your users don't care that the process is running. They care that the app actually works. Now Kubernetes can tell the difference too.
The quick setup:
/health/liveβ returns 200 if the process can respond (dead simple)/health/readyβ returns 200 only if all dependencies are healthy- Configure all three probe types in your Deployment YAML
- Test them before going to production
After countless 2am incidents, I can tell you: the 15 minutes it takes to set up probes properly pays back the first time a dependency silently fails in production and your users never notice. π
Had a probe-related production incident? I'd love to hear your war story β find me on LinkedIn!
Want to see real Kubernetes configs? Check my GitHub for production-tested manifests.
Now go add those probes before your next deployment! π©Ίβ