Kubernetes HPA: Stop Waking Up at 3 AM to Scale Pods Manually ๐๐ค
Kubernetes HPA: Stop Waking Up at 3 AM to Scale Pods Manually ๐๐ค
True story: It was 2:47 AM. My phone exploded with alerts. Our e-commerce API had 3 pods running and suddenly 10,000 concurrent users decided they wanted to check out simultaneously. Response times went from 80ms to 12 seconds. The on-call rotation landed on me.
Me: scrambles to laptop half-asleep
kubectl scale deployment myapp --replicas=20
Problem solved. But also:
- I was awake for 2 hours
- Our SLA was violated by 4 minutes
- The team was not thrilled at the 3 AM Slack message
Three days later I discovered Kubernetes Horizontal Pod Autoscaler. Kubernetes could have handled ALL of that automatically. In under 30 seconds. Without me losing any sleep.
Let me save you the same pain. ๐๏ธ
What Is HPA Anyway? ๐ค
The Kubernetes Horizontal Pod Autoscaler (HPA) watches your pods' CPU/memory usage (or custom metrics) and automatically scales the number of replicas up or down to match demand.
Traffic spike โ Pods use more CPU โ HPA adds replicas โ Load spreads out โ Crisis averted
Traffic drops โ CPU drops โ HPA removes replicas โ You save money
It's like having an infinitely patient SRE who checks CPU every 15 seconds and scales your deployment โ without complaining about being woken up. ๐ค
Your First HPA: The 5-Minute Setup ๐
Step 1: Make sure metrics-server is running
HPA needs the metrics-server to read CPU/memory from pods. Check if it's installed:
kubectl get deployment metrics-server -n kube-system
If it's not there, install it:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
On managed clusters (EKS, GKE, AKS), metrics-server is usually pre-installed. One less thing to worry about! โ
Step 2: Deploy your app with resource requests set
This is the part everyone forgets. HPA is useless without resource requests. It calculates utilization as:
Current CPU Usage รท Requested CPU = Utilization %
If you don't set requests, HPA has nothing to divide by and just... gives up.
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 2
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: myapp:latest
resources:
requests:
cpu: "250m" # HPA needs this to calculate utilization!
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
The lesson I learned the hard way: HPA without resource requests is like a speedometer without knowing your max speed. Set your requests! ๐ฏ
Step 3: Create the HPA
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
# Replica boundaries
minReplicas: 2 # Never go below 2 (HA minimum)
maxReplicas: 20 # Never exceed 20 (cost guard rail!)
metrics:
# Scale on CPU
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale up when avg CPU > 70%
# Scale on Memory too
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 # Scale up when avg Memory > 80%
Apply both:
kubectl apply -f deployment.yaml
kubectl apply -f hpa.yaml
# Watch it work!
kubectl get hpa myapp-hpa --watch
That's it. Your app now auto-scales. Go back to sleep. ๐ด
The Real-World Tuning That Makes It Actually Work ๐ง
The default HPA works, but it can be either too aggressive (scaling up for a 10-second blip) or too slow (waiting while users see errors). Here's what I actually use in production:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
behavior:
scaleUp:
stabilizationWindowSeconds: 30 # Wait 30s of sustained load before scaling up
policies:
- type: Percent
value: 100 # Can DOUBLE replicas per scale event
periodSeconds: 60
- type: Pods
value: 4 # OR add 4 pods at a time
periodSeconds: 60
selectPolicy: Max # Use whichever adds MORE pods (aggressive scale-up!)
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 MINUTES before scaling down
policies:
- type: Percent
value: 10 # Remove at most 10% of replicas
periodSeconds: 60 # Per minute (slow, conservative scale-down)
selectPolicy: Min # Use whichever removes FEWER pods (safe scale-down!)
Why this config?
- Scale up FAST โ when traffic spikes, you want pods ASAP, not in 5 minutes
- Scale down SLOW โ don't kill pods the moment traffic dips; wait to confirm it's real
- Stabilization window โ prevents thrashing on a 30-second traffic spike
Production war story: We once had HPA scale up to 40 pods during a spike, then scale back down to 3 in 2 minutes. The scale-down killed pods mid-request and caused 503 errors. After setting scaleDown.stabilizationWindowSeconds: 300, problem gone. Scale fast, scale down slow! โก๐ข
The Classic Mistake: Not Setting maxReplicas ๐ธ
I once worked with a team that forgot to set a sane maxReplicas. During a load test that accidentally hit production (classic), HPA spun up 237 pods.
AWS bill that month: ๐
Always set a maxReplicas that:
- Can handle your actual peak traffic
- Won't bankrupt you if HPA goes wild
maxReplicas: 50 # Not 1000. Never 1000.
And set up billing alerts. Seriously. ๐ณ
Watching HPA Do Its Thing ๐
# See current state
kubectl get hpa
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
# myapp-hpa Deployment/myapp 45%/65% 3 50 5 2d
# Detailed view
kubectl describe hpa myapp-hpa
# Watch it scale in real time (this is satisfying)
kubectl get hpa myapp-hpa --watch
# See HPA events (scaling history)
kubectl describe hpa myapp-hpa | grep -A 20 "Events:"
When you see SuccessfulRescale events in the logs instead of ScalingLimited, you've set your min/max correctly. ๐ฏ
The Gotcha Nobody Tells You About: Pod Startup Time โฑ๏ธ
HPA scales fast. But if your pods take 90 seconds to start up, you're still getting errors during those 90 seconds.
The fix? Combine HPA with a Readiness Probe that delays traffic until the pod is warm:
# In your deployment spec:
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10 # Give the app time to boot
periodSeconds: 5
failureThreshold: 3
# Optional: pre-warm with init containers
initContainers:
- name: warm-cache
image: myapp:latest
command: ["node", "scripts/warm-cache.js"]
Real lesson: HPA solves the "how many pods" problem. Readiness probes solve the "are the pods ready" problem. You need both! ๐ค
Your Action Plan ๐
Today:
- Check if metrics-server is running in your cluster
- Add
resources.requeststo every deployment - Create a basic HPA with
minReplicas: 2,maxReplicas: 20, CPU target 70%
This week:
- Tune
behavior.scaleDown.stabilizationWindowSecondsto 300 - Load test and watch HPA respond
- Set a
maxReplicasguard rail that won't cause a budget emergency
This month:
- Explore custom metrics (queue depth, active connections, request latency)
- Combine HPA with Cluster Autoscaler so nodes scale too
- Set up alerts when replicas hit
maxReplicasโ that means you need to raise the ceiling!
The Bottom Line ๐ก
Kubernetes HPA is one of those features where the setup is 30 minutes and the payoff is years of uninterrupted sleep. You will eventually have a traffic spike. The question is: will you handle it automatically in 30 seconds, or manually at 3 AM in a panic?
The choice is yours. But your pillow has an opinion. ๐๏ธ
Still scaling pods by hand? Find me on LinkedIn โ I want to hear your on-call horror story!
More Kubernetes deep dives? Check out my GitHub for production-ready Kubernetes configs!
Now go configure that HPA and sleep soundly. ๐๐คโจ
P.S. The correct number of manual scaling incidents needed before you set up HPA is zero. Learn from my 2:47 AM lesson so you don't have to live it. ๐ฏ