0x55aa
โ† Back to Blog

Kubernetes HPA: Stop Waking Up at 3 AM to Scale Pods Manually ๐Ÿ“ˆ๐Ÿค–

โ€ข7 min read

Kubernetes HPA: Stop Waking Up at 3 AM to Scale Pods Manually ๐Ÿ“ˆ๐Ÿค–

True story: It was 2:47 AM. My phone exploded with alerts. Our e-commerce API had 3 pods running and suddenly 10,000 concurrent users decided they wanted to check out simultaneously. Response times went from 80ms to 12 seconds. The on-call rotation landed on me.

Me: scrambles to laptop half-asleep

kubectl scale deployment myapp --replicas=20

Problem solved. But also:

  • I was awake for 2 hours
  • Our SLA was violated by 4 minutes
  • The team was not thrilled at the 3 AM Slack message

Three days later I discovered Kubernetes Horizontal Pod Autoscaler. Kubernetes could have handled ALL of that automatically. In under 30 seconds. Without me losing any sleep.

Let me save you the same pain. ๐Ÿ›๏ธ

What Is HPA Anyway? ๐Ÿค”

The Kubernetes Horizontal Pod Autoscaler (HPA) watches your pods' CPU/memory usage (or custom metrics) and automatically scales the number of replicas up or down to match demand.

Traffic spike โ†’ Pods use more CPU โ†’ HPA adds replicas โ†’ Load spreads out โ†’ Crisis averted
Traffic drops โ†’ CPU drops โ†’ HPA removes replicas โ†’ You save money

It's like having an infinitely patient SRE who checks CPU every 15 seconds and scales your deployment โ€” without complaining about being woken up. ๐Ÿค–

Your First HPA: The 5-Minute Setup ๐Ÿš€

Step 1: Make sure metrics-server is running

HPA needs the metrics-server to read CPU/memory from pods. Check if it's installed:

kubectl get deployment metrics-server -n kube-system

If it's not there, install it:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

On managed clusters (EKS, GKE, AKS), metrics-server is usually pre-installed. One less thing to worry about! โœ…

Step 2: Deploy your app with resource requests set

This is the part everyone forgets. HPA is useless without resource requests. It calculates utilization as:

Current CPU Usage รท Requested CPU = Utilization %

If you don't set requests, HPA has nothing to divide by and just... gives up.

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 2
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: myapp
          image: myapp:latest
          resources:
            requests:
              cpu: "250m"      # HPA needs this to calculate utilization!
              memory: "256Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"

The lesson I learned the hard way: HPA without resource requests is like a speedometer without knowing your max speed. Set your requests! ๐ŸŽฏ

Step 3: Create the HPA

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp

  # Replica boundaries
  minReplicas: 2    # Never go below 2 (HA minimum)
  maxReplicas: 20   # Never exceed 20 (cost guard rail!)

  metrics:
    # Scale on CPU
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70   # Scale up when avg CPU > 70%

    # Scale on Memory too
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80   # Scale up when avg Memory > 80%

Apply both:

kubectl apply -f deployment.yaml
kubectl apply -f hpa.yaml

# Watch it work!
kubectl get hpa myapp-hpa --watch

That's it. Your app now auto-scales. Go back to sleep. ๐Ÿ˜ด

The Real-World Tuning That Makes It Actually Work ๐Ÿ”ง

The default HPA works, but it can be either too aggressive (scaling up for a 10-second blip) or too slow (waiting while users see errors). Here's what I actually use in production:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp

  minReplicas: 3
  maxReplicas: 50

  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65

  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30   # Wait 30s of sustained load before scaling up
      policies:
        - type: Percent
          value: 100          # Can DOUBLE replicas per scale event
          periodSeconds: 60
        - type: Pods
          value: 4            # OR add 4 pods at a time
          periodSeconds: 60
      selectPolicy: Max       # Use whichever adds MORE pods (aggressive scale-up!)

    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 MINUTES before scaling down
      policies:
        - type: Percent
          value: 10           # Remove at most 10% of replicas
          periodSeconds: 60   # Per minute (slow, conservative scale-down)
      selectPolicy: Min       # Use whichever removes FEWER pods (safe scale-down!)

Why this config?

  • Scale up FAST โ€” when traffic spikes, you want pods ASAP, not in 5 minutes
  • Scale down SLOW โ€” don't kill pods the moment traffic dips; wait to confirm it's real
  • Stabilization window โ€” prevents thrashing on a 30-second traffic spike

Production war story: We once had HPA scale up to 40 pods during a spike, then scale back down to 3 in 2 minutes. The scale-down killed pods mid-request and caused 503 errors. After setting scaleDown.stabilizationWindowSeconds: 300, problem gone. Scale fast, scale down slow! โšก๐Ÿข

The Classic Mistake: Not Setting maxReplicas ๐Ÿ’ธ

I once worked with a team that forgot to set a sane maxReplicas. During a load test that accidentally hit production (classic), HPA spun up 237 pods.

AWS bill that month: ๐Ÿ’€

Always set a maxReplicas that:

  1. Can handle your actual peak traffic
  2. Won't bankrupt you if HPA goes wild
maxReplicas: 50   # Not 1000. Never 1000.

And set up billing alerts. Seriously. ๐Ÿ’ณ

Watching HPA Do Its Thing ๐Ÿ‘€

# See current state
kubectl get hpa

# NAME        REFERENCE          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
# myapp-hpa   Deployment/myapp   45%/65%   3         50        5          2d

# Detailed view
kubectl describe hpa myapp-hpa

# Watch it scale in real time (this is satisfying)
kubectl get hpa myapp-hpa --watch

# See HPA events (scaling history)
kubectl describe hpa myapp-hpa | grep -A 20 "Events:"

When you see SuccessfulRescale events in the logs instead of ScalingLimited, you've set your min/max correctly. ๐ŸŽฏ

The Gotcha Nobody Tells You About: Pod Startup Time โฑ๏ธ

HPA scales fast. But if your pods take 90 seconds to start up, you're still getting errors during those 90 seconds.

The fix? Combine HPA with a Readiness Probe that delays traffic until the pod is warm:

# In your deployment spec:
readinessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 10   # Give the app time to boot
  periodSeconds: 5
  failureThreshold: 3

# Optional: pre-warm with init containers
initContainers:
  - name: warm-cache
    image: myapp:latest
    command: ["node", "scripts/warm-cache.js"]

Real lesson: HPA solves the "how many pods" problem. Readiness probes solve the "are the pods ready" problem. You need both! ๐Ÿค

Your Action Plan ๐Ÿš€

Today:

  1. Check if metrics-server is running in your cluster
  2. Add resources.requests to every deployment
  3. Create a basic HPA with minReplicas: 2, maxReplicas: 20, CPU target 70%

This week:

  1. Tune behavior.scaleDown.stabilizationWindowSeconds to 300
  2. Load test and watch HPA respond
  3. Set a maxReplicas guard rail that won't cause a budget emergency

This month:

  1. Explore custom metrics (queue depth, active connections, request latency)
  2. Combine HPA with Cluster Autoscaler so nodes scale too
  3. Set up alerts when replicas hit maxReplicas โ€” that means you need to raise the ceiling!

The Bottom Line ๐Ÿ’ก

Kubernetes HPA is one of those features where the setup is 30 minutes and the payoff is years of uninterrupted sleep. You will eventually have a traffic spike. The question is: will you handle it automatically in 30 seconds, or manually at 3 AM in a panic?

The choice is yours. But your pillow has an opinion. ๐Ÿ›๏ธ


Still scaling pods by hand? Find me on LinkedIn โ€” I want to hear your on-call horror story!

More Kubernetes deep dives? Check out my GitHub for production-ready Kubernetes configs!

Now go configure that HPA and sleep soundly. ๐Ÿ“ˆ๐Ÿค–โœจ


P.S. The correct number of manual scaling incidents needed before you set up HPA is zero. Learn from my 2:47 AM lesson so you don't have to live it. ๐ŸŽฏ