Kubernetes Resource Limits: Stop Crashing Your Nodes at 3 AM 💥🐳
Kubernetes Resource Limits: Stop Crashing Your Nodes at 3 AM 💥🐳
True story: It was 3:17 AM. My phone was buzzing off the nightstand. PagerDuty. Again.
I stumbled to my laptop, bleary-eyed, to find half our Kubernetes cluster in a death spiral. Pods evicted. Nodes NotReady. The monitoring dashboard looked like a Jackson Pollock painting — red everywhere.
Root cause? Our shiny new data-processing service had no resource limits. A traffic spike caused it to eat every byte of memory on three nodes, which then took down every OTHER service on those nodes too. One greedy pod. Thirty minutes of downtime. My dignity: gone. ☠️
Welcome to Kubernetes resource management — the thing nobody explains properly until you've already wrecked production.
The Problem: Kubernetes Is Sharing Economy Gone Wrong 🏠
Kubernetes nodes are like shared apartments. Multiple pods (tenants) live on each node (apartment building). Without rules, that one loud neighbor will blast music at 3 AM, run the AC at full blast, and hog all the parking spots.
Resource requests and limits are your apartment rules:
- Requests = minimum resources your pod needs (used for scheduling)
- Limits = maximum resources your pod can use (enforced at runtime)
resources:
requests:
memory: "256Mi" # "I need at least this much"
cpu: "250m" # "Schedule me somewhere with this available"
limits:
memory: "512Mi" # "Cut me off here, no exceptions"
cpu: "500m" # "Throttle me if I go over this"
Without these? Your pod is that nightmare tenant who just moves in and takes whatever they want. The cluster scheduler is basically flying blind.
CPU vs Memory: They're Not the Same Beast 🧠
Here's something that trips up everyone:
CPU is compressible. If your pod hits its CPU limit, Kubernetes throttles it — slows it down. Annoying, but survivable. Your pod keeps running, just slower.
Memory is NOT compressible. If your pod hits its memory limit, Kubernetes sends it to the shadow realm — OOMKilled. Instant death. No warning. Exit code 137. Your logs end mid-sentence.
# The most dreaded kubectl output
kubectl get pods
NAME READY STATUS RESTARTS AGE
api-server-7d9f8b-xkp2 0/1 OOMKilled 14 2h
# 14 restarts. Pod is in a death loop.
# Someone's getting paged tonight.
Lesson I learned the hard way: Set memory limits conservatively, then tune upward. Set CPU limits generously (or skip them on some workloads). Getting OOMKilled is always worse than CPU throttling.
Setting Requests and Limits: The Real Numbers Game 📊
Here's a battle-tested configuration for a Node.js API service:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3
selector:
matchLabels:
app: api-server
template:
spec:
containers:
- name: api-server
image: myapp:latest
resources:
requests:
# Scheduler needs this much FREE on a node to place the pod
memory: "256Mi"
cpu: "250m" # 250 millicores = 0.25 CPU core
limits:
# Pod gets killed (memory) or throttled (CPU) above this
memory: "512Mi" # 2x the request = safe headroom
cpu: "1000m" # 1 full CPU core max
# ALWAYS add liveness and readiness probes!
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 15
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
The golden ratio that saved me: Set limits.memory to roughly 2x your requests.memory. Enough headroom for traffic spikes, but a hard ceiling before you hurt your neighbors.
The Quality of Service Classes Nobody Tells You About 🎖️
Here's the sneaky part: Kubernetes uses your resource config to assign a QoS class to every pod, which determines who gets evicted first when a node runs low.
Guaranteed → requests == limits (for ALL containers)
Burstable → requests < limits (or only some resources set)
BestEffort → NO resources set at all (naked pod)
Eviction order when the node panics: BestEffort goes first, then Burstable, then Guaranteed (last resort).
# BestEffort - First to die 💀
spec:
containers:
- name: app
image: myapp:latest
# No resources = "please evict me first"
---
# Guaranteed - Last to die 🛡️
spec:
containers:
- name: app
image: myapp:latest
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "512Mi" # Exact same as request
cpu: "500m" # Exact same as request
Production rule: Critical services (databases, auth, payment processors) should be Guaranteed. Background workers and batch jobs? Let them be Burstable or BestEffort — they're fine to evict under pressure.
LimitRange: The Safety Net for Lazy Developers 🥅
What happens when a developer deploys a pod with NO resource spec? (Spoiler: BestEffort class. First to die.)
Enter LimitRange — a namespace-level policy that sets defaults so even naked pods get sensible limits:
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- type: Container
default: # Applied when limits not specified
memory: "512Mi"
cpu: "500m"
defaultRequest: # Applied when requests not specified
memory: "256Mi"
cpu: "250m"
max: # Nobody can exceed this
memory: "2Gi"
cpu: "2000m"
min: # Nobody can go below this
memory: "64Mi"
cpu: "50m"
Deploy this to your namespace and sleep soundly knowing that even the most resource-oblivious developer on your team can't accidentally nuke the cluster. You're welcome. 😴
ResourceQuota: Keeping Teams from Eating the Cluster 🍕
Multi-team clusters are a potluck dinner where someone always takes all the pizza. ResourceQuota is the rule that says each team only gets two slices:
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-frontend-quota
namespace: team-frontend
spec:
hard:
# Total resource budget for this namespace
requests.cpu: "4" # 4 cores total requests
requests.memory: "8Gi" # 8GB total requests
limits.cpu: "8" # 8 cores max usage
limits.memory: "16Gi" # 16GB max usage
# Also cap object counts
pods: "20" # Max 20 pods
services: "10"
persistentvolumeclaims: "5"
Real-world win: After our platform team deployed ResourceQuotas per team namespace, a runaway deployment in the data-science namespace could no longer starve the payment service. Compartmentalization saves lives (and SLAs). 🎯
Finding the Right Numbers: Don't Just Guess 🔍
The most common mistake? Setting limits based on vibes. Here's how to actually find the right values:
# Check what pods are actually consuming RIGHT NOW
kubectl top pods -n production
# NAME CPU(cores) MEMORY(bytes)
# api-server-7d9f8b-xkp2 187m 203Mi
# api-server-7d9f8b-r8n4 201m 198Mi
# worker-5c9f7d-p2q1 45m 89Mi
# Check node pressure
kubectl top nodes
# NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
# node-1 2341m 58% 11Gi 72% ← getting full!
# node-2 890m 22% 4Gi 26%
# node-3 1120m 28% 5Gi 33%
My workflow after an OOMKill incident:
- Check
kubectl top podsunder normal load (get baseline) - Check during peak traffic (find the spike)
- Set
requests= normal load,limits= peak load × 1.5 - Enable Vertical Pod Autoscaler (VPA) in recommendation mode to validate
Pro tip: If you're on AWS/GKE/AKE, the VPA recommendation mode will watch your pods for a few days and suggest better values without touching anything. Trust but verify. 📈
The 3 AM Incident That Changed Everything 🌙
Back to that fateful night. After the cluster recovered, we did a post-mortem. The fix was embarrassingly simple:
Before (the chaos configuration):
containers:
- name: data-processor
image: processor:latest
# No resources. Just vibes. 🤡
After (the production-ready configuration):
containers:
- name: data-processor
image: processor:latest
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi" # Enough headroom for big batches
cpu: "2000m" # CPU-intensive work, be generous
env:
- name: NODE_OPTIONS
value: "--max-old-space-size=900" # Keep Node.js heap UNDER memory limit!
That last line is critical. If Node.js doesn't know about its memory limit, it'll happily grow past it, get OOMKilled, restart, grow again, restart... death loop forever. Tell your runtime about the limit! Python, JVM, and .NET all have equivalent flags.
Result: No more 3 AM pages from that service. The data processor now gets capped cleanly, other pods on the node survive, and the on-call engineer (me) actually sleeps through the night. 🎉
Your Kubernetes Resource Checklist 📋
Before you ship anything to production:
- Every container has
resources.requestsANDresources.limits - Memory limit is at least 1.5x the request (headroom for spikes)
- Runtime memory flags match your container limit (
--max-old-space-size,-Xmx, etc.) - LimitRange deployed to every namespace (catches lazy deploys)
- ResourceQuota set per team/namespace (prevents noisy neighbors)
- Liveness and readiness probes configured (so Kubernetes knows when to restart)
-
kubectl top podsrunning clean (no pod near its limit) - Critical services are
GuaranteedQoS class
The 30-second audit for your cluster:
# Find pods with NO resource requests set (danger zone!)
kubectl get pods --all-namespaces -o json | \
jq '.items[] | select(.spec.containers[].resources.requests == null) | .metadata.name'
# Any results? Go add limits. Now. Before 3 AM finds you.
The Bottom Line 💡
Resource requests and limits aren't DevOps gatekeeping or premature optimization. They're the difference between a cluster that runs itself and one that pages you at 3 AM.
Every pod without limits is a loaded gun pointed at your uptime. Every namespace without a LimitRange is trusting everyone on your team to always do the right thing. (They won't. We're human.)
The good news: This is a one-time fix. Set it up properly once, add LimitRanges as guardrails, and the cluster basically babysits itself.
Go check your production namespace right now. I'll wait.
kubectl get pods -n production -o custom-columns=\
"NAME:.metadata.name,\
MEM_REQ:.spec.containers[0].resources.requests.memory,\
MEM_LIM:.spec.containers[0].resources.limits.memory"
If you see a lot of <none> in those columns — you have homework to do before your pager does it for you. 🚨
Survived a Kubernetes meltdown? Share your war story on LinkedIn — misery loves company, and we can all learn from the 3 AM incidents we'd rather forget.
More Kubernetes content? Check out my GitHub for real production configurations that have been through the fire.
Now go add those resource limits. Your future self at 3 AM will thank you. 💤