🔥 Error Budgets Without Burnout: Your SLO Is Not a Pager Schedule
Error budgets were supposed to be the peacemaker between reliability and velocity. The SRE gospel: agree on an SLO, calculate how much failure you can afford, and use that budget to have rational conversations about risk instead of screaming at each other in post-mortems.
Sounds great. In practice, a lot of teams implement error budgets and end up with the same amount of stress — plus a new dashboard that makes the stress quantitative.
This post is about why that happens and how to wire up error budgets so they actually make your team healthier instead of just better at measuring their own suffering.
What an Error Budget Actually Is
Quick level-set before we get into the weeds.
If your SLO is 99.9% availability over a 30-day window, you have 43.2 minutes of allowed downtime per month. That's your error budget. Spend it wisely on planned maintenance, risky deployments, or canary rollouts. Burn through it unexpectedly, and something has gone wrong — either in reliability or in your SLO calibration.
The budget is expressed as a rate: error_budget_remaining = (1 - SLO) * total_requests. Every failed request eats into it.
Here's where most teams make the first mistake.
The Alert-on-Threshold Trap
Naive error budget alerting fires when the budget drops below some percentage: "Alert when remaining budget < 10%." This sounds reasonable. It isn't.
Problem one: it fires too late. If you're burning at 10x normal rate, you can exhaust the entire month's budget in three hours. An alert that fires at 10% remaining arrives when you have 26 minutes of budget left. You're already on fire.
Problem two: it fires on noise. A single bad minute with high error rate can briefly spike your consumed budget. Threshold alerts don't distinguish between "we had a rough 60 seconds" and "we are structurally broken and will be at zero in four hours."
Both problems generate the same outcome: engineers who distrust their alerts, toggle Do Not Disturb on anyway, and check Slack in the morning to see if anything exploded.
Burn Rate Alerting: The Actually Useful Version
The Google SRE workbook introduced multi-window, multi-burn-rate alerting, and it's the right mental model. Instead of alerting on how much budget is remaining, you alert on how fast it's burning relative to baseline.
A burn rate of 1x means you're consuming budget exactly as fast as your SLO allows — budget hits zero right at the end of the window. Fine.
A burn rate of 14.4x means you'll exhaust a 30-day budget in 50 hours. Time to wake someone up.
A burn rate of 6x means the budget is gone in 5 days. Worth a ticket, probably not a 3am page.
Here's how that looks in Prometheus recording rules and alerts:
# Recording rules — compute error rate per time window
groups:
- name: error_budget_burn
interval: 1m
rules:
- record: job:slo_errors:ratio_rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
- record: job:slo_errors:ratio_rate1h
expr: |
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
- record: job:slo_errors:ratio_rate6h
expr: |
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
# Alerts — two-window confirmation before paging
groups:
- name: error_budget_alerts
rules:
# Page immediately: 2% budget gone in 1 hour at this rate
- alert: ErrorBudgetFastBurn
expr: |
job:slo_errors:ratio_rate5m > (14.4 * 0.001)
and
job:slo_errors:ratio_rate1h > (14.4 * 0.001)
for: 2m
labels:
severity: page
annotations:
summary: "Fast burn — budget exhausted in ~50h at current rate"
# Ticket, don't page: budget gone in ~5 days
- alert: ErrorBudgetSlowBurn
expr: |
job:slo_errors:ratio_rate6h > (6 * 0.001)
for: 15m
labels:
severity: ticket
annotations:
summary: "Slow burn — budget exhausted in ~5 days"
Two windows for the fast burn alert matter: the 5-minute window catches the spike quickly; the 1-hour window confirms it's not a blip. If both are above the burn-rate threshold, you have a real problem. One window alone produces too many false positives.
At Cubet, we tuned the for: 2m duration based on how long transient errors from deploy rollouts typically last. Your number may differ — check your deployment logs before picking a value.
The Part Everyone Skips: Budget as a Conversation Tool
The alerting math is table stakes. The real value of error budgets is organizational, and most teams never get there.
When the budget is healthy, the implicit signal is: reliability is good enough, go ship things. Teams that understand this actually move faster because they're not second-guessing every deploy.
When the budget is burning, the signal flips: freeze feature work, fix reliability first. This is where error budgets give engineering leads a policy-backed, numbers-driven way to push back on product pressure. "We can't ship the payment redesign this week — we've burned 60% of this month's error budget and we don't know why yet" is a much cleaner conversation than "I have a bad feeling about this."
Write this down somewhere everyone can see. Make the budget visible on a dashboard, not buried in Grafana folders.
Avoiding the Emotional Budget Ledger
One failure mode I've seen: teams treat a burned budget as evidence someone did something wrong. The post-mortem becomes a trial. Engineers start sandbagging SLOs so the budget looks healthy on paper. The whole system becomes theater.
Error budgets only work if the team genuinely believes that budget spend is expected and acceptable. Spending budget on a risky migration you chose to do is healthy. Discovering you've been unknowingly spending it on a memory leak for three weeks is the signal to act — not to assign blame.
Keep post-mortems focused on "what would have caught this faster" instead of "who merged this change." The former improves your system; the latter improves nothing and burns your engineers.
One More Calibration Trap
If your SLO is too tight, your budget is always near zero and the team is always on edge. If it's too loose, budget never becomes a meaningful constraint and you stop caring about reliability.
A useful heuristic: set your initial SLO based on what you've actually been delivering over the past 90 days, then tighten it by one nine. If you've been at 99.5% historically, start at 99.5% — not 99.99% because that sounds professional. You can always raise the bar once the team is comfortable with the tooling and culture.
The Burnout-Prevention Summary
- Alert on burn rate, not remaining budget — fast burn needs two windows to confirm, slow burn just needs a ticket
- Make the budget visible and shared, not something only SREs understand
- Use a depleted budget to enforce a reliability freeze — the policy should exist before you need it
- Treat budget spend as a system signal, not a blame signal
- Set SLOs based on historical reality, then improve incrementally
Error budgets aren't magic. They're a forcing function that turns vague reliability anxiety into concrete policy. Do the forcing function right and your on-call rotation gets quieter, your deploys get less scary, and your team stops dreading Monday morning.
Do it wrong and you've just added a countdown timer to your existing stress. The timer doesn't help.
What's your current SLO setup — threshold alerts, burn rate, or something else entirely? I'd genuinely like to know what's working. Find me on GitHub or drop a comment below.