0x55aa
Back to Blog

Graceful Degradation: Serve Something Useful When Everything Is on Fire 🔥

|
6 min read

There's a moment in every on-call rotation where your monitoring dashboard looks like a Christmas tree and three different Slack threads are asking "is the site down?" at once.

The mature response isn't to scramble until everything is green again. The mature response is to have already designed your system so that most users never noticed anything was wrong in the first place.

That's graceful degradation. And it's one of those ideas that sounds obvious in retrospect but gets completely ignored until 2 AM when it's too late.

The 503 Is Not a Strategy

When a dependency fails, the instinct is to propagate that failure up the stack. Your recommendation service is down? Throw a 500. Your inventory API is timing out? Let the product page fail. Your auth service is flapping? Log everyone out.

This is the correct behavior exactly never.

Users don't care about your microservice topology. They came to accomplish a task. Your job is to give them as much of that task as possible, even when your internals are having a bad day.

The question to ask for every dependency in your system: "If this goes away, what can I still serve?"

Fallback Chains: The Real Pattern

Graceful degradation isn't one thing — it's a hierarchy of increasingly less-ideal-but-still-useful responses:

Primary:    live data from the database
  ↓
Fallback 1: data from Redis cache (possibly stale)
  ↓
Fallback 2: data from a local in-memory cache (more stale)
  ↓
Fallback 3: static "default" response embedded in the binary
  ↓
Last resort: a friendly message explaining reduced functionality

Every step down the chain loses freshness or richness, but every step is better than an error. A product page that shows yesterday's price and inventory is infinitely more useful than a 500 page that bounces the user to a competitor.

At Cubet, we built a content recommendation engine for a media client that called three different ML services to personalize the homepage. When we first shipped it, any one of those services going down would crash the entire page render. We redesigned the fallback chain so that each missing signal degrades to something less personalized — regional trends instead of user history, editorial picks instead of regional trends, and a hardcoded "popular this week" list as the absolute floor. The page always loads. The quality just varies.

Stale-While-Revalidate at the Service Level

HTTP has had stale-while-revalidate in the Cache-Control header for years. The idea: serve from cache immediately, refresh in the background. The same pattern works beautifully at the service level.

import redis
import time

CACHE_KEY = "product:{product_id}"
STALE_TTL = 3600       # serve stale up to 1 hour
REFRESH_THRESHOLD = 60  # try to refresh if older than 60s

def get_product(product_id: str) -> dict:
    cache = redis.get(CACHE_KEY.format(product_id=product_id))

    if cache:
        data = json.loads(cache)
        age = time.time() - data["cached_at"]

        if age < REFRESH_THRESHOLD:
            return data["value"]  # fresh enough, return directly

        # Stale but within tolerance — trigger background refresh
        enqueue_background_refresh(product_id)
        return data["value"]  # serve stale now

    # Cache miss — fetch live, populate cache
    try:
        value = fetch_from_db(product_id)
        redis.setex(
            CACHE_KEY.format(product_id=product_id),
            STALE_TTL,
            json.dumps({"value": value, "cached_at": time.time()})
        )
        return value
    except Exception:
        return DEFAULT_PRODUCT_RESPONSE  # absolute fallback

The key move: don't let cache expiry and dependency failure happen at the same time. Extend TTLs during incidents. A 48-hour-old cached response is almost always better than an error page.

Feature Flags as a Degradation Knob

Not all features have equal blast radius. Some are core (add to cart), some are enrichments (recommendations, reviews, social proof). Build your feature flags so you can surgically disable enrichments under load without touching core flows.

# Unleash / LaunchDarkly flag config
features:
  recommendations_panel:
    default: true
    kill_switch: true     # ops can flip this in 30 seconds

  social_proof_widget:
    default: true
    kill_switch: true

  checkout_flow:
    default: true
    kill_switch: false    # never kill this one manually

During a downstream ML service incident, your on-call flips recommendations_panel to false. The page renders without recommendations. Zero code deploys, zero service restarts. Users see a slightly emptier page instead of a broken one.

The discipline here: decide before the incident which features are killable and wire them up. Don't make that decision at 2 AM while your dependency is on fire.

Circuit Breakers: Don't Keep Calling a Corpse

One pattern that complements degradation: circuit breakers. If a downstream service has been failing for 10 consecutive requests, stop calling it entirely for 30 seconds. Return the degraded response immediately instead of hammering a service that's already struggling and adding latency to every user request.

from pybreaker import CircuitBreaker

inventory_breaker = CircuitBreaker(
    fail_max=5,        # open after 5 consecutive failures
    reset_timeout=30,  # try again after 30 seconds
)

@inventory_breaker
def fetch_inventory(sku: str) -> int:
    return inventory_service.get(sku)

def get_inventory_with_fallback(sku: str) -> dict:
    try:
        count = fetch_inventory(sku)
        return {"count": count, "source": "live"}
    except Exception:
        # Circuit open, or service call failed
        return {"count": None, "source": "unavailable", "message": "Availability shown at checkout"}

The circuit breaker keeps your service healthy. The fallback keeps your users informed. Together, they beat a cascade failure that takes everything down with it.

What Not to Degrade

A word of caution: some things should not degrade silently.

  • Security checks: never skip auth validation because the auth service is slow. Block the request instead.
  • Write operations: don't accept a payment or booking if you can't confirm it. A failed write with a clear error is better than a silent partial write.
  • Data that must be accurate: medical dosing, financial balances, legal records. Stale data here isn't "degraded" — it's dangerous.

Graceful degradation is for enrichments and display-layer data. For anything that involves money, safety, or legal compliance, fail loudly and explicitly.

The Test You're Not Running

Most teams test the happy path obsessively and test degraded behavior never. Add chaos to your staging environment: kill the recommendation service, let the cache expire, throttle the inventory API. Does your application serve something useful, or does it fall apart?

The answer will probably surprise you. It surprised us — in a way that was much cheaper to fix in staging than in production.

Design for the fire. Because eventually, something's always on fire.


What's your degradation floor? Have a fallback chain that saved you during an incident, or a story where the lack of one made things worse? I'd be curious — find me on GitHub or drop a comment on this post.

Thanks for reading!

Back to all posts