The Bulkhead Pattern: Stop Letting One Failing Service Sink Your Entire App π’
The Bulkhead Pattern: Stop Letting One Failing Service Sink Your Entire App π’
Fun fact: The Titanic had watertight bulkhead compartments designed to contain flooding and keep the ship afloat. It could survive up to 4 compartments flooding. It hit the iceberg across 5.
Your microservices architecture? Zero compartments. One service starts drowning and it drags everything else down with it. You've basically built an open-plan Titanic.
I learned this lesson expensively. Let me save you the same pain.
The Disaster That Made Me a Believer π₯
When designing our e-commerce backend, I was proud of our microservices setup. Clean separation: Product Service, Inventory Service, Payment Service, Notification Service, User Service.
One Tuesday afternoon, our third-party payment processor had an outage. Their API started responding in 30+ seconds instead of 300ms.
Sounds contained, right? Payment service problem. Other services should be fine.
They weren't fine.
Here's what actually happened:
Payment API slows down to 30s response times
β
Payment Service: "I'll wait for the response..."
β
Our thread pool: "Here, take ALL our threads to wait with!"
β
Thread pool: FULL (all threads stuck waiting on Payment API)
β
New requests arrive: "Hey, I just want to browse products!"
β
Thread pool: "Sorry, we're all busy waiting on payments"
β
Product catalog: OFFLINE
Homepage: OFFLINE
Customer accounts: OFFLINE
β
PagerDuty: π₯π₯π₯
A third-party payment processor's slowdown (not even a crash!) took down our entire application. All because every service shared the same resource pool.
This is the cascading failure problem. And the Bulkhead Pattern is the fix.
What Is the Bulkhead Pattern? ποΈ
Named directly after ship bulkheads β the watertight walls dividing a ship's hull into separate compartments β the idea is simple:
Isolate your resources so that a failure in one partition cannot exhaust resources in another.
Instead of one shared thread pool / connection pool / Lambda concurrency budget for everything, you create isolated resource pools per service or use case.
WITHOUT BULKHEADS:
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Shared Thread Pool (50 threads) β
β β
β [Product] [Inventory] [Payment π΅] [Notif] β
β β β
β Payment API slow = all 50 threads β
β stuck here = EVERYONE is affected β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
WITH BULKHEADS:
ββββββββββββ ββββββββββββββ βββββββββββββ ββββββββββββ
β Product β β Inventory β β Paymentπ΅ β β Notif β
β Pool β β Pool β β Pool β β Pool β
β 15 thds β β 10 thds β β 15 thds β β 10 thds β
ββββββββββββ ββββββββββββββ βββββββββββββ ββββββββββββ
β
Payment drowns β only its pool is affected.
Product, Inventory, Notif keep sailing! β΅
The Titanic-equivalent: if one compartment floods, the watertight walls contain it. The ship keeps sailing.
Three Ways to Implement Bulkheads π§
1. Semaphore Isolation (Node.js / Backend Services)
In Node.js we don't have traditional threads, but we have concurrent requests and connection limits. A semaphore-based bulkhead limits how many concurrent calls each service can have in flight:
// bulkhead.js
class Bulkhead {
constructor(name, maxConcurrent) {
this.name = name;
this.maxConcurrent = maxConcurrent;
this.current = 0;
}
async execute(fn) {
if (this.current >= this.maxConcurrent) {
// Fail fast β don't queue forever and make everything wait
throw new Error(
`Bulkhead [${this.name}] at capacity (${this.current}/${this.maxConcurrent}). Rejecting.`
);
}
this.current++;
try {
return await fn();
} finally {
this.current--;
}
}
}
// Isolated pool per downstream service
const paymentBulkhead = new Bulkhead('payment-service', 15);
const productBulkhead = new Bulkhead('product-service', 20);
const inventoryBulkhead = new Bulkhead('inventory-service', 10);
module.exports = { paymentBulkhead, productBulkhead, inventoryBulkhead };
Then wire it into your service calls:
// paymentService.js
const { paymentBulkhead } = require('./bulkhead');
async function chargeCustomer(orderId, amount) {
try {
return await paymentBulkhead.execute(async () => {
const response = await axios.post('https://payment-api/charge',
{ orderId, amount },
{ timeout: 5000 }
);
return response.data;
});
} catch (err) {
if (err.message.includes('Bulkhead')) {
// Bulkhead full β gracefully degrade: queue payment for retry
await queuePaymentForLater(orderId, amount);
return { status: 'QUEUED', message: 'Payment processing delayed' };
}
throw err;
}
}
// productService.js β COMPLETELY isolated pool
async function getProduct(productId) {
return await productBulkhead.execute(async () => {
return await db.products.findById(productId);
});
// Payment being slow doesn't affect this AT ALL β
}
The key insight: When paymentBulkhead hits its 15-concurrent limit and starts rejecting, productBulkhead has its own separate 20-concurrent pool. Product browsing keeps working perfectly while payment is struggling.
2. Database Connection Pool Isolation ποΈ
This is the one that bites everyone in e-commerce. You have one DB connection pool (let's say 20 connections). Your reporting queries start running slow and eat 18 of those connections. Your order API gets 2 connections and starts timing out at checkout.
Fix: Separate pools per purpose:
// db.js β bulkheaded connection pools
const knex = require('knex');
// OLTP pool: fast, time-critical, customer-facing
const oltpPool = knex({
client: 'pg',
connection: process.env.DATABASE_URL,
pool: {
min: 2,
max: 10,
acquireTimeoutMillis: 3000 // Fail fast β user is waiting
}
});
// Reporting pool: isolated, slow queries welcome
const reportingPool = knex({
client: 'pg',
connection: process.env.DATABASE_URL,
pool: {
min: 1,
max: 5,
acquireTimeoutMillis: 30000 // Reporting can wait longer
}
});
// Background jobs pool: won't compete with OLTP ever
const backgroundPool = knex({
client: 'pg',
connection: process.env.DATABASE_URL,
pool: { min: 1, max: 5 }
});
// Usage:
// Order placement β oltpPool (customer is waiting!)
// Monthly reports β reportingPool (slow is fine)
// Batch email sync β backgroundPool (async, no rush)
A scalability lesson that cost us: On our e-commerce backend, our hourly inventory sync job would open 15 database connections simultaneously for bulk updates. This starved our checkout flow of DB connections. Result: 503 errors every hour, exactly on the hour. We nicknamed it "the hourly panic." Separate pools fixed it in one deploy.
3. Lambda Reserved Concurrency (AWS Serverless Bulkheads) β‘
This one is AWS-specific and stupidly simple. Every Lambda in your account shares a concurrency limit (default: 1,000). One bursty function can eat all 1,000 and leave zero for other functions.
# Give each critical function its own reserved concurrency "compartment"
aws lambda put-function-concurrency \
--function-name payment-processor \
--reserved-concurrent-executions 100
aws lambda put-function-concurrency \
--function-name product-api \
--reserved-concurrent-executions 200
aws lambda put-function-concurrency \
--function-name order-api \
--reserved-concurrent-executions 150
# payment-processor can spike to 100 β it cannot steal from product-api.
# product-api always has up to 200 reserved. Guaranteed.
The bonus: Reserved concurrency also guarantees minimum capacity. order-api will always have up to 150 concurrent executions available, even if everything else is running hot during a Black Friday spike.
Bulkhead vs. Circuit Breaker: What's the Difference? π
These two patterns get confused constantly. I confused them early on.
CIRCUIT BREAKER:
"If the payment API is failing, stop calling it temporarily."
β Detects failures and opens a switch to prevent further calls
β Protects the downstream service from being hammered
β Like a fuse in your electrical panel β trips when overloaded
BULKHEAD:
"Payment API calls get their own isolated resource pool."
β Limits blast radius when something goes wrong
β Protects YOUR application from resource exhaustion
β Like watertight compartments β contains the damage
THEY COMPLEMENT EACH OTHER:
β Bulkhead limits how many threads can wait on Payment API
β Circuit Breaker stops calling Payment API after N failures
β Together: fast failure + contained blast radius = resilience
As a Technical Lead, I've learned: Circuit Breakers and Bulkheads solve different problems. Circuit Breaker is reactive (detect failure, open circuit). Bulkhead is proactive (partition resources upfront). Use both.
Trade-offs: This Isn't Free Lunch π½οΈ
β
WHAT YOU GAIN:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Failure isolation β one crash doesn't cascade everywhere β
β Resource guarantees β critical paths always have capacity β
β Predictable degradation β you control what fails gracefully β
β Easier capacity planning β size each pool independently β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β οΈ WHAT YOU GIVE UP:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Higher total resources β idle capacity can't be shared β
β Configuration complexity β tune each pool separately β
β Risk of over-partitioning β too many tiny pools = waste β
β Fast rejections β bulkhead full = fail immediately, not waitβ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The hardest part: Sizing each pool correctly. Too small and you reject legitimate traffic. Too large and you don't get isolation benefits. Start conservative, measure under load, and adjust. Never guess at production numbers β load test.
When to Use (and Skip) Bulkheads π―
Use bulkheads when:
- Multiple critical services share the same process or connection pool
- One slow third-party API could block your whole application
- You mix user-facing (latency-sensitive) and background (tolerant) workloads
- Your AWS Lambdas share account-level concurrency limits
- You've already survived a cascading failure (don't wait for this!)
Skip them if:
- You have a simple monolith with one or two external dependencies
- Traffic is low and resource contention genuinely isn't a concern
- You're in early development β premature optimization applies here
- Services are deployed as completely separate processes (naturally isolated)
My rule of thumb: If a service's failure would make me open PagerDuty at 2 AM, it gets a bulkhead. Everything critical gets its own compartment.
Common Mistakes to Avoid πͺ€
Mistake #1: Making bulkhead limits too tight
// BAD: 3 concurrent calls to payment API?
// Normal traffic will constantly hit the limit!
const paymentBulkhead = new Bulkhead('payment', 3);
// GOOD: measure your actual P99 concurrency in production first
// then set limits ~20% above your observed peak
const paymentBulkhead = new Bulkhead('payment', 20);
Mistake #2: No fallback when bulkhead rejects
// BAD: Let the error bubble up as a 500
return await paymentBulkhead.execute(() => callPaymentAPI());
// GOOD: Degrade gracefully
try {
return await paymentBulkhead.execute(() => callPaymentAPI());
} catch (err) {
if (err.message.includes('Bulkhead')) {
return { status: 'RETRY_LATER', retryAfter: 5 }; // Tell client to retry
}
}
Mistake #3: One giant bulkhead for everything
// BAD: One shared limit for ALL external calls β defeats the purpose
const globalBulkhead = new Bulkhead('everything', 50);
// GOOD: Separate bulkhead per service
// Now one service being slow can't starve another
Mistake #4: Not monitoring bulkhead utilization
// Add metrics so you can tune limits and detect pressure
setInterval(() => {
console.log(JSON.stringify({
payment: { current: paymentBulkhead.current, max: 15 },
product: { current: productBulkhead.current, max: 20 },
}));
// Ship these to CloudWatch / Datadog / your observability tool
}, 10000);
If current / max is consistently above 80%, your bulkhead is too small. If it's consistently below 20%, you might be able to shrink it and free resources elsewhere.
The Real Production Test π₯
After implementing bulkheads on our e-commerce backend, I deliberately chaos-tested it. I introduced a 10-second artificial delay in our payment service.
Before bulkheads: Within 30 seconds, product browsing, account pages, and cart were all returning 503s. Total outage.
After bulkheads: Payment checkout was slow (expected β the delay was artificial). Product browsing? Blazing fast. Cart? Working. Account pages? Fine. Payment service was isolated in its own compartment, flooding alone, while the rest of the app sailed on.
That test β watching the rest of the app stay healthy while payment struggled β was one of the most satisfying moments of my career as a Technical Lead.
TL;DR π‘
The Bulkhead Pattern is simple: partition your resources so one failure can't consume everything.
- Semaphore per service β payment slowdown doesn't block product browsing
- DB connection pool per workload β reporting queries don't starve checkouts
- Lambda reserved concurrency β bursty functions can't steal from critical paths
- Pair with Circuit Breakers β proactive isolation + reactive failure detection
When designing our e-commerce backend, this was the pattern that turned "total outage when payment API slows down" into "payment service is degraded, everything else is nominal." That's the difference between a P0 incident and a footnote in the status page.
The Titanic didn't have enough bulkheads to survive. Your app doesn't have to make the same mistake.
Had a cascading failure take down your whole app? Tell me the war story on LinkedIn β we've all been there.
Want the full bulkhead implementation? Check GitHub for production-ready code.
Now go partition your resources before the next payment API outage! π’πͺ