Database Read Replicas: Stop Choking Your Primary Database at 11 PM ššļø
Database Read Replicas: Stop Choking Your Primary Database at 11 PM ššļø
True story: It was a Tuesday evening, not even a sale day. Our e-commerce platform's primary RDS instance was pegged at 97% CPU. Response times climbed from 200ms to 8 seconds. Checkout was broken.
I checked the slow query log. Someone - and I'm not naming names but it rhymes with "analytics team" - had kicked off a report joining the orders, users, and products tables with a three-year date range. Against the production primary database. During peak hours.
Me: "Who ran a massive analytics query against the primary DB?"
Analytics Lead: "I needed the data for tomorrow's board meeting..."
Me: "It just took down checkout for 12,000 concurrent users."
Analytics Lead: "...oops."
That night taught me the most important database scaling lesson of my career: Your primary database should do one thing - accept writes. Everything else gets a replica.
Welcome to read replicas - the database pattern that prevents analytics reports from killing your checkout flow!
What's a Read Replica Anyway? š¤
A read replica is a copy of your primary database that stays synchronized via replication. Writes go to primary, reads can go anywhere.
Before Read Replicas (our disaster setup):
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Primary DB (RDS) ā
ā ā
ā ā User checkouts (writes) ā
ā ā Product page loads (reads) ā
ā ā Analytics queries (HUGE reads) ā
ā ā Admin dashboard (reads) ā
ā ā Search queries (reads) ā
ā ā Background reports (ENORMOUS reads) ā
ā ā
ā CPU: 97% š„ Connections: 499/500 š ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
After Read Replicas (sanity restored):
āāāāāāāāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāāāāā
ā Primary DB (RDS) ā āāāŗ ā Read Replica #1 ā
ā ā āāāŗ ā (App reads) ā
ā ā Writes only ā āāāŗ ā ā
ā ā Critical reads ā āāāāāāāāāāāāāāāāāāāāāāā
ā ā
ā CPU: 35% ā
ā āāāāāāāāāāāāāāāāāāāāāāā
ā Connections: 120 ā
ā āāāŗ ā Read Replica #2 ā
āāāāāāāāāāāāāāāāāāāāāāāā ā (Analytics/Reports)ā
ā ā
āāāāāāāāāāāāāāāāāāāāāāā
Translation: Primary database is now a write machine. Replicas handle the read storm. Analytics team can run their cursed three-year reports all they want - on their own replica!
How Replication Actually Works š§
Understanding the mechanism helps you understand the trade-offs:
Primary DB
ā
ā 1. Write happens (INSERT/UPDATE/DELETE)
ā 2. Change written to binary log (binlog/WAL)
ā 3. Replica connects and streams binlog
ā 4. Replica applies changes
ā
ā¼
Replica DB (almost identical to primary, with slight lag)
AWS RDS Aurora Replication:
Primary (writer) ā Aurora storage layer (6 copies across 3 AZs)
ā Replica 1 reads from shared storage (lag: ~10-20ms!)
ā Replica 2 reads from shared storage (lag: ~10-20ms!)
Standard RDS MySQL/PostgreSQL Replication:
Primary ā Binary log ā Network transfer ā Replica applies changes
Lag: 10ms - 30s depending on write load and network!
The replica lag is the critical number you need to understand. Aurora is nearly zero. Standard RDS can be significant under heavy write load.
The Scalability Lesson That Cost Us šø
When I was scaling our e-commerce backend past 100k daily orders, our single primary RDS instance was showing its limits:
// What our app looked like before replicas
// Everything hitting one database:
// Product listing page (100 reads/second peak)
const products = await db.query(
'SELECT * FROM products WHERE category_id = ? AND active = 1 ORDER BY rank DESC',
[categoryId]
);
// Order history (50 reads/second)
const orders = await db.query(
'SELECT o.*, oi.* FROM orders o JOIN order_items oi ON o.id = oi.order_id WHERE o.user_id = ?',
[userId]
);
// Checkout (writes - 30/second peak)
await db.query('INSERT INTO orders (user_id, total, status) VALUES (?, ?, ?)', [...]);
// Analytics dashboard (monster queries, 5/minute but each takes 10+ seconds)
const revenue = await db.query(`
SELECT DATE(created_at) as date, SUM(total) as revenue
FROM orders
WHERE created_at >= ? AND status = 'completed'
GROUP BY DATE(created_at)
`, [startDate]);
// Result: ALL of these competing for the same 500 connections
// Peak hours = checkout timeouts = lost revenue = angry CTO š
After adding read replicas in Laravel:
// config/database.php
'mysql' => [
'read' => [
'host' => [
env('DB_READ_HOST_1', '127.0.0.1'),
env('DB_READ_HOST_2', '127.0.0.1'), // Laravel load balances between these!
],
],
'write' => [
'host' => [env('DB_HOST', '127.0.0.1')],
],
'sticky' => true, // CRITICAL - explained below!
'driver' => 'mysql',
'port' => env('DB_PORT', '3306'),
'database' => env('DB_DATABASE', 'forge'),
'username' => env('DB_USERNAME', 'forge'),
'password' => env('DB_PASSWORD', ''),
// ...
],
That's it for the framework config. Laravel automatically routes SELECT queries to replicas and INSERT/UPDATE/DELETE to primary.
// This goes to READ replica automatically ā
$products = Product::where('category_id', $id)->where('active', 1)->get();
// This goes to PRIMARY (write) ā
Order::create(['user_id' => $userId, 'total' => $total]);
// Force primary for critical read (more on why below!)
$order = Order::on('mysql::write')->find($orderId);
Impact:
- Primary CPU: 97% ā 35%
- Checkout response time: 8s ā 180ms
- Analytics queries no longer affected checkout
- We slept through the night again š
The Replica Lag Problem: Why You'll Get Burned š„
Here's the trap that catches EVERYONE the first time. I fell for it on a Monday morning:
Timeline of "Read-After-Write Consistency" failure:
T=0ms User clicks "Place Order" ā
T=1ms INSERT into orders table on PRIMARY
T=1ms API returns: { orderId: 12345, status: "success" }
T=2ms User browser redirects to /order/12345
T=3ms SELECT from orders WHERE id = 12345
ā Goes to REPLICA
ā Replica lag: 50ms
ā Order doesn't exist yet on replica! š±
T=3ms API returns: 404 Not Found
User: "IT SAYS ORDER NOT FOUND BUT YOU CHARGED MY CARD!!!"
Me: *reaches for the antacids*
The sticky option is your first line of defense:
// With 'sticky' => true in Laravel config:
// If you wrote to primary in this request,
// subsequent reads in the SAME REQUEST go to primary too!
// Request: POST /checkout
Order::create([...]); // ā PRIMARY (write)
$order = Order::find($id); // ā PRIMARY (sticky! same request wrote here)
return response()->json($order); // ā
Order exists!
// Next request: GET /order/12345 (new request, sticky resets)
$order = Order::find($id); // ā REPLICA (normal read)
// Might still be lagging... š¬
The proper solution - explicitly route critical reads to primary:
// Node.js with separate DB pools
const { Pool } = require('pg');
const primaryPool = new Pool({
host: process.env.DB_PRIMARY_HOST,
database: process.env.DB_NAME,
// ...
});
const replicaPool = new Pool({
host: process.env.DB_REPLICA_HOST,
database: process.env.DB_NAME,
// ...
});
// helper to choose the right connection
function db(opts = {}) {
return opts.primary ? primaryPool : replicaPool;
}
// SAFE: User just created an order, read from primary!
router.post('/checkout', async (req, res) => {
const order = await db({ primary: true }).query(
'INSERT INTO orders (user_id, total) VALUES ($1, $2) RETURNING *',
[req.user.id, total]
);
// Read-after-write: MUST use primary here!
const fullOrder = await db({ primary: true }).query(
'SELECT o.*, u.email FROM orders o JOIN users u ON o.user_id = u.id WHERE o.id = $1',
[order.rows[0].id]
);
res.json(fullOrder.rows[0]);
});
// FINE: Browsing products - replica is great here, stale data is OK
router.get('/products', async (req, res) => {
const products = await db().query(
'SELECT * FROM products WHERE active = TRUE ORDER BY rank DESC LIMIT 50'
);
res.json(products.rows);
});
Rule of thumb I use:
PRIMARY reads:
ā
Immediately after a write (read-after-write)
ā
Financial data (balance, payment status)
ā
Inventory checks before purchase
ā
Authentication (is this session valid?)
ā
Anything where stale = bad user experience
REPLICA reads:
ā
Product listings, search results
ā
User profile pages
ā
Analytics dashboards (ideally dedicated replica!)
ā
Admin reports
ā
Anything where 1-10 seconds stale is fine
Setting Up Read Replicas on AWS RDS š ļø
The AWS Console way (don't @ me, sometimes console is fine for one-time setup):
# AWS CLI to create a read replica
aws rds create-db-instance-read-replica \
--db-instance-identifier myapp-replica-1 \
--source-db-instance-identifier myapp-primary \
--db-instance-class db.r6g.xlarge \
--availability-zone us-east-1b \
--publicly-accessible false \
--tags Key=Environment,Value=production Key=Role,Value=replica
Terraform (the right way for production):
# Primary RDS
resource "aws_db_instance" "primary" {
identifier = "myapp-primary"
engine = "mysql"
engine_version = "8.0"
instance_class = "db.r6g.2xlarge"
allocated_storage = 500
backup_retention_period = 7
backup_window = "03:00-04:00"
# Enable automated backups (required for replicas!)
skip_final_snapshot = false
tags = { Role = "primary" }
}
# Read Replica #1 - For app reads
resource "aws_db_instance" "replica_app" {
identifier = "myapp-replica-app"
replicate_source_db = aws_db_instance.primary.identifier
instance_class = "db.r6g.xlarge" # Smaller is fine for reads!
# No backup needed on replica (primary handles it)
backup_retention_period = 0
skip_final_snapshot = true
tags = { Role = "replica-app" }
}
# Read Replica #2 - For analytics (can be different size!)
resource "aws_db_instance" "replica_analytics" {
identifier = "myapp-replica-analytics"
replicate_source_db = aws_db_instance.primary.identifier
instance_class = "db.r6g.4xlarge" # Bigger for heavy analytics queries
backup_retention_period = 0
skip_final_snapshot = true
tags = { Role = "replica-analytics" }
}
# .env
DB_PRIMARY_HOST=myapp-primary.xxxxx.us-east-1.rds.amazonaws.com
DB_REPLICA_APP_HOST=myapp-replica-app.xxxxx.us-east-1.rds.amazonaws.com
DB_ANALYTICS_HOST=myapp-replica-analytics.xxxxx.us-east-1.rds.amazonaws.com
Monitoring Replica Lag: The Number You Must Watch š
# Check replica lag via AWS CLI
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name ReplicaLag \
--dimensions Name=DBInstanceIdentifier,Value=myapp-replica-app \
--start-time 2026-03-13T00:00:00Z \
--end-time 2026-03-13T23:59:59Z \
--period 300 \
--statistics Average Maximum
Alert thresholds I use in production:
# CloudWatch Alarm (Terraform)
resource "aws_cloudwatch_metric_alarm" "replica_lag_warning" {
alarm_name = "replica-lag-warning"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "2"
metric_name = "ReplicaLag"
namespace = "AWS/RDS"
period = "60"
statistic = "Average"
threshold = "10" # 10 seconds lag = warning
alarm_description = "Replica lag exceeding 10 seconds"
dimensions = {
DBInstanceIdentifier = "myapp-replica-app"
}
}
resource "aws_cloudwatch_metric_alarm" "replica_lag_critical" {
alarm_name = "replica-lag-critical"
threshold = "30" # 30 seconds = route ALL reads to primary!
# ... same config
}
When lag spikes, automatically fall back to primary:
// Automatic failover when lag is too high
let replicaLagSeconds = 0;
// Poll replica lag every 30 seconds
setInterval(async () => {
try {
const result = await primaryPool.query(
"SHOW SLAVE STATUS" // or SHOW REPLICA STATUS in MySQL 8.0+
);
replicaLagSeconds = result.rows[0]?.Seconds_Behind_Master ?? 0;
console.log(`Replica lag: ${replicaLagSeconds}s`);
} catch (err) {
console.error('Failed to check replica lag:', err);
}
}, 30000);
function getReadPool() {
// If replica is more than 15 seconds behind, use primary
if (replicaLagSeconds > 15) {
console.warn(`High replica lag (${replicaLagSeconds}s), routing to primary`);
return primaryPool;
}
return replicaPool;
}
Common Read Replica Mistakes (My Personal Hall of Shame) šŖ¤
Mistake #1: Running Transactions That Span Both Primary and Replica
// BAD: Mixing reads and writes in a transaction is confusing
await db.transaction(async (trx) => {
// This write goes to primary
await trx.query('INSERT INTO orders ...');
// But then you try to read from replica... which doesn't have this yet!
const order = await replicaPool.query('SELECT * FROM orders WHERE id = ?', [id]);
// š Order might not exist on replica yet!
});
// GOOD: All reads within a transaction use the same connection (primary)
await db.transaction(async (trx) => {
await trx.query('INSERT INTO orders ...');
const order = await trx.query('SELECT * FROM orders WHERE id = ?', [id]);
// ā
Same connection, guaranteed consistency
});
Mistake #2: Not Separating Analytics Traffic
Before analytics replica:
Analytics query (10 seconds, full table scan) ā PRIMARY
Result: 10,000 users experience slow product pages simultaneously
After analytics replica:
Analytics query ā ANALYTICS REPLICA (dedicated, can be crushed, nobody cares)
Product pages ā APP REPLICA (fast, lightly loaded)
Checkout ā PRIMARY (minimal reads, fast writes)
Mistake #3: Ignoring the Cost Savings Angle
As a Technical Lead, I've learned: read replicas often SAVE money.
Option A: Upgrade primary from db.r6g.2xlarge to db.r6g.8xlarge
Cost: +$800/month, solves current load but next bottleneck is 3 months away
Option B: Add db.r6g.xlarge read replica
Cost: +$200/month, removes 70% of load from primary
Primary now has runway for 12+ months of growth
Plus: built-in HA if primary fails (promote replica!) š”
The Full Read Replica Architecture šļø
What our e-commerce backend looks like today:
āāāāāāāāāāāāāāāāāāāā
User Request āāāāāāāŗā API Gateway ā
āāāāāāāāāā¬āāāāāāāāāā
ā
āāāāāāāāāā¼āāāāāāāāāā
ā Application ā
ā Servers ā
āāāāāāāāāā¬āāāāāāāāāā
ā
āāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāā
ā ā ā
ā¼ ā¼ ā¼
āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāā
ā PRIMARY ā ā REPLICA 1 ā ā REPLICA 2 ā
ā (Writes) ā ā (App Reads) ā ā (Analytics) ā
ā ā ā ā ā ā
ā ⢠Checkouts ā ā ⢠Products ā ā ⢠Reports ā
ā ⢠Order mgmt ā ā ⢠Profiles ā ā ⢠Dashboards ā
ā ⢠Inventory ā ā ⢠Searches ā ā ⢠Exports ā
ā writes ā ā ā ā ⢠BI queries ā
ā ā ā ~20ms lag ā ā ~30ms lag (ok!) ā
āāāāāāāā¬āāāāāāāā āāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāāāā
ā
ā Replication stream
āāāāāāāāŗ both replicas
Result:
- Primary CPU: 35% (writes only)
- Checkout p99: 145ms (down from 8 seconds during the dark times)
- Analytics reports: Still slow, but isolated on their own replica
- Sleep quality: Dramatically improved š“
When NOT to Use Read Replicas š«
Read replicas aren't magic. Skip them when:
- You're early stage - Premature optimization. One beefy RDS instance handles more than you think. Add replicas when CPU or connection count becomes the bottleneck.
- All your reads need to be consistent - If your app truly can't tolerate any stale reads, replicas are more complexity than they're worth. Consider Aurora with its near-zero lag instead.
- Your bottleneck is writes - Replicas help reads. If 90% of your load is writes, replicas don't solve anything. Consider sharding or write optimization instead.
- Your queries are just poorly written - Before adding replicas, check your slow query log. A missing index might fix 80% of your load.
TL;DR - The Replica Cheat Sheet šÆ
USE REPLICAS FOR:
ā
App reads (products, profiles, listings) ā App Replica
ā
Analytics queries ā Dedicated Analytics Replica
ā
Background report generation ā Analytics Replica
ā
Admin dashboards ā Either replica
KEEP ON PRIMARY:
ā
All writes
ā
Read-after-write scenarios
ā
Financial/inventory reads before purchase
ā
Authentication checks
WATCH OUT FOR:
ā ļø Replica lag (monitor it, alert on it, handle it)
ā ļø Read-after-write consistency (use sticky or explicit primary)
ā ļø Transactions (all reads in a transaction use same connection)
ā ļø Cost vs benefit (one replica pays for itself fast; ten might not)
Your primary database is not a garbage dump for every read query your imagination can conjure. Give it one job. Let your replicas handle the flood.
When designing our e-commerce backend, read replicas went from "nice optimization" to "how were we ever surviving without this." The day we stopped routing analytics queries to primary was the day checkout became reliably fast again.
Your primary database will thank you. Your users will thank you. Your on-call rotation will thank you. š
Scaling a database-heavy system? Connect with me on LinkedIn - always happy to geek out about replica lag and replication topology.
Production configs and more patterns? Check out my GitHub.
Now go create those replicas before the analytics team finds your primary DB endpoint. šāØ
P.S. If your analytics team has direct access to your production primary database, take a deep breath, revoke those credentials, set up a read replica, and give them that endpoint instead. You'll sleep better. Trust me. š¤