Deployment Smoke Tests: Stop Letting Users Tell You Your Deploy Is Broken ๐ฅ
Deployment Smoke Tests: Stop Letting Users Tell You Your Deploy Is Broken ๐ฅ
Confession time: I once deployed a Node.js API to production at 11 PM on a Friday, went to bed feeling proud, and woke up to 47 Slack messages, 3 emails from the client, and a call from my manager asking why the checkout page had been returning 500 errors for six hours. ๐ฐ
The best part? The fix was one line. A missing environment variable. A DATABASE_URL that pointed to staging instead of production.
Sixty seconds of smoke testing would have caught it. Instead, I spent six hours of user downtime and two hours of my Saturday morning fixing it.
That incident changed how I deploy software forever.
What's a Smoke Test Anyway? ๐ค
A smoke test isn't a full integration test suite. It's not a unit test. It's a quick sanity check that runs immediately after deployment to verify the app didn't completely explode.
The name comes from hardware testing: you plug in a new circuit board and see if it catches fire. If it smokes โ bad. If it doesn't โ worth testing further.
In software terms:
- โ Can the app start and respond to requests?
- โ Can it connect to the database?
- โ Can it connect to Redis/cache?
- โ Do the most critical API endpoints return 200?
- โ Does the auth flow work?
That's it. Not 500 test cases. 5-10 checks. Under 60 seconds.
The Incident That Started It All ๐
Let me tell you about Black Friday 2022.
Our e-commerce platform handled payment processing for 12 clients. We'd been running CI/CD for 8 months โ GitHub Actions, automated tests, the works. We were confident.
# Our "deployment pipeline" at the time
npm run test # โ
847 tests pass
docker build . # โ
Image built
kubectl apply -f . # โ
Pods deployed
# Done! Let's go home! ๐โโ๏ธ
What we didn't check:
- Did the payment gateway ENV variable survive the Kubernetes secret rotation we did that morning?
- Did the new image have the right
STRIPE_WEBHOOK_SECRET? - Could the deployed pods actually talk to the payment service?
The result: For four hours on Black Friday, every payment attempt silently failed. Users got "Processing..." forever. We didn't know until a client called screaming.
After that incident, I spent a weekend building smoke tests. We've never had a silent post-deploy failure since. ๐ฏ
Building Your First Smoke Test Suite โ๏ธ
The Health Check Endpoint (Start Here)
First, add a proper health check endpoint to your app. Not just GET /ping returning "pong" โ that tells you nothing useful.
Node.js/Express:
// routes/health.js
const router = require('express').Router();
const db = require('../db');
const redis = require('../redis');
router.get('/health', async (req, res) => {
const checks = {
status: 'ok',
timestamp: new Date().toISOString(),
version: process.env.APP_VERSION || 'unknown',
checks: {}
};
// Check database connectivity
try {
await db.raw('SELECT 1');
checks.checks.database = 'ok';
} catch (err) {
checks.checks.database = 'failed';
checks.status = 'degraded';
}
// Check Redis connectivity
try {
await redis.ping();
checks.checks.cache = 'ok';
} catch (err) {
checks.checks.cache = 'failed';
checks.status = 'degraded';
}
// Check critical ENV vars exist
const requiredEnvVars = ['DATABASE_URL', 'REDIS_URL', 'STRIPE_SECRET_KEY'];
const missingEnvVars = requiredEnvVars.filter(v => !process.env[v]);
if (missingEnvVars.length > 0) {
checks.checks.config = `missing: ${missingEnvVars.join(', ')}`;
checks.status = 'unhealthy';
} else {
checks.checks.config = 'ok';
}
const statusCode = checks.status === 'ok' ? 200 : 503;
res.status(statusCode).json(checks);
});
module.exports = router;
Laravel:
// routes/api.php
Route::get('/health', function () {
$checks = [
'status' => 'ok',
'version' => config('app.version', 'unknown'),
'checks' => [],
];
// Database check
try {
DB::select('SELECT 1');
$checks['checks']['database'] = 'ok';
} catch (\Exception $e) {
$checks['checks']['database'] = 'failed';
$checks['status'] = 'degraded';
}
// Redis check
try {
Redis::ping();
$checks['checks']['cache'] = 'ok';
} catch (\Exception $e) {
$checks['checks']['cache'] = 'failed';
$checks['status'] = 'degraded';
}
// Required ENV vars
$required = ['DB_HOST', 'REDIS_HOST', 'STRIPE_SECRET'];
$missing = array_filter($required, fn($v) => empty(env($v)));
if (!empty($missing)) {
$checks['checks']['config'] = 'missing: ' . implode(', ', $missing);
$checks['status'] = 'unhealthy';
} else {
$checks['checks']['config'] = 'ok';
}
$statusCode = $checks['status'] === 'ok' ? 200 : 503;
return response()->json($checks, $statusCode);
});
What a good health response looks like:
{
"status": "ok",
"version": "v2.4.1",
"timestamp": "2026-03-21T10:30:00Z",
"checks": {
"database": "ok",
"cache": "ok",
"config": "ok"
}
}
What a broken deploy looks like:
{
"status": "unhealthy",
"checks": {
"database": "ok",
"cache": "ok",
"config": "missing: STRIPE_SECRET_KEY, STRIPE_WEBHOOK_SECRET"
}
}
That second response? It would have saved my Black Friday. ๐ค
The Smoke Test Script ๐งช
Now build the script that runs after every deploy:
#!/bin/bash
# scripts/smoke-test.sh
APP_URL="${1:-https://api.myapp.com}"
MAX_RETRIES=5
RETRY_DELAY=10
echo "๐ฅ Running smoke tests against: $APP_URL"
echo "โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ"
PASS=0
FAIL=0
# Function to test an endpoint
check_endpoint() {
local name="$1"
local url="$2"
local expected_status="${3:-200}"
local retries=0
while [ $retries -lt $MAX_RETRIES ]; do
status=$(curl -s -o /dev/null -w "%{http_code}" \
--max-time 10 \
--header "Accept: application/json" \
"$url")
if [ "$status" -eq "$expected_status" ]; then
echo " โ
$name โ HTTP $status"
PASS=$((PASS + 1))
return 0
fi
retries=$((retries + 1))
echo " โณ $name โ HTTP $status (retry $retries/$MAX_RETRIES)"
sleep $RETRY_DELAY
done
echo " โ $name โ HTTP $status (expected $expected_status)"
FAIL=$((FAIL + 1))
return 1
}
# Core health check
check_endpoint "Health Check" "$APP_URL/health"
# Auth endpoints
check_endpoint "Login page accessible" "$APP_URL/api/login" 405 # Should return 405 for GET
check_endpoint "Register page accessible" "$APP_URL/api/register" 405
# Public API endpoints
check_endpoint "Products list" "$APP_URL/api/products"
check_endpoint "Categories list" "$APP_URL/api/categories"
# Static assets
check_endpoint "API documentation" "$APP_URL/api/docs"
echo ""
echo "โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ"
echo "Results: โ
$PASS passed | โ $FAIL failed"
if [ $FAIL -gt 0 ]; then
echo "๐จ SMOKE TESTS FAILED! Consider rolling back!"
exit 1
fi
echo "๐ All smoke tests passed! Deploy successful!"
exit 0
Run it after every deploy:
./scripts/smoke-test.sh https://api.myapp.com
Wiring It Into Your CI/CD Pipeline ๐
GitHub Actions
Before (naive deploy):
# .github/workflows/deploy.yml
- name: Deploy to production
run: |
kubectl apply -f k8s/
echo "Deployed! ๐ค"
After (smoke-tested deploy):
# .github/workflows/deploy.yml
name: Deploy & Verify
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Deploy to production
run: |
kubectl apply -f k8s/
kubectl rollout status deployment/api --timeout=120s
- name: Wait for deployment to stabilize
run: sleep 15
- name: Run smoke tests
run: |
chmod +x scripts/smoke-test.sh
./scripts/smoke-test.sh ${{ secrets.PRODUCTION_URL }}
- name: Rollback on smoke test failure
if: failure()
run: |
echo "๐จ Smoke tests failed! Rolling back..."
kubectl rollout undo deployment/api
kubectl rollout status deployment/api --timeout=120s
echo "๐ Rollback complete! Investigate before redeploying."
- name: Notify team
if: failure()
uses: 8398a7/action-slack@v3
with:
status: failure
text: "๐จ Production deploy FAILED smoke tests! Rolled back automatically."
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
What this does:
- Deploy the new version
- Wait for pods to be ready
- Run smoke tests
- If tests fail โ automatic rollback + Slack alert
- If tests pass โ party time ๐
A CI/CD pipeline that saved our team: We once pushed a config change that broke our payment processor connection. The smoke tests caught it in 45 seconds, rolled back automatically, and we fixed it before any user noticed. No incident. No 3 AM calls. Just a GitHub Actions notification. ๐
Testing the Right Things โก
Not all smoke tests are equal. Here's what actually matters:
Tier 1 - Critical (must pass, rollback if fail):
# App is up and healthy
GET /health โ 200
# Database is reachable
Health check includes DB check โ "database": "ok"
# Most critical business endpoint works
GET /api/products โ 200 with data
# Auth isn't broken
POST /api/login โ 422 (needs credentials, but route works)
Tier 2 - Warning (fail loudly, but don't rollback):
# Third-party integrations
GET /api/payment/status โ 200
# Background job status
GET /api/queues/health โ 200
Tier 3 - Informational (log, don't block):
# Performance check
Response time < 2000ms
# Response size reasonable
Content-Length > 0
Before/After: Real Numbers from Our Team ๐
Before smoke tests (6 months, 3-person team):
| Incident Type | Count | Avg Time to Detect | Avg Time to Fix |
|---|---|---|---|
| Broken ENV vars | 4 | 45 min (users reported) | 30 min |
| DB connection lost | 2 | 90 min (on-call alert) | 15 min |
| Missing routes | 3 | 20 min (error monitoring) | 10 min |
| Total user impact | 9 incidents | ~4 hrs avg | โ |
After smoke tests (next 6 months, same team):
| Incident Type | Count | Avg Time to Detect | Avg Time to Fix |
|---|---|---|---|
| Broken ENV vars | 2 | 45 sec (smoke test) | 5 min (rollback) |
| DB connection lost | 1 | 30 sec (smoke test) | 5 min (rollback) |
| Missing routes | 0 | โ | โ |
| Total user impact | 0 incidents | ~40 sec avg | โ |
The smoke tests cost me a weekend to build. The ROI was immediate. ๐ฐ
Common Pitfalls (Learn from My Mistakes!) ๐ชค
Mistake #1: Testing Too Much
Bad smoke test (takes 5 minutes):
# Tests 200 endpoints
# Runs database migrations check
# Downloads test fixtures
# Validates every API response schema
# Full end-to-end user journey
Good smoke test (takes 45 seconds):
# Tests 8 critical endpoints
# Checks health endpoint (includes DB/cache)
# Verifies auth flow exists
# Done!
Smoke tests should be fast and focused. If they take more than 2 minutes, they're not smoke tests โ they're integration tests pretending to be smoke tests.
Mistake #2: Not Retrying
Containers take time to start. Load balancers take time to route. Don't fail on the first 503.
Bad:
status=$(curl -s -o /dev/null -w "%{http_code}" "$URL/health")
if [ "$status" != "200" ]; then exit 1; fi # Fails if pod isn't ready yet!
Good:
for i in {1..5}; do
status=$(curl -s -o /dev/null -w "%{http_code}" "$URL/health")
[ "$status" = "200" ] && break
echo "Retry $i/5..."
sleep 10
done
Docker taught me the hard way: New containers need 10-30 seconds to be healthy. Give them time. ๐ณ
Mistake #3: Not Testing the Right URL
# Wrong: tests the staging URL (oops!)
- run: ./smoke-test.sh https://staging.myapp.com
# Right: tests what was ACTUALLY deployed
- run: ./smoke-test.sh ${{ vars.PRODUCTION_URL }}
I've done this. We tested staging, celebrated, and production was on fire. ๐ฅ
Mistake #4: Skipping the Rollback
# Bad: tests but just notifies on failure
- name: Smoke test
run: ./smoke-test.sh $PROD_URL
continue-on-error: true # ๐ NO! This defeats the purpose!
# Good: roll back on failure!
- name: Rollback on failure
if: failure()
run: kubectl rollout undo deployment/api
What's the point of knowing your deploy broke if you don't fix it? ๐คท
The Minimum Viable Smoke Test ๐ฏ
If you're starting from zero, here's the simplest possible smoke test:
#!/bin/bash
# smoke-test-minimal.sh
URL="${1:-https://api.myapp.com}"
echo "Testing $URL..."
for i in {1..5}; do
status=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 "$URL/health")
if [ "$status" = "200" ]; then
echo "โ
Health check passed!"
exit 0
fi
echo "Attempt $i failed (HTTP $status), retrying..."
sleep 10
done
echo "โ Health check failed after 5 attempts!"
exit 1
Add this to your pipeline today. It takes 10 minutes to set up and will save you hours of pain.
TL;DR ๐ก
After 7+ years deploying production applications to AWS, this is what I know about smoke tests:
- ๐ฅ Smoke tests = fast sanity checks after every deploy (not full test suites)
- โก 60 seconds max โ if longer, trim scope
- ๐ Auto-rollback on failure โ the whole point is catching it before users do
- ๐ฅ Build a real
/healthendpoint โ not just a ping - ๐ฃ Alert your team โ even automatic rollbacks need human awareness
The question isn't "should I build smoke tests?" It's "how many more Black Fridays are you willing to ruin without them?" ๐๐
Still finding out your deploys are broken from angry users? Connect with me on LinkedIn โ I've set up these pipelines for Laravel, Node.js, and containerized apps across a dozen production environments.
Want the full smoke test template? Check my GitHub โ full scripts, GitHub Actions workflows, and health check implementations included.
Now go protect your deploys before your users become your QA team. ๐ฅ๐ก๏ธ
P.S. The Friday night deploy that triggered this post? We still call it "The Incident" two years later. My teammates mention it every time someone wants to skip smoke tests. Some lessons stick. ๐