Production-Ready Dockerfiles: Stop Writing FROM ubuntu and Praying ðģðĨ
Production-Ready Dockerfiles: Stop Writing FROM ubuntu and Praying ðģðĨ
Honest confession: My first production Dockerfile was literally:
FROM ubuntu
RUN apt-get install -y nodejs
COPY . .
CMD ["node", "server.js"]
It worked locally. It worked in staging. In production? Container kept OOM-killing itself every 48 hours. I spent a week thinking we had a Node.js memory leak. Docker taught me the hard way that "it runs in a container" and "it's production-ready" are two very different things.
Seven years of deployments later, here's everything I wish someone had told me.
The Dockerfile That Ruins Your Weekend ð
You know this container. We've all shipped it:
FROM node:latest
WORKDIR /app
COPY . .
RUN npm install
EXPOSE 3000
CMD ["node", "server.js"]
Looks fine, right? Let me count the ways this will ruin your Friday night:
node:latestmeans roulette - next pull might be a different major versionnpm installinstead ofnpm ci- your lockfile? Decorative- Copying the entire directory including
.git,node_modules,.env - No non-root user - your container runs as root like it's 1995
- No health check - your orchestrator assumes it's healthy until it's catastrophically not
- No resource limits - one memory leak and the whole host is toast
Let me show you what a production Dockerfile actually looks like.
The Production-Ready Dockerfile ð
Here's the one I use for Node.js services in production:
# Pin the EXACT version. No surprises on next pull.
FROM node:20.11.0-alpine3.19 AS base
# ââ Dependencies stage ââââââââââââââââââââââââââââââââââââââ
FROM base AS deps
WORKDIR /app
# Copy only what npm needs first (better layer caching!)
COPY package.json package-lock.json ./
RUN npm ci --only=production && npm cache clean --force
# ââ Build stage âââââââââââââââââââââââââââââââââââââââââââââ
FROM base AS builder
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build
# ââ Production image âââââââââââââââââââââââââââââââââââââââââ
FROM base AS runner
WORKDIR /app
# Don't run as root. Ever.
RUN addgroup --system --gid 1001 nodejs && \
adduser --system --uid 1001 appuser
# Copy only what you actually need
COPY --from=deps --chown=appuser:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=appuser:nodejs /app/dist ./dist
USER appuser
# Tell your orchestrator what port to expose (documentation + tooling)
EXPOSE 3000
# Health check so orchestrators know when you're ACTUALLY ready
HEALTHCHECK --interval=30s --timeout=10s --start-period=15s --retries=3 \
CMD wget -qO- http://localhost:3000/health || exit 1
# Use exec form, not shell form. Signals work correctly.
CMD ["node", "dist/server.js"]
That's a 2-stage build with a dedicated deps layer, non-root user, and a real health check. Let me break down why each part matters.
The Four Changes That Saved My Production Clusters âïļ
1. Pin Your Base Image (Seriously, Stop Using latest) ð
Before:
FROM node:latest
After:
FROM node:20.11.0-alpine3.19
After countless deployments where "we didn't change anything" but the container behaved differently, I traced it back to an upstream image update. latest is a lie - it changes under you silently.
Use alpine variants when possible. node:20-alpine is ~50MB vs ~350MB for node:20. Your pull times and storage costs will thank you.
Lock it with a digest for maximum paranoia:
FROM node:20.11.0-alpine3.19@sha256:your-digest-here
Now no upstream registry change can touch you. Not even a compromised image tag.
2. The Non-Root User That Keeps You Out of Incident Reports ð
# Create system group and user
RUN addgroup --system --gid 1001 nodejs && \
adduser --system --uid 1001 appuser
# Set permissions before switching
COPY --chown=appuser:nodejs /app/dist ./dist
# Switch to non-root
USER appuser
A CI/CD pipeline that saved our team: We had a path traversal vulnerability in an API. Because the container ran as non-root, the attacker couldn't read /etc/passwd or escalate privileges. The blast radius was tiny. If we'd been running as root? Much worse story.
Running containers as root is like giving your delivery driver the master key to every apartment in the building. Just... don't.
3. HEALTHCHECK: The Feature Everyone Skips ðĨ
HEALTHCHECK --interval=30s \
--timeout=10s \
--start-period=15s \
--retries=3 \
CMD wget -qO- http://localhost:3000/health || exit 1
Without a HEALTHCHECK, Docker and Kubernetes think your container is healthy the moment the process starts. Your Node.js app needs 10 seconds to connect to the database? Too bad - traffic is already hitting it.
The flags explained:
--interval=30s- Check every 30 seconds--timeout=10s- If the check takes longer than 10s, it fails--start-period=15s- Give the container 15s to boot before counting failures--retries=3- 3 consecutive failures = unhealthy
Your /health endpoint should check actual dependencies:
// Express health endpoint
app.get('/health', async (req, res) => {
try {
// Don't just return 200. Check what matters.
await db.raw('SELECT 1');
await redis.ping();
res.json({
status: 'healthy',
uptime: process.uptime(),
timestamp: new Date().toISOString()
});
} catch (error) {
res.status(503).json({
status: 'unhealthy',
error: error.message
});
}
});
A health check that just returns 200 is like a restaurant that says "yes, we're open!" but the kitchen burned down.
4. Resource Limits: Because Hope Is Not a Strategy ð
The Dockerfile itself can't set resource limits - that's your orchestration layer. But you need them.
Docker Compose:
services:
api:
image: myapp:latest
deploy:
resources:
limits:
cpus: '0.5'
memory: 512M
reservations:
memory: 256M
restart: unless-stopped
Kubernetes:
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
Without limits, one runaway process can eat the entire host. I've watched a single container with a memory leak take down 6 other services on the same EC2 instance. Setting limits is the container equivalent of having circuit breakers in your electrical panel.
The .dockerignore File You Keep Forgetting ð
Every project needs this. Seriously, check if yours has one right now.
# Version control
.git
.gitignore
# Dependencies (you're installing these fresh)
node_modules
# Environment files (NEVER ship these)
.env
.env.*
*.local
# Development tools
.eslintrc*
.prettierrc*
jest.config.*
*.test.js
*.spec.js
# Build artifacts
dist
build
coverage
# Editor files
.vscode
.idea
*.swp
# Docker files (no infinite loops)
Dockerfile*
docker-compose*
.dockerignore
# Docs (don't need them in the container)
README.md
docs/
Without .dockerignore, you're shipping your entire Git history, your .env file with production secrets, your node_modules (which then gets overwritten anyway), and your test files. I've seen production containers with .git directories inside them. Don't be that person.
Before vs After: The Numbers ð
Here's a real comparison from a Laravel API I containerized:
| Metric | Naive Dockerfile | Production Dockerfile |
|---|---|---|
| Image size | 1.2 GB | 187 MB |
| Build time (cold) | 4m 12s | 1m 48s |
| Build time (cached) | 3m 50s | 23s |
| Startup time | 8s | 8s |
| Unhealthy detection | Never | 90s max |
| Running as root? | Yes ðĻ | No â |
The 6x smaller image isn't just aesthetics. It's faster pulls, faster cold starts in ECS/Fargate, lower ECR storage costs, and a smaller attack surface. Every MB is a liability.
Common Pitfalls to Avoid ðŠĪ
Pitfall #1: Using ADD when you should use COPY
# Bad - ADD has hidden magic (URL fetching, auto-extraction)
ADD ./src /app/src
# Good - COPY is explicit and predictable
COPY ./src /app/src
Pitfall #2: Shell form vs exec form for CMD
# Bad - runs as /bin/sh -c, signals go to shell, not your app
CMD "node server.js"
# Good - signals reach your process directly (graceful shutdown works!)
CMD ["node", "server.js"]
When ECS or Kubernetes sends SIGTERM to stop your container gracefully, shell form eats the signal. Your app never gets it. 15 seconds later, you get SIGKILL and your in-flight requests die hard.
Pitfall #3: Installing dev dependencies in production
# Bad
RUN npm install # Installs devDependencies too
# Good
RUN npm ci --only=production
Pitfall #4: Combining RUN commands unnecessarily
# Bad - each RUN is a layer, but unnecessary splitting hurts nothing here
RUN apt-get update
RUN apt-get install -y curl
RUN rm -rf /var/lib/apt/lists/*
# Good - combine related commands to minimize layers AND properly clean up
RUN apt-get update && \
apt-get install -y --no-install-recommends curl && \
rm -rf /var/lib/apt/lists/*
Without that cleanup, the apt cache stays in the layer forever.
TL;DR - The Dockerfile Checklist â
Before you ship any container to production:
- Pin exact base image version (no
latest) - Use Alpine or slim variants when possible
- Write a
.dockerignore(check it exists right now) - Use
npm cinotnpm install - Create non-root user and switch to it before CMD
- Add HEALTHCHECK that tests actual dependencies
- Use exec form for CMD/ENTRYPOINT
["node", "server.js"] - Set resource limits at the orchestration layer
- Combine RUN commands for the same concern
- Never COPY
.envfiles into images
A Dockerfile is a deployment contract. Every line either adds security and reliability or takes it away. The containerized vulnerabilities I've seen in production almost all trace back to a "quick" Dockerfile that was never hardened.
Spend 30 minutes on your Dockerfile now. Save yourself a Saturday night incident response later.
Running containerized apps in production? Connect with me on LinkedIn - I love talking shop about deployment patterns that actually hold up under pressure.
Want to see a real production setup? My GitHub has working examples from projects I've shipped.
Now go check your running containers. How many are running as root? I'll wait. ðģ