AWS ECS: Docker in Production Without the Kubernetes Therapy Bills š³
AWS ECS: Docker in Production Without the Kubernetes Therapy Bills š³
Hot take: Not every team needs Kubernetes. There. I said it.
I spent three months trying to set up Kubernetes for a Laravel API that served maybe 50k requests per day. I configured pods, wrote Helm charts, debugged ingress controllers at 2 AM, and questioned every career decision I'd ever made.
Then a senior architect looked at my setup and said: "Why didn't you just use ECS?"
Me: "...what's ECS?"
Her: sighs with the weight of someone who has had this exact conversation 40 times before
Welcome to AWS Elastic Container Service ā the Docker-in-production tool that's been hiding in plain sight, quietly solving the "I want containers but not the existential crisis" problem.
What Even Is ECS? š¤
ECS is Amazon's managed container orchestration service. Think of it as Docker Compose, but production-grade and managed by AWS.
You tell ECS:
- "Here's my Docker image"
- "Run 3 copies of it"
- "Give each one 512MB RAM and 0.5 vCPU"
- "Restart them if they crash"
- "Route traffic through this load balancer"
ECS says: "Got it!" and handles literally everything else.
No YAML manifests for every resource type. No ETCD clusters. No control plane upgrades at the worst possible moment. Just containers, running, doing their job.
The Time ECS Saved My On-Call Rotation ā°
Early in my career, we ran a Laravel application on EC2 instances. Deployment was... artisanal:
# Our "deployment process" circa 3 years ago
ssh [email protected]
cd /var/www/html
git pull origin main
composer install --no-dev
php artisan migrate
sudo systemctl restart php-fpm
sudo systemctl restart nginx
# Repeat for every server
# Pray nothing breaks
# Cry when it does
We had 4 EC2 instances. Every deployment was 4 SSH sessions, 4 sets of commands, hoping each one matched. Instance 3 once ran a different version for six days before anyone noticed. Six days!
When we migrated to ECS, deployment became this:
# Build and push image
docker build -t api:v1.2.3 .
docker push 123456789.dkr.ecr.ap-south-1.amazonaws.com/api:v1.2.3
# Update ECS service (or let GitHub Actions do it)
aws ecs update-service \
--cluster production \
--service api \
--force-new-deployment
# That's it
# All tasks updated
# Rolling deployment
# Same version everywhere
Every container. Same image. Same version. Guaranteed. šÆ
ECS Core Concepts (The Only Ones That Matter) š§©
You need to understand exactly four things:
1. Task Definition ā Your Docker Run Config as YAML
{
"family": "api",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"executionRoleArn": "arn:aws:iam::123:role/ecsTaskExecutionRole",
"containerDefinitions": [
{
"name": "api",
"image": "123456789.dkr.ecr.ap-south-1.amazonaws.com/api:v1.2.3",
"portMappings": [
{
"containerPort": 8080,
"protocol": "tcp"
}
],
"environment": [
{ "name": "APP_ENV", "value": "production" }
],
"secrets": [
{
"name": "DB_PASSWORD",
"valueFrom": "arn:aws:secretsmanager:ap-south-1:123:secret:prod/db"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/api",
"awslogs-region": "ap-south-1",
"awslogs-stream-prefix": "ecs"
}
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3
}
}
]
}
Think of Task Definition as: The recipe. It describes what container to run and how.
2. Task ā One Running Instance of Your Container
Think of Task as: A single plate of food made from the recipe. ECS can run many tasks from one task definition.
3. Service ā "Keep N Tasks Running Always"
# Create a service that keeps 3 tasks running at all times
aws ecs create-service \
--cluster production \
--service-name api \
--task-definition api:42 \
--desired-count 3 \
--launch-type FARGATE \
--load-balancers "targetGroupArn=arn:aws:...,containerName=api,containerPort=8080"
Think of Service as: The restaurant saying "always have 3 plates ready." If one breaks (task crashes), ECS automatically makes another.
4. Cluster ā The VPC for Your Containers
Think of Cluster as: The restaurant building itself. All your services and tasks live inside it.
That's it. Task Definition ā Task ā Service ā Cluster. Four concepts, and you're running containers in production. š
ECS Fargate vs EC2 Launch Type: The Decision You Actually Need š¤·
ECS has two modes:
Fargate (Serverless Containers)
# You provision: Nothing
# AWS manages: Everything underneath
# You pay for: Exact CPU + Memory your tasks use
# Perfect for:
# - Don't want to manage servers
# - Variable or unpredictable load
# - Team without dedicated infrastructure engineers
# - Starting out with containers
# Not great for:
# - Predictable high-volume workloads (EC2 is cheaper)
# - GPU workloads
# - Windows containers with special requirements
EC2 Launch Type
# You provision: EC2 instances in your cluster
# AWS manages: Container placement
# You pay for: The EC2 instances (even if idle)
# Perfect for:
# - High, predictable traffic (Reserved Instances = huge savings)
# - GPU workloads
# - Need specific instance types
# - Want to squeeze out maximum cost efficiency
# Not great for:
# - Spiky or unpredictable load
# - Small teams without ops experience
My rule: Start with Fargate. Switch to EC2 when your AWS bill starts making you uncomfortable and you actually know your traffic patterns.
After countless deployments on both, Fargate wins for most teams at most stages. The operational overhead of EC2 launch type is real.
The GitHub Actions Pipeline That Changed Everything āļø
Here's the CI/CD pipeline we actually use for ECS deployments. Docker taught me the hard way that a pipeline without proper caching is a pipeline that makes you wait 12 minutes for every change:
# .github/workflows/deploy.yml
name: Deploy to ECS
on:
push:
branches: [main]
env:
AWS_REGION: ap-south-1
ECR_REPOSITORY: api
ECS_CLUSTER: production
ECS_SERVICE: api
CONTAINER_NAME: api
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
- name: Build, tag, and push image to ECR
id: build-image
env:
ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
IMAGE_TAG: ${{ github.sha }}
run: |
# Build with cache from previous runs (Docker taught me this the hard way)
docker build \
--cache-from $ECR_REGISTRY/$ECR_REPOSITORY:latest \
--tag $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG \
--tag $ECR_REGISTRY/$ECR_REPOSITORY:latest \
.
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
docker push $ECR_REGISTRY/$ECR_REPOSITORY:latest
echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" >> $GITHUB_OUTPUT
- name: Download current task definition
run: |
aws ecs describe-task-definition \
--task-definition api \
--query taskDefinition \
> task-definition.json
- name: Update image in task definition
id: task-def
uses: aws-actions/amazon-ecs-render-task-definition@v1
with:
task-definition: task-definition.json
container-name: ${{ env.CONTAINER_NAME }}
image: ${{ steps.build-image.outputs.image }}
- name: Deploy to ECS
uses: aws-actions/amazon-ecs-deploy-task-definition@v1
with:
task-definition: ${{ steps.task-def.outputs.task-definition }}
service: ${{ env.ECS_SERVICE }}
cluster: ${{ env.ECS_CLUSTER }}
wait-for-service-stability: true
# This waits until the new tasks are healthy
# and the old ones are drained
# Zero-downtime rolling deployment!
What happens on every git push main:
- GitHub Actions builds the Docker image
- Pushes it to ECR (Amazon's private Docker registry)
- Updates the ECS task definition with the new image
- Triggers a rolling deployment
- Waits until all new tasks are healthy
- Old tasks get drained gracefully
- New version is live ā
Zero SSH. Zero manual commands. Zero "wait, which server did I deploy to?"
Auto Scaling That Actually Works š
Before ECS, scaling meant: "SSH into the server, run some commands, SSH into another server, run more commands, hope you didn't miss one."
After ECS:
# Register scalable target
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/production/api \
--min-capacity 2 \
--max-capacity 20
# Scale up when CPU > 70%
aws application-autoscaling put-scaling-policy \
--policy-name "api-cpu-scale-up" \
--service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/production/api \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 70.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageCPUUtilization"
},
"ScaleInCooldown": 300,
"ScaleOutCooldown": 60
}'
The magic: Under load? ECS spins up more containers. Traffic drops? Scales back down. You pay only for what you use.
A CI/CD pipeline that saved our team real money: before auto-scaling, we over-provisioned for peak load 24/7. After ECS auto-scaling, we cut our container compute bill by 40% while handling 3x the traffic during peaks. Not bad for a config file.
The Mistakes That Cost Me Sleep šØ
Mistake #1: Not Setting Resource Limits
What I did:
{
"cpu": "0",
"memory": "0"
}
// "Let's just let it use what it needs!" ā past me, about to learn a lesson
What happened: A memory leak in a worker task consumed all available memory on the host, crashing other containers alongside it.
What you should do:
{
"cpu": "512",
"memory": "1024",
"memoryReservation": "512"
}
// Hard limit: 1GB. Soft reservation: 512MB.
// Task gets killed before it eats everything.
// Lesson learned the painful way.
Mistake #2: Not Configuring Health Checks
What I did: No health check. Task was "running" but the app inside had crashed. ECS happily kept routing traffic to a dead container.
# Users saw: 502 Bad Gateway
# ECS showed: Task status: RUNNING ā
# Reality: App had been dead for 20 minutes
What you should do:
{
"healthCheck": {
"command": [
"CMD-SHELL",
"curl -sf http://localhost:8080/health || exit 1"
],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
}
ECS replaces unhealthy tasks automatically. Your health endpoint becomes your production guardian. š”ļø
Mistake #3: Storing Secrets as Environment Variables in Task Definitions
What I did:
{
"environment": [
{ "name": "DB_PASSWORD", "value": "supersecret123" }
]
}
// Visible in AWS Console. Visible in task definition history.
// Visible in CloudTrail. Visible to everyone with console access. š±
What you should do:
{
"secrets": [
{
"name": "DB_PASSWORD",
"valueFrom": "arn:aws:secretsmanager:ap-south-1:123:secret:prod/db-password"
}
]
}
// Secret stays in Secrets Manager
// ECS injects it at runtime
// No plaintext in your config
// Audit trail on who accessed what
Mistake #4: Deploying Without Waiting for Stability
What I did:
aws ecs update-service --cluster prod --service api --force-new-deployment
# Immediately ran smoke tests
# "All green!" ā me, 30 seconds before the old tasks finished draining
# Users were still hitting the old broken version
# "Why are users still seeing the bug?!"
What you should do:
aws ecs update-service \
--cluster prod \
--service api \
--force-new-deployment
# Wait for actual stability!
aws ecs wait services-stable \
--cluster prod \
--services api
echo "Now run your smoke tests"
Or just use the GitHub Actions action with wait-for-service-stability: true.
ECS vs Kubernetes: The Honest Comparison š„
| What You Need | ECS Wins | Kubernetes Wins |
|---|---|---|
| Time to first deployment | ā Hours | ā Days/weeks |
| Small team, no DevOps specialist | ā Yes | ā Hard |
| AWS-only infrastructure | ā Native | Possible |
| Multi-cloud portability | ā Vendor lock-in | ā Portable |
| Complex networking (service mesh) | Limited | ā Istio/Linkerd |
| Custom schedulers | ā No | ā Yes |
| Large scale (1000+ services) | Possible | ā Better |
| Operational simplicity | ā Much simpler | Complexity city |
My honest take after 7 years: If you're on AWS and don't need the advanced features of Kubernetes, ECS gives you 80% of the value with 20% of the complexity. For most product teams, that's a great trade.
TL;DR: When Should You Use ECS? šÆ
Use ECS when:
- You're already on AWS (or going there)
- Your team doesn't have a dedicated Kubernetes expert
- You want managed container orchestration without managing control planes
- You're migrating from EC2/bare metal and want the simplest path
- Your application runs as one to a few dozen services
Skip ECS when:
- You need multi-cloud portability
- You have hundreds of microservices with complex inter-service networking
- You need advanced scheduling features
- You already have Kubernetes expertise on the team
The real lesson I learned: Don't reach for Kubernetes because it sounds impressive. Reach for ECS because it solves your actual problem. After countless deployments, the best infrastructure is the one your team understands and can operate at 3 AM without consulting a 400-page manual.
Your containers deserve a stable home. ECS might just be that home. š
Running ECS in production? I'd love to hear your setup ā connect with me on LinkedIn!
Check out my deployment configs: GitHub
Now go containerize something and let ECS worry about keeping it alive. š³š