AWS ECR + GitHub Actions: The Docker Image Pipeline That Runs Itself š³š
AWS ECR + GitHub Actions: The Docker Image Pipeline That Runs Itself š³š
Hot take: If you're still running docker push from your laptop, you are one coffee spill away from shipping a broken image to production.
I've done it. Pushed the wrong tag. Pushed an image built against main when the hotfix was on a branch. Pushed an unscanned image with a critical CVE and found out at 2 AM. Docker taught me the hard way that local pushes are a liability, not a workflow.
After countless deployments ā and a few memorable disasters ā I built a GitHub Actions pipeline that handles everything: build, scan, tag with a Git SHA, push to AWS ECR, and trigger a rolling deploy. No more manual steps. No more "which version is in prod?" confusion. Just commit and ship.
Here's the full setup.
Why AWS ECR and Not Docker Hub? š¤
Docker Hub is fine until it isn't:
| Docker Hub (Free) | AWS ECR | |
|---|---|---|
| Pull rate limits | 100 pulls/6hr (anonymous) | Unlimited within AWS |
| Private repos | 1 free | Unlimited |
| Image scanning | Manual, paid | Built-in with ECR Enhanced Scanning |
| IAM auth | ā | ā (no long-lived credentials!) |
| Same-region pulls | Slower, external | Free + fast |
| Cost | Free tier limited | ~$0.10/GB/month |
When deploying to ECS or EKS, ECR images pull from the same AWS region. No egress costs. No rate limits. IAM-based auth that rotates automatically. It's the obvious choice once you're on AWS ā yet I kept using Docker Hub out of habit for way too long. š
Step 1: Create Your ECR Repository āļø
First, create the repo. You can click around the console or do it properly:
# Create the repository
aws ecr create-repository \
--repository-name my-api \
--region ap-south-1 \
--image-scanning-configuration scanOnPush=true \
--encryption-configuration encryptionType=AES256
# Note the repositoryUri in the output:
# 123456789.dkr.ecr.ap-south-1.amazonaws.com/my-api
Enable lifecycle policies immediately ā or you'll wake up to a bill for 500 untagged image layers:
{
"rules": [
{
"rulePriority": 1,
"description": "Keep last 10 production images",
"selection": {
"tagStatus": "tagged",
"tagPrefixList": ["prod-"],
"countType": "imageCountMoreThan",
"countNumber": 10
},
"action": { "type": "expire" }
},
{
"rulePriority": 2,
"description": "Expire untagged images after 1 day",
"selection": {
"tagStatus": "untagged",
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 1
},
"action": { "type": "expire" }
}
]
}
aws ecr put-lifecycle-policy \
--repository-name my-api \
--lifecycle-policy-text file://lifecycle.json
A CI/CD pipeline that saved our team from a $200/month ECR bill: lifecycle policies. Set them before you forget. ā ļø
Step 2: Create an IAM Role for GitHub Actions š
Do NOT use an IAM user with long-lived access keys. I see this everywhere and it terrifies me. One leaked key in a commit = compromised pipeline.
Use OIDC federation instead ā GitHub's identity provider talks directly to AWS:
# Create the OIDC provider (one-time setup)
aws iam create-open-id-connect-provider \
--url https://token.actions.githubusercontent.com \
--client-id-list sts.amazonaws.com \
--thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1
Then create an IAM role with a trust policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789:oidc-provider/token.actions.githubusercontent.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
},
"StringLike": {
"token.actions.githubusercontent.com:sub": "repo:YOUR_ORG/YOUR_REPO:*"
}
}
}
]
}
And attach a minimal permission policy ā only what the pipeline needs:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"ecr:PutImage",
"ecr:InitiateLayerUpload",
"ecr:UploadLayerPart",
"ecr:CompleteLayerUpload",
"ecr:DescribeImages",
"ecr:ListImages"
],
"Resource": "arn:aws:ecr:ap-south-1:123456789:repository/my-api"
}
]
}
Notice that second Resource is scoped to one specific repo. Not *. Minimal blast radius if something goes wrong. š”ļø
Step 3: The GitHub Actions Workflow š¤
Here's the full pipeline. Read through it ā every section has a purpose:
# .github/workflows/deploy.yml
name: Build and Deploy
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
AWS_REGION: ap-south-1
ECR_REPOSITORY: my-api
permissions:
id-token: write # Required for OIDC
contents: read
jobs:
build-and-push:
runs-on: ubuntu-latest
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
image-digest: ${{ steps.build.outputs.digest }}
steps:
- name: Checkout code
uses: actions/checkout@v4
# Configure AWS credentials via OIDC ā no secrets stored!
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-actions-ecr
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v2
# Set up Docker Buildx for layer caching
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
# Generate tags: branch-sha for PRs, latest + sha for main
- name: Docker metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ steps.login-ecr.outputs.registry }}/${{ env.ECR_REPOSITORY }}
tags: |
type=ref,event=branch
type=sha,prefix=sha-
type=raw,value=latest,enable=${{ github.ref == 'refs/heads/main' }}
type=raw,value=prod-${{ github.sha }},enable=${{ github.ref == 'refs/heads/main' }}
# Build with layer caching (dramatically speeds up CI)
- name: Build and push Docker image
id: build
uses: docker/build-push-action@v5
with:
context: .
push: ${{ github.ref == 'refs/heads/main' }} # Only push on main!
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gist,scope=${{ github.ref_name }}
cache-to: type=gist,mode=max,scope=${{ github.ref_name }}
build-args: |
BUILD_SHA=${{ github.sha }}
BUILD_TIME=${{ github.event.head_commit.timestamp }}
# Separate job: scan AFTER building, BEFORE deploying
security-scan:
needs: build-and-push
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-actions-ecr
aws-region: ${{ env.AWS_REGION }}
- name: Wait for ECR scan to complete
run: |
echo "Waiting for image scan results..."
sleep 30
FINDINGS=$(aws ecr describe-image-scan-findings \
--repository-name ${{ env.ECR_REPOSITORY }} \
--image-id imageTag=prod-${{ github.sha }} \
--query 'imageScanFindings.findingSeverityCounts' \
--output json)
echo "Scan results: $FINDINGS"
CRITICAL=$(echo $FINDINGS | jq -r '.CRITICAL // 0')
HIGH=$(echo $FINDINGS | jq -r '.HIGH // 0')
if [ "$CRITICAL" -gt "0" ]; then
echo "ā Found $CRITICAL CRITICAL vulnerabilities! Blocking deploy."
exit 1
fi
if [ "$HIGH" -gt "5" ]; then
echo "ā ļø Found $HIGH HIGH vulnerabilities. Review before deploying."
exit 1
fi
echo "ā
Security scan passed. Proceeding to deploy."
# Deploy only after scan passes
deploy:
needs: [build-and-push, security-scan]
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
environment: production # Requires manual approval in GitHub settings!
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-actions-ecr
aws-region: ${{ env.AWS_REGION }}
- name: Deploy to ECS
run: |
# Force new deployment ā ECS will pull the latest image
aws ecs update-service \
--cluster production \
--service my-api \
--force-new-deployment \
--query 'service.deployments[0].status' \
--output text
- name: Wait for deployment to stabilize
run: |
aws ecs wait services-stable \
--cluster production \
--services my-api
echo "ā
Deployment complete and healthy!"
What makes this pipeline special:
- ā OIDC auth ā zero long-lived credentials stored in GitHub
- ā
Only pushes images on
mainā PRs build but don't pollute ECR - ā Scans BEFORE deploying ā no CVEs sneaking into prod
- ā Uses GitHub Environments with manual approval gate
- ā Layer caching baked in ā builds that were 8 minutes are now 2
The Deployment Horror Story That Prompted All This š
Picture it: December 2022, last Friday before the holiday freeze.
I merged a hotfix, opened my terminal, ran docker build, docker push, kubectl set image... and everything looked fine. Monitoring was green. Users were happy.
Monday morning: Pagerduty fires. 40% of requests erroring. Turns out the image I built had a dependency conflict because I had a different node_modules checkout locally than CI would have produced. My laptop had an older lock file. The image worked fine in my test, failed horribly in prod.
The fix took 3 minutes. The post-mortem took 3 hours.
That Monday, I sat down and built the pipeline above. The rule became: no human being touches docker push in production. Ever. The pipeline is the only entity with permission to push ā and it runs from a clean checkout every time.
No more "works on my machine" because the build doesn't happen on my machine anymore. šÆ
Tagging Strategy: Stop Using latest for Everything š·ļø
latest is a lie. It doesn't mean the latest tested, stable, production-ready image. It means "whatever was pushed most recently." In practice, it means chaos.
My tagging strategy after setting up ECR:
# What gets pushed for every commit to main:
prod-a3f8c21 # Immutable: the specific SHA
sha-a3f8c21 # Same, different prefix for tooling
latest # Updated ā but you DON'T deploy this tag
# What ECS/Kubernetes actually runs:
prod-a3f8c21 # Always deploy the SHA tag, never latest
Why deploy by SHA and not latest?
# Bad: What does this actually run? Nobody knows!
image: 123456789.dkr.ecr.ap-south-1.amazonaws.com/my-api:latest
# Good: Exactly this commit, pinned forever
image: 123456789.dkr.ecr.ap-south-1.amazonaws.com/my-api:prod-a3f8c21
When an incident happens at 2 AM, you want to know exactly what code is running. A SHA tag gives you that. latest gives you a mystery. šµļø
Bonus: Rollback in 60 Seconds āŖ
The best part of SHA-tagged images? Rollback is instant:
# What commit is currently deployed?
aws ecs describe-tasks \
--cluster production \
--tasks $(aws ecs list-tasks --cluster production --service-name my-api --query 'taskArns[0]' --output text) \
--query 'tasks[0].containers[0].image' \
--output text
# Output: 123456789.dkr.ecr.ap-south-1.amazonaws.com/my-api:prod-a3f8c21
# Roll back to last known good SHA:
aws ecs update-service \
--cluster production \
--service my-api \
--task-definition my-api:PREVIOUS_REVISION \
--force-new-deployment
# Wait for it:
aws ecs wait services-stable --cluster production --services my-api
echo "Rolled back! ā
"
No rebuilding. No redeployment from scratch. The old image is still in ECR (lifecycle policy keeps the last 10 prod tags). Pull it, run it, done. A CI/CD pipeline that saved our team from prolonged incidents more than once: immutable image tags + rollback by revision. š
Common Pitfalls to Avoid šŖ¤
Pitfall #1: Pushing on every branch push
# Bad ā pollutes ECR with images from every feature branch
on:
push:
# Good ā only push images you might actually deploy
push: ${{ github.ref == 'refs/heads/main' }}
Pitfall #2: Storing AWS credentials as GitHub secrets
# Bad ā rotates manually, leaks if secrets are exposed
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
# Good ā short-lived token via OIDC, auto-rotates
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::ACCOUNT:role/github-actions-ecr
Pitfall #3: Skipping the scan because "it slows down CI"
Yes, it adds 60 seconds. A production breach costs significantly more. Scan everything. Block on criticals. š”ļø
Pitfall #4: Not setting lifecycle policies immediately
ECR charges $0.10/GB/month. A Node.js image is 200MB. If your CI runs 50 times a day for 6 months, you've got 9,000 images Ć 200MB = 1.8TB in ECR. That's $180/month of forgotten images. Set lifecycle policies on day one. šø
The Before/After Reality Check š
Before the pipeline:
# My Friday deploy ritual:
git pull # Hope nobody pushed since I branched
npm run build # 6 minutes on my MacBook
docker build -t myapp:latest . # 8 minutes
docker push myapp:latest # 4 minutes, upload varies
kubectl set image deployment/myapp myapp=myapp:latest # Deploys "latest"
# TOTAL: 18-25 minutes
# RELIABILITY: Depends on my laptop's mood
# ROLLBACK: Pray the old "latest" is still somewhere
After the pipeline:
git push origin main
# CI takes over:
# Build: 2 min (layer cache)
# Scan: 1 min
# Push to ECR: 45 sec
# ECS deploy: 3 min (rolling update)
# TOTAL: ~7 minutes, unattended
# RELIABILITY: Identical build environment every time
# ROLLBACK: aws ecs update-service --task-definition previous:revision
After countless deployments, I learned: automation isn't laziness. It's the only way to stay sane at scale. The pipeline is more reliable than I am at 8 PM on a Friday.
TL;DR šÆ
- Create ECR repo with
scanOnPush=trueand lifecycle policies - Use OIDC ā never store IAM credentials in GitHub secrets
- Build on every commit, push only on
main - Tag images by Git SHA ā never deploy
latestto production - Scan before deploy ā block on critical/high CVEs
- Use
aws ecs wait services-stableā don't declare victory until the deployment is actually done
Your future self at 2 AM will thank you for building this today. š
Running this pipeline? Compare notes with me on LinkedIn ā I'm always curious how other teams handle image management at scale.
Check out real GitHub Actions configs on GitHub ā production battle-tested pipelines from real projects!
Now go set those lifecycle policies before you forget! š³