Terraform State Management: Stop Letting a JSON File Hold Your Infrastructure Hostage ðïļðĨ
Terraform State Management: Stop Letting a JSON File Hold Your Infrastructure Hostage ðïļðĨ
True story: It was a Friday at 4:45 PM. A well-meaning developer ran terraform apply from their laptop, got confused by a merge conflict in terraform.tfstate, and did the thing you should never, ever do â deleted the state file to "start fresh."
By 5:30 PM, Terraform had no idea that our RDS instance, 3 EC2 boxes, and an ALB existed. It was ready to create new ones. While the originals were still running. Billing for doubles. With no routing to either.
I've never set up remote state so fast in my life.
What Even Is Terraform State? ðĪ
Terraform state is a JSON file that maps your HCL configuration to real-world infrastructure. It's Terraform's memory. Without it, Terraform doesn't know what it already created â it just sees configuration and asks "should I make this?"
A simplified terraform.tfstate looks like:
{
"version": 4,
"terraform_version": "1.7.0",
"resources": [
{
"mode": "managed",
"type": "aws_instance",
"name": "api_server",
"provider": "provider[\"registry.terraform.io/hashicorp/aws\"]",
"instances": [
{
"attributes": {
"id": "i-0abc123def456789",
"ami": "ami-0c55b159cbfafe1f0",
"instance_type": "t3.medium"
}
}
]
}
]
}
That "id": "i-0abc123def456789" is the link between your code and the real EC2 instance. Lose that mapping, and Terraform is flying blind over your entire infrastructure.
The Local State Trap ðŠĪ
By default, Terraform writes state to terraform.tfstate in your working directory. And for solo experiments on a personal project, that's fine. For anything else, it's a disaster waiting to happen.
Problems with local state:
- It's not shared â teammate A applies, teammate B applies, their states diverge. Now two people each think they're the source of truth. They're not.
- It's not backed up â disk dies, S3 bucket gets wiped, dev deletes it (see: Friday story above)
- No locking â two people run
terraform applysimultaneously. They're both writing to the same state. State corruption incoming. - It ends up in git â I've reviewed PRs with
terraform.tfstatecommitted. It contains resource IDs, ARNs, sometimes actual secret values. Lovely.
After countless deployments, I've learned: local state is fine for terraform plan learning sessions. It's not for anything touching production.
Remote State: The Fix That Takes 10 Minutes âïļ
Terraform's backend block moves your state file to a remote, shared, lockable location. S3 + DynamoDB is the AWS standard:
# backend.tf
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/api/terraform.tfstate"
region = "ap-south-1"
encrypt = true
# DynamoDB table for state locking
dynamodb_table = "terraform-state-lock"
}
}
The S3 bucket holds the state file (encrypted, versioned). The DynamoDB table handles distributed locking â when one terraform apply runs, it puts a lock entry in DynamoDB. Any other apply trying to run simultaneously sees the lock and waits. No more state corruption.
Bootstrapping the backend itself (the one bit of chicken-and-egg you can't avoid):
# Create these resources FIRST, before adding the backend config
resource "aws_s3_bucket" "terraform_state" {
bucket = "mycompany-terraform-state"
lifecycle {
prevent_destroy = true # Do NOT let Terraform delete this
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
resource "aws_dynamodb_table" "terraform_lock" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
Run terraform apply to create those resources, then add the backend block and run terraform init. Terraform will ask if you want to migrate local state to S3. Say yes.
State Key Strategy: Don't Put Everything in One File ðïļ
A A CI/CD pipeline that saved our team from a catastrophic mistake: separating state by environment AND by component.
The naive approach (what everyone starts with):
terraform.tfstate # Everything. All environments. All services. One file.
The disaster: One bad apply and you're rolling back infrastructure for your database, your API, your CDN, and your VPC all at once. State locking means nobody can touch anything until you're done.
The sane approach:
s3://mycompany-terraform-state/
production/
networking/terraform.tfstate # VPC, subnets, security groups
database/terraform.tfstate # RDS, parameter groups
application/terraform.tfstate # ECS, ALB, autoscaling
monitoring/terraform.tfstate # CloudWatch, alarms
staging/
networking/terraform.tfstate
database/terraform.tfstate
application/terraform.tfstate
Now a botched staging deployment can't lock your production networking state. And when your DBA wants to modify the RDS configuration, they don't need to lock the entire application layer.
Use terraform_remote_state data sources to share outputs between state files:
# In application/main.tf - read outputs from networking state
data "terraform_remote_state" "networking" {
backend = "s3"
config = {
bucket = "mycompany-terraform-state"
key = "production/networking/terraform.tfstate"
region = "ap-south-1"
}
}
resource "aws_ecs_service" "api" {
name = "api"
cluster = aws_ecs_cluster.main.id
network_configuration {
# Pull the subnet IDs from networking state instead of hardcoding
subnets = data.terraform_remote_state.networking.outputs.private_subnet_ids
security_groups = [aws_security_group.api.id]
}
}
State Commands That Will Save You ð
Terraform state is not read-only. Sometimes you need to reach in and fix things:
# List all resources in state
terraform state list
# Show the full state of a specific resource
terraform state show aws_instance.api_server
# Remove a resource from state WITHOUT destroying it
# (useful when something was created outside Terraform)
terraform state rm aws_instance.orphaned_server
# Import an existing resource INTO state
# (for resources created manually before Terraform was added)
terraform import aws_instance.legacy_server i-0abc123def456789
# Move a resource in state (renaming or restructuring)
terraform state mv aws_instance.server aws_instance.api_server
# Pull the raw state JSON locally (useful for inspection)
terraform state pull > current-state.json
The terraform state rm command is the one I reach for most often. When a resource is deleted outside Terraform (someone clicked "Delete" in the console), Terraform still thinks it exists. state rm tells Terraform to forget it, without trying to destroy the (already-gone) resource.
State Drift: When Reality Disagrees With Your State File ðŧ
After a few months, someone will change something in the AWS console. Or a Lambda will be manually modified. Or an auto-scaling event will change instance counts. Your state file doesn't know. This is drift.
# Detect drift: compare current infrastructure to state
terraform plan -refresh-only
# If the drift is intentional, update state to match reality
terraform apply -refresh-only
Make terraform plan -refresh-only part of your weekly CI job. Drift accumulates silently. You don't want to discover it during an incident response when you're trying to understand why Terraform wants to destroy your load balancer.
Protecting State From Yourself ð
Beyond remote state, add these guardrails:
1. Lifecycle rules to prevent destroying critical resources:
resource "aws_rds_cluster" "main" {
cluster_identifier = "production-db"
lifecycle {
prevent_destroy = true # Terraform will refuse to destroy this
}
}
2. S3 bucket versioning (you already set this up above) means every state change creates a new version. If you corrupt state, roll it back:
# List state file versions in S3
aws s3api list-object-versions \
--bucket mycompany-terraform-state \
--prefix production/api/terraform.tfstate \
--query 'Versions[*].[VersionId,LastModified]'
# Restore a previous version
aws s3api copy-object \
--bucket mycompany-terraform-state \
--copy-source mycompany-terraform-state/production/api/terraform.tfstate?versionId=PREVIOUS_VERSION_ID \
--key production/api/terraform.tfstate
3. Require CI/CD to apply, never local laptops in production:
# .github/workflows/terraform.yml
name: Terraform
on:
push:
branches: [main]
paths: ['infrastructure/**']
jobs:
apply:
runs-on: ubuntu-latest
environment: production # Requires manual approval
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/TerraformCIRole
aws-region: ap-south-1
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.0
- name: Terraform Init
run: terraform init
working-directory: infrastructure/production/application
- name: Terraform Apply
run: terraform apply -auto-approve
working-directory: infrastructure/production/application
Now nobody can run terraform apply from their Friday-afternoon laptop. The CI pipeline is the only applier. And the CI pipeline has the credentials, not your developers.
Before vs After ð
| Scenario | Local State | Remote State + Best Practices |
|---|---|---|
| Two devs run apply simultaneously | State corruption ðĨ | Second apply waits for lock â |
| State file accidentally deleted | All resources orphaned ð | S3 versioning, roll back in 30s â |
| Audit: who applied what when? | No idea ðĪ· | S3 object logs + CI history â |
| Production apply from laptop | Yes, terrifyingly | Blocked by CI-only credentials â |
.env equivalent for infra |
Committed state in git ðŽ | Encrypted in private S3 â |
| Cross-team dependency | Hardcoded ARNs everywhere | terraform_remote_state outputs â
|
TL;DR â
- Never use local state for anything shared or production
- S3 + DynamoDB backend is the AWS standard; set it up before you have a Friday incident
- Split state by environment AND component â don't put everything in one file
- S3 versioning is your undo button when state gets corrupted
terraform plan -refresh-onlyweekly to catch drift before it bites you- Only apply from CI/CD in production â revoke developer credentials for
terraform apply prevent_destroylifecycle rules on anything you really, really don't want deleted
Terraform state management isn't glamorous. Nobody writes blog posts about it until 4:45 PM on a Friday. But getting it right means that when something does go wrong, you have a recovery path instead of a weekend rebuilding production from scratch.
Set up remote state today. Your future self â the one who would have been on call that Friday â will thank you.
Managing Terraform at scale? Connect with me on LinkedIn â always happy to trade war stories about infrastructure disasters narrowly avoided.
Want to see real Terraform modules? Check out my GitHub for working examples from production setups.
Now go check: is your terraform.tfstate committed to git? I'll wait. ðïļ