AWS Step Functions: Stop Wiring Lambdas Together with Duct Tape šāļø
AWS Step Functions: Stop Wiring Lambdas Together with Duct Tape šāļø
Hot take: The moment your "simple serverless function" needs to call three other functions, retry on failure, and send a notification if it all goes wrong ā you've accidentally written a distributed workflow. And you probably wrote it terribly. I know, because I did too.
Welcome to AWS Step Functions ā the service that makes you feel like you actually have your life together. āØ
The Problem: Lambda Spaghetti š
When I first started architecting serverless e-commerce backends, I had this brilliant idea. One Lambda triggers another Lambda. That one triggers another. Simple, right?
// Lambda #1: processOrder
exports.handler = async (event) => {
const order = await createOrder(event);
// Manually invoke Lambda #2
await lambda.invoke({
FunctionName: 'validate-inventory',
Payload: JSON.stringify(order)
}).promise();
// What if THIS fails? Do I retry? How many times?
// What if it times out? Is it still running?
// How do I track where in the flow we are?
// 𤔠Good luck debugging this in prod!
};
In production, I've deployed this exact pattern. It worked great ā until it didn't.
An order would get stuck halfway through. Payment was charged. Inventory never deducted. Customer got no email. Me at 2 AM trying to piece together CloudWatch logs from six different Lambda functions. Zero fun. š„
That's when I found AWS Step Functions and everything changed.
What Are Step Functions? š¤
Think of Step Functions as a visual flowchart that actually executes your code.
You define states (steps), transitions, retry logic, error handling, and parallelism ā all in a JSON/YAML file called a State Machine. AWS handles the orchestration, retry, timeout tracking, and audit trail.
Your e-commerce order flow becomes this:
START
ā Validate Order
ā Check Inventory (parallel with Check Fraud)
ā Charge Payment
ā Update Inventory
ā Send Confirmation Email
END
If ANY step fails ā Compensate ā Notify ā Alert team
No manual Lambda invocations. No custom retry loops. No "where did the order disappear?" mysteries at 2 AM.
A Real Example: E-Commerce Order Processing ā”
Here's the Step Functions state machine I actually use in production:
{
"Comment": "E-Commerce Order Processing Flow",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:validate-order",
"Next": "ParallelChecks",
"Retry": [{
"ErrorEquals": ["Lambda.ServiceException"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2
}],
"Catch": [{
"ErrorEquals": ["InvalidOrderError"],
"Next": "OrderFailed"
}]
},
"ParallelChecks": {
"Type": "Parallel",
"Branches": [
{
"StartAt": "CheckInventory",
"States": {
"CheckInventory": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:check-inventory",
"End": true
}
}
},
{
"StartAt": "FraudCheck",
"States": {
"FraudCheck": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:fraud-detection",
"End": true
}
}
}
],
"Next": "ChargePayment"
},
"ChargePayment": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:process-payment",
"Next": "SendConfirmation",
"Catch": [{
"ErrorEquals": ["PaymentFailedError"],
"Next": "RefundAndNotify"
}]
},
"SendConfirmation": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:send-email",
"End": true
},
"OrderFailed": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123:function:notify-failure",
"End": true
}
}
}
A serverless pattern that saved us: Running inventory check and fraud detection in parallel cut our order processing time from ~4s to ~2s. Step Functions parallel states are stupidly easy to add ā it's just a JSON change. Try doing THAT cleanly with raw Lambda chaining! šŖ
The Features That Actually Matter š
Built-in Retry Logic š
"Retry": [{
"ErrorEquals": ["States.Timeout", "Lambda.ServiceException"],
"IntervalSeconds": 1,
"MaxAttempts": 3,
"BackoffRate": 2
}]
That's exponential backoff with zero custom code. No more:
// The sadness I used to write
let retries = 0;
while (retries < 3) {
try {
await callSomeService();
break;
} catch (err) {
retries++;
await sleep(retries * 1000);
}
}
Wait States (For Human Approval Workflows) āøļø
This one blew my mind. Step Functions can pause and wait for a callback ā perfect for order approval flows, refund reviews, or manual verification steps.
"WaitForApproval": {
"Type": "Task",
"Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
"Parameters": {
"QueueUrl": "https://sqs.us-east-1.amazonaws.com/123/approvals",
"MessageBody": {
"TaskToken.$": "$$.Task.Token",
"OrderId.$": "$.orderId"
}
},
"Next": "ProcessApproved"
}
The state machine literally parks itself and waits. A human reviews, clicks "Approve" in your dashboard, your app calls SendTaskSuccess with the token, and the workflow resumes. All tracked. All auditable. No polling required. š¤Æ
The Execution History (Your Debugging BFF) š
Every Step Functions execution stores a complete audit trail:
[ā
] ValidateOrder - Started: 10:00:00, Ended: 10:00:01
[ā
] CheckInventory - Started: 10:00:01, Ended: 10:00:02
[ā
] FraudCheck - Started: 10:00:01, Ended: 10:00:03 (parallel!)
[ā] ChargePayment - Started: 10:00:03, Error: CardDeclined
[ā
] RefundAndNotify - Started: 10:00:04, Ended: 10:00:05
When architecting on AWS, I learned: This execution history alone is worth the price of entry. No more grepping through six CloudWatch log groups at 2 AM. The entire flow ā inputs, outputs, errors ā is RIGHT THERE.
Express vs. Standard Workflows šø
Step Functions has two flavors, and picking wrong will hurt your wallet:
Standard Workflows:
- Max duration: 1 year š®
- Exactly-once execution
- Full execution history
- Cost: $0.025 per 1,000 state transitions
- Use for: Order processing, human approval flows, long-running jobs
Express Workflows:
- Max duration: 5 minutes
- At-least-once execution (idempotency is YOUR problem)
- No persistent history
- Cost: $1.00 per 1 million executions (WAY cheaper!)
- Use for: High-volume event processing, IoT data pipelines, streaming
My production split:
- Order flows ā Standard (can't afford duplicates!)
- Notification pipelines ā Express (10k/day, Standard would cost $$$)
I accidentally used Standard for a high-volume notification pipeline once. The bill was... educational. ššø
The Gotchas Nobody Tells You šŖ¤
Gotcha #1: 256KB payload limit
State machines pass data between steps as JSON. If your Lambda returns a big response (product catalog, bulk data), you'll hit the 256KB limit and get a cryptic error.
Fix: Store large data in S3 and pass the S3 key between states ā not the actual data.
// Bad: passing the entire product list
{ "products": [...500 items...] }
// Good: passing a reference
{ "productsS3Key": "workflows/order-123/products.json" }
Gotcha #2: State transitions cost money
Every time you move from one state to another ā that's a billable transition. A step machine with 10 states, running 1 million times a month = 10 million transitions = $250/month.
Fix: Combine lightweight steps. Don't create a separate state for every tiny operation.
Gotcha #3: Timeouts are not retries
A Lambda that times out doesn't automatically retry unless you configure it in the Retry block. I once had payments silently fail because the Lambda hit its 15-minute limit and Step Functions just moved on. Ouch. š¬
Always add:
"TimeoutSeconds": 30,
"Retry": [{"ErrorEquals": ["States.Timeout"], "MaxAttempts": 2}]
When NOT to Use Step Functions š
Don't reach for Step Functions for everything. It adds complexity.
Skip it if:
- Your "workflow" is two Lambda calls that always succeed ā just call one from the other
- You need sub-100ms latency ā Step Function overhead is 50-100ms per state
- You're doing simple event processing (just use SQS + Lambda directly)
Use it if:
- You have 3+ steps with conditional logic
- You need retry/compensation logic
- You need an audit trail
- A human needs to approve something mid-flow
- You've already written Lambda spaghetti and are ashamed of it
Quick Setup š ļø
# Deploy with AWS SAM
aws cloudformation deploy \
--template-file template.yaml \
--stack-name order-workflow \
--capabilities CAPABILITY_IAM
# Start an execution
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:us-east-1:123:stateMachine:OrderFlow \
--input '{"orderId": "ORD-999", "userId": "user-42"}'
# Check execution status
aws stepfunctions describe-execution \
--execution-arn arn:aws:states:us-east-1:123:execution:OrderFlow:abc123
The AWS Console also gives you a live visual diagram of your execution as it runs. Show that to your non-technical stakeholders and watch their eyes light up! š
TL;DR š”
Stop duct-taping Lambdas together with manual invocations and custom retry loops. AWS Step Functions gives you:
- ā Visual, auditable workflows
- ā Built-in retry with exponential backoff
- ā Parallel execution out of the box
- ā Human-in-the-loop support
- ā Error handling that doesn't make you cry
The trade-offs: Cost per state transition + execution overhead + 256KB data limit.
My rule: If my workflow has more than three steps and any error handling ā Step Functions. Every time. No exceptions.
Building serverless e-commerce taught me that the hardest part isn't the Lambdas themselves. It's the glue between them. Step Functions is really good glue. š
Wrangling serverless workflows? Hit me up on LinkedIn ā I've got opinions about all of this!
Want to see a real state machine? Check out my GitHub for examples.
Now go replace that spaghetti with a proper state machine! š ā š