0x55aa
Back to Blog

💸 FinOps for Engineers: You Don't Need a Finance Degree, You Need a `WHERE` Clause

|
6 min read

Every FinOps conversation I've sat in starts the same way: someone from finance shares a screenshot of a cost dashboard with a red arrow pointing up and to the right, and the room full of engineers responds with the collective energy of being asked to explain a stranger's credit card statement. "That's... not really my department" is the vibe, even though every single line item on that bill was generated by code someone in the room wrote.

Here's the reframe that actually helps: cost is a runtime property of your system, exactly like latency or error rate. You wouldn't ship an endpoint without knowing its p99. You already instrument for correctness and performance. Cost is just a third dimension you've been ignoring — until finance shows up with the bill and now it's suddenly everyone's emergency.

FinOps, stripped of the buzzword, is just: attribute spend to the thing that caused it, make that attribution visible to the people who write the code, and put a cost-aware habit into the loop they already run (PRs, dashboards, alerts). None of that requires a finance background. It requires the same instincts you already use for debugging a slow query.

The bill is unreadable because your resources are anonymous

The number one reason engineers get looped into cost conversations too late is that nobody can answer "whose is this?" for half the line items. An untagged EC2 instance, an EBS volume nobody remembers attaching, a Lambda that fires 40,000 times a day for a feature that shipped and got deprecated eighteen months ago — these all show up as one undifferentiated blob called "compute."

The fix isn't a finance tool, it's tagging discipline enforced the same way you enforce lint rules: at the boundary, automatically, before merge.

# Terraform — reject anything that doesn't declare who owns it and why
variable "required_tags" {
  type = map(string)
  default = {
    owner       = null
    team        = null
    cost_center = null
    service     = null
  }
}

resource "aws_instance" "app" {
  # ...
  tags = merge(var.required_tags, {
    Name = "app-server"
  })

  lifecycle {
    precondition {
      condition     = alltrue([for k, v in var.required_tags : v != null])
      error_message = "Every resource needs owner, team, cost_center, and service tags. No exceptions, no 'I'll add it later'."
    }
  }
}

At Cubet, we added a policy check in the Terraform CI pipeline that just fails the plan if cost_center is missing — same shape as a security scan, same "you can't merge this" enforcement. Within a month, the mystery-blob line item on our cost report shrank from roughly a third of total spend to under five percent. Not because anyone got smarter about cost — because it became attributable, and attributable spend gets fixed by whoever owns it, fast, without a finance meeting.

Unit economics beat aggregate dashboards every time

A dashboard that says "you spent $84,000 on compute this month" tells you nothing actionable. A dashboard that says "you spent $0.014 per API request, up from $0.009 last month" tells you exactly where to look, because now it's a regression, not a fact of life.

This is the single biggest mindset shift: stop tracking total spend and start tracking cost per unit of value — per request, per active user, per job processed, whatever your product's actual unit is. Total spend going up is often good news (you're growing). Cost per unit going up while your traffic is flat is a bug, and engineers are very good at finding bugs once they're framed as bugs instead of "the AWS bill is weird again."

-- Rough cost-per-request, joining CUR data against your own request logs
SELECT
  date_trunc('day', usage_date) AS day,
  SUM(unblended_cost) AS daily_cost,
  SUM(unblended_cost) / NULLIF(request_count, 0) AS cost_per_request
FROM cost_and_usage_report cur
JOIN daily_request_counts req
  ON cur.usage_date = req.day
WHERE service = 'checkout-api'
GROUP BY 1, request_count
ORDER BY 1 DESC;

Wire that number into the same Grafana board that already has your latency and error-rate panels. When cost-per-request jumps, it shows up next to the deploy markers, and half the time you'll spot the correlation yourself: someone shipped a change that traded a cache hit for three downstream calls, and the SLO dashboards didn't catch it because nothing broke — it just got more expensive to serve.

Rightsizing is a code review comment, not a quarterly initiative

The other place FinOps quietly becomes an engineering habit instead of a finance project: treating "is this instance/pod sized correctly" as a normal code review question, not a once-a-quarter consulting exercise where someone downloads a spreadsheet of underutilized instances and nobody acts on it because by the time the spreadsheet exists, six sprints have passed.

# Kubernetes — requests set from *observed* usage, not a guess copy-pasted from another service
resources:
  requests:
    cpu: "250m"      # p95 actual usage over 30d, not "512m felt safe"
    memory: "384Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

The trick is making "what did this actually use in production" a five-second lookup instead of an archaeology project. If your PR template for a new service has a spot for "expected RPS" and "resource requests," add one line: "checked against actual usage of a comparable service? Y/N." That's it. No dashboard, no new tool, just a habit that catches the classic failure mode of copy-pasting resources.requests from a template and never revisiting it once traffic patterns diverge.

The part finance actually wants from you

None of this requires you to understand reserved instance amortization schedules or how your company's cloud committed-use discount is structured — that part genuinely is finance's job, and they're usually good at it. What they can't do is know that the checkout-api service started making an extra downstream call last Tuesday, or that a batch job someone forgot about has been running hourly since a migration that finished nine months ago. That's your job, and it's the same skill set you use for any other production incident: notice the anomaly, trace it to a cause, fix it, ship it.

If your org doesn't have per-service cost visibility yet, that's the actual starting point — not a finance tool, not a FinOps certification, just tags enforced at merge time and one cost-per-unit panel next to your existing SLO dashboard. Start there. The rest of FinOps is just applying the debugging instincts you already have to a metric you've been ignoring.

What's the worst "mystery" line item you've ever chased down on a cloud bill? I'd bet it was untagged, unowned, and had been running quietly for way longer than anyone expected.

Thanks for reading!

Back to all posts