Most data engineers optimize for correctness. The expensive ones optimize for cost.
Why cost-awareness is becoming the line between senior and staff data engineers.
Recently I was hired to work on a system where a fraud detection pipeline ran a query on every incoming request, scanning two years of history each time. The architecture worked. The bill scaled linearly with traffic. We rebuilt it and cut operational costs by 80%.
That kind of cost surprise isn’t unusual in cloud data warehouses. A SELECT * with a LIMIT 10 on a 10-petabyte table costs about $62,000 in BigQuery for ten rows. Most data engineers don’t know that, because most data engineers were trained to optimize for correctness.
This post is about the second skill: optimizing for cost. And why the engineers who develop it are worth more than the ones who don’t.
The state of data engineering
When we think of software companies, we mostly think of servers locked in a cooled room and developers running sacred code against them. Data engineering followed the same pattern, until the shift to cloud happened.
That pivot was mainly driven by the fact that teams would no longer need to manage their own servers.
But this shift introduced a new paradigm I call cost-sensitive programming. The technologies and the principles remained the same, but this new dimension had a massive impact on businesses. Every development now carries a production cost on top of the development cost.
Expensive mistakes
Let’s talk about LIMIT in SQL queries. I see lots of data engineers and data analysts writing queries on BigQuery like this:
SELECT * FROM huge_table LIMIT 10Hoping it will be a cheap query. But the LIMIT here has no impact.
BigQuery pricing is based on bytes scanned, and in this example BigQuery would have already scanned all rows and all columns of the table just to produce 10 rows.
If this table was 10 petabytes, this query would cost about $62,000, just for reading 10 rows. This goes down to the fundamentals of how a SQL query is processed, and not just in BigQuery.
SQL runs in a different order than it’s written: FROM and JOIN first, then WHERE, GROUP BY, HAVING, SELECT, DISTINCT, ORDER BY, and LIMIT last. By the time the LIMIT clause runs, BigQuery has already done all the work, and charged you for it.
LIMIT helps neither performance nor cost in BigQuery.
On-premise, this would affect query performance. In a batch environment, no one cares. In a cloud environment, this query has a cost every single time it runs. So if your data department is huge and everyone is making these kinds of queries, imagine the amount of money lost. Or the amount you could save.
Then there’s SELECT *.
In a row-store like Postgres, all the columns of a row are stored together on disk. If you only needed one column, you still had to read the whole row. The marginal cost of reading more columns was negligible, especially because it ran on a fixed-cost machine. So developers defaulted to SELECT *. The habit was free.
Then the warehouse moved to the cloud, and the storage model changed. In a columnar warehouse, each column is stored separately. Reading one column of a hundred-column table means scanning roughly 1% of the bytes. Combined with BigQuery’s bytes-scanned pricing, this makes SELECT * almost always the wrong default.
Replacing SELECT * with the seven columns you actually need can drop a query from petabytes to terabytes scanned. That’s the difference between a $5,000 query and a $5 query.
And SELECT * is everywhere. I’ve seen Spark jobs in production pulling the full SELECT * from BigQuery and using ten columns out of a hundred. Every run, every day.
Lastly, there’s the dashboard problem. Most dashboards built on BigQuery inherit patterns from row-store databases: tables in a star schema, queried in full inside a tool like PowerBI, then filtered locally. Or worse, every interaction in the dashboard fires a fresh BigQuery query.
The first pattern wastes scans. You pull a hundred million rows so the user can filter to ten thousand client-side. The second pattern is worse: every dashboard open, every filter change, every action is essentially a billable query. A dashboard opened fifty times a day by twenty people is a thousand queries a day. Thirty thousand queries a month.
You don’t want the dashboard telling you how much revenue you generated to eat your margins.
Dashboards rarely need to be recomputed every time someone looks at them. Materialized views, scheduled aggregations, BI-tool caching, dbt pre-aggregations: there are several solutions. But the core idea is to precompute what doesn’t need to be live, and stop paying for the same scan a thousand times a month.
All these patterns share the same blind spot: they optimize for correctness and ignore what a cloud architecture costs.
Conclusion
On-premise, your mistakes wasted machine time. In the cloud, your mistakes waste money. The difference is that machine time was your team’s problem; money is your CFO’s. That’s why this skill is becoming the line between senior and staff data engineers.