Cutting Your Data Stack Costs: How To Approach It And Common Issues

Cutting Your Data Stack Costs: How To Approach It And Common Issues

January 5, 2024 big data consulting data engineering 0
cut data stack costs

I once had an engineer tell me that they essentially didn’t want to consider cost as they were building a solution. I was baffled. Don’t get me wrong, yes, when you’re building, you iterate and aim to improve your solutions cost.

But from my perspective, I don’t think completely ignoring costs from day one is a good plan. In fact, many data teams reaching out to our team in the past 6 months have asked about how they can cut their data stack costs.

Cost plays a role in all forms of projects, whether you’re building bridges or writing code. How much budget is allocated to build and maintain a solution is important. In the real world, it can change the materials used, the timeline, or the final product’s design.

Whereas in the software and data world, it might push other features and decisions that you make.

If anything, cost and performance optimization are likely one of the top things I enjoy doing as an engineer. Sure, it’s fun to build new solutions and infrastructure. But it’s often when we are trying to figure out how to run systems more efficiently or cost-effectively that I’ve felt myself solving a real problem.

It forces you to consider methods of storing or processing data in more effective ways that are still easy to maintain, which can feel limiting.

There are plenty of common issues that drive up data infrastructure costs, but to find them, you likely first need to approach your cost-saving efforts in an organized way. So let’s go over how you can cut your data stack costs in 2024!

Cutting Data Stack Costs

The pursuit of cost efficiency is an ongoing journey. Even after deployment, we must revisit systems to incrementally improve optimization. The key is continually questioning and diving deeper.

The Frugal Architect

Before diving into some of the common culprits for data infrastructure costs, I’d like to start by going through how you can start to approach this problem.

Step one in my mind, after deciding that you’d like to cut data stack costs, is to highlight, by product and by job/dashboard/process, how much things are costing.

For example, in general, I will put together a sheet that looks like the one below(you might add some other columns such as prioritization, data processed, etc).

Back when I worked more on-prem, this sheet really focused on the amount of time a process took as well as how much RAM it’d take on the server. Now that we are in the cloud, you can add a column for costs per process as well.

With this list of processes and data workflows, you can start to prioritize what to target first.

Overall, this creates a more methodical process that you can use to approach your cost optimization. If you’re just interested in learning about some of the common issues I have seen, you can read them below.

Realtime Has Gotten Easier But Can Still Be Expensive

Realtime is often thought of as expensive. But in the past, part of that was because, technically, it was a hard problem to solve.

Now there are plenty of solutions that can make it considerably easier from a technical standpoint, but depending on how your team implements your realtime solution can still be expensive.

For example, with the way many modern data warehouses work in terms of pay for compute, if you’re running a process (such as inserting data), every minute in an hour vs. batching it all happens in a single minute at the top of the hour, then the first approach will be considerably more expensive. Your cost will be multiplied several times over.

That’s why when I read articles such as “I accidentally saved my company 500k,” I believe it. Although, in this case, a different mechanism was the cause of the $500k, it’s not uncommon to find small mistakes or design decisions that could cost a company a considerable amount. One of which can be solved by changing how often you load data into your cloud data platform.

Many companies that do provide realtime capabilities are also provide an option to delay load times mainly for this reason.

Picking The Wrong Solution Is Expensive

Another example of an issue I ran into in 2023 was when a client was trying to ingest data from a single source and was getting quoted nearly 200k from one vendor. After talking to a few other vendors we were able to find Estuary which provided a far more reasonable price(nearly 90% less than the 200k).

Now don’t get me wrong; if you’re used to some data infrastructure contracts, $200k for an ELT/ETL solution is nothing. But for this company’s size and use case, it was drastically too much. This was the project that led me to work more closely with Estuary. They provided a solution that helped me cut the client’s bill by 90%, and I realized that Estuary would work great for a lot of other use cases (in fact, I have already found another client in a similar situation).

Honestly, at $200k, it might have even been cheaper to pay someone to custom-build a solution (but with tech salaries these days that might not be true). All in all, sometimes you may not be using the best solution for the job and this could easily cost your company.

Views Are Great Until They Aren’t

For many clients, it’s not uncommon to find they’ve built a view-on-view-on-view situation that their dashboards rely on.

This leads to both costly bills from your data warehouse if you’re on a consumption-based pricing model, and snail-like dashboards (the worst of both worlds), slow and expensive.

Yes, it’s nice to build views so you can avoid creating tables that often require more maintenance. Tables also aren’t as easy to change, and at many companies, can only be built by engineers.

So often the view-on-view-on-view paradigm is pretty easily created, even if it takes other forms, such as views built inside of the BI tool that is then overlaid onto the data warehouse and then…once again multiple layers of that.

In terms of knowing what to tackle first, you’ll recall that you should have a list of the dashboards and/or processes and their costs. So, maybe you don’t need to fix every instance of a view-on-view-on-view situation. But perhaps there is one that is costing your data team $60k a year, which is 30% of your data infrastructure costs.

That might be a great place to start.

Bad Data Models Will Cost You

Data modeling isn’t just meant to make data easier to work with. Much of the origin was because there was limited space and compute(along of course with usability, reliability, and several other pillars). Now although we live in a world of infinite computing and storage, guess what–it comes with an infinite bill.

In other words, to a degree, there is still a limitation, especially on compute.

This means that your choice of data modeling technique or lack thereof can still cost you!

One of the first projects I took on at Facebook was to improve the design of a data model that had slowly started to sprawl out. The purpose was less to reduce costs in terms of compute and focused more on usability. By improving usability we were able to reduce the time data engineers and analysts spent trying to get the right data as well as reduced the number of questions we got from external teams. We did this by focusing on a few key areas including:

  • Centralizing IDs so end-users are clear on what to join on
  • Parsing out nested and complex data structures so analysts can easily work with them
  • Improving the data standardization in terms of naming, data types, etc
  • Creating pre-joined tables when possible for analysts

So not all cost improvements are direct infrastructure costs.

Costs Won’t Reduce Magically

Your data infrastructure costs play an important role in how your team can operate. In some cases, it can even impact how large your team can be because your spending could hinder hiring new team members (when required).

Of course, I’d say cost savings and performance optimizations are iterative processes. You’ll likely first develop a solution that works and is perhaps not 100% optimal. However, over time you’ll slowly (or quickly) find what the major cost drivers are. This doesn’t mean you can develop your first iteration sloppily. After all, I will once again reiterate the point, “Does the business really need real time?

On the other hand, the view-on-view-on-view paradigm will likely slowly happen over time and you might not know which views are truly worth putting effort into building a permanent solution until a few months down the line (which could be a trade-off of engineering time for future costs).

Take the time to review your costs on some normal cadence and what is driving it, and you’ll likely keep your infrastructure in check!

Also, if you’re looking to cut your data infrastructure costs in 2024, then please feel free to reach out!

Set up a free consultation today!

Thanks for reading! If you’d like to read more about data engineering, then check out the articles below.

Normalization Vs Denormalization – Taking A Step Back

4 Alternatives to Fivetran: The Evolving Dynamics of the ETL & ELT Tool Market

What Is Change Data Capture – Understanding Data Engineering 101

Why Everyone Cares About Snowflake

What Is Estuary