How To Set-up Your Data Stack For 2026 - Data Infrastructure For AI

How To Set-up Your Data Stack For 2026 – Data Infrastructure For AI

research@theseattledataguy.com April 13, 2026 data engineering 0

We are several years into the AI Revolution, so to speak, and with that has come an increased demand for data.

The increased demand for data comes an increased demand for data infrastructure.

Some companies already have reliable data stacks; others are looking to migrate to Snowflake, Databricks, or some other solution(I am sure some people view Snowflake and Databricks as old at this point).

And, of course, others are still just starting to build their data infrastructure.

Now, wherever you might be, the focus of this article is to help you set up a reliable data stack as well as possibly avoid pitfalls on the way to deploying all your infrastructure.

You Shouldn’t Be Thinking About Ingestion

One of the points I try to hammer home with my clients is that ingestion should, for the most part, be a solved problem.

If your data team is struggling just to pull data into your Snowflake or BigQuery instance, you likely need a new solution. Now, don’t get me wrong, there are plenty of one-off ERPs that don’t have out-of-the-box data connectors. But most of the key data sources shouldn’t require a custom script(Of course, there are still reasons to write your own when it’s cost-effective).

Even in those cases, out-of-the-box solutions often don’t just ingest data. They can handle:

Schema changes easily
Easily change ingestion cadence
Automatically handle retries and failure recovery without you writing orchestration logic
Provide built-in monitoring and alerting so you’re not debugging blind at 2am
Maintain historical syncs and backfills without custom scripts
Normalize and structure semi-structured data (JSON, nested APIs) out of the box
Offer column-level change tracking so downstream models don’t break silently
Reduce the need for engineers to babysit pipelines every time an upstream system changes
Scale ingestion volume without you re-architecting your pipeline every 6 months
Handle API rate limits and pagination gracefully (something custom scripts often fail at)

There are several solutions you can look into here. I generally recommend a few including Estuary and DLT (or cloud specific ingestion tool like Glue, although the last time I worked with Glue I regretted it).

I personally have used Estuary a lot recently, especially after Fivetrans’ most recent price increase and I’ve developed a deeper relationship with them ever since I started on as one of their advisors.

The point is, don’t make ingestion complex if you don’t have to.

Don’t get me wrong, when it comes to data ingestion, there are plenty of ways it can get complicated. But if you can avoid it, do it.

The real value comes with you do with your data not ingesting in from source to destination.

Speaking of transforms and value.

Data Transforms

One of the surprising things about it being 2026 is that so many companies still rely on SQL stored procedures for their transforms. There is nothing wrong here; it’s the way we’ve run data transformations forever, and it works.

Overall, SQL still tends to be the most common tool for running transformations. Sure, it can come in a few different forms, like DBT, stored procedures, perhaps integrated into an Airflow DAG. But it’s SQL.

That’s why I am happy that now you can just run dbt on Snowflake, which allows you to have that one fewer solution in terms of not using dbt cloud, and instead just running things on Snowflake. You could also just utilize Databricks and its data pipelines. I am still not a fan that everything is essentially just a Python Notebook, but it is nice that it’s all wrapped in one solution.

Now Snowflake just needs to make its dbt implementation far less complicated, and I’d be fully on board.

This is likely where people are starting to use AI heavily.

Many SQL transformations already exist, or at least, the basis for them does.

Need to build a HubSpot data model.

It probably exists.

Need to figure out how to integrate several commonly used solutions; it’s probably out there.

The challenge here that I see data teams run into is that:

This usually only accounts for more common data models
Their data still might be imperfect. For example, I recently worked for a client where they thought some data was populated, only to figure out it wasn’t, so we had to figure out how to fill in the gap.
Workflows are messy, data doesn’t always come from an API. Sometimes you’re taking 8 different data sources in different shapes and formats and putting it into a single table.

Now that all being said, if you can build a reliable core data model, then comes the more modern take on the data team and data infrastructure.

That is AI of course!

You can’t have a data team without AI.

After all, isn’t that everyone’s goal?

To modernize their data stacks and layer in AI?

AI In Your Data Stack

There are plenty of tools that integrate AI in one form or another into their solution.

There are MCP servers.

Chatbots and conversational analytics.

And more.

I’ve used many of these, and most leave me wanting. They are often too restrictive, or require just as much work to get them functioning as you would have to put into building dashboards.

I am not saying this about all options via AI, but I’d say I’ve run into a lot of snake oil salespeople who have amazing POCs and demos and fail after.

Truthfully, what I am finding is many data teams are building some sort of skills file + CLI via Claude or Codex and building tooling around it.

Add in a few well set-up Github actions and you’re automatically deploying your data pipelines to Snowflake or Databricks and smoothly integrated tests and everything.

This has been the case for several projects our team has been delivering as well as myself personally. I’ve partnered with companies such as Codestrap to build better versions of this.

Of course it’s more than just Claude and skills files. That only gets you so far.

Here are a few other ways I see teams implementing AI.

Initial Dashboard Development – Via AI, many non-technical users will start to utilize tools like Claude to help them design their dashboards and bring them to data teams. Their hope being that by having a POC of what they want premade the process to getting to a usable dashboard will be faster. This can be somewhat dependent on how well the scoped their ask to Claude. I’ve seen people come back with pages upon pages of metrics. Many which they would surely not use. Yet, they included them.

There is also something to be asked about the need of dashboards in the future. At least dashboards that are merely metrics. Future dashboards should add further context and provide insights into why one decision should be made over another. Otherwise, we are still building traditional dashboards when we have so much more information and the ability to have AI help as well.

Reverse Engineering Legacy Pipelines – Many large enterprises have custom built data pipeline systems. These could be written in Python, Powershell or a few other languages(and always SQL). Many companies are trying to centralize their tooling. As part of the process you’ll need to convert this code into whatever solution you’re picking, Airflow, Mage, dbt, etc.

Query Optimization & Cost Reduction – Another common pattern I’ve witnessed is teams placing their query processing meta-data into Claude or another LLM and asking where the opportunities are to reduce costs. Maybe you’ll realize that your Snowflake instance is too large or that you should run more of your queries together.

Synthetic Data Generation – This has less to do with business operations. But I’ve used LLMs to generate data sets to provide demos and examples because it can put together several tables and a decent amount of data quickly.

There is plenty more you can do with LLMs and AI. I especially like the approach of mixing traditional temples, LLMs skills and rules to get a more deterministic output.

Quick pause! If you’re looking to improve your data infrastructure and better set things up for AI, then feel free to reach out for a free consultation here!

Just Cause AI Exists, Doesn’t Mean The Basics Don’t Matter

What I’ve learned over the last few years is that AI won’t magically fix all your business problems. Even if you get to a point where you’ve set up the perfect data infrastructure there is so much that needs to be set-up to ensure your data team will succeed.

The problem is, many data teams forget the basics. They start running towards fancy AI solutions and automations.

All while leaving behind lots of other components of their data infrastructure.

Let’s talk about a few of those key areas.

Data quality and reliability

I know most people get tired of the garbage in garbage out line, at least if you’ve been in the data world for a long time. But I keep seeing bad data get in the way of good results.

I am not saying you need perfect data, because most companies don’t even truly define what that is and that is the problem.

Do you mean perfect in terms of accurate, or timely?

Do you always want all your data to be up to date?

Or just that the data is accurate once it’s loaded?

What about missing data?

And who is responsible for fixing bad data?

The truth is, the business will often complain about bad data without defining what they mean and that causes far more issues.

So start by setting expectations. Where are the known gaps? How often is the data loaded? What data isn’t included and do you plan to load it at some point.

Don’t let the business think they can answer every question if they can’t and push back where you need to.

Then, read my two articles on data quality. One gives you an understanding of what the other gives you a better understanding of all the ways data quality checks can get noisy.

Storage and compute choices

One of the biggest mistakes I see companies make is choosing storage and compute based on trends instead of their real needs.

Not every company needs the most advanced lakehouse setup.

Think about your data team, it’s size, what questions your business is trying to answer.

Maybe you don’t even need any form of data ingestion because, guess what you can just read things from a replica database.

Whether via an agent or just your own trad SQL skills.

When do you need to start looking into a data warehouse?

Here are the key reasons I start looking into setting up a data warehouse:

Multiple data sources need to be joined (CRM, product, billing, marketing)
Metrics are inconsistent across teams and need a single source of truth
Queries are starting to impact production systems
Data volume is growing beyond what raw systems can handle
You’re starting to support advanced use cases (ML, forecasting, AI)
Your current setup is becoming a collection of one-off scripts and pipelines
Engineering is becoming a bottleneck for basic data questions
You need a centralized place to model and transform data reliably

Now what options do you have?

In my experience, most teams fall into a few different buckets when it comes to storage and compute:

Use a cloud data warehouse (Snowflake, BigQuery, Redshift)
Use a lakehouse setup (Databricks, open table formats like Delta/Iceberg)
Stay on a replica database for as long as possible
Use an embedded/local analytics approach (DuckDB, small-scale setups)
Some combination of the above

Each of these can work. The right choice depends less on the tool and more on your constraints.

Here’s how I generally think about them:

Cloud Data Warehouses

I think a lot of data teams overcomplicate there data stack. Your a team of 5 people managing 100 GBs of data. The benefits you’re going to get from using open table formats and having to logistically manage compute and storage separately might not be worth it.

This is where most teams should start.

They are relatively easy to set up, scale well, and support the majority of analytics use cases out of the box.

If your primary goal is BI, reporting, and enabling analysts, this is usually the fastest path to value.

The downside is cost can creep up quickly if you don’t manage compute properly, and you may run into limitations for more complex data science or real-time workloads.

Lakehouse / Databricks-style setups

These are powerful, flexible, and can support a wide range of workloads from analytics to machine learning.

But they also tend to require more engineering maturity.

You’re trading simplicity for flexibility.

If you don’t actually need that flexibility, you may just be adding complexity your team now has to maintain.

Replica database / staying close to source

This works surprisingly well for early-stage data teams.

If your data needs are simple, you can often get pretty far just reading from a replica and avoiding a full data stack altogether.

The limitation shows up when:

You need historical tracking
You need to join multiple systems
You need performance isolation from production

At that point, this approach starts to break down.

Embedded / local analytics (DuckDB, etc)

This is becoming more popular, especially for smaller teams or specific workflows.

It’s fast, cheap, and simple.

But it’s not always a full replacement for a centralized warehouse, especially when you need shared access, governance, or large-scale processing.

Governance, ownership, and documentation

This is one of those areas people tend to ignore until things start breaking.

A metric changes and no one knows who owns it.

A dashboard is wrong and everyone points fingers at each other.

Etc.

Then the questions start pouring in.

Who owns the definition of revenue?

Who is responsible for fixing broken source data?

What changes occurred in the core models and who approved them?

Who decides what gets added to the warehouse and what does not?

I know, no one gets excited about governance and process.

But if your data team has to keep answering the same questions over and over again, or if every new analyst has to reverse engineer how your warehouse works, you do not have a documentation problem, you have a scale problem.

At a minimum, your team should be documenting core metrics, important models, major data sources, and any known caveats the business should understand before using the data.

Because the truth is, a well-built data stack is not just about storing and transforming data.

It is about making that data usable.

And usable data is not just technically available. It is understandable, trusted, and owned.

That is where a lot of teams fall short. They build the pipelines, stand up the warehouse, maybe even layer in AI, but they never put the operational structure around it that makes the whole thing work long term.

If you’re looking to dig deeper into data governance check this article out.

Cost and maintainability

One thing that hasn’t changed in the last decade of working in the data world.

If the technology you’re using or processes you’re implementing cost more than your positive business outcomes, the business will eventually start calling it into question.

Why are we spending so much on Teradata, we should switch to Hadoop.

Why are we spending so much on Hadoop, we should switch to Snowflake.

Etc.

You need to consider costs, at least to a degree. You’re overall goal should be that your data strategy and infrastructure rives so much value that people don’t ask about the costs.

But its very easy to go from $20,000 a year to $200,000 a year via pay as you go models. So how do you keep your data stack costs under control. Here are a few tips for keeping your data infrastructure costs reasonable.

Realtime pipelines running more often than needed – Realtime has gotten easier to implement, but that doesn’t mean it’s free. Running jobs every minute instead of batching them hourly can multiply your costs several times over, especially in consumption-based warehouses.
View-on-view-on-view patterns – These are easy to create and hard to notice. Over time, they lead to both slow dashboards and expensive queries, especially when each layer adds more compute to resolve.
Poor data modeling decisions – Data models aren’t just about usability – they directly impact cost. Inefficient joins, unoptimized tables, and lack of structure can quietly drive up compute usage over time.
Lack of visibility into what actually costs money – Many teams don’t know which pipelines or dashboards are driving spend. Without that visibility, it’s almost impossible to prioritize what to fix.
Small decisions compounding over time – Most cost issues don’t come from one big mistake. They come from dozens of small, reasonable decisions – load frequency, query patterns, tool choices—that slowly add up.
Not revisiting systems after they’re built – What made sense six months ago might not make sense today. Costs won’t magically go down unless someone is actively reviewing and optimizing.

If you are really spending far more than you think you should on your data tooling, then reach out here. I’ve helped plenty of data teams cut there data infrastructure costs by over 50% in some cases!

Final Thoughts

A lot has changed in the last few years and I want to be clear, I am far from anti-AI. In fact, I am working to build tooling right now to get better results for cheaper using AI.

And as your team works to set-up their data stacks, I do think its important to think about where you might want to layer in AI.

But realize, many of the basics still matter.

Maybe you can do more with less.

Great!

You still want to build data infrastructure in 2026 that drives business impact…not just is yet another set of useless tools.

I am realizing this is really a part one, and I owe all my readers a deeper dive!

If you’d like to read more about data engineering and data science, check out the articles below!

What Are Data Pipelines And Why Do They Exist

What Is A Data Platform And Why You Should Build One

Throughput vs Latency: Understanding the Key Difference in Data Engineering

How to build a data pipeline using Delta Lake

Intro To Databricks – What Is Databricks