Batch Vs Real-Time Data Pipelines - Do We Still Need To Pick?

Batch Vs Real-Time Data Pipelines – Do We Still Need To Pick?

research@theseattledataguy.com November 12, 2025 data engineering 0

One of the questions most data engineers need to answer is whether this data pipeline should be real-time or batch.

Sometimes posed as streaming vs batch.

The tools you might use to do that have changed over the past few years, but that was always the question. The business, of course, would always ask for real-time data pipelines, but when pressed on it, often batch was a good enough choice.

In most cases, most pipelines would almost always default to batch.

Of course, that was the thinking when I first started in the data world, but has it changed? In this article, we’ll talk about batch vs streaming, the challenges you’ll face as you build data pipelines, and discuss how to pick whether you should use batch or streaming.

Starting With Batch Pipelines

Before diving too deep, let’s talk about why most data teams picked batch over real-time and streaming so much in the past.

First off, it was easier and cheaper. Even to this day, many IT and data teams are just using Cron or Windows Task Scheduler to ingest data into their data warehouse at midnight.

Just a quick Python or bash script that extracts all your data or calls a list of stored procedures, and that’s it. If something goes wrong, it’s likely not hard to rerun for a single day. Meaning if something goes wrong during a run, you can come in during the morning and manually rerun your script after fixing it.

That’s why, for a long time, batch was almost always the choice, specifically for analytics workflows.

Real-Time Becoming More Accessible

Now, real-time analytics or attempts to build analytics that provide more real-time analytics have existed for decades. In many cases, you can just run analytics queries on top of your current database, assuming they are simple and won’t bog down your database since it has actual transactions it needs to run(and in those cases, many might choose to use a replica).

That gets most companies to a point. Eventually, either the metrics you want to calculate become too complex or you want to merge other data sets with your data. Most companies start by porting over some tables into their replica and trying to answer questions.

However, cracks eventually start to show(which will have to be pointed out in a future article). That’s when companies start looking for real-time alternatives. Many look to tools like Kafka, Kinesis, or CDC tools.

And to this day that’s how many real-time pipelines are built.

In terms of use cases for this real-time data, it’s usually around fraud detection, logistics, inventory, personalization, real-time auctions, etc. There generally has to be some sort of operational component involved. Otherwise, why do you need the data real-time? If you’re just looking at numbers once a month for a board meeting, real-time is overkill.

Defining Real-Time and Batch Clearly

Image Source

Before we get too far, let’s clear up what “real-time” even means.

A lot of people (especially in business meetings) say “real-time” when what they really mean is “faster than before.” True real-time systems are continuously processing data as it’s produced, think sub-second latency. But in practice, most of what’s called “real-time” falls somewhere in the near real-time zone, to a seconds if not minutes later. A lot of people in the data world will get up in arms if you say your system is real-time but its off by 2 seconds. Thus, everyone started saying “near real-time”

There’s also micro-batching, which sits right in the middle. At least conceptually. Instead of constantly streaming every event, you process small batches frequently. This still tends to lean more towards batch vs streaming.

And that’s really where most modern “real-time” systems live, somewhere between batch and streaming. A lot of companies say they’re doing streaming, but under the hood, they’ve just got a clever micro-batch setup running every few minutes.

So when someone says they need real-time data, it’s worth asking:

“Do you need data right now… or ten minutes before your next board meeting”

That simple question usually cuts project complexity in half.

Tradeoffs and Challenges

For a long time, the trade-offs between batch and real-time were pretty clear for a long time(but I am starting to see some shift which we will get to).

But here were some of the original challenges.

Complexity – Building a real-time data pipeline used to be very complex. You’ve got to deal with message queues, offsets, checkpoints, retries, and more moving parts than you expected. Batch, on the other hand, is predictable. You know when it runs, what data it processes, and what to do when it fails. Even now, their are tools that make much of the actual event tracking and processing easy but taking into consideration concepts like backfilling and other maintenance workflows real-time isn’t always the right time.

Cost – Streaming means you’re paying for continuous compute. That might be fine at scale or for high-value workloads, but for a daily report that no one looks at until tomorrow? Not worth it. Batch is cheaper because it only runs when it needs to.

Data Quality – With batch, you can easily validate a dataset before you load it. With streaming, data is constantly flowing, and if something’s off, duplicate messages, out-of-order events, schema drift, it’s a lot harder to catch and fix. You’ll need guardrails like schema registries, watermarking, and error queues just to maintain some consistency.

Use-Case Fit – And finally, some data just doesn’t need to be real-time. Finance reports, end-of-day dashboards, or weekly metrics don’t suddenly get more useful just because they refresh every second.

That’s the key takeaway: real-time isn’t automatically better. It’s just different.

Batch vs Streaming – Which Should You Pick

At the end of the day, the choice between batch and streaming comes down to one thing: what problem are you actually solving? And how does it add value to the business!

If you’re building an analytics workflow, dashboards, metrics, or reports, batch is almost always going to be the simpler, more reliable choice.

You don’t need to process data every second just so someone can look at a chart once a day.

But if your goal is to react to data, trigger alerts, update prices, flag fraud, or personalize a user’s experience, then real-time makes sense. In those cases, being even a few minutes late can cost money or create a bad experience.

Here’s the rule of thumb I use:

If your system needs to react, think real-time.

If it needs to be analyzed, think batch.

That small distinction can save you months of engineering effort and unnecessary infrastructure bills.

The Future: Hybrid Pipelines

What has changed drastically since the beginning of the batch vs real-time debate is that there are even more tools that are making the choice, a choice no longer.

Meaning, if you want to set up a batch, do it.

Real-time, do it.

All in one tool.

For me, the tool I’ve been doing this for the most recently has been Estuary(which I am an advisor for). I’ve now had several clients who have toggled between real-time and batch while loading into Snowflake as they compared the costs in terms of Snowflake compute and the benefits they were getting.

So in cases where real-time made sense, they kept the pipelines running in real-time; where it didn’t, they simply switched it to load more often.

Estuary is calling this right-time data.

Letting their end-users pick between real-time and batch. So the idea of batch vs streaming becomes less of a choice of a technology and more of a use case-based choice(as in does the business benefit from it or not).

Conclusion

The debate between real-time vs batch or real-time vs streaming continues. The difference now is that you don’t have to always switch between solutions. Several modern tools allow you to pick between the tools and focus on your data, landing at the right time.

Take a moment and review your current workflows.

Ask yourself, is my data landing at the optimal time for its use?

Could we change how frequently it landed, and what would that do to our data warehouse costs?

Then you can figure out how to better approach your data pipelines.

As always, thanks for reading.

If you’d like to read more about data engineering and data science, check out the articles below!

Why Is Data Modeling So Challenging – How To Data Model For Analytics

What Is A Data Platform And Why You Should Build One

How to build a data pipeline using Delta Lake

Intro To Databricks – What Is Databricks

Is Everyone’s Data A Mess – The Truth About Working As A Data Engineer

Data Engineering Vs Machine Learning Pipelines