How To Write Better SQL - Simplifying Complex SQL

How To Write Better SQL – Simplifying Complex SQL

research@theseattledataguy.com May 13, 2025 data engineering 0

Maybe you’re luckier than me.

Maybe you’ve never opened a .sql file or an Airflow DAG only to be greeted by a 5,000+ line query…a true monster of a script that leaves you wondering where to begin.

I’ve seen plenty of these, and every time, I ask myself: Why in the world do these exist? And, more importantly, how can teams avoid them?

With tools like ChatGPT and Cursor making it easier than ever to generate SQL, I have a feeling we’ll see even more of these sprawling queries in the wild.

So, in this article, I’ll explore why massive SQL queries happen, why they’re a problem, and how you can break them down before they become unmanageable.

Why Should You Care?

Before we dive into why 5,000-line queries exist and how to avoid them, let’s talk about why they’re a problem in the first place.

Maintainability: Large queries are a nightmare to read, understand, and update. When a single SQL script stretches across thousands of lines, even the smallest tweak can have unintended consequences, turning debugging and refactoring into a frustrating game of whack-a-mole.
Data Quality: Ensuring data accuracy is already a challenge, but long analytical queries make it even harder. If bad data sneaks in—and if your query isn’t broken up—it might not fail until it’s well into the final 5,000-line query. Maybe you get lucky and catch it before production. If not? Good luck!Where did it go wrong? The first CTE? The second subquery? The fifteenth case statement? If you’ve dealt with SQL error messages before, you know they don’t always tell you where to start looking. And it’s even worse when the query technically runs but spits out bad data.
Evolution Over Time: SQL queries rarely start as behemoths. They grow, new fields, conditions, joins, and business requirements pile on over time. Without proper modularization, a once-simple query can quickly balloon into thousands of lines, becoming an unmanageable, tangled mess.

Why Do 5,000-Line Queries Exist And How Can You Avoid It?

No, really…why do 5,000-line SQL queries exist?

We all know they’re a bad idea. Yet they keep showing up. So, I took a step back to reflect on why these monsters exist in the first place, and, more importantly, how to limit them.

Logic that only exists in the application: One common culprit? Logic that lives exclusively in the application rather than the database itself. For example, at Facebook, someone might use an enum instead of a table to categorize an. event. Since this logic isn’t stored in the database, you’re left with two options: manually build a table or add a case statement.
Messy Source Data: If your source data is messy, your queries are going to get a lot more complicated, because you’d have to clean, filter, and restructure the data just to make it usable. I recall one query where I had to add several lines just because the end-user didn’t properly have some form of “is_test” field. Instead, the only way to remove test data depended on the date of the event and several other factors since the engineering team switched how they managed handling test data.
Complex Joins: Joins are supposed to be simple. Table_a.row = table_b.row. That’s it. But then someone throws in a few case statements, a dozen conditions, and some questionable logic, and suddenly, that straightforward join is a full-blown puzzle. At this point, we’re all scratching our heads wondering why this query exists, but here we are.
Analyst and SME-driven logic: To be clear, I am not saying analyst and SME logic is bad. Analysts and subject matter experts (SMEs) often have to implement tons of custom logic into queries, whether it’s tracking a recruiting funnel or building logic for dental fraud analytics. Maybe their team has specific categories they want to see or needs to develop multiple populations. All of which will take more than a single CTE. Suddenly, your queries ballooned to thousands of lines.

Time: As mentioned, queries don’t start massive. But over the years, as teams request new fields, more logic, and additional conditions, that once-reasonable query grows from just a thousand to several thousand lines. That’s how SQL queries grow their own tree rings.

These are just some of the reasons I’ve seen. Have you encountered (or, gasp, written) one of these monstrosities? I’d love to hear your experiences!

How to Deal With 5,000-Line Queries

1. Break Up Your Queries Via Staging Tables

Massive queries often contain multiple distinct tables built within a single query, whether through CTEs or subqueries. If that’s the case, consider breaking them up.

Of course, this is somewhat dependent on your developer workflows and data systems. If creating and implementing distinct queries is cumbersome, developers may resist breaking them down.

2. Reduce Logic Duplication

One pattern I’ve seen in companies with large queries is redundant logic appearing in multiple places. Suppose you find yourself repeating the same logic using the same CASE statement. over and over again, then it probably belongs in a dedicated table or reusable query.

Pick a single location for that logic to exist, whether it’s an existing query or a new table, and centralize that logic.

3. Build Tests

A big reason queries grow out of control? No one writes tests for them(perhaps this should be a future article).

If you had to test every query, you’d quickly start writing shorter, more modular SQL. Trying to validate a 5,000-line query with endless logic and thousands of lines becomes difficult quickly, so by necessity, you’ll break it down into smaller, testable pieces.

4. Use Tools That Make It Easy to Avoid Large Queries

Long queries can often stem from clunky development environments.

For example, a traditional approach to SQL development involves a master stored procedure or script that calls multiple stored procedures. But this setup makes it:

Hard to manage complex dependencies
Difficult to version control
Generally frustrating to work with and deploy new scripts

So, instead of creating new, modular queries, developers keep stacking logic into the same massive script, leading to SQL bloat. Investing in better tooling can help prevent this.

5. Code Reviews

Sometimes, all it takes is a nudge from a teammate.

Code and design reviews are great ways to catch SQL queries before they spiral out of control. For larger data projects, start with an in-person design review and follow up with async code reviews via pull requests.

A simple comment like, “Hey, this query is getting pretty long, is there any way we can break it up?” can go a long way.

Final Thoughts

It’s easy for a query to get out of hand. One more CTE, a few extra CASE statements—and suddenly, you’re stuck with a 5,000-line script.
The real problem isn’t writing it…it’s maintaining it.

It’s hard to test.

Hard to read.

Hard to update.

Simply put, it’s bad practice.

Are there cases where longer queries make sense? Of course. But before you commit to one, take a step back and ask yourself:

Does this actually need to be this long, or should I break it up?

With that, as always, thanks for reading!

Also! Don’t forget to check the articles below.

6 Real-World ETL Use Cases with Estuary Flow

What Is Snowflake – Breaking Down What Snowflake Is, How Snowflake Credits Work And More

How To Modernize Your Data Strategy And Infrastructure For 2025

NetSuite to Snowflake Integration: Ultimate Guide to 2 Effective Methods

Big Data data engineering data pipelines sql

How To Write Better SQL – Simplifying Complex SQL