What Is Apache Airflow

What Is Apache Airflow

February 12, 2023 data engineering 0
what is apache airflow

 

Apache Airflow is an extremely popular tool that data engineers rely on.

But why?

Why do data engineers like Airflow? Also, what does the Apache Airflow event do?

In this article, we will answer questions like:

  • What is Airflow?
  • What is a DAG?
  • Why do people use Apache Airflow?
  • Why do we like Airflow?
  • What are the downsides to Airflow?
  • Who are Apache Airflows’ competitors?

What is Apache Airflow?

Let’s make this short; Apache Airflow is a workflow orchestration tool. To dig deeper into that, Airflow allows end-users to set up sequences of tasks that are effortlessly turned into what are called Directed Acyclic Graphs (DAGs). We’ll talk a little more about DAGs later. But overall, Airflow uses several other key components to take said DAGs, such as Airflow’s Scheduler, web server, and a meta-database to help data engineers actually manage Airflow.

Even when working with complex DAGs, changes are easily made thanks to a robust set of command line utilities. With an advanced user interface, visualizing pipelines is simple, allowing teams to monitor production, progress, and problems efficiently. Defining workflows as code, maintaining, versioning, testing, and collaborating with them is made simpler. That makes Apache Airflow a fantastic platform for workflow orchestration, and MWAA only makes it better.

What Is a DAG?

what is a dag

Before digging deeper into why data engineers like Airflow, it’s essential we understand a necessary concept both libraries have in common. Both, essentially, build what is known as a directed acyclic graph (DAG). A DAG is a collection of tasks that run in a specific order with dependencies on previous tasks.

For example, if we had three tasks named Foo, Bar, and FooBar, it might be the case that Foo runs first, and Bar and FooBar depend on Foo finishing.

This would create a basic graph like the one below. As you can see, there’s a clear path. Now imagine this with tens of hundreds of tasks.

Large data organizations have massive DAGs with dependencies on dependencies. Having clear access to the DAG allows companies to track where things are going wrong and doesn’t allow insufficient data into their ecosystems.If something fails, it’ll often force the tasks downstream to wait until their dependencies are complete.

what is airflow data

What People Use Airflow For?

One more point before digging into why you should use Airflow: Airflow is billed as a workflow orchestration tool, also known as a general automation tool. But, when it comes to data engineers, we generally use it as a data pipeline/ETL/ELT solution. Yes, there are discussions about the unbundling of Airflow, but overall, in terms of my anecdotal experience, it’s used as everything in a data pipeline. It runs SQL, API connections, and even machine learning models.

Airflow tends to do it all for some teams, and why not? It’ll keep your data stack far simpler and require your engineers to know less. Of course, there are valid reasons to integrate other tools.

All in all, that’s more than enough in terms of explaining what Apache Airflow is, so let’s dig into the why.

Why We Like Airflow

Why has Airflow gained such a stronghold in the world of data engineering? Taking a quote back from a 2015 article written by the creator of Airflow:

As a result of using Airflow, the productivity and enthusiasm of people working with data has been multiplied at Airbnb. – Maxime Beauchemin

Airflow proved to be a solution that could boost productivity during a time when data engineering was constantly bogged down with one-off requests (wait, this hasn’t changed) and constant migrations.

Of course, that was Airbnb; they had to adopt their own solution. So why did so many other data engineers pick it up?

Easy To Start

A great thing about Airflow is that building your first toy DAG is very easy.

First, all you need to do is write a few parameterized operators.

After that, you can run Airflow standalone.

Suddenly you’re running Airflow, locally or maybe on an EC2 instance.

From there, you’re kind of done. Of course, we haven’t considered scaling or the fact that your logs will blow up your storage, but for the first few months, this will work all right.

Scheduling

Airflow provides an easy-to-understand scheduler. Using cron-based scheduling, a developer can easily set their DAGs to run daily, hourly, weekly, or just about anything else in between.

From there, Airflow will take care of running the jobs. No need to go into cron to make updates on when to run scripts. Instead, you can save the schedule as part of your code. This is beneficial in terms of making sure you don’t have to search for where the scheduling agent is as well as making it part of the code.

The ability to schedule jobs so easily was a major plus for someone like me who had spent a lot of time in a previous job trying to figure out why the SQL Server Agent wouldn’t work on my instance of SQL Server due to configuration problems.

Well, now my scheduler was bundled into one solution.

Dependency Management

what is airflow
Source: Astronomer

Creating dependency management functionality is one of the trickier requirements to implement into a custom data pipeline solution.In some previous life, data engineers might do this simply by setting different tasks to run with a manually set interval in between. Of course, this isn’t dependency management;if a previous task failed, the next task would still run.

Personally, this was a major reason I adopted Airflow. It allowed me to describe how tasks should run and not enable dependent tasks to run if the previous task failed.

This is often denoted in the Airflows DAG paradigm, which creates a graph where you can easily outline which tasks depend on what. Add into that the ability to track said tasks in a UI and rerun them at the failure point (rather than running the entire pipeline), and I do believe it makes sense why Airflow quickly gained so much traction.

Need Airflow Consulting Help? Set Up A Free Consultation

Other reasons I heard

Of course, there are plenty of other reasons why developers adopt Airflow. Here were a few that were brought up.

  • SQL Templating
  • Strong OSS Community
  • A Good Balance Between Predefined Components And Flexibility To Write Your Own Code

Downsides To Airflow

Despite Airflow being popular, it has a lot of downsides especially when trying to productionize it by yourself.

Scaling is Hard

One point most of the interviewees could agree with is that scaling Airflow is hard. For many companies, if they are using Airflow, the management of the actual infrastructure gets moved to DevOps or Data infrastructure because it inevitably ends up being a person’s job to make sure Airflow is running.

This is because, like Hadoop back in the day, Airflow requires a lot of extra services just to run it successfully, especially at scale. If you’re really wanting to learn more about how you could scale Airflow at your company, check out the architecture and its corresponding article below.

Shopify’s Airflow Architecture – Read More About It Here

The ease of creating basic DAGs in Airflow can lure users in. But if your team doesn’t consider the fact that they will need to scale the overall architecture in production as more and more jobs are created, you will eventually run into many problems.

Passing Data Between Tasks Is Clunky

When I asked Sarah Krasnik what she felt were some limitations of Airflow, she responded with the following:

Passing information between tasks. I believe this has been significantly improved upon in 2.2+, although I haven’t used the particular improvements. The concept of XComs is just so clunky, buggy, and really hard to get right. — Sarah Krasnik

Like Sarah, I learned about Airflow’s struggle to pass data in between tasks, but through Airflow’s cousin, Dataswarm.

Dataswarm is the initial version of Airflow and, in turn, has many of its limitations, one being that passing data between tasks is quite difficult. For example, whenever I needed to pass data between tasks, I would have to write the data to a text file and then read it in the next task.

Very clunky.

Airflow does provide XComs to pass data around. However, even here, it can feel a bit clunky, and as Sarah pointed out, it’s “hard to do right.”

To some degree, this is by design, but I do run into the need to pass data occasionally, which is always frustrating.

All in all, Airflow is far from perfect, and many of us have merely learned to deal with its limitations.

Who Are Apache Airflows Competitors

While I was interviewing all the various data engineers, many of them brought up Prefect, Dagster, and Mage. All of these options have aimed to improve where Airflow has fallen short. As Sarah put it, Airflow has “Airflow-isms.” Dagster and Prefect improve upon those “-isms”. Of course, I haven’t even included some of the low-code solutions like Azure Data Factory or SSIS.

So if you’re looking for alternatives, these three could be a great option.

What Is Prefect

Prefect is a workflow control platform built for the modern data stack. It monitors, coordinates, and orchestrates dataflows between and across your applications. With Prefect, users can also build pipelines, deploy them anywhere, and configure them remotely.

Prefect was launched in 2018, but the new Prefect 2 (2022) focuses on improving functionality in scheduling, retries, logging, caching, notifications, and observability. The founders understood that implementing these features into your dataflows is tedious and time-consuming. That is why the Prefect team has wrapped all this functionality into Prefect 2, allowing you to spend your valuable time writing domain-specific code and building your business.

By creating new features tailored to the needs of the modern data stack, Prefect has fundamentally reworked the way we build and orchestrate dataflows. While newer companies like Dagster work to build on the data orchestration capabilities of Airflow, Prefect tackles similar issues in a different way.

What Is Dagster

Dagster is a cloud-native orchestrator designed to help you develop and manage the whole lifecycle of your data assets, including tables, machine learning (ML) models, reports, and data sets. It delivers high-quality, modern functionality to local development, unit tests, integration tests, staging environments, and production.

Dagster is a declarative programming model that supports integrated lineage and observability. You declare the functions you want to run and the data assets those functions produce or update (Dagster Docs, 2022). Dagster then continues to help you run your functions at the right time and keeps your assets up to date.

What Is Mage.ai

Mage was initially a tool more focused on helping end-users integrate AI into their workflows but now is more focused on the workflow itself. It’s an open-source data pipeline tool that can help end-users run transforms using a myriad of other tools.

Mage lets you integrate several other third-party solutions, including tools like dbt.

Airflow 2.X

Airflow remains a popular framework for many companies trying to scale out their data pipeline infrastructure. Because it is easy to start and has a strong community, I foresee it will continue to play a major role in the data engineering space. With managed solutions and improved best practices, Airflow will be hard to dethrone.

Is Airflow right for you? Well, as always, it will depend on your team size, skills, and budget.

What has your experience been? Have you used Airflow, or do you prefer another solution?

Now You Should Know What Airflow is

Hopefully at this point it’s clear what Airflow is designed to do. It’s a popular solution that many data engineers rely on for building their data pipelines. It’s been used to run SQL, machine learning models, and more. Yes, admittedly it’s far from perfect; Airflow has a lot of challenges.

There are also many new competitors that are looking to both grow the pie and take market share. We’ll see where things go regarding which data tools end-users pick moving forward.

Do You Need A Data Warehouse – A Quick Guide

5 Mistakes Startups Make With Data And Analytics

Reducing Data Analytics Costs In 2023 – Doing More With Less

What Is Snowflake And Why You Should Use It

5 ELT Tools for The Modern Data Stack 

5 SQL Concepts You Need To Know Before Your Next Data Science Or Data Engineering Interview

What Is Managed Workflows for Apache Airflow On AWS And Why Companies Should Migrate To It