3 Engineers’ Perspectives on the Modern Data Stack

3 Engineers’ Perspectives on the Modern Data Stack

April 5, 2021 big data consulting Data Strategy Consulting 0
big data consulting seattle

Photo by Alexandre Debiève on Unsplash.

There are plenty of cliches about data and its likeness to oil or companies being data-driven.

The one truth I have learned from consulting over the past few years is that companies of all sizes are trying to figure out how to get to and utilize their data.

The traditional methods of getting data into some form to analyze and create data tooling often require large teams and expensive hardware.

However, through a combination of cloud service providers as well as improved software, many companies of all sizes are becoming more data-driven.

In this article, we wanted to interview several people who work on or with modern data stack tools to understand the roles these tools play in the modern world and how they are helping business owners go from data to product fast.

In particular, we asked them three questions:

  1. What is your experience with the traditional approach to data warehousing, ETLs, and or data visualizations?
  2. What pain points do you think the modern data stack majorly improves?
  3. What is your favorite use case where a small/medium business was able to use your tool to improve their processes/income?

Let’s see what they had to say. 

Airbyte

Airbyte is a new open source (MIT) EL+T platform that started in July 2020. It has a fast-growing community and distinguishes itself by making several significant choices:

  • Airbyte’s connectors are usable out of the box through a UI and an API, with monitoring, scheduling, and orchestration. Their ambition is to support 50+ connectors by EOY 2020.
  • These connectors run as Docker containers, so they can be built in the language of your choice.
  • Airbyte components are also modular and you can decide to use subsets of the features to better fit in your data infrastructure (e.g. orchestration with Airflow or K8s or Airbyte’s…).

 

Airbyte’s Co-Founders: Michel Tricot/John Lafleur

We asked our questions to the two co-founders of Airbyte, Michel Tricot and John Lafleur. Here are their answers.

1. What is your experience with the traditional approach to data warehousing, ETLs, and or data visualizations?

Before starting Airbyte, I (Michel) was director of engineering and head of integration at Liveramp. We were processing petabytes of data every day using Hadoop/Cascading and Spark.

Back in the day, all the analytics were managed by hardcore data engineers. Needless to say, it was a huge burden for the team, and it prevented us from doing deep ad hoc analysis. The problem when you run analysis is that you start with one question that then branches out to ten other questions. If there is too much friction answering these ten other questions, you try to satisfy yourself with the first answer.

In 2015, we introduced a more modern analytics stack with Redshift, our internal version of DBT and Looker. It was a game-changer for the whole team and put the responsibility of the analytics in the hands of PMs, data scientists, and data analysts. We had to also implement our version of ELT, which was a trigger for us to start Airbyte.

Let me know if that works for you!

2. What pain points do you think the modern data stack majorly improves?

With ETL, the main pain points were:

  • Inflexibility: ETL forces data analysts to know beforehand every way they are going to use the data. Any change they make can be costly, as it can affect data consumers downstream of the initial extraction.
  • Lack of visibility: Every transformation performed on the data obscures some of the underlying information. Analysts won’t see all the data in the warehouse — only what was kept during the transformation phase.
  • Lack of autonomy for analysts: Building an ETL-based data pipeline is often beyond the technical capabilities of analysts.

The modern data stack improves those points in the following way:

  • Agile decision-making for analysts: When analysts can load data before transforming it, they don’t have to determine beforehand exactly what insights they want to generate before deciding on the exact schema they need to get. Instead, the underlying source data is directly replicated to a data warehouse, comprising a “single source of truth.”
  • Autonomy for analysts/scientists: With modern ELT tools, analysts or scientists don’t need data engineering help to replicate data to wherever they need. They can even use DBT to normalize the data in the way they need before plugging a BI tool to get the insights.
  • Data literacy across the whole company: Through more autonomy, you can get more people producing dashboards across all departments of the company.

3. What is your favorite use case where a small/medium business was able to use your tool to improve their processes/income?

Since we made our alpha public at the end of September 2020, within four months, more than 350 companies used Airbyte to replicate data. So we’ve seen a lot of use cases, but they usually fall under two categories: analytical ones or operational ones.

  1. Analytical use case: A company needs to consolidate its data from multiple sources (SaaS, APIs, or databases) into a single destination to perform analytics. One common stack that we see here is Airbyte + DBT + Metabase. They might have been using Fivetran or StitchData beforehand while also building and maintaining connectors in-house on the side. Having everything done through Airbyte enables them to reduce the costs in the budget and also the maintenance costs from the team.
  2. Operational use case: A company wants to offer more integrations to their customers and uses Airbyte’s API to offer the connectors while controlling the frontend on their platform. This enables this company to offer a lot more connectors to their customers for better lead conversion and contract values — thus more revenue.

Panoply

We also had the opportunity to get a response from Panoply’s CTO and co-founder Roi Avinoam, who provided an in-depth look at his background as well as how Panoply is helping companies take advantage of their data.

For those who haven’t heard of Panoply, it is the world’s first automated data warehouse that uses machine learning and natural language processing (NLP) to automatically discover data models and make sense of any data — structured, semi-structured, and even unstructured. Panoply can store all three data types in the data warehouse and facilitate on-the-fly analysis. 

Panoply CTO/Co-Founder Roi Avinoam

Here are Roi’s views on the questions we asked.

1. What is your experience with the traditional approach to data warehousing, ETLs, and or data visualizations?

I’ve been working with ETL and data warehousing in various roles for 10-15 years now. My first exposure was at a startup where we were striving to be a very data-driven, analytical team. But doing that took an enormous amount of work to set up.

We were a small team and our core job was to build apps, so all our expertise was built around those products: how to build the best applications, the best customer experience. And then in parallel to that, we were trying to build these internal data tools, and it was such a time sink.

The gap for me was always that tools to solve these problems existed, but only larger organizations could absorb both because it’s expensive and it requires time-consuming engineering, technical know-how, and maintenance. Startups and smaller organizations aren’t less data-driven than larger companies. They can’t afford to be. But they don’t have the headcount or the technical expertise to do both software engineering and data engineering. That’s the problem we set out to solve with Panoply.

2. What pain points do you think your product addresses?

The one thing that I think we solve most of all is the one-stop shop, the single place where you can access all your data, see all your data, configure the data, and not have to learn the caveats of different tools. We’re a data-eng team in a box with a UI. And as simplistic as that is, I think by far that’s one of the biggest pain points of working with data.

The reality is that for a lot of companies that aren’t experts in data, they’re building apps or other products. They shouldn’t have to know what the latest features of Snowflake are. They shouldn’t have to know that Hadoop is no longer state of the art and now we’ve moved on to data lakes. They should be free to focus on critical tasks — not mundane data chores.

The other pain point we address well is configuring data sources. Compared to a lot of the ETL tools out there, we require the least amount of setup work. We are opinionated, though. You can’t configure everything and be as customized as some people might want. With that said, for Agile teams that want to get things done, we’re a great match.

3. What is your favorite use case where a small/medium business was able to use your tool to improve their processes/income?

One of my favorites is a customer who told us that before Panoply, a data request would take two months roundtrip. Whether adding a data source, changing the structure of a table, or something else, you had to wait two months. And if you realized you were missing something or needed a change? You had to wait another two months.

Part of the problem is that they were capturing a ton of data, but they were sending it to a Hadoop cluster they called “the data graveyard.” They had the data, sure, but it was a mess and not analysis-ready. No SMB or startup or company that just wants to move fast can be remotely successful like that. With Panoply, they were able to drastically cut their time to insight because we handled their data engineering needs end to end.

Another customer found Panoply while they were working to build out their data team. They were already a couple of months into it and struggling to find people. While they were working on that, they weren’t getting any data and they knew that even after they were hired, it would take another couple of months to get it running.

They brought us on and later they told us, “Four months later, we have data, but we didn’t have to hire a single person.” Now, I’d like for people to get hired, but that kind of measurable efficiency is something I’m proud of. If we can help teams to be successful — and if we can free up data engineers and analytics engineers to do more interesting and important work — that’s amazing.

 

Seattle Data Guy — Ben Rogojan, Principal Technology Consultant

Finally, I wanted to provide a different angle.

1. What is your experience with the traditional approach to data warehousing, ETLs, and or data visualizations?

Over the past five years, I have developed a consulting firm that helps design, develop, and deploy end-to-end data solutions for companies in industries like insurance, finance, healthcare, transportation, and Saas providers.

This has provided me with a broad range of experiences in both the various data sources many of these companies rely on as well as learning about what different types of companies find important.

Through each experience, I was able to work with a lot of traditional concepts like ETLs and data warehousing as well as look to what products from more modern methods and practices could help.

2. What pain points do you think your product addresses?

Unlike the other interviewees in this article, I don’t have a particular product that I have developed to help companies work with data.

Instead, I utilize some tools similar to many of the products referenced above to help companies of all sizes take advantage of their data.

I focus on developing data systems that match companies’ use cases and technical limitations.

Companies vary in their data sources as well as general technical infrastructure. Some companies have nine, ten, 11 odd data sources, whereas others only have one or two.

And there are other variables to consider.

However, in the end, I help take all of these limitations and use cases, and develop data infrastructure that matches a client’s needs.

So the pain point I help address is the actual usage of data.

I help road-map and produce everything from dashboards to machine learning models that improve companies’ bottom lines.

3. What is your favorite use case where a small/medium business was able to use your tool to improve their processes/income?

I have had the chance to help my clients use data in several ways to help reduce costs, improve processes, and increase profits.

One simple way I was able to help improve a customer’s bottom line was by detecting service cannibalization.

As businesses grow, they often want to create new services and products. However, oftentimes these products and services may cross over to other services and products your business is already selling.

Sometimes this is OK because you would rather cannibalize your product vs. allowing a competitor to come in with their new product. Think iPhone vs. iPod. Yes, Apple destroyed the sales of the iPod. But if they hadn’t, someone else would have.

On the other hand, sometimes you are merely providing a duplicate service or product.

Think of a coffee store putting similar stores too close to each other. Something like this happened in Washington with Krispy Kreme a long time ago when they had to pull back after expanding and cannibalizing their own business.

One of our clients started doing something similar. It’s not uncommon. Your business is doing well. Do you think it is time to expand? But one of our clients didn’t realize they were cannibalizing their services and getting minimal ROI.

Our team quickly found this out after we ran an analysis of their services. We saw that their new service only really added about 3% extra income for the same cost of every other service, which was responsible for about 13%-15% of the income.

In the end, the client just needed to push their customers to the other services they already provided and reduce the duplicate service.

The Modern Data Stack

Personally, the modern data stack means a lot more than just using the new startup data warehouse or data pipeline tool.

It means taking a combination of the best practices that we have developed over the past few decades and implementing them with the right tools to develop a maintainable and flexible data intrastate that is easy to work with and allows a wide variety of users access to data.

Data continues to grow in every direction. More applications are being used by companies that are providing them with insight into their business’s daily transactions.

But we need to develop the right data infrastructure to match that.

Hopefully, this article helped you see a few ways some teams are trying to do just that.

Thanks for reading! If you want to read more about data consulting, big data, and data science, then click below.

Realities Of Being A Data Engineer

Developing A Data Analytics Strategy For Small Businesses And Start-ups

5 SQL Concepts You Need To Know Before Your Next Data Science Or Data Engineering Interview

How To Improve Your Data-Driven Strategy

What Is A Data Warehouse And Why Use It

Mistakes That Are Ruining Your Data-Driven Strategy

5 Great Libraries To Manage Big Data With Python

What Is A Data Engineer

Why You Need To Migrate To The Modern Data Stack