How To Set Up Your Data Strategy For 2022 - Part 1

How To Set Up Your Data Strategy For 2022 – Part 1

research@theseattledataguy.com January 16, 2022 Uncategorized 0

Billions of dollars have been put into investing into companies that fall under the concept of “Modern Data Stack”. Fivetran nearly has one billion dollars funding them, DBT has 150 million(and is looking to raise more), Starburst has 100 million…and I could really go on and on about all the companies being funded.

So that means every company has a fully decked-out data stack, RIGHT?

Yet, most companies don’t and can’t start with a completely decked-out data stack.

Instead, most companies build their data stack in stages, which is probably best the way to do it.

You don’t suddenly have a flawless source of truth with perfect serviceable data that can all be tracked through your data observability tools.

It takes time.

Teams need to develop processes, scalability, trust and the ability to actually execute on data.

There are three key areas your team will need to consider as you start developing your data infrastructure. These are how you will ingest the data, store it, visualize it, and of course, who is your team.

In this article, we will discuss why each of these areas is important as you are developing your data strategy.

Data Ingestion

Data ingestion refers to the action of taking data from their sources and loading them into your data warehouse. Sometimes you might also hear the term data pipeline, ETL, ELT, or real-time data pipelines. All of these do play a role in data ingestion. There are multiple options when it comes to data ingestion.

These range from 100% custom code, code frameworks, and low-code solutions.

We believe that 100% custom code should always be a last resort. Unless your team has the goal of completely recreating Airflow or Dagster, there is no reason to start from scratch. Frameworks like Airflow manage some of the many components you will need to develop for your data ingestion.

Airflow acts as an orchestrator that you can then utilize its many Operators, such as a BigQuery operates to extract data from sources and then insert it into different data storage systems.

But these frameworks do require some understanding of programming and DevOps. So what if you have a very small team that doesn’t have time to program you might need to pick another solution. That is where tools like Fivetran, Matillion, and Stitch come into play.

These tools are low-code options that your team can use to connect to data sources and pipe the data into your data warehouses or data lake. All without any code.

This is my experience does not mean you shouldn’t have an engineer build these data pipelines. These pipelines are still best built by data engineers who can implement the best practices they have developed while learning code.

The trade-off here is the speed at which the pipelines can be developed. Which tool is right for your team is dependent on your business goals, expectations, and team skill sets.

Data Storage

Data storage references concepts like data warehouses, data lakes, and data lakehouses.

All of these act as an analytical data storage layer. Companies use a wide variety of data storage systems to meet the need of storing data for analysts, data scientists and general users. Companies might pick Snowflake, BigQuery, Postgres, Redshift, SQL Server and a whole host of other databases(and thats just for the data warehouse).

The purpose of this layer is to create a source of truth where analysts, data scientists and end-users can access data from multiple data sources. For example, your data warehouse might have data from your custom application, Workday, Facebook ads, Asana, Braze and so on. Having a centralized reporting layer allows users to avoid having to go to 4-5 different sources to create a basic report.

Now, they can write one query and merge all of these data sources into one data set.

In addition, another benefit of having a central location is that often you create data standards, governance, quality checks and likely a light data cleansing that makes the data easier for analysts to work with.

This is the overall goal. Creating a data layer that anyone can access(based on their security roles) and can rely on. Without the need for lots of heavy manual processes such as pulling data from data sources into csvs and then combining it altogether.

In the end, once you have the data, you need to put it to work through reporting and data visualizations.

Data Visualization

Finally, data visualization and reporting are the last of the three key pillars in your baseline data strategy. This is because just ingesting and storing your data is not sufficient. Businesses need a purpose for this data. This is often done through data reporting and visualizations.

Whether these are dashboards, KPIs, Excel reports, or just some form of the final number.

These, when applied correctly can drive business initiatives, help businesses make better decisions, and provide confidence to executives who are looking to understand exactly what is going on in their business.

Of course, like all of these other sections, there are many choices for data visualization tools. You could pick the classic Tableau or decide on the Cloud-Based solution Looker. All of these tools have different benefits.

I will mention that Looker in particular does require a longer learning curve but when utilized well can help your analytical teams work on a more consistent layer of data. Overall, finding the right data visualization tool starts with your team figuring out your data goals.

But what about the rest?

All The rest

There are plenty of other tools, best practices, and solutions that your team will need to implement at some point. For example, a few components I didn’t discuss was data observability, lineage, and quality.

Why?

You won’t have data quality issues or have a need for data observability until your data warehouse and data pipelines are set up. In addition, more than likely, your business owners won’t want to pay for these solutions until they are necessary. So although in many cases including some sort of data quality provider like BigEye might be a good idea. You probably won’t be able to convince your business partners of the need for these tools until problems arise.

As for the actual other popular terms like reverse ETL, metrics stores, data catalogs, or machine learning. These concepts can be held off.

First, you want your teams to deliver on the first three layers of data before further increasing your team’s processes complexity. Every new tool, infrastructure layer, and responsibility means you will need more employees and governance to manage all of these new tools.

It can be tempting to start running, in terms of your data infrastructure strategy. It can be tempting to try to utilize machine learning tools before even partially developing a solid base layer of your data.

But, long-term this is not a good data strategy.

When you look at the larger tech companies like Google and Amazon you will notice that both developed multiple tools to help manage their data. For example, Amazon created Redshift and Google created BigQuery. This is because they realized in order to continue to manage their data they would need to improve their data infrastructure.

So you should too.

Setting Up Your Data Infrastructure For 2022

Picking the right data infrastructure components is crucial to ensuring your team can scale quickly and deliver.

Over-promising and committing to every new data fad is a quick way for your team to never truly provide value.

Our team sees this problem occur a lot. Companies constantly are jumping on the next trend, vs delivering on the previous project.

This all starts with first landing data pipelines, creating a core data layer, and then finally, creating tangible insights, KPIs, and metrics that the business can use. Everything else is a distraction.

In 2022, our hope is that your team focuses on simplifying your data structure.

5 Big Data Experts Predictions For 2022 – From The Modern Data Stack To Data Science

What Is Trino And Why Is It Great At Processing Big Data

Setting Up Airflow 101 – Google Cloud Composer Vs Astronomer Vs MWAA

What Is Airbyte and Why You Should Use It?

Should You Become A Data Engineer? Reasons Why People Are Becoming DEs