3 Steps to Ensure Success as a new-hire Data Engineer at Any Company

Photo by ThisisEngineering RAEng on Unsplash

I’ve been four years into a career leading to data engineering, and am very glad to see the wave has risen. Influencers (from YouTube to LinkedIn top voices) and companies have started to recognize the importance of onboarding a data engineer before a data scientist if they are serious about scaling up analytics and making data work for them instead of the other way around (*insert budget for storing mysterious data and having to care about them between infrastructure migrations).

As it is right now, it seems like the dust has mostly settled for the go-to data engineering tech stack being Hadoop/Cloud for data storage (and warehousing), Spark/Databricks for data processing, Airflow for data pipeline orchestrating (although this can be scaled up to Kubernetes if your pipeline is encapsulated in microservices to play along with other microservices in production), and last but not the least, Python/SQL/Scala for scripting the pipelines and data models themselves.

But hey, who knows, with dbt on the horizon, the paradigm shift to ELT becoming more of a reality every day, and the emerging role of an analytics engineer, maybe all of this would change again in another 5 years?

What do you do then? What’s the point of aspiring to be a data engineer?

Throughout the four years, I have seen an over-emphasis on technical requirements combined with unrealistic, expected years of experience.

While I understand the simplicity of having a checklist of skills to go over when looking for a qualified candidate, such a checklist should be taken more light-heartedly because even the hiring manager might not know what tool the senior management would like to “try out” next quarter or if the existing data infrastructure is deteriorating to the point that migration to a new tech stack is due soon. (On that point, how do you then accumulate 5+ years of any specific technology?)

In this article, I would like all of us to take a break from the intense technical requirements and the infinite loop of “how-am-I-going-to-learn-all-this”. I enjoy approaching the need for a data engineer from a business perspective rather than diving headfirst into debating which programming language is the best for analytics or intensely coding away a highly performant pipeline that transports data nobody needs, so I’m going to show you three key steps to focus your effort on as a data engineer when you join a company of any size.

Note: “system” mentioned below can refer to a microservice that may or may not be a data pipeline, and yes, spreadsheets, though not technically correct, also count.

It’s all connected — through business keys!

In the world of data modeling, business keys have a more serious name called “primary key”. These primary keys can be a single field sufficient to identify a unique record in a table, a composite key that is basically just multiple important fields combined to identify a unique record, or a surrogate key which is a meaningless, numbered or hash field based on a unique combo of the multiple important fields.

You probably guessed it — these important fields are your business keys.

Business keys can be an incrementing integer that the upstream system (ex, an application or spreadsheet that tracks orders) decides, an order date, a customer id with some sort of convention (ex, you would be able to tell a CAN001 is a customer from Canada), or a hash log-in session ID.

Whatever it is, understanding the business keys is often half of the battle to understanding system integration, report accuracy, and the key components that make the world (company) go round.

Identifying business keys can be a breeze if there is already a subject matter experts who can guide you through the table, but if the dataset is super novel, there are some sure-fire way to identify them by counting the rows in a table (of a partition if present), and the field(s) that have the same number of row counts if you SELECT DISTINCT them (or try row_number() if you still want to see the rest of the data). NULL or empty values in these business keys make those records invalid, and if they’re in great amount, you need to check with system admins to remedy them.

To bring this up a notch, you can attempt joining dataset A with dataset B that you know of containing the same field, let’s say, business key Y. If you manage to join them successfully in that all of the values under the business key Y can be found in dataset B, you know you’ve identified two related tables that may very well be generated by two different systems or reports. If you have not managed to join them successfully (partial to none of the values are found), well, this is why you are hired! It is time to dig deeper into the logic of the values being generated and identify whether there is a method backed by business logic to regenerate the values you see in the business key Y of dataset A.

Perhaps if a business key is called “business glue”, it would’ve been more hype-able, and “glue” is definitely more descriptive of what business keys do. They identify each true record and are leveraged to purge duplicate records. They enable “join” operations and lookups, so they indirectly show you whether two business keys are 1:1 or 1:N relationships (no crazy N:M relationship, please. That means this table needs some refactoring). Their transformation logic can rarely be meddled with because that might mean doing significant changes to the system or reports generating them.

So step 1, once your access to the data infrastructure at the new company is granted, start scanning the tables regularly maintained by your team and most importantly, identify what connects them all. Be ready to take notes because your brain can probably take no more than how three business keys are related.

High granularity is great for digestion

I’m sure this is not your first time hearing data being compared to food because Cassie Kozyrkov’s microwave analogy just helps us so much in understanding the entire data science discipline.

Microwave analogy in first 5 minutes

How do you know if the dataset in front of you is as raw (most granular level) as it can get?

If you haven’t skipped the section before this one, you’d have half of the answer already. That is, if your dataset has business keys, that is almost always the most granular data you could get unless there is yet another system to further break it down (ex, a shipment number that rolls up to an order number because an order can be split into multiple shipments).

Having an in-depth understanding of the most granular data and how many true records there are to be expected using business keys at a time window gives you robust control over data quality and flexibility to change your data strategy or business model. You would know your baselines such as metrics currently being tracked or daily row count of a metric, and start comparison from there.

Taking that shipment number VS order number as an example, imagine your company only partnered with one shipping company (shipping A) and life was very easy when shipment number = order number. However, as the business grows and a pandemic hits so that shipping A can no longer fulfill all of the orders, you start partnering with a second shipping company (shipping B).

Senior management negotiated a deal, and that’s all fine and dandy until you can’t quite accurately report to senior management on how many shipments exactly have been fulfilled by shipping B because you’re still scrambling on splitting the shipment number from one company to another.

The prudent way is to understand how many orders are normally being fulfilled by shipping A for a time window (per hour/day/week), influence decision-makers to agree on a scalable shipment number convention that can easily identify different shipping companies as an updated business key, and as shipping B fulfills more orders, track the “new normal” by monitoring the record counts per order number + shipping number for a time window.

Say shipping A usually ships 100K orders per week (your baseline), and after shipping B has started, it appears shipping B now fulfills 40K orders while shipping A fulfills the rest 60–70K orders. Now the downstream aggregated report even has a new metric where you can track per order, how many of the shipments are being fulfilled by shipping A versus shipping B (ex, an order of 5 items, 3 shipped by A, and 2 by B) without any structural change to your raw table.

Let’s try throwing one or two “data disasters” in. Imagine shipping A’s server is down, so they reported their fulfillment to be 50K orders. If you did not have a deep understanding of the transactional shipment table and shipping A’s shipment number convention to do a row count, it might not be obvious at first that shipping A’s data is not accurate and will take someone down the chain of command to flag this “data issue”.

Imagine yet another month, your order counts shot up to 300K per week, and your boss, though excited but err on the side of caution considering the last episode of a data disaster, asks you to check whether this 300K order is a real deal. With the baseline number in mind, and being able to query against the business keys (let’s hope the business keys are also all generated correctly), it would be so fast for you to confirm with your boss that the 300K order is not due to duplicate records and could be an effect of the pandemic.

Step 2, become familiar with the high profile BI reports circulating, but become even more familiar with the underlying raw data that aggregate or roll up to those reports. You can thank yourself later.

Can I have the data by tomorrow?

Please do comment below if this is your favorite question at work.

How many times have you responded in your head, “well, that doesn’t really depend on me…”?

Indeed, it’s not really dependent on one single data engineer or analyst to turn around a report because the underlying raw data can come from various data sources that refresh data at different frequencies. The cherry on top is that every single time before the data person is ready to deliver something, they are bound to spend some time checking the data available are from the requested time window, verifying the data based on the aforementioned business keys, confirming no recent changes to the data or if a change is going to affect the report.

The sales report might look amazing but turns out the product dimension table has been outdated for half a year, so the new product launches that happened during then were not included in this report, resulting in severe under-reporting.

The batch jobs updating three tables you need to generate a cost report happen at different times and frankly, in the wrong order, so when you join the three tables, you see less amount of records than you should (two tables give you the full day of records, but one table, because of the wrong updating order, gives you only records from the last 2 hours of the same day). You actually couldn’t notice this even when you know your business keys in this table, and what type of granularity it has. You just have to pay attention to whether the order of the batch jobs makes sense.

I can go on forever with the examples, but what I am trying to hone in here is that “time” is the trickiest variable in data engineering.

https://www.reddit.com/r/perfectloops/comments/9lgtid/a_synchronized_passes/

“Time” as a variable can manifest in many ways, one being the literal definition such as the time window that a job regardless of batch or real-time streaming runs at, another being the frequency of a job being run, yet another being the order of the jobs being run across different systems (kudos to you if you immediately think of DAG).

Step 3, becoming familiar with how fast data arrives at each table will enable you to inform data consumers that the data is only as fresh as its background updating frequency and the last time you queried from it, and truly, is one expertise that sets you apart from other data professionals.

Closing Remarks

Four years ago, I barely knew the term “data engineering” but was more interested in getting consistent, trustworthy data than deriving business insights from “good” data passed by word of mouth. Four years later, despite getting full-blown hands-on experience with the state-of-the-art tech stack for handling big data, I still need to work hard to ensure the data accurately reflects business reality. Yet the journey continues, with every year, new supporting software that’s solving a challenge or business gap that the existing set-up isn’t able to address.

What good is technology without its application in helping businesses utilize their data to reach new heights?

Regardless of which “it” technology is trendy for big data handling now, I deeply believe that the three described topics will remain because I’ve seen them as the constant across companies solely standing on spreadsheets, spreadsheets plus half-baked in-house relational database, or even sophisticated data warehousing in on-prem Hadoop or Cloud.

They’re all trying to solve the same problem — understand their own data and use it. The learnings you gain from these three steps will equip you very well in the face of an operational fire (ex, crashed server leading to missing data), a future project (ex, a pipeline with a novel aggregation from previously thought-to-be unrelated data sources to a dashboard), and changing data strategies (ex, a new data collection application you think is unnecessary because a similar data source already exists).

TL;DR,

Step 1. Every table has a business key. Identify it (or them), understand what values they contain, and use them to glue tables together.

Step 2. The most granular tables (raw tables) always have business keys. Understand their structure very well, especially the metrics (count, $$, rating). Get a baseline number.

Step 3. Time is of the essence. Data sources can be updated at different times, frequencies, or just in the wrong order. The previous two steps can’t entirely help you with keeping up data freshness.

Good luck and lead the way!

Original Article Here

3 Steps to Ensure Success as a new-hire Data Engineer at Any Company