Good Data Quality Is Key For Great Data Science And Analytics
If you have worked with financial or accounting data, then you know that a $10 discrepancy today can become a $100,000 tomorrow.
An error in a data storage system or business intelligence report might seem small at first. It can be tempting for analysts to ignore it, or fudge the numbers for the month. Maybe they will tell themselves “I will fix it later”.
However, if a data science or business intelligence team doesn’t fully understand why the error has occurred, they might end up with an even larger error in their next report or insight they provide to management.
Leaving bad data alone to fix itself never works. It is a great way to ensure your executives make decisions off of bad metrics and eventually lose trust in a data team.
The Harvard Business Review just wrote a piece titled Only 3% of Companies’ Data Meets Basic Quality Standards. In which they did a study over 2 years with 75 executives where they performed the Friday Afternoon Measurement (FAM) test.
The test involves taking a sample of the last 100 records created in a specific business department and checking for accuracy.
This could be customer orders, social posts or engineering drawings. The purpose is the assess the validity of a small sample to see how the rest of the data’s quality plausibly could be.
This method is a great litmus test that will start to assess data quality post insertion into a database.
It should be the goal of a data team to make sure prior to entering a system, data is clean and accurate.
Besides data science, our team also specializes in data engineering and data management. This is the discipline of actually ensuring that the way data is stored and structured is both conducive to analysts use cases as well as keeping a tight lock on data quality. It also requires the occasional data audit that helps assess ETLs, database objects, s
cripts, and anything else involved in loading, manipulating or scanning the data for external use. It helps ensure that managers and directors can trust the companies data.
The concept of data management is not exactly “Cool” so to speak. There isn’t the promise of the instant pay off that data science and business intelligence teams seem to promise. Add to that the discipline of quality assurance and we can already feel some of the readers eyes begin to glaze over this subject. It’s not sexy, it’s not cheap and it requires planning and process development.
Data management and quality assurance is similar to system administration. A thankless job that only gets noticed when things are really wrong and executives find out they have been making the wrong decisions because their data was off by a few records.
Compared to data scientists who currently have the “Sexiest Jobs of 2017”, data management has fallen on the wayside of discussions in most companies. Companies are attempting to play catch up with the tech giants like Amazon and Google who have been data driven from birth. Even their data isn’t perfect.
Keeping Your Data Clean
How does a company keep their data clean?
Have good data principals from the start!
Thus, our team believes in creating data systems that are engineered to keep out bad data and keep accurate records. This is the key to any data driven culture!
Our last few posts have been written on statistics and some implementations of data science algorithms. We love statistics and data science, and we enjoy showing how their techniques can be implemented.
We wanted to take a step back and make sure that those reading understand that data science requires more than just good math and programming skills!
It requires good data!
Which is why our favorite quote is: “Data! Data! Data! I can not build bricks without clay” – Sherlock Holmes
How do you avoid bad data?
Automated QA System
We can’t emphasize the importance of an automated QA system enough for data. As data grows, relying on smart systems that are designed to manage data throughout the process is a must. These systems need to be automated because of the sheer volume and velocity of big data. Big data is not just a buzzword. It is a real thing that companies have to deal with. It can either be a strength or a weakness depending on how companies manage their large data stores.
Data in most systems is too large to manually check and in this case, manually check refers to running a list of stored procedures or ad-hoc queries that are just part of an analysts routine. That can no longer fly in most corporations. It can take anywhere from 4-20 hours of work a week and that is often not enough to ensure your data is as close to the truth as possible.
Their is a solution! Thankfully, most of the data quality checks that a user does can be written down and replicated by a job scheduler! This can save hundreds to thousands of hours company wide and increase accuracy.
The rate at which users create data in the 21st century does not allow for humans to manually check and ensure the validity of all the data.
This means a data engineering, or data warehousing team needs to make sure there are checks at every stage of a data flow.
We know plenty of systems tout the ability to magically fix or store data. They like to talk about just throwing unstructured data with no plan. Yet, even these systems require planning and data quality.
As with anything that sounds too good to be true. Data isn’t magically converted into high quality insights.
Instead, having a good QA system that checks data for expected results is key! No matter whether you are using a data lake or a relational database.
There are some basic tests a data team can put into place that allow it to create a suite that runs every time new data is inserted.
Some examples of data quality checks are :
Upper and lower bounds tests and Inter Quartile Range Checks(IQR) and standard deviations
Aggregate level checks (after manipulating data, there should still be the ability to explain how the data aggregates back to the previous data set)
Tracking percentage of nulls and dropped columns (Define what is an acceptable amount)
Data Type Checks (This should be done earlier at the application level, as well as data value constraints e.g. WA is a state abbreviation KZ is not)
Tracking Data Inserts
- Wherever data comes from, whether it is flat files, IPs, users, etc. This should all be tracked. Especially if it is specific files. If your team finds out that the data from a specific file was inaccurate. Then it would want to remove it. If you have tracked what file the data came from, this is easy.
Utilizing these concepts mixed with meta data tables we have been able to create systems that allow for easy insertion of new tests, and tracking possible errors that might need to be removed.
We recently discussed with a software engineer that works at Amazon about what their team’s focus was.
Oddly enough, it was a team of machine learning engineers devising systems to detect and fix duplicate data! Imagine that, a whole team focused on managing duplicate data!
Because duplicate data can happen at many stages of a system. It can also cost your business as executives and leadership make poor decisions because of inaccurate data..
Duplicate Data on Insertion
Duplicate data can occur upon insertion of a new record. An operational team member might accidentally load the same source data twice, a glitch might cause the same data entry to be inserted twice in a row, etc.
This should typically be mitigated by a combination of factors.
First, a unique key!
This is not alway a primary key and it is never a primary key set on auto-increment. This will not work to keep out duplicates as it simply just increments and will never have a duplicate value.
However, if your data team has a great understanding of the process it is attempting to model (which it should). Then they can break down the specific unique data points between each transaction and make a unique key.
Maybe you know that no one customer can have the same customer_id, product_id, and datetimekey. This is a simple toy example, however, it would allow you to create a unique key that would avoid duplicate entries. You just have to make sure you have a solid try-catch system setup.
This becomes especially important with data that has constant alterations like medical records which might go back and forth in an insurance system as the payment gets sorted out.
Next, and a much more common method of duplication is caused by not understanding the data architecture and running complex joins on relational models.
This is especially common with new analysts. Whether they are new on the team or new to creating ad-hoc queries.
It isn’t always clear when data is duplicated when you aggregate data sets.
Writing a SQL aggregate statement doesn’t allow analysts to be aware of what could be going on underneath the hood.
In some cases the error is obvious because the final numbers are so absurd(like a company that grosses $100 million suddenly having a month spend of $1 billion).
However, there are instances in which the errors are much more dubious. Often time only a few aggregated categories may be off and even then only by a small percentage that could go unnoticed.
This inaccuracy will typically only be caught once being compared to other reports or once the number finally reaches an absurd value.
Once caught it will lead to distrust amongst executives and leadership.
Thus, it is important that analysts fully understand the data first, before they start developing reports.
This is where good documentation and training come in handy. Without taking these steps, each new analysts will have to stumble through the same pitfalls as the previous.
Poor Data Constraints
Data constraints can be a little bit of a pain.
When we refer to data constraints, we mean foreign keys, value limitations, data types, and not null clauses.
They can be somewhat frustrating as they will cause multiple errors as developers test out their new code, and attempt to insert data.
This can make it tempting for a developer to leave their database stripped of any constraints as it will make it easier to test.
It also allows the developer to forget that they need to go back and replace the constraints.
These are some of the issues that some of the more modern less restricted data management systems like MongoDB have issues with.
There is a balance of course. However, if, when creating a MongoDB system, data constraints are skipped.
It leads to bad data.
Data constraints might be painful, but they keep bad data from leaking into a system!
Data Quality Assurance
Data quality assurance and data management might not be as sexy as data science. It doesn’t have the wow factor or thrill of machine learning. However, without good data, your data science insights will be useless. The machine learning algorithms developed will be wrong and possibly cost the company more than it could have saved.
It is sort of like putting dirty, unrefined crude oil into your car engine and expecting it to run perfectly.
Not really a good plan.
Being data driven is more than just a buzzword. It requires discipline at each step of the process! Otherwise, the end result will not be accurate or worth anything. Executives are better off guessing and using their gut in that case. Rather than relying on their teams data.
Remember, even Amazon has a team that uses machine learning to help manage duplicate data. That is the one dirty secret no company really likes to talk about…
It needs to be discussed and fixed!
We do enjoy discussing the applications of data science, but it is always good to look at the entire eco-system of data.
Other Resources About Data Science And Strategy