Our Data Science Consulting Project Process
As data science consultants. We often get asked, what is your process, define your steps, and or tell us your methodology?
Unlike a company data scientist who might get stuck maintaining code. Our clients typically call us in because they want a new algorithm or machine learning system to be integrated into their website or application or they would like our team of consultants to give mediation on how to integrate data science into executive level strategy.
This means we are not only data scientists, but also project managers who have to ensure we meet our deadlines and that our data science project is a success. So this question is warranted! What is our data science process?
Data Science Project Proposal
As a data scientist consultant, we are actually in a very competitive field. There are a lot of smart data scientists out there and we all love new projects! Each one is just a different puzzle that we get to solve and at our core, that is what data scientists love! We love puzzles and problems.
However, due to this compition, step one (after the initial meeting with a client and them mapping out their needs and wants) is usually a project proposal. Depending on what type of project we are going to be doing, changes what this includes.
When we conduct data science workshops and executive mediation. The goal is to find out what the client wants discussed, what their executive team hopes to gain from the workshops. Then we draw up a schedule, design possible materials that need to be created and run the events.
For a much larger machine learning integration or algorithm project. We have a much larger task ahead of us! So lets go over our steps!
One of our first steps is to figure out what data exists! If we find out there is no good data to build a foundation on. We have to know that as close to day one as possible. Hopefully, this isn’t the case!
If the clients data is in order, then we will work towards finding out as much contextual data as possible. We will pull meta data from databases, objects, tables, ETLs, data dictionaries, you name it. The more context and meta data we can gain the better.
We can start to track and record the possible data sources we need from internal databases. This means inquiring with subject matter experts what field names might mean, and seeking internal tribal knowledge of data source usage. Another set of data sources might come from external APIs, open source projects and aggregators.
This might be social media data, credit card information, public records data, or aggregator data(that often has to be purchased). This is only if these data sources seem necessary.
This gives us our bricks to build our machine learning systems, data science algorithms and predictive models. Without taking the time to first look through all the data and see if the building block were high quality and sorting what is what. It would be like reaching in the dark. This step is key, you can’t really get started without it.
Once our data science team has gathered all the data we need. We will begin poking, probing, and essentially playing with the data. This would be similar to a research scientist gleaning hundreds of papers. We want to know what the data is saying?
We want to look for
This helps us give quick feedback to our stake holders as well as give us better information to our team of how the data is shaped.
We do this by using:
All of these tools allow us to see data from different angles. The more we understand the data, the better we can develop statistical models, or processes that can manage the data to predict the outputs we want.
We have had some people state things like, “Can’t it all just go into a neural network”.
Sometimes it does boil down to a neural network. However, it isn’t always the case. It is best to test a few methodologies before thinking you know the right answer. That is one misconceptions we have seen happen on occasion.
Someone might think they know the right answer. A data scientist or analyst might make several leaps of logic and misunderstand the data (Sometimes we even experience confirmation bias). Thus, it is very important to approach each problem from multiple angles. Otherwise, you might be missing the best algorithm options, or best implementation (which are not always the same!). That is where agile development comes in!
Agile Data Science Development
However the data science project is being implemented. Whether it is just an algorithm or an entire machine learning integrated system. It requires some form of agile development.
This might mean fine tuning the algorithm, and making sure it always produces the right output. It also might mean testing it in real life. This may be through A/B testing, it might be through experimental testing. Having some form of control group while you are testing the application.
Human reactions to algorithms are not always as predicted. Oddly enough, computers are a little too rational. Sometimes this can lead to computers thinking very differently compared to humans.
Even the Go Masters stated that the computer that beat them played like no other human they had played before. Humans play a huge factor in the final outcome of a data science project.
Both in reference to your stake holders as well as the actual human users! We have discussed what has occurred in the past when data science algorithms are implemented too well.
Even after millions might be spent on a project, it can lead to huge losses. So implementation of a machine learning algorithm is key! You don’t want your systems to give away too much, or provide too large of discounts! Even if the algorithm calculates to give a 101% discount to somebody. Perhaps you should not?
In the end, before releasing the final product, our team has to make sure it has been tested on a a sample population. It helps control possible unexpected outputs. With a focus on making sure the data science project does not fail!
Final Data Science Project Product
Sometimes, there might not be a real final product when a machine learning system is implemented. Think about it. Even Google has to change their algorithm a few times a year to avoid people from gaming their systems.
Similarly, a machine learning system might not always be effective. Things change, peoples thought processes might become slightly altered. We recommend algorithm effectiveness be monitored.
This can usually be automated to some extent. This will reduce analyzing algorithm maintenance down to every year or two give or take(unless you just want to add new features or data points, then that is on you!).
Other times the final product might actually just be a report, algorithm or model. This can just be a static product that doesn’t require maintenance and just is.
Data Science Documentation
Documentation is obviously not the most fun part of the job. However, it is oddly one of the most important!
Especially as a consultant. Again, like a hired data scientist, you will be leaving after a set period of time. What happens then? Even if you left your implementation perfect and created a robust software system.
Things happen! Technology changes, company systems change, etc. All of this must be documented. What is your system dependent on, what functions, objects, and scripts did your team create? Why? What do they do?
You know what is great! Diagrams! Having logical diagrams of the models and objects your team has created saves the day. It makes explaining what you have built easier and allows another team the ability to edit your code securely.
It is the difference between mediocre, and great! Even great programmers (in the sense of actual design), need to leave documentation. Data science and programming is too complex to simply try to remember it all.
Why All the Process? Doesn’t Process Kill Data Science Creativity?
You might be asking us now:
- Why all this process in data science?
- Why create a project plan?
- Why define a scope?
Well, anyone who has ever tried to do a project of any kind, whether a data scientist or construction. There are too many moving pieces to blinding attack each project. Do we deviate from the steps above. Of course! These are merely guidelines that help keep us moving forward. Complex tasks require simplification.
It may seem easy developing a machine learning algorithm. However, even if that section is easy. Implementation of said algorithm is far from it! Implementing any piece of software into another technology has never been easy! Companies throw hundreds of thousands to millions at integration projects.
They require both our consultants and their engineers working together to discover all the data touch points and develop a congruent system. It is important that both systems speak the same data language and use the same vocabulary!
That way, when they want to go back and analyze the effectiveness of the algorithm or feed the new data into their current model, it is easy.
Finally, Why We Love Data Science Consulting
Data science consulting isn’t always easy. It is however, a lot of fun! The challenges and puzzles thrown at you are always different. Data is in different shapes, industries have different needs, and customers respond differently to different products. This means there is not a one size fits all algorithm.
When problems are simply involving picture classification, that is one thing. Human behavior is not involved. Add in the element of human nature, biology, and psychology, and you get another set of problems!
If you’re a data science team out there, we hope this helped you organize your next project, let us know what your data science team does differently in the comments below!