The Most Import Data Science Happens Before Any Model is Built – Data Quality
Oh my Susana is gonna talk about such a boring topic. Bear with me, Data Quality is crucial and one of the most important parts of the data science process.
Here the main rule is simple.
In modelling, it doesn’t matter if your model is perfect or not if your data is garbadge. In modelling rules the garbadge in, garbadge out rule. Bad inputs create bad results.
Here lies the problem, no data is perfect.
All data requires some degree of treatment before moving on to modelling. And it’s upon the data analysts and data scientists to do it. But how? How can we know that a dataset is ready to move on to modelling?
The process is not straightforward and requires quite a lot of your analytical and rational mindset. If we can put in a sentence it should be “Take all you know about statistics, join it with domain knowledge, pick the question you need to answer and answer this: Does this data make sense?”
A lot of times errors will arise from the simplest of analytical processes. From wrong variables, to lack of consistency to downright weird values, many of them can be discovered from the simplest descriptive statistics. Don’t believe me when I say this process is crucial?
Let’s go for an historical example. This time with NASA, when they lost a Mars orbiter because of units of measure.
The history has a lot of interesting facts (read more about it here ) but on the center of the question are the software of two systems that were created by two different teams. The orbiter was already 4 months into the mission when they realised a crucial problem. The system in the spacecraft used Newton-seconds, units of the International System of Units and the system on earth used pound-seconds from the Imperial System. Uh-oh…
Paraphrasing the VICE article:
“This means that every impulse measurement done by the Angular Momentum Desaturation software underestimated the effects of the thrusters on the spacecraft by a factor of 4.45 (1 pound of force is equal to 4.45 Newtons).
When you’re traveling tens of millions of kilometers, every little deviation adds up. The measurement errors in the ground-based processing went overlooked by the ground crews; modelers assumed that the data provided to the spacecraft was in the correct metric units.”
Ok, you won’t probably cause an orbiter to be lost but this history serves to us as a reminder on how data quality is important.
You will gain more analytical skills the more data quality practice you have In sum keep curious, honest and objective and be ruthless in the data quality process. It is a question of being real, not perfect.
Do we have guidelines we can follow?
While the process in which data quality is attained may vary between projects there are a few golden rules that one should try to follow as close as possible:
● Relevancy – Big Data seems all the rage nowadays but is the data you have on your hands relevant? Can you answer the question you’ve been posed with that data or not? Is the relevant data drowned in a pool of unnecessary data? Verify
that in the end you have only relevant data in your dataset. More data doesn’t necessarily mean better results;
● Precision and Accuracy – How accurate has the project to be? What kind of precision in your results do you need to attain? Churn rates and medicine do require different levels of precision. Assess the costs of the level of precision and
accuracy for each project and aim for them;
● Consistency – Are you trying to make predictions on historical data but you realize that some months have a ton of data and some barely have it? Many many datasets require the proper consistency to make proper predictions. Assure that you intervene early enough to correct it. Data scientists can and should help to create data collection processes that avoid problems with inconsistent data;
● Security and Privacy – Last but not least, assess if the data you have been given is in compliance with security and privacy measures. Are you seeing private data that you shouldn’t see? Say it! Is the database being fragmented and tons of data spread among the workers, say it! In what regards security and privacy, prevention is always the best measure.
Hope you enjoy this honest look about the importance of data quality.