Data Analysis,taking a good first step…
Day after day I keep hearing that data analysts are “not so smart data scientists” and this assumption couldn’t be more wrong. Many courses and tutorials, by focusing on modelling and mathematics rather than the data first, make the process of data analysis seem way too simple than it usually is. Data analysis is the first step of a big marathon that is data science and, as a first step, it is of the utmost importance and many things can go wrong in it. What can we do to prevent ourselves from falling right on the beginning of the marathon?
What is your question?
Ask a data scientist what question is she/he trying to answer with that analysis and more than often you will hear some stuttering before a timid answer.
What has happened? Probably no roadmap has been prepared for that work and the data scientist is literally lost in the work. There’s no goal, no main objective.
So here is the first goal to attain in data analysis: having a well defined goal.
This goal by no way needs to be super fixed and it will probably evolve along the process but it needs to have in mind its usefulness. Is it useful for the analyst or scientist performing the data analysis? For the business/research? For the project? And this goal needs to be set as a team with everyone aware of it.
With a goal set it’s time to move to next phase.
Question the Goal
Say whaaaaaaaaaaaat? You’re probably thinking I’ve gone insane but hear me out.
The goal set on the previous phase is set as a team and rarely has enough definition to have a mathematical objective meaning. It’s your work to examine the question to understand the underlying problem behing the goal.
Let’s check an example. You are working on air pollution and trying to determine if air pollution is harmful for human health. The question “Is air pollution harmful for human health?” is too vague to be answered directly by data. What can you do?
You check the dataset on your hands and verify that you have pollution levels in several spots of the city as well as the number of people that entered each city hospital with respiratory symptoms. One good first question for you could be “Is there any relation between the air pollution values and the number of respiratory patients in the nearby hospital?”
This phase can be hard, because some managers and senior positions can think that this kind of deep dive into the set goal is questioning their knowledge but it should be in the hands on the data analyst/data scientist to show them the importance of this task.
It will allow the person doing the analysis to set specific goals, to understand the dataset and to understand the business or research area being studied. During this phase do not be afraid to experiment, to do little prototypes and to use your statistics knowledge to understand the data as much as possible.
Defining the Data
Along with the question, the analyst should assess the scope of the data required to answer the question. This process is usually answering 4 questions:
- Which data do I have?
- Which data do I need to answer this question?
- How much data do I need to answer this question?
- Which standards of data quality do I need to follow to answer this question?
In this process the analyst should map out all the available data and the required actions to answer the questions above. Have in mind that the requirements for data quality vary greatly from problem to problem.
If you are dealing with medical data or marketing data, the rules for each will vary greatly. Do not forget the standards of quality we’ve already talked about here
Creating the Solution
Ok, now we have a good defined question and know our data, it’s about time to create the solution. With a specific goal set, the process will remain iterative and experimental but with a more defined workflow. Do not forget that the analysis has a specific goal.
The analysis should have in mind the level of exactness the problem demands and the public for which the solution is being developed for. A data analysis made for engineers will have different standards than a data analysis made for doctors.
The standard golden rule is to keep the analysis with the best quality possible and that all parts interested in it understand the process and results.
Never never forget to document the whole process. It will be valuable to validate results and even to reuse some processes in later analyses. No one wants to reinvent the wheel.
Still think the process is simple? Data analysis is the first step in the data science project and It can easily be the most important one. Bad data analysis can make or break a whole project.
A good data analyst listens to everyone in the project and that usually means having to manage a lot of points of view. It is the responsability of the data analyst to assure an unbiased work that produces reliable results and material to further the project. Managing egos and priorities is no easy task.
Don’t oversimplify data analysis. It’s no easy job.
And never forget “The longest journey of all begins with just one step” so be sure to take a good one.