Data science teams strive to transform data into knowledge; or even further, data into action. Therefore, there should be a classification and specialization of tasks in order to achieve results in the most efficient way. Or in other words, to achieve results within time and budget.
In the latest years, some authors have proposed a framework to tackle data analysis problems in a more structured way. They even have developed open source tools to work in this way.
The development of this framework and others evolve continuously and a fast pace, although their foundations remain the same.
Data Science process
In a nutshell, this framework proposed 4 well defined steps to tackle data analysis problems.
Data collection & import
What can’t be measured, can’t be improved. Generating, collecting, buying or just documenting what data is available, should be the first step of any data analysis effort. Afterwards, it is necessary to move it to our analysis environment.
Now that the data is collected and documented, it is necessary to transform it for consumption of the data scientist. If possible, also document the post-process data sets. Surprisingly these first two steps can consume from 50% to 70% of every initiative.
Visualization, Transformation & Modeling
Visuals have the power to surprise us, but inherently don’t scale. Any data analysis effort will have at some point exploratory data analysis and visualization. Some transformations in the structure of the data could be necessary. This depends on the tool or model to be used subsequently.
With regards models, model can scale. It is possible to build a lot of them with a computer, but can not surprise us. Models make assumptions and have different biases (page 17), that, in general, can not self-correct.
It is necessary to understand, if the final consumer of the data will be comfortable with a model that is not interpretable but accurate, or less accurate but interpretable.
How knowledge/actions/conclusions will be transferred and consumed to the end-user? Is a report the final product? Will this report trigger action? Or maybe it is necessary a small web app? Will an interactive app help the final user to take action? Shall it be time bound or live?
All of these questions, and more, need to be answer.
The data science process can and must effectively support and improve decision making.
American Football and Data Science
I am going to use an american football analogy that I believe defines well team dynamics. In order to be efficient it is necessary to have two teams, one in charge of defense phases and another for offensive phases.
Applying this concept to data science, it would be natural for team members to specialize in one of the 4 different areas that compose the data analysis process.
Let’s assume that for a certain amount of projects some members will only work in one of the areas. Seems reasonable to think that after the initial learning curve, this members will get quite good at it. Hence, getting more efficient with time.
It is even possible to imagine a push-pull or kanban process to optimize the flow between areas. This is something that it’s been done since the 50’s in Lean manufacturing.
Some of you might think, that that is not the reason you sign up to a data science team. Doing one thing, and one thing only. Well the good news is that you won’t.
Success or failure of any project is based on the team/project output. The whole process is as strong as the weakest individual process.
If you are only one person, you will touch all this roles at certain point of your project. Although your efficiency probably will be lower than a team of analyst. This especially holds true for big batches of projects.
So what to do?
Therefore in a team, rotation of roles and responsibilities are necessary to step up the competencies of each individual members, and consequently the team. In order to have a good knowledge transfer between projects, a team can and must train continuously each member.
In a world where is difficult to keep up with the latest developments, having colleagues that recursively update each other on the latest developments and state of the art techniques, is a competitive advantage to any data science team must strive to acquire.
What about reliability engineers?
Personally I don’t see big differences between data science teams and reliability engineering teams, since the later, will need to gather, clean, analyze and communicate any effort. True that more specialized sub-steps can and have to be added, but the basics are the same.
How would you organize your data science team? Do you believe that one person projects can scale? And more importantly, are you playing attack or defense today?