Skip to content

Data Engineer Vs. Data Scientist

I had the opportunity to be present in a discussion about the future of the data professional landscape. There were two clear define positions. One that data science will be the commoditize. Hence data professionals are better off learning some data engineering. In the other side, the argument was made that extracting valid knowledge from a given body of data is the critical skill. Tooling to put it in production will be the commodity. Is there a right answer?

Data science as a commodity

Let’s start with the first view, data science will be a commodity in the near future. It is argued that for example, the machine learning (modelling) part of it, will be the first step to get automated. Already, there are services like DataRobot that will trow a lot of models to a problem and optimise for a certain indicator. Might be one single model or an ensemble of them. More accuracy? This is the model. Less false positives? This other here. Skills to put the models in production and maintain them will be the most critical.
Therefore the work horse will be the construction infrastructure, maintaining complex data lakes, model deployment, API micro-services, defensive programming, unit testing, code coverage, data pipelines as well as continuous deployment and continuous integration.

Data Engineer/Scientist by Data Camp
Data Engineer/Scientist by DataCamp

Data Engineering as a commodity

In the other hand companies like H2O.ai are crushing some old paradigms with regards the quantity of models that can be deployed during the knowledge discovery phase. Making it relatively easy to deploy and maintain at scale. Therefore in this new paradigm the infrastructure gets commoditize and the scarce resource is the ability to extract information in a given problem, with a set of constrains. That sounds a lot like what a data scientist does. Accordingly skills like data munging, model interpretation, collaborating with the domain specific expert, visualisation or communication within the organisation, will be the scarce resource.

“The solution should ideally be a product”

Lab to Factory by Peadar Coyle

The interesting part is that this conversation was raised by two skill data professionals, and they defended the opposite argument to their background. Meaning that the data engineer was vouching for the data scientist, and the data scientist for the data engineering skill set!! This can only be good news.

As I exposed in the article “Team roles in data science”, I vouch to have both roles in the same data team. They are related in any initiative, and the construction of two different teams can stop the quick iteration, which I believe is the key in any data related initiative.

“Follow an iterative process, test everything, allow for both experimentation and failure, factor in 50% more time than your best estimates, bring stakeholders along closely with regular demos.”

Delivering Value Throughout the Analytical Process by Jonathan Sedar

Most probably the reality will be somewhere in the middle, where for the moment there is an overlap in the skillset of both roles. To sum up we are better off learning some of the trades of our professional data colleagues.

 

 

 

More resources on the topic: