Skip to content

It’s in the data! – or why data is more powerful than algorithms

It’s been said many times in the machine learning world that the second most important part in the analysis process is, to have the right data correctly collected, labelled and clean. Otherwise there is a big risk to suffer a case of garbage in, garbage out.

There are starting to pop-up initiatives were the objective is not to predict outcomes. Instead they aim to clean and make easily usable certain data sets for public use. This is due to the high barrier for a lot of people to spend huge amounts of time doing this labour intensive task. Furthermore, to acknowledge that the machine learning world is evolving quickly and what it might not be usable now, it might be in a few years, if and only if, the raw material (data) is correctly labeled and made usable.

The Product Edge in Machine Learning Startups - a16z Podcast
a16z Podcast: The Product Edge in Machine Learning Startups

By the way, the most important is to have domain specific knowledge of the problem we would like to solve. This is why is so important to embed data scientists inside the team that wants to solve the problem. Hopefully it is the same team that is going to use the solution.

I believe this is why big companies such as Google or Facebook open source their algorithms, for everyone to use and develop. You can read more about the same idea in this article by The Guardian.

“By sharing their algorithms, Facebook and Google are merely sharing the recipe. Someone has to provide the eggs and flour and provide the baking facilities (which in Google and Facebook’s case are vast data-computation facilities, often located near hydroelectric power stations for cheaper electricity).”

The Guardian.

If the killer algorithm that gives you competitive advantage existed, most probably won’t be open sourced that easy. And since it doesn’t exist yet, is not a bad idea to open a whole new business like cloud services.

“This is probably why Facebook and Google have so freely shared their methodologies: they know that the real value in their companies is the vast quantities of data they retain about each one of us.”

The Guardian.


Failure Data

Same principle applies to fault data. Without the proper labeling of the failure data, it is quite challenging to extract meaningful information.
Ideally we will have information such as failure initiator, mode & mechanism, as well as, the environmental conditions in which the component was designed to operate and the conditions in which was (and is) operating.

Context is key
Context is key

Then we could go one layer deeper and gather more contextual information from the previous steps like batch number, time in operation (if it’s a time driven failure mechanism), warehousing conditions, recommended maintenance plan, executed maintenance plan, etc.

“The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

John Tukey

Seems clear to me that, to gather this information in all occasions is quite difficult, if not impossible in all occasions. Therefore there is a long way to go from domain specific knowledge to automatically feed an algorithm and extract meaningful information.

I find myself listening every now and then buzzwords for tools as something meritorious. The tools are not the solution, but the means to find a solution!

So I propose something, instead of saying that we use machine learning as the end purpose, let’s talk about the problem that we are trying to solve and what tools & techniques we could use to try to solve it.

This is a mistake that I see time and time again in professional circles, as well as in the media. Am I being too rigid? Shall the problem be tackled from both sides? Maybe algorithm based and domain specific knowledge?