June 28, 2017
Adopting machine learning? First rearchitect your data.
By: Matthew O'Kane

I’m looking at the important question of how companies can most effectively scale Machine Learning across their operations. In my previous blog I talked about the need to build a new type of data science team within enterprises to ready the business for Machine Learning at scale. Here, I’m going to look at the second of the three Machine Learning fundamentals: good data processes.

The challenge facing businesses is that data is usually organised, managed and governed for MI before being adapted for more advanced analytics. As a result, data scientists have had to spend time and effort re-engineering data sets to make them suitable for meaningful model development. If Machine Learning is to achieve scale, then businesses need to start thinking about how to structure their data so that it is fit for purpose from the outset.

Complicating this endeavour is the fact that machines think differently to humans, and learn in different ways. When we analyse something, we usually start with a hypothesis, and then find ways to come up with clean, conformed data to verify the hypothesis. This doesn’t work for machines. Instead, data engineering teams need to build data sets that meet three core principles:

  1. Quantity of data is all. Machines process data much faster than humans, and therefore make it possible to use Big Data to come up with meaningful insights. Whereas human reasoning requires that data sets are standardised, machines can infer correlations from many thousands of structured and unstructured data sets, taken from a variety of sources. So, the first principle of data for Machine Learning is to throw as much as you can into the system—whether that’s operational data, customer data, freely available web data, or even open data sets created by governments. The more data a machine has, the better it can learn, and the more likely it will spot interesting correlations and causality.

  2. Raw data is best. Data can be aggregated and summarised in any number of ways to make it fit for feature engineering. However, data that has been engineered in this way is less likely to be useful for other features. For Machine Learning, it is preferable to keep data raw. This is because Machine Learning algorithms create new features automatically—and are usually better at it than humans.

  3. Focus on prescriptive analytics. Data correlations are one thing, but what’s most valuable to businesses are understanding the causes behind correlations: Proving how one action causes another is how businesses make better decisions. The only way to create this type of data is to run experiments in the business and collect the resulting data. New machine learning algorithms can identify causality, thereby engineering the shift from predictive to prescriptive analytics. Whereas predictive analytics focuses on the likelihood of something happening, prescriptive analytics identify the action or set of actions which are most likely to optimise a result.

With the rise of the machines, therefore, comes a responsibility to rebuild data processes to optimise them for Machine Learning. Data is the fuel of a Machine Learning system, but in my next blog I’m going to look at the machine itself. The third fundamental of Machine Learning is, of course, the technology stack that powers it. Drop by again and find out what investments you need to make to deliver effective Machine Learning applications across the enterprise.

Popular Tags

    More blogs on this topic