The field of Machine Learning (ML) has been consistently evolving since Data Science started gaining traction in 2012. However, we believe 2018 was a critical inflection point in the ML industry.
original article published by kdnuggets.com
As a field that has consistently toed the line between its origins in academic research and the need to serve customer needs, it has often been hard to reconcile engineering standards with ML models. As both research and applied teams are doubling down on their engineering and infrastructure needs, the nascent field of ML Engineering will build upon 2018’s foundation and truly blossom in 2019.
To illustrate this, we wanted to both:
Three takeaways from 2018
Both in research and industry, ML is growing at a pace that is thrilling and exciting for the future, but can make it hard to get a sense for the direction of the field. Here are three trends I’ve identified from 2018 that we think will have an impact on ML in industry in 2019 and beyond.
One of the most impressive trends in 2018 has been various models’ growing ability to capture increasingly useful information in dense learned representations. Here are a few examples below:
Other than the quality of the representations they generate these exciting results have something else in common: they leverage an increasing amount of data and compute, which leads us to our second trend!
In recent years, we have learned that when it comes to performing on well-defined tasks, such as standard datasets or game environments, larger datasets and additional compute will help us push performance further.
In fact, if we look back to the examples in the representation learning section, they all leveraged a larger dataset or more compute:
While the strategy of scaling up has shown promise on academic tasks, this post focuses on practical ML in industry where there is often no standard definition of a task or dataset, and the data distributions we are trying to understand are constantly shifting in nature. This requires an entirely different set of tools.
In fact, the disparity between research results we see today and what has been deployed in most products except by a few leading companies points to a wide gap between researchers and well funded teams, and other practitioners and startups.
This past year, more companies have started to publicize the scale of the tooling they have built internally to help support their ML efforts. Here are a few personal favorites:
Seeing how some of the best engineering teams in industry have tackled the challenge of delivering ML to their users is inspiring. At the same time, because building such platforms represents such a herculean effort, many practitioners advise smaller teams to avoid building their own ML platform.
This leaves most small to medium sized teams in a no man’s land between cobbling together offerings from service providers that do not exactly fit their needs, and taking on significant engineering costs. There is a growing gappointing to a need for a set of frameworks like Tensorflow and Spark, and of widely shared best practices for all the parts of ML that are not purely model training.
More and more people realize that nobody needs yet another library or tutorial to build a 3-layer neural network on MNIST. Consequently, many startups have entered the space of data and model infrastructure, management and deployment, and educational resources have started to focus on these aspects more. This is why, we fundamentally believe that 2019 will be the year of ML Engineering. We’ll explain how we see this unfolding below!
Three ways ML Engineering will grow in 2019
A common warning shared with aspiring Data Scientists is that 90% of the work is about gathering and cleaning data, or validating, deploying, and monitoring models. If that is the case, why are 90% of the frameworks and Github repositories focused on model building?
A part of the job that demands so much of a practitioner’s time should have proper tooling support.
Now that many large companies have laid the ground work for best practices when it comes to building ML products, and that many teams are being forced to reinvent the wheel when it comes to building the majority of their modeling pipeline, we are finally at the right moment for open-source ML Engineering frameworks.
Amazing libraries such as Keras, Tensorflow, PyTorch, and fast.ai have made it easier than ever to define and train custom models. At the same time, many companies have launched hosted services that complement such libraries by helping with data visualization, cleaning, model serving, and experiment tracking.
The problem with many of these services, is that ML Engineering needs are very use case specific, and often require the flexibility of an extensible open source framework. This is quite similar to Google Cloud offering APIs to call standard computer vision models: it is valuable for a subset of users, but would never be considered a replacement for Keras. The question now is, what will be the Keras of data exploration and cleaning?
There are many parts to ML Engineering work, and we could see frameworks being extended to cover multiple aspects, or separate solutions winning out in each of these categories. Here are a few domains to watch for:
Startups have started to propose solutions for many of these problems, but none have really helped define standards as widely adapted as Tensorflow and Pytorch are for model building. This is part of a larger trend, the lack of best practices for ML Engineering.
More and more developers can train a model to a given level of performance. However, when building a product you usually simply have a goal, with no attached dataset. This requires being able to:
These are crucial skills, that usually make or break a data product, and there is a shortage of resources and best practices to help guide practitioners. When surveyed, this is the type of content that most experienced ML professionals in my network want more of, since it related to the problems they face every day, which leads me to my last point.
When it comes to recruiting, Hiring Managers of teams all over the valley most often complain that while there is no shortage of people able to train models on a dataset, they need engineers that can build data driven products.
At the same time, most aspiring Data Scientists and ML Engineers are most excited about training models on provided datasets. This excitement is usually inspired by blogs and courses that have focused on that part of the work, instead of data gathering/labeling/cleaning and model deployment.
This leads to a frustrating disconnect between companies looking to hire and newcomers. We are looking forward to more resources that promise to teach ML focusing on the 90% of the field that is not model building!