The field of Machine Learning (ML) has been consistently evolving since Data Science started gaining traction in 2012. However, we believe 2018 was a critical inflection point in the ML industry.
original article published by kdnuggets.com
As a field that has consistently toed the line between its origins in academic research and the need to serve customer needs, it has often been hard to reconcile engineering standards with ML models. As both research and applied teams are doubling down on their engineering and infrastructure needs, the nascent field of ML Engineering will build upon 2018’s foundation and truly blossom in 2019.
To illustrate this, we wanted to both:
- Share back three of the key trends we observed in 2018
- Draw a parallel to how we think they’ll affect 2019 in three key ways
Three takeaways from 2018
Both in research and industry, ML is growing at a pace that is thrilling and exciting for the future, but can make it hard to get a sense for the direction of the field. Here are three trends I’ve identified from 2018 that we think will have an impact on ML in industry in 2019 and beyond.
- Redefining representation learning
One of the most impressive trends in 2018 has been various models’ growing ability to capture increasingly useful information in dense learned representations. Here are a few examples below:
- Generative Adversarial Networks (GAN)have existed since 2014 and have grown from an exciting research proposition to a cutting edge technique. The ability of these models to learn a condensed representationallows them to be applied as photo-realistic editors (linked above) and style transfer for dance. Recent research is further improving on the quality of these representations.
- Reinforcement Learning has had a few success stories in 2018 as well. OpenAI’s Dota 2 bot showed great promise, beating semi professional teams. A core part of the work involved reducing a complex game to a smaller learned representation.
- Natural Language Processing (NLP) finally got its ImageNet moment in 2018. This past year has been called the year of transfer learning for NLP because of successful approaches in large scale language model pre-training, such as ULMFiT and BERT. Again, the key to these is to consume a large corpus of text such as the entirety of Wikipedia to learn a more useful representation of text.
Other than the quality of the representations they generate these exciting results have something else in common: they leverage an increasing amount of data and compute, which leads us to our second trend!
- Scaling up
In recent years, we have learned that when it comes to performing on well-defined tasks, such as standard datasets or game environments, larger datasets and additional compute will help us push performance further.
In fact, if we look back to the examples in the representation learning section, they all leveraged a larger dataset or more compute:
- Many GAN improvements such as BigGAN have relied on massively larger datasets, and training models for longer.
- OpenAI’s Dota 2 bot plays 180 years worth of games against itself every day during training.
- NLP models learn better representations by leveraging the entirety of Wikipedia before even starting to train on their actual task!
While the strategy of scaling up has shown promise on academic tasks, this post focuses on practical ML in industry where there is often no standard definition of a task or dataset, and the data distributions we are trying to understand are constantly shifting in nature. This requires an entirely different set of tools.
In fact, the disparity between research results we see today and what has been deployed in most products except by a few leading companies points to a wide gap between researchers and well funded teams, and other practitioners and startups.
- Building internal platforms
This past year, more companies have started to publicize the scale of the tooling they have built internally to help support their ML efforts. Here are a few personal favorites:
- Uber released two more articles in their series explaining Michelangelo which we highly recommend reading.
- Airbnb shared a very candid and fascinating account of the trials and tribulations of applying Deep Learning to search rankings.
- Netflix explained the infrastructure they have put in place to use notebooks in production.
Seeing how some of the best engineering teams in industry have tackled the challenge of delivering ML to their users is inspiring. At the same time, because building such platforms represents such a herculean effort, many practitioners advise smaller teams to avoid building their own ML platform.
This leaves most small to medium sized teams in a no man’s land between cobbling together offerings from service providers that do not exactly fit their needs, and taking on significant engineering costs. There is a growing gappointing to a need for a set of frameworks like Tensorflow and Spark, and of widely shared best practices for all the parts of ML that are not purely model training.
More and more people realize that nobody needs yet another library or tutorial to build a 3-layer neural network on MNIST. Consequently, many startups have entered the space of data and model infrastructure, management and deployment, and educational resources have started to focus on these aspects more. This is why, we fundamentally believe that 2019 will be the year of ML Engineering. We’ll explain how we see this unfolding below!
Three ways ML Engineering will grow in 2019
A common warning shared with aspiring Data Scientists is that 90% of the work is about gathering and cleaning data, or validating, deploying, and monitoring models. If that is the case, why are 90% of the frameworks and Github repositories focused on model building?
A part of the job that demands so much of a practitioner’s time should have proper tooling support.
Now that many large companies have laid the ground work for best practices when it comes to building ML products, and that many teams are being forced to reinvent the wheel when it comes to building the majority of their modeling pipeline, we are finally at the right moment for open-source ML Engineering frameworks.
- Creating tooling for all Data Scientists
Amazing libraries such as Keras, Tensorflow, PyTorch, and fast.ai have made it easier than ever to define and train custom models. At the same time, many companies have launched hosted services that complement such libraries by helping with data visualization, cleaning, model serving, and experiment tracking.
The problem with many of these services, is that ML Engineering needs are very use case specific, and often require the flexibility of an extensible open source framework. This is quite similar to Google Cloud offering APIs to call standard computer vision models: it is valuable for a subset of users, but would never be considered a replacement for Keras. The question now is, what will be the Keras of data exploration and cleaning?
There are many parts to ML Engineering work, and we could see frameworks being extended to cover multiple aspects, or separate solutions winning out in each of these categories. Here are a few domains to watch for:
- Data exploration, labeling, and versioning.
- Model automated testing, validation, serving, and lifecycle management.
- Experiment tracking, hyperparameter optimization, and experimentation tooling in general
Startups have started to propose solutions for many of these problems, but none have really helped define standards as widely adapted as Tensorflow and Pytorch are for model building. This is part of a larger trend, the lack of best practices for ML Engineering.
- Defining a clear path to production
More and more developers can train a model to a given level of performance. However, when building a product you usually simply have a goal, with no attached dataset. This requires being able to:
- Go from a product goal to formulating a learning problem that will both be feasible, and generate a useful model.
- Efficiently gather a dataset, build a simple model, and consistently improve both the model and the dataset (many can improve a model given a standard dataset, but few know how to iterate on a dataset).
- Decide what it means for a model to actually be ready to be put in front of users.
- Monitor deployed models to know when and how to update them.
These are crucial skills, that usually make or break a data product, and there is a shortage of resources and best practices to help guide practitioners. When surveyed, this is the type of content that most experienced ML professionals in my network want more of, since it related to the problems they face every day, which leads me to my last point.
- Promoting resources that best prepare aspiring professionals
When it comes to recruiting, Hiring Managers of teams all over the valley most often complain that while there is no shortage of people able to train models on a dataset, they need engineers that can build data driven products.
At the same time, most aspiring Data Scientists and ML Engineers are most excited about training models on provided datasets. This excitement is usually inspired by blogs and courses that have focused on that part of the work, instead of data gathering/labeling/cleaning and model deployment.
This leads to a frustrating disconnect between companies looking to hire and newcomers. We are looking forward to more resources that promise to teach ML focusing on the 90% of the field that is not model building!