As you’re probably overly aware of by this point, the mid-to-late-2010s ushered in an era of amazing promise around Machine Learning (ML), building unending hype that seemed to permeate into every industry. Words like “super-human performance” and discussions around automation replacing workers quickly cropped up, as ML models finally provided a robust tool to interpret and predict the vast troves of data we generate with our digital lives.
Combined with cheaper, faster hardware, part of the reason there was so much excitement was the arrival of open datasets that enabled ML researchers and practitioners to test out new algorithms. Take, for example, the ImageNet dataset, a set of millions of labeled images with a thousand classes, created by Fei-Fei Li. Released in 2009, it enabled Microsoft researchers to surpass human-level image recognition for the first time in 2015. ML research has proven a remarkable tool to model all kinds of tasks; including image classification, natural language processing, activity classification, product recommendation, just to name a few.
The barrier to building these models dropped significantly, as well, as cloud notebooks, such as Google Colab, and welcoming, easy-to-use ML libraries, such as Keras, opened the door for anyone to train ML models. Combined with open datasets, such as ImageNet, any software engineer new to ML could quickly build models that were mind-bogglingly (>99%) accurate.
Now, let’s say you take that amazing ML model you’ve trained and want to use it in a production environment. According to Algorithmia’s “2020 State of Enterprise Machine Learning,” companies big and small are mainly tackling problems related to: reducing costs, generating customer insights, improving customer experiences, and internal processing automation. Remember, that 99% metric is highly tuned to the dataset you trained your model on, a historical snapshot of data. Once your model is published to production, you might find that it starts outputting a different distribution than in training, or the distribution of inputs is different, meaning your user behavior has shifted. Suddenly, that 99% is meaningless. How do you know what your model doesn’t know?
If you’re relying on ML for any core business logic or sensitive tasks, it’s important to understand the ways your model can degrade, or even completely fail. Whereas a traditional app or website might error out or crash in an obvious way to the user, ML models in production have added complexity and harder-to-diagnose failures.
Despite the demonstrated power of machine learning, over the last few years, the real challenge that has emerged is around how to productionize ML successfully.
Perhaps most importantly, ML practitioners must understand that the historical data they use can be full of racial and gender bias (and many forms of other discrimination), and should avoid encoding that bias into an abstract ML model, which users may blindly trust. Unfortunately, there have been numerous failures on this front, so it’s an essential factor to be considering from the beginning. And just a reminder, ML models do not establish causation between input and output.
If you’ve gotten through the challenges of building a generalized model that works for everyone, the next reality you face can be a daunting one: a model’s quality starts to degrade as soon as you finish training it. This is due to two primary reasons: data drift and concept drift.
Data drift (or population shift or covariate shift) is a slow process where the underlying distribution of your data changes, meaning that your model slowly gets worse at generalizing. An example could be a change in the age demographic of your users. Although the fundamental relationship of input and out is the same, the model accuracy will slowly worsen. The good news is that since this is a slow process, it is monitorable.
Concept drift is a more drastic degradation representing a fundamental shift in the relationship between your input and output. It’s not hard to think of a recent example when the workings of the world were fundamentally upended. When the COVID pandemic led to the shutdown of countries and businesses in March 2020, Instacart's ML model for predicting inventory dropped from 91% to 63% (think toilet paper). User buying patterns totally changed, and the historical data being used was no longer relevant. Instacart adapted by shortening their training window from several weeks to ten days.
A quick note: concept drift is not always a time series issue. ML models such as Recurrent Neural Networks can account for trends and seasonal changes, like increased shopping on weekends.
Hopefully, this insight didn’t leave you too daunted about training and utilizing machine learning in a production environment. As long as you’re aware of how and why your machine learning models can degrade and break, and are willing to do the work to monitor and retrain your models, production ML is feasible for teams small and large.