Machine Learning Monitoring Best Practices

Machine learning (ML) models, unlike conventional applications, are susceptible to performance degradation over time due to various factors, including subtle changes in the production environment. Monitoring the health and throughput of a deployed ML service is insufficient. A comprehensive monitoring system is crucial to ensure the reliability and accuracy of ML models in production. This article explores the best practices for monitoring ML models, focusing on the key challenges and strategies for maintaining optimal performance.

The Need for Model Monitoring

Regardless of the efforts invested in developing, training, and evaluating ML models before deployment, their functionality inevitably degrades over time. Even subtle trends in the production environment can radically alter the model's behavior, especially for advanced models using deep learning and other non-deterministic techniques.

Setting Up a Monitoring System

To monitor a model's performance in production, a system that can analyze both the data being fed to the model and its predictions is required. This system should include a service that sits alongside the prediction service to ingest samples of input data and prediction logs, calculate metrics, and forward these metrics to observability and monitoring platforms for alerting and analysis.

Where possible, the monitoring service should calculate data and prediction drift metrics, as well as backtest metrics that directly evaluate the accuracy of predictions using historical data. Alongside tools for monitoring the data processing pipelines used for feature extraction, this setup enables the identification of key issues affecting model performance.

Key Challenges in ML Model Monitoring

Several challenges can impact the performance of ML models in production, including:

Training-Serving Skew: This occurs when there is a significant difference between training and production conditions, causing the model to make incorrect or unexpected predictions.
Data Drift: This refers to changes in the input data fed to the model during inference. It can manifest as changes in the overall distribution of possible input values, as well as the emergence of new patterns in data that weren’t present in the training set. Significant data drift can occur when real-world events lead to changes in user preferences, economic figures, or other data sources, as in the COVID-19 pandemic’s effect on financial markets and consumer behavior.
Prediction Drift: This refers to changes in the model’s predictions. If the distribution of prediction values changes a lot over time, this can indicate that either the data is changing, predictions are getting worse, or both. For example, a credit-lending model being used to evaluate users in a neighborhood with lower median income and different average household demographics from the training data might start making significantly different predictions.
Concept Drift: This occurs when the relationship between the model’s input and output changes, so the ground truth the model previously learned is no longer valid.
Data Processing Pipeline Issues: Problems in the data processing pipelines feeding the production model can lead to errors or incorrect predictions.

To cope with drift, ML models generally have to be retrained at a set cadence.

Strategies for Effective Model Monitoring

Backtesting with Ground Truth Evaluation Metrics

Where possible, you should use backtest metrics to track the quality of your model’s predictions in production. These metrics directly evaluate the model by comparing prediction results with ground truth values that are collected after inference. Backtesting can only be performed in cases where the ground truth can be captured quickly and authoritatively used to label data. This is simple for models that use supervised learning, where the training data is labeled. In production, the ground truth needs to be obtained using historical data ingested at some feedback delay after inference.

In order to calculate the evaluation metric, each prediction needs to be labeled with its associated ground truth value. Depending on your use case, you’ll either have to create these labels manually or pull them in from existing telemetry, such as RUM data. For example, let’s say you have a classification model that predicts whether a user who was referred to your application’s homepage will purchase a subscription.

In order to prepare your prediction data for backtesting, it’s ideal to first archive prediction logs in a simple object store (such as an Amazon S3 bucket or Azure Blob Storage instance). This way, you can choose to roll up logs at a desired time interval and then forward them to a processing pipeline that assigns the ground truth labels. Then, the labeled data can be ingested into an analytics tool that calculates the final metric for reporting to your dashboards and monitors.

When generating the evaluation metric, it’s important to choose a rollup frequency both for the metric calculation and for its ingestion into your observability tools that balances granularity and frequent reporting with accounting for seasonality in the prediction trend. Different evaluation metrics apply to different types of models. For example, classification models have discrete output, so their evaluation metrics work by comparing discrete classes. Classification evaluations like precision, recall, accuracy, and AU-ROC target different qualities of the model’s performance, so you should pick the ones that matter the most for your use case. If, say, your use case places a high cost on false positives (such as a model evaluating loan applicants), you’ll want to optimize for precision. Because precision, recall, and accuracy are calculated using simple algebraic formulas, you can also easily use a dashboarding tool to calculate and compare each of them from labeled predictions.

Read also: Revolutionizing Remote Monitoring

For models with continuous decimal outputs (such as a regression model), you should use a metric that can compare continuous prediction and ground truth values. In order to understand not only whether your predictions are accurate relative to the ground truth but also whether or not the model is facilitating the overall performance of your application, you can correlate evaluation metrics with business KPIs.

Drift Detection

Even when ground truth is unavailable, delayed, or difficult to obtain, you can still use drift metrics to identify trends in your model’s behavior. Prediction drift metrics track the changes in the distributions of model outputs (i.e., prediction values) over time. This is based on the notion that if your model behaves similarly to how it acted in training, it’s probably of a similar quality. Data drift metrics detect changes in the input data distributions for your model. They can be early indicators of concept drift, because if the model is getting significantly different data from what it was trained on, it’s likely that prediction accuracy will suffer.

Managed ML platforms like Vertex AI and Sagemaker provide tailored drift detection tools. If you’re deploying your own custom model, you can use a data analytics tool like Evidently to intake predictions and input data, calculate the metrics, and then report them to your monitoring and visualization tools. For example, let’s say you want to detect data drift for a recommendation model that scores product pages according to the likelihood that a given user will add the product to their shopping cart.

To get a stable metric that isn’t too sensitive to local seasonal variations and that can detect changes with reasonable lag, you must choose an optimal rollup range. Your monitoring service should load each data set into arrays to use as inputs in the calculation of the final metric.

Feature and Feature Attribution Drift

Feature drift and feature attribution drift on retrains break down data drift by looking at the distributions of the individual feature values and their attributions-a decimal value that changes how feature values are weighted in the calculation of the final prediction. While feature drift is typically caused by changes in the production data, feature attribution drift is typically the result of repeated retrains. Both of these metrics can help you find specific issues that may be related to changes in a subset of the data.

Read also: Boosting Algorithms Explained

For example, if a model’s predictions are drifting significantly while data drift looks normal, it may be that feature attribution drift is causing the final feature values to deviate. By monitoring feature attribution drift, you can spot these changing attributions that may be affecting predictions even as feature distributions remain consistent. As you break down your model’s prediction trends to detect feature and feature attribution drift, it’s also helpful to consider how features interact with each other, as well as which features have the largest impact on the end prediction. The more features you have in your model, the more difficult it becomes to set up, track, and interpret drift metrics for all of them.

For any of these drift metrics, you should set alerts with thresholds that make sense for your particular use case. To avoid alert fatigue, you need to pick a subset of metrics that report quickly and are easy to interpret.

Retraining Cadence

The frequency of drift detection will determine how often you retrain your model, and you need to figure out a cadence for this that accounts for the computational cost and development overhead of training while ensuring that your model sticks to its SLOs. As you retrain, you must closely monitor for new feature and feature attribution drift, and where possible, evaluate predictions to validate whether or not retraining has improved model performance.

Monitoring Data Processing Pipelines

Failures in the data processing pipelines that convert raw production data into features for your model are a common cause of data drift and degraded model quality. If these pipelines change the data schema or process data differently from what your model was trained on, your model’s accuracy will suffer. These issues often arise when multiple models that are addressing unique use cases all leverage the same data.

You can establish a bellwether for these kinds of data processing issues by alerting on unexpected drops in the quantity of successful predictions. To help prevent them from cropping up in the first place, you can add data validation tests to your processing pipelines, and alert on failures to check whether the input data for predictions is valid. To accommodate the continual retraining and iteration of your production model, it’s best to automate these tests using a workflow manager like Airflow. By tracking database schema changes and other user activity, you can help your team members ensure that their pipelines are updated accordingly before they can break. Finally, by using service management tools to centralize knowledge about data sources and data processing, you can help ensure that model owners and other stakeholders are aware of data pipeline changes and the potential impacts on dependencies. For example, you might have one team in your organization pulling data from a feature store to train their recommendation model while another team is managing that database and a third team starts pulling it for business analytics.

Centralized Monitoring and Visualization

Because your model relies on upstream and downstream services managed by disparate teams, it is helpful to have a central place for all stakeholders to report their analysis of the model and learn about its performance. By forwarding your calculated evaluation metrics to the same monitoring or visualization tool you’re using to track KPIs, you can form dashboards that establish a holistic view of your ML application.

Leveraging Tools for ML Observability

Datadog includes products and features such as Log Management, Event Management, custom metrics, alerts, dashboards, and more to help you centralize ML observability data and form more powerful insights. Datadog also helps you achieve full-stack ML platform observability with integrations not only for ML services like Vertex AI and Sagemaker but also for the most popular databases, workflow tools, and compute engines. By forwarding your ML model performance metrics to Datadog, you can track and alert on them alongside telemetry from the rest of your system, including RUM data to help interpret your model’s business impact.

tags: #machine #learning #monitoring #best #practices