Interpretable Machine Learning: Techniques and Applications

Machine learning (ML) has become an integral part of our lives, influencing products, processes, and research across various industries. However, the opacity of many machine learning models, often referred to as "black boxes," poses significant challenges. These models make predictions without providing clear explanations, leading to trust issues, undetected biases, and difficulties in debugging. Interpretable Machine Learning (IML) emerges as a crucial field, aiming to bridge this gap by making these complex models more transparent and understandable.

The Need for Interpretability

Interpretability in machine learning refers to the degree to which a human can understand the cause of a decision made by a machine learning model. It is about extracting relevant knowledge from a model concerning relationships either contained in the data or learned by the model. While predictive performance is often the primary goal, interpretability becomes essential when understanding why a prediction was made is equally important.

Human Curiosity and Learning: Humans naturally seek explanations for unexpected events to update their understanding of the world. Opaque machine learning models hinder this process, preventing the extraction of meaningful insights.
Bias Detection: Machine learning models can inadvertently learn biases from training data, leading to discriminatory outcomes. Interpretability tools help identify and mitigate these biases.
Debugging and Auditing: Interpretable models are easier to debug and audit, ensuring that they function as intended and do not produce erroneous or harmful results.

Read also: Comprehensive Guide to Interpretable ML
Trust and Acceptance: Explanations increase user trust and encourage the adoption of machine learning systems, particularly in high-stakes domains.
Scientific Discovery: In scientific disciplines, interpretability allows researchers to gain knowledge from models, turning them into sources of insight rather than just prediction engines.

Categories of Interpretable Machine Learning Techniques

Interpretable machine learning techniques can be broadly categorized into two main approaches:

Intrinsic Interpretability: This approach involves building models that are inherently interpretable due to their simple structures. Examples include decision trees, linear regression, and rule-based models.
Post-hoc Interpretability: This approach focuses on explaining existing, potentially complex "black box" models after they have been trained. This involves using various techniques to understand how the model makes predictions.

Read also: Python for Interpretable ML

Furthermore, interpretability can be classified as:

Global Interpretability: Understanding how the model works as a whole by examining its structure and parameters.
Local Interpretability: Examining individual predictions of a model to understand why it made a specific decision.

Intrinsic Interpretability: Building Interpretable Models

Intrinsic interpretability is achieved by designing self-explanatory models that incorporate interpretability directly into their structures. These models are either globally interpretable or provide explanations when they make individual predictions.

Globally Interpretable Models

Globally interpretable models can be constructed in two ways: directly trained from data with interpretability constraints or extracted from a complex and opaque model.

Adding Interpretability Constraints

The interpretability of a model can be enhanced by incorporating interpretability constraints during training. Examples include:

Sparsity: Encouraging the model to use fewer features for prediction.
Monotonicity: Ensuring that features have monotonic relationships with the prediction.
Pruning Decision Trees: Replacing subtrees with leaves to encourage simpler trees.
Semantic Constraints: Adding semantically meaningful constraints to improve interpretability, such as learning disentangled representations in convolutional neural networks (CNNs).

However, there are often trade-offs between prediction accuracy and interpretability when constraints are directly incorporated into models. The more interpretable models may result in reduced prediction accuracy comparing the less interpretable ones.

Interpretable Model Extraction

An alternative is to apply interpretable model extraction, also referred to as mimic learning, which may not sacrifice model performance too much. The motivation behind mimic learning is to approximate a complex model using an easily interpretable model such as a decision tree, rule-based model, or linear model. As long as the approximation is sufficiently close, the statistical properties of the complex model will be reflected in the interpretable model. Eventually, we obtain a model with comparable prediction performance, and the behavior of which is much easier to understand. For instance, the tree ensemble model is transformed into a single decision tree. Moreover, a DNN is utilized to train a decision tree that mimics the input-output function captured by the neural network so the knowledge encoded in DNN is transferred to the decision tree. To avoid the overfitting of the decision tree, active learning is applied for training. These techniques convert the original model to a decision tree with better interpretability and maintain comparable predictive performance at the same time.

Locally Interpretable Models

Locally interpretable models are usually achieved by designing more justified model architectures that could explain why a specific decision is made. Different from the globally interpretable models that offer a certain extent of transparency about what is going on inside a model, locally interpretable models provide users understandable rationale for a specific prediction. A representative scheme is employing attention mechanism which is widely utilized to explain predictions made by sequential models, for example, recurrent neural networks (RNNs). Attention mechanism is advantageous in that it gives users the ability to interpret which parts of the input are attended by the model through visualizing the attention weight matrix for individual predictions. Attention mechanism has been used to solve the problem of generating image caption. In this case, a CNN is adopted to encode an input image to a vector, and an RNN with attention mechanisms is utilized to generate descriptions. When generating each word, the model changes its attention to reflect the relevant parts of the image. The final visualization of the attention weights could tell human what the model is looking at when generating a word. Similarly, attention mechanism has been incorporated in machine translation. At decoding stage, the neural attention module added to neural machine translation (NMT) model assigns different weights to the hidden states of the decoder, which allows the decoder to selectively focus on different parts of the input sentence at each step of the output generation. Through visualizing the attention scores, users could understand how words in one language depend on words in another language for correct translation.

Post-Hoc Interpretability: Explaining Black Box Models

Post-hoc interpretability aims to provide a global understanding of what knowledge has been acquired by pre-trained models and illuminate the parameters or learned representations in an intuitive manner to humans.

Traditional Machine Learning Explanation

Traditional machine learning pipelines mostly rely on feature engineering, which transforms raw data into features that better represent the predictive task. The features are generally interpretable and the role of machine learning is to map the representation to output. We consider a simple yet effective explanation measure that is applicable to most of the models belonging to traditional pipeline, called feature importance, which indicates statistical contribution of each feature to the underlying model when making decisions.

Model-Agnostic Explanation

The model-agnostic feature importance is broadly applicable to various machine learning models. It treats a model as a black-box and does not inspect internal model parameters. A representative approach is "permutation feature importance." The key idea is the importance of a specific feature to the overall performance of a model can be determined by calculating how the model prediction accuracy deviates after permuting the values of that feature.More specifically, given a pretrained model with n features and a test set, the average prediction score of the model on the test set is p, which is also the baseline accuracy. We shuffle the values of a feature on the test set and compute the average prediction score of the model on the modified dataset. This process is iteratively performed for each feature and eventually n prediction scores are obtained for n features respectively. We then rank the importance of the n features according to the reductions of their score comparing to baseline accuracy p.There are several advantages for this approach. First, we do not need to normalize the values of the handcrafted features. Second, it can be generalized to nearly any machine learning models with handcrafted features as input. Third, this strategy has been proved to be robust and efficient in terms of implementation.

Model-Specific Explanation

There also exists explanation methods specifically designed for different models. Model-specific methods usually derive explanations by examining internal model structures and parameters. Here, we introduce how to provide feature importance for two families of machine learning models.

Generalized linear models (GLM) is constituted of a series of models that are linear combination of input features and model parameters followed by feeding to some transformation function (often nonlinear). Examples of GLM include linear regression and logistic regression. The weights of a GLM directly reflect feature importance, so users can understand how the model works by checking their weights and visualizing them. However, the weights may not be reliable when different features are not appropriately normalized and vary in their scale of measurement. Moreover, the interpretability of an explanation will decrease when the feature dimensions become too large, which may be beyond the comprehension ability of humans.

Tree-based ensemble models, such as gradient boosting machines, random forests, and XGBoost, are typically inscrutable to humans. There are several ways to measure the contribution of each feature. The first approach is to calculate the accuracy gain when a feature is used in tree branches. The rationale behind is that without adding a new split to a branch for a feature, there may be some misclassified elements, while after adding the new branch, there are two branches and each one is more accurate. The second approach measures the feature coverage, that is, calculating the relative quantity of observations related to a feature. The third approach is to count the number of times that a feature is used to split the data.

DNN Representation Explanation

DNNs, in contrast to traditional models, not only discover the mapping from representation to output, but also learn representations from raw data. The learned deep representations are usually not human interpretable hence the explanation for DNNs mainly focuses on understanding the representations captured by the neurons at intermediate layers of DNNs.

Model-Agnostic Methods for Local Explanations

Several model-agnostic methods provide local explanations for individual predictions, offering insights into the factors that influenced a specific outcome. These methods can be applied to any machine learning model, regardless of its internal structure.

LIME: Local Interpretable Model-Agnostic Explanations

LIME focuses on explaining the decisions made by black-box models by approximating their behavior with simpler, more interpretable models locally.

Here's how LIME works:

Select a Data Point: Start by selecting a data point or instance for which you want an explanation.
Generate Perturbed Data: LIME creates a dataset of perturbed instances by making small, random modifications to the selected data point. These modifications can include adding or changing features while keeping some of the original features constant. The goal is to create a diverse set of data points that represent the local neighborhood of the original instance.
Get Model Predictions: For each of the perturbed instances, LIME uses the machine learning model (the one you want to explain) to obtain predictions. These predictions serve as labels for the perturbed instances.
Fit a Local Model: LIME then fits an interpretable model, often a linear model like a logistic regression or a decision tree, to the perturbed data points and their corresponding model predictions. The goal is to create a model that approximates the behavior of the black-box model in the local neighborhood of the selected instance.
Weighted Sampling: The perturbed instances are often weighted based on their similarity to the original instance and the quality of their fit to the local model. This helps ensure that instances that are more similar to the original data point have a stronger influence on the explanation.
Generate Explanations: The fitted local model is used to generate feature importance scores. These scores indicate which features had the most influence on the model's prediction for the selected instance. The higher the weight for a feature, the more important that feature was in making the prediction.
Present the Explanation: The feature importance scores can be used to explain why the model made a particular prediction for the selected data point. These explanations can take various forms, such as highlighting important features, showing which features contributed to the prediction, or providing textual explanations.

LIME works by generating a dataset of perturbed or sampled instances from the original data and observing the corresponding predictions of the black-box model. It then fits a simple interpretable model (often a linear or decision tree model) to this generated dataset, effectively modeling the local behavior of the complex model around a specific instance. This local model provides an interpretable explanation for the prediction made by the black-box model for that instance.

The key idea behind LIME is to make model-agnostic explanations, meaning it can be applied to any machine learning model without needing to know its internal structure. LIME is particularly valuable in cases where transparency and trust in model predictions are essential, such as in healthcare, finance, and legal applications.

SHAP: SHapley Additive exPlanations

Shapley Values are a concept borrowed from cooperative game theory and have been adapted for model interpretation. They provide a way to fairly distribute the contribution of each feature to a modelâs prediction. Shapley Values explain the impact of individual features on a particular prediction by measuring the average contribution of a feature when it is added to all possible subsets of features. They offer a unified and consistent way to attribute predictions to feature values, making them model-agnostic. Shapley Values can reveal not only the importance of features but also how interactions between features influence a prediction. SHAP (SHapley Additive exPlanations) is a Python library that implements Shapley Values for model explanation.

Model-Agnostic Methods for Global Explanations

Partial Dependence Plots (PDP)

Partial Dependence Plot (PDP) is a technique used to visualize the marginal effect of one or two features on the predicted outcome of a machine learning model. It shows how the predicted response changes as a function of the selected features, while averaging out the effects of all other features. This helps to understand the relationship between the chosen features and the model's predictions.

Evaluating Interpretations: The PDR Framework

Evaluating the quality of interpretations is crucial to ensure their trustworthiness and usefulness. The Predictive, Descriptive, Relevant (PDR) framework provides three overarching desiderata for evaluation:

Predictive Accuracy: The model's ability to accurately predict outcomes on unseen data.
Descriptive Accuracy: The degree to which an interpretation method accurately captures the relationships learned by the machine learning model.
Relevancy: The extent to which the interpretation is meaningful and useful to the intended audience. Relevancy is judged relative to a human audience.

Challenges and Future Directions

Despite the advancements in interpretable machine learning, several challenges remain:

Scalability: Many interpretation methods are computationally expensive and do not scale well to large datasets or complex models.
Causality vs. Correlation: Interpretations often reveal correlations between features and predictions, but establishing causal relationships remains a challenge.
Human-Computer Interaction: Designing effective ways to present interpretations to humans and facilitate interaction with machine learning models is an ongoing area of research.
Standardization: The lack of standardized metrics and evaluation protocols makes it difficult to compare different interpretation methods.

Future research directions include developing more scalable and efficient interpretation methods, incorporating causal reasoning into interpretations, and creating more user-friendly interfaces for interacting with interpretable models.

tags: #interpretable #machine #learning #techniques #and #applications