Understanding ROC Curves with Scikit-Learn: A Comprehensive Tutorial

The Receiver Operating Characteristic (ROC) curve is a vital tool for evaluating the performance of binary classification models. It provides a visual representation of a model's ability to discriminate between positive and negative classes across various threshold settings. This tutorial delves into the intricacies of ROC curves, their interpretation, and their implementation using the scikit-learn library in Python.

Introduction to ROC Curves

ROC curves plot the true positive rate (TPR) against the false positive rate (FPR) at different threshold values, ranging from 0.0 to 1.0. A model's skill is determined by its ability to assign a higher probability to a randomly chosen real positive occurrence than to a negative occurrence, on average. A model with no skill is represented at the point (0.5, 0.5), while a model with perfect skill is represented at a point (0, 1).

Key Concepts

True Positive (TP): Correctly predicted positive instances. A criminal is incarcerated. Take the case of an effective Covid-19 test.
True Negative (TN): Correctly predicted negative instances.
False Positive (FP): Incorrectly predicted as positive.
False Negative (FN): Incorrectly predicted as negative. A false negative is a value that was predicted to be false, but turned out to be true. In a smog prediction system, we may be far more concerned with having low false negatives than low false positives. A false negative would mean not warning about a smog day when in fact it is a high smog day, leading to health issues in the public that are unable to take precautions.
True Positive Rate (TPR) / Recall / Sensitivity: TP / (TP + FN). TPR, or sensitivity, is a critical metric that tells how well our model is capturing the positive instances. TPR is calculated as the proportion of actual positives correctly identified by the model. It’s a key indicator of our model’s effectiveness in detecting the true signals (positives).
False Positive Rate (FPR): FP / (FP + TN). FPR represents how often our model incorrectly classifies negative class instances as positive. It measures the proportion of actual negative instances that are incorrectly identified as positive by the model, indicating the rate of false alarms.
Specificity: TN / (TN + FP). Specificity is a metric that, unlike recall, is used to rate the certainty of ‘absence’, as opposed to the certainty of presence. Specificity (referred to as the true negative rate in ROC Curves), in the medical field, tell us how effective is the model at identifying the absence of a disease. It is calculated as 1 - FPR.

The Power of Thresholds

In binary classification, models typically don't simply output a definitive "True" or "False" prediction. Instead, they assign a probability to each class, and a decision threshold (often 50%) is used to determine the final classification. However, this threshold can be adjusted to fine-tune the model's behavior for a specific problem. This flexibility comes from the way that probabilities may be interpreted using different thresholds that allow the operator of the model to trade-off concerns in the errors made by the model, such as the number of false positives compared to the number of false negatives.

Adjusting the Decision Threshold

Altering the decision model means, in most cases, determining the class using model.predict_proba() manually, rather than relying on model.predict(). It may be because we want a model with rock-bottom recall (like the tumour detection example), or sky-high precision (like in the case of search engine results). But, in certain cases, we may also want some compromise between the two.

High Recall, Low Precision: In scenarios like car breakdown prediction, it's preferable to have a high recall, ensuring that all potential issues are flagged, even if it means some false positives. The owner of a False Positive car will face a minor inconvenience of going to the repair shop only to find out that his car is fine, but on the other hand, most cases of cars that might break (and even cause accidents, maybe) are covered. We reduce FN (and raise the recall) but increase FP (and lower the precision).
Low Recall, High Precision: In situations like stock picking, where resources are limited and risk aversion is high, a high precision is desired, even at the expense of missing some positive instances. By picking only the best ones we reduce the False Positives (and raise the precision) while accepting to increase the False Negatives (and reducing the recall).

When to Use ROC Curves

ROC curves are particularly useful when:

The dataset is balanced and the model needs to be evaluated across all thresholds.
False positives and false negatives are of similar importance.
Comparing different diagnostic tests, especially when the cost of false negatives (e.g., missing a disease) is much higher than false positives (e.g., unnecessary treatment).

Limitations of ROC Curves

However, ROC curves can be misleading when dealing with imbalanced datasets. If the proportion of positive to negative instances changes in a test set, the ROC curves will not change. Metrics such as accuracy, precision, lift and F scores use values from both columns of the confusion matrix. As a class distribution changes these measures will change as well, even if the fundamental classifier performance does not. In such cases, the precision-recall curve (PRC) is a more suitable alternative.

Area Under the Curve (AUC)

AUC, or Area Under the Curve, is a single scalar value ranging from 0 to 1, that gives a performance snapshot of the model. AUC-ROC provides an aggregate performance measure across all possible classification thresholds. AUC-ROC is a single scalar value that makes it easier to compare multiple models, regardless of their classification thresholds.

AUC close to 1: Excellent performance, strong ability to distinguish between classes. An ideal model would hug the top-left corner of the plot-where TPR is maximized and FPR is minimized.
AUC around 0.5: No better than random guessing, signaling no discriminatory power.
AUC closer to 0: An alarming situation.

Interpreting the ROC Curve

The intent of the ROC Curve is to show how well the model works for every possible threshold, as a relation of TPR vs FPR. So basically to plot the curve we need to calculate these variables for each threshold and plot it on a plane. On a ROC Curve, you have the false positive rate (specificity) on the x axis, and the true positive rate (recall) on the y axis.

Good Classifier: The ROC curve is closer to the axes and the "elbow" close to the coordinate (0,1).
Bad Classifier: The ROC curve is relatively close to the green line where TPR = FPR, which means that the classifier has the same predictive power as flipping a coin.

Precision-Recall Curve as an Alternative

The precision-recall curve (PRC) is a metric for imbalanced datasets that significantly impacts the value of the machine learning model when false positives and false negatives have different costs.

Precision and Recall

Precision: TP / (TP + FP). Precision is a ratio of the number of true positives divided by the sum of the true positives and false positives. It describes how good a model is at predicting the positive class.
Recall: TP / (TP + FN). Recall is calculated as the ratio of the number of true positives divided by the sum of the true positives and the false negatives.

F1 Score

The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both. It's used when we need a clear, single-number summary of model performance especially in imbalanced datasets. F1 is a balance between precision and recall.

When to Use Precision-Recall Curves

Reviewing both precision and recall is useful in cases where there is an imbalance in the observations between the two classes. The reason for this is that typically the large number of class 0 examples means we are less interested in the skill of the model at predicting class 0 correctly, e.g.

Multi-Class ROC AUC

For multiclass classification, AUC-ROC is extended using the One-vs-All (OvA) approach. Each class is treated as the positive class once, and the remaining classes are grouped as the negative class. For example, if you have classes A, B, C, D, you will get four ROC curves one for each class:

Class A vs. (B, C, D)
Class B vs. (A, C, D)
Class C vs. (A, B, D)
Class D vs. (A, B, C)

Steps to Use AUC-ROC for Multiclass Models

One-vs-All Conversion: Treat each class as the positive class and all others combined as the negative class.
Train a Binary Classifier per Class: Fit the model separately for each class-vs-rest combination.
Compute AUC-ROC for Each Class: Plot the ROC curve for every class. Calculate the AUC value for each curve.
Compare Performance: A higher AUC score means the model is better at distinguishing that class from the others.

Averaging Methods

It is also essential to select an appropriate averaging method (micro vs. macro) for multi-class problems:

Micro-averaging: Calculate metrics globally by counting the total true positives, false negatives, and false positives. treats each instance equally, regardless of its class.
Macro-averaging: Calculate metrics for each class separately and then average them, giving equal weight to each class. A macro averages gives equal weight to each class which is both an advantage, and a disadvantage, depending on our objective.

Implementing ROC Curves in Scikit-Learn

Scikit-learn provides convenient functions for generating and evaluating ROC curves:

roc_curve(y_true, y_score): Calculates the true positive rate and false positive rate for different probability thresholds. The function roccurve(ytrue, yscore) requires the same input arguments as precisionrecall_curve(), which we have already discussed and computed in the last section.
roc_auc_score(y_true, y_score): Computes the Area Under the Curve (AUC) from prediction scores. We can use sklearn to easily calculate the ROC AUC.

Plotting the ROC Curve from Scratch

To truly understand ROC curves, it's beneficial to implement the plotting process from scratch:

Train a classifier model on the dataset.
Calculate TPR and FPR for each instance based on the confusion matrix.
Create "n" thresholds and iterate over them, calculating the variables and storing them in a list.
Plot the coordinates of the ROC Curve points.

Example: Comparing Different Models

Let’s use the same dataset and compare the performance of four classification algorithms using ROC. The figure above shows all four curves corresponding to the four classifiers, where the RandomForestClassifier performs the best with an AUC-ROC of 0.94 while the KNeighborsClassifier is the least performing with an AUC-ROC of 0.85.

tags: #scikit #learn #roc #curve #tutorial