Understanding Positive Predictive Value (PPV) in Machine Learning

In the realm of machine learning, evaluating the performance of diagnostic tests and statistical measures is critical. The positive predictive value (PPV) is a key metric to consider. It helps determine the accuracy and reliability of machine learning models, especially in binary classification tasks such as medical diagnoses or fraud detection.

Defining Positive and Negative Predictive Values

The positive and negative predictive values (PPV and NPV, respectively) are the proportions of positive and negative results in statistics and diagnostic tests that are true positive and true negative results, respectively. A high result can be interpreted as indicating the accuracy of such a statistic. Although sometimes used synonymously, a positive predictive value generally refers to what is established by control groups, while a post-test probability refers to a probability for an individual.

A "true positive" is the event that the test makes a positive prediction, and the subject has a positive result under the gold standard, and a "false positive" is the event that the test makes a positive prediction, and the subject has a negative result under the gold standard. A "true negative" is the event that the test makes a negative prediction, and the subject has a negative result under the gold standard, and a "false negative" is the event that the test makes a positive prediction, and the subject has a positive result under the gold standard. Although sometimes used synonymously, a negative predictive value generally refers to what is established by control groups, while a negative post-test probability rather refers to a probability for an individual.

How PPV Works

PPV indicates the probability that in case of a positive test, that the patient really has the specified disease. However, there may be more than one cause for a disease and any single potential cause may not always result in the overt disease seen in a patient. PPV is used to indicate the probability that in case of a positive test, that the patient really has the specified disease. An example is the microbiological throat swab used in patients with a sore throat. Usually publications stating PPV of a throat swab are reporting on the probability that this bacterium is present in the throat, rather than that the patient is ill from the bacteria found. If presence of this bacterium always resulted in a sore throat, then the PPV would be very useful. However the bacteria may colonise individuals in a harmless way and never result in infection or disease. Sore throats occurring in these individuals are caused by other agents such as a virus. In this situation the gold standard used in the evaluation study represents only the presence of bacteria (that might be harmless) but not a causal bacterial sore throat illness.

Formula for PPV

PPV is calculated using the following formula:

Read also: Career Path in Machine Learning

PPV = True Positives / (True Positives + False Positives)

This calculation finds the proportion of positive test results that are correct out of the entire pool of all positives, both true and false.

Interpreting PPV

PPV assesses a diagnostic test’s accuracy by calculating the probability that a person who tests positive truly has the condition. PPV focuses on how trustworthy a positive result is in real-world testing scenarios. Hence, it is the best measure for interpreting an individual positive test result.

A high PPV indicates that a positive test result is likely to be accurate, meaning the individual truly has the condition. Conversely, a low PPV suggests that many of the positive results may be false positives.

The Importance of Prevalence

Note that the positive and negative predictive values can only be estimated using data from a cross-sectional study or other population-based study in which valid prevalence estimates may be obtained. Note that the PPV is not intrinsic to the test-it depends also on the prevalence. PPV is directly proportional to the prevalence of the disease or condition.

Read also: Your Guide to Nursing Internships

How Prevalence Affects PPV

The PPV incorporates the prevalence of the condition in the population by including the number of false positives in the formula. For example, imagine a disease screening test applied to 1,000 people, where 80 are truly diseased and test positive, while 40 disease-free individuals mistakenly test positive. When a condition is rare, even a test with good sensitivity can produce more false positives than true positives. This situation drives PPV down, meaning many positive results will be incorrect. When a condition is common, the number of true positives rises and false positives drops.

Standardizing PPV

Due to the large effect of prevalence upon predictive values, a standardized approach has been proposed, where the PPV is normalized to a prevalence of 50%.

Addressing Prevalence Discrepancies

To overcome this problem, NPV and PPV should only be used if the ratio of the number of patients in the disease group and the number of patients in the healthy control group used to establish the NPV and PPV is equivalent to the prevalence of the diseases in the studied population, or, in case two disease groups are compared, if the ratio of the number of patients in disease group 1 and the number of patients in disease group 2 is equivalent to the ratio of the prevalences of the two diseases studied.

When an individual being tested has a different pre-test probability of having a condition than the control groups used to establish the PPV and NPV, the PPV and NPV are generally distinguished from the positive and negative post-test probabilities, with the PPV and NPV referring to the ones established by the control groups, and the post-test probabilities referring to the ones for the tested individual (as estimated, for example, by likelihood ratios).

PPV vs. Sensitivity

Positive Predictive Value vs. Positive Predictive Value tells you the reliability of a positive test result. Sensitivity tells you how good the test is at detecting disease in those who actually have it. It is test-centered and helps evaluate the test’s inherent ability. Positive predictive value helps you understand what a positive test result means for an individual. Sensitivity helps you evaluate and compare tests.

Sensitivity only considers people who truly have the disease, so prevalence does not affect it. Positive predictive value incorporates prevalence by including true positives + false positives in the denominator.

Real-World Example: Mammography

Mammograms typically have a sensitivity in the 80-90% range, depending on the population. They are more accurate in older women and those with less dense breast tissue, and somewhat less sensitive in younger women with denser breasts. This sensitivity means that mammography detects 8 or 9 out of every 10 cancers when they are present. However, the positive predictive value is far more volatile because it depends strongly on prevalence. In population-wide screening where most women do not have breast cancer, PPV can be quite low. For example, some studies report that only about 5-10% of women recalled for further testing after a screening mammogram actually have cancer. By contrast, when mammography occurs in a diagnostic setting where suspicion is already higher (e.g., a woman has a palpable lump), the PPV rises substantially. In such cases, published studies often report PPV values around 60-70%.

The sensitivity of 80-90% reflects mammography’s inherent ability to detect most cancers when present. The positive predictive value-low in general screening but much higher in diagnostic contexts-shows what a positive mammogram means for an individual patient. This real-world example highlights why you must consider both sensitivity and PPV.

Practical Implications

For common diseases, a high positive predictive value means clinicians can act with confidence on positive results. For rare diseases, even accurate tests may yield low PPV, so confirmatory testing is often required.

Limitations and Considerations

Bayes' theorem confers inherent limitations on the accuracy of screening tests as a function of disease prevalence or pre-test probability. It has been shown that a testing system can tolerate significant drops in prevalence, up to a certain well-defined point known as the prevalence threshold, below which the reliability of a positive screening test drops precipitously. That said, Balayla et al. showed that sequential testing overcomes the aforementioned Bayesian limitations and thus improves the reliability of screening tests.

Of note, the denominator of the above equation is the natural logarithm of the positive likelihood ratio (LR+). Also, note that a critical assumption is that the tests must be independent.

Alternatives to PPV

Precision is a model performance metric that corresponds to the fraction of values that actually belong to a positive class out of all of the values which are predicted to belong to that class.

Metrics allow to quantify the performance of an ML model. In this section, we describe metrics for classification and regression tasks. Other tasks (segmentation, generation, detection,…) can use some of these but will often require other metrics that are specific to these tasks. The reader may refer to Chap. 13 for metrics dedicated to segmentation and to Subheading 6 of Chap. 23 for metrics dedicated to segmentation, classification, and detection.

Metrics for Classification

For classification tasks, the results can be summarized in a matrix called the confusion matrix (Fig. 1). For binary classification, the confusion matrix divides the test samples into four categories, depending on their true and predicted labels:

True Positives (TP): Samples for which the true and predicted labels are both 1. Example: The patient has cancer (1), and the model classifies this sample as cancer (1).
True Negatives (TN): Samples for which the true and predicted labels are both 0. Example: The patient does not have cancer (0), and the model classifies this sample as non-cancer (0).
False Positives (FP): Samples for which the true label is 0 and the predicted label is 1. Example: The patient does not have cancer (0), and the model classifies this sample as cancer (1).
False Negatives (FN): Samples for which the true label is 1 and the predicted label is 0. Example: The patient has cancer (1), and the model classifies this sample as non-cancer (0).

F1 Score

The F1 score is another summary metric, built as the harmonic mean of the sensitivity (recall) and PPV (precision). It is popular in machine learning but, as we will see, it also has substantial drawbacks. Note that it is equal to the Dice coefficient used for segmentation. Given that it builds on the PPV rather than the specificity to characterize retrieval, it accounts slightly better for prevalence.

The F1 score can nevertheless be misleading if the prevalence is high. In such a case, one can have high values for sensitivity, specificity, PPV, F1 score but a low NPV. A solution can be to exchange the two classes. The F1 score becomes informative again.

Matthews Correlation Coefficient (MCC)

Another option is to use Matthews Correlation Coefficient (MCC). The MCC makes full use of the confusion matrix and can remain informative even when prevalence is very low or very high. However, its interpretation may be less intuitive than that of the other metrics.

Markedness

Finally, markedness is a seldom known summary metric that deals well with low-prevalence situations as it is built from the PPV and NPV. Its drawback is that it is as much related to the population under study as to the classifier.

tags: #PPV #in #machine #learning