Bayesian Inference in Machine Learning: A Comprehensive Tutorial

Bayesian inference is a powerful statistical method that plays a crucial role in various machine learning algorithms, particularly in Bayesian statistics and probabilistic modeling. It is based on Bayes' theorem, which provides a way to update probabilities based on new evidence or information. This article provides a comprehensive tutorial on Bayesian inference in machine learning, covering its theoretical foundations, key concepts, applications, and practical considerations.

Introduction to Bayesian Inference

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability of a hypothesis as more evidence or information becomes available. In a general sense, Bayesian inference is a learning technique that uses probabilities to define and reason about our beliefs. Bayesian methods assist several machine learning algorithms in extracting crucial information from small data sets and handling missing data. They play an important role in a vast range of areas from game development to drug discovery. Bayesian methods enable the estimation of uncertainty in predictions which proves vital for fields like medicine. The methods help saving time and money by allowing the compression of deep learning models a hundred folds and automatically tuning hyperparameters.

Bayes' Theorem: The Foundation of Bayesian Inference

Bayes' theorem is a fundamental concept in probability theory that plays a crucial role in various machine learning algorithms, especially in the fields of Bayesian statistics and probabilistic modelling. It provides a way to update probabilities based on new evidence or information. In the context of machine learning, Bayes' theorem is often used in Bayesian inference and probabilistic models. The theorem can be mathematically expressed as:

P(A∣B) = \frac{P(B∣A)⋅P(A)}{P(B)}

where:

P(A∣B) is the posterior probability of event A given event B.
P(B∣A) is the likelihood of event B given event A.
P(A) is the prior probability of event A.
P(B) is the total probability of event B.

In the context of modeling hypotheses, Bayes' theorem allows us to infer our belief in a hypothesis based on new data. We start with a prior belief in the hypothesis, represented by P(A), and then update this belief based on how likely the data are to be observed under the hypothesis, represented by P(B∣A). The posterior probability P(A∣B) represents our updated belief in the hypothesis after considering the data.

Key Terms Related to Bayes' Theorem

To fully grasp Bayesian inference, it's essential to understand the key terms associated with Bayes' theorem:

Likelihood (P(B∣A)): Represents the probability of observing the given evidence (features) given that the class is true.
Prior Probability (P(A)): In machine learning, this represents the probability of a particular class before considering any features. It is estimated from the training data.
Evidence Probability (P(B)): This is the probability of observing the given evidence (features). It serves as a normalization factor and is often calculated as the sum of the joint probabilities over all possible classes.
Posterior Probability (P(A∣B)): This is the updated probability of the class given the observed features. It is what we are trying to predict or infer in a classification task.

Applications of Bayes' Theorem in Machine Learning

Bayes' theorem has numerous applications in machine learning, including:

1. Naive Bayes Classifier

The Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with a strong (naive) independence assumption between the features. It is widely used for text classification, spam filtering, and other tasks involving high-dimensional data. Despite its simplicity, the Naive Bayes classifier often performs well in practice and is computationally efficient.

How it works?

Assumption of Independence: The "naive" assumption in Naive Bayes is that the presence of a particular feature in a class is independent of the presence of any other feature, given the class. This is a strong assumption and may not hold true in real-world data, but it simplifies the calculation and often works well in practice. In other words, Naive Bayes works best with discrete features.
Calculating Class Probabilities: Given a set of features x1, x2, …, xn, the Naive Bayes classifier calculates the probability of each class Ck given the features using Bayes' theorem:

P(Ck∣x1,x2,…,xn) = \frac{P(x1,x2,…,xn∣Ck)⋅P(Ck)}{P(x1,x2,…,xn)}

the denominator P(x1,x2,...,xn) is the same for all classes and can be ignored for the purpose of comparison.

Classification Decision: The classifier selects the class Ck with the highest probability as the predicted class for the given set of features.

2. Bayes Optimal Classifier

The Bayes optimal classifier is a theoretical concept in machine learning that represents the best possible classifier for a given problem. It is based on Bayes' theorem, which describes how to update probabilities based on new evidence.

In the context of classification, the Bayes optimal classifier assigns the class label that has the highest posterior probability given the input features. Mathematically, this can be expressed as:

\widehat y = arg max_y P(y∣x)

where \widehat y is the predicted class label, y is a class label, x is the input feature vector, and P(y∣x) is the posterior probability of class y given the input features.

3. Bayesian Optimization

Bayesian optimization is a powerful technique for global optimization of expensive-to-evaluate functions. To choose which point to assess next, a probabilistic model of the objective function-typically based on a Gaussian process-is constructed. Bayesian optimization finds the best answer fast and requires few evaluations by intelligently searching the search space and iteratively improving the model. Because of this, it is especially well-suited for activities like machine learning model hyperparameter tweaking, where each assessment may be computationally costly.

4. Bayesian Belief Networks

Bayesian Belief Networks (BBNs), also known as Bayesian networks, are probabilistic graphical models that represent a set of random variables and their conditional dependencies using a directed acyclic graph (DAG). The graph's edges show the relationships between the nodes, which each represent a random variable.

BBNs are employed for modeling uncertainty and generating probabilistic conclusions regarding the network's variables. They may be used to provide answers to queries like "What is the most likely explanation for the observed data?" and "What is the probability of variable A given the evidence of variable B?"

BBNs are extensively utilized in several domains, including as risk analysis, diagnostic systems, and decision-making. They are useful tools for reasoning under uncertainty because they provide complicated probabilistic connections between variables a graphical and understandable representation.

Bayesian vs. Frequentist Approaches

Classical statistics is said to follow the frequentist approach because it interprets probability as the relative frequency of an event over the long run that is, after observing many trials. In the context of probabilities, an event is a combination of one or more elementary outcomes of an experiment, such as any of six equal results in rolls of two dice or an asset price dropping by 10 percent or more on a given day.

Bayesian statistics, in contrast, views probability as a measure of the confidence or belief in the occurrence of an event. The Bayesian perspective, thus, leaves more room for subjective views and differences in opinions than the frequentist interpretation. This difference is most striking for events that do not happen often enough to arrive at an objective measure of long-term frequency.

Put differently, frequentist statistics assumes that data is a random sample from a population and aims to identify the fixed parameters that generated the data. Bayesian statistics, in turn, takes the data as given and considers the parameters to be random variables with a distribution that can be inferred from data. As a result, frequentist approaches require at least as many data points as there are parameters to be estimated. Bayesian approaches, on the other hand, are compatible with smaller datasets, and well suited for online learning from one sample at a time.

The Bayesian view is very useful for many real-world events that are rare or unique, at least in important respects. Examples include the outcome of the next election or the question of whether the markets will crash within 3 months. In each case, there is both relevant historical data as well as unique circumstances that unfold as the event approaches and how Bayesian machine learning contributes.

Updating Assumptions from Empirical Evidence

Bayes’ theorem updates the beliefs about the parameters of interest by computing the posterior probability distribution from the following inputs:

The prior distribution indicates how likely we consider each possible hypothesis.
The likelihood function outputs the probability of observing a dataset when given certain values for the parameters θ, that is, for a specific hypothesis.
The evidence measures how likely the observed data is, given all possible hypotheses. Hence, it is the same for all parameter values and serves to normalize the numerator.

The posterior is the product of prior and likelihood, divided by the evidence. Thus, it reflects the probability distribution of the hypothesis, updated by taking into account both prior assumptions and the data. Viewed differently, the posterior probability results from applying the chain rule, which, in turn, factorizes the joint distribution of data and parameters.

With higher-dimensional, continuous variables, the formulation becomes more complex and involves (multiple) integrals. Also, an alternative formulation uses odds to express the posterior odds as the product of the prior odds, times the likelihood ratio.

Exact Inference - Maximum a Posteriori Estimation

Practical applications of Bayes’ rule to exactly compute posterior probabilities are quite limited. This is because the computation of the evidence term in the denominator is quite challenging. The evidence reflects the probability of the observed data over all possible parameter values. It is also called the marginal likelihood because it requires “marginalizing out” the parameters’ distribution by adding or integrating over their distribution. This is generally only possible in simple cases with a small number of discrete parameters that assume very few values.

Maximum a posteriori probability (MAP) estimation leverages the fact that the evidence is a constant factor that scales the posterior to meet the requirements for a probability distribution. Since the evidence does not depend on θ, the posterior distribution is proportional to the product of the likelihood and the prior. Hence, MAP estimation chooses the value of θ that maximizes the posterior given the observed data and the prior belief, that is, the mode of the posterior.

The MAP approach contrasts with the Maximum Likelihood Estimation (MLE) of parameters that define a probability distribution. MLE picks the parameter value θ that maximizes the likelihood function for the observed training data.

A look at the definitions highlights that MAP differs from MLE by including the prior distribution. In other words, unless the prior is a constant, the MAP estimate will differ from its MLE counterpart:

The MLE solution tends to reflect the frequentist notion that probability estimates should reflect observed ratios. On the other hand, the impact of the prior on the MAP estimate often corresponds to adding data that reflects the prior assumptions to the MLE. For example, a strong prior that a coin is biased can be incorporated in the MLE context by adding skewed trial data.

Prior distributions are a critical ingredient to Bayesian models.

Selecting Priors

The prior should reflect knowledge about the distribution of the parameters because it influences the MAP estimate. If a prior is not known with certainty, we need to make a choice, often from several reasonable options. In general, it is good practice to justify the prior and check for robustness by testing whether alternatives lead to the same conclusion.

There are several types of priors: regarding how Bayesian machine learning works

Objective priors maximize the impact of the data on the posterior. If the parameter distribution is unknown, we can select an uninformative prior like a uniform distribution, also called a flat prior, over a relevant range of parameter values.
In contrast, subjective priors aim to incorporate information external to the model into the estimate.
An empirical prior combines Bayesian and frequentist methods and uses historical data to eliminate subjectivity, for example, by estimating various moments to fit a standard distribution. Using some historical average of daily returns rather than a belief about future returns would be an example of a simple empirical prior.

In the context of a machine learning model, the prior can be viewed as a regularizer because it limits the values that the posterior can assume. Parameters that have zero prior probability, for instance, are not part of the posterior distribution. Generally, more good data allows for stronger conclusions and reduces the influence of the prior.

Conjugate Priors

A prior distribution is conjugate with respect to the likelihood when the resulting posterior is of the same class or family of distributions as the prior, except for different parameters. For example, when both the prior and the likelihood are normally distributed, then the posterior is also normally distributed.

The conjugacy of prior and likelihood implies a closed-form solution for the posterior that facilitates the update process and avoids the need to use numerical methods to approximate the posterior. Moreover, the resulting posterior can be used as the prior for the next update step.

Dynamic Probability Estimates of Asset Price Moves

When the data consists of binary Bernoulli random variables with a certain success probability for a positive outcome, the number of successes in repeated trials follows a binomial distribution. The conjugate prior is the beta distribution with support over the interval [0, 1] and two shape parameters to model arbitrary prior distributions over the success probability. Hence, the posterior distribution is also a beta distribution that we can derive by directly updating the parameters.

In practice, the use of conjugate priors is limited to low-dimensional cases. In addition, the simplified MAP approach avoids computing the evidence term but has a key shortcoming, even when it is available: it does not return a distribution so that we can derive a measure of uncertainty or use it as a prior. Hence, we need to resort to an approximate rather than exact inference using numerical methods and stochastic simulations.

Markov Chain Monte Carlo (MCMC)

The most classic method to sample from a posterior distribution is the Markov chain Monte Carlo method (often called MCMC). The idea of this sampling method is to randomly modify an initial model (a set of parameters here) several times in a specific way so that the model eventually becomes a valid sample from the desired distribution. This procedure can be seen as a random walk in the space of parameters, which is called a Markov chain. The trick is that these modifications are random but biased toward models with higher posterior (quite like an optimization procedure). This means that sometimes a modification is rejected and a copy of the current model is used instead. Note that a modification that increases the posterior distribution is always accepted here. This procedure can be shown to converge toward samples from the desired distribution.

Probabilistic Programming

The idea of probabilistic programming is to use a regular programming language to define distributions, which can then be conditioned to perform a Bayesian inference. There are two main ways to define a distribution: either by defining its probability density function (PDF) or by defining a sampling procedure. Simple numeric distributions are usually defined by their PDF. A related way to define distributions is by using an energy function instead of the PDF. This energy function E(u1,u2,…) implicitly defines an unnormalized PDF through the relation PDF(u1,u2,…)∝exp(-E(u1,u2,…)).

Advantages of Bayesian Inference

Incorporates Prior Knowledge: Bayesian inference allows for the incorporation of prior knowledge or beliefs into the modeling process, which can be particularly useful when data is scarce or when expert opinions are available.
Provides Uncertainty Quantification: Bayesian methods provide a full probability distribution over the parameters, allowing for the quantification of uncertainty in the estimates.
Handles Missing Data: Bayesian methods can handle missing data in a principled way by marginalizing over the missing values.
Online Learning: Bayesian methods are well-suited for online learning, where the model is updated sequentially as new data becomes available.

Disadvantages of Bayesian Inference

Computational Complexity: Bayesian inference can be computationally expensive, especially for complex models with many parameters.
Subjectivity in Prior Selection: The choice of prior distribution can be subjective, and different priors can lead to different results.
Model Complexity: Bayesian models can be more complex to specify and interpret than frequentist models.

Bayesian Inference in Practice

Bayesian inference is a powerful tool that can be applied to a wide range of machine learning problems. However, it is important to be aware of the computational challenges and the potential for subjectivity in prior selection. By carefully considering these factors, you can effectively leverage Bayesian inference to build more robust and accurate models.

tags: #bayesian #inference #machine #learning #tutorial