Gaussian Processes for Machine Learning: A Comprehensive Tutorial

Gaussian Processes (GPs) represent a powerful and versatile approach in machine learning, offering a probabilistic framework for modeling and predicting complex datasets. Unlike many algorithms that provide a single "best guess," GPs provide a distribution of possible functions, quantifying the uncertainty associated with predictions. This tutorial aims to provide an intuitive and in-depth exploration of Gaussian Processes, suitable for both beginners and experienced practitioners. Since the celebrated book by Rasmussen and Williams, there have been a considerable amount of novel contributions that are allowing the applicability of Gaussian processes (GPs) to problems at an unprecedented scale and to new areas where uncertainty quantification is of fundamental importance.

Introduction to Gaussian Processes

Gaussian Processes (GPs) are defined by a mean function m(x) and a covariance function or kernel k(x,x′). At its core, a Gaussian Process is a generalization of the Gaussian probability distribution. While a Gaussian distribution describes the distribution of random variables, a Gaussian Process describes the properties of functions.Gaussian processes require specifying a kernel that controls how examples relate to each other; specifically, it defines the covariance function of the data. It also requires a link function that interprets the internal representation and predicts the probability of class membership.

What Makes Gaussian Processes Special?

Probabilistic Predictions: GPs don't just give you a single prediction; they provide a range of possible outcomes along with their probabilities.
Flexibility: GPs can capture complex patterns and relationships in data without needing a rigid, pre-defined structure.
Uncertainty Quantification: GPs inherently provide a measure of uncertainty, allowing you to understand the confidence associated with predictions.

Foundational Concepts for Understanding Gaussian Processes

To fully grasp Gaussian Processes, it is crucial to understand the underlying concepts upon which they are built. This section will delve into these foundational elements, providing the necessary context for understanding the mechanics and applications of GPs.

Gaussian (Normal) Distribution

The Gaussian, or normal, distribution is a fundamental concept in statistics and machine learning. It describes the probability distribution of a continuous random variable. The probability density function (PDF) of a uni-variate normal (or Gaussian) distribution was plotted in Fig. Here, XX represents random variables and xx is the real argument. This normal distribution of XX is usually represented by PX(x)∼𝒩(μ,σ2)P_{X}(x)~\sim\mathcal{N}(\mu,\sigma^{2}).

Multivariate Normal Distribution (MVN)

It is quite usual and often necessary for a system to be described by more than one feature variable (x1,x2,…,xD)(x{1},x{2},\ldots,x_{D}) that are correlated to each other. If we would like to model these variables all together as one Gaussian model, we need to use a multivariate Gaussian/normal (MVN) distribution model [3]. Here, DD represents the number of the dimensionality, xx denotes the variable, μ=𝔼[x]∈ℝD\mu=\mathbb{E}[x]\in\mathbb{R}^{D} is the mean vector, and Σ=cov[x]\Sigma=\text{cov}[x] is the D×DD\times D covariance matrix.

The Multivariate Gaussian distribution is also known as the joint normal distribution, and is the generalization of the univariate Gaussian distribution to high dimensional spaces. Mathematically, $X = (X1, …Xk)^T$ has a multivariate Gaussian distribution if $Y=a1X1 + a2X2 … Note: if all k components are independent Gaussian random variables, then $X$ must be multivariate Gaussian (because the sum of independent Gaussian random variables is always Gaussian).

A bi-variate normal (BVN) distribution offers a simpler example to understand the MVN concept. A BVN distribution can be visualized as a three-dimensional (3-D) bell curve, where the vertical axis (height) represents the probability density, as shown in Fig. 5(a). The ellipse contours on the x1,x2x{1},x{2} plane, illustrated in Fig. 5(a) and 5(b), are the projections of this 3-D curve. The shape of ellipses shows the correlation degree between x1x{1} and x2x{2} points, i.e. how one variable of x1x{1} relates to another variable of x2x{2}. \mu{2}\end{bmatrix}, where μ1\mu{1} and μ2\mu{2} represent the independent means of x1x{1} and x2x{2}, respectively. \sigma{21}&\sigma{22}\end{bmatrix}, with the diagonal terms σ11\sigma{11} and σ22\sigma{22} being the independent variance of x1x{1} and x2x{2}, respectively. The off-diagonal terms, σ12\sigma{12} and σ21\sigma{21} represent correlations between x1x{1} and x2x_{2}.

The covariance matrix, denoted as $\Sigma$, tells us (1) the variance of each individual random variable (on diagonal entries) and (2) the covariance between the random variables (off diagonal entries). The covariance matrix in above image indicates that $y1$ and $y2$ are positively correlated (with $0.7$ covariance), therefore the somewhat "stretchy" shape of the contour.

With multivariate Gaussian, another fun thing we can do is conditioning. We fix the value of $y1$ to compute the density of $y2$ along the red line -- thereby condition on $y1$. Take the oval contour graph of the 2D Gaussian (left-top in below image) and choose a random point on the graph. Under this setting, we can now visualize the sampling operation in a new way by taking multiple "random points" and plot $y1$ and $y2$ at index $1$ and $2$ multiple times. The further away the indices of two points are, the less correlated they are. We can again condition on $y1$ and take samples for all the other points.

Kernels: Measuring Similarity

Kernels are essential to Gaussian processes because they capture the relationships and underlying structure of the data. In a feature space, a kernel-also called a covariance or similarity function-quantifies how similar two pairs of input points are to one another. It influences how the process acts across various input configurations by defining the form and properties of the Gaussian process distribution. With kernels, Gaussian processes can handle non-linearities, model complex relationships, and generate predictions by extrapolating and interpolating data from observed points. In the context of Gaussian Processes (GPs), kernels - also known as covariance functions - measure the similarity or correlation between two points in the input space. The choice of kernel function has a profound impact on the behavior of the GP.

Read also: Revolutionizing Remote Monitoring

In regression, we desire the predictions to be smooth and logical: similar inputs should yield similar outputs. For example, consider two houses, A and B, with comparable size, location, and features; we expect their market prices to be similar. A natural measure of ‘similarity’ between two inputs is the dot product A⋅B=∥A∥∥B∥cosθA\cdot B=\lVert A\rVert\lVert B\rVert\text{cos}\theta, where θ\theta is the angle between two input vectors. Imagine a scenario in which we could lift our house into a ‘magical’ space, where doing this dot product becomes more powerful and tells us even more about how similar our houses are. This magical space is called “feature space”. The function that helps us do this lift and enhanced compression in the feature space is named as “kernel function”, denoted as k(x,x′)k(x,\ x^{\prime}). We do not actually move our data into this new high-dimensional “feature space” (that could be computationally expensive); instead, the kernel function facilitates the comparison of data though providing us the same dot product result as if we had done so. This is known as the famous “kernel trick”. Formally, the kernel function k(x,x′)k(x,\ x^{\prime}) computes the similarity between data points in a high-dimensional feature space without explicitly transforming the inputs [1].

Prior Distribution: Initial Beliefs

The prior distribution, which represents our first presumptions about the functions we are modeling, serves as the starting point for Gaussian processes. Imagine it as a stretchy canvas on which different functions and their actions are painted according to our preconceived notions. This distribution, which is commonly taken to be Gaussian, has parameters such as mean and covariance that affect the properties of the potential functions. It's similar to preparing the stage before any data even arrives for the function's performance. This prior distribution changes as we see data, taking on a posterior distribution that better fits the seen world. The prior distribution in GPs encapsulates our initial beliefs about the function before observing any data.

More formally, the prior distribution of these infinite functions is MVN, representing the expected outputs of 𝐟\mathbf{f} over inputs 𝐱\mathbf{x} before observing any data.

Posterior Distribution: Updated Knowledge

The posterior distribution, a fundamental idea in Gaussian processes, expresses our revised views about the functions we are modeling in light of data observation. To begin with, we assume certain things about the functions based on a past distribution. The posterior distribution is obtained by combining the prior distribution with the likelihood of the observed data with the help of Bayes' theorem as data points are observed. Our enhanced knowledge of the underlying functions is captured in a revised and updated distribution as a result. Gaussian processes are continuously able to adjust and enhance their predictions in light of fresh data because to this iterative process. After observing the data, the posterior distribution updates our beliefs, incorporating the evidence provided by the data.

When we start to have observations, instead of infinite numbers of functions, we only keep functions that fit the observed data points, forming the posterior distribution. This posterior is the prior updated with observed data.

Read also: Boosting Algorithms Explained

Non-Parametric Models

This section explains the distinction between parametric and non-parametric models [3]. Parametric models assume that the data distribution can be modeled in terms of a set of finite numbers of parameters. In regression, given some data points, we would like to predict the function value y=f(x)y=f(x) for a new specific xx. If we assume a linear regression model, y=θ1+θ2xy=\theta{1}+\theta{2}x, we need to identify the parameters θ1\theta{1} and θ2\theta{2} to define the function. often, a linear model is insufficient, and a polynomial model with more parameters, like y=θ1+θ2x+θ3x2y=\theta{1}+\theta{2}x+\theta{3}x^{2} is needed. We use the training dataset DD comprising nn observed points, D=[(xi,yi)|i=1,…,n]D=[(x{i},y{i})\,|\,i=1,\ldots,n] to train the model, i.e. establish a mapping xx to yy through basis functions f(x)f(x). After the training process, all information in the dataset is assumed to be encapsulated by the feature parameters θ\mathbf{\theta}, thus predictions are independent of the training dataset DD. This can be expressed as P(f∗|X∗,θ,D)=P(f∗|X∗,θ)P(f{}\,|\,X{},\mathbf{\theta},D)=P(f{}\,\,|\,X{},\mathbf{\theta}), in which f∗f{} are predictions made at unobserved data points X∗X_{}. Thus, when conducting regressions using parametric models, the complexity or flexibility of models is inherently limited by the number of parameters. Conversely, if the parameter number of a model grows with the size of the observed dataset, it’s a non-parametric model.

Note that our Gaussian processes are non-parametric, as opposed to nonlinear regression models which are parametric.

Gaussian Process Regression (GPR)

The objective of regression is to formulate a function that accurately represents observed data points and then utilize this function for predicting new data points. Considering a set of observed data points depicted in Fig. 1(a), an infinite array of potential functions can be fitted to these data points. Fig. 1(b) illustrates five such sample functions. Here, XX represents random variables and xx is the real argument. These randomly generated data points can be expressed as a vector x1=[x11,x12,…,x1n]x{1}=[x{1}^{1},x{1}^{2},\ldots,x{1}^{n}]. By plotting the vector x1x{1} on a new YY axis at Y=0Y=0, we projected the points [x11,x12,…,x1n][x{1}^{1},x{1}^{2},\ldots,x{1}^{n}] into a different space shown in Fig. 3. We did nothing but vertically plot points of the vector x1x{1} in a new Y,xY,x coordinates space. Similarly, another independent Gaussian vector x2=[x21,x22,…,x2n]x{2}=[x{2}^{1},x{2}^{2},\ldots,x{2}^{n}] can be plotted at Y=1Y=1 within the same coordinate framework, as demonstrated in Fig. 3. It’s crucial to remember that both x1x{1} and x2x{2} are a uni-variate normal distribution depicted in Fig. Next, we selected 10 points randomly in vector x1x{1} and x2x{2} respectively and connected these points in order with lines as shown in Fig. 4(a). These connected lines look like linear functions spanning within the [0,1][0,1] domain. We can use these functions to make predictions for regression tasks if the new data points are on (or proximate to) these linear lines. However, the assumption that new data points will consistently lie on these linear functions often does not hold. If we plot more random generated uni-variate Gaussian vectors, say 20 vectors like x1,x2,…,x20x{1},x{2},\ldots,x{20} within [0,1][0,1] interval, and connecting 10 randomly selected sample points of each vector as lines, we get 10 lines that look more like functions within [0,1][0,1] shown in Fig. 4(b). Yet, we still cannot use these lines to make predictions for regression tasks because they are too noisy. These functions must be smoother, meaning input points that are close to each other should have similar output values. These “functions” generated by connecting points from independent Gaussian vectors lack the required smoothness for regression tasks.

In summary, GP regression is exactly the same as regression with parametric models, except you put a prior on the set of functions you'd like to consider for this dataset. The characteristic of this "set of functions" you consider is defined by the kernel of choice ($K(x, x')$). Luckily, because $p(y \mid \theta)$ is Gaussian, we can compute its likelihood in close form.

Definition of Gaussian Processes

Definition of Gaussian processes: A Gaussian process model describes a probability distribution over possible functions that fit a set of points. Because we have the probability distribution over all possible functions, we can compute the means to represent the maximum likelihood estimate of the function, and the variances as an indicator of prediction confidence.

The Gaussian Process Model

Now, it is time to explore the standard Gaussian process model. All the parameter definitions align the classic textbook by Rasmussen (2006) [1]. Besides the covered basic concepts, Appendix A.1 and A.2 of [1] are also recommended reading. where 𝐗=[𝐱1,…,𝐱n]\mathbf{X}=[\mathbf{x}{1},\ldots,\mathbf{x}{n}] represents the observed data points, 𝐟=[f(𝐱1),…,f(𝐱n)]\mathbf{f}=\left[f(\mathbf{x}{1}),\ldots,f(\mathbf{x}{n})\right] the function values, 𝝁=[m(𝐱1),…,m(𝐱n)]\boldsymbol{\mu}=\left[m(\mathbf{x}{1}),\ldots,m(\mathbf{x}{n})\right] the mean function, and Kij=k(𝐱i,𝐱j)K{ij}=k(\mathbf{x}{i},\mathbf{x}{j}) the kernel function, which is a positive definite. With no observation, we default the mean function to m(𝐗)=0m(\mathbf{X})=0, assuming the data is normalized to zero mean. The Gaussian process model is thus a distribution over functions whose shapes (smoothness) are defined by 𝐊\mathbf{K}. If points 𝐱i\mathbf{x}{i} and 𝐱j\mathbf{x}{j} are considered similar by the kernel, their respective function outputs, f(𝐱i)f(\mathbf{x}{i}) and f(𝐱j)f(\mathbf{x}_{j}), are expected to be similar too. The regression process using Gaussian processes is illustrated in Fig

Making Predictions with GPR

Where 𝐊=K(𝐗,𝐗)\mathbf{K}=K(\mathbf{X},\mathbf{X}), 𝐊∗=K(𝐗,𝐗∗)\mathbf{K}{}=K(\mathbf{X},\mathbf{X}{}) and 𝐊∗∗=K(𝐗∗,𝐗∗)\mathbf{K}{*}=K(\mathbf{X}{},\mathbf{X}{}). While this equation describes the joint probability distribution P(𝐟,𝐟∗|𝐗,𝐗∗)P(\mathbf{f},\mathbf{f}{}\,|\,\mathbf{X},\mathbf{X}{}) over 𝐟\mathbf{f} and 𝐟∗\mathbf{f}{}, in regressions, we need the conditional distribution P(𝐟∗|𝐟,𝐗,𝐗∗)P(\mathbf{f}{}\,|\,\mathbf{f},\mathbf{X},\mathbf{X}{}) over 𝐟∗\mathbf{f}{} only. The derivation of the conditional distribution P(𝐟∗|𝐟,𝐗,𝐗∗)P(\mathbf{f}{}\,|\,\mathbf{f},\mathbf{X},\mathbf{X}{}) from the joint distribution P(𝐟,𝐟∗|𝐗,𝐗∗)P(\mathbf{f},\mathbf{f}{}\,|\,\mathbf{X},\mathbf{X}_{*}) is achieved by using the Marginal and conditional distributions of MVN theorem [5, Sec. 2.3.1].

In realistic scenarios, we typically have access only to noisy versions of true function values, y=f(x)+ϵy=f(x)+\epsilon, where ϵ\epsilon represents additive independent and identically distributed (i.i.d.) Gaussian noise with variance σn2\sigma{n}^{2}. The prior on these noisy observations then becomes cov(y)=𝐊+σn2𝐈\text{cov}(y)=\mathbf{K}+\sigma{n}^{2}\mathbf{I}. In this expression, the variance function cov(𝐟∗)\text{cov}(\mathbf{f}{}) reveals that the uncertainty in predictions depends solely on the input values 𝐗\mathbf{X} and 𝐗∗\mathbf{X}{}, not on the observed outputs 𝐲\mathbf{y}. The inputs of this algorithm are 𝐗\mathbf{X} (inputs), 𝐲\mathbf{y}(targets), KK (covariance function), σn2\sigma^{2}{n}(noise level), and 𝐗∗\mathbf{X{*}}(test input).

Kernels in Detail

One of the most crucial aspects of Gaussian Processes is the selection and design of the kernel function. The kernel defines the similarity between data points and dictates the properties of the learned function.

Combining Kernels for Complex Data

Combining kernels in Gaussian processes is a potent way to improve the model's expressiveness and adaptability. The features and form of functions within a Gaussian process are determined by kernels. We can develop a composite kernel that can recognize different patterns and structures in the data by merging multiple kernels. This is especially helpful in cases where the underlying systems display disparate tendencies. The combination can be made by multiplying or adding distinct kernels, each of which adds to the total function in a different way. As a result, the Gaussian process can adjust to a variety of patterns and offer a more thorough depiction of the relationships found in the data. Kernels can be combined to create a new kernel that captures multiple aspects of the data. Multiplying kernels allows one to model interactions between inputs.

Common Kernel Functions

Constant Kernel: It depends on a parameter (constant_value). sum-kernel where it explains the noise-component of the signal.
RBF (Radial Basis Function) Kernel: The RBF kernel is a stationary kernel. exponential” kernel. very smooth. RBF kernel. the smoothness of the resulting function. of RBF kernels with different characteristic length-scales.
Rational Quadratic Kernel: (p>0).
DotProduct Kernel: by putting (N(0, 1)) priors on the coefficients of (xd (d = 1, . . . a prior of (N(0, \sigma0^2)) on the bias. It is parameterized by a parameter (\sigma_0^2). is called the homogeneous linear kernel, otherwise it is inhomogeneous. The DotProduct kernel is commonly combined with exponentiation.

Gaussian Process in Classification and Regression

Gaussian Processes can be applied to both classification and regression problems, although the approach differs slightly.

Regression: In regression, GPs predict continuous outcomes.
Classification: In classification, GPs are used for predicting discrete labels. The GP’s output is passed through a non-linear function (like the logistic function) to obtain class probabilities.

Gaussian Processes in Scikit-learn

Scikit-learn provides a convenient and powerful implementation of Gaussian Processes. Here's an overview of how to use GPs within the Scikit-learn framework:

Gaussian Processes for Regression: In Scikit-learn, Gaussian Processes won't just give you a single temperature forecast; they'll provide a range of possible temperatures along with probabilities, offering a complete picture of future possibilities. Gaussian ProcessesGaussian Processes in sklearn are built on two main concepts: the mean function, which represents the average prediction, and the covariance function, also known as the kernel, which defines how points in the dataset relate to each other. The beauty of GPs lies in their ability to capture complex patterns and relationships in data without needing to predefine a rigid structure, like the number of layers in a neural network.
Gaussian Processes Classifier model: We can fit and evaluate a Gaussian Processes Classifier model using repeated stratified k-fold cross-validation via the RepeatedStratifiedKFold class. Perhaps the most important hyperparameter is the kernel controlled via the “kernel” argument.

Hyperparameters and Optimization

Kernels are parameterized by a vector (\theta) of hyperparameters. via gradient ascent. bounds need to be specified when creating an instance of the kernel. theta of the kernel object. accessed by the property bounds of the kernel. Hyperparameter in the respective kernel. The abstract base class for all kernels is Kernel. methods getparams(), setparams(), and clone(). GridSearchCV. kernel parameters might become relatively complicated. parameters of the right operand with k2_. but with the hyperparameters set to theta. [ 0. metric to pairwisekernels from sklearn.metrics.pairwise. class PairwiseKernel. only isotropic distances. hyperparameter and may be optimized.

The implementation is based on Algorithm 2.1 of [RW2006]. optimizer can be started repeatedly by specifying nrestartsoptimizer. randomly from the range of allowed values. alpha, either globally as a scalar or per datapoint. adding it to the diagonal of the kernel matrix.

Example: California House Prices

Mean Squared Error: 1.5693510966686535This code illustrates how to forecast California house prices using Gaussian Process Regression. The California Housing dataset is loaded first, and a subset of the data is chosen for efficiency. Next, the dataset is divided into testing and training sets. A kernel (constant kernel plus RBF) is given while creating a Gaussian Process Regressor. Predictions are produced on the test set once the model has been fitted to the training set of data. The output "Mean Squared Error: 1.5693510966686535" tells us how close the Gaussian Process model's predictions are to the actual values of the housing prices in the California dataset. A mean squared error (MSE) is a common measure in statistics that averages the squares of the errors, the difference between predicted and actual values. The lower the MSE, the more accurate the model.

Practical Considerations and Challenges

While Gaussian Processes offer significant advantages, it’s important to be aware of their limitations and challenges.

Computational Complexity

Note: here we catch a glimpse of the bottleneck of GP: we can see that this analytical solution involves computing the inverse of the covariance matrix of our observation $C^{-1}$, which, given $n$ observations, is an $O(n^3)$ operation. So the mean of $p(y1 \mid y2)$ is linearly related to $y_2$, and the predictive covariance is the prior uncertainty subtracted by the reduction in uncertainty after seeing the observations.

Kernel Selection

I don't have any authoritative advice on selecting kernels for GP in general, and I believe in practice, most people try a few popular kernels and pick the one that fits their data/problem the best. So here we will only introduce the form of some of the most frequently seen kernels, get a feel for them with some plots and not go into too much detail. (I highly recommend implementing some of them and play around with it though! There are books that you can look up for appropriate kernels for covariance functions for your particular problem, and rules you can follow to produce more complicated covariance functions (such as, the product of two covariance functions is a valid covariance function). It is tricky to find the appropriate covariance function, but there are also methods in place for model selection. However, it does involve a very difficult integral (or sum in discrete case, as showed above) over the hyperparameters of your GP, which makes it impractical, and is also very sensitive to the prior you put over your hyperparameters.

Scaling Problems

In reality, GPs suffer from scaling problems in large datasets and from being very sensitive to the choice of kernels.

Conclusion

In conclusion, Gaussian Processes (GPs) in Scikit-learn provide a nuanced and sophisticated method for regression tasks, capable of accounting for uncertainties in predictions. They offer a probabilistic approach, which means they give us not just predictions but also a sense of how confident the model is about those predictions. The implementation of GPs we discussed involves using a subset of data and a simplified kernel to ensure the model is less demanding on system memory resources. This makes GPs more accessible for use on systems with limited RAM, although it may trade off some accuracy due to the reduced complexity and dataset size.

Using Gaussian Processes with Scikit-learn is therefore a balance between model complexity, system resources, and prediction accuracy. While more data and a complex model can potentially offer more accurate predictions, they also require more computational resources. Conversely, a simplified model can save resources but at the cost of some prediction precision. The main point I try to show here is the importance of choosing the correct kernel for the job. Here, even a simple dataset required the use of 3 kernels, for more complex datasets, this could be many more.

Hopefully this has been a helpful guide to Gaussian process. To keep things relatively short and simple here, I did not delve into the complications of using GPs in practice.

tags: #gaussian #processes #for #machine #learning #tutorial