Statistical Learning Theory Explained

Understanding intelligence and replicating it in machines is a significant challenge in science. Learning, with its theoretical underpinnings and computational implementations, is central to intelligence. In recent decades, AI systems have successfully tackled complex tasks previously exclusive to biological organisms, such as computer vision, speech recognition, and natural language processing. These advancements are driven by algorithms trained on examples rather than explicit programming, marking a paradigm shift in computer science. However, a complete theory of learning, especially one that explains the empirical puzzles raised by deep learning, remains elusive.

Statistical Learning Theory (SLT) offers a framework for understanding the behavior and performance of learning algorithms. It provides the foundation for constructing machines that can "learn" from data and make decisions or predictions. SLT draws from statistics and functional analysis and deals with finding a predictive function based on the data presented.

Core Concepts of Statistical Learning Theory

Predictive Power

One of the foundational principles of SLT is its predictive ability. By analyzing patterns in historical data, SLT enables machines to compute models that offer precise predictions about unknown events or states. Statistical Learning Theory fundamentally revolves around formulating a hypothesis from a given set of input.

Model Complexity

SLT involves the composition of complex models that capture non-linear relationships, logical combinations, and other intricate structures within the data.

Margin of Error

An inherent feature of SLT is the acceptance of a certain degree of uncertainty or margin of error. Models are designed not to overfit to data by memorizing it rather than learning underlying patterns.

Key Aspects of Statistical Learning

Data Dependency

SLT is dependent on the quality and quantity of data available.

Efficient Processing

SLT provides room for efficient computation of models.

Handling Complexity

The ability to handle and manage complex relationships and intricate structures within data sets them apart.

Statistics vs. Machine Learning

Statistics and machine learning, while sharing similar mathematical tools, have distinct goals. Statistics focuses on creating interpretable statistical models to describe data, enabling inferences and predictions. Machine learning, on the other hand, prioritizes predictive power, often treating the underlying statistical model as a "black box" as long as the results are useful.

To avoid overfitting the data using machine learning methods, the observed data set is tested by separating out a portion of the data (known as the test set) to confirm the strength of the model built by the majority of the data (known as the training set).

Read also: Intro to Statistical Learning

Linear Regression

In statistics, a linear statistical model is tested before being used for predictions.

The vector y = (y₁,…,yᵢ,…,yₙ)⊤ represents the values taken by the response variable. The matrix X is not random and is full rank. The Gauss-Markov theorem states that, for this model with Normal distribution error terms, the ordinary-least-squares derived estimator for β is the best linear unbiased estimator. The parameters are estimated using the maximum likelihood method.

Logistic Regression

For logistic regression, when there is a binomial response, y ∈ {0,1}, the logistic function defines the probability of a successful outcome, 𝜋 = P(y=1|x), where x ** is the vector of observed predictive variables, of which there are p. where 𝜋ᵢ __ = P(yᵢ=1|xᵢ) and xᵢ is the ith observed outcome, of which there are n. The logistic function above allows us to apply the theory behind linear regression to a probability between 0 and 1 of predicting a successful outcome.

Machine Learning Approach to Linear Regression

From the machine learning point of view, predictive models are considered too complicated or computationally intensive to solve mathematically. In machine learning, the function to be solved isn’t typically predefined. The objective function to be minimized is the ordinary least squares of the residuals, which is used in the machine learning algorithm as the loss function, L, which is more generally called the cost function, J(θ), where θ represents the parameter values being optimized. In machine learning, to help normalize and compare models, one typically minimizes the mean squared error, which is 1/n of the sum of the squared error values.

In machine learning, training occurs in pairs (x₁,y₁)…(xₙ,yₙ), and updates are cycled through each pair multiple times when optimizing θ. This equation gives us the direction in which to move the β values, which are also known as weights, towards their minimum. The learning rate 𝛼 should be set so that progress towards the minimum of the loss function is sufficiently fast without overshooting and making the minimum impossible to reach.

Read also: Statistical Learning Overview

Neural Networks

The idea of neural networks came from the concept of how neurons work in living animals: a nerve signal is either amplified or dampened by each neuron the signal passes through, and it is the sum of multiple neurons in series and in parallel, each filtering multiple inputs and feeding that signal to additional neurons to eventually provide the desired output. where f(·) is a nonlinear activation function and φⱼ(x) is a basis function. The basis function can transform the inputs x before the weights w are determined.

The activation function f(·) is also set to 1 for linear regression. However, with logistic regression, a specific activation function is needed to convert the output of the linearly determined weights to a predicted probability of the binomial response, 0 or 1. The activation function is the sigmoid function, which is equivalent to the logistic function defined for logistic regression for statistics. where z = xβ. In machine learning more generally, however, a nonlinear function is used when we’re not needing to get a linear probability to the prediction.

To do the training to establish the weights w for each step, the neural network algorithm calculates the Error value, which is the difference between the prediction calculated versus the actual outcome. The generalized linear model in statistics is used to expand linear regression to logistic regression for a binomial response. A similar transformation can be done for situations where the response is multinomial, i.e., multiclass. The neural network model for multinomial logistic regression works similarly to binary logistic regression.

Types of Data in Statistical Learning

In statistical learning theory, data is categorized into two main types:

Dependent Variable (y): The variable whose values depend on the values of other variables, also known as the target variable.
Independent Variables (x): The variables whose values do not depend on the values of other variables, also known as predictor variables, input variables, explanatory variables, or features.

In statistical learning, the independent variable(s) affect the dependent variable.

Independent Variable Example: Age is an independent variable.
Dependent Variable Examples: Weight (dependent on age, diet, activity levels) and Temperature (impacted by altitude, latitude, and distance from the sea).

In graphs, the independent variable is often plotted along the x-axis, while the dependent variable is plotted along the y-axis.

Statistical Models

A statistical model defines the relationships between a dependent and independent variable. For example, a simple model can be represented by the equation y = mx + c, where m is the gradient and c is the intercept, illustrating a linear relationship between the size of a home and its price. More complex models can incorporate multiple independent variables.

Model Generalization

To build an effective model, the available data needs to be used in a way that makes the model generalizable for unseen situations. Model validation involves:

Splitting the data into training data and testing data (e.g., 80/20 or 70/30 split).
Using the training data to train the model.
Using the testing data to test the model.

If the model has learned well from the training data, it will perform well with both the training data and testing data. The accuracy score for each can determine performance. Overfitting occurs if the training data has a significantly higher accuracy score than the testing data.

Supervised Learning

From the perspective of statistical learning theory, supervised learning is best understood. Supervised learning involves learning from a training set of data. Every point in the training is an input-output pair, where the input maps to an output. Depending on the type of output, supervised learning problems are either problems of regression or problems of classification. If the output takes a continuous range of values, it is a regression problem. Using Ohm's law as an example, a regression could be performed with voltage as input and current as an output. Classification problems are those for which the output will be an element from a discrete set of labels. Classification is very common for machine learning applications. In facial recognition, for instance, a picture of a person's face would be the input, and the output label would be that person's name.

Let be the vector space of all possible inputs, and to be the vector space of all possible outputs. Statistical learning theory takes the perspective that there is some unknown probability distribution over the product space , i.e. there exists some unknown . In this formalism, the inference problem consists of finding a function such that . Let be a space of functions called the hypothesis space. The hypothesis space is the space of functions the algorithm will search through. Let be the loss function, a metric for the difference between the predicted value and the actual value . Because the probability distribution is unknown, a proxy measure for the expected risk must be used. This measure is based on the training set, a sample from this unknown probability distribution. The choice of loss function is a determining factor on the function that will be chosen by the learning algorithm. The loss function also affects the convergence rate for an algorithm. The most common loss function for regression is the square loss function (also known as the L2-norm). This familiar loss function is used in Ordinary Least Squares regression.

In some sense the 0-1 indicator function is the most natural loss function for classification. It takes the value 0 if the predicted output is the same as the actual output, and it takes the value 1 if the predicted output is different from the actual output.

Overfitting and Regularization

In machine learning problems, a major problem that arises is that of overfitting. Because learning is a prediction problem, the goal is not to find a function that most closely fits the (previously observed) data, but to find one that will most accurately predict output from future input. Overfitting is symptomatic of unstable solutions; a small perturbation in the training set data would cause a large variation in the learned function.

Regularization can be accomplished by restricting the hypothesis space . A common example would be restricting to linear functions: this can be seen as a reduction to the standard problem of linear regression. could also be restricted to polynomial of degree , exponentials, or bounded functions on L1. One example of regularization is Tikhonov regularization. where is a fixed and positive parameter, the regularization parameter.

Consider a binary classifier . We can apply Hoeffding's inequality to bound the probability that the empirical risk deviates from the true risk to be a Sub-Gaussian distribution. But generally, when we do empirical risk minimization, we are not given a classifier; we must choose it. Therefore, a more useful result is to bound the probability of the supremum of the difference over the whole class. where is the shattering number and is the number of samples in your dataset.

Minimization of Expected Prediction Error (EPE)

SLT regression methods differ from classical statistical approaches in a number of ways, summarized in Table 1 (Breiman, 2001). Most, if not all of these differences arise from the SLT priority of maximizing out-of-sample predictive accuracy. Optimizing out-of-sample prediction is often neither the focus nor outcome of traditional null-hypothesis testing methods.

To motivate discussion of SLT methods, consider the following general modeling framework. Let m be an arbitrary statistical model in a sample of n observations, relating a group of p predictors (X1, X2, … Xp) in an n × p design matrix X to an n × 1 outcome vector y.1 The model m maps the sample values in X to y, through some function fm(X). β^. β^ are (p+1) × 1, since there is an additional parameter for the intercept in most models. β^. The resulting quantity is simply the weighted linear combination one would find in any regression context. ηi=xiTβ^. In a likelihood framework, a probability distribution is chosen for the dependent variable, and (usually) its mean is parameterized as conditional on the predictors via β. A set of estimating equations is then used to estimate the parameter values that maximize the log likelihood of the sample data. These Maximum Likelihood Estimates (MLEs) are often (but not always) unbiased (Cassella & Berger, 2002).

While there are a variety of criteria for evaluating estimators, perhaps the most commonly utilized one is accuracy. An estimator’s accuracy is defined as the inverse of its Mean Square Error (MSE). The MSE is the squared expectation of the difference between the estimator and the population parameter it seeks to estimate (Cassella & Berger, 2002). θ could be any kind of parameter, but in the present context consider a regression parameter β). The importance of (1) lies in the fact that an estimator’s bias and variance are independent contributors to its overall accuracy. Unbiasedness simply refers to whether the estimator produces an estimate equal to the parameter in probability theory expectation, or on average. The estimator may or may not show great variability around this average. Thus, unbiased estimators are not always the most accurate, because they may have large variance (see Cassella & Berger 2002, Chapter 7).

The independence of an estimator’s bias and variance has an important implication when one strives to maximize a model’s predictive power (Cassella & Berger, 2002). Maximizing prediction is equivalent to increasing variance explained. If one naively wants to increase R2 or similar likelihood-based measures of model fit, one can do so by adding more and more predictors. The only technical restriction on the number of p (independent) parameters usually estimable is p < n−1 (or p < n for models like Cox regression which have no intercept). The reason that adding more and more predictors of even negligible importance can increase “variance explained” in the outcome lies in the fact that maximizing R2-like measures is equivalent to minimizing Var[Y|X1, X2, … Xp]. As long as X is sufficient statistic (a random variable is a sufficient statistic for itself), an important result known as the Rao-Blackwell theorem guarantees that Var[Y|X] ≤ Var[Y] (see Cassella & Berger, 2002, Chapter 7). This means that conditioning on more and more X’s almost always drives down the conditional variance of the outcome, driving up variance explained by fm(X). Such a strategy risks overfitting fm(X) to the sample. Such a model has become erroneously complex, and will not predict the outcome very well in other samples. This is commonly understood at an intuitive level, and discouraged in practice by the use of adjusted R2 or information criteria penalizing fit for model complexity (Vrieze, 2012).

Overfitting happens because the more predictors that are added, the greater the chance that one or more parameter estimates may lie far away from the true value of the population parameter. Recall that unbiasedness means that the estimator yields estimates equal to the population parameter on average. Within any given sample, an unbiased estimator may produce an estimate that lies far from the true value of the parameter, or out at the tails of the sampling distributions in (1). β^ are subject to this risk. And the more estimates in a model that deviate substantially from the corresponding population parameters, the more error is transmitted into the model’s predicted values of the outcome. Developing an fm(X) that explains as much outcome variance as possible without overfitting the sample is thus a difficult task.

A model fm(X) acts as an estimator of the outcome Y. In a criterion-keyed scale context, fm(X) is a scale score based on items relevant to the outcome or criterion Y, arranged in a weighted linear combination. The weights are determined by regression, and depending on the type of model employed, fm(X) may involve a final transformation, such as exponentiation from a logit to an odds ratio if a logistic regression has been used. In any context, since fm(X) acts as an estimator of Y, its accuracy can be evaluated by expected prediction error (EPE; Hastie et al., 2009, Chapter 7). Denote the model estimate of Y at a particular set of inputs x0 as fm(x0). σε2, is “irreducible error” that cannot be eliminated. Statistically, this reflects variation of the outcome about its average value that can never be explained. One might consider it an inherent property of the sample that cannot be eliminated by a model. [E[fm(x0)]−y0]2, is the squared bias of the model-based estimate, or the extent to which the model systematically over- or under-estimates the outcome at the particular set of inputs x0. The third term is the variance of the model-based estimate, or the degree of random variation about its average prediction at the set of inputs E[fm(X0)]. Note that in a scale-score context, fm(x0) is merely a particular score on the scale corresponding to a pattern of responses on the scale’s items. More than one individual observation may share a particular set of inputs, or have identical values of the X variables. Of course, the more numerous the X variables and/or smaller the sample, the less likely this will occur. Taken across the entire possible range of inputs on all X variables, EPE is conceptually akin to outcome variation unexplained by the model.

EPE may not be minimized by the parameter estimates that maximize the likelihood of the sample data. This is most likely to be the case when the ratio of predictors p to observations n is large, although “rules of thumb” about p/n are at best tentative since every situation is different. At small p/n ratios, MLEs may perform reasonably well in out-of-sample predictions; we remark on this issue later. SLT methods represent a reaction to the “over-optimism” of MLEs, or their tendency to overfit samples in cases of larger p/n. Hastie and colleagues (2009; Chapter 7) provide a good overview of a variety of model selection strategies based on the minimization of EPE. A classic article-length introduction to cross-validation from a traditional statistical standpoint is (Harrell, Lee, & Mark, 1996), and a technical SLT-oriented overview can be found in (Arlot & Celisse, 2010).

In k-fold cross-validation, a sample is split into k different segments, usually of equal size, with k commonly equal to five or ten. The parameters in fm(X) are estimated from the data in k−1 folds, and then the model is used to predict the outcome in the remaining fold. This process is repeated k times, with each fold serving as the test set once. The EPE is then estimated as the average prediction error across all k folds.

Applications of Statistical Learning Theory

Statistical Learning Theory rooted techniques are widely adopted in diverse industries for tasks ranging from anomaly detection to advanced prognosis. SLT has vast potential in the technological realm, particularly in the advent of data mining, AI, and machine learning. Its application, however, requires careful consideration and operating knowledge to ensure its strategic and fruitful utilization.

tags: #statistical #learning #theory #explained