Maximum Likelihood Estimation (MLE) Explained: A Comprehensive Guide

Maximum Likelihood Estimation (MLE) is a fundamental statistical method used to estimate the parameters of a probability distribution based on observed data. It's a cornerstone of many machine learning algorithms and statistical models. This article provides a comprehensive explanation of MLE, covering its underlying principles, mathematical formulation, practical applications, and limitations.

Introduction to Maximum Likelihood Estimation

In essence, MLE seeks to find the parameter values that maximize the likelihood of observing the given data under the assumed probability distribution. The "likelihood" quantifies how plausible different parameter values are, given the actual observed data. It is a function of parameters for fixed data. In other words, MLE attempts to determine the parameters that make the observed data "most probable."

Many conventional machine learning algorithms work with the principles of MLE. For example, the best-fit line in linear regression calculated using least squares is identical to the result of MLE.

What is a Likelihood Function?

The likelihood function helps us find the best parameters for our distribution. It can be defined as:

L(θ | x₁, x₂, …, xₙ) = f(x₁, x₂, …, xₙ | θ)

Read also: Read more about Computer Vision and Machine Learning

Where:

  • θ is the parameter (or parameters) we want to estimate.
  • x₁, x₂, …, xₙ are the observed data points.
  • f is the joint probability density function (PDF) of our distribution with the parameter θ.

The pipe (“ | “) is often replaced by a semi-colon since θ isn’t a random variable, but an unknown parameter.

Of course, θ could also be a set of parameters. For example, in the case of a normal distribution, we would have θ = (μ,σ), with μ and σ representing the two parameters of our distribution.

Intuition Behind MLE

Likelihood is often interchangeably used with probability, but they are not the same. Likelihood is not a probability density function, meaning that integrating over a specific interval would not result in a “probability” over that interval. Rather, it talks about how likely a distribution with certain values for its parameters fits our data.

Looking at it this way, we can say that likelihood is how likely the distribution fits of given data for variable values of its parameters. So, if L(θ_1|x) is greater than L(θ_2|x), the distribution with parameter value as θ_1 fits our data better than the one with a parameter value of θ_2.

Read also: Revolutionizing Remote Monitoring

Skip the Academics

Maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable.

The MLE Process: A Step-by-Step Guide

To re-iterate, we’re looking for the parameter (or parameters, as the case may be) that maximize our likelihood function. How do we do that?

  1. Define the Probability Distribution: Assume that the data follows a specific probability distribution (e.g., normal, binomial, exponential). This assumption is crucial as it dictates the form of the likelihood function.
  2. Formulate the Likelihood Function: Construct the likelihood function based on the chosen probability distribution and the observed data. This function expresses the probability of observing the data given the parameters of the distribution.
  3. Maximize the Likelihood Function: Find the parameter values that maximize the likelihood function. This can be achieved through analytical methods (e.g., calculus) or numerical optimization techniques.
  4. Obtain the Maximum Likelihood Estimate (MLE): The parameter values that maximize the likelihood function are the maximum likelihood estimates (MLEs). These estimates represent the "best fit" parameters for the chosen distribution.

The i.i.d Assumption

To simplify our calculations, let's assume that our data is independently and identically distributed, or i.i.d, for short, meaning that observations are independent of each other and that they can be quantified in the same way, which basically means that all points are from the same distribution.

The i.i.d assumption allows us to easily calculate the cumulative likelihood considering all data points as a product of individual likelihoods.

Also, most likelihood functions have a single maxima, allowing us to simply equate the derivate to 0 to get the value of our parameter. If multiple maxima exist, we would need to look at the global maxima to get our answer.

Read also: Boosting Algorithms Explained

In general, more complex numerical methods would be required to find the maximum likelihood estimate.

Mathematical Formulation

Let's delve into the mathematical details of MLE.

General Derivation of the MLE

Let’s suppose we have a dataset: x₁, x₂, …, xₙ. We believe these data points are generated from a probability distribution that depends on some unknown parameter θ (theta). For example, if our dataset were about coin flips, θ could be the probability of heads.

The likelihood function measures how likely it is to observe your data for different values of θ.

L(θ) = P(x₁; θ) * P(x₂; θ) * … * P(xₙ; θ)

However, this is quite a complicated equation!

The Log-Likelihood Trick

However, recall that our likelihood function contains a product. Working with products can get messy, especially with lots of data points.

To simplify the maximization process, it's often more convenient to work with the natural logarithm of the likelihood function, called the log-likelihood. Because the logarithm function itself is a continuous strictly increasing function over the range of the likelihood, the values which maximize the likelihood will also maximize its logarithm (the log-likelihood itself is not necessarily strictly increasing).

The log-likelihood is:

log L(θ) = log [P(x₁; θ) * P(x₂; θ) * … * P(xₙ; θ)] = log P(x₁; θ) + log P(x₂; θ) + … + log P(xₙ; θ)

We are now at a place where we can differentiate, however in machine learning, we tend to want our loss functions to be minimized. Now, we can use calculus to obtain the value of θ. By taking the derivative of the log-likelihood with respect to θ, setting it to zero, and solving for θ.

Analytical vs. Numerical Solutions

In some cases, the MLE equations can be solved analytically, meaning we can derive an exact formula for the parameter estimates. However, for more complex models, analytical solutions don't exist or are too complicated to derive. In these cases, we use numerical optimization methods - iterative algorithms that search for the parameters that maximize the log-likelihood.

Examples of MLE in Action

Let's illustrate MLE with some concrete examples.

Example 1: Estimating the Probability of Heads in a Coin Flip

Suppose we flip a coin 12 times and observe the following sequence: Heads, Heads, Tails, Heads, Heads, Heads, Tails, Tails, Heads, Heads, Heads, Tails.

What is the probability that a single flip is heads?,” everyone would correctly guess 9/13. (9/13 \approx 0.7\ldots).

We want to estimate the probability θ (theta) of rolling a six.

The likelihood function is:

L(θ) = θ^(number of heads) * (1-θ)^(number of tails) = θ⁹ * (1-θ)⁴

To find the MLE, we take the derivative of the log-likelihood, set it to zero, and solve for θ. This has three solutions: (0), (1) and (9/13). (0) to our sequence.

The MLE for the probability of heads is 9/13, which is approximately 0.7.

Note: If we had obtained multiple solutions of θ, then we would have to also find the second derivative and see which θ values would give us a positive result (to confirm we have found a minimum point).

Example 2: Estimating the Mean of a Normal Distribution

Lets suppose we have a dataset of the heights of 5 people: 160, 165, 170, 175, 180 (in cm).

Suppose we have a set of N independent observations drawn from a Gaussian distribution with unknown mean, but variance 1.

The likelihood function is:

L(μ) = (1/√(2π)) * exp(-(x₁-μ)²/2) * (1/√(2π)) * exp(-(x₂-μ)²/2) * … * (1/√(2π)) * exp(-(xₙ-μ)²/2)

Taking the log-likelihood and maximizing with respect to μ, we find that the MLE for the mean is simply the sample mean:

μ_MLE = (x₁ + x₂ + … + xₙ) / n

Example 3: Estimating the Rate Parameter of an Exponential Distribution

To understand the math behind MLE, let’s try a simple example. We’ll derive the value of the exponential distribution’s parameter corresponding to the maximum likelihood value.

The Exponential Distribution is a continuous probability distribution used to measure inter-event time. It has a single parameter, called λ by convention. λ is called rate. It's mean and variance is 1/ λ and 1 / λ², respectively.

The probability density function for the exponential distribution is as shown below.

There’s a single parameter λ. Let’s calculate its value, given n random points x1 to xn.

As discussed earlier, we know that the likelihood for a given point xi is given by the following:

L(λ|xi) = λe^(-λxi)

We calculate the likelihood for each of our n points. The combined likelihood for all n points would just be the product of their individual likelihoods since we are considering independent and identically distributed points.

L(λ|x1, x2…xn) = Π(λe^(-λxi)) = λ^n * e^(-λΣxi)

The log-likelihood:

tags: #mle #machine #learning #explained

Popular posts: