Maximum Likelihood Estimation

What it is, and why it's important to data science, machine learning, and even generative AI.

Aug 22, 2024

If you’re a student of statistics or data science, you’ve probably come across the idea of maximum likelihood estimation or MLE and if you’re also like me you probably struggled with it the first time around. The goal of this post is to simply break it down to understand what it is and why it’s important for statisticians, data scientists and machine learning engineers.

What is maximum likelihood estimation?

MLE was developed in the 1920s by a famous British statistician, Sir R. A. Fisher as a method of obtaining a point estimator of a parameter. At a high level, maximum likelihood estimation (MLE) is a method for estimating the parameters of a statistical model given a dataset. It involves finding the values of the model parameters that maximize the likelihood function of the observed data given the model. Stated differently, the idea behind MLE is to define a function of the parameters that enables us to find a model that fits the data well.

Why is it important?

MLE provides a unified approach for estimating the parameters of a wide range of statistical models. It is a fundamental tool in the field of statistical inference where the goal is to draw conclusions about a population based on a sample of data. In machine learning and even generative AI it provides a foundational approach for building, training, and evaluating models. MLE provides a natural and theoretically grounded way to find the model parameters (θ) that best explain the observed data (𝒟). This principle underlies the training process for many machine learning models.

Additionally, it can be applied to a wide range of models, including those with complex likelihood functions which makes it applicable across various domains, from simple linear regression models to sophisticated machine learning models and even generative models like variational autoencoders (note: with variational autoencoders, the connection to MLE is indirect and involves some key concepts in probabilistic modeling).

What’s the mathematical formulation?

In machine learning, the process of estimating θ from 𝒟, is called model fitting, or training and typically boils down to an optimization problem of the form:

\(\hat{\theta} = \underset{\theta}{argmin} \ \mathcal{L}(\theta)\)

where 𝐿(𝜃) is some kind of loss function or objective function and 𝜃-hat is our point estimate.

Treating this as an optimization problem, we can define the MLE as follows:

\(\hat{\theta}_{mle} \triangleq \underset{\theta}{argmax} \ p(\mathcal{D}|\theta)\)

In statistics and machine learning, we usually assume the training examples are independently (i.i.d) sampled from the same distribution, so the (conditional) likelihood becomes:

\(p(\mathcal{D}|\theta) = \overset{N}{\underset{n=1}{\prod}} p(y_{n} |x_{n},\theta)\)

To get rid of the product term, we typically work with the log likelihood, which is given by:

\(\ell(\theta) \triangleq log \ p( \mathcal{D} | \theta) = \overset{N}{\underset{n=1}{\sum}} \ log \ p(y_{n} |x_{n},\theta)\)

This log likelihood, decomposes into a sum of terms given by:

\(\hat{\theta}_{mle} = \underset{\theta}{argmax} \overset{N}{\underset{n=1}{\sum}} \ log \ p(y_{n} |x_{n},\theta)\)

Because most optimization algorithms are built to minimize cost functions, we can instead express the objective function as the (conditional) negative log likelihood.

\(NLL(\theta) \triangleq - log \ p(\mathcal{D}|\theta) = - \overset{N}{\underset{n=1}{\sum}} log \ p(y_{n} |x_{n},\theta)\)

When we minimize the negative log likelihood, this will generate the MLE. For unconditional, unsupervised models, the MLE becomes:

\(\hat{\theta}_{mle} = \underset{\theta}{argmin} - \overset{N}{\underset{n=1}{\sum}} \ log \ p(y_{n} |\theta)\)

Alternatively, if we want to maximize the joint likelihood, the MLE in this case becomes:

\(\hat{\theta}_{mle} = \underset{\theta}{argmin} - \overset{N}{\underset{n=1}{\sum}} \ log \ p(y_{n}, x_{n} |\theta)\)

Examples:

MLE for Exponential Distribution

Given the probability density function for the exponential distribution and where X is exponentially distributed with parameter λ:

\(f(x; \lambda) = \lambda e^{-\lambda x}\)

The likelihood function of a random sample of size n, where X is:

\(X = \{x_1, x_2, ..., x_n \}\)

\(L(\lambda) = \overset{n}{\underset{i=1}{\prod}} \lambda e^{-\lambda x_i} = \lambda^{n} e^{-\lambda \overset{n}{\underset{i=1}{\sum}}x_{i}}\)

Taking the log of the likelihood function yields the log-likelihood:

\(ln \ L(\lambda) = n \ ln \ \lambda - \lambda \overset{n}{\underset{i=1}{\sum}} x_{i}\)

To solve for the parameter λ, we can take the derivative and solve for λ:

\(\frac{\partial \ {ln \ L(\lambda)}}{\partial \lambda} = \frac{n}{\lambda} - \overset{n}{\underset{i=1}{\sum}} x_{i}\)

\(\hat{\lambda} = \frac{n}{\overset{n}{\underset{i=1}{\sum}} X_{i}} = \frac{1}{\bar{X}}\)

Consequently, the maximum likelihood estimate for the λ parameter is the reciprocal of the sample mean.

MLE for Normal Distribution

The general form of the normal or Gaussian probability density function is:

\(f(x; \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^{2}}} e^{-\frac{(x-\mu)^{2}}{2 \sigma^{2}}}\)

For this example, we let X be normally distributed with mean μ and variance σ² where both μ and σ² are unknown. The likelihood function for a random sample of size n is:

\(L(\mu, \sigma^{2}) = \overset{n}{\underset{i=1}{\prod}} \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x_{i}-\mu)^{2}}{2 \sigma^{2}}} = \frac{1}{(2 \pi \sigma^{2})^{\frac{n}{2}}} e^{\frac{-1}{2 \sigma^{2}} \overset{n}{\underset{i=1}{\sum}} (x_{i} - \mu)^{2} }\)

Taking the log of the likelihood function yields:

\(ln \ L(\mu, \sigma^{2}) = -\frac{n}{2} \ ln \ (2 \pi \sigma^{2}) - \frac{1}{2 \sigma^{2}} \overset{n}{\underset{i=1}{\sum}} (x_{i} - \mu)^{2}\)

To find the parameter estimate for both μ and σ², we need to take the derivative with respect to μ and σ²:

\(\frac{\partial \ {ln \ L(\mu , \sigma^{2})}}{\partial \mu} = \frac{1}{\sigma^{2}}\overset{n}{\underset{i=1}{\sum}} (x_{i} - \mu) = 0\)

\(\frac{\partial \ {ln \ L(\mu , \sigma^{2})}}{\partial \sigma^{2}} = -\frac{n}{2\sigma^{2}} + \frac{1}{2\sigma^{4}}\overset{n}{\underset{i=1}{\sum}} (x_{i} - \mu)^{2} = 0\)

The solutions to these equations yield the maximum likelihood estimators for our normal distribution:

\(\hat{\mu} = \bar{X}\)

\(\hat{\sigma}^{2} = \frac{1}{n}\overset{n}{\underset{i=1}{\sum}} (X_{i} - \bar{X})^{2}\)

For more information about MLE take check out the following videos:

Josh Starmer: Maximum Likelihood, clearly explained!!!

UMass Amherst: Maximum Likelihood Estimation:

John’s Substack

Discussion about this post

Ready for more?