# Connexions

You are here: Home » Content » Maximum Likelihood Estimation

### Recently Viewed

This feature requires Javascript to be enabled.

# Maximum Likelihood Estimation

Module by: Clayton Scott, Robert Nowak. E-mail the authors

Summary: This module introduces the maximum likelihood estimator. We show how the MLE implements the likelihood principle. Methods for computing th MLE are covered. Properties of the MLE are discussed including asymptotic efficiency and invariance under reparameterization.

The maximum likelihood estimator (MLE) is an alternative to the minimum variance unbiased estimator (MVUE). For many estimation problems, the MVUE does not exist. Moreover, when it does exist, there is no systematic procedure for finding it. In constrast, the MLE does not necessarily satisfy any optimality criterion, but it can almost always be computed, either through exact formulas or numerical techniques. For this reason, the MLE is one of the most common estimation procedures used in practice.

The MLE is an important type of estimator for the following reasons:

1. The MLE implements the likelihood principle.
2. MLEs are often simple and easy to compute.
3. MLEs have asymptotic optimality properties (consistency and efficiency).
4. MLEs are invariant under reparameterization.
5. If an efficient estimator exists, it is the MLE.
6. In signal detection with unknown parameters (composite hypothesis testing), MLEs are used in implementing the generalized likelihood ratio test (GLRT).
This module will discuss these properties in detail, with examples.

## The Likelihood Principle

Supposed the data XX is distributed according to the density or mass function px| θ p θ x . The likelihood function for θ θ is defined by lθ| x px| θ l x θ p θ x At first glance, the likelihood function is nothing new - it is simply a way of rewriting the pdf/pmf of XX. The difference between the likelihood and the pdf or pmf is what is held fixed and what is allowed to vary. When we talk about the likelihood, we view the observation xx as being fixed, and the parameter θ θ as freely varying.

### note:

It is tempting to view the likelihood function as a probability density for θθ, and to think of lθ| x l x θ as the conditional density of θθ given xx. This approach to parameter estimation is called fiducial inference, and is not accepted by most statisticians. One potential problem, for example, is that in many cases lθ| x l x θ is not integrable ( lθ| x d θ θ l x θ ) and thus cannot be normalized. A more fundamental problem is that θ θ is viewed as a fixed quantity, as opposed to random. Thus, it doesn't make sense to talk about its density. For the likelihood to be properly thought of as a density, a Bayesian approach is required.
The likelihood principle effectively states that all information we have about the unknown parameter θ θ is contained in the likelihood function.

### principle 1: Likelihood Principle

The information brought by an observation xx about θθ is entirely contained in the likelihood function px| θ p θ x . Moreover, if x1 x1 and x2 x2 are two observations depending on the same parameter θθ, such that there exists a constant cc satisfying p x 1 | θ =cp x 2 | θ p θ x 1 c p θ x 2 for every θθ, then they bring the same information about θθ and must lead to identical estimators.

In the statement of the likelihood principle, it is not assumed that the two observations x1 x1 and x2 x2 are generated according to the same model, as long as the model is parameterized by θ θ.

### Example 1

Suppose a public health official conducts a survey to estimate 0θ1 0 θ 1 , the percentage of the population eating pizza at least once per week. As a result, the official found nine people who had eaten pizza in the last week, and three who had not. If no additional information is available regarding how the survey was implemented, then there are at least two probability models we can adopt.

1. The official surveyed 12 people, and 9 of them had eaten pizza in the last week. In this case, we observe x 1 =9 x 1 9 , where x 1 Binomial12θ x 1 Binomial 12 θ The density for x 1 x 1 is f x 1 | θ =12 x 1 θ x 1 1θ12 x 1 f θ x 1 12 x 1 θ x 1 1 θ 12 x 1
2. Another reasonable model is to assume that the official surveyed people until he found 3 non-pizza eaters. In this case, we observe x 2 =12 x 2 12 , where x 2 NegativeBinomial31θ x 2 NegativeBinomial 3 1 θ The density for x 2 x 2 is g x 2 | θ = x 2 131θ x 2 31θ3 g θ x 2 x 2 1 3 1 θ x 2 3 1 θ 3
The likelihoods for these two models are proportional: θ| x 1 θ| x 2 θ91θ3 x 1 θ x 2 θ θ 9 1 θ 3 Therefore, any estimator that adheres to the likelihood principle will produce the same estimate for θθ, regardless of which of the two data-generation models is assumed.

The likelihood principle is widely accepted among statisticians. In the context of parameter estimation, any reasonable estimator should conform to the likelihood principle. As we will see, the maximum likelihood estimator does.

### Note:

While the likelihood principle itself is a fairly reasonable assumption, it can also be derived from two somewhat more intuitive assumptions known as the sufficiency principle and the conditionality principle. See Casella and Berger, Chapter 6.

## The Maximum Likelihood Estimator

The maximum likelihood estimator θx ^ θ x is defined by θ ^=argmaxθlθ| x θ θ l x θ Intuitively, we are choosing θθ to maximize the probability of occurrence of the observation xx.

### Note:

It is possible that multiple parameter values maximize the likelihood for a given x x. In that case, any of these maximizers can be selected as the MLE. It is also possible that the likelihood may be unbounded, in which case the MLE does not exist.

The MLE rule is an implementation of the likelihood principle. If we have two observations whose likelihoods are proportional (they differ by a constant that does not depend on θ θ), then the value of θ θ that maximizes one likelihood will also maximize the other. In other words, both likelihood functions lead to the same inference about θθ, as required by the likelihood principle.

Understand that maximum likelihood is a procedure, not an optimality criterion. From the definition of the MLE, we have no idea how close it comes to the true parameter value relative to other estimators. In constrast, the MVUE is defined as the estimator that satisfies a certain optimality criterion. However, unlike the MLE, we have no clear produre to follow to compute the MVUE.

## Computing the MLE

If the likelihood function is differentiable, then θ ^ θ is found by differentiating the likelihood (or log-likelihood), equating with zero, and solving: loglθ| x θ =0 θ l x θ 0 If multiple solutions exist, then the MLE is the solution that maximizes loglθ| x l x θ , that is, the global maximizer.

In certain cases, such as pdfs or pmfs with an esponential form, the MLE can be easily solved for. That is, loglθ| x θ =0 θ l x θ 0 can be solved using calculus and standard linear algebra.

### Example 2: DC level in white Guassian noise

Suppose we observe an unknown amplitude in white Gaussian noise with unknown variance: x n =A+ w n x n A w n n01N1 n 0 1 N 1 , where w n 𝒩0σ2 w n 0 σ 2 are independent and identically distributed. We would like to estimate θ=Aσ2 θ A σ 2 by computing the MLE. Differentiating the log-likelihood gives logpx| θ A =1σ2 n =1N x n A A p θ x 1 σ 2 n 1 N x n A logpx| θ σ2 =Nσ2+12σ4 n =1N x n A2 σ 2 p θ x N σ 2 1 2 σ 4 n 1 N x n A 2 Equating with zero and solving gives us our MLEs: A ^=1N n =1N x n A 1 N n 1 N x n and σ2 ^=1N n =1N x n A ^2 σ 2 1 N n 1 N x n A 2

#### note:

σ2 ^ σ 2 is biased!

As an exercise, try the following problem:

### Exercise 1

Suppose we observe a random sample x= x 1 x N T x x 1 x N of Poisson measurements with intensity λλ: Pr x i =n=eλλnn! x i n λ λ n n , n012 n 0 1 2 . Find the MLE for λλ.

Unfortunately, this approach is only feasible for the most elementary pdfs and pmfs. In general, we may have to resort to more advanced numerical maximization techniques:
1. Newton-Raphson iteration
2. Iteration by the Scoring Method
3. Expectation-Maximization Algorithm
All of these are iterative techniques which posit some initial guess at the MLE, and then incrementally update that guess. The iteration procedes until a local maximum of the likelihood is attained, although in the case of the first two methods, such convergence is not guaranteed. The EM algorithm has the advantage that the likelihood is always increased at each iteration, and so convergence to at least a local maximum is guaranteed (assuming a bounded likelihood). For each algorithm, the final estimate is highly dependent on the initial guess, and so it is customary to try several different starting values. For details on these algorithms, see Kay, Vol. I.

## Asymptotic Properties of the MLE

Let x= x 1 x N T x x 1 x N denote an IID sample of size NN, and each sample is distributed according to px| θ p θ x . Let θ^N θ N denote the MLE based on a sample xx.

### Theorem 1: Asymptotic Properties of MLE

If the likelihood θ| x =px| θ x θ p θ x satisfies certain "regularity" conditions1, then the MLE θ^N θ N is consistent, and moreover, θ^N θ N converges in probability to θ ^ θ , where θ ^𝒩θIθ-1 θ θ I θ where Iθ I θ is the Fisher Information matrix evaluated at the true value of θθ.

Since the mean of the MLE tends to the true parameter value, we say the MLE is asymptotically unbiased. Since the covariance tends to the inverse Fisher information matrix, we say the MLE is asymptotically efficient.

In general, the rate at which the mean-squared error converges to zero is not known. It is possible that for small sample sizes, some other estimator may have a smaller MSE.The proof of consistency is an application of the weak law of large numbers. Derivation of the asymptotic distribution relies on the central limit theorem. The theorem is also true in more general settings (e.g., dependent samples). See, Kay, Vol. I, Ch. 7 for further discussion.

## The MLE and Efficiency

In some cases, the MLE is efficient, not just asymptotically efficient. In fact, when an efficient estimator exists, it must be the MLE, as described by the following result:

### Theorem 2

If θ ^ θ is an efficient estimator, and the Fisher information matrix Iθ I θ is positive definite for all θθ, then θ ^ θ maximizes the likelihood.

#### Proof

Recall the θ ^ θ is efficient (meaning it is unbiased and achieves the Cramer-Rao lower bound) if and only if lnpx| θ θ =Iθ( θ ^θ) θ p θ x I θ θ θ for all θθ and xx. Since θ ^ θ is assumed to be efficient, this equation holds, and in particular it holds when θ= θx ^ θ θ x . But then the derivative of the log-likelihood is zero at θ= θx ^ θ θ x . Thus, θ ^ θ is a critical point of the likelihood. Since the Fisher information matrix, which is the negative of the matrix of second order derivatives of the log-likelihood, is positive definite, θ ^ θ must be a maximum of the likelihood.

An important case where this happens is described in the following subsection.

### Optimality of MLE for Linear Statistical Model

If the observed data xx are described by x=Hθ+w x H θ w where HH is N×p N p with full rank, θθ is p×1 p 1 , and w𝒩0C w 0 C , then the MLE of θθ is θ ^=HTC-1H-1HTC-1x θ H C H H C x This can be established in two ways. The first is to compute the CRLB for θθ. It turns out that the condition for equality in the bound is satisfied, and θ ^ θ can be read off from that condition.

The second way is to maximize the likelihood directly. Equivalently, we must minimize (xHθ)TC-1(xHθ) x H θ C x H θ with respect to θθ. Since C-1 C is positive definite, we can write C-1=UTΛU=DTD C U Λ U D D , where D=Λ12U D Λ 1 2 U , where UU is an orthogonal matrix whose columns are eigenvectors of C-1 C , and Λ Λ is a diagonal matrix with positive diagonal entries. Thus, we must minimize (DxDHθ)T(DxDHθ) D x D H θ D x D H θ But this is a linear least squares problem, so the solution is given by the pseudoinverse of DH D H :

θ ^=(DH)T(DH)-1(DH)T(Dx)=HTC-1H-1HTC-1x θ D H D H D H D x H C H H C x
(1)

#### Exercise 2

Consider X 1 , , X N 𝒩sσ2I X 1 , , X N s σ 2 I , where ss is a p×1 p 1 unknown signal, and σ2 σ 2 is known. Express the data in the linear model and find the MLE s ^ s for the signal.

## Invariance of MLE

Suppose we wish to estimate the function w=Wθ w W θ and not θθ itself. To use the maximum likelihood approach for estimating ww, we need an expression for the likelihood w| x =px| w x w p w x . In other words, we would need to be able to parameterize the distribution of the data by ww. If WW is not a one-to-one function, however, this may not be possible. Therefore, we define the induced likelihood w| x =max θ θ Wθ=wθ| x x w θ W θ w x θ The MLE w ^ w is defined to be the value of ww that maximizes the induced likelihood. With this definition, the following invariance principle is immediate.

### Theorem 3

Let θ ^ θ denote the MLE of θθ. Then w ^=W θ ^ w W θ is the MLE of w=Wθ w W θ .

#### Proof

The proof follows directly from the definitions of θ ^ θ and w ^ w . As an exercise, work through the logical steps of the proof on your own.

#### Example

Let x= x 1 x N T x x 1 x N where x i Poissonλ x i Poisson λ Given xx, find the MLE of the probability that xPoissonλ x Poisson λ exceeds the mean λλ.

Wλ=Prx>λ= n =λ+1eλλnn! W λ x λ n λ 1 λ λ n n where z=largest integerz z largest integer z . The MLE of ww is w ^= n = λ ^+1e λ ^ λ ^nn! w n λ 1 λ λ n n where λ ^ λ is the MLE of λλ: λ ^=1N n =1N x n λ 1 N n 1 N x n

Be aware that the MLE of a transformed parameter does not necessarily satisfy the asymptotic properties discussed earlier.

### Exercise 3

Consider observations x 1 x 1 ,…, x N x N , where x i x i is a pp-dimensional vector of the form x i =s+ w i x i s w i where ss is an unknown signal and w i w i are independent realizations of white Gaussian noise: w i 𝒩0σ2 I p × p w i 0 σ 2 I p × p Find the maximum likelihood estimate of the energy E=sTs E s s of the unknown signal.

## Summary of MLE

The likelihood principle states that information brought by an observation xx about θθ is entirely contained in the likelihood function px| θ p θ x . The maximum likelihood estimator is one effective implementation of the likelihood principle. In some cases, the MLE can be computed exactly, using calculus and linear algebra, but at other times iterative numerical algorithms are needed. The MLE has several desireable properties:

• It is consistent and asymptotically efficient (as N N we are doing as well as MVUE).
• When an efficient estimator exists, it is the MLE.
• The MLE is invariant to reparameterization.

## Footnotes

1. The regularity conditions are essentially the same as those assumed for the Cramer-Rao lower bound: the log-likelihood must be twice differentiable, and the expected value of the first derivative of the log-likelihood must be zero.

## References

1. Casella and Berger. (1990). Statistical Inference. Belmont, CA: Duxbury Press.
2. Steven Kay. (1993). Fundamentals of Statistical Signal Processing Volume I: Estimation Theory. Prentice Hall.

## Content actions

PDF | EPUB (?)

### What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks