The maximum likelihood estimator (MLE) is an
alternative to the minimum variance unbiased estimator (MVUE).
For many estimation problems, the MVUE does not exist. Moreover,
when it does exist, there is no systematic procedure for
finding it. In constrast, the MLE does not necessarily satisfy any
optimality criterion, but it can almost always be computed,
either through exact formulas or numerical techniques. For this reason,
the MLE is one of the most common estimation procedures used in practice.
The MLE is an important
type of estimator for the following reasons:
- The MLE implements the likelihood principle.
- MLEs are often simple and easy to compute.
- MLEs have asymptotic optimality properties
(consistency and efficiency).
- MLEs are invariant under reparameterization.
- If an efficient estimator exists, it is the MLE.
- In signal detection with unknown parameters
(composite hypothesis testing), MLEs are used in implementing the
generalized likelihood ratio test (GLRT).
This module will discuss these properties in detail, with examples.
Supposed the data XX is
distributed according to the density or mass function
px|θ
p
θ
x
. The likelihood function for
θ
θ
is defined by
lθ|x≡px|θ
l
x
θ
p
θ
x
At first glance, the likelihood function is nothing new - it is
simply a way of rewriting the pdf/pmf of XX. The difference between the
likelihood and the pdf or pmf is what is held fixed and what
is allowed to vary. When we talk about the likelihood, we view
the observation xx
as being fixed, and the parameter θ θ as freely varying.
It is tempting to view the likelihood function
as a probability density for
θθ, and to think of
lθ|x
l
x
θ
as the conditional density of
θθ given
xx. This approach to parameter
estimation is called
fiducial inference,
and is not accepted by most statisticians.
One potential problem, for
example, is that in many cases
lθ|x
l
x
θ
is not integrable (
∫lθ|xdθ→∞
θ
l
x
θ
) and thus cannot be normalized. A more
fundamental problem is that
θ θ is viewed as a fixed
quantity, as opposed to random. Thus, it doesn't make sense
to talk about its density. For the likelihood to be properly
thought of as a density, a
Bayesian
approach is required.
The likelihood principle effectively states that all information we have
about the unknown parameter
θ
θ is contained in the likelihood function.
The information brought by an observation xx about θθ is entirely
contained in the likelihood function
px|θ
p
θ
x
. Moreover, if x1x1
and x2x2
are two observations depending
on the same parameter θθ, such that there
exists a constant cc
satisfying
px1|θ=cpx2|θ
p
θ
x
1
c
p
θ
x
2
for every θθ, then they bring
the same information about θθ and must lead to
identical estimators.
In the statement of the likelihood principle, it is
not assumed that the two observations
x1x1
and
x2x2
are generated according to the same
model, as long as the model is parameterized by
θ
θ.
Suppose a public health official conducts a survey to
estimate
0≤θ≤1
0
θ
1
, the percentage of the population eating pizza
at least once per week. As a result, the official found
nine people who had eaten pizza in the last week, and three
who had not.
If no additional information is available regarding how
the survey was implemented, then there are at least two
probability models we can adopt.
- The official surveyed 12 people, and 9 of them had
eaten pizza in the last week. In this case, we observe
x
1
=9
x
1
9
, where
x
1
∼Binomial12θ
x
1
Binomial
12
θ
The density for
x
1
x
1
is
fx1|θ=12
x
1
θ
x
1
1−θ12−
x
1
f
θ
x
1
12
x
1
θ
x
1
1
θ
12
x
1
- Another reasonable model is to assume that the
official surveyed people until he
found 3 non-pizza eaters. In this case, we observe
x
2
=12
x
2
12
, where
x
2
∼NegativeBinomial31−θ
x
2
NegativeBinomial
3
1
θ
The density for
x
2
x
2
is
g
x
2
|θ=
x
2
−13−1θ
x
2
−31−θ3
g
θ
x
2
x
2
1
3
1
θ
x
2
3
1
θ
3
The likelihoods for these two models are proportional:
ℓθ|
x
1
∝ℓθ|
x
2
∝θ91−θ3
∝
ℓ
x
1
θ
ℓ
x
2
θ
θ
9
1
θ
3
Therefore, any estimator that adheres to the likelihood
principle will produce the same estimate for
θθ, regardless of which
of the two data-generation models is assumed.
The likelihood principle is widely accepted among
statisticians. In the context of parameter estimation, any
reasonable estimator should conform to the likelihood
principle. As we will see, the maximum likelihood estimator
does.
While the likelihood principle itself is a fairly
reasonable assumption, it can also be derived from two
somewhat more intuitive assumptions known as the
sufficiency principle and the
conditionality principle. See
Casella and Berger, Chapter 6.
The maximum likelihood estimator
θx
̂
θ
x
is defined by
θ
̂=argmaxθlθ|x
θ
θ
l
x
θ
Intuitively, we are choosing θθ to maximize the
probability of occurrence of the observation xx.
It is possible that multiple parameter values maximize the
likelihood for a given
x
x. In that case, any of
these maximizers can be selected as the MLE. It is also
possible that the likelihood may be
unbounded, in which case the MLE does not exist.
The MLE rule is an implementation of the likelihood
principle. If we have two observations whose likelihoods are
proportional (they differ by a constant that does not depend
on θ θ),
then the value of θ
θ that maximizes one likelihood will also maximize the
other. In other words, both likelihood functions lead to the
same inference about θθ, as
required by the likelihood principle.
Understand that maximum likelihood is a
procedure, not an optimality criterion.
From the definition of the MLE, we have no idea how close it
comes to the true parameter value relative to other
estimators. In constrast, the MVUE is defined as the estimator
that satisfies a certain optimality criterion. However, unlike
the MLE, we have no clear produre to follow to compute the
MVUE.
If the likelihood function is differentiable, then
θ
̂
θ
is found by differentiating the likelihood (or
log-likelihood), equating with zero, and solving:
∂∂θloglθ|x=0
θ
l
x
θ
0
If multiple solutions exist, then the MLE is the
solution that maximizes
loglθ|x
l
x
θ
, that is, the global
maximizer.
In certain cases, such as pdfs or pmfs with an esponential form,
the MLE can be
easily solved for. That is,
∂∂θloglθ|x=0
θ
l
x
θ
0
can be solved using calculus and standard linear
algebra.
Suppose we observe an unknown amplitude in white Gaussian noise
with unknown variance:
x
n
=A+
w
n
x
n
A
w
n
n∈01…N−1
n
0
1
…
N
1
, where
w
n
∼0σ2
w
n
0
σ
2
are independent and identically distributed.
We would like to estimate
θ=Aσ2
θ
A
σ
2
by computing the MLE. Differentiating the log-likelihood gives
∂∂Alogpx|θ=1σ2∑n=1N
x
n
−A
A
p
θ
x
1
σ
2
n
1
N
x
n
A
∂∂σ2logpx|θ=-Nσ2+12σ4∑n=1N
x
n
−A2
σ
2
p
θ
x
N
σ
2
1
2
σ
4
n
1
N
x
n
A
2
Equating with zero and solving gives us our MLEs:
A
̂=1N∑n=1N
x
n
A
1
N
n
1
N
x
n
and
σ2
̂=1N∑n=1N
x
n
−
A
̂2
σ
2
1
N
n
1
N
x
n
A
2
σ2
̂
σ
2
is biased!
As an exercise, try the following problem:
Suppose we observe a random sample
x=
x
1
…
x
N
T
x
x
1
…
x
N
of Poisson measurements with intensity
λλ:
Pr
x
i
=n=ⅇ-λλnn!
x
i
n
λ
λ
n
n
,
n∈012…
n
0
1
2
…
. Find the MLE for
λλ.
Unfortunately, this approach is only feasible for the most elementary
pdfs and pmfs. In general, we may have to resort to more advanced
numerical maximization techniques:
- Newton-Raphson iteration
- Iteration by the Scoring Method
- Expectation-Maximization Algorithm
All of these are iterative techniques which posit some initial
guess at the MLE, and then incrementally update that
guess. The iteration procedes until a local maximum of the
likelihood is attained, although in the case of the first two
methods, such convergence is not guaranteed. The EM algorithm
has the advantage that the likelihood is always increased at
each iteration, and so convergence to at least a local maximum
is guaranteed (assuming a bounded likelihood). For each
algorithm, the final estimate is highly dependent on the
initial guess, and so it is customary to try several different
starting values. For details on these algorithms, see
Kay, Vol. I.
Let
x=
x
1
…
x
N
T
x
x
1
…
x
N
denote an IID sample of size
NN, and each sample is
distributed according to
px|θ
p
θ
x
. Let
θ̂N
θ
N
denote the MLE based on a sample xx.
If the likelihood
ℓθ|x=px|θ
ℓ
x
θ
p
θ
x
satisfies certain "regularity" conditions, then the MLE
θ̂N
θ
N
is
consistent, and moreover,
θ̂N
θ
N
converges in probability to
θ
̂
θ
, where
θ
̂∼θIθ-1
θ
θ
I
θ
where
Iθ
I
θ
is the Fisher Information
matrix evaluated at the true value of
θθ.
Since the mean of the MLE tends to the true parameter value, we say
the MLE is asymptotically unbiased. Since the
covariance tends to the inverse Fisher information matrix, we say
the MLE is asymptotically efficient.
In general, the rate at which the mean-squared error converges
to zero is not known. It is possible that for small sample
sizes, some other estimator may have a smaller MSE.The proof
of consistency is an application of the weak law of large
numbers. Derivation of the asymptotic distribution relies on
the central limit theorem. The theorem is also true in more
general settings (e.g., dependent samples). See, Kay, Vol. I, Ch. 7 for further discussion.
In some cases, the MLE is efficient, not just asymptotically
efficient. In fact, when an efficient estimator exists, it
must be the MLE, as described by the following result:
If
θ
̂
θ
is an efficient estimator, and the Fisher
information matrix
Iθ
I
θ
is positive definite for all θθ, then
θ
̂
θ
maximizes the likelihood.
Recall the
θ
̂
θ
is efficient (meaning it is unbiased and
achieves the Cramer-Rao lower bound) if and only if
∂∂θlnpx|θ=Iθ
θ
̂−θ
θ
p
θ
x
I
θ
θ
θ
for all θθ and xx. Since
θ
̂
θ
is assumed to be efficient, this equation holds,
and in particular it holds when
θ=
θx
̂
θ
θ
x
. But then the derivative of the log-likelihood
is zero at
θ=
θx
̂
θ
θ
x
. Thus,
θ
̂
θ
is a critical point of the likelihood. Since
the Fisher information matrix, which is the negative of
the matrix of second order derivatives of the
log-likelihood, is positive definite,
θ
̂
θ
must be a maximum of the likelihood.
An important case where this happens is
described in the following subsection.
If the observed data xx are described by
x=Hθ+w
x
H
θ
w
where HH is
N×p
N
p
with full rank, θθ is
p×1
p
1
, and
w∼0C
w
0
C
, then the MLE of θθ is
θ
̂=HTC-1H-1HTC-1x
θ
H
C
H
H
C
x
This can be established in two ways. The first is to
compute the CRLB for θθ. It turns out that
the condition for equality in the bound is satisfied, and
θ
̂
θ
can be read off from that condition.
The second way is to maximize the likelihood
directly. Equivalently, we must minimize
x−HθTC-1x−Hθ
x
H
θ
C
x
H
θ
with respect to θθ. Since
C-1
C
is positive definite, we can write
C-1=UTΛU=DTD
C
U
Λ
U
D
D
, where
D=Λ12U
D
Λ
1
2
U
, where UU is an orthogonal matrix
whose columns are eigenvectors of
C-1
C
, and
Λ
Λ is a diagonal matrix with positive diagonal
entries. Thus, we must minimize
Dx−DHθTDx−DHθ
D
x
D
H
θ
D
x
D
H
θ
But this is a linear least squares problem, so the solution
is given by the pseudoinverse of
DH
D
H
:
θ
̂=DHTDH-1DHTDx=HTC-1H-1HTC-1x
θ
D
H
D
H
D
H
D
x
H
C
H
H
C
x
(1)
Consider
X1
,
…
,
XN
∼sσ2I
X
1
,
…
,
X
N
s
σ
2
I
, where ss is a
p×1
p
1
unknown signal, and
σ2
σ
2
is known. Express the data in the linear model
and find the MLE
s
̂
s
for the signal.
Suppose we wish to estimate the function
w=Wθ
w
W
θ
and not θθ itself. To use the
maximum likelihood approach for estimating ww, we need an expression for
the likelihood
ℓw|x=px|w
ℓ
x
w
p
w
x
.
In other words, we would need to be able to parameterize the
distribution of the data by ww. If
WW is not a one-to-one function,
however, this may not be possible. Therefore, we define the
induced likelihood
ℓw|x=maxθ{Wθ=w}ℓθ|x
ℓ
x
w
θ
W
θ
w
ℓ
x
θ
The MLE
w
̂
w
is defined to be the value of ww that maximizes the induced
likelihood. With this definition, the following invariance
principle is immediate.
Let
θ
̂
θ
denote the MLE of θθ. Then
w
̂=W
θ
̂
w
W
θ
is the MLE of
w=Wθ
w
W
θ
.
The proof follows directly from the definitions
of
θ
̂
θ
and
w
̂
w
. As an exercise, work
through the logical steps of the proof on your own.
Let
x=
x
1
…
x
N
T
x
x
1
…
x
N
where
x
i
∼Poissonλ
x
i
Poisson
λ
Given xx, find the MLE of the probability that
x∼Poissonλ
x
Poisson
λ
exceeds the mean
λλ.
Wλ=Prx>λ=∑n=⌊λ+1⌋∞ⅇ-λλnn!
W
λ
x
λ
n
λ
1
λ
λ
n
n
where
⌊z⌋=largest integer≤z
z
largest integer
z
. The MLE of ww
is
w
̂=∑n=⌊
λ
̂+1⌋∞ⅇ-
λ
̂
λ
̂nn!
w
n
λ
1
λ
λ
n
n
where
λ
̂
λ
is the MLE of
λλ:
λ
̂=1N∑n=1N
x
n
λ
1
N
n
1
N
x
n
Be aware that the MLE of a transformed
parameter does not necessarily satisfy the asymptotic
properties discussed earlier.
Consider observations
x1
x
1
,…,
xN
x
N
, where
xi
x
i
is a pp-dimensional
vector of the form
xi=s+wi
x
i
s
w
i
where ss is an
unknown signal and
wi
w
i
are independent realizations of white Gaussian noise:
wi∼0σ2
I
p
×
p
w
i
0
σ
2
I
p
×
p
Find the maximum likelihood estimate of the energy
E=sTs
E
s
s
of the unknown signal.
The likelihood principle states that information brought
by an observation xx about θθ is entirely
contained in the likelihood function
px|θ
p
θ
x
. The maximum likelihood estimator is
one effective implementation of the
likelihood principle. In some cases, the MLE can be computed
exactly, using calculus and linear algebra, but at other times
iterative numerical algorithms are needed. The MLE has several
desireable properties:
- It is consistent and asymptotically efficient (as
N→∞
N
we are doing as well as MVUE).
- When an efficient estimator exists, it is the MLE.
- The MLE is invariant to reparameterization.
-
Casella and Berger. (1990). Statistical Inference. Belmont, CA: Duxbury Press.
-
Steven Kay. (1993). Fundamentals of Statistical Signal Processing Volume I: Estimation Theory. Prentice Hall.