Let consider a distribution with p.d.f.
f(
x;θ
)
f(
x;θ
)
such that the parameter
θ
θ
is not involved in the support of the distribution. We want to be able to find the maximum likelihood estimator
θ
^
θ
^
by solving
∂[
lnL(
θ
)
]
∂θ
=0,
∂[
lnL(
θ
)
]
∂θ
=0,
where here the partial derivative was used because
L(
θ
)
L(
θ
)
involves
x
1
,
x
2
,...,
x
n
x
1
,
x
2
,...,
x
n
.
That is,
∂[
lnL(
θ
^
)
]
∂θ
=0,
∂[
lnL(
θ
^
)
]
∂θ
=0,
where now, with
θ
^
θ
^
in this expression,
L(
θ
^
)=f(
X
1
;
θ
^
)f(
X
2
;
θ
^
)···f(
X
n
;
θ
^
).
L(
θ
^
)=f(
X
1
;
θ
^
)f(
X
2
;
θ
^
)···f(
X
n
;
θ
^
).
We can approximate the left-hand member of this latter equation by a linear function found from the first two terms of a Taylor’s series expanded about
θ
θ
, namely
∂[
lnL(
θ
)
]
∂θ
+(
θ
^
−θ
)
∂
2
[
lnL(
θ
)
]
∂
θ
2
≈0,
∂[
lnL(
θ
)
]
∂θ
+(
θ
^
−θ
)
∂
2
[
lnL(
θ
)
]
∂
θ
2
≈0,
when
L(
θ
)=f(
X
1
;θ
)f(
X
2
;θ
)···f(
X
n
;θ
).
L(
θ
)=f(
X
1
;θ
)f(
X
2
;θ
)···f(
X
n
;θ
).
Obviously, this approximation is good enough only if
θ
^
θ
^
is close to
θ
θ
, and an adequate mathematical proof involves those conditions. But a heuristic argument can be made by solving for
θ
^
−θ
θ
^
−θ
to obtain
θ
^
−θ=
∂[
lnL(
θ
)
]
∂θ
−
∂
2
[
lnL(
θ
)
]
∂
θ
2
θ
^
−θ=
∂[
lnL(
θ
)
]
∂θ
−
∂
2
[
lnL(
θ
)
]
∂
θ
2
(1)
Recall that
lnL(
θ
)=lnf(
X
1
;θ
)+lnf(
X
2
;θ
)+···+lnf(
X
n
;θ
)
lnL(
θ
)=lnf(
X
1
;θ
)+lnf(
X
2
;θ
)+···+lnf(
X
n
;θ
)
and
∂lnL(
θ
)
∂θ
=
∑
i=1
n
∂[
lnf(
X
i
;θ
)
]
∂θ
;
∂lnL(
θ
)
∂θ
=
∑
i=1
n
∂[
lnf(
X
i
;θ
)
]
∂θ
;
(2)
The expression (2) is the sum of the n independent and identically distributed random variables
Y
i
=
∂[
lnf(
X
i
;θ
)
]
∂θ
,i=1,2,...,n.
Y
i
=
∂[
lnf(
X
i
;θ
)
]
∂θ
,i=1,2,...,n.
and thus the Central Limit Theorem has an approximate normal distribution with mean (in the continuous case) equal to
∫
−∞
∞
∂[
lnf(
x
i
;θ
)
]
∂θ
f(
x;θ
)dx=
∫
−∞
∞
∂[
f(
x
i
;θ
)
]
∂θ
f(
x
i
;θ
)
f(
x
i
;θ
)
dx=
∫
−∞
∞
∂[
f(
x
i
;θ
)
]
∂θ
dx
=
∂
d∂
[
∫
−∞
∞
f(
x
i
;θ
)dx
]=
∂
d∂
[ 1 ]=0.
∫
−∞
∞
∂[
lnf(
x
i
;θ
)
]
∂θ
f(
x;θ
)dx=
∫
−∞
∞
∂[
f(
x
i
;θ
)
]
∂θ
f(
x
i
;θ
)
f(
x
i
;θ
)
dx=
∫
−∞
∞
∂[
f(
x
i
;θ
)
]
∂θ
dx
=
∂
d∂
[
∫
−∞
∞
f(
x
i
;θ
)dx
]=
∂
d∂
[ 1 ]=0.
(3)
Clearly, the mathematical condition is needed that it is permissible to interchange the operations of integration and differentiation in those last steps. Of course, the integral of
f(
x
i
;θ
)
f(
x
i
;θ
)
is equal to one because it is a p.d.f.
Since we know that the mean of each Y is
∫
−∞
∞
∂[
lnf(
x
i
;θ
)
]
∂θ
f(
x;θ
)dx=
0
∫
−∞
∞
∂[
lnf(
x
i
;θ
)
]
∂θ
f(
x;θ
)dx=
0
let us take derivatives of each member of this equation with respect to
θ
θ
obtaining
∫
−∞
∞
{
∂
2
[
lnf(
x
i
;θ
)
]
∂
θ
2
f(
x;θ
)+
∂[
lnf(
x
i
;θ
)
]
∂θ
∂[
f(
x
i
;θ
)
]
∂θ
}
dx=0.
∫
−∞
∞
{
∂
2
[
lnf(
x
i
;θ
)
]
∂
θ
2
f(
x;θ
)+
∂[
lnf(
x
i
;θ
)
]
∂θ
∂[
f(
x
i
;θ
)
]
∂θ
}
dx=0.
However,
∂[
f(
x
i
;θ
)
]
∂θ
=
∂[
lnf(
x
i
;θ
)
]
∂θ
f(
x;θ
)
∂[
f(
x
i
;θ
)
]
∂θ
=
∂[
lnf(
x
i
;θ
)
]
∂θ
f(
x;θ
)
so
∫
−∞
∞
{
∂[
lnf(
x
i
;θ
)
]
∂θ
}
2
f(
x;θ
)dx=−
∫
−∞
∞
∂
2
[
lnf(
x
i
;θ
)
]
∂
θ
2
f(
x
i
;θ
)dx.
∫
−∞
∞
{
∂[
lnf(
x
i
;θ
)
]
∂θ
}
2
f(
x;θ
)dx=−
∫
−∞
∞
∂
2
[
lnf(
x
i
;θ
)
]
∂
θ
2
f(
x
i
;θ
)dx.
Since
E(
Y
)=0
E(
Y
)=0
, this last expression provides the variance of
Y=∂[
lnf(
X;θ
)
]/d∂.
Y=∂[
lnf(
X;θ
)
]/d∂.
Then the variance of expression (2) is n times this value, namely
−nE{
∂
2
[
lnf(
x
i
;θ
)
]
∂
θ
2
}.
−nE{
∂
2
[
lnf(
x
i
;θ
)
]
∂
θ
2
}.
Let us rewrite (1) as
n
(
θ
^
−θ
)
1−
−E{
∂
2
[
lnf(
X;θ
)
]/∂
θ
2
}
=
∂[
lnL(
θ
)
]/∂θ
−E{
∂
2
[
lnf(
X;θ
)
]/∂
θ
2
}
−
1
n
∂
2
[
lnL(
θ
)
]
∂
θ
2
E{
−
∂
2
[
lnf(
X;θ
)
]/∂
θ
2
}
n
(
θ
^
−θ
)
1−
−E{
∂
2
[
lnf(
X;θ
)
]/∂
θ
2
}
=
∂[
lnL(
θ
)
]/∂θ
−E{
∂
2
[
lnf(
X;θ
)
]/∂
θ
2
}
−
1
n
∂
2
[
lnL(
θ
)
]
∂
θ
2
E{
−
∂
2
[
lnf(
X;θ
)
]/∂
θ
2
}
(4)
The numerator of (4) has an approximate
N(
0,1
)
N(
0,1
)
distribution; and those unstated mathematical condition require, in some sense for
−
1
n
∂
2
[
lnL(
θ
)
]
∂
θ
2
−
1
n
∂
2
[
lnL(
θ
)
]
∂
θ
2
to converge to
E[
−
∂
2
[
lnf(
X;θ
)
]/∂
θ
2
]
E[
−
∂
2
[
lnf(
X;θ
)
]/∂
θ
2
]
. Accordingly, the ratios given in equation (4) must be approximately
N(
0,1
)
N(
0,1
)
. That is,
θ
^
θ
^
has an approximate normal distribution with mean
θ
θ
and standard deviation
1
−nE{
∂
2
[
lnf(
X;θ
)
]/∂
θ
2
}
1
−nE{
∂
2
[
lnf(
X;θ
)
]/∂
θ
2
}
.
With the underlying exponential p.d.f.
f(
x;θ
)=
1
θ
e
−x/θ
,0<x<∞,θ∈Ω={
θ;0<θ<∞
}.
f(
x;θ
)=
1
θ
e
−x/θ
,0<x<∞,θ∈Ω={
θ;0<θ<∞
}.
X
¯
X
¯
is the maximum likelihood estimator. Since
lnf(
x;θ
)=−lnθ−
x
θ
lnf(
x;θ
)=−lnθ−
x
θ
and
∂[
lnf(
x;θ
)
]
∂θ
=−
1
θ
+
x
θ
2
∂[
lnf(
x;θ
)
]
∂θ
=−
1
θ
+
x
θ
2
and
∂
2
[
lnf(
x;θ
)
]
∂
θ
=
1
θ
2
−
2x
θ
3
∂
2
[
lnf(
x;θ
)
]
∂
θ
=
1
θ
2
−
2x
θ
3
, we have
−E[
1
θ
2
−
2X
θ
3
]=−
1
θ
+
2θ
θ
3
=
1
θ
2
−E[
1
θ
2
−
2X
θ
3
]=−
1
θ
+
2θ
θ
3
=
1
θ
2
because
E(
X
)=θ
E(
X
)=θ
. That is,
X
¯
X
¯
has an approximate distribution with mean
θ
θ
and standard deviation
θ/
n
θ/
n
. Thus the random interval
X
¯
±1.96(
θ/
n
)
X
¯
±1.96(
θ/
n
)
has an approximate probability of 0.95 for covering
θ
θ
. Substituting the observed
x
¯
x
¯
for
θ
θ
, as well as for
X
¯
X
¯
, we say that
x
¯
±1.96
x
¯
/
n
x
¯
±1.96
x
¯
/
n
is an approximate 95% confidence interval for
θ
θ
.
The maximum likelihood estimator for
λ
λ
in
f(
x;λ
)=
λ
x
e
−λ
x!
,x=0,1,2,...;θ∈Ω={
θ:0<θ<∞
}
f(
x;λ
)=
λ
x
e
−λ
x!
,x=0,1,2,...;θ∈Ω={
θ:0<θ<∞
}
is
λ
^
=
X
¯
λ
^
=
X
¯
Now
lnf(
x;λ
)=xlnλ−λ−lnx!
lnf(
x;λ
)=xlnλ−λ−lnx!
and
∂[
lnf(
x;λ
)
]
∂λ
=
x
λ
−1
∂[
lnf(
x;λ
)
]
∂λ
=
x
λ
−1
and
∂
2
[
lnf(
x;λ
)
]
∂
λ
2
=
x
λ
2
∂
2
[
lnf(
x;λ
)
]
∂
λ
2
=
x
λ
2
. Thus
−E(
−
X
λ
2
)=
λ
λ
2
=
1
λ
−E(
−
X
λ
2
)=
λ
λ
2
=
1
λ
and
λ
^
=
X
¯
λ
^
=
X
¯
has an approximate normal distribution with mean
λ
λ
and standard deviation
λ/n
λ/n
. Finally
x
¯
±1.645
x
¯
/n
x
¯
±1.645
x
¯
/n
serves as an approximate 90% confidence interval for
λ
λ
. With the data from example(…)
x
¯
=2.225
x
¯
=2.225
and hence this interval is from 1.887 to 2.563.
It is interesting that there is another theorem which is somewhat related to the preceding result in that the variance of
θ
^
θ
^
serves as a lower bound for the variance of every unbiased estimator of
θ
θ
. Thus we know that if a certain unbiased estimator has a variance equal to that lower bound, we cannot find a better one and hence it is the best in the sense of being the unbiased minimum variance estimator.
This is called the Rao-Cramer Inequality.
Let
X
1
,
X
2
,...,
X
n
X
1
,
X
2
,...,
X
n
be a random sample from a distribution with p.d.f.
f(
x;θ
),θ∈Ω={
θ:c<θ<d
},
f(
x;θ
),θ∈Ω={
θ:c<θ<d
},
where the support X does not depend upon
θ
θ
so that we can differentiate, with respect to
θ
θ
, under integral signs like that in the following integral:
∫
−∞
∞
f(
x;θ
)
dx=1.
∫
−∞
∞
f(
x;θ
)
dx=1.
If
Y=u(
X
1
,
X
2
,...,
X
n
)
Y=u(
X
1
,
X
2
,...,
X
n
)
is an unbiased estimator of
θ
θ
, then
Var(
Y
)≥
1
n
∫
−∞
∞
{
[
∂lnf(
x;θ
)/∂θ
]
}
2
f(
x;θ
)dx
=
−1
n
∫
−∞
∞
[
∂
2
lnf(
x;θ
)/∂
θ
2
]f(
x;θ
)dx
.
Var(
Y
)≥
1
n
∫
−∞
∞
{
[
∂lnf(
x;θ
)/∂θ
]
}
2
f(
x;θ
)dx
=
−1
n
∫
−∞
∞
[
∂
2
lnf(
x;θ
)/∂
θ
2
]f(
x;θ
)dx
.
Note that the two integrals in the respective denominators are the expectations
E{
[
∂lnf(
X;θ
)
∂θ
]
2
}
E{
[
∂lnf(
X;θ
)
∂θ
]
2
}
and
E[
∂
2
lnf(
X;θ
)
∂
θ
2
]
E[
∂
2
lnf(
X;θ
)
∂
θ
2
]
sometimes one is easier to compute that the other.
Note that above the lower bound of two distributions: exponential and Poisson was computed. Those respective lower bounds were
θ
2
n
θ
2
n
and
λ
n
λ
n
. Since in each case, the variance of
X
¯
X
¯
equals the lower bound, then
X
¯
X
¯
is the unbiased minimum variance estimator.
The sample arises from a distribution with p.d.f.
f(
x;θ
)=θ
x
θ−1
,0<x<1,θ∈Ω={
θ:0<θ<∞
}.
f(
x;θ
)=θ
x
θ−1
,0<x<1,θ∈Ω={
θ:0<θ<∞
}.
We have
lnf(
x;θ
)=lnθ+(
θ−1
)lnx,
∂lnf(
x;θ
)
∂θ
=
1
θ
+lnx,
lnf(
x;θ
)=lnθ+(
θ−1
)lnx,
∂lnf(
x;θ
)
∂θ
=
1
θ
+lnx,
and
∂
2
lnf(
x;θ
)
∂
θ
2
=−
1
θ
2
.
∂
2
lnf(
x;θ
)
∂
θ
2
=−
1
θ
2
.
Since
E(
−1/
θ
2
)=−1/
θ
2
E(
−1/
θ
2
)=−1/
θ
2
, the lower bound of the variance of every unbiased estimator of
θ
θ
is
θ
2
/n
θ
2
/n
. Moreover, the maximum likelihood estimator
θ
^
=−n/ln
∏
i=1
n
X
i
θ
^
=−n/ln
∏
i=1
n
X
i
has an approximate normal distribution with mean
θ
θ
and variance
θ
2
/n
θ