We are interested in estimating θθ given the observation
xx. Naturally then,
any estimation strategy will be based on the posterior
distribution
pθ|x
p
x
θ
. Furthermore, we need a criterion for assessing the
quality of potential estimators.
The quality of an estimate
θ
̂
θ
is measured by a real-valued loss function:
Lθ
θ
̂
L
θ
θ
. For example, squared error or quadratic loss is simply
Lθ
θ
̂=θ−
θ
̂Tθ−
θ
̂
L
θ
θ
θ
θ
θ
θ
Posterior Expected Loss:
ELθ
θ
̂|x=∫Lθ
θ
̂pθ|xdθ
x
L
θ
θ
θ
L
θ
θ
p
x
θ
Bayes Risk:
ELθ
θ
̂=∫∫Lθ
θ
̂pθ|xpxdθdx=∫∫Lθ
θ
̂px|θpθdxdθ=EELθ
θ
̂|x
L
θ
θ
x
θ
L
θ
θ
p
x
θ
p
x
θ
x
L
θ
θ
p
θ
x
p
θ
x
L
θ
θ
(1)
The "best" or optimal estimator given the data
xx and under a specified loss is
given by
θ
̂=argminθELθ
θ
̂|x
θ
θ
x
L
θ
θ
BMSE
θ
̂≡∫∫θ−
θ
̂2pθ|xdθpxdx
BMSE
θ
x
θ
θ
θ
2
p
x
θ
p
x
Since
px≥0
p
x
0
for every
x
x, minimizing the inner integral
∫θ−Eθ2pθ|xdθ=ELθ
θ
̂|x
θ
θ
θ
2
p
x
θ
x
L
θ
θ
(where
ELθ
θ
̂|x
x
L
θ
θ
is the posterior expected loss) for each
x
x, minimizes the overall BMSE.
∂∂
θ
̂∫θ−
θ
̂2pθ|xdθ=∫∂∂
θ
̂θ−
θ
̂2pθ|xdθ=-2∫θ−
θ
̂pθ|xdθ
θ
θ
θ
θ
2
p
x
θ
θ
θ
θ
θ
2
p
x
θ
-2
θ
θ
θ
p
x
θ
(2)
Equating this to zero produces
θ
̂=∫θpθ|xdθ≡Eθ|x
θ
θ
θ
p
x
θ
x
θ
The conditional mean (also called
posterior mean) of
θθ given
xx!
∀n,n∈1…N:
x
n
=A+
W
n
n
n
1
…
N
x
n
A
W
n
W
n
∼0σ2
W
n
0
σ
2
prior for unknown parameter AA:
pa=U-
A
0
A
0
p
a
U
A
0
A
0
px|A=12πσ2N2ⅇ-12σ2∑n=1N
x
n
−A2
p
A
x
1
2
σ
2
N
2
-1
2
σ
2
n
1
N
x
n
A
2
pA|x=12
A
0
2πσ2N2ⅇ-12σ2∑n=1N
x
n
−A2∫-
A
0
A
0
12
A
0
2πσ2N2ⅇ-12σ2∑n=1N
x
n
−A2dAif|A|≤
A
0
0if|A|>
A
0
p
x
A
1
2
A
0
2
σ
2
N
2
-1
2
σ
2
n
1
N
x
n
A
2
A
A
0
A
0
1
2
A
0
2
σ
2
N
2
-1
2
σ
2
n
1
N
x
n
A
2
A
A
0
0
A
A
0
Minimum Bayes MSE Estimator:
A
̂=EA|x=∫-∞∞apA|xdA=∫-
A
0
A
0
A12
A
0
2πσ2N2ⅇ-12σ2∑n=1N
x
n
−A2dA∫-
A
0
A
0
12
A
0
2πσ2N2ⅇ-12σ2∑n=1N
x
n
−A2dA
A
x
A
A
a
p
x
A
A
A
0
A
0
A
1
2
A
0
2
σ
2
N
2
-1
2
σ
2
n
1
N
x
n
A
2
A
A
0
A
0
1
2
A
0
2
σ
2
N
2
-1
2
σ
2
n
1
N
x
n
A
2
(3)
- No closed-form estimator
- As
A
0
→∞
A
0
,
A
̂→1N∑n=1N
x
n
A
1
N
n
1
N
x
n
- For smaller
A
0
A
0
, truncated integral produces an
A
̂
A
that is a function of xx,
σ2
σ
2
, and
A
0
A
0
- As NN increases,
σ2N
σ
2
N
decreases and posterior
pA|x
p
x
A
becomes tightly clustered about
1N∑
x
n
1
N
x
n
. This implies
A
̂→1N∑
x
n
A
1
N
n
x
n
as
n→∞
n
(the data "swamps out" the prior)
(Laplace, 1773)
Lθ
θ
̂=|θ−
θ
̂|
L
θ
θ
θ
θ
ELθ
θ
̂|x=∫-∞∞|θ−
θ
̂|pθ|xdθ=∫-∞
θ
̂
θ
̂−θpθ|xdθ+∫
θ
̂∞θ−
θ
̂pθ|xdθ
x
L
θ
θ
θ
θ
θ
p
x
θ
θ
θ
θ
θ
p
x
θ
θ
θ
θ
θ
p
x
θ
(4)
Using integration-by-parts it can be shown that
∫-∞
θ
̂
θ
̂−θpθ|xdθ=∫-∞
θ
̂Pθ<y|xdy
θ
θ
θ
θ
p
x
θ
y
θ
P
x
θ
y
∫
θ
̂∞θ−
θ
̂pθ|xdθ=∫
θ
̂∞Pθ>y|xdy
θ
θ
θ
θ
p
x
θ
y
θ
P
x
θ
y
where
Pθ<y|x
P
x
θ
y
and
Pθ>y|x
P
x
θ
y
are a cumulative distributions.
So,
ELθ
θ
̂|x=∫-∞
θ
̂Pθ<y|xdy+∫
θ
̂∞Pθ>y|xdy
x
L
θ
θ
y
θ
P
x
θ
y
y
θ
P
x
θ
y
Take the derivative with respect to
θ
̂
θ
implies
Pθ<
θ
̂|x=Pθ>
θ
̂|x
P
x
θ
θ
P
x
θ
θ
which implies that the optimal
θ
̂
θ
under absolute error loss is
posterior
median.
Lθ
θ
̂=0if
θ
̂=θ1if
θ
̂≠θ=
I
{
θ
̂
≠
θ
}
L
θ
θ
0
θ
θ
1
θ
θ
I
{
θ
̂
≠
θ
}
ELθ
θ
̂|x=E
I
{
θ
̂
≠
θ
}
|x=Pr
θ
̂≠θ|x
x
L
θ
θ
x
I
{
θ
̂
≠
θ
}
x
θ
θ
which is the probability that
θ
̂≠θ
θ
θ
given xx.
To minimize '0-1' loss we must choose
θ
̂
θ
to be the value of
θθ with the highest
posterior probability, which implies
θ
̂≠θ
θ
θ
with the smallest probability.
The optimal estimator
θ
̂
θ
under '0-1' loss is the
maximum a
posteriori (MAP) estimator--the value of
θθ where
pθ|x
p
x
θ
is maximized.