In the last lecture we derived a risk (MSE) bound for regression
problems; i.e., select an f∈Ff∈F so that E[(f(X)-Y)2]-E[(f*(X)-Y)2]E[(f(X)-Y)2]-E[(f*(X)-Y)2] is small, where f*(x)=E[Y|X=x]f*(x)=E[Y|X=x]. The result is
summarized below.
The focus of this lecture is to consider another approach to learning
based on maximum likelihood estimation. Consider the classical signal
plus noise model:
Y
i
=
f
i
n
+
W
i
,
i
=
1
,
⋯
,
n
Y
i
=
f
i
n
+
W
i
,
i
=
1
,
⋯
,
n
(4)
where WiWi are iid zero-mean noises. Furthermore, assume that Wi∼P(w)Wi∼P(w) for some known density P(w)P(w). Then
Y
i
∼
P
y
-
f
i
n
≡
P
f
i
(
y
)
Y
i
∼
P
y
-
f
i
n
≡
P
f
i
(
y
)
(5)
since Yi-fin=WiYi-fin=Wi.
A very common and useful loss function to consider is
R
^
n
(
f
)
=
1
n
∑
i
=
1
n
(
-
log
P
f
i
(
Y
i
)
)
.
R
^
n
(
f
)
=
1
n
∑
i
=
1
n
(
-
log
P
f
i
(
Y
i
)
)
.
(6)
Minimizing
R^nR^n with respect to ff is equivalent to maximizing
1
n
∑
i
=
1
n
log
P
f
i
(
Y
i
)
1
n
∑
i
=
1
n
log
P
f
i
(
Y
i
)
(7)
or
∏
i
=
1
n
P
f
i
(
Y
i
)
.
∏
i
=
1
n
P
f
i
(
Y
i
)
.
(8)
Thus, using the negative log-likelihood as a loss
function leads to maximum likelihood estimation. If the WiWi are iid
zero-mean Gaussian r.v.s then this is just the squared error loss we
considered last time. If the WiWi are Laplacian distributed e.g.
P(w)∝e-|w|P(w)∝e-|w|, then we obtain the absolute error, or L1L1,
loss function. We can also handle non-additive models such as the
Poisson model
Y
i
∼
P
y
|
f
i
/
n
=
e
-
f
(
i
/
n
)
[
f
(
i
/
n
)
]
y
y
!
Y
i
∼
P
y
|
f
i
/
n
=
e
-
f
(
i
/
n
)
[
f
(
i
/
n
)
]
y
y
!
(9)
In this case
-
log
P
Y
i
|
f
i
/
n
=
f
i
/
n
-
Y
i
log
f
i
/
n
+
constant
-
log
P
Y
i
|
f
i
/
n
=
f
i
/
n
-
Y
i
log
f
i
/
n
+
constant
(10)
which is a very different loss function, but quite appropriate for many imaging problems.
Before we investigate maximum likelihood estimation for model
selection, let's review some of the basis concepts. Let ΘΘ
denote a parameter space (e.g., Θ=RΘ=R), and assume we have
observations
Y
i
∼
i
i
d
P
θ
*
(
y
)
,
i
=
1
,
⋯
,
n
Y
i
∼
i
i
d
P
θ
*
(
y
)
,
i
=
1
,
⋯
,
n
(11)
where θ*∈Θθ*∈Θ is a parameter determining the
density of the {YiYi}. The ML estimator of θ*θ* is
θ
^
n
=
arg
max
θ
∈
Θ
∏
i
=
1
n
P
θ
(
Y
i
)
=
arg
max
θ
∈
Θ
∑
i
=
1
n
log
P
θ
(
Y
i
)
=
arg
min
θ
∈
Θ
∑
i
=
1
n
-
log
P
θ
(
Y
i
)
.
θ
^
n
=
arg
max
θ
∈
Θ
∏
i
=
1
n
P
θ
(
Y
i
)
=
arg
max
θ
∈
Θ
∑
i
=
1
n
log
P
θ
(
Y
i
)
=
arg
min
θ
∈
Θ
∑
i
=
1
n
-
log
P
θ
(
Y
i
)
.
(12)
θ^θ^ maximizes the expected log-likelihood. To see this, let's compare the expected log-likelihood of θ*θ* with any other θ∈Θθ∈Θ.
E
[
log
P
θ
*
(
Y
)
-
log
P
θ
(
Y
)
]
=
E
log
P
θ
*
(
Y
)
P
θ
(
Y
)
=
∫
log
P
θ
*
(
y
)
P
θ
(
y
)
P
θ
*
(
y
)
d
y
=
K
(
P
θ
,
P
θ
*
)
the
KL
divergence
≥
0
with
equality
iff
P
θ
*
=
P
θ
.
E
[
log
P
θ
*
(
Y
)
-
log
P
θ
(
Y
)
]
=
E
log
P
θ
*
(
Y
)
P
θ
(
Y
)
=
∫
log
P
θ
*
(
y
)
P
θ
(
y
)
P
θ
*
(
y
)
d
y
=
K
(
P
θ
,
P
θ
*
)
the
KL
divergence
≥
0
with
equality
iff
P
θ
*
=
P
θ
.
(13)
Why?
-
E
log
P
θ
*
(
y
)
P
θ
(
y
)
=
E
log
P
θ
(
y
)
P
θ
*
(
y
)
≤
log
E
P
θ
(
y
)
P
θ
*
(
y
)
=
log
∫
P
θ
(
y
)
d
y
=
0
⇒
K
(
P
θ
,
P
θ
*
)
≥
0
-
E
log
P
θ
*
(
y
)
P
θ
(
y
)
=
E
log
P
θ
(
y
)
P
θ
*
(
y
)
≤
log
E
P
θ
(
y
)
P
θ
*
(
y
)
=
log
∫
P
θ
(
y
)
d
y
=
0
⇒
K
(
P
θ
,
P
θ
*
)
≥
0
(14)
On the other hand, since θ^nθ^n maximizes the likelihood over θ∈Θθ∈Θ, we have
∑
i
=
1
n
log
P
θ
*
(
Y
i
)
P
θ
^
n
(
Y
i
)
=
∑
i
=
1
n
log
P
θ
*
(
Y
i
)
-
log
P
θ
^
n
(
Y
i
)
≤
0
∑
i
=
1
n
log
P
θ
*
(
Y
i
)
P
θ
^
n
(
Y
i
)
=
∑
i
=
1
n
log
P
θ
*
(
Y
i
)
-
log
P
θ
^
n
(
Y
i
)
≤
0
(15)
Therefore,
1
n
∑
i
=
1
n
log
P
θ
*
(
Y
i
)
P
θ
^
n
(
Y
i
)
-
K
(
P
θ
^
n
,
P
θ
*
)
+
K
(
P
θ
^
n
,
P
θ
*
)
≤
0
1
n
∑
i
=
1
n
log
P
θ
*
(
Y
i
)
P
θ
^
n
(
Y
i
)
-
K
(
P
θ
^
n
,
P
θ
*
)
+
K
(
P
θ
^
n
,
P
θ
*
)
≤
0
(16)
or re-arranging
K
(
P
θ
^
n
,
P
θ
*
)
≤
1
n
∑
i
=
1
n
log
P
θ
*
(
Y
i
)
P
θ
^
n
(
Y
i
)
-
K
(
P
θ
^
n
,
P
θ
*
)
K
(
P
θ
^
n
,
P
θ
*
)
≤
1
n
∑
i
=
1
n
log
P
θ
*
(
Y
i
)
P
θ
^
n
(
Y
i
)
-
K
(
P
θ
^
n
,
P
θ
*
)
(17)
Notice that the quantity
1
n
∑
i
=
1
n
log
P
θ
*
(
Y
i
)
P
θ
(
Y
i
)
1
n
∑
i
=
1
n
log
P
θ
*
(
Y
i
)
P
θ
(
Y
i
)
(18)
is an empirical average whose mean is K(Pθ,Pθ*)K(Pθ,Pθ*). By the law of large numbers, for each θ∈Θθ∈Θ,
1
n
∑
i
=
1
n
log
P
θ
*
(
Y
i
)
P
θ
(
Y
i
)
-
K
(
P
θ
,
P
θ
*
)
→
a
.
s
.
0
1
n
∑
i
=
1
n
log
P
θ
*
(
Y
i
)
P
θ
(
Y
i
)
-
K
(
P
θ
,
P
θ
*
)
→
a
.
s
.
0
(19)
.
If this also holds for the sequence {θ^n}{θ^n}, then we have
K
(
P
θ
^
n
,
P
θ
*
)
≤
1
n
∑
log
P
θ
*
(
Y
i
)
P
θ
^
n
(
Y
i
)
-
K
(
P
θ
^
n
,
P
θ
*
)
→
0
as
n
→
∞
K
(
P
θ
^
n
,
P
θ
*
)
≤
1
n
∑
log
P
θ
*
(
Y
i
)
P
θ
^
n
(
Y
i
)
-
K
(
P
θ
^
n
,
P
θ
*
)
→
0
as
n
→
∞
(20)
which implies that
P
θ
^
n
→
P
θ
*
P
θ
^
n
→
P
θ
*
(21)
which often implies that
θ
^
n
→
θ
*
θ
^
n
→
θ
*