An example of the use of Sieves for complexity regularization in
denoising
Consider the following setting. Let
Y
=
f
*
(
X
)
+
W
,
Y
=
f
*
(
X
)
+
W
,
(1)
where X is a
random variable (r.v.) on X=[0,1]X=[0,1], WW is a r.v. on Y=R,Y=R,
independent of XX and satisfying
E
[
W
]
=
0
and
E
[
W
2
]
=
σ
2
<
∞
.
E
[
W
]
=
0
and
E
[
W
2
]
=
σ
2
<
∞
.
(2)
Finally let f*:[0,1]→Rf*:[0,1]→R be
a function satisfying
|
f
*
(
t
)
-
f
*
(
s
)
|
≤
L
|
t
-
s
|
,
∀
t
,
s
∈
[
0
,
1
]
,
|
f
*
(
t
)
-
f
*
(
s
)
|
≤
L
|
t
-
s
|
,
∀
t
,
s
∈
[
0
,
1
]
,
(3)
where L>0L>0 is a constant. A function satisfying condition
(Equation 3) is said to be Lipschitz on [0,1][0,1]. Notice
that such a function must be continuous, but it is not necessarily
differentiable. An example of such a function is depicted in
Figure 1(a).
Note that
E
[
Y
|
X
=
x
]
=
E
[
f
*
(
X
)
+
W
|
X
=
x
]
=
E
[
f
*
(
x
)
+
W
|
X
=
x
]
=
f
*
(
x
)
+
E
[
W
]
=
f
*
(
x
)
.
E
[
Y
|
X
=
x
]
=
E
[
f
*
(
X
)
+
W
|
X
=
x
]
=
E
[
f
*
(
x
)
+
W
|
X
=
x
]
=
f
*
(
x
)
+
E
[
W
]
=
f
*
(
x
)
.
(4)
Consider our usual setup: Estimate f*f* using nn training
examples
{
X
i
,
Y
i
}
i
=
1
n
∼
i
.
i
.
d
.
P
X
Y
,
Y
i
=
f
*
(
X
i
)
+
W
i
,
i
=
{
1
,
...
,
n
}
,
{
X
i
,
Y
i
}
i
=
1
n
∼
i
.
i
.
d
.
P
X
Y
,
Y
i
=
f
*
(
X
i
)
+
W
i
,
i
=
{
1
,
...
,
n
}
,
(5)
where ∼i.i.d.∼i.i.d. means independently and identically
distributed. Figure 1(a) illustrates this
setup.
In many applications we can sample X=[0,1]X=[0,1] as we like, and not
necessarily at random. For example we can take nn samples
uniformly on [0,1]
x
i
=
i
n
,
i
=
1
,
...
,
n
,
Y
i
=
f
(
x
i
)
+
W
i
=
f
i
n
+
W
i
.
x
i
=
i
n
,
i
=
1
,
...
,
n
,
Y
i
=
f
(
x
i
)
+
W
i
=
f
i
n
+
W
i
.
(6)
We will proceed with this setup (as in
Figure 1(b)) in the rest of the lecture.
Our goal is to find f^nf^n such that
E[∥f*-f^n∥2]→0,asn→0E[∥f*-f^n∥2]→0,asn→0 (here ∥·∥∥·∥
is the usual L2L2-norm; i.e., ∥f*-f^n∥2=∫01|f*(t)-f^n(t)|2dt∥f*-f^n∥2=∫01|f*(t)-f^n(t)|2dt).
Let
F
=
{
f
:
f
is
Lipschitz
with
constant
L
}
.
F
=
{
f
:
f
is
Lipschitz
with
constant
L
}
.
(7)
The
Risk
is defined as
R
(
f
)
=
∥
f
*
-
f
∥
2
=
∫
0
1
|
f
*
(
t
)
-
f
(
t
)
|
2
d
t
.
R
(
f
)
=
∥
f
*
-
f
∥
2
=
∫
0
1
|
f
*
(
t
)
-
f
(
t
)
|
2
d
t
.
(8)
The Expected Risk
(recall that our
estimator f^nf^n is based on {xi,Yi}{xi,Yi} and hence is a r.v.)
is defined as
E
[
R
(
f
^
n
)
]
=
E
[
∥
f
*
-
f
^
n
∥
2
]
.
E
[
R
(
f
^
n
)
]
=
E
[
∥
f
*
-
f
^
n
∥
2
]
.
(9)
Finally the
Empirical Risk
is defined as
R
^
n
(
f
)
=
1
n
∑
i
=
1
n
f
i
n
-
Y
i
2
.
R
^
n
(
f
)
=
1
n
∑
i
=
1
n
f
i
n
-
Y
i
2
.
(10)
Let 0<m1≤m2≤m3≤⋯0<m1≤m2≤m3≤⋯ be a sequence of integers
satisfying mn→∞mn→∞ as n→∞n→∞, and
knmn=nknmn=n for some integer kn>0kn>0. That is, for each value of nn
there is an associated integer value mnmn. Define the Sieve
F1,F2,F3,...,F1,F2,F3,...,
F
n
=
f
:
f
(
t
)
=
∑
j
=
1
m
n
c
j
1
{
j
-
1
m
n
≤
t
<
j
m
n
}
,
c
j
∈
R
.
F
n
=
f
:
f
(
t
)
=
∑
j
=
1
m
n
c
j
1
{
j
-
1
m
n
≤
t
<
j
m
n
}
,
c
j
∈
R
.
(11)
FnFn is the space of
functions that are constant on intervals
I
j
,
m
n
≡
j
-
1
m
n
,
j
m
n
,
j
=
1
,
...
,
m
n
.
I
j
,
m
n
≡
j
-
1
m
n
,
j
m
n
,
j
=
1
,
...
,
m
n
.
(12)
From here on we will use mm and kk instead of mnmn and knkn
(dropping the subscript nn) for notational ease.
Define
f
n
(
t
)
=
∑
j
=
1
m
c
j
*
1
{
t
∈
I
j
,
m
}
,
where
c
j
*
=
1
k
∑
i
:
i
n
∈
I
j
,
m
f
*
i
n
f
n
(
t
)
=
∑
j
=
1
m
c
j
*
1
{
t
∈
I
j
,
m
}
,
where
c
j
*
=
1
k
∑
i
:
i
n
∈
I
j
,
m
f
*
i
n
(13)
Note that fn∈Fnfn∈Fn.
Exercise 1
Upper bound ∥f*-fn∥2∥f*-fn∥2.
∥
f
*
-
f
∥
2
=
∫
0
1
|
f
*
(
t
)
-
f
n
(
t
)
|
2
d
t
=
∑
j
=
1
m
∫
I
j
,
m
|
f
*
(
t
)
-
f
n
(
t
)
|
2
d
t
=
∑
j
=
1
m
∫
I
j
,
m
|
f
*
(
t
)
-
c
j
*
|
2
d
t
=
∑
j
=
1
m
∫
I
j
,
m
f
*
(
t
)
-
1
k
∑
i
:
i
n
∈
I
j
,
m
f
*
i
n
2
d
t
=
∑
j
=
1
m
∫
I
j
,
m
1
k
∑
i
:
i
n
∈
I
j
,
m
f
*
(
t
)
-
f
*
i
n
2
d
t
≤
∑
j
=
1
m
∫
I
j
,
m
1
k
∑
i
:
i
n
∈
I
j
,
m
f
*
(
t
)
-
f
*
i
n
2
d
t
≤
∑
j
=
1
m
∫
I
j
,
m
1
k
∑
i
:
i
n
∈
I
j
,
m
L
m
2
d
t
=
∑
j
=
1
m
∫
I
j
,
m
L
m
2
d
t
=
∑
j
=
1
m
1
m
L
m
2
=
L
m
2
.
∥
f
*
-
f
∥
2
=
∫
0
1
|
f
*
(
t
)
-
f
n
(
t
)
|
2
d
t
=
∑
j
=
1
m
∫
I
j
,
m
|
f
*
(
t
)
-
f
n
(
t
)
|
2
d
t
=
∑
j
=
1
m
∫
I
j
,
m
|
f
*
(
t
)
-
c
j
*
|
2
d
t
=
∑
j
=
1
m
∫
I
j
,
m
f
*
(
t
)
-
1
k
∑
i
:
i
n
∈
I
j
,
m
f
*
i
n
2
d
t
=
∑
j
=
1
m
∫
I
j
,
m
1
k
∑
i
:
i
n
∈
I
j
,
m
f
*
(
t
)
-
f
*
i
n
2
d
t
≤
∑
j
=
1
m
∫
I
j
,
m
1
k
∑
i
:
i
n
∈
I
j
,
m
f
*
(
t
)
-
f
*
i
n
2
d
t
≤
∑
j
=
1
m
∫
I
j
,
m
1
k
∑
i
:
i
n
∈
I
j
,
m
L
m
2
d
t
=
∑
j
=
1
m
∫
I
j
,
m
L
m
2
d
t
=
∑
j
=
1
m
1
m
L
m
2
=
L
m
2
.
(14)
The above implies that ∥f*-fn∥2→0asn→∞,∥f*-fn∥2→0asn→∞, since m=mn→∞asn→∞m=mn→∞asn→∞. In words, with nn sufficiently large we can
approximate f*f* to arbitrary accuracy using models in FnFn
(even if the functions we are using to approximate f*f* are not
Lipschitz!).
For any f∈Fn,f∈Fn,f=∑j=1mcj1{t∈Ij,m},f=∑j=1mcj1{t∈Ij,m}, we have
R
^
n
(
f
)
=
1
n
∑
j
=
1
m
∑
i
:
i
n
∈
I
j
,
m
(
c
j
-
Y
i
)
2
.
R
^
n
(
f
)
=
1
n
∑
j
=
1
m
∑
i
:
i
n
∈
I
j
,
m
(
c
j
-
Y
i
)
2
.
(15)
Let f^n=argminf∈FnR^n(f).f^n=argminf∈FnR^n(f). Then
f
^
n
(
t
)
=
∑
j
=
1
m
c
^
j
1
{
t
∈
I
j
,
m
}
,
where
c
^
j
=
1
k
∑
i
:
i
n
∈
I
j
,
m
Y
i
f
^
n
(
t
)
=
∑
j
=
1
m
c
^
j
1
{
t
∈
I
j
,
m
}
,
where
c
^
j
=
1
k
∑
i
:
i
n
∈
I
j
,
m
Y
i
(16)
Exercise 2
Show (Equation 16).
Note that E[c^j]=cj*E[c^j]=cj* and therefore
E[f^n(t)]=fn(t)E[f^n(t)]=fn