Let Pj≡∫Qjη(x)pX(x)dx∫QjpX(x)dxPj≡∫Qjη(x)pX(x)dx∫QjpX(x)dx
(the theoretical analog of P^jP^j) and define
η
¯
(
x
)
=
∑
j
=
1
M
P
j
1
{
x
∈
Q
j
}
η
¯
(
x
)
=
∑
j
=
1
M
P
j
1
{
x
∈
Q
j
}
(15)The function η¯η¯ is
the theoretical analog of η^η^ (i.e., the function obtained
by averaging ηη over the partition cells). By the triangle
inequality,
E
|
η
^
n
(
X
)
-
η
(
X
)
|
≤
E
[
|
η
^
n
(
X
)
-
η
¯
(
X
)
|
]
︸
E
s
t
i
m
a
t
i
o
n
E
r
r
o
r
+
E
[
|
η
¯
n
(
X
)
-
η
(
X
)
|
]
︸
A
p
p
r
o
x
i
m
a
t
i
o
n
E
r
r
o
r
E
|
η
^
n
(
X
)
-
η
(
X
)
|
≤
E
[
|
η
^
n
(
X
)
-
η
¯
(
X
)
|
]
︸
E
s
t
i
m
a
t
i
o
n
E
r
r
o
r
+
E
[
|
η
¯
n
(
X
)
-
η
(
X
)
|
]
︸
A
p
p
r
o
x
i
m
a
t
i
o
n
E
r
r
o
r
(16)Let's first bound the estimation error. For any x∈[0,1]dx∈[0,1]d, let
Q(x)Q(x) denote the histogram bin in which xx falls in. Define the
random variable
N
(
x
)
=
∑
i
=
1
n
1
{
X
i
∈
Q
(
x
)
}
N
(
x
)
=
∑
i
=
1
n
1
{
X
i
∈
Q
(
x
)
}
(17)If Q(x)=QjQ(x)=Qj, then this
random variable is simply nP^jnP^j. Note that
η
^
n
(
x
)
=
1
N
(
x
)
B
(
x
)
η
^
n
(
x
)
=
1
N
(
x
)
B
(
x
)
(18)where B(x)==∑i=1n1{Xi∈Q(x),Yi=1}=∑i:Xi∈Q(x)YiB(x)==∑i=1n1{Xi∈Q(x),Yi=1}=∑i:Xi∈Q(x)Yi. B(x)B(x) is simply the number of samples in cell
Q(x)Q(x) labelled 1. Now η^n(x)η^n(x) is a fairly
complicated random variable, but the conditional distribution of B(x)B(x)
given N(x)N(x) is relatively simple. Note that
B
(
x
)
|
N
(
x
)
=
k
∼
Binomial
k
,
η
¯
(
x
)
B
(
x
)
|
N
(
x
)
=
k
∼
Binomial
k
,
η
¯
(
x
)
(19)since η¯(x)η¯(x) is the probability of a
sample in Q(x)Q(x) having the label 1 and we are conditioning on the
event of observing kk samples in Q(x)Q(x).
Now consider the conditional expectation
E
η
^
n
(
x
)
-
η
¯
(
x
)
∣
N
(
x
)
=
k
≤
E
B
(
x
)
N
(
x
)
-
η
¯
(
x
)
∣
N
(
x
)
=
k
,
k
>
0
1
,
k
=
0
(
since
0
≤
η
¯
(
x
)
≤
1
)
E
η
^
n
(
x
)
-
η
¯
(
x
)
∣
N
(
x
)
=
k
≤
E
B
(
x
)
N
(
x
)
-
η
¯
(
x
)
∣
N
(
x
)
=
k
,
k
>
0
1
,
k
=
0
(
since
0
≤
η
¯
(
x
)
≤
1
)
(20)Next note that
E
B
(
x
)
N
(
x
)
-
η
¯
(
x
)
∣
N
(
x
)
=
k
=
E
B
(
x
)
k
-
η
¯
(
x
)
∣
N
(
x
)
=
k
=
E
1
k
|
B
(
x
)
-
k
η
¯
(
x
)
︸
E
[
B
(
x
)
]
|
∣
N
(
x
)
=
k
≤
1
k
(
E
|
B
(
x
)
-
k
η
¯
(
x
)
|
2
∣
N
(
x
)
=
k
︸
conditional
variance
of
B
(
x
)
)
1
2
E
B
(
x
)
N
(
x
)
-
η
¯
(
x
)
∣
N
(
x
)
=
k
=
E
B
(
x
)
k
-
η
¯
(
x
)
∣
N
(
x
)
=
k
=
E
1
k
|
B
(
x
)
-
k
η
¯
(
x
)
︸
E
[
B
(
x
)
]
|
∣
N
(
x
)
=
k
≤
1
k
(
E
|
B
(
x
)
-
k
η
¯
(
x
)
|
2
∣
N
(
x
)
=
k
︸
conditional
variance
of
B
(
x
)
)
1
2
(21)by the Jensen's inequality, E[|Z|]≤(E[|Z|2])12E[|Z|]≤(E[|Z|2])12.
Therefore,
E
B
(
x
)
N
(
x
)
-
η
¯
(
x
)
∣
N
(
x
)
=
k
≤
1
k
k
η
¯
(
x
)
(
1
-
η
¯
(
x
)
)
1
2
=
η
¯
(
x
)
(
1
-
η
¯
(
x
)
)
k
E
B
(
x
)
N
(
x
)
-
η
¯
(
x
)
∣
N
(
x
)
=
k
≤
1
k
k
η
¯
(
x
)
(
1
-
η
¯
(
x
)
)
1
2
=
η
¯
(
x
)
(
1
-
η
¯
(
x
)
)
k
(22)and
E
|
η
^
n
(
x
)
-
η
¯
(
x
)
|
∣
N
(
x
)
=
k
≤
η
¯
(
x
)
(
1
-
η
¯
(
x
)
k
,
k
>
0
1
,
k
=
0
E
|
η
^
n
(
x
)
-
η
¯
(
x
)
|
∣
N
(
x
)
=
k
≤
η
¯
(
x
)
(
1
-
η
¯
(
x
)
k
,
k
>
0
1
,
k
=
0
(23)or in other words,
E
|
η
^
n
(
x
)
-
η
¯
(
x
)
|
∣
N
(
x
)
=
k
≤
η
¯
(
x
)
(
1
-
η
¯
(
x
)
N
(
x
)
1
{
N
(
x
)
>
0
}
+
1
{
N
(
x
)
=
0
}
E
|
η
^
n
(
x
)
-
η
¯
(
x
)
|
∣
N
(
x
)
=
k
≤
η
¯
(
x
)
(
1
-
η
¯
(
x
)
N
(
x
)
1
{
N
(
x
)
>
0
}
+
1
{
N
(
x
)
=
0
}
(24)Now taking expectation with respect to N(x)N(x)
E
N
E
[
|
η
^
n
(
x
)
-
η
¯
(
x
)
|
N
(
x
)
=
k
]
≤
E
N
η
¯
(
x
)
(
1
-
η
¯
(
x
)
N
(
x
)
1
{
N
(
x
)
>
0
}
+
P
(
N
(
x
)
=
0
)
≤
E
1
2
N
(
x
)
1
{
N
(
x
)
>
0
}
+
P
(
N
(
x
)
=
0
)
≤
1
2
P
(
N
(
x
)
≤
k
)
+
1
2
k
P
(
N
(
x
)
>
k
)
︸
≤
1
+
P
(
N
(
x
)
=
0
)
E
N
E
[
|
η
^
n
(
x
)
-
η
¯
(
x
)
|
N
(
x
)
=
k
]
≤
E
N
η
¯
(
x
)
(
1
-
η
¯
(
x
)
N
(
x
)
1
{
N
(
x
)
>
0
}
+
P
(
N
(
x
)
=
0
)
≤
E
1
2
N
(
x
)
1
{
N
(
x
)
>
0
}
+
P
(
N
(
x
)
=
0
)
≤
1
2
P
(
N
(
x
)
≤
k
)
+
1
2
k
P
(
N
(
x
)
>
k
)
︸
≤
1
+
P
(
N
(
x
)
=
0
)
(25)Now a key fact is that for any k>0k>0, P(N≤k)→0asn→∞P(N≤k)→0asn→∞. This follows from the assumption that
the marginal density pX(x)≥cpX(x)≥c, for some constant c>0c>0, and
nM→∞nM→∞ as n→∞n→∞. This result
is easily verified by contradiction. If P(N≤k)→q>0asn→∞P(N≤k)→q>0asn→∞, then PX(x)>0PX(x)>0 is contradicted.
Thus, for any ϵ>0ϵ>0 there exists a k>0k>0 such that
12k<ϵ12k<ϵ and P(N≤k)<ϵP(N≤k)<ϵ for nn
sufficiently large. Therefore, for nn sufficiently large and every
x∈[0,1]dx∈[0,1]d,
E
|
η
^
n
(
x
)
-
η
¯
(
x
)
|
<
3
ϵ
E
|
η
^
n
(
x
)
-
η
¯
(
x
)
|
<
3
ϵ
(26)where the
expectation is with respect to the distribution of the sample
{Xi,Yi}i=1n{Xi,Yi}i=1n. Thus,
E
|
η
^
n
(
X
)
-
η
¯
(
X
)
|
<
3
ϵ
E
|
η
^
n
(
X
)
-
η
¯
(
X
)
|
<
3
ϵ
(27)where the
expectation is now with respect to the distribution of the sample
and the marginal distribution of XX.
Next consider the approximation error E|η¯n(X)-η(X)|E|η¯n(X)-η(X)|, where the expectation is over XX alone. The
function ηη may not itself be continuous, but there is another
function ηϵηϵ that is uniformly continuous and such that
E|ηϵ(X)-η(X)|<ϵE|ηϵ(X)-η(X)|<ϵ. Recall that
uniformly continuous functions can be well approximated by piecewise
constant functions.
By the triangle inequality,
E
|
η
¯
-
η
|
≤
E
|
η
¯
-
η
¯
ϵ
|
︸
≤
ϵ
+
E
|
η
¯
ϵ
-
η
ϵ
|
+
E
|
η
ϵ
-
η
|
︸
≤
ϵ
by
design
E
|
η
¯
-
η
|
≤
E
|
η
¯
-
η
¯
ϵ
|
︸
≤
ϵ
+
E
|
η
¯
ϵ
-
η
ϵ
|
+
E
|
η
ϵ
-
η
|
︸
≤
ϵ
by
design
(28)where η¯ϵ(x)=∑j=1m∫Qjηϵ(x')pX(x')dx'1{x∈Qj}η¯ϵ(x)=∑j=1m∫Qjηϵ(x')pX(x')dx'1{x∈Qj}.
E
|
η
¯
(
X
)
-
η
¯
ϵ
(
X
)
|
=
∑
j
=
1
m
∫
Q
j
|
η
(
x
)
-
η
ϵ
(
x
)
|
p
X
(
x
)
d
x
1
{
x
∈
Q
j
}
≤
ϵ
E
|
η
¯
(
X
)
-
η
¯
ϵ
(
X
)
|
=
∑
j
=
1
m
∫
Q
j
|
η
(
x
)
-
η
ϵ
(
x
)
|
p
X
(
x
)
d
x
1
{
x
∈
Q
j
}
≤
ϵ
(29)and since ηϵηϵ is uniformly continuous,
E
|
η
¯
ϵ
(
X
)
-
η
ϵ
(
X
)
|
=
∑
j
=
1
M
∫
Q
j
|
η
¯
ϵ
(
x
)
-
η
ϵ
(
x
)
|
1
{
x
∈
Q
j
}
p
X
(
x
)
d
x
≤
∑
j
=
1
M
δ
P
(
x
∈
Q
j
)
,
where
δ
depends
on
M
=
δ
,
since
∑
j
=
1
M
P
(
X
∈
Q
j
)
=
1
E
|
η
¯
ϵ
(
X
)
-
η
ϵ
(
X
)
|
=
∑
j
=
1
M
∫
Q
j
|
η
¯
ϵ
(
x
)
-
η
ϵ
(
x
)
|
1
{
x
∈
Q
j
}
p
X
(
x
)
d
x
≤
∑
j
=
1
M
δ
P
(
x
∈
Q
j
)
,
where
δ
depends
on
M
=
δ
,
since
∑
j
=
1
M
P
(
X
∈
Q
j
)
=
1
(30)By taking MM sufficiently large, δδ can be made arbitrarily
small. So for large MM, δ≤ϵδ≤ϵ.
Thus, we have shown
E
|
η
¯
(
X
)
-
η
(
X
)
|
<
3
ϵ
E
|
η
¯
(
X
)
-
η
(
X
)
|
<
3
ϵ
(31)for sufficiently large MM.
Since ϵ>0ϵ>0 was arbitrary, we have shown that taking
f
^
n
(
x
)
=
1
,
η
^
n
(
x
)
≥
1
/
2
0
,
o
t
h
e
r
w
i
s
e
f
^
n
(
x
)
=
1
,
η
^
n
(
x
)
≥
1
/
2
0
,
o
t
h
e
r
w
i
s
e
(32)satisfies
P
(
f
^
n
(
X
)
≠
Y
)
-
P
(
f
*
(
X
)
≠
Y
)
≤
2
E
|
η
^
n
(
X
)
-
η
(
X
)
|
→
0
P
(
f
^
n
(
X
)
≠
Y
)
-
P
(
f
*
(
X
)
≠
Y
)
≤
2
E
|
η
^
n
(
X
)
-
η
(
X
)
|
→
0
(33)if
M
→
∞
n
M
→
∞
as
n
→
∞
M
→
∞
n
M
→
∞
as
n
→
∞
(34) P(f^n(X)≠Y)=E[1{f^(X)≠Y}]P(f^n(X)≠Y)=E[1{f^(X)≠Y}] is the
expected risk of f^f^, with expectation over the distributions of
(X,Y)(X,Y) and {Xi,Yi}i=1n{Xi,Yi}i=1n.
"A complete course in modern statistical learning theory at the graduate student level."