The proof is a specialization of the basic ingredients of
VC Theory to the case at hand. Here we follow the proof in DGL
'96. First we note that,
R
(
f
^
n
)
-
R
(
f
*
)
=
R
(
f
^
n
)
-
R
^
n
(
f
^
n
)
+
R
^
n
(
f
^
n
)
-
R
(
f
*
)
R
(
f
^
n
)
-
R
(
f
*
)
=
R
(
f
^
n
)
-
R
^
n
(
f
^
n
)
+
R
^
n
(
f
^
n
)
-
R
(
f
*
)
(22)
≤
R
(
f
^
n
)
-
R
^
n
(
f
^
n
)
+
R
^
n
f
*
)
-
R
(
f
*
)
+
d
/
n
≤
R
(
f
^
n
)
-
R
^
n
(
f
^
n
)
+
R
^
n
f
*
)
-
R
(
f
*
)
+
d
/
n
(23)and since
R^n(f^n)≤R^n(f)+d/nR^n(f^n)≤R^n(f)+d/n for any f∈Ff∈F
≤
m
a
x
i
=
1
,
...
,
2
n
d
(
R
(
f
i
)
-
(
^
R
)
n
(
f
i
)
)
+
(
^
R
)
n
(
f
*
)
-
R
(
f
*
)
+
d
/
n
.
≤
m
a
x
i
=
1
,
...
,
2
n
d
(
R
(
f
i
)
-
(
^
R
)
n
(
f
i
)
)
+
(
^
R
)
n
(
f
*
)
-
R
(
f
*
)
+
d
/
n
.
(24)Therefore, by the union bound:
P
(
R
(
f
^
n
)
-
R
(
f
*
)
>
ϵ
)
P
(
R
(
f
^
n
)
-
R
(
f
*
)
>
ϵ
)
(25)
≤
∑
i
=
1
2
n
d
P
R
(
f
i
)
-
R
^
n
(
f
i
)
>
ϵ
/
2
+
P
R
^
n
(
f
*
)
-
R
(
f
*
)
+
d
/
n
>
ϵ
/
2
.
≤
∑
i
=
1
2
n
d
P
R
(
f
i
)
-
R
^
n
(
f
i
)
>
ϵ
/
2
+
P
R
^
n
(
f
*
)
-
R
(
f
*
)
+
d
/
n
>
ϵ
/
2
.
(26)We can bound the second term of the above bound using
Chernoff's/Hoeffding's inequality:
P
R
^
n
(
f
*
)
-
R
(
f
*
)
>
ϵ
/
2
-
d
/
n
P
R
^
n
(
f
*
)
-
R
(
f
*
)
>
ϵ
/
2
-
d
/
n
(27)
≤
e
-
2
n
ϵ
/
2
-
d
/
n
2
≤
e
-
2
n
ϵ
/
2
-
d
/
n
2
(28)
≤
e
2
d
ϵ
e
-
n
ϵ
2
/
2
.
≤
e
2
d
ϵ
e
-
n
ϵ
2
/
2
.
(29)Next, let's bound one of the terms in the summation. For example,
take
P
R
(
f
i
)
-
R
^
n
(
f
i
)
>
(
ϵ
/
2
)
.
P
R
(
f
i
)
-
R
^
n
(
f
i
)
>
(
ϵ
/
2
)
.
(30)Note that by symmetry all 2nd2nd terms will have
identical bounds. Since the bounds are independent of PxyPxy.
Assume that f1f1 is determined by the first d data points x14,...,xdx14,...,xd. By the smoothing property of expectations we can
write,
P
R
(
f
i
)
-
R
^
n
(
f
)
>
ϵ
/
2
=
E
P
R
(
f
i
)
-
R
^
n
(
f
i
)
>
ϵ
/
2
|
x
1
,
...
,
x
d
.
P
R
(
f
i
)
-
R
^
n
(
f
)
>
ϵ
/
2
=
E
P
R
(
f
i
)
-
R
^
n
(
f
i
)
>
ϵ
/
2
|
x
1
,
...
,
x
d
.
(31)From here, we will bound the conditional probability inside the
expectation. Let (X1",Y1"),...,(Xd",Yd")(X1",Y1"),...,(Xd",Yd") be d
additional random samples that are independent and identically
distributed as the data (X1,Y1),...,(Xd,Yd)(X1,Y1),...,(Xd,Yd). Xi",Yi"i=1dXi",Yi"i=1d are often called the "ghost sample" since they are
not actually observed. They are a fictious sample leads to a
simple bound on the conditional probability. Define if i≤di≤d
(
X
i
'
,
Y
i
'
)
=
(
X
i
"
,
Y
i
"
)
(
X
i
'
,
Y
i
'
)
=
(
X
i
"
,
Y
i
"
)
(32)or if i>di>d
(
X
i
'
,
Y
i
'
)
=
(
X
i
,
Y
i
)
.
(
X
i
'
,
Y
i
'
)
=
(
X
i
,
Y
i
)
.
(33)That is, Xi',Yi'i=1dXi',Yi'i=1d agrees with our
observed data on i>d, but the first d samples are replaced with
the ghost sample. Then,
P
R
(
f
i
)
-
R
^
n
(
f
1
)
>
ϵ
/
2
|
x
1
,
...
,
x
d
P
R
(
f
i
)
-
R
^
n
(
f
1
)
>
ϵ
/
2
|
x
1
,
...
,
x
d
(34)
≤
P
R
(
f
i
)
-
1
/
n
∑
i
=
d
+
1
n
1
f
1
(
x
i
)
≠
y
i
>
ϵ
/
2
|
x
1
,
...
,
x
d
≤
P
R
(
f
i
)
-
1
/
n
∑
i
=
d
+
1
n
1
f
1
(
x
i
)
≠
y
i
>
ϵ
/
2
|
x
1
,
...
,
x
d
(35)
≤
P
R
(
f
i
)
-
1
/
n
∑
1
n
1
f
1
(
x
i
)
≠
y
i
+
d
/
n
>
ϵ
/
2
|
x
1
,
...
,
x
d
≤
P
R
(
f
i
)
-
1
/
n
∑
1
n
1
f
1
(
x
i
)
≠
y
i
+
d
/
n
>
ϵ
/
2
|
x
1
,
...
,
x
d
(36)
=
P
R
(
f
i
)
-
(
^
R
)
n
'
(
f
1
)
>
t
/
2
-
d
/
n
|
x
1
,
...
,
x
d
=
P
R
(
f
i
)
-
(
^
R
)
n
'
(
f
1
)
>
t
/
2
-
d
/
n
|
x
1
,
...
,
x
d
(37)where,
R
^
n
'
(
f
1
)
=
1
/
n
∑
i
=
1
n
1
{
f
1
(
x
i
'
)
≠
y
i
'
}
.
R
^
n
'
(
f
1
)
=
1
/
n
∑
i
=
1
n
1
{
f
1
(
x
i
'
)
≠
y
i
'
}
.
(38)Note that n(^R)n'(f1)n(^R)n'(f1) is binomially distributed with mean
R(f1)R(f1) and it is independent of x1,...,xdx1,...,xd Therefore,
P
R
(
f
i
)
-
R
^
n
'
(
f
1
)
>
ϵ
/
2
-
d
/
n
|
x
1
,
...
,
x
d
P
R
(
f
i
)
-
R
^
n
'
(
f
1
)
>
ϵ
/
2
-
d
/
n
|
x
1
,
...
,
x
d
(39)
=
P
(
R
(
f
i
)
-
R
^
n
'
(
f
1
)
>
t
/
2
-
d
/
n
|
x
1
,
...
,
x
d
)
=
P
(
R
(
f
i
)
-
R
^
n
'
(
f
1
)
>
t
/
2
-
d
/
n
|
x
1
,
...
,
x
d
)
(40)
≤
e
-
2
n
ϵ
/
2
-
d
/
n
2
≤
e
-
2
n
ϵ
/
2
-
d
/
n
2
(41)
≤
e
2
d
ϵ
e
-
n
ϵ
2
/
2
.
≤
e
2
d
ϵ
e
-
n
ϵ
2
/
2
.
(42)In conclusion,
P
R
(
f
^
n
)
-
R
*
>
ϵ
P
R
(
f
^
n
)
-
R
*
>
ϵ
(43)
≤
∑
i
=
1
2
n
d
P
R
(
f
)
i
-
R
^
n
(
t
i
)
>
ϵ
/
2
+
P
R
^
n
(
f
*
)
-
R
(
f
*
)
+
d
/
n
>
ϵ
/
2
≤
∑
i
=
1
2
n
d
P
R
(
f
)
i
-
R
^
n
(
t
i
)
>
ϵ
/
2
+
P
R
^
n
(
f
*
)
-
R
(
f
*
)
+
d
/
n
>
ϵ
/
2
(44)
≤
2
n
d
e
2
d
ϵ
e
-
n
ϵ
2
/
2
+
e
2
d
ϵ
e
-
n
ϵ
2
/
2
≤
2
n
d
e
2
d
ϵ
e
-
n
ϵ
2
/
2
+
e
2
d
ϵ
e
-
n
ϵ
2
/
2
(45)
=
e
2
d
ϵ
2
n
d
+
1
e
-
n
ϵ
2
/
2
.
=
e
2
d
ϵ
2
n
d
+
1
e
-
n
ϵ
2
/
2
.
(46)Lastly, Corollary If n≥dn≥d, then
E
R
(
f
^
n
)
-
m
i
n
f
∈
F
R
(
f
)
≤
2
(
d
+
1
)
(
l
o
g
n
+
2
)
/
n
.
E
R
(
f
^
n
)
-
m
i
n
f
∈
F
R
(
f
)
≤
2
(
d
+
1
)
(
l
o
g
n
+
2
)
/
n
.
(47)
"A complete course in modern statistical learning theory at the graduate student level."