Example: to quantize FF
F
q
=
f
,
f
(
x
)
=
1
{
x
≤
1
/
q
i
i
∈
{
0
,
1
,
...
,
q
}
}
F
q
=
f
,
f
(
x
)
=
1
{
x
≤
1
/
q
i
i
∈
{
0
,
1
,
...
,
q
}
}
(3)
q is positive, such that ∀fq∈Fq∀fq∈Fq
∫
|
f
-
f
q
|
≤
c
/
q
∫
|
f
-
f
q
|
≤
c
/
q
(4)
if the density of x is bounded by c>0c>0. q<n1/2q<n1/2
Consider the fact that given only n training data, many of the
classifiers in such a collection may produce identical empirical
errors. Also, many f∈Ff∈F will produce identical label
assignments on the data. We will have at most 2n2n unique
labels.
f is uncountable, its interceptions are countable and bounded by
2n2n. n intervals with 2 classifier per interval.
The number of distinct labeling assignments that a class FF can
produce on a set of n points is denoted
S
(
F
,
n
)
≤
2
n
S
(
F
,
n
)
≤
2
n
(5)
The VC dimension is logS(F,n)logS(F,n). Specifically, VC(F)=kVC(F)=k,
where kk is largest integer such that S(F,k)=2kS(F,k)=2k Ex. 2n=2n2n=2n,
n=2n=2, VC(F)=2VC(F)=2.
Ex. Consider
F
=
f
:
f
(
x
)
=
1
{
x
≥
t
}
o
r
f
(
x
)
=
1
{
x
<
t
}
,
t
∈
[
0
,
1
]
F
=
f
:
f
(
x
)
=
1
{
x
≥
t
}
o
r
f
(
x
)
=
1
{
x
<
t
}
,
t
∈
[
0
,
1
]
(6)
Let q be a positive integer and
F
q
=
f
:
f
(
x
)
=
1
{
x
≥
i
/
q
}
o
r
f
(
x
)
=
1
{
x
<
i
/
q
}
,
i
∈
{
0
,
1
,
...
,
q
}
F
q
=
f
:
f
(
x
)
=
1
{
x
≥
i
/
q
}
o
r
f
(
x
)
=
1
{
x
<
i
/
q
}
,
i
∈
{
0
,
1
,
...
,
q
}
(7)
and,
|
f
q
|
=
2
(
q
+
1
)
|
f
q
|
=
2
(
q
+
1
)
(8)
Moreover, for any f∈Ff∈F there exists
an f1∈Fqf1∈Fq such that
∫
|
f
(
x
)
-
f
q
(
x
)
|
d
x
≤
∫
(
i
-
1
)
/
q
i
/
q
1
d
x
=
1
/
q
∫
|
f
(
x
)
-
f
q
(
x
)
|
d
x
≤
∫
(
i
-
1
)
/
q
i
/
q
1
d
x
=
1
/
q
(9)
Now suppose we have n training data and suppose f*∈Ff*∈F. We
know that in general, the minimum empirical risk classifier will
converge to the Bayes classifier at the rate of n-1/2n-1/2 or
slower. Therefore, it is unnecessary to drive the approximation
error down faster than n-1/2n-1/2 So, we can restrict our
attention of Fn-1/2Fn-1/2 and, provided that the density of x
is bound above. We have
m
i
n
f
∈
F
n
-
1
/
2
R
(
f
)
-
R
(
f
*
)
≤
C
f
q
m
i
n
∫
|
f
*
(
x
)
-
f
(
x
)
|
d
x
≤
c
/
n
1
/
2
m
i
n
f
∈
F
n
-
1
/
2
R
(
f
)
-
R
(
f
*
)
≤
C
f
q
m
i
n
∫
|
f
*
(
x
)
-
f
(
x
)
|
d
x
≤
c
/
n
1
/
2
(10)
Vapnik-Chervonenkis theory is based not on explicitly quantizing
the collection of candidate functions, but rather on recognizing
that the richness of FF is limited in a certain sense by the
number of training data. Indeed, given n i.i.d. training data,
there are at most 2n2n different binary labelings. Therefore, any
collection FF may be divided into 2n2n subsets of classifiers
that are "equilvalent" with respect to the training data. In many
cases a collection may not even be capable of producing 2n2n
different labellings.
Consider X=[0,1]X=[0,1]
F
=
f
:
f
(
x
)
=
1
{
x
≥
t
}
o
r
f
(
x
)
=
1
{
x
<
t
}
t
∈
[
0
,
1
]
F
=
f
:
f
(
x
)
=
1
{
x
≥
t
}
o
r
f
(
x
)
=
1
{
x
<
t
}
t
∈
[
0
,
1
]
(11)
Suppose we have n training data: (x1,...,xn)∈[0,1](x1,...,xn)∈[0,1].
With xsxs denotes the location of each training point in [0,1].
Associated with each x is a label y∈{0,1}y∈{0,1}. Any classifier in
FF will label all points to the left of a number t∈[0,1]t∈[0,1] as
"1" or "0", and points to the right as "0" or "1", respectively.
For t∈[0,x1)t∈[0,x1), all points are either labelled "0" or "1". For
t∈(x1,x2)t∈(x1,x2), x1x1 is labelled "0" or "1" and x2...xnx2...xn are
label "1" or "0" and so on. We see that there are exactly 2n2n
different labellings; far less than 2n2n!
The number of different labellings that a class FF can produce
on a set of n training data is a measure of the "effective size"
of FF. The Vapnik-Chervonenkis (VC) dimension of FF is
proportional to the log of the effective size. Let V(F,n)V(F,n)
denote the VC dimension of FF, typically a constant, independent
of n. The VC inequality states that for all f∈Ff∈F
P
|
R
^
n
(
f
)
-
R
(
f
)
|
>
ϵ
≤
8
e
V
(
F
,
h
)
e
-
n
ϵ
2
/
32
P
|
R
^
n
(
f
)
-
R
(
f
)
|
>
ϵ
≤
8
e
V
(
F
,
h
)
e
-
n
ϵ
2
/
32
(12)
This type of uniform concentration inequality can be used in a
similar fashion to our use of Hoeffding's inequality plus union
bound.
We will go into the details of VC Theory next lecture, and the
remainder of this lecture will introduce the key ideas with an
example Consider the following setup. Let X=[0,1]dX=[0,1]d, Y={0,1}Y={0,1} Let
F
=
f
:
f
(
x
)
=
1
{
w
T
x
+
w
0
>
0
}
F
=
f
:
f
(
x
)
=
1
{
w
T
x
+
w
0
>
0
}
(13)
with w0w0 and ww∈Rd+1∈Rd+1 This is the collection of all hyperplane
classifiers. FF is infinite and uncountable.
Suppose that we have n training data
X
i
,
Y
i
i
=
1
n
X
i
,
Y
i
i
=
1
n
(14)
There are at most 2nd2nd unique classifiers in FF with
respect to these data. To see this, consider d arbitrary data
points x1,...,xidx1,...,xid, and let wTx+w0>0wTx+w0>0 be a
hyperplane containing these points. To e specific, take the
hyperplane with
|
|
w
0
w
|
|
=
1
|
|
w
0
w
|
|
=
1
(15)
. this hyperplane coincides with two
possible classification rules:
f
1
(
x
)
=
1
{
w
T
x
+
w
0
>
0
}
f
1
(
x
)
=
1
{
w
T
x
+
w
0
>
0
}
(16)
f
2
(
x
)
=
1
{
w
T
x
+
w
0
<
0
}
f
2
(
x
)
=
1
{
w
T
x
+
w
0
<
0
}
(17)
Each d-tuple of training data produces two distinct classifiers,
assuming the data are not co-linear. Thus, there are at most 2*nd2*nd unique classifiers in FF with respect to the
training data. (All other f∈Ff∈F produce the same labels and
empirical risk as one of the classifiers.) Let's enumerate the
unique hyperplane classifiers f1,...,f2*ndf1,...,f2*nd,
and let
f
^
n
=
arg
min
f
∈
f
1
,
...
,
f
2
n
d
R
^
n
(
f
)
f
^
n
=
arg
min
f
∈
f
1
,
...
,
f
2
n
d
R
^
n
(
f
)
(18)
and let
R
*
=
i
n
f
f
∈
F
R
(
f
)
R
*
=
i
n
f
f
∈
F
R
(
f
)
(19)
and define
f
*
=
a
r
g
m
i
n
f
∈
F
R
(
f
)
f
*
=
a
r
g
m
i
n
f
∈
F
R
(
f
)
(20)
If multiple f∈Ff∈F achieve R*R*, pick f*f* to be one of them in an
arbitrary fixed number.
Theorem: Assume that PxPx has a
density, but that the distribution of (x,y)(x,y) is other arbitrary.
If n≥dn≥d and 2d/n≤ϵ≤12d/n≤ϵ≤1 then
P
R
(
f
^
n
)
-
R
(
f
)
>
ϵ
≤
e
2
d
ϵ
2
n
d
+
1
e
-
n
ϵ
2
/
2
P
R
(
f
^
n
)
-
R
(
f
)
>
ϵ
≤
e
2
d
ϵ
2
n
d
+
1
e
-
n
ϵ
2
/
2
(21)
NOTE: The assumption that PxPx has a density insures that no d+1
points are co-planar. This in turns, guarantees that there are
exactly 2nd2nd unique classifier and that the 2nd2nd under consideration are fully representative of all
possible classifiers in FF, with respect to the data.
Proof: The proof is a specialization of the basic ingredients of
VC Theory to the case at hand. Here we follow the proof in DGL
'96. First we note that,
R
(
f
^
n
)
-
R
(
f
*
)
=
R
(
f
^
n
)
-
R
^
n
(
f
^
n
)
+
R
^
n
(
f
^
n
)
-
R
(
f
*
)
R
(
f
^
n
)
-
R
(
f
*
)
=
R
(
f
^
n
)
-
R
^
n
(
f
^
n
)
+
R
^
n
(
f
^
n
)
-
R
(
f
*
)
(22)
≤
R
(
f
^
n
)
-
R
^
n
(
f
^
n
)
+
R
^
n
f
*
)
-
R
(
f
*
)
+
d
/
n
≤
R
(
f
^
n
)
-
R
^
n
(
f
^
n
)
+
R
^
n
f
*
)
-
R
(
f
*
)
+
d
/
n
(23)
and since
R^n(f^n)≤R^n(f)+d/nR^n(f^n)≤R^n(f)+d/n for any f∈Ff∈F
≤
m
a
x
i
=
1
,
...
,
2
n
d
(
R
(
f
i
)
-
(
^
R
)
n
(
f
i
)
)
+
(
^
R
)
n
(
f
*
)
-
R
(
f
*
)
+
d
/
n
≤
m
a
x
i
=
1
,
...
,
2
n
d
(
R
(
f
i
)
-
(
^
R
)
n
(
f
i
)
)
+
(
^
R
)
n
(
f
*
)
-
R
(
f
*
)
+
d
/
n
(24)
Therefore, by the union bound:
P
(
R
(
f
^
n
)
-
R
(
f
*
)
>
ϵ
)
P
(
R
(
f
^
n
)
-
R
(
f
*
)
>
ϵ
)
(25)
≤
∑
i
=
1
2
n
d
P
R
(
f
i
)
-
R
^
n
(
f
i
)
>
ϵ
/
2
+
P
R
^
n
(
f
*
)
-
R
(
f
*
)
+
d
/
n
>
ϵ
/
2
≤
∑
i
=
1
2
n
d
P
R
(
f
i
)
-
R
^
n
(
f
i
)
>
ϵ
/
2
+
P
R
^
n
(
f
*
)
-
R
(
f
*
)
+
d
/
n
>
ϵ
/
2
(26)
We can bound the second term of the above bound using
Chernoff's/Hoeffding's inequality:
P
R
^
n
(
f
*
)
-
R
(
f
*
)
>
ϵ
/
2
-
d
/
n
P
R
^
n
(
f
*
)
-
R
(
f
*
)
>
ϵ
/
2
-
d
/
n
(27)
≤
e
-
2
n
ϵ
/
2
-
d
/
n
2
≤
e
-
2
n
ϵ
/
2
-
d
/
n
2
(28)
≤
e
2
d
ϵ
e
-
n
ϵ
2
/
2
≤
e
2
d
ϵ
e
-
n
ϵ
2
/
2
(29)
Next, let's bound one of the terms in the summation. For example,
take
P
R
(
f
i
)
-
R
^
n
(
f
i
)
>
(
ϵ
/
2
)
P
R
(
f
i
)
-
R
^
n
(
f
i
)
>
(
ϵ
/
2
)
(30)
Note that by symmetry all 2nd2