Suppose FF= {linear classifiers in RdRd}, then we have
V
F
=
d
+
1
,
f
^
n
=
arg
min
f
∈
F
R
^
n
(
f
)
V
F
=
d
+
1
,
f
^
n
=
arg
min
f
∈
F
R
^
n
(
f
)
(1)
E
[
R
(
f
^
n
)
]
-
inf
f
∈
F
R
(
f
)
≤
4
(
d
+
1
)
log
(
n
+
1
)
+
log
2
n
E
[
R
(
f
^
n
)
]
-
inf
f
∈
F
R
(
f
)
≤
4
(
d
+
1
)
log
(
n
+
1
)
+
log
2
n
(2)
.
Normally, we have a feature vector X∈RdX∈Rd. A hyperplane in
RdRd provides a linear classifier in RdRd. Nonlinear
classifiers can be obtained by a straightforward generalization.
Let φ1,⋯,φd'φ1,⋯,φd', d'≥dd'≥d be a
collection of functions mapping Rd→RRd→R. These
functions, applied to a feature X∈RdX∈Rd, produce a
generalized set of features, φ=(φ1(X),φ2(X),⋯,φd'(X))'φ=(φ1(X),φ2(X),⋯,φd'(X))'. For example, if X=(x1,x2)'X=(x1,x2)',
then we could consider d'=Sd'=S and φ=(x1,x2,x1x2,x12,x22)'∈R5φ=(x1,x2,x1x2,x12,x22)'∈R5. We can then construct a linear
classifier in the higher dimensional generalized feature space
Rd'Rd'.
The VC bounds immediately extend to this case, and we have for FF'
= { generalized linear classifiers based on maps φ:Rd→Rd'φ:Rd→Rd' },
E
[
R
(
f
^
n
)
]
-
inf
f
∈
F
'
R
(
f
)
≤
4
(
d
'
+
1
)
log
(
n
+
1
)
+
log
2
n
E
[
R
(
f
^
n
)
]
-
inf
f
∈
F
'
R
(
f
)
≤
4
(
d
'
+
1
)
log
(
n
+
1
)
+
log
2
n
(3)
.
Theorem 1 (Steele '75, Dudley '78)
Let GGbe a finite-dimensional vector space of real-valued
functions on RdRd. The class of sets A={{x:g(x)≥0}:g∈G}A={{x:g(x)≥0}:g∈G} has VC dimension ≥≥ dim(GG).
Proof:
It is sufficient to show that no set of n=dim(G)+1n=dim(G)+1 points
can be shattered by AA. Take any nn points and for each g∈Gg∈G, define the vector Vg=(g(x1),⋯,g(xn))Vg=(g(x1),⋯,g(xn)).
The set {Vg:g∈G}{Vg:g∈G} is a linear subspace of RnRn of
dimension ≤≤ dim (GG) = n-1n-1. Therefore, there exists a
non-zero vector α=(α1,⋯,αn)∈Rnα=(α1,⋯,αn)∈Rn
such that ∑i=1nαig(xi)=0∑i=1nαig(xi)=0. We can assume that
at least one of these αiSαiS is negative (if all are
positive, just negate the sum). We can then re-arrange this
expression as ∑i:αi≥0αig(xi)=∑i:αi<0-αig(xi)∑i:αi≥0αig(xi)=∑i:αi<0-αig(xi).
Now suppose that there exists a g∈Gg∈G such that the set {x:g(x)≥0}{x:g(x)≥0} selects precisely the xiSxiS on the left-hand
side above. Then all terms on the left are non-negative and all
the terms on the right are non-positive. Since αα is
non-zero, this is a contradiction. Therefore, x1,⋯,xnx1,⋯,xn
cannot be shattered by sets in {x:g(x)≥0}{x:g(x)≥0}, g∈Gg∈G.
6.375pt0.0pt6.375pt
Example 1
Consider half-spaces in RdRd of the form A={x∈Rd:xi≥b,i∈{1,⋯,d},b∈R}A={x∈Rd:xi≥b,i∈{1,⋯,d},b∈R}. Each half-space
can be described by
g
(
x
)
=
0
,
⋯
,
0
,
1
,
0
,
⋯
,
0
x
1
⋮
x
d
-
b
g
(
x
)
=
0
,
⋯
,
0
,
1
,
0
,
⋯
,
0
x
1
⋮
x
d
-
b
(4)
⟹
d
i
m
(
G
)
=
d
+
1
,
V
A
≤
d
+
1
⟹
d
i
m
(
G
)
=
d
+
1
,
V
A
≤
d
+
1
(5)
.
Let
T
k
=
r
e
c
u
r
s
i
v
e
r
e
c
t
a
n
g
u
l
a
r
p
a
r
t
i
t
i
o
n
s
o
f
R
d
w
i
t
h
k
+
1
c
e
l
l
s
T
k
=
r
e
c
u
r
s
i
v
e
r
e
c
t
a
n
g
u
l
a
r
p
a
r
t
i
t
i
o
n
s
o
f
R
d
w
i
t
h
k
+
1
c
e
l
l
s
(6)
Let T∈TkT∈Tk. Each cell of TT results from splitting a
rectangular region into two smaller rectangles parallel to one of
the coordinate axes.
Example 2
T∈T3T∈T3, d=2d=2.
Each additional split is analogous to a half-space set. Therefore,
each additional split can potentially shatter d+1d+1 points. This
implies that
V
T
k
≤
(
d
+
1
)
k
V
T
k
≤
(
d
+
1
)
k
(7)
.
Example 3
d=1d=1.
k=1k=1 split shatters two points.
k=2k=2 splits shatters three points <4<4.
F
k
=
{
t
r
e
e
c
l
a
s
s
i
f
i
e
r
s
w
i
t
h
k
+
1
l
e
a
f
s
o
n
R
d
}
F
k
=
{
t
r
e
e
c
l
a
s
s
i
f
i
e
r
s
w
i
t
h
k
+
1
l
e
a
f
s
o
n
R
d
}
(8)
E
[
R
(
f
^
n
)
]
-
inf
f
∈
F
k
R
(
f
)
≤
4
(
d
+
1
)
k
log
n
+
log
2
n
E
[
R
(
f
^
n
)
]
-
inf
f
∈
F
k
R
(
f
)
≤
4
(
d
+
1
)
k
log
n
+
log
2
n
(9)
.
How can we decide what dimension to choose for a generalized
linear classifier?
How many leafs should be used for a classification tree?
Answer: Complexity Regularization using VC bounds!
SRM is simply complexity regularization using VC type bounds in
place of Chernoff's bound or other concentration inequalities.
The basic idea is to consider a sequence of sets of classifiers
F1,F2,...,F1,F2,..., of increasing VC dimensions VF1≤VF2≤...VF1≤VF2≤.... Then for each k=1,2,...k=1,2,... we find
the minimum empirical risk classifier
f
^
n
(
k
)
=
arg
min
f
∈
F
k
R
^
n
(
f
)
f
^
n
(
k
)
=
arg
min
f
∈
F
k
R
^
n
(
f
)
(10)
and then select the final classifier according to
k
^
=
arg
min
k
≥
1
R
^
n
(
t
^
n
(
k
)
)
+
32
V
F
k
(
log
n
+
1
)
n
k
^
=
arg
min
k
≥
1
R
^
n
(
t
^
n
(
k
)
)
+
32
V
F
k
(
log
n
+
1
)
n
(11)
and f^n≡f^n(k^)f^n≡f^n(k^) is the
final choice.
The basic rational is that we know
R
n
(
f
^
n
(
k
)
)
-
inf
f
∈
F
k
R
(
f
)
≤
C
'
V
F
k
log
n
n
R
n
(
f
^
n
(
k
)
)
-
inf
f
∈
F
k
R
(
f
)
≤
C
'
V
F
k
log
n
n
(12)
where C'C' is a constant.
The end result is that
E
[
R
(
f
^
n
)
]
≤
min
k
≥
1
min
f
∈
F
k
R
(
f
)
+
16
V
F
k
log
n
+
4
2
n
E
[
R
(
f
^
n
)
]
≤
min
k
≥
1
min
f
∈
F
k
R
(
f
)
+
16
V
F
k
log
n
+
4
2
n
(13)
analogous
to our pervious complexity regularization results, except that
codelengths are replaced by VC dimensions.
In order to prove the result we use the VC probability
concentration bound and assume that △=∑k≥1VFk<∞△=∑k≥1VFk<∞. This enables a union bounding argument and
leads to a risk bound of the form given above. For details see
Lugosi and Zeger '96.
Complexity of classes depends on richness (shattering capability)
relative to a set of nn arbitrary points. This allows us to
effectively “quantize" collections of functions in a slightly
data-dependent manner.
Let
F
k
=
k
l
e
a
f
d
e
c
i
s
i
o
n
t
r
e
e
s
i
n
R
d
,
V
F
k
≤
(
d
+
1
)
(
k
+
1
)
F
k
=
k
l
e
a
f
d
e
c
i
s
i
o
n
t
r
e
e
s
i
n
R
d
,
V
F
k
≤
(
d
+
1
)
(
k
+
1
)
(14)
f
^
n
(
k
)
=
arg
min
f
∈
F
k
R
^
n
(
f
)
f
^
n
(
k
)
=
arg
min
f
∈
F
k
R
^
n
(
f
)
(15)
k
^
=
arg
min
k
≥
1
min
f
∈
F
k
R
(
f
)
+
32
(
d
+
1
)
(
k
-
1
)
(
log
n
+
1
)
n
k
^
=
arg
min
k
≥
1
min
f
∈
F
k
R
(
f
)
+
32
(
d
+
1
)
(
k
-
1
)
(
log
n
+
1
)
n
(16)
Then
f
^
n
=
f
^
n
(
k
^
)
f
^
n
=
f
^
n
(
k
^
)
(17)
satisfies
E
[
R
(
f
^
n
)
]
≤
min
k
≥
1
min
f
∈
F
k
R
(
f
)
+
16
(
d
+
1
)
(
k
-
1
)
log
n
+
4
2
n
E
[
R
(
f
^
n
)
]
≤
min
k
≥
1
min
f
∈
F
k
R
(
f
)
+
16
(
d
+
1
)
(
k
-
1
)
log
n
+
4
2
n
(18)
compare with
E
[
R
(
f
^
n
)
]
≤
min
k
≥
1
min
f
∈
d
y
a
d
i
c
k
l
e
a
f
t
r
e
e
s
R
(
f
)
+
(
3
k
-
1
)
log
2
+
1
2
log
n
2
n
E
[
R
(
f
^
n
)
]
≤
min
k
≥
1
min
f
∈
d
y
a
d
i
c
k
l
e
a
f
t
r
e
e
s
R
(
f
)
+
(
3
k
-
1
)
log
2
+
1
2
log
n
2
n
(19)
from Lecture 11.