Skip to content Skip to navigation

Connexions

You are here: Home » Content » Applications of VC Bound

Navigation

Content Actions

  • Download module PDF
  • Add to ...
    Add the module to:
    • My Favorites
    • A lens
    • An external social bookmarking service
    • My Favorites (What is 'My Favorites'?)
      'My Favorites' is a special kind of lens which you can use to bookmark modules and collections directly in Connexions. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need a Connexions account to use 'My Favorites'.
    • A lens (What is a lens?)

      Definition of a lens

      Lenses

      A lens is a custom view of Connexions content. You can think of it as a fancy kind of list that will let you see Connexions through the eyes of organizations and people you trust.

      What is in a lens?

      Lens makers point to Connexions materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

      Who can create a lens?

      Any individual Connexions member, a community, or a respected organization.

    • External bookmarks
  • E-mail the author

Recently Viewed

Applications of VC Bound

Module by: Robert Nowak

Linear Classifiers

Suppose FF= {linear classifiers in RdRd}, then we have

V F = d + 1 , f ^ n = arg min f F R ^ n ( f ) V F = d + 1 , f ^ n = arg min f F R ^ n ( f ) (1)
E [ R ( f ^ n ) ] - inf f F R ( f ) 4 ( d + 1 ) log ( n + 1 ) + log 2 n E [ R ( f ^ n ) ] - inf f F R ( f ) 4 ( d + 1 ) log ( n + 1 ) + log 2 n (2)

.

Generalized Linear Classifiers

Normally, we have a feature vector XRdXRd. A hyperplane in RdRd provides a linear classifier in RdRd. Nonlinear classifiers can be obtained by a straightforward generalization.

Let φ1,,φd'φ1,,φd', d'dd'd be a collection of functions mapping RdRRdR. These functions, applied to a feature XRdXRd, produce a generalized set of features, φ=(φ1(X),φ2(X),,φd'(X))'φ=(φ1(X),φ2(X),,φd'(X))'. For example, if X=(x1,x2)'X=(x1,x2)', then we could consider d'=Sd'=S and φ=(x1,x2,x1x2,x12,x22)'R5φ=(x1,x2,x1x2,x12,x22)'R5. We can then construct a linear classifier in the higher dimensional generalized feature space Rd'Rd'.

The VC bounds immediately extend to this case, and we have for FF' = { generalized linear classifiers based on maps φ:RdRd'φ:RdRd' },

E [ R ( f ^ n ) ] - inf f F ' R ( f ) 4 ( d ' + 1 ) log ( n + 1 ) + log 2 n E [ R ( f ^ n ) ] - inf f F ' R ( f ) 4 ( d ' + 1 ) log ( n + 1 ) + log 2 n (3)

.

Half-Space Classifiers

Theorem 1 (Steele '75, Dudley '78) Let GGbe a finite-dimensional vector space of real-valued functions on RdRd. The class of sets A={{x:g(x)0}:gG}A={{x:g(x)0}:gG} has VC dimension dim(GG).

Proof: It is sufficient to show that no set of n=dim(G)+1n=dim(G)+1 points can be shattered by AA. Take any nn points and for each gGgG, define the vector Vg=(g(x1),,g(xn))Vg=(g(x1),,g(xn)).

The set {Vg:gG}{Vg:gG} is a linear subspace of RnRn of dimension dim (GG) = n-1n-1. Therefore, there exists a non-zero vector α=(α1,,αn)Rnα=(α1,,αn)Rn such that i=1nαig(xi)=0i=1nαig(xi)=0. We can assume that at least one of these αiSαiS is negative (if all are positive, just negate the sum). We can then re-arrange this expression as i:αi0αig(xi)=i:αi<0-αig(xi)i:αi0αig(xi)=i:αi<0-αig(xi).

Now suppose that there exists a gGgG such that the set {x:g(x)0}{x:g(x)0} selects precisely the xiSxiS on the left-hand side above. Then all terms on the left are non-negative and all the terms on the right are non-positive. Since αα is non-zero, this is a contradiction. Therefore, x1,,xnx1,,xn cannot be shattered by sets in {x:g(x)0}{x:g(x)0}, gGgG.  6.375pt0.0pt6.375pt

Example 1 Consider half-spaces in RdRd of the form A={xRd:xib,i{1,,d},bR}A={xRd:xib,i{1,,d},bR}. Each half-space can be described by

g ( x ) = 0 , , 0 , 1 , 0 , , 0 x 1 x d - b g ( x ) = 0 , , 0 , 1 , 0 , , 0 x 1 x d - b (4)
d i m ( G ) = d + 1 , V A d + 1 d i m ( G ) = d + 1 , V A d + 1 (5)

.

Tree Classifiers

Let

T k = r e c u r s i v e r e c t a n g u l a r p a r t i t i o n s o f R d w i t h k + 1 c e l l s T k = r e c u r s i v e r e c t a n g u l a r p a r t i t i o n s o f R d w i t h k + 1 c e l l s (6)

Let TTkTTk. Each cell of TT results from splitting a rectangular region into two smaller rectangles parallel to one of the coordinate axes.

Example 2 TT3TT3, d=2d=2.

Each additional split is analogous to a half-space set. Therefore, each additional split can potentially shatter d+1d+1 points. This implies that

V T k ( d + 1 ) k V T k ( d + 1 ) k (7)

.

Example 3 d=1d=1.

k=1k=1 split shatters two points.

k=2k=2 splits shatters three points <4<4.

VC Bound for Tree Classifiers

F k = { t r e e c l a s s i f i e r s w i t h k + 1 l e a f s o n R d } F k = { t r e e c l a s s i f i e r s w i t h k + 1 l e a f s o n R d } (8)
E [ R ( f ^ n ) ] - inf f F k R ( f ) 4 ( d + 1 ) k log n + log 2 n E [ R ( f ^ n ) ] - inf f F k R ( f ) 4 ( d + 1 ) k log n + log 2 n (9)

.

How can we decide what dimension to choose for a generalized linear classifier?

How many leafs should be used for a classification tree?

Answer: Complexity Regularization using VC bounds!

Structural Risk Minimization (SRM)

SRM is simply complexity regularization using VC type bounds in place of Chernoff's bound or other concentration inequalities.

The basic idea is to consider a sequence of sets of classifiers F1,F2,...,F1,F2,..., of increasing VC dimensions VF1VF2...VF1VF2.... Then for each k=1,2,...k=1,2,... we find the minimum empirical risk classifier

f ^ n ( k ) = arg min f F k R ^ n ( f ) f ^ n ( k ) = arg min f F k R ^ n ( f ) (10)

and then select the final classifier according to

k ^ = arg min k 1 R ^ n ( t ^ n ( k ) ) + 32 V F k ( log n + 1 ) n k ^ = arg min k 1 R ^ n ( t ^ n ( k ) ) + 32 V F k ( log n + 1 ) n (11)

and f^nf^n(k^)f^nf^n(k^) is the final choice.

The basic rational is that we know

R n ( f ^ n ( k ) ) - inf f F k R ( f ) C ' V F k log n n R n ( f ^ n ( k ) ) - inf f F k R ( f ) C ' V F k log n n (12)

where C'C' is a constant.

The end result is that

E [ R ( f ^ n ) ] min k 1 min f F k R ( f ) + 16 V F k log n + 4 2 n E [ R ( f ^ n ) ] min k 1 min f F k R ( f ) + 16 V F k log n + 4 2 n (13)

analogous to our pervious complexity regularization results, except that codelengths are replaced by VC dimensions.

In order to prove the result we use the VC probability concentration bound and assume that =k1VFk<=k1VFk<. This enables a union bounding argument and leads to a risk bound of the form given above. For details see Lugosi and Zeger '96.

Key Point of VC Theory

Complexity of classes depends on richness (shattering capability) relative to a set of nn arbitrary points. This allows us to effectively “quantize" collections of functions in a slightly data-dependent manner.

Application to Trees

Let

F k = k l e a f d e c i s i o n t r e e s i n R d , V F k ( d + 1 ) ( k + 1 ) F k = k l e a f d e c i s i o n t r e e s i n R d , V F k ( d + 1 ) ( k + 1 ) (14)
f ^ n ( k ) = arg min f F k R ^ n ( f ) f ^ n ( k ) = arg min f F k R ^ n ( f ) (15)
k ^ = arg min k 1 min f F k R ( f ) + 32 ( d + 1 ) ( k - 1 ) ( log n + 1 ) n k ^ = arg min k 1 min f F k R ( f ) + 32 ( d + 1 ) ( k - 1 ) ( log n + 1 ) n (16)

Then

f ^ n = f ^ n ( k ^ ) f ^ n = f ^ n ( k ^ ) (17)

satisfies

E [ R ( f ^ n ) ] min k 1 min f F k R ( f ) + 16 ( d + 1 ) ( k - 1 ) log n + 4 2 n E [ R ( f ^ n ) ] min k 1 min f F k R ( f ) + 16 ( d + 1 ) ( k - 1 ) log n + 4 2 n (18)

compare with

E [ R ( f ^ n ) ] min k 1 min f d y a d i c k l e a f t r e e s R ( f ) + ( 3 k - 1 ) log 2 + 1 2 log n 2 n E [ R ( f ^ n ) ] min k 1 min f d y a d i c k l e a f t r e e s R ( f ) + ( 3 k - 1 ) log 2 + 1 2 log n 2 n (19)

from Lecture 11.

Comments, questions, feedback, criticisms?

Send feedback