Skip to content Skip to navigation

Connexions

You are here: Home » Content » Vapnik-Chervonenkis Theory

Navigation

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Endorsed by Endorsed (What does "Endorsed by" mean?)

This content has been endorsed by the organizations listed. Click each link for a list of all content endorsed by the organization.

Recently Viewed

This feature requires Javascript to be enabled.
 

Vapnik-Chervonenkis Theory

Module by: Robert Nowak. E-mail the author

Review of Past Lecture

In our past lectures we considered collections of candidate function FF that were either finite or enumerable. We then constructed penalties, usually codelengths, for each candidate c(f)c(f), fFfF, such that fF2c(f)1fF2c(f)1 This allowed us to derive uniform concentration inequalities over the entire set FF using the union bound. However, in many cases the collections FF may be uncountably infinite. A simple example is the collection FF of a single threshold classifier in 1-d having the form

f t ( x ) = 1 { x t } f t ( x ) = 1 { x t }
(1)

and their complements

f s ( t ) = 1 { x < s } . f s ( t ) = 1 { x < s } .
(2)

Thus, FF contains an uncountable number of classifiers, and we cannot apply the union bound argument in such cases.

Two Ways to Proceed

Discretize or Quantize the Collection

Example 1

To quantize FF

F q = f , f ( x ) = 1 { x 1 / q i i { 0 , 1 , ... , q } } F q = f , f ( x ) = 1 { x 1 / q i i { 0 , 1 , ... , q } }
(3)

q is positive, such that fqFqfqFq

| f - f q | c / q | f - f q | c / q
(4)

if the density of x is bounded by c>0c>0. q<n1/2q<n1/2.

Identical Empirical Errors

Consider the fact that given only n training data, many of the classifiers in such a collection may produce identical empirical errors. Also, many fFfF will produce identical label assignments on the data. We will have at most 2n2n unique labels.

f is uncountable, its interceptions are countable and bounded by 2n2n. n intervals with 2 classifier per interval.

The number of distinct labeling assignments that a class FF can produce on a set of n points is denoted

S ( F , n ) 2 n S ( F , n ) 2 n
(5)

The VC dimension is logS(F,n)logS(F,n). Specifically, VC(F)=kVC(F)=k, where kk is largest integer such that S(F,k)=2kS(F,k)=2k Ex. 2n=2n2n=2n, n=2n=2, VC(F)=2VC(F)=2.

Ex. Consider

F = f : f ( x ) = 1 { x t } o r f ( x ) = 1 { x < t } , t [ 0 , 1 ] F = f : f ( x ) = 1 { x t } o r f ( x ) = 1 { x < t } , t [ 0 , 1 ]
(6)

Let q be a positive integer and

F q = f : f ( x ) = 1 { x i / q } o r f ( x ) = 1 { x < i / q } , i { 0 , 1 , ... , q } F q = f : f ( x ) = 1 { x i / q } o r f ( x ) = 1 { x < i / q } , i { 0 , 1 , ... , q }
(7)

and,

| f q | = 2 ( q + 1 ) . | f q | = 2 ( q + 1 ) .
(8)

Moreover, for any fFfF there exists an f1Fqf1Fq such that

| f ( x ) - f q ( x ) | d x ( i - 1 ) / q i / q 1 d x = 1 / q . | f ( x ) - f q ( x ) | d x ( i - 1 ) / q i / q 1 d x = 1 / q .
(9)

Now suppose we have n training data and suppose f*Ff*F. We know that in general, the minimum empirical risk classifier will converge to the Bayes classifier at the rate of n-1/2n-1/2 or slower. Therefore, it is unnecessary to drive the approximation error down faster than n-1/2n-1/2 So, we can restrict our attention of Fn-1/2Fn-1/2 and, provided that the density of x is bound above. We have

m i n f F n - 1 / 2 R ( f ) - R ( f * ) C f q m i n | f * ( x ) - f ( x ) | d x c / n 1 / 2 . m i n f F n - 1 / 2 R ( f ) - R ( f * ) C f q m i n | f * ( x ) - f ( x ) | d x c / n 1 / 2 .
(10)

Vapnik-Chervonenkis theory is based not on explicitly quantizing the collection of candidate functions, but rather on recognizing that the richness of FF is limited in a certain sense by the number of training data. Indeed, given n i.i.d. training data, there are at most 2n2n different binary labelings. Therefore, any collection FF may be divided into 2n2n subsets of classifiers that are "equilvalent" with respect to the training data. In many cases a collection may not even be capable of producing 2n2n different labellings.

Example

Consider X=[0,1]X=[0,1].

F = f : f ( x ) = 1 { x t } o r f ( x ) = 1 { x < t } t [ 0 , 1 ] F = f : f ( x ) = 1 { x t } o r f ( x ) = 1 { x < t } t [ 0 , 1 ]
(11)

Suppose we have n training data: (x1,...,xn)[0,1](x1,...,xn)[0,1]. With xsxs denotes the location of each training point in [0,1]. Associated with each x is a label y{0,1}y{0,1}. Any classifier in FF will label all points to the left of a number t[0,1]t[0,1] as "1" or "0", and points to the right as "0" or "1", respectively. For t[0,x1)t[0,x1), all points are either labelled "0" or "1". For t(x1,x2)t(x1,x2), x1x1 is labelled "0" or "1" and x2...xnx2...xn are label "1" or "0" and so on. We see that there are exactly 2n2n different labellings; far less than 2n2n!

The number of different labellings that a class FF can produce on a set of n training data is a measure of the "effective size" of FF. The Vapnik-Chervonenkis (VC) dimension of FF is proportional to the log of the effective size. Let V(F,n)V(F,n) denote the VC dimension of FF, typically a constant, independent of n. The VC inequality states that for all fFfF

P | R ^ n ( f ) - R ( f ) | > ϵ 8 e V ( F , h ) e - n ϵ 2 / 32 . P | R ^ n ( f ) - R ( f ) | > ϵ 8 e V ( F , h ) e - n ϵ 2 / 32 .
(12)

This type of uniform concentration inequality can be used in a similar fashion to our use of Hoeffding's inequality plus union bound.

Hyperplane Classifiers

We will go into the details of VC Theory next lecture, and the remainder of this lecture will introduce the key ideas with an example Consider the following setup. Let X=[0,1]dX=[0,1]d, Y={0,1}Y={0,1} Let

F = f : f ( x ) = 1 { w T x + w 0 > 0 } F = f : f ( x ) = 1 { w T x + w 0 > 0 }
(13)

with w0w0 and wwRd+1Rd+1 This is the collection of all hyperplane classifiers. FF is infinite and uncountable.

Suppose that we have n training data

X i , Y i i = 1 n . X i , Y i i = 1 n .
(14)

There are at most 2nd2nd unique classifiers in FF with respect to these data. To see this, consider d arbitrary data points x1,...,xidx1,...,xid, and let wTx+w0>0wTx+w0>0 be a hyperplane containing these points. To be specific, take the hyperplane with

| | w 0 w | | = 1 . | | w 0 w | | = 1 .
(15)

this hyperplane coincides with two possible classification rules:

f 1 ( x ) = 1 { w T x + w 0 > 0 } f 1 ( x ) = 1 { w T x + w 0 > 0 }
(16)
f 2 ( x ) = 1 { w T x + w 0 < 0 } f 2 ( x ) = 1 { w T x + w 0 < 0 }
(17)

Each d-tuple of training data produces two distinct classifiers, assuming the data are not co-linear. Thus, there are at most 2*nd2*nd unique classifiers in FF with respect to the training data. (All other fFfF produce the same labels and empirical risk as one of the classifiers.) Let's enumerate the unique hyperplane classifiers f1,...,f2*ndf1,...,f2*nd, and let

f ^ n = arg min f f 1 , ... , f 2 n d R ^ n ( f ) f ^ n = arg min f f 1 , ... , f 2 n d R ^ n ( f )
(18)

and let

R * = i n f f F R ( f ) R * = i n f f F R ( f )
(19)

and define

f * = a r g m i n f F R ( f ) f * = a r g m i n f F R ( f )
(20)

If multiple fFfF achieve R*R*, pick f*f* to be one of them in an arbitrary fixed number.

Theorem 1

Assume that PxPx has a density, but that the distribution of (x,y)(x,y) is other arbitrary. If ndnd and 2d/nϵ12d/nϵ1 then

P R ( f ^ n ) - R ( f ) > ϵ e 2 d ϵ 2 n d + 1 e - n ϵ 2 / 2 . P R ( f ^ n ) - R ( f ) > ϵ e 2 d ϵ 2 n d + 1 e - n ϵ 2 / 2 .
(21)

Note:

The assumption that PxPx has a density insures that no d+1 points are co-planar. This in turn, guarantees that there are exactly 2nd2nd unique classifier and that the 2nd2nd under consideration are fully representative of all possible classifiers in FF, with respect to the data.

Proof

The proof is a specialization of the basic ingredients of VC Theory to the case at hand. Here we follow the proof in DGL '96. First we note that,

R ( f ^ n ) - R ( f * ) = R ( f ^ n ) - R ^ n ( f ^ n ) + R ^ n ( f ^ n ) - R ( f * ) R ( f ^ n ) - R ( f * ) = R ( f ^ n ) - R ^ n ( f ^ n ) + R ^ n ( f ^ n ) - R ( f * )
(22)
R ( f ^ n ) - R ^ n ( f ^ n ) + R ^ n f * ) - R ( f * ) + d / n R ( f ^ n ) - R ^ n ( f ^ n ) + R ^ n f * ) - R ( f * ) + d / n
(23)

and since R^n(f^n)R^n(f)+d/nR^n(f^n)R^n(f)+d/n for any fFfF

m a x i = 1 , ... , 2 n d ( R ( f i ) - ( ^ R ) n ( f i ) ) + ( ^ R ) n ( f * ) - R ( f * ) + d / n . m a x i = 1 , ... , 2 n d ( R ( f i ) - ( ^ R ) n ( f i ) ) + ( ^ R ) n ( f * ) - R ( f * ) + d / n .
(24)

Therefore, by the union bound:

P ( R ( f ^ n ) - R ( f * ) > ϵ ) P ( R ( f ^ n ) - R ( f * ) > ϵ )
(25)
i = 1 2 n d P R ( f i ) - R ^ n ( f i ) > ϵ / 2 + P R ^ n ( f * ) - R ( f * ) + d / n > ϵ / 2 . i = 1 2 n d P R ( f i ) - R ^ n ( f i ) > ϵ / 2 + P R ^ n ( f * ) - R ( f * ) + d / n > ϵ / 2 .
(26)

We can bound the second term of the above bound using Chernoff's/Hoeffding's inequality:

P R ^ n ( f * ) - R ( f * ) > ϵ / 2 - d / n P R ^ n ( f * ) - R ( f * ) > ϵ / 2 - d / n
(27)
e - 2 n ϵ / 2 - d / n 2 e - 2 n ϵ / 2 - d / n 2
(28)
e 2 d ϵ e - n ϵ 2 / 2 . e 2 d ϵ e - n ϵ 2 / 2 .
(29)

Next, let's bound one of the terms in the summation. For example, take

P R ( f i ) - R ^ n ( f i ) > ( ϵ / 2 ) . P R ( f i ) - R ^ n ( f i ) > ( ϵ / 2 ) .
(30)

Note that by symmetry all 2nd2nd terms will have identical bounds. Since the bounds are independent of PxyPxy.

Assume that f1f1 is determined by the first d data points x14,...,xdx14,...,xd. By the smoothing property of expectations we can write,

P R ( f i ) - R ^ n ( f ) > ϵ / 2 = E P R ( f i ) - R ^ n ( f i ) > ϵ / 2 | x 1 , ... , x d . P R ( f i ) - R ^ n ( f ) > ϵ / 2 = E P R ( f i ) - R ^ n ( f i ) > ϵ / 2 | x 1 , ... , x d .
(31)

From here, we will bound the conditional probability inside the expectation. Let (X1",Y1"),...,(Xd",Yd")(X1",Y1"),...,(Xd",Yd") be d additional random samples that are independent and identically distributed as the data (X1,Y1),...,(Xd,Yd)(X1,Y1),...,(Xd,Yd). Xi",Yi"i=1dXi",Yi"i=1d are often called the "ghost sample" since they are not actually observed. They are a fictious sample leads to a simple bound on the conditional probability. Define if idid

( X i ' , Y i ' ) = ( X i " , Y i " ) ( X i ' , Y i ' ) = ( X i " , Y i " )
(32)

or if i>di>d

( X i ' , Y i ' ) = ( X i , Y i ) . ( X i ' , Y i ' ) = ( X i , Y i ) .
(33)

That is, Xi',Yi'i=1dXi',Yi'i=1d agrees with our observed data on i>d, but the first d samples are replaced with the ghost sample. Then,

P R ( f i ) - R ^ n ( f 1 ) > ϵ / 2 | x 1 , ... , x d P R ( f i ) - R ^ n ( f 1 ) > ϵ / 2 | x 1 , ... , x d
(34)
P R ( f i ) - 1 / n i = d + 1 n 1 f 1 ( x i ) y i > ϵ / 2 | x 1 , ... , x d P R ( f i ) - 1 / n i = d + 1 n 1 f 1 ( x i ) y i > ϵ / 2 | x 1 , ... , x d
(35)
P R ( f i ) - 1 / n 1 n 1 f 1 ( x i ) y i + d / n > ϵ / 2 | x 1 , ... , x d P R ( f i ) - 1 / n 1 n 1 f 1 ( x i ) y i + d / n > ϵ / 2 | x 1 , ... , x d
(36)
= P R ( f i ) - ( ^ R ) n ' ( f 1 ) > t / 2 - d / n | x 1 , ... , x d = P R ( f i ) - ( ^ R ) n ' ( f 1 ) > t / 2 - d / n | x 1 , ... , x d
(37)

where,

R ^ n ' ( f 1 ) = 1 / n i = 1 n 1 { f 1 ( x i ' ) y i ' } . R ^ n ' ( f 1 ) = 1 / n i = 1 n 1 { f 1 ( x i ' ) y i ' } .
(38)

Note that n(^R)n'(f1)n(^R)n'(f1) is binomially distributed with mean R(f1)R(f1) and it is independent of x1,...,xdx1,...,xd Therefore,

P R ( f i ) - R ^ n ' ( f 1 ) > ϵ / 2 - d / n | x 1 , ... , x d P R ( f i ) - R ^ n ' ( f 1 ) > ϵ / 2 - d / n | x 1 , ... , x d
(39)
= P ( R ( f i ) - R ^ n ' ( f 1 ) > t / 2 - d / n | x 1 , ... , x d ) = P ( R ( f i ) - R ^ n ' ( f 1 ) > t / 2 - d / n | x 1 , ... , x d )
(40)
e - 2 n ϵ / 2 - d / n 2 e - 2 n ϵ / 2 - d / n 2
(41)
e 2 d ϵ e - n ϵ 2 / 2 . e 2 d ϵ e - n ϵ 2 / 2 .
(42)

In conclusion,

P R ( f ^ n ) - R * > ϵ P R ( f ^ n ) - R * > ϵ
(43)
i = 1 2 n d P R ( f ) i - R ^ n ( t i ) > ϵ / 2 + P R ^ n ( f * ) - R ( f * ) + d / n > ϵ / 2 i = 1 2 n d P R ( f ) i - R ^ n ( t i ) > ϵ / 2 + P R ^ n ( f * ) - R ( f * ) + d / n > ϵ / 2
(44)
2 n d e 2 d ϵ e - n ϵ 2 / 2 + e 2 d ϵ e - n ϵ 2 / 2 2 n d e 2 d ϵ e - n ϵ 2 / 2 + e 2 d ϵ e - n ϵ 2 / 2
(45)
= e 2 d ϵ 2 n d + 1 e - n ϵ 2 / 2 . = e 2 d ϵ 2 n d + 1 e - n ϵ 2 / 2 .
(46)

Lastly, Corollary If ndnd, then

E R ( f ^ n ) - m i n f F R ( f ) 2 ( d + 1 ) ( l o g n + 2 ) / n . E R ( f ^ n ) - m i n f F R ( f ) 2 ( d + 1 ) ( l o g n + 2 ) / n .
(47)

Content actions

Download module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks