Skip to content Skip to navigation

Connexions

You are here: Home » Content » Chernoff's Bound and Hoeffding's Inequality

Navigation

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Endorsed by Endorsed (What does "Endorsed by" mean?)

This content has been endorsed by the organizations listed. Click each link for a list of all content endorsed by the organization.

Recently Viewed

This feature requires Javascript to be enabled.
Download
x

Download module as:

  • PDF
  • EPUB (what's this?)

    What is an EPUB file?

    EPUB is an electronic book format that can be read on a variety of mobile devices.

    Downloading to a reading device

    For detailed instructions on how to download this content's EPUB to your specific device, click the "(what's this?)" link.

  • More downloads ...
Reuse / Edit
x

Module:

Add to a lens
x

Add module to:

Add to Favorites
x

Add module to:

 

Chernoff's Bound and Hoeffding's Inequality

Module by: Robert Nowak. E-mail the author

Introduction

Motivation

In the last lecture we consider a learning problem in which the optimal function belonged to a finite class of functions. Specifically, for some collection of functions FFwith finite cardinality |F||F|, we have

min f F R ( f ) = 0 f * F . min f F R ( f ) = 0 f * F . (1)

This is almost always not the situation in the real-world learning problems. Let us suppose we have a finite collection of candidate functions FF. Furthermore, we do not assume that the optimal function f*f*, which satisfies

R ( f * ) = inf f R ( f ) R ( f * ) = inf f R ( f ) (2)

where the infinf is taken over all measurable functions, is a member of FF. That is, we make few, if any, assumptions about f*f*. This situation is sometimes termed as Agnostic Learning. The root of the word agnostic literally means not known. The term agnostic learning is used to emphasize the fact that often, perhaps usually, we may have no prior knowledge about f*f*. The question then arises about how we can reasonably select an fFfF in this setting.

The Problem

The PAC style bounds discussed in the previous lecture, offer some help. Since we are selecting a function based on the empirical risk, the question is how close is R^n(f)R^n(f) to R(f)R(f)fFfF. In other words, we wish that the empirical risk is a good indicator of the true risk for every function in FF. If this is case, the selection of ff that minimizes the empirical risk

f n ^ = arg min f F n R ^ n ( f ) f n ^ = arg min f F n R ^ n ( f ) (3)

should also yield a small true risk, that is, R(fn^)R(fn^) should be close to minfFR(f)minfFR(f). Finally, we can thus state our desired situation as

P max f F n | R n ^ ( f ) - R ( f ) | > ϵ < δ , P max f F n | R n ^ ( f ) - R ( f ) | > ϵ < δ , (4)

for small values of ϵϵ and δδ. In other words, with probability at least 1-δ1-δ, |Rn^(f)-R(f)|>ϵ|Rn^(f)-R(f)|>ϵ, fFfF. In this lecture, we will start to develop bounds of this form. First we will focus on bounding P(|Rn^(f)-R(f)|>ϵ)P(|Rn^(f)-R(f)|>ϵ) for one fixed fFfF.

Developing Initial Bounds

To begin, let us recall the definition of empirical risk for {Xi,Yi}i=1n{Xi,Yi}i=1n be a collection of training data. Then the empirical risk is defined as

R ^ n ( f ) = 1 n i = 1 n ( f ( X i ) , Y i ) . R ^ n ( f ) = 1 n i = 1 n ( f ( X i ) , Y i ) . (5)

Note that since the training data {Xi,Yi}i=1n{Xi,Yi}i=1n are assumed to be i.i.d. pairs, the terms in the sum are i.i.d random variables.

Let

L i = ( f ( X i ) , Y i ) . L i = ( f ( X i ) , Y i ) . (6)

The collection of losses {Li}i=1n{Li}i=1n is i.i.d according to some unknown distribution (depending on the unknown joint distribution of (X,Y) and the loss function). The expectation of LiLi is E[(f(Xi),Yi)]=E[(f(X),Y)]=R(f)E[(f(Xi),Yi)]=E[(f(X),Y)]=R(f), the true risk of ff. For now, let's assume that ff is fixed.

E [ R n ^ ( f ) ] = 1 n i = 1 n E [ ( f ( X i ) , Y i ) ] = 1 n i = 1 n E [ L i ] = R ( f ) E [ R n ^ ( f ) ] = 1 n i = 1 n E [ ( f ( X i ) , Y i ) ] = 1 n i = 1 n E [ L i ] = R ( f ) (7)

We know from the strong law of large numbers that the average (or empirical mean) Rn^(f)Rn^(f) converges almost surely to the true mean R(f).R(f). That is, Rn^(f)R(f)Rn^(f)R(f) almost surely as nn. The question is how fast.

Concentration of Measure Inequalities

Concentration inequalities are upper bounds on how fast empirical means converge to their ensemble counterparts, in probability. The area of the shaded tail regions in Figure 1 is P(|Rn^(f)-R(f)|>ϵ)P(|Rn^(f)-R(f)|>ϵ). We are interested in finding out how fast this probability tends to zero as nn.

Figure 1: Distribution of Rn^(f)Rn^(f)
Figure 1 (distr.png)

At this stage, we recall Markov's Inequality. Let ZZ be a nonnegative random variable.

E [ Z ] = 0 z p ( z ) d z = 0 t z p ( z ) d z + u z p ( z ) d z 0 + t t z p ( z ) d z = t P ( Z t ) P ( Z t ) E [ Z ] t P ( Z 2 t 2 ) E [ Z 2 ] t 2 E [ Z ] = 0 z p ( z ) d z = 0 t z p ( z ) d z + u z p ( z ) d z 0 + t t z p ( z ) d z = t P ( Z t ) P ( Z t ) E [ Z ] t P ( Z 2 t 2 ) E [ Z 2 ] t 2 (8)

Take

Z = | R n ( f ) ^ - R ( f ) | and t = ϵ Z = | R n ( f ) ^ - R ( f ) | and t = ϵ (9)
P ( | R n ^ ( f ) - R ( f ) | ϵ ) E [ | R n ( f ) ^ - R ( f ) | 2 ] ϵ 2 var ( R ^ n ( f ) ) ϵ 2 = i = 1 n var ( L i n ) ϵ 2 = var ( ( X ) , Y ) n ϵ 2 = σ L 2 n ϵ 2 . P ( | R n ^ ( f ) - R ( f ) | ϵ ) E [ | R n ( f ) ^ - R ( f ) | 2 ] ϵ 2 var ( R ^ n ( f ) ) ϵ 2 = i = 1 n var ( L i n ) ϵ 2 = var ( ( X ) , Y ) n ϵ 2 = σ L 2 n ϵ 2 . (10)

So, the probability goes to zero at a rate of at least n-1n-1. However, it turns out that this is an extremely loose bound. According to the Central Limit Theorem

R n ^ ( f ) = 1 n i = 1 n L i N R ( f ) , σ L 2 n as n R n ^ ( f ) = 1 n i = 1 n L i N R ( f ) , σ L 2 n as n (11)

in distribution. This suggests that for large values of n,

P ( | R n ^ ( f ) - R ( f ) | ϵ ) O e - n ϵ 2 2 σ L 2 . P ( | R n ^ ( f ) - R ( f ) | ϵ ) O e - n ϵ 2 2 σ L 2 . (12)

That is, the Gaussian tail probability is tending to zero exponentially fast.

Chernoff's Bound

Note that for any nonnegative random variable ZZ and t>0t>0,

P ( Z t ) = P ( e s Z e s t ) E [ e s Z ] e s t , s > 0 by Markov's inequality . P ( Z t ) = P ( e s Z e s t ) E [ e s Z ] e s t , s > 0 by Markov's inequality . (13)

Chernoff's bound is based on finding the value of ss that minimizes the upper bound. If ZZ is a sum of independent random variables. For example, say

Z = i = 1 n ( f ( X i ) , Y i ) - R ( f ) = n R ^ n ( f ) - R ( f ) Z = i = 1 n ( f ( X i ) , Y i ) - R ( f ) = n R ^ n ( f ) - R ( f ) (14)

then the bound becomes

P i = 1 n ( L i - E [ L i ] ) t e - s t E [ e s i = 1 n ( L i - E [ L i ] ) ] e - s t i = 1 n E [ e s ( L i - E [ L i ] ) ] , from independence. P i = 1 n ( L i - E [ L i ] ) t e - s t E [ e s i = 1 n ( L i - E [ L i ] ) ] e - s t i = 1 n E [ e s ( L i - E [ L i ] ) ] , from independence. (15)

Thus, the problem of finding a tight bound boils down to finding a good bound for E[ss(Li-E[Li])]E[ss(Li-E[Li])]. Chernoff ('52), first studied this situation for binary random variables. Then, Hoeffding ('63) derived a more general result for arbitrary bounded random variables.

Hoeffding's Indequality

Theorem 1: Hoeffding's Inequality

Let Z1,Z2,...,ZnZ1,Z2,...,Zn be independent bounded random variables such that Zi[ai,bi]Zi[ai,bi] with probability 1. Let Sn=i=1nZiSn=i=1nZi. Then for any t>0t>0, we have

P ( | S n - E [ S n ] | t ) 2 e - 2 t 2 i = 1 n ( b i - a i ) 2 . P ( | S n - E [ S n ] | t ) 2 e - 2 t 2 i = 1 n ( b i - a i ) 2 . (16)

Proof

The key to proving Hoeffding's inequality is the following upper bound: if ZZ is a random variable with E[Z]=0E[Z]=0 and aZb,aZb, then

E [ e s Z ] e s 2 ( b - a ) 2 8 . E [ e s Z ] e s 2 ( b - a ) 2 8 . (17)

This upper bound is derived as follows. By the convexity of the exponential function,

e s z z - a b - a e s b + b - z b - a e s a , for a z b . e s z z - a b - a e s b + b - z b - a e s a , for a z b . (18)
Figure 2: Convexity of exponential function.
Figure 2 (convxx.png)

Thus,

E [ e s Z ] E Z - a b - a e s b + E b - Z b - a e s a = b b - a e s a - a b - a e s b , since E [ Z ] = 0 = ( 1 - θ + θ e s ( b - a ) ) e - θ s ( b - a ) , where θ = - a b - a . E [ e s Z ] E Z - a b - a e s b + E b - Z b - a e s a = b b - a e s a - a b - a e s b , since E [ Z ] = 0 = ( 1 - θ + θ e s ( b - a ) ) e - θ s ( b - a ) , where θ = - a b - a . (19)

Now let

u = s ( b - a ) and define φ ( u ) - θ u + log ( 1 - θ + θ e u ) . u = s ( b - a ) and define φ ( u ) - θ u + log ( 1 - θ + θ e u ) . (20)

Then we have

E [ e s Z ] ( 1 - θ + θ e s ( b - a ) ) e - θ s ( b - a ) = e φ ( u ) . E [ e s Z ] ( 1 - θ + θ e s ( b - a ) ) e - θ s ( b - a ) = e φ ( u ) . (21)

To minimize the upper bound let's express φ(u)φ(u) in a Taylor's series with remainder :

φ ( u ) = φ ( 0 ) + u φ ' ( 0 ) + u 2 2 φ ' ' ( v ) for some v [ 0 , u ] φ ( u ) = φ ( 0 ) + u φ ' ( 0 ) + u 2 2 φ ' ' ( v ) for some v [ 0 , u ] (22)
φ ' ( u ) = - θ + θ e u 1 - θ + θ e u φ ' ( u ) = 0 φ ' ' ( u ) = θ e u 1 - θ + θ e u - θ e u ( 1 - θ + θ e u ) 2 = θ e u 1 - θ + θ e u ( 1 - θ e u 1 - θ + θ e u ) = ρ ( 1 - ρ ) . φ ' ( u ) = - θ + θ e u 1 - θ + θ e u φ ' ( u ) = 0 φ ' ' ( u ) = θ e u 1 - θ + θ e u - θ e u ( 1 - θ + θ e u ) 2 = θ e u 1 - θ + θ e u ( 1 - θ e u 1 - θ + θ e u ) = ρ ( 1 - ρ ) . (23)

Now, φ''(u)φ''(u) is maximized by

ρ = θ e u 1 - θ + θ e u = 1 2 φ ' ' ( u ) 1 4 . ρ = θ e u 1 - θ + θ e u = 1 2 φ ' ' ( u ) 1 4 . (24)

So,

φ ( u ) u 2 8 = s 2 ( b - a ) 2 8 φ ( u ) u 2 8 = s 2 ( b - a ) 2 8 (25)
E [ e s Z ] e s 2 ( b - a ) 2 8 . E [ e s Z ] e s 2 ( b - a ) 2 8 . (26)

Now, we can apply this upper bound to derive Hoeffding's inequality.

P ( S n - E [ S n ] t ) e - s t i = 1 n E [ e s ( L i - E [ L i ] ) ] e - s t i = 1 n e s 2 ( b i - a i ) 2 8 = e - s t e s 2 i = 1 n ( b i - a i ) 2 8 = e - 2 t 2 i = 1 n ( b i - a i ) 2 by choosing s = 4 t i = 1 n ( b i - a i ) 2 P ( S n - E [ S n ] t ) e - s t i = 1 n E [ e s ( L i - E [ L i ] ) ] e - s t i = 1 n e s 2 ( b i - a i ) 2 8 = e - s t e s 2 i = 1 n ( b i - a i ) 2 8 = e - 2 t 2 i = 1 n ( b i - a i ) 2 by choosing s = 4 t i = 1 n ( b i - a i ) 2 (27)

Similarly, P(E[Sn]-Snt)e-2t2i=1n(bi-ai)2P(E[Sn]-Snt)e-2t2i=1n(bi-ai)2. This completes the proof of the Hoeffding's theorem.

Example

Application

Let Zi=1f(Xi)Yi-R(f),Zi=1f(Xi)Yi-R(f), as in the classification problem. Then for a fixed f, it follows from Hoeffding's inequality (i.e., Chernoff's bound in this special case) that

P ( | R n ^ ( f ) - R ( f ) | ϵ ) = P 1 n | S n - E [ S n ] | ϵ = P ( | S n - E [ S n ] | n ϵ ) 2 e - 2 ( n ϵ ) 2 n = 2 e - 2 n ϵ 2 . P ( | R n ^ ( f ) - R ( f ) | ϵ ) = P 1 n | S n - E [ S n ] | ϵ = P ( | S n - E [ S n ] | n ϵ ) 2 e - 2 ( n ϵ ) 2 n = 2 e - 2 n ϵ 2 . (28)

Now, we want a bound like this to hold uniformly for all fFfF. Assume that FF is a finite collection of models and let |F||F| denote its cardinality. We would like to bound the probability that maxfF|Rn^(f)-R(f)|ϵmaxfF|Rn^(f)-R(f)|ϵ. Note that the event

max f F | R n ^ ( f ) - R ( f ) | ϵ f F | R n ^ ( f ) - R ( f ) | ϵ . max f F | R n ^ ( f ) - R ( f ) | ϵ f F | R n ^ ( f ) - R ( f ) | ϵ . (29)

Therefore

P max f F | R n ^ ( f ) - R ( f ) | ϵ = P f F | R n ^ ( f ) - R ( f ) | ϵ f F P ( | R n ^ ( f ) - R ( f ) | ϵ ) , the `` union of events '' bound 2 | F | e - 2 n ϵ 2 , by Hoeffding's inequality. P max f F | R n ^ ( f ) - R ( f ) | ϵ = P f F | R n ^ ( f ) - R ( f ) | ϵ f F P ( | R n ^ ( f ) - R ( f ) | ϵ ) , the `` union of events '' bound 2 | F | e - 2 n ϵ 2 , by Hoeffding's inequality. (30)

Thus, we have shown that with probability at least 1-2|F|e-2nϵ21-2|F|e-2nϵ2, fFfF

| R n ^ ( f ) - R ( f ) | < ϵ . | R n ^ ( f ) - R ( f ) | < ϵ . (31)

And accordingly, we can be reasonably confident in selecting ff from FF based on the empirical risk function R^nR^n.

Content actions

Download module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

Reuse / Edit:

Reuse or edit module (?)

Check out and edit

If you have permission to edit this content, using the "Reuse / Edit" action will allow you to check the content out into your Personal Workspace or a shared Workgroup and then make your edits.

Derive a copy

If you don't have permission to edit the content, you can still use "Reuse / Edit" to adapt the content by creating a derived copy of it and then editing and publishing the copy.