Skip to content Skip to navigation

Connexions

You are here: Home » Content » Error Bounds in Countably Infinite Spaces

Navigation

Content Actions

  • Download module PDF
  • Add to ...
    Add the module to:
    • My Favorites
    • A lens
    • An external social bookmarking service
    • My Favorites (What is 'My Favorites'?)
      'My Favorites' is a special kind of lens which you can use to bookmark modules and collections directly in Connexions. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need a Connexions account to use 'My Favorites'.
    • A lens (What is a lens?)

      Definition of a lens

      Lenses

      A lens is a custom view of Connexions content. You can think of it as a fancy kind of list that will let you see Connexions through the eyes of organizations and people you trust.

      What is in a lens?

      Lens makers point to Connexions materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

      Who can create a lens?

      Any individual Connexions member, a community, or a respected organization.

    • External bookmarks
  • E-mail the author

Recently Viewed

This feature requires Javascript to be enabled.

Error Bounds in Countably Infinite Spaces

Module by: Robert Nowak

Introduction

In the last lecture, we studied bounds of the following form: for any δ>0δ>0, with probability at least 1-δ1-δ,

R ( f ) R ^ n ( f ) + log | F | + log 1 δ 2 n , f F R ( f ) R ^ n ( f ) + log | F | + log 1 δ 2 n , f F (1)

which led to upper bounds on the estimation error of the form

E [ R ( f ^ n ) ] - min f F R ( f ) log | F | + log ( n ) + 2 n E [ R ( f ^ n ) ] - min f F R ( f ) log | F | + log ( n ) + 2 n (2)

The key assumptions made in deriving the error bounds were:

(i) bounded loss function

(ii) finite collection of candidate functions

The bounds are valid for every PXYPXY and are called distribution-free .

Deriving Bounds for Countably Infinite Spaces

In this lecture we will generalize the previous results in a powerful way by developing bounds applicable to possibly infinite collections of candidates. To start let us suppose that FF is a countable, possibly infinite, collection of candidate functions. Assign a positive number c(ff) to each fFfF, such that

f F e - c ( f ) < f F e - c ( f ) < (3)

The numbers c(ff) can be interpreted as

(i) measures of complexity

(ii) -log of prior probabilities

(iii) codelengths

(4)

In particular, if P(ff) is the prior probability of ff then

e - - log p ( f ) = p ( f ) e - - log p ( f ) = p ( f ) (5)

so c(f)-logp(f)c(f)-logp(f) produces

f F e - c ( f ) = f F p ( f ) = 1 f F e - c ( f ) = f F p ( f ) = 1 (6)

Now recall Hoeffding's inequality. For each ff and every ϵ>0ϵ>0

P R ( f ) - R ^ n ( f ) ϵ e - 2 n ϵ 2 P R ( f ) - R ^ n ( f ) ϵ e - 2 n ϵ 2 (7)

or for every δ>0δ>0

P R ( f ) - R ^ n ( f ) log 1 δ 2 n δ P R ( f ) - R ^ n ( f ) log 1 δ 2 n δ (8)

Suppose δ>0δ>0 is specified. Using the values c(ff) for fFfF, define

δ ( f ) = e - c ( f ) δ δ ( f ) = e - c ( f ) δ (9)

Then we have

P R ( f ) - R ^ n ( f ) log 1 δ ( f ) 2 n δ ( f ) P R ( f ) - R ^ n ( f ) log 1 δ ( f ) 2 n δ ( f ) (10)
(11)

Furthermore we can apply the union bound as follows

P sup f F R ( f ) - R ^ n ( f ) - log ( 1 / δ ( f ) ) 2 n 0 P f F R ( f ) - R ^ n ( f ) log 1 δ ( f ) 2 n f F P R ( f ) - R ^ n ( f ) log 1 δ ( f ) 2 n f F δ ( f ) = f F e - c ( f ) δ = δ P sup f F R ( f ) - R ^ n ( f ) - log ( 1 / δ ( f ) ) 2 n 0 P f F R ( f ) - R ^ n ( f ) log 1 δ ( f ) 2 n f F P R ( f ) - R ^ n ( f ) log 1 δ ( f ) 2 n f F δ ( f ) = f F e - c ( f ) δ = δ (12)

So for any δ>0δ>0 with probability at least 1-δ1-δ, we have that fFfF

R ( f ) R ^ n ( f ) + log 1 δ ( f ) 2 n = R ^ n ( f ) + c ( f ) + log 1 δ 2 n R ( f ) R ^ n ( f ) + log 1 δ ( f ) 2 n = R ^ n ( f ) + c ( f ) + log 1 δ 2 n (13)

Special Case

Suppose FF is finite and c(f)=log|F|fFc(f)=log|F|fF. Then

f F e - c ( f ) = f F e - log | F | = f F 1 | F | = 1 f F e - c ( f ) = f F e - log | F | = f F 1 | F | = 1 (14)

and

δ ( f ) = δ | F | δ ( f ) = δ | F | (15)

which implies that for any δ>0δ>0 with probability at least 1-δ1-δ, we have

R ( f ) R ^ n ( f ) + log | F | + log 1 δ ( f ) 2 n , f F R ( f ) R ^ n ( f ) + log | F | + log 1 δ ( f ) 2 n , f F (16)

Note that this is precisely the bound we derived in the last lecture.

Choosing c(ff)

The generalized bounds allow us to handle countably infinite collections of candidate functions, but we require that

f F e - c ( f ) < f F e - c ( f ) < (17)

Of course, if c(f)=-logp(f)c(f)=-logp(f) where p(f)p(f) is a proper prior probability distribution then we have

f F e - c ( f ) = 1 f F e - c ( f ) = 1 (18)

However, it may be difficult to design a probability distribution over an infinite class of candidates. The coding perspective provides a very practical means to this end.

Assume that we have assigned a uniquely decodable binary code to each fFfF, and let c(ff) denote the codelength for ff. That is, the code for ff is c(ff) bits long. A very useful class of uniquely decodable codes are called prefix codes .

Definition 1 A code is called a prefix code if no codeword is a prefix of any other codeword.

Example 1 (From Cover && Thomas '91)

Consider an alphabet of symbols, say A,B,CA,B,C, and DD and the codebooks below

Figure 1
Figure1.png

In the singular codebook we assign the same codeword to each symbol - a system that is obviously flawed! In the second case, the codes are not singular but the codeword 010 could represent B or CA or AD. Hence it is not a uniquely decodable codebook.

The third and fourth cases are both examples of uniquely decodable codebooks, but the fourth has the added feature that no codeword is a prefix of another. Prefix codes can be decoded from left to right since each codeword is “self-punctuating" - in this case with a zero to indicate the end of each word.

To design a uniquely decodable codebook in general is as challenging as the problem of selecting c(ff) to satisfy

f F e - c ( f ) < f F e - c ( f ) < (19)

However, prefix codes can often be easily designed or specified and they are inherently decodable. Moreover, prefix codes satisfy an important inequality called the Kraft Inequality .

The Kraft Inequality

For any binary prefix code, the codeword lengths c1c1, c2c2, ... satisfy

i = 1 2 - c i 1 i = 1 2 - c i 1 (20)

Conversely, given any c1c1, c2c2, ... satisfying the inequality above we can construct a prefix code with these codeword lengths. We will prove this result a bit later, but now let's see how this is useful in our learning problem.

Assume that we have assigned a binary prefix codeword to each fFfF, and let c(ff) denote the bit-length of the codeword for ff. Set δ(f)=2-c(f)δδ(f)=2-c(f)δ. Then

P f F R ( f ) - R ^ n ( f ) log 1 δ ( f ) 2 n f F P R ( f ) - R ^ n ( f ) log 1 δ ( f ) 2 n f F δ ( f ) = f F 2 - c ( f ) δ = δ P f F R ( f ) - R ^ n ( f ) log 1 δ ( f ) 2 n f F P R ( f ) - R ^ n ( f ) log 1 δ ( f ) 2 n f F δ ( f ) = f F 2 - c ( f ) δ = δ (21)

This implies that for any δ>0δ>0 with probability at least 1-δ1-δ we have fFfF

R ( f ) R ^ n ( f ) + log 1 δ ( f ) 2 n = R ^ n ( f ) + c ( f ) log 2 + log 1 δ 2 n R ( f ) R ^ n ( f ) + log 1 δ ( f ) 2 n = R ^ n ( f ) + c ( f ) log 2 + log 1 δ 2 n (22)
(23)

Application

Let F1F1, F1F1, ... be a sequence of finite sets of candidate functions with |F1|<|F1|<...|F1|<|F1|<... We can design prefix codes as follows. Use the codes 0, 10, 110, 1110, ... to encode the subscript ii in |Fi||Fi|. For each class |Fi||Fi|, construct a set of binary codewords of length log2|F|log2|F| to uniquely encode each function in FiFi. Then, encode any given function ff by first using the code for ii corresponding to the smallest FiFi that ff belongs to, followed by the length log2|F|log2|F| codeword for fFifFi. This is a prefix code.

Example 2 Histogram Classifiers

X=[0,1]dd, Y={0,1}. Let FkFk, k=1, 2, ... denote the collection of histogram classification rules with k equal volume bins. We can use the following codebook for the index k

Figure 2
Figure2.png

And follow this codeword with k=log2k=log2|Fk||Fk| bits to indicate which of the 2kk possible histogram rules is under consideration. Thus for any fFkfFk for some k 1 there is a prefix code of length

c ( f ) = k + k = 2 k b i t s c ( f ) = k + k = 2 k b i t s (24)

It follows that for any δ>0δ>0 with probability at least 1-δ1-δ we have fk1Fkfk1Fk

R ( f ) R ^ n ( f ) + 2 k f log 2 + log 1 δ 2 n R ( f ) R ^ n ( f ) + 2 k f log 2 + log 1 δ 2 n (25)

where kfkf is the number of bins in histogram corresponding to ff. Contrast with the bound we had for the class of m bin histograms alone: with probability 1-δ1-δ, fFmfFm

R ( f ) R ^ n ( f ) + m log 2 + log 1 δ ( f ) 2 n R ( f ) R ^ n ( f ) +