In the last lecture, we studied bounds of the following form: for any δ>0δ>0, with probability at least 1-δ1-δ,
R
(
f
)
≤
R
^
n
(
f
)
+
log
|
F
|
+
log
1
δ
2
n
,
∀
f
∈
F
R
(
f
)
≤
R
^
n
(
f
)
+
log
|
F
|
+
log
1
δ
2
n
,
∀
f
∈
F
(1)
which led to upper bounds on the
estimation error of the form
E
[
R
(
f
^
n
)
]
-
min
f
∈
F
R
(
f
)
≤
log
|
F
|
+
log
(
n
)
+
2
n
E
[
R
(
f
^
n
)
]
-
min
f
∈
F
R
(
f
)
≤
log
|
F
|
+
log
(
n
)
+
2
n
(2)
The key assumptions made
in deriving the error bounds were:
(i) bounded loss function
(ii) finite collection of candidate functions
The bounds are valid for every PXYPXY and are called
distribution-free
.
In this lecture we will generalize the previous results in a
powerful way by developing bounds applicable to possibly infinite
collections of candidates. To start let us suppose that FF is a
countable, possibly infinite, collection of candidate functions.
Assign a positive number c(ff) to each f∈Ff∈F, such that
∑
f
∈
F
e
-
c
(
f
)
<
∞
∑
f
∈
F
e
-
c
(
f
)
<
∞
(3)
The numbers c(ff) can be interpreted as
(i) measures of complexity
(ii) -log of prior probabilities
(iii) codelengths
(4)
In particular, if P(ff) is the prior
probability of ff then
e
-
-
log
p
(
f
)
=
p
(
f
)
e
-
-
log
p
(
f
)
=
p
(
f
)
(5)
so c(f)≡-logp(f)c(f)≡-logp(f) produces
∑
f
∈
F
e
-
c
(
f
)
=
∑
f
∈
F
p
(
f
)
=
1
∑
f
∈
F
e
-
c
(
f
)
=
∑
f
∈
F
p
(
f
)
=
1
(6)
Now recall Hoeffding's inequality. For each ff and every
ϵ>0ϵ>0
P
R
(
f
)
-
R
^
n
(
f
)
≥
ϵ
≤
e
-
2
n
ϵ
2
P
R
(
f
)
-
R
^
n
(
f
)
≥
ϵ
≤
e
-
2
n
ϵ
2
(7)
or for every δ>0δ>0
P
R
(
f
)
-
R
^
n
(
f
)
≥
log
1
δ
2
n
≤
δ
P
R
(
f
)
-
R
^
n
(
f
)
≥
log
1
δ
2
n
≤
δ
(8)
Suppose δ>0δ>0 is specified. Using the values c(ff) for
f∈Ff∈F, define
δ
(
f
)
=
e
-
c
(
f
)
δ
δ
(
f
)
=
e
-
c
(
f
)
δ
(9)
Then we have
P
R
(
f
)
-
R
^
n
(
f
)
≥
log
1
δ
(
f
)
2
n
≤
δ
(
f
)
P
R
(
f
)
-
R
^
n
(
f
)
≥
log
1
δ
(
f
)
2
n
≤
δ
(
f
)
(10)
(11)
Furthermore we can apply the union bound as
follows
P
sup
f
∈
F
R
(
f
)
-
R
^
n
(
f
)
-
log
(
1
/
δ
(
f
)
)
2
n
≥
0
≤
P
⋃
f
∈
F
R
(
f
)
-
R
^
n
(
f
)
≥
log
1
δ
(
f
)
2
n
≤
∑
f
∈
F
P
R
(
f
)
-
R
^
n
(
f
)
≥
log
1
δ
(
f
)
2
n
≤
∑
f
∈
F
δ
(
f
)
=
∑
f
∈
F
e
-
c
(
f
)
δ
=
δ
P
sup
f
∈
F
R
(
f
)
-
R
^
n
(
f
)
-
log
(
1
/
δ
(
f
)
)
2
n
≥
0
≤
P
⋃
f
∈
F
R
(
f
)
-
R
^
n
(
f
)
≥
log
1
δ
(
f
)
2
n
≤
∑
f
∈
F
P
R
(
f
)
-
R
^
n
(
f
)
≥
log
1
δ
(
f
)
2
n
≤
∑
f
∈
F
δ
(
f
)
=
∑
f
∈
F
e
-
c
(
f
)
δ
=
δ
(12)
So for any δ>0δ>0 with probability at least 1-δ1-δ, we have that ∀f∈F∀f∈F
R
(
f
)
≤
R
^
n
(
f
)
+
log
1
δ
(
f
)
2
n
=
R
^
n
(
f
)
+
c
(
f
)
+
log
1
δ
2
n
R
(
f
)
≤
R
^
n
(
f
)
+
log
1
δ
(
f
)
2
n
=
R
^
n
(
f
)
+
c
(
f
)
+
log
1
δ
2
n
(13)
Special Case
Suppose FF is finite and c(f)=log|F|∀f∈Fc(f)=log|F|∀f∈F. Then
∑
f
∈
F
e
-
c
(
f
)
=
∑
f
∈
F
e
-
log
|
F
|
=
∑
f
∈
F
1
|
F
|
=
1
∑
f
∈
F
e
-
c
(
f
)
=
∑
f
∈
F
e
-
log
|
F
|
=
∑
f
∈
F
1
|
F
|
=
1
(14)
and
δ
(
f
)
=
δ
|
F
|
δ
(
f
)
=
δ
|
F
|
(15)
which implies that for any δ>0δ>0 with probability
at least 1-δ1-δ, we have
R
(
f
)
≤
R
^
n
(
f
)
+
log
|
F
|
+
log
1
δ
(
f
)
2
n
,
∀
f
∈
F
R
(
f
)
≤
R
^
n
(
f
)
+
log
|
F
|
+
log
1
δ
(
f
)
2
n
,
∀
f
∈
F
(16)
Note that this is precisely the bound we derived in the last
lecture.
Choosing c(ff)
The generalized bounds allow us to handle countably infinite
collections of candidate functions, but we require that
∑
f
∈
F
e
-
c
(
f
)
<
∞
∑
f
∈
F
e
-
c
(
f
)
<
∞
(17)
Of course, if c(f)=-logp(f)c(f)=-logp(f) where p(f)p(f) is a proper prior
probability distribution then we have
∑
f
∈
F
e
-
c
(
f
)
=
1
∑
f
∈
F
e
-
c
(
f
)
=
1
(18)
However, it may be difficult to design a probability distribution
over an infinite class of candidates. The coding perspective
provides a very practical means to this end.
Assume that we have assigned a uniquely decodable binary code to
each f∈Ff∈F, and let c(ff) denote the codelength for ff. That
is, the code for ff is c(ff) bits long. A very useful class of
uniquely decodable codes are called prefix codes
.
Definition 1
A code is called a prefix code
if no codeword is a
prefix of any other codeword.
Example 1
(From Cover && Thomas '91)
Consider an alphabet of symbols, say A,B,CA,B,C, and DD and the
codebooks below
In the singular codebook we assign the same codeword to each
symbol - a system that is obviously flawed! In the second case,
the codes are not singular but the codeword 010 could represent B
or CA or AD. Hence it is not a uniquely decodable codebook.
The third and fourth cases are both examples of uniquely decodable
codebooks, but the fourth has the added feature that no codeword
is a prefix of another. Prefix codes can be decoded from left to
right since each codeword is “self-punctuating" - in this case
with a zero to indicate the end of each word.
To design a uniquely decodable codebook in general is as
challenging as the problem of selecting c(ff) to satisfy
∑
f
∈
F
e
-
c
(
f
)
<
∞
∑
f
∈
F
e
-
c
(
f
)
<
∞
(19)
However, prefix codes can often be
easily designed or specified and they are inherently decodable.
Moreover, prefix codes satisfy an important inequality called the Kraft Inequality
.
For any binary prefix code, the codeword lengths c1c1, c2c2, ...
satisfy
∑
i
=
1
∞
2
-
c
i
≤
1
∑
i
=
1
∞
2
-
c
i
≤
1
(20)
Conversely, given any c1c1, c2c2, ... satisfying the inequality
above we can construct a prefix code with these codeword lengths.
We will prove this result a bit later, but now let's see how this
is useful in our learning problem.
Assume that we have assigned a binary prefix codeword to each
f∈Ff∈F, and let c(ff) denote the bit-length of the codeword for
ff. Set δ(f)=2-c(f)δδ(f)=2-c(f)δ. Then
P
⋃
f
∈
F
R
(
f
)
-
R
^
n
(
f
)
≥
log
1
δ
(
f
)
2
n
≤
∑
f
∈
F
P
R
(
f
)
-
R
^
n
(
f
)
≥
log
1
δ
(
f
)
2
n
≤
∑
f
∈
F
δ
(
f
)
=
∑
f
∈
F
2
-
c
(
f
)
δ
=
δ
P
⋃
f
∈
F
R
(
f
)
-
R
^
n
(
f
)
≥
log
1
δ
(
f
)
2
n
≤
∑
f
∈
F
P
R
(
f
)
-
R
^
n
(
f
)
≥
log
1
δ
(
f
)
2
n
≤
∑
f
∈
F
δ
(
f
)
=
∑
f
∈
F
2
-
c
(
f
)
δ
=
δ
(21)
This implies that for any δ>0δ>0 with probability at least 1-δ1-δ
we have ∀f∈F∀f∈F
R
(
f
)
≤
R
^
n
(
f
)
+
log
1
δ
(
f
)
2
n
=
R
^
n
(
f
)
+
c
(
f
)
log
2
+
log
1
δ
2
n
R
(
f
)
≤
R
^
n
(
f
)
+
log
1
δ
(
f
)
2
n
=
R
^
n
(
f
)
+
c
(
f
)
log
2
+
log
1
δ
2
n
(22)
(23)
Application
Let F1F1, F1F1, ... be a sequence of finite sets of candidate
functions with |F1|<|F1|<...|F1|<|F1|<... We can design prefix codes
as follows. Use the codes 0, 10, 110, 1110, ... to encode the
subscript ii in |Fi||Fi|. For each class |Fi||Fi|, construct a
set of binary codewords of length ⌈log2|F|⌉⌈log2|F|⌉ to
uniquely encode each function in FiFi. Then, encode any given
function ff by first using the code for ii corresponding to the
smallest FiFi that ff belongs to, followed by the length
⌈log2|F|⌉⌈log2|F|⌉ codeword for f∈Fif∈Fi. This is a
prefix code.
Example 2
Histogram Classifiers
X=[0,1]dd, Y={0,1}. Let FkFk, k=1, 2, ... denote the
collection of histogram classification rules with k equal volume
bins. We can use the following codebook for the index k
And follow this codeword with k=log2k=log2|Fk||Fk| bits to indicate
which of the 2kk possible histogram rules is under
consideration. Thus for any f∈Fkf∈Fk for some k ≥≥ 1 there
is a prefix code of length
c
(
f
)
=
k
+
k
=
2
k
b
i
t
s
c
(
f
)
=
k
+
k
=
2
k
b
i
t
s
(24)
It follows that for any δ>0δ>0 with probability at least 1-δ1-δ
we have ∀f∈⋃k≥1Fk∀f∈⋃k≥1Fk
R
(
f
)
≤
R
^
n
(
f
)
+
2
k
f
log
2
+
log
1
δ
2
n
R
(
f
)
≤
R
^
n
(
f
)
+
2
k
f
log
2
+
log
1
δ
2
n
(25)
where kfkf is the
number of bins in histogram corresponding to ff. Contrast with the
bound we had for the class of m bin histograms alone: with probability
≥1-δ≥1-δ, ∀f∈Fm∀f∈Fm
R
(
f
)
≤
R
^
n
(
f
)
+
m
log
2
+
log
1
δ
(
f
)
2
n
R
(
f
)
≤
R
^
n
(
f
)
+