Skip to content Skip to navigation

Connexions

You are here: Home » Content » Plug-In Classifier and Histogram Classifier

Navigation

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Endorsed by Endorsed (What does "Endorsed by" mean?)

This content has been endorsed by the organizations listed. Click each link for a list of all content endorsed by the organization.

Recently Viewed

This feature requires Javascript to be enabled.
 

Plug-In Classifier and Histogram Classifier

Module by: Robert Nowak. E-mail the author

We return to the topic of classification, and we assume an input (feature) space XX and a binary output (label) space Y={0,1}Y={0,1}. Recall that the Bayes classifier (which minimizes the probability of misclassification) is defined by

f * ( x ) = 1 , P ( Y = 1 | X = x ) 1 / 2 0 , o t h e r w i s e . f * ( x ) = 1 , P ( Y = 1 | X = x ) 1 / 2 0 , o t h e r w i s e .
(1)

Throughout this section, we will denote the conditional probability function by

η ( x ) P ( Y = 1 | X = x ) . η ( x ) P ( Y = 1 | X = x ) .
(2)

Plug-in Classifiers

One way to construct a classifier using the training data {Xi,Yi}i=1n{Xi,Yi}i=1n is to estimate η(x)η(x) and then plug-it into the form of the Bayes classifier. That is obtain an estimate,

η ^ n ( x ) = η ( x ; { X i , Y i } i = 1 n ) η ^ n ( x ) = η ( x ; { X i , Y i } i = 1 n )
(3)

and then form the “plug-in" classification rule

f ^ ( x ) = 1 , η ^ ( x ) 1 / 2 0 , o t h e r w i s e . f ^ ( x ) = 1 , η ^ ( x ) 1 / 2 0 , o t h e r w i s e .
(4)

Remark:

The function η(x)η(x) is generally more complicated than the ultimate classification rule (binary-valued), as we can see
η : X [ 0 , 1 ] f : X { 0 , 1 } . η : X [ 0 , 1 ] f : X { 0 , 1 } .
(5)

Therefore, in this sense plug-in methods are solving a more complicated problem than necessary. However, plug-in methods can perform well, as demonstrated by the next result.

Theorem 1: Plug-in Classifier

Let η˜η˜ be an approximation to ηη, and consider the plug-in rule

f ( x ) = 1 , η ˜ ( x ) 1 / 2 0 , o t h e r w i s e . f ( x ) = 1 , η ˜ ( x ) 1 / 2 0 , o t h e r w i s e .
(6)

Then,

R ( f ) - R * 2 E [ | η ( x ) - η ˜ ( x ) | ] R ( f ) - R * 2 E [ | η ( x ) - η ˜ ( x ) | ]
(7)

where

R ( f ) = P ( f ( X ) Y ) R * = R ( f * ) = inf f R ( f ) . R ( f ) = P ( f ( X ) Y ) R * = R ( f * ) = inf f R ( f ) .
(8)

Proof

Consider any xRdxRd. In proving the optimality of the Bayes classifier f*f* in Lecture 2, we showed that

P f ( x ) Y | X = x - P f * ( x ) Y | X = x = 2 η ( x ) - 1 1 { f * ( x ) = 1 } - 1 { f ( x ) = 1 } , P f ( x ) Y | X = x - P f * ( x ) Y | X = x = 2 η ( x ) - 1 1 { f * ( x ) = 1 } - 1 { f ( x ) = 1 } ,
(9)

which is equivalent to

P f ( x ) Y | X = x - P f * ( x ) Y | X = x = 2 η ( x ) - 1 1 { f * ( x ) f ( x ) } , P f ( x ) Y | X = x - P f * ( x ) Y | X = x = 2 η ( x ) - 1 1 { f * ( x ) f ( x ) } ,
(10)

since f*(x)=1f*(x)=1 whenever 2η(x)-1>02η(x)-1>0. Thus,

P ( f ( X ) Y ) - R * = R d 2 | η ( x ) - 1 / 2 | 1 { f * ( x ) f ( x ) } p X ( x ) d x where p X ( x ) is the marginal density of X R d 2 | η ( x ) - η ˜ ( x ) | 1 { f * ( x ) f ( x ) } p X ( x ) d x R d 2 | η ( x ) - η ˜ ( x ) | p X ( x ) d x = 2 E [ | η ( X ) - η ˜ ( X ) | ] P ( f ( X ) Y ) - R * = R d 2 | η ( x ) - 1 / 2 | 1 { f * ( x ) f ( x ) } p X ( x ) d x where p X ( x ) is the marginal density of X R d 2 | η ( x ) - η ˜ ( x ) | 1 { f * ( x ) f ( x ) } p X ( x ) d x R d 2 | η ( x ) - η ˜ ( x ) | p X ( x ) d x = 2 E [ | η ( X ) - η ˜ ( X ) | ]
(11)

where the first inequality follows from the fact

f ( x ) f * ( x ) | η ( x ) - η ˜ ( x ) | | η ( x ) - 1 / 2 | f ( x ) f * ( x ) | η ( x ) - η ˜ ( x ) | | η ( x ) - 1 / 2 |
(12)

and the second inequality is simply a result of the fact that 1{f*(x)f(x)}1{f*(x)f(x)} is either 0 or 1.

Figure 1: Pictorial illustration of |η(x)-η˜(x)||η(x)-1/2||η(x)-η˜(x)||η(x)-1/2| when f(x)f*(x)f(x)f*(x). Note that the inequality P(f(X)Y)-R*Rd2|η(x)-η˜(x)|1{f*(x)f(x)}pX(x)dxP(f(X)Y)-R*Rd2|η(x)-η˜(x)|1{f*(x)f(x)}pX(x)dx shows that the excess risk is at most twice the integral over the set where f*(x)f(x)f*(x)f(x). The difference |η(x)-η˜(x)||η(x)-η˜(x)| may be arbitrarily large away from this set without effecting the error rate of the classifier. This illustrates the fact that estimating ηη well everywhere (i.e., regression) is unnecessary for the design of a good classifier (we only need to determine where ηη crosses the 1/21/2-level). In other words, “classification is easier than regression.”
Figure 1 (eta_diff.png)

The theorem shows us that a good estimate of ηη can produce a good plug-in classification rule. By “good" estimate, we mean an estimator η˜η˜ that is close to ηη in expected L1-normL1-norm.

The Histogram Classifier

Let's assume that the (input) features are randomly distributed over the unit hypercube X=[0,1]dX=[0,1]d (note that by scaling and shifting any set of bounded features we can satisfy this assumption), and assume that the (output) labels are binary, i.e., Y={0,1}Y={0,1}. A histogram classifier is based on a partition the hypercube [0,1]d[0,1]d into MM smaller cubes of equal size.

Example 1: Partition of hypercube in 2 dimensions

Consider the unit square [0,1]2[0,1]2 and partition it into MM subsquares of equal area (assuming MM is a squared integer). Let the subsquares be denoted by {Qi},i=1,...,M{Qi},i=1,...,M.

Figure 2: Example of hypercube [0,1]2[0,1]2 in MM equally sized partition
Figure 2 (2dcubes.png)

Define the following piecewise-constant estimator of η(x)η(x):

η ^ n ( x ) = j = 1 M P ^ j 1 { x Q j } η ^ n ( x ) = j = 1 M P ^ j 1 { x Q j }
(13)

where

P ^ j = i = 1 n 1 { X i Q j , Y i = 1 } i = 1 n 1 { X i Q j } . P ^ j = i = 1 n 1 { X i Q j , Y i = 1 } i = 1 n 1 { X i Q j } .
(14)

Like our previous denoising examples, we expect that the bias of η^nη^n will decrease as MM increases, but the variance will increase as MM increases.

Theorem 2: Consistency of Histogram Classifiers

If MM and nMnM as nn, then the histogram classifier risk converges to the Bayes risk for every distribution PXYPXY with marginal density pX(x)cpX(x)c, for some constant c>0c>0.1.

What the theorem tells us is that we need the number of partition cells to tend to infinity (to insure that the bias tends to zero), but they can't grow faster than the number of samples (i.e., we want the number of samples per box tending to infinity to drive the variance to zero).

Proof

Let PjQjη(x)pX(x)dxQjpX(x)dxPjQjη(x)pX(x)dxQjpX(x)dx (the theoretical analog of P^jP^j) and define

η ¯ ( x ) = j = 1 M P j 1 { x Q j } η ¯ ( x ) = j = 1 M P j 1 { x Q j }
(15)

The function η¯η¯ is the theoretical analog of η^η^ (i.e., the function obtained by averaging ηη over the partition cells). By the triangle inequality,

E | η ^ n ( X ) - η ( X ) | E [ | η ^ n ( X ) - η ¯ ( X ) | ] E s t i m a t i o n E r r o r + E [ | η ¯ n ( X ) - η ( X ) | ] A p p r o x i m a t i o n E r r o r E | η ^ n ( X ) - η ( X ) | E [ | η ^ n ( X ) - η ¯ ( X ) | ] E s t i m a t i o n E r r o r + E [ | η ¯ n ( X ) - η ( X ) | ] A p p r o x i m a t i o n E r r o r
(16)

Let's first bound the estimation error. For any x[0,1]dx[0,1]d, let Q(x)Q(x) denote the histogram bin in which xx falls in. Define the random variable

N ( x ) = i = 1 n 1 { X i Q ( x ) } N ( x ) = i = 1 n 1 { X i Q ( x ) }
(17)

If Q(x)=QjQ(x)=Qj, then this random variable is simply nP^jnP^j. Note that

η ^ n ( x ) = 1 N ( x ) B ( x ) η ^ n ( x ) = 1 N ( x ) B ( x )
(18)

where B(x)==i=1n1{XiQ(x),Yi=1}=i:XiQ(x)YiB(x)==i=1n1{XiQ(x),Yi=1}=i:XiQ(x)Yi. B(x)B(x) is simply the number of samples in cell Q(x)Q(x) labelled 1. Now η^n(x)η^n(x) is a fairly complicated random variable, but the conditional distribution of B(x)B(x) given N(x)N(x) is relatively simple. Note that

B ( x ) | N ( x ) = k Binomial k , η ¯ ( x ) B ( x ) | N ( x ) = k Binomial k , η ¯ ( x )
(19)

since η¯(x)η¯(x) is the probability of a sample in Q(x)Q(x) having the label 1 and we are conditioning on the event of observing kk samples in Q(x)Q(x).

Now consider the conditional expectation

E η ^ n ( x ) - η ¯ ( x ) N ( x ) = k E B ( x ) N ( x ) - η ¯ ( x ) N ( x ) = k , k > 0 1 , k = 0 ( since 0 η ¯ ( x ) 1 ) E η ^ n ( x ) - η ¯ ( x ) N ( x ) = k E B ( x ) N ( x ) - η ¯ ( x ) N ( x ) = k , k > 0 1 , k = 0 ( since 0 η ¯ ( x ) 1 )
(20)

Next note that

E B ( x ) N ( x ) - η ¯ ( x ) N ( x ) = k = E B ( x ) k - η ¯ ( x ) N ( x ) = k = E 1 k | B ( x ) - k η ¯ ( x ) E [ B ( x ) ] | N ( x ) = k 1 k ( E | B ( x ) - k η ¯ ( x ) | 2 N ( x ) = k conditional variance of B ( x ) ) 1 2 E B ( x ) N ( x ) - η ¯ ( x ) N ( x ) = k = E B ( x ) k - η ¯ ( x ) N ( x ) = k = E 1 k | B ( x ) - k η ¯ ( x ) E [ B ( x ) ] | N ( x ) = k 1 k ( E | B ( x ) - k η ¯ ( x ) | 2 N ( x ) = k conditional variance of B ( x ) ) 1 2
(21)

by the Jensen's inequality, E[|Z|](E[|Z|2])12E[|Z|](E[|Z|2])12.

Therefore,

E B ( x ) N ( x ) - η ¯ ( x ) N ( x ) = k 1 k k η ¯ ( x ) ( 1 - η ¯ ( x ) ) 1 2 = η ¯ ( x ) ( 1 - η ¯ ( x ) ) k E B ( x ) N ( x ) - η ¯ ( x ) N ( x ) = k 1 k k η ¯ ( x ) ( 1 - η ¯ ( x ) ) 1 2 = η ¯ ( x ) ( 1 - η ¯ ( x ) ) k
(22)

and

E | η ^ n ( x ) - η ¯ ( x ) | N ( x ) = k η ¯ ( x ) ( 1 - η ¯ ( x ) k , k > 0 1 , k = 0 E | η ^ n ( x ) - η ¯ ( x ) | N ( x ) = k η ¯ ( x ) ( 1 - η ¯ ( x ) k , k > 0 1 , k = 0
(23)

or in other words,

E | η ^ n ( x ) - η ¯ ( x ) | N ( x ) = k η ¯ ( x ) ( 1 - η ¯ ( x ) N ( x ) 1 { N ( x ) > 0 } + 1 { N ( x ) = 0 } E | η ^ n ( x ) - η ¯ ( x ) | N ( x ) = k η ¯ ( x ) ( 1 - η ¯ ( x ) N ( x ) 1 { N ( x ) > 0 } + 1 { N ( x ) = 0 }
(24)

Now taking expectation with respect to N(x)N(x)

E N E [ | η ^ n ( x ) - η ¯ ( x ) | N ( x ) = k ] E N η ¯ ( x ) ( 1 - η ¯ ( x ) N ( x ) 1 { N ( x ) > 0 } + P ( N ( x ) = 0 ) E 1 2 N ( x ) 1 { N ( x ) > 0 } + P ( N ( x ) = 0 ) 1 2 P ( N ( x ) k ) + 1 2 k P ( N ( x ) > k ) 1 + P ( N ( x ) = 0 ) E N E [ | η ^ n ( x ) - η ¯ ( x ) | N ( x ) = k ] E N η ¯ ( x ) ( 1 - η ¯ ( x ) N ( x ) 1 { N ( x ) > 0 } + P ( N ( x ) = 0 ) E 1 2 N ( x ) 1 { N ( x ) > 0 } + P ( N ( x ) = 0 ) 1 2 P ( N ( x ) k ) + 1 2 k P ( N ( x ) > k ) 1 + P ( N ( x ) = 0 )
(25)

Now a key fact is that for any k>0k>0, P(Nk)0asnP(Nk)0asn. This follows from the assumption that the marginal density pX(x)cpX(x)c, for some constant c>0c>0, and nMnM as nn. This result is easily verified by contradiction. If P(Nk)q>0asnP(Nk)q>0asn, then PX(x)>0PX(x)>0 is contradicted. Thus, for any ϵ>0ϵ>0 there exists a k>0k>0 such that 12k<ϵ12k<ϵ and P(Nk)<ϵP(Nk)<ϵ for nn sufficiently large. Therefore, for nn sufficiently large and every x[0,1]dx[0,1]d,

E | η ^ n ( x ) - η ¯ ( x ) | < 3 ϵ E | η ^ n ( x ) - η ¯ ( x ) | < 3 ϵ
(26)

where the expectation is with respect to the distribution of the sample {Xi,Yi}i=1n{Xi,Yi}i=1n. Thus,

E | η ^ n ( X ) - η ¯ ( X ) | < 3 ϵ E | η ^ n ( X ) - η ¯ ( X ) | < 3 ϵ
(27)

where the expectation is now with respect to the distribution of the sample and the marginal distribution of XX.

Next consider the approximation error E|η¯n(X)-η(X)|E|η¯n(X)-η(X)|, where the expectation is over XX alone. The function ηη may not itself be continuous, but there is another function ηϵηϵ that is uniformly continuous and such that E|ηϵ(X)-η(X)|<ϵE|ηϵ(X)-η(X)|<ϵ. Recall that uniformly continuous functions can be well approximated by piecewise constant functions.

By the triangle inequality,

E | η ¯ - η | E | η ¯ - η ¯ ϵ | ϵ + E | η ¯ ϵ - η ϵ | + E | η ϵ - η | ϵ by design E | η ¯ - η | E | η ¯ - η ¯ ϵ | ϵ + E | η ¯ ϵ - η ϵ | + E | η ϵ - η | ϵ by design
(28)

where η¯ϵ(x)=j=1mQjηϵ(x')pX(x')dx'1{xQj}η¯ϵ(x)=j=1mQjηϵ(x')pX(x')dx'1{xQj}.

E | η ¯ ( X ) - η ¯ ϵ ( X ) | = j = 1 m Q j | η ( x ) - η ϵ ( x ) | p X ( x ) d x 1 { x Q j } ϵ E | η ¯ ( X ) - η ¯ ϵ ( X ) | = j = 1 m Q j | η ( x ) - η ϵ ( x ) | p X ( x ) d x 1 { x Q j } ϵ
(29)

and since ηϵηϵ is uniformly continuous,

E | η ¯ ϵ ( X ) - η ϵ ( X ) | = j = 1 M Q j | η ¯ ϵ ( x ) - η ϵ ( x ) | 1 { x Q j } p X ( x ) d x j = 1 M δ P ( x Q j ) , where δ depends on M = δ , since j = 1 M P ( X Q j ) = 1 E | η ¯ ϵ ( X ) - η ϵ ( X ) | = j = 1 M Q j | η ¯ ϵ ( x ) - η ϵ ( x ) | 1 { x Q j } p X ( x ) d x j = 1 M δ P ( x Q j ) , where δ depends on M = δ , since j = 1 M P ( X Q j ) = 1
(30)

By taking MM sufficiently large, δδ can be made arbitrarily small. So for large MM, δϵδϵ.

Thus, we have shown

E | η ¯ ( X ) - η ( X ) | < 3 ϵ E | η ¯ ( X ) - η ( X ) | < 3 ϵ
(31)

for sufficiently large MM. Since ϵ>0ϵ>0 was arbitrary, we have shown that taking

f ^ n ( x ) = 1 , η ^ n ( x ) 1 / 2 0 , o t h e r w i s e f ^ n ( x ) = 1 , η ^ n ( x ) 1 / 2 0 , o t h e r w i s e
(32)

satisfies

P ( f ^ n ( X ) Y ) - P ( f * ( X ) Y ) 2 E | η ^ n ( X ) - η ( X ) | 0 P ( f ^ n ( X ) Y ) - P ( f * ( X ) Y ) 2 E | η ^ n ( X ) - η ( X ) | 0
(33)

if

M n M as n M n M as n
(34)
Note:
P(f^n(X)Y)=E[1{f^(X)Y}]P(f^n(X)Y)=E[1{f^(X)Y}] is the expected risk of f^f^, with expectation over the distributions of (X,Y)(X,Y) and {Xi,Yi}i=1n{Xi,Yi}i=1n.

Footnotes

  1. Actually, the result holds for every distribution PXYPXY. For the more general theorem, refer to Theorem 6.16.1 in A probabilistic Theory of Pattern Recognition by Luc Devroye, László Györfi and Gábor Lugosi.

Content actions

Download module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks