Skip to content Skip to navigation

Connexions

You are here: Home » Content » Basic Elements of Statistical Decision Theory and Statistical Learning Theory

Navigation

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Endorsed by Endorsed (What does "Endorsed by" mean?)

This content has been endorsed by the organizations listed. Click each link for a list of all content endorsed by the organization.

Recently Viewed

This feature requires Javascript to be enabled.
 

Basic Elements of Statistical Decision Theory and Statistical Learning Theory

Module by: Robert Nowak. E-mail the author

Summary: This paper reviews and contrasts the basic elements of statistical decision theory and statistical learning theory. It is not intended to be a comprehensive treatment of either subject, but rather just enough to draw comparisons between the two.

Throughout this module, let XX denote the input to a decision-making process and YY denote the correct response or output (e.g., the value of a parameter, the label of a class, the signal of interest). We assume that XX and YY are random variables or random vectors with joint distribution PX,Y(x,y)PX,Y(x,y), where xx and yy denote specific values that may be taken by the random variables XX and YY, respectively. The observation XX is used to make decisions pertaining to the quantity of interest. For the purposes of illustration, we will focus on the task of determining the value of the quantity of interest. A decision rule for this task is a function ff that takes the observation XX as input and outputs a prediction of the quantity YY. We denote a decision rule by Y^Y^ or f(X)f(X), when we wish to indicate explicitly the dependence of the decision rule on the observation. We will examine techniques for designing decision rules and for analyzing their performance.

Measuring Decision Accuracy: Loss and Risk Functions

The accuracy of a decision is measured with a loss function. For example, if our goal is to determine the value of YY, then a loss function takes as inputs the true value YY and the predicted value (the decision) Y^=f(X)Y^=f(X) and outputs a non-negative real number (the “loss”) reflective of the accuracy of the decision. Two of the most commonly encountered loss functions include:

  1. 0/1 loss: 0/1(Y^,Y)=IY^Y0/1(Y^,Y)=IY^Y, which is the indicator function taking the value of 1 when Y^YY^Y and taking the value 0 when Y^(X)=YY^(X)=Y.
  2. squared error loss: 2(Y^,Y)=Y^-Y222(Y^,Y)=Y^-Y22, which is simply the sum of squared differences between the elements of Y^Y^ and YY.

The 0/1 loss is commonly used in detection and classification problems, and the squared error loss is more appropriate for problems involving the estimation of a continuous parameter. Note that since the inputs to the loss function may be random variables, so is the loss.

A risk R(f)R(f) is a function of the decision rule ff, and is defined to be the expectation of a loss with respect to the joint distribution PX,Y(x,y)PX,Y(x,y). For example, the expected 0/1 loss produces the probability of error risk function; i.e., a simply calculation shows that R0/1(f)=E[(If(X)Y]=Pr(f(X)Y)R0/1(f)=E[(If(X)Y]=Pr(f(X)Y). The expected squared error loss produces the mean squared error MSE risk function, R2(f)=E[f(X)-Y22]R2(f)=E[f(X)-Y22].

Optimal decisions are obtained by choosing a decision rule ff that minimizes the desired risk function. Given complete knowledge of the probability distributions involved (e.g., PX,Y(x,y)PX,Y(x,y)) one can explicitly or numerically design an optimal decision rule, denoted f*f*, that minimizes the risk function.

The Maximum Likelihood Principle

The conditional distribution of the observation XX given the quantity of interest YY is denoted by PX|Y(x|y)PX|Y(x|y). The conditional distribution PX|Y(x|y)PX|Y(x|y) can be viewed as a generative model, probabilistically describing the observations resulting from a given value, yy, of the quantity of interest. For example, if yy is the value of a parameter, the PX|Y(x|y)PX|Y(x|y) is the probability distribution of the observation XX when the parameter value is set to yy. If XX is a continuous random variable with conditional density pX|Y(x|y)pX|Y(x|y) or a discrete random variable with conditional probability mass function (pmf) pX|Y(x|y)pX|Y(x|y), then given a value yy we can assess the probability of a particular measurment value yy by the magnitude of either the conditional density or pmf.

In decision making problems, we know the value of the observation, but do not know the value yy. Therefore, it is appealing to consider the conditional density or pmf as a function of the unknown values yy, with XX fixed at its observed value. The resulting function is called the likelihood function. As the name suggests, values of yy where the likelihood function is largest are intuitively reasonable indicators of the true value of the unknown quantity, which we will denote by y*y*. The rationale for this is that these values would produce conditional densities or pmfs that place high probability on the observation X=xX=x.

The Maximum Likelihood Estimator (MLE) is defined to be the value of yy that maximizes the likelihood function; i.e., in the continuous case

y ^ ( X ) = arg max y p X | Y ( X | y ) y ^ ( X ) = arg max y p X | Y ( X | y )
(1)

with an analogous definition for the discrete case by replacing the conditional density with the conditional pmf. The decision rule y^(X)y^(X) is called an “estimator,” which is common in decision problems involving a continuous parameter. Note that maximizing the likelihood function is equivalent to minimizing the negative log-likelihood function (since the logarithm is a monotonic transformation). Now let y*y* denote the true value of YY. Then we can view the negative log-likelihood as a loss function

L ( y , y * ) = - log p X | Y ( X | y ) L ( y , y * ) = - log p X | Y ( X | y )
(2)

where the dependence on y*y* on the right hand side is embodied in the observation XX on the left. An interesting special case of the MLE results when the conditional density PX|Y(X|y)PX|Y(X|y) is a Gaussian, in which case the negative log-likelihood corresponds to a squared error loss function.

Now let us consider the expectation of this loss, with respect to the conditional distribution PX|Y(X|y*)PX|Y(X|y*):

- E [ log p X | Y ( X | y ) ] = log 1 p X | Y ( x | y ) p X | Y ( x | y * ) d x - E [ log p X | Y ( X | y ) ] = log 1 p X | Y ( x | y ) p X | Y ( x | y * ) d x
(3)

The true value y*y* minimizes the expected negative log-likelihood (or, equivalently, maximizes the expected log-likelihood ). To see this, compare the expected log-likelihood of y*y* with that of any other value yy:

E [ log p X | Y ( X | y * ) - log p X | Y ( X | y ) ] = E log p X | Y ( X | y * ) p X | Y ( X | y ) = log p X | Y ( x | y * ) p X | Y ( x | y ) p X | Y ( x | y * ) d x = KL ( p X | Y ( x | y * ) , p X | Y ( x | y ) ) . E [ log p X | Y ( X | y * ) - log p X | Y ( X | y ) ] = E log p X | Y ( X | y * ) p X | Y ( X | y ) = log p X | Y ( x | y * ) p X | Y ( x | y ) p X | Y ( x | y * ) d x = KL ( p X | Y ( x | y * ) , p X | Y ( x | y ) ) .
(4)

The quantity KL(pX|Y(x|y*),pX|Y(x|y))KL(pX|Y(x|y*),pX|Y(x|y)) is called the Kullback-Leibler (KL) divergence between the conditional density function pX|Y(x|y*)pX|Y(x|y*) and pX|Y(x|y)pX|Y(x|y). The KL divergence is non-negative, and zero if and only if the two densities are equal [1]. So, we see that the KL divergence acts as a sort of risk function in the context of Maximum Likelihood Estimation.

The Cramer-Rao Lower Bound

The MLE is based on finding the value for YY that maximizes the likelihood function. Intuitively, if the maximum point is very distinct, say a well isolated peak in the likelihood function, then the easier it will be to distinguish the MLE from alternative decisions. Consider the case in which YY is a scalar quantity. The “peakiness” of the log-likelihood function can be gauged by examining its curvature, -2logpX|Y(x|y)y2-2logpX|Y(x|y)y2, at the point of maximum likelihood. The higher the curvature, the more peaky is the behavior of the likelihood function at the maximum point. Of course, we hope that the MLE will be a good predictor (decision) for the unknown true value y*y*. So, rather than looking at the curvature of the log-likelihood function at the maximum likelihood point, a more appropriate measure of how easily it will be to distinguish y*y* from the alternatives is the expected curvature of the log-likelihood function evaluated at the value y*y*. The expectation taken over all possible observations with respect to the conditional density pX|Y(x|y*)pX|Y(x|y*). This quantity, denoted I(y*)=E[-2logpX|Y(x|y)y2]|y=y*I(y*)=E[-2logpX|Y(x|y)y2]|y=y*, is called the Fisher Information (FI). In fact, the FI provides us with an important performance bound known as the Cramer-Rao Lower Bound (CRLB).

The CRLB states that under some mild regularity assumptions about the conditional density function pX|Y(x|y)pX|Y(x|y), the variance of any unbiased estimator is bounded from below by the inverse of the I(y*)I(y*)[5], [4], [3]. Recall that an unbiased estimator is any estimator Y^Y^ that satisfies E[Y^]=y*E[Y^]=y*. The CRLB tells us is that

var ( Y ^ ) 1 I ( y * ) . var ( Y ^ ) 1 I ( y * ) .
(5)

If YY is a vector-valued quantity, then the expected negative Hessian matrix (matrix of partial second derivatives) of the log-likelihood function is called the Fisher Information Matrix (FIM), and a similar inequality tells us that the variance of each component of any unbiased estimator of y*y* is bounded below by the corresponding diagonal element of the inverse of the FIM. Since the MSE of an unbiased estimator is equal to its variance, we see that the CRLB provides a very useful lower bound on the best MSE performance that we can hope to achieve. Thus, the CRLB is often used as a comparison point for evaluating estimators. It may or may not be possible to achieve the CRLB, but if we find a decision rule that does, we know that it also minimizes the MSE risk among all possible unbiased estimators. In general, it may be difficult to compute the CRLB, but in certain important cases it is possible to find closed-form or computational solutions.

Bayesian Decision Theory

Bayesian Decision Theory provides a formal system for integrating prior knowledge and observed observations. For the purposes of illustration we will focus on problems involving continuous variables and observations, but extensions to discrete cases are straightforward (simple replace probability densities with probability mass functions, and integrals with summations). The key elements of Bayesian methods are:

  1. a prior probability density function pY(y)pY(y) describing a priori knowledge of probable states for the quantity YY;
  2. the likelihood function pX|Y(x|y)pX|Y(x|y), as described above;
  3. the posterior density function pY|X(y|x)pY|X(y|x).

The posterior density is a function of the prior and likelihood, obtained according to Bayes rule:

p Y | X ( y | x ) = p X | Y ( x | y ) p Y ( y ) p X | Y ( x | y ) p Y ( y ) d y . p Y | X ( y | x ) = p X | Y ( x | y ) p Y ( y ) p X | Y ( x | y ) p Y ( y ) d y .
(6)

The posterior is an indicator of probable values for YY, based on the prior knowledge and the observation. Several options exist for deriving a specific estimate of YY using the posterior. The mean value of the posterior density is one common choice (commonly called the posterior mean). The posterior mean is the decision rule that minimizes the expected squared error loss (MSE risk) function. The value yy where the posterior density is maximized is another popular estimator (commonly called the Maximum A Posteriori (MAP) estimator). Note that the denominator of the posterior is independent of yy, so the MAP estimator is simply the maximizer of the product of the likelihood and the prior. Therefore, if the prior is a constant function, the MAP estimator and MLE coincide.

Statistical Learning

In all of the methods described above, we assumed some amount of knowledge about the distributions of the observation XX and quantity of interest YY. Such knowledge can come from a careful analysis of the physical characteristics of the problem at hand, or it can be gleaned from previous experience. However, there are situations where it is difficult to model the physics of the problem and we may not have enough experience to develop complete and accurate probability models. In such cases, it is natural to adopt a statistical learning approach [2], [7].

Statistical learning methods are based on developing decision rules or estimators based only on a collection of training examples, rather than predetermined probability models. Statistical learning methods are often said to be distribution-free, since they do not assume particular probability models. The canonical set-up for statistical learning is as follows. We begin with a collection of training examples, {(Xi,Yi)}i=1n{(Xi,Yi)}i=1n, which are assumed to be independently and identically distributed according to an unknown probability distribution PX,Y(x,y)PX,Y(x,y). If we knew PX,Y(x,y)PX,Y(x,y), then we could compute a desired risk function and design an optimal decision rule using the methods described above. In essence, the training examples give us a glimpse at the underlying distribution, but our knowledge of it is far from complete. We cannot exactly compute a risk function, and therefore we cannot derive a corresponding optimal decision rule.

There are at least two ways to proceed at this point. One possibility is to use the training examples to estimate the joint probability distribution, and then use this estimate to derive an decision rule. Unfortunately, the (general-purpose) problem of estimating a distribution is often more difficult from a limited pool of data than is the problem of designing a specific-purpose decision rule. For this reason, a second possibility is more commonly favored in practice. Rather than estimating the complete distribution, one can use the training examples to directly design a decision rule. More precisely, perhaps the most common approach is to use the training examples to compute an estimate of the desired risk function.

Suppose that we are interested in minimizing a particular risk function. Recall that the risk is the expected value of a chosen loss function. Let (Y^,Y)(Y^,Y) denote the loss, and let f(X)f(X) denote a candidate decision function, mapping observations to predictions about YY (i.e., Y^=f(X)Y^=f(X)). The empirical risk function is constructed from the training examples as follows:

R ^ ( f ) = 1 n i = 1 n ( f ( X i ) , Y i ) . R ^ ( f ) = 1 n i = 1 n ( f ( X i ) , Y i ) .
(7)

This is simply the average loss of the decision rule ff over the set of training examples. Note that since the training examples are independent and identically distributed, the expected value of the empirical risk is equal to the true risk R(f)=E[(f(X),Y)]R(f)=E[(f(X),Y)]. Moreover, we known (according to the law of large numbers) that the empirical risk tends to the true risk as the size of the training sample increases. These facts lend support to the idea of choosing a decision rule to minimize the empirical risk.

Empirical risk minimization (ERM) is just this process. Given a collection of possible decision rules, say FF, ERM selects a decision rule according to

f ^ n = arg min f F R ^ ( f ) . f ^ n = arg min f F R ^ ( f ) .
(8)

The selected rule, f^nf^n, obviously depends on the given set of training examples, and therefore it is itself a random quantity. The theoretically optimal counterpart to f^nf^n is the decision rule that minimizes the true risk

f * = arg min f F R ( f ) . f * = arg min f F R ( f ) .
(9)

The central problem in statistical learning is to quantify how close f^nf^n performs relative to f*f*. Note that R(f*)R(f^n)R(f*)R(f^n), since f*f* minimizes the true risk. Thus, one way to gauge the performance of f^nf^n relative to f*f* is to show that there exists small positive values ϵϵ and δδ such that with probability at least 1-δ1-δ we have

R ( f ^ n ) R ( f * ) + ϵ . R ( f ^ n ) R ( f * ) + ϵ .
(10)

If an inequality of this form holds, then we say that f^nf^n is a Probability Approximately Correct (PAC) decision rule [6].

To show that the empirical risk minimizer is a PAC decision rule, we first must understand how closely the empirical risk matches the true risk. First, let us consider the empirical and true risk of the decision rule ff. Assume that the loss function is bounded between 0 and 1 (possibly after a suitable normalization). Then the empirical risk function is a sum of independent random variables bounded between 0 and 1. Hoeffding's inequality is a bound on the deviations of such random sums from their corresponding mean values [2]. In this case, the mean value is the true risk of ff, and Hoeffding's inequality states that

P ( | R ^ ( f ) - R ( f ) | > ϵ ) 2 e - 2 n ϵ 2 . P ( | R ^ ( f ) - R ( f ) | > ϵ ) 2 e - 2 n ϵ 2 .
(11)

Another equivalent statement is that the inequality |R^(f)-R(f)|ϵ|R^(f)-R(f)|ϵ holds with probability at least 1-2e-2nϵ21-2e-2nϵ2. Thus, the two risks are probably close together, and the greater the number of training examples, nn, the closer they are.

Now we would like a similar condition to hold for all fFfF, since ERM optimizes over the entire collection FF. Suppose that FF is a finite collection of decision rules. Let |F||F| denote the number of rules in FF. The probability that the difference between the true and empirical risks, of one or more of the decision rules, exceeds ϵϵ is bounded by the sum of the probabilities of each individual event of the form |R^(f)-R(f)|>ϵ|R^(f)-R(f)|>ϵ, the so-called Union of Events bound. Therefore, with probability at least 1-|F|2e-2nϵ21-|F|2e-2nϵ2 we have that

| R ^ ( f ) - R ( f ) | ϵ | R ^ ( f ) - R ( f ) | ϵ
(12)

for all fFfF. Equivalently, setting δ=2|F|e-2nϵ2δ=2|F|e-2nϵ2, we have that with probability at least 1-δ1-δ and for all fFfF

| R ^ ( f ) - R ( f ) | log | F | + log ( 2 / δ ) 2 n . | R ^ ( f ) - R ( f ) | log | F | + log ( 2 / δ ) 2 n .
(13)

Notice that the two risks are uniformly close together, and the closeness indicated by the bound increases as nn increases and decreases as the number of decision rules in FF increases. In fact, the bound scales with log|F|log|F|, and so it is reasonable to interpret the logarithm of the number of decision rules under consideration as a measure of the complexity of the class.

Now using this bound, we can show that f^nf^n is a PAC decision rule as follows. Note that with probability at least 1-δ1-δ

R ( f ^ n ) R ^ ( f ^ n ) + log | F | + log ( 2 / δ ) 2 n R ^ ( f * ) + log | F | + log ( 2 / δ ) 2 n R ( f * ) + 2 log | F | + log ( 2 / δ ) 2 n R ( f ^ n ) R ^ ( f ^ n ) + log | F | + log ( 2 / δ ) 2 n R ^ ( f * ) + log | F | + log ( 2 / δ ) 2 n R ( f * ) + 2 log | F | + log ( 2 / δ ) 2 n
(14)

where the first inequality follows since the true and empirical risks are close for all fFfF, and in particular for f^nf^n, the second inequality holds since by definition f^nf^n minimizes the empirical risk, and the third inequality holds again since the empirical risk is close to the true risk for all ff, in this case for f*f* in particular. So, we have shown that f^nf^n is PAC.

PAC bounds of this form can be extended in many directions, for example to infinitely large or uncountable classes of decision rules, but the basic ingredients of the theory are essentially like those demonstrated above. The bottom line is that empirical risk minimization is a reasonable approach, provided one has access to a sufficient number of training examples and the number, or more generally the complexity, of the class of decision rules under consideration is not too great.

Further reading

Excellent treatments of classical decision and estimation theory can be found in a number of textbooks [5], [4], [3], [1]. For references on statistical learning theory, outstanding textbooks are also available [2], [7], [6] for further reading.

References

  1. Cover, T. and Thomas, J. A. (1991). Elements of Information Theory. Wiley.
  2. Devroye, L. and Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Spring, New York.
  3. Kay, S. M. (1993). Fundamentals of Statistical Signal Processing. Prentice Hall.
  4. Lehmann, E. L. (1983). Theory of Point Estimation. Wiley, New York.
  5. Trees, H. L. Van. (1968). Detection, Estimation, and Modulation Theory, Part I. Wiley, New York.
  6. Valiant, L. G. (1984). A Theory of the Learnable. Communications of the ACM, 27, 1134-1142.
  7. Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York.

Content actions

Download module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks