Throughout this module, let XX denote the input to a
decision-making process and YY denote the correct response or output (e.g., the value of a parameter, the label of a class, the
signal of interest). We assume that XX and YY are random variables
or random vectors with joint distribution PX,Y(x,y)PX,Y(x,y), where xx
and yy denote specific values that may be taken by the random
variables XX and YY, respectively. The observation XX is used to
make decisions pertaining to the quantity of interest. For the
purposes of illustration, we will focus on the task of determining the
value of the quantity of interest. A decision rule for this task is a function ff that takes the observation XX
as input and outputs a prediction of the quantity YY. We denote a
decision rule by Y^Y^ or f(X)f(X), when we wish to indicate
explicitly the dependence of the decision rule on the observation. We
will examine techniques for designing decision rules and for analyzing
their performance.
The accuracy of a decision is measured with a loss function. For example, if our goal is to determine the value of
YY, then a loss function takes as inputs the true value YY and
the predicted value (the decision) Y^=f(X)Y^=f(X) and
outputs a non-negative real number (the “loss”) reflective of the
accuracy of the decision. Two of the most commonly encountered loss
functions include:
- 0/1 loss: ℓ0/1(Y^,Y)=IY^≠Yℓ0/1(Y^,Y)=IY^≠Y, which is the indicator function
taking the value of 1 when Y^≠YY^≠Y and taking the
value 0 when Y^(X)=YY^(X)=Y.
- squared error loss: ℓ2(Y^,Y)=∥Y^-Y∥22ℓ2(Y^,Y)=∥Y^-Y∥22, which is simply the sum of squared
differences between the elements of Y^Y^ and YY.
The 0/1 loss is commonly used in detection and classification
problems, and the squared error loss is more appropriate for problems
involving the estimation of a continuous parameter. Note that since the
inputs to the loss function may be random variables, so is the loss.
A risk R(f)R(f) is a function of the decision rule ff, and
is defined to be the expectation of a loss with respect to the joint
distribution PX,Y(x,y)PX,Y(x,y). For example, the expected 0/1 loss
produces the probability of error risk function; i.e., a simply
calculation shows that R0/1(f)=E[(If(X)≠Y]=Pr(f(X)≠Y)R0/1(f)=E[(If(X)≠Y]=Pr(f(X)≠Y). The expected squared error loss
produces the mean squared error MSE risk function, R2(f)=E[∥f(X)-Y∥22]R2(f)=E[∥f(X)-Y∥22].
Optimal decisions are obtained by choosing a decision rule ff that
minimizes the desired risk function. Given complete knowledge of the
probability distributions involved (e.g., PX,Y(x,y)PX,Y(x,y)) one can
explicitly or numerically design an optimal decision rule, denoted
f*f*, that minimizes the risk function.
The conditional distribution of the observation XX given the quantity
of interest YY is denoted by PX|Y(x|y)PX|Y(x|y). The conditional
distribution PX|Y(x|y)PX|Y(x|y) can be viewed as a generative model,
probabilistically describing the observations resulting from a given
value, yy, of the quantity of interest. For example, if yy is the
value of a parameter, the PX|Y(x|y)PX|Y(x|y) is the probability
distribution of the observation XX when the parameter value is set to
yy. If XX is a continuous random variable with conditional density
pX|Y(x|y)pX|Y(x|y) or a discrete random variable with conditional
probability mass function (pmf) pX|Y(x|y)pX|Y(x|y), then given a value yy
we can assess the probability of a particular measurment value yy by
the magnitude of either the conditional density or pmf.
In decision making problems, we know the value of the observation, but
do not know the value yy. Therefore, it is appealing to consider the
conditional density or pmf as a function of the unknown values yy,
with XX fixed at its observed value. The resulting function is
called the likelihood function. As the name suggests, values of yy
where the likelihood function is largest are intuitively reasonable
indicators of the true value of the unknown quantity, which we will denote
by y*y*. The rationale for this is that these values would produce
conditional densities or pmfs that place high probability on the
observation X=xX=x.
The Maximum Likelihood Estimator
(MLE) is defined to be the value of yy that maximizes the
likelihood function; i.e., in the continuous case
y
^
(
X
)
=
arg
max
y
p
X
|
Y
(
X
|
y
)
y
^
(
X
)
=
arg
max
y
p
X
|
Y
(
X
|
y
)
(1)
with an analogous
definition for the discrete case by replacing the conditional density
with the conditional pmf. The decision rule y^(X)y^(X) is
called an “estimator,” which is common in decision problems
involving a continuous parameter. Note that maximizing the likelihood
function is equivalent to minimizing the negative log-likelihood
function (since the logarithm is a monotonic transformation). Now let
y*y* denote the true value of YY. Then we can view the negative
log-likelihood as a loss function
ℓ
L
(
y
,
y
*
)
=
-
log
p
X
|
Y
(
X
|
y
)
ℓ
L
(
y
,
y
*
)
=
-
log
p
X
|
Y
(
X
|
y
)
(2)
where the
dependence on y*y* on the right hand side is embodied in the
observation XX on the left. An interesting special case of the MLE
results when the conditional density PX|Y(X|y)PX|Y(X|y) is a Gaussian, in
which case the negative log-likelihood corresponds to a squared error
loss function.
Now let us consider the expectation of this
loss, with respect to the conditional distribution
PX|Y(X|y*)PX|Y(X|y*):
-
E
[
log
p
X
|
Y
(
X
|
y
)
]
=
∫
log
1
p
X
|
Y
(
x
|
y
)
p
X
|
Y
(
x
|
y
*
)
d
x
-
E
[
log
p
X
|
Y
(
X
|
y
)
]
=
∫
log
1
p
X
|
Y
(
x
|
y
)
p
X
|
Y
(
x
|
y
*
)
d
x
(3)
The true value y*y* minimizes the expected negative log-likelihood
(or, equivalently, maximizes the expected log-likelihood ). To see
this, compare the expected log-likelihood of y*y* with that of any
other value yy:
E
[
log
p
X
|
Y
(
X
|
y
*
)
-
log
p
X
|
Y
(
X
|
y
)
]
=
E
log
p
X
|
Y
(
X
|
y
*
)
p
X
|
Y
(
X
|
y
)
=
∫
log
p
X
|
Y
(
x
|
y
*
)
p
X
|
Y
(
x
|
y
)
p
X
|
Y
(
x
|
y
*
)
d
x
=
KL
(
p
X
|
Y
(
x
|
y
*
)
,
p
X
|
Y
(
x
|
y
)
)
.
E
[
log
p
X
|
Y
(
X
|
y
*
)
-
log
p
X
|
Y
(
X
|
y
)
]
=
E
log
p
X
|
Y
(
X
|
y
*
)
p
X
|
Y
(
X
|
y
)
=
∫
log
p
X
|
Y
(
x
|
y
*
)
p
X
|
Y
(
x
|
y
)
p
X
|
Y
(
x
|
y
*
)
d
x
=
KL
(
p
X
|
Y
(
x
|
y
*
)
,
p
X
|
Y
(
x
|
y
)
)
.
(4)
The quantity KL(pX|Y(x|y*),pX|Y(x|y))KL(pX|Y(x|y*),pX|Y(x|y)) is called the
Kullback-Leibler (KL) divergence between the conditional density
function pX|Y(x|y*)pX|Y(x|y*) and pX|Y(x|y)pX|Y(x|y). The KL divergence is
non-negative, and zero if and only if the two densities are equal
[1]. So, we see that the KL divergence acts as a sort of
risk function in the context of Maximum Likelihood Estimation.
The MLE is based on finding the value for YY that maximizes the
likelihood function. Intuitively, if the maximum point is very
distinct, say a well isolated peak in the likelihood function, then
the easier it will be to distinguish the MLE from alternative
decisions. Consider the case in which YY is a scalar quantity.
The “peakiness” of the log-likelihood function can be gauged by
examining its curvature, -∂2logpX|Y(x|y)∂y2-∂2logpX|Y(x|y)∂y2, at the point of maximum likelihood. The higher the
curvature, the more peaky is the behavior of the likelihood function
at the maximum point. Of course, we hope that the MLE will be a good
predictor (decision) for the unknown true value y*y*. So, rather
than looking at the curvature of the log-likelihood function at the
maximum likelihood point, a more appropriate measure of how easily it
will be to distinguish y*y* from the alternatives is the expected
curvature of the log-likelihood function evaluated at the value y*y*.
The expectation taken over all possible observations with respect to
the conditional density pX|Y(x|y*)pX|Y(x|y*). This quantity, denoted
I(y*)=E[-∂2logpX|Y(x|y)∂y2]|y=y*I(y*)=E[-∂2logpX|Y(x|y)∂y2]|y=y*, is called the Fisher Information (FI). In
fact, the FI provides us with an important performance bound known as
the Cramer-Rao Lower Bound (CRLB).
The CRLB states that under some mild regularity assumptions about the
conditional density function pX|Y(x|y)pX|Y(x|y), the variance of any
unbiased estimator is bounded from below by the inverse of the
I(y*)I(y*)[5], [4], [3]. Recall that an unbiased
estimator is any estimator Y^Y^ that satisfies
E[Y^]=y*E[Y^]=y*. The CRLB tells us is that
var
(
Y
^
)
≥
1
I
(
y
*
)
.
var
(
Y
^
)
≥
1
I
(
y
*
)
.
(5)
If YY is a
vector-valued quantity, then the expected negative Hessian matrix
(matrix of partial second derivatives) of the log-likelihood function
is called the Fisher Information Matrix (FIM), and a similar inequality tells us that the variance
of each component of any unbiased estimator of y*y* is bounded below
by the corresponding diagonal element of the inverse of the FIM.
Since the MSE of an unbiased estimator is equal to its variance, we
see that the CRLB provides a very useful lower bound on the best MSE
performance that we can hope to achieve. Thus, the CRLB is often used
as a comparison point for evaluating estimators. It may or may not be
possible to achieve the CRLB, but if we find a decision rule that
does, we know that it also minimizes the MSE risk among all possible
unbiased estimators. In general, it may be difficult to compute the
CRLB, but in certain important cases it is possible to find
closed-form or computational solutions.
Bayesian Decision Theory provides a formal system for integrating
prior knowledge and observed observations. For
the purposes of illustration we will focus on problems involving
continuous variables and observations, but extensions to discrete
cases are straightforward (simple replace probability densities with
probability mass functions, and integrals with summations). The key
elements of Bayesian methods are:
- a prior probability density function pY(y)pY(y) describing a
priori knowledge of probable states for the quantity YY;
- the likelihood function pX|Y(x|y)pX|Y(x|y), as described above;
- the posterior density function pY|X(y|x)pY|X(y|x).
The posterior density is a function of the prior and likelihood, obtained
according to Bayes rule:
p
Y
|
X
(
y
|
x
)
=
p
X
|
Y
(
x
|
y
)
p
Y
(
y
)
∫
p
X
|
Y
(
x
|
y
)
p
Y
(
y
)
d
y
.
p
Y
|
X
(
y
|
x
)
=
p
X
|
Y
(
x
|
y
)
p
Y
(
y
)
∫
p
X
|
Y
(
x
|
y
)
p
Y
(
y
)
d
y
.
(6)
The posterior is an indicator of
probable values for YY, based on the prior knowledge and the
observation. Several options exist for deriving a specific estimate
of YY using the posterior. The mean value of the posterior density
is one common choice (commonly called the posterior mean). The
posterior mean is the decision rule that minimizes the expected
squared error loss (MSE risk) function. The value yy where the
posterior density is maximized is another popular estimator (commonly
called the Maximum A Posteriori (MAP) estimator). Note that the
denominator of the posterior is independent of yy, so the MAP
estimator is simply the maximizer of the product of the likelihood and
the prior. Therefore, if the prior is a constant function, the MAP
estimator and MLE coincide.
In all of the methods described above, we assumed some amount of
knowledge about the distributions of the observation XX and quantity
of interest YY. Such knowledge can come from a careful analysis of
the physical characteristics of the problem at hand, or it can be
gleaned from previous experience. However, there are situations where
it is difficult to model the physics of the problem and we may not
have enough experience to develop complete and accurate probability
models. In such cases, it is natural to adopt a statistical
learning approach [2], [7].
Statistical learning methods are based on developing decision rules or
estimators based only on a collection of training examples, rather
than predetermined probability models. Statistical learning methods
are often said to be distribution-free, since they do not assume
particular probability models. The canonical set-up for statistical
learning is as follows. We begin with a collection of training
examples, {(Xi,Yi)}i=1n{(Xi,Yi)}i=1n, which are assumed to be
independently and identically distributed according to an unknown probability distribution PX,Y(x,y)PX,Y(x,y). If we knew
PX,Y(x,y)PX,Y(x,y), then we could compute a desired risk function and
design an optimal decision rule using the methods described above. In
essence, the training examples give us a glimpse at the underlying
distribution, but our knowledge of it is far from complete. We cannot
exactly compute a risk function, and therefore we cannot derive a
corresponding optimal decision rule.
There are at least two ways to proceed at this point. One possibility
is to use the training examples to estimate the joint probability
distribution, and then use this estimate to derive an decision rule.
Unfortunately, the (general-purpose) problem of estimating a
distribution is often more difficult from a limited pool of data than
is the problem of designing a specific-purpose decision rule. For
this reason, a second possibility is more commonly favored in
practice. Rather than estimating the complete distribution, one can
use the training examples to directly design a decision rule. More
precisely, perhaps the most common approach is to use the training
examples to compute an estimate of the desired risk function.
Suppose that we are interested in minimizing a particular risk
function. Recall that the risk is the expected value of a chosen loss
function. Let ℓ(Y^,Y)ℓ(Y^,Y) denote the loss, and let
f(X)f(X) denote a candidate decision function, mapping observations to
predictions about YY (i.e., Y^=f(X)Y^=f(X)). The
empirical risk function is constructed from the training examples
as follows:
R
^
(
f
)
=
1
n
∑
i
=
1
n
ℓ
(
f
(
X
i
)
,
Y
i
)
.
R
^
(
f
)
=
1
n
∑
i
=
1
n
ℓ
(
f
(
X
i
)
,
Y
i
)
.
(7)
This is simply the average loss of the decision rule ff over the set
of training examples. Note that since the training examples are
independent and identically distributed, the expected value of the
empirical risk is equal to the true risk R(f)=E[ℓ(f(X),Y)]R(f)=E[ℓ(f(X),Y)].
Moreover, we known (according to the law of large numbers) that the
empirical risk tends to the true risk as the size of the training
sample increases. These facts lend support to the idea of choosing a
decision rule to minimize the empirical risk.
Empirical risk minimization (ERM) is just this process. Given a
collection of possible decision rules, say FF, ERM selects a
decision rule according to
f
^
n
=
arg
min
f
∈
F
R
^
(
f
)
.
f
^
n
=
arg
min
f
∈
F
R
^
(
f
)
.
(8)
The
selected rule, f^nf^n, obviously depends on the given set of
training examples, and therefore it is itself a random quantity. The
theoretically optimal counterpart to f^nf^n is the decision
rule that minimizes the true risk
f
*
=
arg
min
f
∈
F
R
(
f
)
.
f
*
=
arg
min
f
∈
F
R
(
f
)
.
(9)
The central problem in
statistical learning is to quantify how close f^nf^n performs
relative to f*f*. Note that R(f*)≤R(f^n)R(f*)≤R(f^n), since
f*f* minimizes the true risk. Thus, one way to gauge the performance
of f^nf^n relative to f*f* is to show that there exists
small positive values ϵϵ and δδ such that with
probability at least 1-δ1-δ we have
R
(
f
^
n
)
≤
R
(
f
*
)
+
ϵ
.
R
(
f
^
n
)
≤
R
(
f
*
)
+
ϵ
.
(10)
If an inequality of
this form holds, then we say that f^nf^n is a Probability Approximately Correct (PAC) decision rule [6].
To show that the empirical risk minimizer is a PAC decision rule, we
first must understand how closely the empirical risk matches the true
risk. First, let us consider the empirical and true risk of the
decision rule ff. Assume that the loss function is bounded between
0 and 1 (possibly after a suitable normalization). Then the
empirical risk function is a sum of independent random variables
bounded between 0 and 1. Hoeffding's inequality is a bound on the
deviations of such random sums from their corresponding mean values
[2]. In this case, the mean value is the true risk of
ff, and Hoeffding's inequality states that
P
(
|
R
^
(
f
)
-
R
(
f
)
|
>
ϵ
)
≤
2
e
-
2
n
ϵ
2
.
P
(
|
R
^
(
f
)
-
R
(
f
)
|
>
ϵ
)
≤
2
e
-
2
n
ϵ
2
.
(11)
Another equivalent statement is that the inequality
|R^(f)-R(f)|≤ϵ|R^(f)-R(f)|≤ϵ holds with probability at least
1-2e-2nϵ21-2e-2nϵ2. Thus, the two risks are probably close
together, and the greater the number of training examples, nn, the
closer they are.
Now we would like a similar condition to hold for all f∈Ff∈F, since ERM optimizes over the entire collection FF.
Suppose that FF is a finite collection of decision rules. Let
|F||F| denote the number of rules in FF. The
probability that the difference between the true and empirical risks, of
one or more of the decision rules, exceeds ϵϵ is bounded by
the sum of the probabilities of each individual event of the form
|R^(f)-R(f)|>ϵ|R^(f)-R(f)|>ϵ, the so-called Union of Events
bound. Therefore, with probability at least 1-|F|2e-2nϵ21-|F|2e-2nϵ2 we have that
|
R
^
(
f
)
-
R
(
f
)
|
≤
ϵ
|
R
^
(
f
)
-
R
(
f
)
|
≤
ϵ
(12)
for all f∈Ff∈F.
Equivalently, setting δ=2|F|e-2nϵ2δ=2|F|e-2nϵ2, we have
that with probability at least 1-δ1-δ and for all f∈Ff∈F
|
R
^
(
f
)
-
R
(
f
)
|
≤
log
|
F
|
+
log
(
2
/
δ
)
2
n
.
|
R
^
(
f
)
-
R
(
f
)
|
≤
log
|
F
|
+
log
(
2
/
δ
)
2
n
.
(13)
Notice that the two risks are
uniformly close together, and the closeness indicated by the bound
increases as nn increases and decreases as the number of decision
rules in FF increases. In fact, the bound scales with
log|F|log|F|, and so it is reasonable to interpret the logarithm
of the number of decision rules under consideration as a measure of
the complexity of the class.
Now using this bound, we can show that f^nf^n is a PAC
decision rule as follows. Note that with probability at least 1-δ1-δ
R
(
f
^
n
)
≤
R
^
(
f
^
n
)
+
log
|
F
|
+
log
(
2
/
δ
)
2
n
≤
R
^
(
f
*
)
+
log
|
F
|
+
log
(
2
/
δ
)
2
n
≤
R
(
f
*
)
+
2
log
|
F
|
+
log
(
2
/
δ
)
2
n
R
(
f
^
n
)
≤
R
^
(
f
^
n
)
+
log
|
F
|
+
log
(
2
/
δ
)
2
n
≤
R
^
(
f
*
)
+
log
|
F
|
+
log
(
2
/
δ
)
2
n
≤
R
(
f
*
)
+
2
log
|
F
|
+
log
(
2
/
δ
)
2
n
(14)
where the first inequality follows since the true and empirical risks
are close for all f∈Ff∈F, and in particular for
f^nf^n, the second inequality holds since by definition
f^nf^n minimizes the empirical risk, and the third inequality
holds again since the empirical risk is close to the true risk for all
ff, in this case for f*f* in particular. So, we have shown that
f^nf^n is PAC.
PAC bounds of this form can be extended in many directions, for
example to infinitely large or uncountable classes of decision rules,
but the basic ingredients of the theory are essentially like those
demonstrated above. The bottom line is that empirical risk
minimization is a reasonable approach, provided one has access to a
sufficient number of training examples and the number, or more
generally the complexity, of the class of decision rules under
consideration is not too great.
Excellent treatments of classical decision and estimation theory can
be found in a number of textbooks
[5], [4], [3], [1]. For references on
statistical learning theory, outstanding textbooks are also available
[2], [7], [6] for further reading.
-
Cover, T. and Thomas, J. A. (1991). Elements of Information Theory. Wiley.
-
Devroye, L. and Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Spring, New York.
-
Kay, S. M. (1993). Fundamentals of Statistical Signal Processing. Prentice Hall.
-
Lehmann, E. L. (1983). Theory of Point Estimation. Wiley, New York.
-
Trees, H. L. Van. (1968). Detection, Estimation, and Modulation Theory, Part I. Wiley, New York.
-
Valiant, L. G. (1984). A Theory of the Learnable. Communications of the ACM, 27, 1134-1142.
-
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York.