The criterion used in the previous section - minimize the
average cost of an incorrect decision - may seem to be a
contrived way of quantifying decisions. Well, often it is. For
example, the Bayesian decision rule depends explicitly on the
a priori probabilities; a rational method of
assigning values to these - either by experiment or through true
knowledge of the relative likelihood of each model - may be
unreasonable. In this section, we develop alternative decision
rules that try to answer such objections. One essential point
will emerge from these considerations: the fundamental
nature of the decision rule does not change with choice of
optimization criterion. Even criteria remote from
error measures can result in the likelihood ratio test (see
this problem).
Such results do not occur often in signal processing and
underline the likelihood ratio test's significance.
As only one model can describe any given set of data (the
models are mutually exclusive), the probability of being
correct
P
c
P
c
for distinguishing two models is given by
P
c
=Pr
say
ℳ
0
when
ℳ
0
true
+Pr
say
ℳ
1
when
ℳ
1
true
P
c
say
ℳ
0
when
ℳ
0
true
say
ℳ
1
when
ℳ
1
true
We wish to determine the optimum decision region
placement Expressing the probability correct in terms of the
likelihood functions
pr|
ℳ
i
r
p
r
ℳ
i
r
, the a priori probabilities, and
the decision regions,
P
c
=∫
ℜ
0
π
0
pr|
ℳ
0
rdr+∫
ℜ
1
π
1
pr|
ℳ
1
rdr
P
c
r
ℜ
0
π
0
p
r
ℳ
0
r
r
ℜ
1
π
1
p
r
ℳ
1
r
We want to maximize
P
c
P
c
by selecting the decision regions
ℜ
0
ℜ
0
and
ℜ
0
ℜ
0
. The probability correct is maximized by
associating each value of
r
r with the largest term in the expression for
P
c
P
c
. Decision region
ℜ
0
ℜ
0
, for example, is defined by the collection of values
of
r
r for which the first term is largest. As all of the
quantities involved are non-negative, the decision rule
maximizing the probability of a correct decision is
Given
r
r, choose
ℳ
i
ℳ
i
for which the product
π
i
pr|
ℳ
i
r
π
i
p
r
ℳ
i
r
is largest.
Simple manipulations lead to the likelihood ratio test.
pr|
ℳ
1
rpr|
ℳ
0
r
≷
ℳ
0
ℳ
1
π
0
π
1
p
r
ℳ
1
r
p
r
ℳ
0
r
≷
ℳ
0
ℳ
1
π
0
π
1
Note that if the Bayes' costs were chosen so that
C
i
i
=0
C
i
i
0
and
C
i
j
=C
C
i
j
C
, (
i≠j
i
j
), we would have the same threshold as in the
previous section.
To evaluate the quality of the decision rule, we usually
compute the probability of error
P
e
P
e
rather than the probability of being correct. This
quantity can be expressed in terms of the observations, the
likelihood ratio, and the sufficient statistic.
P
e
=
π
0
∫
ℜ
1
pr|
ℳ
0
rdr+
π
1
∫
ℜ
0
pr|
ℳ
1
rdr=
π
0
∫Λ>ηpΛ|
ℳ
0
ΛdΛ+
π
1
∫Λ<ηpΛ|
ℳ
1
ΛdΛ=
π
0
∫ϒ>γpϒ|
ℳ
0
ϒdϒ+
π
1
∫ϒ<γpϒ|
ℳ
1
ϒdϒ
P
e
π
0
r
ℜ
1
p
r
ℳ
0
r
π
1
r
ℜ
0
p
r
ℳ
1
r
π
0
Λ
Λ
η
p
Λ
ℳ
0
Λ
π
1
Λ
Λ
η
p
Λ
ℳ
1
Λ
π
0
ϒ
ϒ
γ
p
ϒ
ℳ
0
ϒ
π
1
ϒ
ϒ
γ
p
ϒ
ℳ
1
ϒ
(1)
When the likelihood ratio is non-monotonic, the
first expression is most difficult to evaluate. When
monotonic, the middle expression proves the most difficult.
Furthermore, these expressions point out that the likelihood
ratio and the sufficient statistic can be considered a
function of the observations
r
r; hence, they are random variables and have
probability densities for each model. Another aspect of the
resulting probability of error is that
no other
decision rule can yield a lower probability of
error. This statement is obvious as we minimized
the probability of error in deriving the likelihood ratio
test. The point is that these expressions represent a lower
bound on performance (as assessed by the probability of
error). This probability will be non-zero if the conditional
densities overlap over some range of values of
r
r, such as occurred in the previous example. In this
region of overlap, the observed values are ambiguous: either
model is consistent with the observations. Our "optimum"
decision rule operates in such regions by selecting that model
which is most likely (has the highest probability) of
generating any particular value.
Situations occur frequently where assigning or measuring the
a priori probabilities
P
i
P
i
is unreasonable. For example, just what is the
a priori probability of a supernova
occurring in any particular region of the sky? We clearly
need a model evaluation procedure which can function without
a priori probabilities. This kind of test
results when the so-called Neyman-Pearson criterion is used to
derive the decision rule. The ideas behind and decision rules
derived with the Neyman-Pearson criterion (Neyman and Pearson) will serve us
well in sequel; their result is important!
Using nomenclature from radar, where model
ℳ
1
ℳ
1
represents the presence of a target and
ℳ
0
ℳ
0
its absence, the various types of correct and
incorrect decisions have the following names (Woodward, pp. 127-129).
- Detection -
we say it's there when it is;
P
D
=Pr
say
ℳ
1
|
ℳ
1
true
P
D
Pr
say
ℳ
1
|
ℳ
1
true
- False-alarm -
we say it's there when it's not;
P
F
=Pr
say
ℳ
1
|
ℳ
0
true
P
F
Pr
say
ℳ
1
|
ℳ
0
true
- Miss -
we say it's not there when it is;
P
M
=Pr
say
ℳ
0
|
ℳ
1
true
P
M
Pr
say
ℳ
0
|
ℳ
1
true
The remaining probability
Pr
say
ℳ
0
|
ℳ
0
true
ℳ
0
true
say
ℳ
0
has historically been left nameless and equals
1-
P
F
1
P
F
. We should also note that the detection and miss
probabilities are related by
P
M
=1-
P
D
P
M
1
P
D
. As these are conditional probabilities, they do
not depend on the
a priori probabilities
and the two probabilities
P
F
P
F
and
P
D
P
D
characterize the errors when
any decision rule is used.
These two probabilities are related to each other in an
interesting way. Expressing these quantities in terms of the
decision regions and the likelihood functions, we have
P
F
=∫
ℜ
1
pr|
ℳ
0
rdr
P
F
r
ℜ
1
p
r
ℳ
0
r
P
D
=∫
ℜ
1
pr|
ℳ
1
rdr
P
D
r
ℜ
1
p
r
ℳ
1
r
As the region
ℜ
1
ℜ
1
shrinks, both of these
probabilities tend toward zero; as
ℜ
1
ℜ
1
expands to engulf the entire range of observation
values, they both tend toward unity. This rather direct
relationship between
P
D
P
D
and
P
F
P
F
does not mean that they equal each other;
in most cases, as
ℜ
1
ℜ
1
expands,
P
D
P
D
increases more rapidly than
P
F
P
F
(we had better be right more often than we are
wrong!). However, the "ultimate" situation where a rule is
always right and never wrong
(
P
D
=1
P
D
1
,
P
F
=0
P
F
0
) cannot occur when the conditional distributions
overlap. Thus, to increase the detection probability we must
also allow the false-alarm probability to increase. This
behavior represents the fundamental tradeoff in hypothesis
testing and detection theory.
One can attempt to impose a performance criterion that depends
only on these probabilities with the consequent decision rule
not depending on the a priori
probabilities. The Neyman-Pearson criterion assumes that the
false-alarm probability is constrained to be less than or
equal to a specified value
α
α while we attempt to maximize the detection
probability
P
D
P
D
.
∀
P
F
,
P
F
≤α:max
ℜ
1
{
P
D
}
P
F
P
F
α
ℜ
1
P
D
A subtlety of the succeeding solution is that the
underlying probability distribution functions may not be
continuous, with the result that
P
F
P
F
can never equal the constraining value
α
α. Furthermore, an (unlikely) possibility is that the
optimum value for the false-alarm probability is somewhat less
than the criterion value. Assume, therefore, that we rephrase
the optimization problem by requiring that the false-alarm
probability equal a value
α
′
α
that is less than or equal to
α
α.
This optimization problem can be solved using Lagrange
multipliers (see Constrained
Optimization); we seek to find the decision rule that
maximizes
F=
P
D
+λ
P
F
-
α
′
F
P
D
λ
P
F
α
where
λ
λ is the Lagrange multiplier. This optimization
technique amounts to finding the decision rule that maximizes
F
F, then finding the value of the multiplier that
allows the criterion to be satisfied. As is usual in the
derivation of optimum decision rules, we maximize these
quantities with respect to the decision regions. Expressing
P
D
P
D
and
P
F
P
F
in terms of them, we have
F=∫
ℜ
1
pr|
ℳ
1
rdr+λ∫
ℜ
1
pr|
ℳ
0
rdr-
α
′=-λ
α
′+∫
ℜ
1
pr|
ℳ
1
r+λpr|
ℳ
0
rdr
F
r
ℜ
1
p
r
ℳ
1
r
λ
r
ℜ
1
p
r
ℳ
0
r
α
λ
α
r
ℜ
1
p
r
ℳ
1
r
λ
p
r
ℳ
0
r
(2)
To maximize this quantity with respect to
ℜ
1
ℜ
1
, we need only to integrate over those regions of
r
r where the integrand is positive. The region
ℜ
1
ℜ
1
thus corresponds to those values of
r
r where
pr|
ℳ
1
r>-λpr|
ℳ
0
r
p
r
ℳ
1
r
λ
p
r
ℳ
0
r
and the resulting decision rule is
pr|
ℳ
1
rpr|
ℳ
0
r
≷
ℳ
0
ℳ
1
-λ
p
r
ℳ
1
r
p
r
ℳ
0
r
≷
ℳ
0
ℳ
1
λ
The ubiquitous likelihood ratio test again appears;
it
is indeed the fundamental quantity in
hypothesis testing. Using the logarithm of the likelihood
ratio or the sufficient statistic, this result can be
expressed as either
lnΛr
≷
ℳ
0
ℳ
1
ln-λ
Λ
r
≷
ℳ
0
ℳ
1
λ
or
ϒr
≷
ℳ
0
ℳ
1
γ
ϒ
r
≷
ℳ
0
ℳ
1
γ
We have not as yet found a value for the threshold. The
false-alarm probability can be expressed in terms of the
Neyman-Pearson threshold in two (useful) ways.
P
F
=∫-λ∞pΛ|
ℳ
0
ΛdΛ=∫γ∞pϒ|
ℳ
0
ϒdϒ
P
F
Λ
λ
p
Λ
ℳ
0
Λ
ϒ
γ
p
ϒ
ℳ
0
ϒ
(3)
One of these implicit equations must be solved for
the threshold by setting
P
F
P
F
equal to
α
′
α
. The selection of which to use is usually based on
pragmatic considerations: the easiest to compute. From the
previous discussion of the relationship between the detection
and false-alarm probabilities, we find that to maximize
P
D
P
D
we must allow
α
′
α
to be as large as possible while remaining less than
α
α. Thus, we want to find the
smallest value of
-λ
λ
(note the minus sign) consistent with the
constraint. Computation of the threshold is
problem-dependent, but a solution always exists.
An important application of the likelihood ratio test occurs
when
r
r is a Gaussian random vector for each model.
Suppose the models correspond to Gaussian random vectors
having different mean values but sharing the same identity
covariance.
-
ℳ
0
ℳ
0
:
r∼0σ2I
r
0
σ
2
I
-
ℳ
1
ℳ
1
:
r∼mσ2I
r
m
σ
2
I
Thus,
r
r is of dimension
L
L and has statistically independent, equal variance
components. The vector of means
m=
m
0
…
m
L
−
1
T
m
m
0
…
m
L
−
1
distinguishes the two models. The likelihood
functions associated this problem are
pr|
ℳ
0
r=∏l=0L-112πσ2ⅇ-1/2
r
l
σ2
p
r
ℳ
0
r
l
0
L
1
1
2
σ
2
12
r
l
σ
2
pr|
ℳ
1
r=∏l=0L-112πσ2ⅇ-1/2
r
l
-
m
l
σ2
p
r
ℳ
1
r
l
0
L
1
1
2
σ
2
12
r
l
m
l
σ
2
The likelihood ratio
Λr
Λ
r
becomes
Λr=∏l=0L-1ⅇ-1/2
r
l
-
m
l
σ2∏l=0L-1ⅇ-1/2
r
l
σ2
Λ
r
l
0
L
1
12
r
l
m
l
σ
2
l
0
L
1
12
r
l
σ
2
This expression for the likelihood ratio is
complicated. In the Gaussian case (and many others), we use
the logarithm the reduce the complexity of the likelihood
ratio and form a sufficient statistic.
lnΛr=∑l=0L-1-1/2
r
l
-
m
l
2σ2+1/2
r
l
2σ2=1σ2∑l=0L-1
m
l
r
l
-12σ2∑l=0L-1
m
l
2
Λ
r
l
0
L
1
-12
r
l
m
l
2
σ
2
12
r
l
2
σ
2
1
σ
2
l
0
L
1
m
l
r
l
1
2
σ
2
l
0
L
1
m
l
2
(4)
The likelihood ratio test then has the much
simpler, but equivalent form
∑l=0L-1
m
l
r
l
≷
ℳ
0
ℳ
1
σ2lnη+1/2∑l=0L-1
m
l
2
l
0
L
1
m
l
r
l
≷
ℳ
0
ℳ
1
σ
2
η
12
l
0
L
1
m
l
2
To focus on the model evaluation aspects of this
problem, let's assume means be equal to a positive constant:
m
l
=m
m
l
m
(
>0
0
).
∑l=0L-1
r
l
≷
ℳ
0
ℳ
1
σ2mlnη+Lm2
l
0
L
1
r
l
≷
ℳ
0
ℳ
1
σ
2
m
η
L
m
2
Note that all that need be known about the observations
r
l
r
l
is their sum. This quantity is the sufficient
statistic for the Gaussian problem:
ϒr=∑
r
l
ϒ
r
r
l
and
γ=σ2lnηm+Lm2
γ
σ
2
η
m
L
m
2
.
When trying to compute the probability of error or the
threshold in the Neyman-Pearson criterion, we must find the
conditional probability density of one of the decision
statistics: the likelihood ratio, the log-likelihood, or the
sufficient statistic. The log-likelihood and the sufficient
statistic are quite similar in this problem, but clearly we
should use the latter. One practical property of the
sufficient statistic is that it usually simplifies
computations. For this Gaussian example, the sufficient
statistic is a Gaussian random variable under each model.
-
ℳ
0
ℳ
0
:
ϒr∼0Lσ2
ϒ
r
0
L
σ
2
-
ℳ
1
ℳ
1
:
ϒr∼LmLσ2