In statistics, hypothesis testing is some times known as
decision theory or simply testing. The key result around which
all decision theory revolves is the likelihood ratio test.
In a binary hypothesis testing problem, four possible outcomes
can result. Model
ℳ
0
ℳ
0
did in fact represent the best model for the data
and the decision rule said it was (a correct decision) or said
it wasn't (an erroneous decision). The other two outcomes
arise when model
ℳ
1
ℳ
1
was in fact true with either a correct or incorrect
decision made. The decision process operates by segmenting
the range of observation values into two disjoint
decision regions
ℜ
0
ℜ
0
and
ℜ
1
ℜ
1
. All values of
r
r fall into either
ℜ
0
ℜ
0
or
ℜ
1
ℜ
1
. If a given
r
r lies in
ℜ
0
ℜ
0
, for example, we will announce our decision "model
ℳ
0
ℳ
0
was true"; if in
ℜ
1
ℜ
1
, model
ℳ
1
ℳ
1
would be proclaimed. To derive a rational method of
deciding which model best describes the observations, we need
a criterion to assess the quality of the decision process.
Optimizing this criterion will specify the decision regions.
The Bayes' decision criterion seeks to minimize a
cost function associated with making a decision. Let
C
i
j
C
i
j
be the cost of mistaking model
j
j for model
i
i
(
i≠j
i
j
) and
C
i
i
C
i
i
the presumably smaller cost of correctly choosing
model
i
i:
C
i
j
>
C
i
i
C
i
j
C
i
i
,
i≠j
i
j
. Let
π
i
π
i
be the a priori probability of model
i
i. The so-called Bayes' cost
C¯
C
is the average cost of making a decision.
C¯=∑ij
C
i
j
π
j
Pr
say
ℳ
i
when
H
j
true
=∑ij
C
i
j
π
j
Pr
say
ℳ
i
|
H
j
true
C
i
j
i
j
C
i
j
π
j
say
ℳ
i
when
H
j
true
i
j
i
j
C
i
j
π
j
H
j
true
say
ℳ
i
(1)
The Bayes' cost can be expressed as
C¯=∑ij
C
i
j
π
j
Prr∈
ℜ
i
|
ℳ
0
true
=∑ij
C
i
j
π
j
∫
ℜ
i
pr|
H
j
rdr=∫
ℜ
0
C
0
0
π
0
pr|
ℳ
0
r+
C
0
1
π
1
pr|
ℳ
1
rdr+∫
ℜ
1
C
1
0
π
0
pr|
ℳ
0
r+
C
1
1
π
1
pr|
ℳ
1
rdr
C
i
j
i
j
C
i
j
π
j
ℳ
0
true
r
ℜ
i
i
j
i
j
C
i
j
π
j
r
ℜ
i
p
r
H
j
r
r
ℜ
0
C
0
0
π
0
p
r
ℳ
0
r
C
0
1
π
1
p
r
ℳ
1
r
r
ℜ
1
C
1
0
π
0
p
r
ℳ
0
r
C
1
1
π
1
p
r
ℳ
1
r
(2)
pr|
ℳ
i
r
p
r
ℳ
i
r
is the conditional probability density function of
the observed data
r
r given that model
ℳ
i
ℳ
i
was true. To minimize this expression with respect
to the decision regions
ℜ
0
ℜ
0
and
ℜ
1
ℜ
1
, ponder which integral would yield the smallest
value if its integration domain included a specific
observation vector. This selection process defines the
decision regions; for example, we choose
ℳ
0
ℳ
0
for those values of
r
r which yield a smaller value for the first integral.
π
0
C
0
0
pr|
ℳ
0
r+
π
1
C
0
1
pr|
ℳ
1
r<
π
0
C
1
0
pr|
ℳ
0
r+
π
1
C
1
1
pr|
ℳ
1
r
π
0
C
0
0
p
r
ℳ
0
r
π
1
C
0
1
p
r
ℳ
1
r
π
0
C
1
0
p
r
ℳ
0
r
π
1
C
1
1
p
r
ℳ
1
r
We choose
ℳ
1
ℳ
1
when the inequality is reversed. This expression is
easily manipulated to obtain the decision rule known as the
likelihood ratio test.
pr|
ℳ
1
rpr|
ℳ
0
r
≷
ℳ
0
ℳ
1
π
0
C
1
0
−
C
0
0
π
1
C
0
1
−
C
1
1
p
r
ℳ
1
r
p
r
ℳ
0
r
≷
ℳ
0
ℳ
1
π
0
C
1
0
C
0
0
π
1
C
0
1
C
1
1
(3)
The comparison relation means selecting model
ℳ
1
ℳ
1
if the left-hand ratio exceeds the value on the right;
otherwise,
ℳ
0
ℳ
0
is selected. Thus, the
likelihood
ratio
pr|
ℳ
1
rpr|
ℳ
0
r
p
r
ℳ
1
r
p
r
ℳ
0
r
symbolically represented by
Λr
Λ
r
, is computed from the observed value of
r
r and then compared with a
threshold
η
η equaling
π
0
C
1
0
−
C
0
0
π
1
C
0
1
−
C
1
1
π
0
C
1
0
C
0
0
π
1
C
0
1
C
1
1
. Thus, when two models are hypothesized, the
likelihood ratio test can be succinctly expressed as the
comparison of the likelihood ratio with a threshold.
Λr
≷
ℳ
0
ℳ
1
η
Λ
r
≷
ℳ
0
ℳ
1
η
(4)
The data processing operations are captured entirely by the
likelihood ratio
pr|
ℳ
1
rpr|
ℳ
0
r
p
r
ℳ
1
r
p
r
ℳ
0
r
. Furthermore, note that only the value of the
likelihood ratio relative to the
threshold matters; to simplify the computation of the
likelihood ratio, we can perform any
positively monotonic operations simultaneously on the
likelihood ratio and the threshold without affecting the
comparison. We can multiply the ratio by a positive constant,
add any constant, or apply a monotonically increasing function
which simplifies the expressions. We single one such
function, the logarithm, because it simplifies likelihood
ratios that commonly occur in signal processing
applications. Known as the log-likelihood, we explicitly
express the likelihood ratio test with it as
lnΛr
≷
ℳ
0
ℳ
1
lnη
Λ
r
≷
ℳ
0
ℳ
1
η
(5)
Useful simplifying transformations are problem-dependent; by
laying bare that aspect of the observations essential to the
model testing problem, we reveal the
sufficient
statistic
ϒr
ϒ
r
: the scalar quantity which best summarizes the data
(
Lehmann, pp. 18-22). The
likelihood ratio test is best expressed in terms of the
sufficient statistic.
ϒr
≷
ℳ
0
ℳ
1
γ
ϒ
r
≷
ℳ
0
ℳ
1
γ
(6)
We will denote the threshold value by
γ
γ when the sufficient statistic is used or by
η
η when the likelihood ratio appears prior to its
reduction to a sufficient statistic.
As we shall see, if we use a different criterion other than
the Bayes' criterion, the decision rule often involves the
likelihood ratio. The likelihood ratio is comprised of the
quantities
pr|
ℳ
i
r
p
r
ℳ
i
r
, termed the likelihood function, which
is also important in estimation theory. It is this
conditional density that portrays the probabilistic model
describing data generation. The likelihood function
completely characterizes the kind of "world" assumed by each
model; for each model, we must specify the likelihood function
so that we can solve the hypothesis testing problem.
A complication, which arises in some cases, is that the
sufficient statistic may not be monotonic. If monotonic, the
decision regions
ℜ
0
ℜ
0
and
ℜ
1
ℜ
1
are simply connected (all portions of a region can
be reached without crossing into the other region). If not,
the regions are not simply connected and decision region
islands are created (see this problem). Such regions usually
complicate calculations of decision performance. Monotonic or
not, the decision rule proceeds as described: the sufficient
statistic is computed for each observation vector and compared
to a threshold.
An instructor in a course in detection theory wants to
determine if a particular student studied for his last test.
The observed quantity is the student's grade, which we
denote by
r
r. Failure may not indicate studiousness:
conscientious students may fail the test. Define the models
as
-
ℳ
0
ℳ
0
: did not study
-
ℳ
1
ℳ
1
: did study
The conditional densities of the grade are shown in
Figure 1.
Based on knowledge of student behavior, the instructor
assigns
a priori probabilities of
π
0
=1/4
π
0
14
and
π
1
=3/4
π
1
34
. The costs
C
i
j
C
i
j
are chosen to reflect the instructor's sensitivity
to student feelings:
C
0
1
=1=
C
1
0
C
0
1
1
C
1
0
(an erroneous decision either way is given the
same cost) and
C
0
0
=0=
C
1
1
C
0
0
0
C
1
1
. The likelihood ratio is plotted in
Figure 1 and the threshold value
η
η, which is computed from the
a
priori probabilities and the costs to be
1/3
13, is indicated. The calculations of this
comparison can be simplified in an obvious way.
r50
≷
ℳ
0
ℳ
1
1/3
r
50
≷
ℳ
0
ℳ
1
13
or
r
≷
ℳ
0
ℳ
1
50/3=16.7
r
≷
ℳ
0
ℳ
1
503
16.7
The multiplication by the factor of 50 is a simple
illustration of the reduction of the likelihood ratio to a
sufficient statistic. Based on the assigned costs and
a priori probabilities, the optimum
decision rule says the instructor must assume that the
student did not study if the student's grade is less than
16.7; if greater, the student is assumed to have studied
despite receiving an abysmally low grade such as 20. Note
that as the densities given by each model overlap entirely:
the possibility of making the wrong interpretation
always haunts the instructor. However,
no other procedure will be better!