Consider the binary hypothesis test
ℋ
0
:
x∼
f
0
x
ℋ
0
:
x
f
0
x
ℋ
1
:
x∼
f
1
x
ℋ
1
:
x
f
1
x
Let
π
i
π
i
, denote the a priori probability of
hypothesis
ℋ
i
ℋ
i
. Suppose our decision rule declares
"
ℋ
0
ℋ
0
is the true model" when
x∈
R
0
x
R
0
, and it selects
ℋ
1
ℋ
1
when
x∈
R
1
x
R
1
, where
R
1
=
R
0
′
R
1
R
0
. The probability of making an error, denoted
P
e
P
e
, is
P
e
=Pr
declare
ℋ
0
and
ℋ
1
true
+Pr
declare
ℋ
1
and
ℋ
0
true
=Pr
ℋ
1
Pr
ℋ
0
|
ℋ
1
+Pr
ℋ
0
Pr
ℋ
1
|
ℋ
0
=∫
R
0
π
1
f
1
xdx+∫
R
1
π
0
f
0
xdx
P
e
declare
ℋ
0
and
ℋ
1
true
declare
ℋ
1
and
ℋ
0
true
ℋ
1
ℋ
1
ℋ
0
ℋ
0
ℋ
0
ℋ
1
x
R
0
π
1
f
1
x
x
R
1
π
0
f
0
x
(1)
In this module, we study the minimum probability of error
decision rule, which selects
R
0
R
0
and
R
1
R
1
so as to minimize the above expression.
Since an observation xx falls into one and only one of the
decision regions
R
i
R
i
, in order to minimize
P
e
P
e
,
we assign xx to the
region for which the corresponding integrand in Equation 1
is smaller. Thus, we select
x∈
R
0
x
R
0
if
π
1
f
1
x<
π
0
f
0
x
π
1
f
1
x
π
0
f
0
x
, and
x∈
R
1
x
R
1
if the inequality is reversed. This decision rule may
be summarized concisely as
Λx≡
f
1
x
f
0
x
≷
ℋ
0
ℋ
1
π
0
π
1
≡η
Λ
x
f
1
x
f
0
x
≷
ℋ
0
ℋ
1
π
0
π
1
η
Here,
Λx
Λ
x
is called the likelihood ratio,
ηη is called a
threshold, and the overall decision rule is called
the Likelihood Ratio Test.
Consider the binary hypothesis test of a scalar
xx
ℋ
0
:
x∼0σ2
ℋ
0
:
x
0
σ
2
ℋ
1
:
x∼μσ2
ℋ
1
:
x
μ
σ
2
where μμ and
σ2
σ
2
are known, positive quantities. Suppose we observe
a single measurement xx. The
likelihood ratio is
Λx=12πσ2ⅇ-x−μ22σ212πσ2ⅇ-x22σ2=ⅇ1σ2μx−μ22
Λ
x
1
2
σ
2
x
μ
2
2
σ
2
1
2
σ
2
x
2
2
σ
2
1
σ
2
μ
x
μ
2
2
(2)
and so the minimum probability of error decision rule is
ⅇ1σ2μx−μ22
≷
ℋ
0
ℋ
1
π
0
π
1
=η
1
σ
2
μ
x
μ
2
2
≷
ℋ
0
ℋ
1
π
0
π
1
η
The expression for
Λx
Λ
x
is somewhat complicated. By applying a sequence of
monotonically increasing functions to both sides, we can
obtain a simplified expression for the optimal decision rule
without changing the rule. In this example, we apply the
natural logarithm and rearrange terms to arrive at
x
≷
ℋ
0
ℋ
1
σ2μlnη+μ2≡γ
x
≷
ℋ
0
ℋ
1
σ
2
μ
η
μ
2
γ
Here we have used the assumption
μ>0
μ
0
. If
μ<0
μ
0
, then dividing by
μ
μ would reverse the inequalities.
This form of the decision rule is much
simpler: we just compare the observed value
xx to a threshold
γγ. Figure 1
depicts the two candidate densities and a possible value of
γγ. If each hypothesis is
a priori equally likely
(
π
0
=
π
1
=12
π
0
π
1
1
2
), then
γ=μ2
γ
μ
2
. Figure 1 illustrates the case where
π
0
>
π
1
π
0
π
1
(
γ>μ2
γ
μ
2
).
If we plot the two densities so that each is weighted by its
a priori probability of occuring, the two
curves will intersect at the threshold
γγ (see
Figure 2). (Can you explain why this is? Think back
to our derivation of the LRT). This plot also offers a way
to visualize the probability of error. Recall
P
e
=∫
R
0
π
1
f
1
xdx+∫
R
1
π
0
f
0
xdx=∫x<γ
π
1
f
1
xdx+∫x>γ
π
0
f
0
xdx=
π
1
P
M
+
π
0
P
F
P
e
x
R
0
π
1
f
1
x
x
R
1
π
0
f
0
x
x
x
γ
π
1
f
1
x
x
x
γ
π
0
f
0
x
π
1
P
M
π
0
P
F
(3)
where
P
M
P
M
and
P
F
P
F
denote the miss and false alarm probabilities,
respectively. These quantities are depicted in
Figure 2.
We can express
P
M
P
M
and
P
F
P
F
in terms of the
Q-function as
P
e
=
π
1
Qμ−γσ+
π
0
Qγσ
P
e
π
1
Q
μ
γ
σ
π
0
Q
γ
σ
When
π
0
=
π
1
=12
π
0
π
1
1
2
, we have
γ=μ2
γ
μ
2
, and the error probability is
P
e
=Qμ2σ
P
e
Q
μ
2
σ
Since
Qx
Q
x
is monotonically decreasing, this says that the
"difficulty" of the detection problem decreases with
decreasing
σσ and
increasing
μμ.
In the preceding example, computation of the
probability of error involved a one-dimensional integral. If we
had multiple observations, or vector-valued data, generalizing
this procedure would involve multi-dimensional integrals over
potentially complicated decision regions. Fortunately, in many
cases, we can avoid this problem through the use of sufficient statistics.
Suppose we have the same test as in the
previous example, but now we have
NN independent observations:
ℋ
0
:
x
n
∼0σ2
,
n
=
1
,
…
,
N
ℋ
0
:
x
n
0
σ
2
,
n
=
1
,
…
,
N
ℋ
1
:
x
n
∼μσ2
,
n
=
1
,
…
,
N
ℋ
1
:
x
n
μ
σ
2
,
n
=
1
,
…
,
N
where
μ>0
μ
0
and
σ2>0
σ
2
0
and both are known. The likelihood ratio is
Λx=∏n=1N12πσ2ⅇ-
x
n
−μ22σ2∏n=1N12πσ2ⅇ-
x
n
22σ2=ⅇ-12σ2∑n=1N
x
n
−μ2ⅇ-12σ2∑n=1N
x
n
2=ⅇ12σ2∑n=1N2
x
n
μ−μ2=ⅇ1σ2μ∑n=1N
x
n
−Nμ22
Λ
x
n
1
N
1
2
σ
2
x
n
μ
2
2
σ
2
n
1
N
1
2
σ
2
x
n
2
2
σ
2
-1
2
σ
2
n
1
N
x
n
μ
2
-1
2
σ
2
n
1
N
x
n
2
1
2
σ
2
n
1
N
2
x
n
μ
μ
2
1
σ
2
μ
n
1
N
x
n
N
μ
2
2
(4)
As in the
previous example,
we may apply the natural logarithm and rearrange terms to
obtain an equivalent form of the LRT:
t≡∑n=1N
x
n
≷
ℋ
0
ℋ
1
σ2μlnη+Nμ2≡γ
t
n
1
N
x
n
≷
ℋ
0
ℋ
1
σ
2
μ
η
N
μ
2
γ
The scalar quantity
tt is a
sufficient statistic for the mean. In order to evaluate the
probability of error without resorting to a multi-dimensional
integral, we can express
P
e
P
e
in terms of
tt as
P
e
=
π
1
Prt<γ|
ℋ
1
true
+
π
0
Prt>γ|
ℋ
0
true
P
e
π
1
ℋ
1
true
t
γ
π
0
ℋ
0
true
t
γ
Now
tt is a linear combination of
normal variates, so it is itself normal. In particular, we
have
t=Ax
t
A
x
, where
1…1
1
…
1
is an
NN-dimensional
row vector of 1's, and
xx is multivariate normal with
mean
00 or
μ=μ…μT
μ
μ
…
μ
, and covariance
σ2I
σ
2
I
. Thus we have
t
|
ℋ
0
∼A0Aσ2IAT=0Nσ2
t
|
ℋ
0
A
0
A
σ
2
I
A
0
N
σ
2
t
|
ℋ
1
∼AμAσ2IAT=NμNσ2
t
|
ℋ
1
A
μ
A
σ
2
I
A
N
μ
N
σ
2
Therefore, we may write
P
e
P
e
in terms of the
Q-function as
P
e
=
π
1
QNμ−γNσ+
π
0
QγNσ
P
e
π
1
Q
N
μ
γ
N
σ
π
0
Q
γ
N
σ
In the special case
π
0
=
π
1
=12
π
0
π
1
1
2
,
P
e
=QNμσ
P
e
Q
N
μ
σ
Since
QQ is monotonically
decreasing, this result provides mathematical support for
something that is intuitively obvious: The performance of our
decision rule improves with increasing
NN and
μμ, and decreasing
σσ.
In the context of signal processing, the
foregoing problem may be viewed as the problem of detecting a
constant (DC) signal in
additive white
Gaussian noise:
ℋ
0
:
x
n
=
w
n
,
n
=
1
,
…
,
N
ℋ
0
:
x
n
w
n
,
n
=
1
,
…
,
N
ℋ
1
:
x
n
=A+
w
n
,
n
=
1
,
…
,
N
ℋ
1
:
x
n
A
w
n
,
n
=
1
,
…
,
N
where
AA is a known, fixed
amplitude, and
w
n
∼0σ2
w
n
0
σ
2
. Here
AA corresponds
to the mean
μμ in the
example.
The next example explores the minimum
probability of error decision rule in a
discrete setting.
Suppose we have a friend who is trying to
transmit a bit (0 or 1) to us over a noisy channel. The
channel causes an error in the transmission (that is, the bit
is flipped) with probability pp,
where
0≤p<12
0
p
1
2
, and pp is known. In
order to increase the chance of a successful transmission,
our friend sends the same bit
NN times. Assume the
NN transmissions are
statistically independent. Under these assumptions, the bits
you receive are Bernoulli random variables:
x
n
∼Bernoulliθ
x
n
Bernoulli
θ
. We are faced with the following hypothesis test:
Table 1
|
ℋ
0
ℋ
0
|
θ=p
θ
p
|
0 sent |
|
ℋ
1
ℋ
1
|
θ=1−p
θ
1
p
|
1 sent |
We decide to decode the received sequence
x=
x
1
…
x
N
T
x
x
1
…
x
N
by minimizing the probability of error. The likelihood ratio is
Λx=∏n=1N1−p
x
n
p1−
x
n
∏n=1Np
x
n
1−p1−
x
n
=1−pkpN−kpk1−pN−k=1−pp2k−N
Λ
x
n
1
N
1
p
x
n
p
1
x
n
n
1
N
p
x
n
1
p
1
x
n
1
p
k
p
N
k
p
k
1
p
N
k
1
p
p
2
k
N
(5)
where
k=∑n=1N
x
n
k
n
1
N
x
n
is the number of 1s received.
The LRT is
1−pp2k−N
≷
ℋ
0
ℋ
1
π
0
π
1
=η
1
p
p
2
k
N
≷
ℋ
0
ℋ
1
π
0
π
1
η
Taking the natural logarithm of both sides and rearranging,
we have
k
≷
ℋ
0
ℋ
1
N2+12lnηln1−pp=γ
k
≷
ℋ
0
ℋ
1
N
2
1
2
η
1
p
p
γ
In the case that both hypotheses are equally likely, the
minimum probability of error decision is the "majority-vote"
rule: Declare
ℋ
1
ℋ
1
if there are more 1s than 0s, declare
ℋ
0
ℋ
0
otherwise. In the event
k=γ
k
γ
, we may decide arbitrarily; the probability of
error is the same either way. Let's adopt the convention
that
ℋ
0
ℋ
0
is declared in this case.
To compute the probability of error of the
optimal rule, write
P
e
=
π
0
Pr
declare
ℋ
1
|
ℋ
0
true
+
π
1
Pr
declare
ℋ
0
|
ℋ
1
true
=
π
0
Prk>γ|
ℋ
0
true
+
π
1
Prk≤γ|
ℋ
1
true
P
e
π
0
ℋ
0
true
declare
ℋ
1
π
1
ℋ
1
true
declare
ℋ
0
π
0
ℋ
0
true
k
γ
π
1
ℋ
1
true
k
γ
(6)
Now
kk is a binomial random variable,
k∼BinomialNθ
k
Binomial
N
θ
, where
θθ
depends on which hypothesis is true. We have
Prk>γ|
ℋ
0
=∑k=⌊γ⌋+1N
f
0
k=∑k=⌊γ⌋+1NNkpk1−pN−k
ℋ
0
k
γ
k
γ
1
N
f
0
k
k
γ
1
N
N
k
p
k
1
p
N
k
(7)
and
Prk≤γ|
ℋ
1
=∑k=0⌊γ⌋Nk1−pkpN−k
ℋ
1
k
γ
k
0
γ
N
k
1
p
k
p
N
k
Using these formulae, we may compute
P
e
P
e
explicitly for given values of
NN,
pp,
π
0
π
0
and
π
1
π
1
.
The likelihood ratio test is one way of
expressing the minimum probability of error decision
rule. Another way is
Declare hypothesis
ii such that
π
i
f
i
x
π
i
f
i
x
is maximal.
This rule is referred to as the
maximum a
posteriori, or
MAP rule, because
the quantity
π
i
f
i
x
π
i
f
i
x
is proportional to the posterior probability of
hypothesis
ii. This becomes clear
when we write
π
i
=Pr
ℋ
i
π
i
ℋ
i
and
f
i
x=f
x
|
ℋ
i
f
i
x
f
x
|
ℋ
i
.
Then, by
Bayes rule, the
posterior probability of
ℋ
i
ℋ
i
given the data is
Pr
ℋ
i
|x=Pr
ℋ
i
f
x
|
ℋ
i
fx
x
ℋ
i
ℋ
i
f
x
|
ℋ
i
f
x
Here
fx
f
x
is the unconditional density or mass function for
xx, which is
effectively a constant when trying to maximiaze with respect to
ii.
According to the MAP interpretation, the optimal
decision boundary is the locus of points where the weighted
densities (in the continuous case)
π
i
f
i
x
π
i
f
i
x
intersect one another. This idea is illustrated in
Example 2.
One advantage the MAP formulation of the minimum
probability of error decision rule has over the LRT is that it
generalizes easily to MM-ary
hypothesis testing. If we are to choose between hypotheses
ℋ
i
ℋ
i
,
i=1…M
i
1
…
M
, the optimal rule is still the MAP rule
The
Bayes risk criterion for
constructing decision rules assigns a cost
C
i
j
C
i
j
to the outcome of declaring
ℋ
i
ℋ
i
when
ℋ
j
ℋ
j
is in effect. The probability of error is simply a special
case of the Bayes risk corresponding to
C
00
=
C
11
=0
C
00
C
11
0
and
C
01
=
C
10
=1
C
01
C
10
1
. Therefore, the form of the minimum probability of
error decision rule is a specialization of the minimum Bayes
risk decision rule: both are likelihood ratio tests. The
different costs in the Bayes risk formulation simply shift the
threshold to favor one hypothesis over the other.
Generally speaking, when is the probability of
error zero for the optimal rule? Phrase
your answer in terms of the distributions underlying each
hypothesis. Does the LRT agree with your answer in this case?
Suppose we measure
NN independent values
x
1
,
…
,
x
N
x
1
,
…
,
x
N
. We know the variance of our measurements
(
σ2=1
σ
2
1
), but are unsure whether the data obeys a
Laplacian or Gaussian probability law:
ℋ
0
:
f
0
x=12ⅇ-2|r|
ℋ
0
:
f
0
x
1
2
2
r
ℋ
1
:
f
1
x=12πⅇ-r22
ℋ
1
:
f
1
x
1
2
r
2
2
Show that the two densities have the same
mean and variance, and plot the densities on the same
graph.
Find the likelihood ratio.
Determine the decision regions for
different values of the threshold
ηη. Consider all possible
values of
η>0
η
0
There are three distinct cases.
Draw the decision regions and decision
boundaries for
η=1212
η
1
2
1
2
.
Assuming the two hypotheses are equally
likely, compute the probability of error. Your answer should
be a number.
Consider the hypothesis testing problem
ℋ
0
:
x∼μ0
Σ
0
ℋ
0
:
x
μ
0
Σ
0
ℋ
1
:
x∼μ1
Σ
1
ℋ
1
:
x
μ
1
Σ
1
where
μ0∈ℝd
μ
0
d
and
μ0∈ℝd
μ
0
d
, and
Σ
0
Σ
0
,
Σ
1
Σ
1
are positive definite, symmetric
dd×
dd matrices. Write down the
likelihood ratio test, and simplify, for the following
cases. In each case, provide a geometric description
of the decision boundary.
Σ
0
=
Σ
1
Σ
0
Σ
1
, but
μ0≠μ1
μ
0
μ
1
.
μ0=μ1
μ
0
μ
1
, but
Σ
0
≠
Σ
1
Σ
0
Σ
1
.
μ0≠μ1
μ
0
μ
1
and
Σ
0
≠
Σ
1
Σ
0
Σ
1
.
Suppose we observe
NN independent realizations of a
Poisson random variable kk with
intensity parameter λλ:
fk=ⅇ-λλkk!
f
k
λ
λ
k
k
We must decide which of two intensities is in effect:
ℋ
0
:
λ=
λ
0
ℋ
0
:
λ
λ
0
ℋ
1
:
λ=
λ
1
ℋ
1
:
λ
λ
1
where
λ
0
<
λ
1
λ
0
λ
1
.
Give the minimum probability of error
decision rule.
Simplify the LRT to a test statistic
involving only a sufficient statistic. Apply a monotonically
increasing transformation to simplify further.
Determine the distribution of the sufficient
statistic under both hypotheses.
Use the
characteristic function to show that a sum of IID Poisson
variates is again Poisson distributed.
Derive an expression for the probability of
error.
Assuming the two hypotheses are equally likely, and
λ
0
=5
λ
0
5
and
λ
1
=6
λ
1
6
, what is the minimum number
NN of observations needed to
attain a probability of error no greater than 0.01?
If you have numerical trouble, try rewriting
the log-factorial so as to avoid evaluating the factorial
of large integers.
In Example 3, suppose
π
0
=
π
1
=12
π
0
π
1
1
2
, and
p=0.1
p
0.1
. What is the smallest value of
NN needed to ensure
P
e
≤0.01
P
e
0.01
?