Perhaps the ultimate non-parametric detector makes
no assumptions about the observations'
probability distribution under either model. Here, we assume
that data representative of each model are available to train
the detection algorithm. One approach uses artificial neural
networks, which are difficult to analyze in terms of both
optimality and performance. When the observations are
discrete-valued, a provable optimal detection algorithm (Gutman) can be derived using the theory of types.
For a two-model evaluation problem, let
r
˜
r
˜
(length
L
˜
L
˜
) denote training data representative of some unknown
probability distribution PP. We
assume that the data have statistically independent
components. To derive a non-parametric detector, form a
generalized likelihood ratio to distinguish whether a set of
observations rr
(length LL) has the same
distribution as the training data or a different one
QQ.
logΛr=logmaxP,Q{PrQ
r
˜
}maxP{PrP
r
˜
}
Λ
r
P
Q
P
r
Q
r
˜
P
P
r
P
r
˜
Because a type is the maximum likelihood estimate of
the probability distribution (see Histogram Estimators), we
simply substitute types for the training data and observations
probability distributions into the likelihood ratio. The
probability of a set of observations having a probability
distribution identical to its type equals
ⅇ-Lℋ
P
̂
L
ℋ
P
. Thus, the log likelihood ratio becomes
logΛr=logⅇ-LℋP̂rⅇ-
L
˜
ℋP̂
r
˜
ⅇ-L+
L
˜
ℋP̂
r
,
r
˜
Λ
r
L
ℋ
P
r
L
˜
ℋ
P
r
˜
L
L
˜
ℋ
P
r
,
r
˜
The denominator term means that the training and observed data
are lumped together to form a type. This type equals the linear
combination of the types for the training and observed data
weighted by their relative lengths.
P̂
r
,
r
˜
=LP̂r+
L
˜
P̂
r
˜
L+
L
˜
P
r
,
r
˜
L
P
r
L
˜
P
r
˜
L
L
˜
Returning to the log likelihood ratio, we have that
logΛr=-LℋP̂r-
L
˜
ℋP̂
r
˜
+L+
L
˜
ℋLP̂r+
L
˜
P̂
r
˜
L+
L
˜
Λ
r
L
ℋ
P
r
L
˜
ℋ
P
r
˜
L
L
˜
ℋ
L
P
r
L
˜
P
r
˜
L
L
˜
Note that the last term equals
L+
L
˜
ℋLP̂r+
L
˜
P̂
r
˜
L+
L
˜
=-∑aL
P
r
a
̂+
L
˜
P
r
˜
a
̂logL
P
r
a
̂+
L
˜
P
r
˜
a
̂L+
L
˜
L
L
˜
ℋ
L
P
r
L
˜
P
r
˜
L
L
˜
a
a
L
P
r
a
L
˜
P
r
˜
a
L
P
r
a
L
˜
P
r
˜
a
L
L
˜
which means it can be combined with the other terms to yield the
simple expression for the log likelihood ratio.
logΛr=LP̂r∥P̂
r
,
r
˜
+
L
˜
P̂
r
˜
∥P̂
r
,
r
˜
Λ
r
L
P
r
P
r
,
r
˜
L
˜
P
r
˜
P
r
,
r
˜
(1)
When the training data and the observed data are drawn from the
same distribution, the Kullback-Leibler distances will be
small. When the distributions differ, the distances will be
larger. Defining
ℳ
0
ℳ
0
to be the model that the training data and
observations have the same distribution and
ℳ
1
ℳ
1
that they don't, Gutman
showed that when we use the decision rule
1LlogΛr
≷
ℳ
0
ℳ
1
γ
1
L
Λ
r
≷
ℳ
0
ℳ
1
γ
its false-alarm probability has an exponential rate at
least as large as the threshold and the miss probability is the
smallest among all decision rules based on
training data.
limL→∞1L
P
F
≤-γ
L
1
L
P
F
γ
and
P
M
P
M
minimum.
We can extend these results to the
KK-model case if we have training
data
r
˜
i
r
˜
i
(each of length
L
˜
L
˜
) that represent model
ℳ
i
ℳ
i
,
i∈0…K-1
i
0
…
K
1
.
Given observed data r
r (length LL), we calculate
the log likelihood function given above for each model to
determine whether the observations closely resemble the tested
training data or not. More precisely, define the sufficient
statistics
ϒ
i
ϒ
i
according to
ϒ
i
=P̂r∥P̂
r
,
r
˜
i
+
L
˜
LP̂
r
˜
∥P̂
r
,
r
˜
i
-γ
ϒ
i
P
r
P
r
,
r
˜
i
L
˜
L
P
r
˜
P
r
,
r
˜
i
γ
Ideally, this statistic would be negative for one of the
training sets (matching it) and positive for all of the others
(not matching them). However, we could also have the observation
matching more than one training set. In all such cases, we
define a rejection region
ℜ
?
ℜ
?
similar to what we defined in sequential
model evaluation. Thus, we define the
i
th
i
th
decision region
ℜ
i
ℜ
i
according to
ϒ
i
<0
ϒ
i
0
and
ϒ
j
>0
ϒ
j
0
,
j≠i
j
i
and the rejection region as the complement of
⋃
i
=
0
K
−
1
ℜ
i
⋃
i
=
0
K
−
1
ℜ
i
. Note that all decision regions depend on the value of
γγ, a number we must
choose. Regardless of the value chosen, the probability of
confusing models - choosing some model other than the true one
- has an exponential rate that is at least
γγ for all models. Because of
the presence of a rejection region, another kind of "error" is
to not choose any model. This decision rule is optimal in the
sense that no other training-data-based decision rule has a
smaller rejection region than the type-based one.
Because it controls the exponential rate of confusing models, we
would like γγ to be as large
as possible. However, the rejection region grows as
γγ increases; choosing too
large a value could make virtually all decisions
rejections. What we want to ensure is that
limL→∞Pr
ℜ
?
|
ℳ
i
=0
L
ℳ
i
ℜ
?
0
. Obtaining this behavior requires that
limL→∞
L
˜
L>0
L
L
˜
L
0
: As the length of the observations increases, so must
the size of the training set. In summary,
for some
i
:
limL→∞
L
˜
L=0∨γ>
γ
0
⇒limL→∞Pr
ℜ
?
|
ℳ
i
=1
for some
i
:
L
L
˜
L
0
γ
γ
0
L
ℳ
i
ℜ
?
1
∀i:limL→∞
L
˜
L>0∧γ<
γ
0
⇒limL→∞1LPr
ℜ
?
|
ℳ
i
≤-β<0
i
L
L
˜
L
0
γ
γ
0
L
1
L
ℳ
i
ℜ
?
β
0
The critical value
γ
0
γ
0
depends on the true distributions underlying the
models. The exponential rate of the rejection probability
ββ also depends on the true
distributions. These results mean that if sufficient training
data are available and the decision threshold is not too large,
then we can perform optimal detection based entirely on data! As
the number of observations increases (and the amount of training
data as well), the critical threshold
γ
0
γ
0
becomes the Kullback-Liebler distance between the
unknown models. In other words, the type-based detector becomes
optimal!
-
M. Gutman. (1989). Asymptotically optimal classification for multiple test with empirically observed statistics. IEEE Trans. Info. Theory, 35, 401-408.