Many signal processing algorithms, implicitly or explicitly,
assume that the signal and the observation noise are each well
described as Gaussian random sequences. Virtually all linear
estimation and prediction filters minimize the mean-squared
error while not explicitly assuming any form for the amplitude
distribution of the signal or noise. In many formal waveform
estimation theories where probability density is, for better
or worse, specified, the mean-squared error arises from
Gaussian assumptions. A similar situation occurs explicitly in
detection theory. The matched filter is probably the optimum
detection rule only when the observation
noise is Gaussian. When the noise is non-Gaussian, the
detector assumes some other form. Much of what has been
presented in this chapter is based
implicitly on a Gaussian model for both
the signal and the noise. When non-Gaussian distributions are
assumed, the quantities upon which optimal linear filtering
theory are based, covariance functions, no longer suffice to
characterize the observations. While the joint amplitude
distribution of any zero-mean, stationary Gaussian stochastic
process is entirely characterized by its covariance function;
non-Gaussian processes require more. Optimal linear filtering
results can be applied in non-Gaussian problems, but we should
realize that other informative aspects of the process are
being ignored.
This discussion would seem to be leading to a formulation of
optimal filtering in a non-Gaussian setting. Would that such
theories were easy to use; virtually all of them require
knowledge of process characteristics that are difficult to
measure and the resulting filters are typically nonlinear
[Lipster and Shiryayev: Chapter
8] Rather than present preliminary results, we take the
tack that knowledge is better than ignorance: At least the
first-order amplitude distribution of the observed signals
should be considered during the signal processing design. If
the signal is found to be Gaussian, then linear filtering
results can be applied with the knowledge than no other
filtering strategy will yield better results. If
non-Gaussian, the linear filtering can still be used and the
engineer must be aware that future systems might yield
"better" results.
When the observations are discrete-valued or made so by
digital-to-analog converters, estimating the probability mass
function is straightforward: Count the relative number of
times each value occurs. Let
r0…rL-1
r
0
…
r
L
1
denote a sequence of observations, each of which takes on
𝒜=
a
1
…
a
N
𝒜
a
1
…
a
N
. This set is known as an alphabet and each
a
n
a
n
is a letter in that alphabet. We estimate the probability
that an observation equals one of the letters according to
P
r
a
n
̂=1L∑l=0L-1Irl=
a
n
P
r
a
n
1
L
l
0
L
1
I
r
l
a
n
where
I·
I
·
is the indicator function, equaling one if its argument is
true and zero otherwise. This kind of estimate is known in
information theory as a type [Cover and Thomas: Chapter 12], and
types have remarkable properties. For example, if the
observations are statistically independent, the probability
that a given sequence occurs equals
Prr=r0…rL-1=∏l=0L-1
P
r
rl
r
r
0
…
r
L
1
l
0
L1
P
r
r
l
Evaluating the logarithm, we find that
logPrr=log
P
r
rl
r
P
r
r
l
Converting to a sum over letters reveals
logPrr=∑n=0N-1L
P
r
a
n
̂log
P
r
a
n
=L∑n=0N-1
P
r
a
n
̂log
P
r
a
n
̂-log
P
r
a
n
̂
P
r
a
n
=-LℋP̂r+P̂r∥
P
r
r
n
0
N
1
L
P
r
a
n
P
r
a
n
L
n
0
N
1
P
r
a
n
P
r
a
n
P
r
a
n
P
r
a
n
L
ℋ
P
r
P
r
P
r
(1)
which yields
Prr=ⅇ-LℋP̂r+P̂r∥
P
r
r
L
ℋ
P
r
P
r
P
r
(2)
We introduce the
entropy
[Cover and Thomas: §2.1] and
Kullback-Leibler distance [See
Stein's Lemma].
ℋP=-∑n=0N-1P
a
n
logP
a
n
ℋ
P
n
0
N
1
P
a
n
P
a
n
P
1
∥
P
0
=∑n=0N-1
P
1
a
n
log
P
1
a
n
P
0
a
n
P
1
P
0
n
0
N
1
P
1
a
n
P
1
a
n
P
0
a
n
Because the Kullback-Leibler distance is non-negative,
equaling zero
only when the two
probability distributions equal each other, we maximize
Equation 2 with respect to
P
P
by choosing
P=
P
̂
P
P
:
The type estimator is the maximum likelihood estimator of
P
r
P
r
.
The number of length-L
L observation sequences having a given type
P
̂
P
approximately equals
ⅇ-Lℋ
P
̂
L
ℋ
P
. The probability that a given sequence has a given type
approximately equals
ⅇ-L
P
̂∥P
L
P
P
, which means that the probability a given sequence has a type
not equal to the true distribution decays
exponentially with the number of observations. Thus, while
the coin flip sequences
HHHHH
H
H
H
H
H
and
TTHHT
T
T
H
H
T
are equally likely (assuming a fair coin), the second is more
typical because its type is closer to the
true distribution.
By far the most used technique for estimating the probability
distribution of a continuous-valued random variable is the
histogram; more sophisticated techniques are
discussed in Silverman. For
real-valued data, subdivide the real line into
NN intervals
r
i
r
i
+
1
r
i
r
i
+
1
having widths
δ
i
=
r
i
+
1
-
r
i
δ
i
r
i
+
1
r
i
,
i=1…N
i
1
…
N
.
These regions are called bins and they should
encompass the range of values assumed by the data. For large
values, the "edge bins" can extend to infinity to catch the
overflows. Given LL observations
of a stationary random sequence
rl
r
l
,
l=0…L-1
l
0
…
L
1
, the histogram estimate
hi
h
i
is formed by simply forming a type from the number
L
i
L
i
of these observations that fall into the
i
th
i
th
bin and dividing by the binwidth
δ
i
δ
i
.
prr
̂=h1=
L
1
L
δ
1
if
r
1
<r≤
r
2
h2=
L
2
L
δ
2
if
r
2
<r≤
r
3
⋮ifhN=
L
N
L
δ
N
if
r
N
<r≤
r
N
+
1
p
r
r
h
1
L
1
L
δ
1
r
1
r
r
2
h
2
L
2
L
δ
2
r
2
r
r
3
⋮
h
N
L
N
L
δ
N
r
N
r
r
N
+
1
The histogram estimate resembles a rectangular approximation
to the density. Unless the underlying density has the same
form (a rare event), the histogram estimate does
not converge to the true density as the
number LL of observations grows.
Presumably, the value of the histogram at each bin converges
to the probability that the observations lie in that bin.
limL→∞
L
i
L=∫
r
i
r
i
+
1
prrdr
L
L
i
L
r
r
i
r
i
+
1
p
r
r
To demonstrate this intuitive feeling, we compactly denote the
histogram estimate by using indicator functions.
An indicator function
I
i
rl
I
i
r
l
for the
i
th
i
th
bin equals one if the observation
rl
r
l
lies in the bin and is zero otherwise. The estimate is simply
the average of the indicator functions across the
observations.
hi=1L
δ
i
∑l=0L-1
I
i
rl
h
i
1
L
δ
i
l
0
L
1
I
i
r
l
The expected value of
I
i
rl
I
i
r
l
is simply the probability
P
i
P
i
that the observation lies in the
i
th
i
th
bin. Thus, the expected value of each histogram value equals
the integral of the actual density over the bin, showing that
the histogram is an unbiased estimate of this integral.
Convergence can be tested by computing the variance of the
estimate. The variance of one bin in the histogram is given
by
σhi2=
P
i
-
P
i
2L
δ
i
2+1L2
δ
i
2∑k≠lE
I
i
rk
I
i
rl-
P
i
2
h
i
P
i
P
i
2
L
δ
i
2
1
L
2
δ
i
2
k
k
l
I
i
r
k
I
i
r
l
P
i
2
To simplify this expression, the correlation between the
observations must be specified. If the values are
statistically independent (we have white noise), each term in
the sum becomes zero and the variance is given by
σhi2=P
i
-P
i
2L
δ
i
2
h
i
P
i
P
i
2
L
δ
i
2
. Thus, the variance tends to zero as
L→∞
L
and the histogram estimate is consistent, converging to
P
i
δ
i
P
i
δ
i
. If the observations are not white, convergence becomes
problematical. Assume, for example, that
I
i
rk
I
i
r
k
and
I
i
rl
I
i
r
l
are correlated in a first-order, geometric fashion.
E
I
i
rk
I
i
rl-
P
i
2=
P
i
2ρ|k-l|
I
i
r
k
I
i
r
l
P
i
2
P
i
2
ρ
k
l
The variance does increase with this presumed correlation
until, at the extreme (
ρ=1
ρ
1
), the variance is a constant independent of
LL! In summary, if the
observations are mutually correlated and the histogram
estimate converges, the estimate converges to the proper value
but more slowly than if the observations were white. The
estimate may not converge if the observations are heavily
dependent from index to index. This type of dependence
structure occurs when the power spectrum of the observations
is lowpass with an extremely low cutoff frequency.
Convergence to the density rather than its integral over a
region can occur if, as the amount of data grows, we reduce
the binwidth
δ
i
δ
i
and increase NN, the number of
bins. However, if we choose the binwidth too small for the
amount of available data, few bins contain data and the
estimate is inaccurate. Letting
r
′
r
′
denote the midpoint of a bin, using a Taylor expansion about
this point reveals that the mean-squared error between the
histogram and the density at that point is [Thompson and Tapia:44-59]
Epr
r
′
-hi2=pr
r
′
2L
δ
i
+
δ
i
436d2dr2prr|
r=
r
′
2+O1L+O
δ
i
5
p
r
r
′
h
i
2
p
r
r
′
2
L
δ
i
δ
i
4
36
r
r
′
r
2
p
r
r
2
O
1
L
O
δ
i
5
This mean-squared error becomes zero only
if
L→∞
L
,
L
δ
i
→∞
L
δ
i
, and
δ
i
→0
δ
i
0
. Thus, the binwidth must decrease more
slowly than the rate of increase of the number of
observations. We find the "optimum" compromise between the
decreasing binwidth and the increasing amount of data to
be
δ
i
=9pr
r
′
2d2dr2prr|
r=
r
′
215L-15
δ
i
9
p
r
r
′
2
r
r
′
r
2
p
r
r
2
1
5
L
1
5
Using this binwidth, we find the the mean-squared error to be
proportional to
L-45
L
4
5
. We have thus discovered the famous "4/5" rule of density
estimation; this is one of the few cases where the variance of
a convergent statistic decreases more slowly than the
reciprocal of the number of observations. In practice, this
optimal binwidth cannot be used because the proportionality
constant depends on the unknown density being estimated.
Roughly speaking, wider bins should be employed where the
density is changing slowly. How the optimal binwidth varies
with LL can be used to adjust the
histogram estimate as more data becomes available.
Once a density estimate is produced, the class of density that best
coincides with the estimate remains an issue:
Is the density just estimated statistically similar to a Gaussian?
The histogram estimate can be used directly in a hypothesis test to
determine similarity with any proposed density.
Assume that the observations are obtained from a white,
stationary, stochastic sequence.
Let
ℳ
0
ℳ
0
denote the hypothesis that the data has an amplitude
distribution equal to the presumed density and
ℳ
1
ℳ
1
the dissimilarity hypothesis. If
ℳ
0
ℳ
0
is true, the estimate for each bin should not deviate greatly
from the probability of a randomly chosen datum lying in the
bin. We determine this probability from the presumed density
by integrating over the bin. Summing these deviations over
the entire estimate, the result should not exceed a threshold.
The theory of standard hypothesis testing requires us to
produce a specific density for the alternative hypothesis
ℳ
1
ℳ
1
. We cannot rationally assign such a density; consistency is
being tested, not whether either of two densities provides the
best fit. However, taking inspiration from the Neyman-Pearson
approach to hypothesis testing [See Neyman-Pearson Criterion], we can
develop a test statistic and require its statistical
characteristics only under
ℳ
0
ℳ
0
. The typically used, but ad hoc test statistic
SLN
S
L
N
is related to the histogram estimate's mean-squared error
[Cramer:416-41].
SLN=∑i=1N
L
i
-L
P
i
2L
P
i
=∑i=1N
L
i
2L
P
i
-L
S
L
N
i
1
N
L
i
L
P
i
2
L
P
i
i
1
N
L
i
2
L
P
i
L
This statistic sums over the various bins the squared error of
the number of observations relative to the expected number.
For large LL,
SLN
S
L
N
has a
χ
2
χ
2
probability distribution with
N-1
N
1
degrees of freedom [Cramer:417].
Thus, for a given number of observations
LL we establish a threshold
η
N
η
N
by picking a false-alarm probability
P
F
P
F
and using tables to solve
Pr
χ
N
-
1
2
>
η
N
=
P
F
χ
N
-
1
2
η
N
P
F
. To enhance the validity of this approximation, statisticians
recommend selecting the binwidth so that each bin contains at
least ten observations. In practice, we fulfill this
criterion by merging adjacent bins until a sufficient number
of observations occur in the new bin and defining its binwidth
as the sum of the merged bins' widths. Thus, the number of
bins is reduced to some number
N
′
N
′
, which determines the degrees of freedom in the hypothesis
test. The similarity test between the histogram estimate of a
probability density function and an assumed ideal form becomes
SL
N
′
≷
ℳ
0
ℳ
1
η
N
′
S
L
N
′
≷
ℳ
0
ℳ
1
η
N
′
In many circumstances, the formula for the density is known
but not some of its parameters. In the Gaussian case, for
example, the mean or variance are usually unknown. These
parameters must be determined from the same data used in the
consistency test before the test can be used. Doesn't the
fact that we use estimates rather than actual values affect
the similarity test? The answer is "yes," but in an
interesting way: The similarity test changes only in that the
number of degrees of freedom of the
χ 2
χ 2
random variable used to establish
the threshold is reduced by one for each estimated parameter.
If a Gaussian density is being tested, for example, the mean
and variance usually need to be found. The threshold should
then be determined according to the distribution of a
χ
N
′
-
3
2
χ
N
′
-
3
2
random variable.
Three sets of observations are considered: Two are drawn from
a Gaussian distribution and the other not. The first Gaussian
example is white noise, a signal whose characteristics match
the assumptions of this section. The second is non-Gaussian,
which should not pass the test. Finally, the last test
consists of colored Gaussian noise that, because of dependent
samples, does not have as many degrees of freedom as would be
expected. The number of data available in each case is 2000.
The histogram estimator uses fixed-width bins and the
χ
2
χ
2
test demands at least ten observations per merged bin. The
mean and variance estimates are used in constructing the
nominal Gaussian density. The histogram estimates and their
approximation by the nominal density whose mean and variance
were computed from the data are shown in the Figure 1.
The chi-squared test
P
F
=0.1
P
F
0.1
yielded the following results.
| Density |
N
′
N
′
|
χ
N
′
-
3
2
χ
N
′
-
3
2
|
S2000
N
′
S
2000
N
′
|
| White Gaussian |
70 |
82.2 |
78.4 |
| White Sech |
65 |
76.6 |
232.6 |
| Colored Gaussian |
65 |
76.6 |
77.8 |
The white Gaussian noise example clearly passes the
χ
2
χ
2
test. The test correctly evaluated the non-Gaussian example,
but declared the colored Gaussian data to be non-Gaussian,
yielding a value near the threshold. Failing in the latter
case to correctly determine the data's Gaussianity, we see
that the
χ 2
χ 2
test is sensitive to the statistical independence of the
observations.
-
R.S. Lipster, A.N.Shiryayev. (1977). Statistics of Random Processes I: General Theory. New York: Springer-Verlag.
-
T.M.Cover, J.A.Thomas. (1991). Elements of Information Theory. John Wiley and Sons, Inc.
-
B.W.Silverman. (1986). Density Estimation. London: Chapman and Hall.
-
J.R.Thompson, R.A.Tapia. (1990). Nonparametric Function Estimation, Modeling and Simulation. Philadelphia, PA: SIAM.
-
H.Cramer. (1946). Mathematical Methods of Statistics. Princeton, NJ: Princeton University Press.