Skip to content Skip to navigation

Connexions

You are here: Home » Content » Probability Density Estimation

Navigation

Content Actions

  • Download module PDF
  • Add to ...
    Add the module to:
    • My Favorites
    • A lens
    • An external social bookmarking service
    • My Favorites (What is 'My Favorites'?)
      'My Favorites' is a special kind of lens which you can use to bookmark modules and collections directly in Connexions. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need a Connexions account to use 'My Favorites'.
    • A lens (What is a lens?)

      Definition of a lens

      Lenses

      A lens is a custom view of Connexions content. You can think of it as a fancy kind of list that will let you see Connexions through the eyes of organizations and people you trust.

      What is in a lens?

      Lens makers point to Connexions materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

      Who can create a lens?

      Any individual Connexions member, a community, or a respected organization.

    • External bookmarks
  • E-mail the author

Recently Viewed

This feature requires Javascript to be enabled.

Probability Density Estimation

Module by: Don Johnson

Summary: This module covers probability density estimation of signals assuming a knowledge of the first order amplitude distribution of observed signals. It describes Type and Histogram Estimators as well as density verification of the estimates.

Probability Density Estimation

Many signal processing algorithms, implicitly or explicitly, assume that the signal and the observation noise are each well described as Gaussian random sequences. Virtually all linear estimation and prediction filters minimize the mean-squared error while not explicitly assuming any form for the amplitude distribution of the signal or noise. In many formal waveform estimation theories where probability density is, for better or worse, specified, the mean-squared error arises from Gaussian assumptions. A similar situation occurs explicitly in detection theory. The matched filter is probably the optimum detection rule only when the observation noise is Gaussian. When the noise is non-Gaussian, the detector assumes some other form. Much of what has been presented in this chapter is based implicitly on a Gaussian model for both the signal and the noise. When non-Gaussian distributions are assumed, the quantities upon which optimal linear filtering theory are based, covariance functions, no longer suffice to characterize the observations. While the joint amplitude distribution of any zero-mean, stationary Gaussian stochastic process is entirely characterized by its covariance function; non-Gaussian processes require more. Optimal linear filtering results can be applied in non-Gaussian problems, but we should realize that other informative aspects of the process are being ignored.

This discussion would seem to be leading to a formulation of optimal filtering in a non-Gaussian setting. Would that such theories were easy to use; virtually all of them require knowledge of process characteristics that are difficult to measure and the resulting filters are typically nonlinear [Lipster and Shiryayev: Chapter 8] Rather than present preliminary results, we take the tack that knowledge is better than ignorance: At least the first-order amplitude distribution of the observed signals should be considered during the signal processing design. If the signal is found to be Gaussian, then linear filtering results can be applied with the knowledge than no other filtering strategy will yield better results. If non-Gaussian, the linear filtering can still be used and the engineer must be aware that future systems might yield "better" results. 1

Types

When the observations are discrete-valued or made so by digital-to-analog converters, estimating the probability mass function is straightforward: Count the relative number of times each value occurs. Let r0rL-1 r 0 r L 1 denote a sequence of observations, each of which takes on 𝒜= a 1 a N 𝒜 a 1 a N . This set is known as an alphabet and each a n a n is a letter in that alphabet. We estimate the probability that an observation equals one of the letters according to P r a n ̂=1Ll=0L-1Irl= a n P r a n 1 L l 0 L 1 I r l a n where I· I · is the indicator function, equaling one if its argument is true and zero otherwise. This kind of estimate is known in information theory as a type [Cover and Thomas: Chapter 12], and types have remarkable properties. For example, if the observations are statistically independent, the probability that a given sequence occurs equals Prr=r0rL-1=l=0L-1 P r rl r r 0 r L 1 l 0 L1 P r r l Evaluating the logarithm, we find that logPrr=log P r rl r P r r l Converting to a sum over letters reveals

logPrr=n=0N-1L P r a n ̂log P r a n =Ln=0N-1 P r a n ̂log P r a n ̂-log P r a n ̂ P r a n =-LP̂r+P̂r P r r n 0 N 1 L P r a n P r a n L n 0 N 1 P r a n P r a n P r a n P r a n L P r P r P r (1)
which yields
Prr=-LP̂r+P̂r P r r L P r P r P r (2)
We introduce the entropy [Cover and Thomas: §2.1] and Kullback-Leibler distance [See Stein's Lemma]. P=-n=0N-1P a n logP a n P n 0 N 1 P a n P a n P 1 P 0 =n=0N-1 P 1 a n log P 1 a n P 0 a n P 1 P 0 n 0 N 1 P 1 a n P 1 a n P 0 a n Because the Kullback-Leibler distance is non-negative, equaling zero only when the two probability distributions equal each other, we maximize Equation 2 with respect to P P by choosing P= P ̂ P P : The type estimator is the maximum likelihood estimator of P r P r .

The number of length-L L observation sequences having a given type P ̂ P approximately equals -L P ̂ L P . The probability that a given sequence has a given type approximately equals -L P ̂P L P P , which means that the probability a given sequence has a type not equal to the true distribution decays exponentially with the number of observations. Thus, while the coin flip sequences HHHHH H H H H H and TTHHT T T H H T are equally likely (assuming a fair coin), the second is more typical because its type is closer to the true distribution.

Histogram Estimators

By far the most used technique for estimating the probability distribution of a continuous-valued random variable is the histogram; more sophisticated techniques are discussed in Silverman. For real-valued data, subdivide the real line into NN intervals r i r i + 1 r i r i + 1 having widths δ i = r i + 1 - r i δ i r i + 1 r i , i=1N i 1 N . These regions are called bins and they should encompass the range of values assumed by the data. For large values, the "edge bins" can extend to infinity to catch the overflows. Given LL observations of a stationary random sequence rl r l , l=0L-1 l 0 L 1 , the histogram estimate hi h i is formed by simply forming a type from the number L i L i of these observations that fall into the i th i th bin and dividing by the binwidth δ i δ i . prr ̂=h1= L 1 L δ 1 if r 1 <r r 2 h2= L 2 L δ 2 if r 2 <r r 3 ifhN= L N L δ N if r N <r r N + 1 p r r h 1 L 1 L δ 1 r 1 r r 2 h 2 L 2 L δ 2 r 2 r r 3 h N L N L δ N r N r r N + 1 The histogram estimate resembles a rectangular approximation to the density. Unless the underlying density has the same form (a rare event), the histogram estimate does not converge to the true density as the number LL of observations grows. Presumably, the value of the histogram at each bin converges to the probability that the observations lie in that bin. limL L i L= r i r i + 1 prrdr L L i L r r i r i + 1 p r r To demonstrate this intuitive feeling, we compactly denote the histogram estimate by using indicator functions. An indicator function I i rl I i r l for the i th i th bin equals one if the observation rl r l lies in the bin and is zero otherwise. The estimate is simply the average of the indicator functions across the observations. hi=1L δ i l=0L-1 I i rl h i 1 L δ i l 0 L 1 I i r l The expected value of I i rl I i r l is simply the probability P i P i that the observation lies in the i th i th bin. Thus, the expected value of each histogram value equals the integral of the actual density over the bin, showing that the histogram is an unbiased estimate of this integral. Convergence can be tested by computing the variance of the estimate. The variance of one bin in the histogram is given by σhi2= P i - P i 2L δ i 2+1L2 δ i 2klE I i rk I i rl- P i 2 h i P i P i 2 L δ i 2 1 L 2 δ i 2 k k l I i r k I i r l P i 2 To simplify this expression, the correlation between the observations must be specified. If the values are statistically independent (we have white noise), each term in the sum becomes zero and the variance is given by σhi2=P i -P i 2L δ i 2 h i P i P i 2 L δ i 2 . Thus, the variance tends to zero as L L and the histogram estimate is consistent, converging to P i δ i P i δ i . If the observations are not white, convergence becomes problematical. Assume, for example, that I i rk I i r k and I i rl I i r l are correlated in a first-order, geometric fashion. E I i rk I i rl- P i 2= P i 2ρ|k-l| I i r k I i r l P i 2 P i 2 ρ k l The variance does increase with this presumed correlation until, at the extreme ( ρ=1 ρ 1 ), the variance is a constant independent of LL! In summary, if the observations are mutually correlated and the histogram estimate converges, the estimate converges to the proper value but more slowly than if the observations were white. The estimate may not converge if the observations are heavily dependent from index to index. This type of dependence structure occurs when the power spectrum of the observations is lowpass with an extremely low cutoff frequency.

Convergence to the density rather than its integral over a region can occur if, as the amount of data grows, we reduce the binwidth δ i δ i and increase NN, the number of bins. However, if we choose the binwidth too small for the amount of available data, few bins contain data and the estimate is inaccurate. Letting r r denote the midpoint of a bin, using a Taylor expansion about this point reveals that the mean-squared error between the histogram and the density at that point is [Thompson and Tapia:44-59] Epr r -hi2=pr r 2L δ i + δ i 436d2dr2prr| r= r 2+O1L+O δ i 5 p r r h i 2 p r r 2 L δ i δ i 4 36 r r r 2 p r r 2 O 1 L O δ i 5 This mean-squared error becomes zero only if L L , L δ i L δ i , and δ i 0 δ i 0 . Thus, the binwidth must decrease more slowly than the rate of increase of the number of observations. We find the "optimum" compromise between the decreasing binwidth and the increasing amount of data to be 2 δ i =9pr r 2d2dr2prr| r= r 215L-15 δ i 9 p r r 2 r r r 2 p r r 2 1 5 L 1 5 Using this binwidth, we find the the mean-squared error to be proportional to L-45 L 4 5 . We have thus discovered the famous "4/5" rule of density estimation; this is one of the few cases where the variance of a convergent statistic decreases more slowly than the reciprocal of the number of observations. In practice, this optimal binwidth cannot be used because the proportionality constant depends on the unknown density being estimated. Roughly speaking, wider bins should be employed where the density is changing slowly. How the optimal binwidth varies with LL can be used to adjust the histogram estimate as more data becomes available.

Density Verification

Once a density estimate is produced, the class of density that best coincides with the estimate remains an issue: Is the density just estimated statistically similar to a Gaussian? The histogram estimate can be used directly in a hypothesis test to determine similarity with any proposed density. Assume that the observations are obtained from a white, stationary, stochastic sequence. Let 0 0 denote the hypothesis that the data has an amplitude distribution equal to the presumed density and 1 1 the dissimilarity hypothesis. If 0 0 is true, the estimate for each bin should not deviate greatly from the probability of a randomly chosen datum lying in the bin. We determine this probability from the presumed density by integrating over the bin. Summing these deviations over the entire estimate, the result should not exceed a threshold. The theory of standard hypothesis testing requires us to produce a specific density for the alternative hypothesis 1 1 . We cannot rationally assign such a density; consistency is being tested, not whether either of two densities provides the best fit. However, taking inspiration from the Neyman-Pearson approach to hypothesis testing [See Neyman-Pearson Criterion], we can develop a test statistic and require its statistical characteristics only under 0 0 . The typically used, but ad hoc test statistic SLN S L N is related to the histogram estimate's mean-squared error [Cramer:416-41]. SLN=i=1N L i -L P i 2L P i =i=1N L i 2L P i -L S L N i 1 N L i L P i 2 L P i i 1 N L i 2 L P i L This statistic sums over the various bins the squared error of the number of observations relative to the expected number. For large LL, SLN S L N has a χ 2 χ 2 probability distribution with N-1 N 1 degrees of freedom [Cramer:417]. Thus, for a given number of observations LL we establish a threshold η N η N by picking a false-alarm probability P F P F and using tables to solve Pr χ N - 1 2 > η N = P F χ N - 1 2 η N P F . To enhance the validity of this approximation, statisticians recommend selecting the binwidth so that each bin contains at least ten observations. In practice, we fulfill this criterion by merging adjacent bins until a sufficient number of observations occur in the new bin and defining its binwidth as the sum of the merged bins' widths. Thus, the number of bins is reduced to some number N N , which determines the degrees of freedom in the hypothesis test. The similarity test between the histogram estimate of a probability density function and an assumed ideal form becomes SL N 0 1 η N S L N 0 1 η N

In many circumstances, the formula for the density is known but not some of its parameters. In the Gaussian case, for example, the mean or variance are usually unknown. These parameters must be determined from the same data used in the consistency test before the test can be used. Doesn't the fact that we use estimates rather than actual values affect the similarity test? The answer is "yes," but in an interesting way: The similarity test changes only in that the number of degrees of freedom of the χ 2 χ 2 random variable used to establish the threshold is reduced by one for each estimated parameter. If a Gaussian density is being tested, for example, the mean and variance usually need to be found. The threshold should then be determined according to the distribution of a χ N - 3 2 χ N - 3 2 random variable.

Example 1

Three sets of observations are considered: Two are drawn from a Gaussian distribution and the other not. The first Gaussian example is white noise, a signal whose characteristics match the assumptions of this section. The second is non-Gaussian, which should not pass the test. Finally, the last test consists of colored Gaussian noise that, because of dependent samples, does not have as many degrees of freedom as would be expected. The number of data available in each case is 2000. The histogram estimator uses fixed-width bins and the χ 2 χ 2 test demands at least ten observations per merged bin. The mean and variance estimates are used in constructing the nominal Gaussian density. The histogram estimates and their approximation by the nominal density whose mean and variance were computed from the data are shown in the Figure 1.

Figure 1: Three histogram density estimates are shown and compared with Gaussian densities having the same mean and variance. The histogram on the top is obtained from Gaussian data that are presumed to be white. The middle one is obtained from a non-Gaussian distribution related to the hyperbolic secant prr=12σsech2πr2σ p r r 1 2 σ r 2 σ 2 . This density resembles a Gaussian about the origin but decreases exponentially in the tails. The bottom histogram is taken from a first-order autoregressive Gaussian signal. Thus, these data are correlated, but yield a histogram resembling the true amplitude distribution. In each case, 2000 data points were used and the histogram contained 100 bins.
Figure 1 (densityest.png)
The chi-squared test P F =0.1 P F 0.1 yielded the following results.
Density N N χ N - 3 2 χ N - 3 2 S2000 N S 2000 N
White Gaussian 70 82.2 78.4
White Sech 65 76.6 232.6
Colored Gaussian 65 76.6 77.8
The white Gaussian noise example clearly passes the χ 2 χ 2 test. The test correctly evaluated the non-Gaussian example, but declared the colored Gaussian data to be non-Gaussian, yielding a value near the threshold. Failing in the latter case to correctly determine the data's Gaussianity, we see that the χ 2 χ 2 test is sensitive to the statistical independence of the observations.

Footnotes

  1. Note that linear filtering optimizes the mean-squared error whether the signals involved are Gaussian or not. Other error criteria might better capture unexpected changes in signal characteristics and non-Gaussian processes contain internal statistical structure beyond that described by the covariance function.
  2. This result assumes that the second derivative of the density is nonzero. If it is not, either the Taylor series expansion brings higher order terms into play or, if all the derivatives are zero, no optimum binwidth can be defined for minimizing the mean-squared error.

References

  1. R.S. Lipster, A.N.Shiryayev. (1977). Statistics of Random Processes I: General Theory. New York: Springer-Verlag.
  2. T.M.Cover, J.A.Thomas. (1991). Elements of Information Theory. John Wiley and Sons, Inc.
  3. B.W.Silverman. (1986). Density Estimation. London: Chapman and Hall.
  4. J.R.Thompson, R.A.Tapia. (1990). Nonparametric Function Estimation, Modeling and Simulation. Philadelphia, PA: SIAM.
  5. H.Cramer. (1946). Mathematical Methods of Statistics. Princeton, NJ: Princeton University Press.

Comments, questions, feedback, criticisms?

Send feedback