# Connexions

You are here: Home » Content » Statistical Signal Processing » The Bayesian Paradigm

### Recently Viewed

This feature requires Javascript to be enabled.

Inside Collection (Course):

Course by: Clayton Scott. E-mail the author

# The Bayesian Paradigm

Module by: Clayton Scott, Robert Nowak. E-mail the authors

Statistical analysis is fundamentally an inversion process. The objective is to the "causes"--parameters of the probabilistic data generation model--from the "effects"--observations. This can be seen in our interpretation of the likelihood function.

Given a parameter θ θ, observations are generated according to px| θ p θ x The likelihood function has the same form as the conditional density function above l θ | x px| θ l θ | x p θ x except now xx is given (we take measurements) and θθ is the variable. The likelihood function essentially inverts the role of observation (effect) and parameter (cause).

Unfortunately, the likelihood function does not provide a formal framework for the desired inversion.

One problem is that the parameter θθ is supposed to be a fixed and deterministic quantity while the observation xx is the realization of a random process. So their role aren't really interchangeable in this setting.

Moreover, while it is tempting to interpret the likelihood l θ | x l θ | x as a density function for θθ, this is not always possible; for example, often l θ | x d θ θ l θ | x

Another problematic issue is the mathematical formalization of statements like: "Based on the measurements xx, I am 95% confident that θθ falls in a certain range."

## Example 1

Suppose you toss a coin 10 times and each time it comes up "heads." It might be reasonable to say that we are 99% sure that the coin is unfair, biased towards heads.

Formally: H 0 : θprob heads>0.5 H 0 : θ prob heads 0.5 xNxθx1θNx x N x θ x 1 θ N x which is the binomial likelihood. pθ>0.5| x =? p x θ 0.5 ? The problem with this is that pθ H 0 | x p x θ H 0 implies that θθ is a random, not deterministic, quantity. So, while "confidence" statements are very reasonable and in fact a normal part of "everyday thinking," this idea can not be supported from the classical perspective.

All of these "deficiencies" can be circumvented by a change in how we view the parameter θθ.

If we view θθ as the realization of a random variable with density pθ p θ , then Bayes Rule (Bayes, 1763) shows that pθ| x =px| θ pθpx| θ pθd θ p x θ p θ x p θ θ p θ x p θ Thus, from this perspective we obtain a well-defined inversion: Given xx, the parameter θθ is distributing according to pθ| x p x θ .

From here, confidence measures such as pθ H 0 | x p x θ H 0 are perfectly legitimate quantities to ask for.

Definition 1: Bayesian statistical model
A statistical model compose of a data generation model, px| θ p θ x , and a prior distribution on the parameters, pθ p θ .

The prior distriubtion (or prior for short) models the uncertainty in the parameter. More specifically, pθ p θ models our knowledge--or lack thereof--prior to collecting data.

Notice that pθ| x =px| θ pθpxpx| θ pθ p x θ p θ x p θ p x p θ x p θ since the data xx are known, px p x is just a constant. Hence, pθ| x p x θ is proportional to the likelihood function multiplied by the prior.

Bayesian analysis has some significant advantages over classical statistical analysis:

1. properly inverts the relationship between causes and effects
2. permits meaningful assessments in confidence regions
3. enables the incorporation of prior knowledge into the analysis (which could come from previous experiments, for example)
4. leads to more accurate estimators (provided the prior knowledge is accurate)
5. obeys the Likelihood and Sufficiency principles

## Example 2

x n =A+ W n   ,   n=1N    n n 1 N x n A W n W n 𝒩0σ2 W n 0 σ 2 iid. A ^=1N n =1N x n A 1 N n 1 N x n MVUB and MLE estimator. Now suppose that we have prior knowledge that A 0 A A 0 A 0 A A 0 . We might incorporate this by forming a new estimator

A ={ A 0   if   A ^< A 0 A ^  if   A 0 A ^ A 0 A 0   if   A ^> A 0 A A 0 A A 0 A A 0 A A 0 A 0 A A 0
(1)
This is called a truncated sample mean estimator of AA. Is A A a better estimator of AA than the sample mean A ^ A ?

Let pa p a denote the density of A ^ A . Since A ^=1N x n A 1 N x n , pa=𝒩Aσ2N p a A σ 2 N . The density of A A is given by

p a=Pr A ^ A 0 δa+ A 0 +pa I { - A 0 α A 0 } +Pr A ^ A 0 δa A 0 p a A A 0 δ a A 0 p a I { - A 0 α A 0 } A A 0 δ a A 0
(2)
Now consider the MSE of the sample mean A ^ A .
MSE A ^=aA2pad a MSE A a a A 2 p a
(3)

## Note

1. A A is biased (Figure 2).
2. Although A ^ A is MVUB, A A is better in the MSE sense.
3. Prior information is aptly described by regarding AA as a random variable with a prior distribution U A 0 A 0 U A 0 A 0 , which implies that we know A 0 A A 0 A 0 A A 0 , but otherwise AA is abitrary.

## The Bayesian Approach to Statistical Modeling

### Example 3

x n =A+ W n   ,   n=1N    n n 1 N x n A W n

Prior distribution allows us to incorporate prior information regarding unknown paremter--probable values of parameter are supported by prior. Basically, the prior reflects what we believe "Nature" will probably throw at us.

## Elements of Bayesian Analysis

• (a): joint distribution p x , θ =px| θ pθ p x , θ p θ x p θ
• (b): marginal distributions px=px| θ pθd θ p x θ p θ x p θ pθ=px| θ pθd x p θ x p θ x p θ where pθ p θ is a prior.
• (c): posterior distribution pθ| x =p x , θ px=px| θ pθpx| θ pθd x p x θ p x , θ p x p θ x p θ x p θ x p θ

### Example 4

px| θ =nxθx1θnx  ,   θ 0 1    θ θ 0 1 p θ x n x θ x 1 θ n x which is the Binomial likelihood. pθ=1Bαβθα11θβ1 p θ 1 B α β θ α 1 1 θ β 1 which is the Beta prior distriubtion and Bαβ=ΓαΓβΓα+β B α β Γ α Γ β Γ α β

#### Joint Density

p x , θ =nxBαβθα+x11θnx+β1 p x , θ n x B α β θ α x 1 1 θ n x β 1

#### marginal density

px=nxΓα+βΓαΓβΓα+xΓnx+βΓα+β+n p x n x Γ α β Γ α Γ β Γ α x Γ n x β Γ α β n

#### posterior density

pθ| x =θα+x1θβ+nx1Bα+xβ+nx p x θ θ α x 1 θ β n x 1 B α x β n x where Bα+xβ+nx B α x β n x is the Beta density with parameters α =α+x α α x and β =β+nx β β n x

## Selecting an Informative Prior

Clearly, the most important objective is to choose the prior pθ p θ that best reflects the prior knowledge available to us. In general, however, our prior knowledge is imprecise and any number of prior densities may aptly capture this information. Moreover, usually the optimal estimator can't be obtained in closed-form.

Therefore, sometimes it is desirable to choose a prior density that models prior knowledge and is nicely matched in functional form to px| θ p θ x so that the optimal esitmator (and posterior density) can be expressed in a simple fashion.

## Choosing a Prior

### 1. Informative Priors

• design/choose priors that are compatible with prior knowledge of unknown parameters

### 2. Non-informative Priors

• attempt to remove subjectiveness from Bayesian procedures
• designs are often based on invariance arguments

### Example 5

Suppose we want to estimate the variance of a process, incorporating a prior that is amplitude-scale invariant (so that we are invariant to arbitrary amplitude rescaling of data). ps=1s p s 1 s satisifies this condition. σ2psAσ2ps σ 2 p s A σ 2 p s where ps p s is non-informative since it is invariant to amplitude-scale.

## Conjugate Priors

### Idea

Given px| θ p θ x , choose pθ p θ so that pθ| x px| θ pθ p x θ p θ x p θ has a simple functional form.

### Conjugate Priors

Choose pθ𝒫 p θ 𝒫 , where 𝒫 𝒫 is a family of densities (e.g., Gaussian family) so that the posterior density also belongs to that family.

Definition 2: conjugate prior
pθ p θ is a conjugate prior for px| θ p θ x if pθ𝒫pθ| x 𝒫 p θ 𝒫 p x θ 𝒫

### Example 6

x n =A+ W n   ,   n=1N    n n 1 N x n A W n W n 𝒩0σ2 W n 0 σ 2 iid. Rather than modeling AU A 0 A 0 A U A 0 A 0 (which did not yield a closed-form estimator) consider pA=12π σ A 2e-12 σ A 2Aμ2 p A 1 2 σ A 2 -1 2 σ A 2 A μ 2

With μ=0 μ 0 and σ A =13 A 0 σ A 1 3 A 0 this Gaussian prior also reflects prior knowledge that it is unlikely for |A| A 0 A A 0 .

The Gaussian prior is also conjugate to the Gaussian likelihood px| A =12πσ2N2e-12σ2 n =1N x n A2 p A x 1 2 σ 2 N 2 -1 2 σ 2 n 1 N x n A 2 so that the resulting posterior density is also a simple Gaussian, as shown next.

First note that px| A =12πσ2N2e-12σ2 n =1N x n e-12σ2(NA22NA x - ) p A x 1 2 σ 2 N 2 -1 2 σ 2 n 1 N x n -1 2 σ 2 N A 2 2 N A x - where x - =1N n =1N x n x - 1 N n 1 N x n .

pA| x =px| A pApx| A pAd A =e-12(1σ2(NA22NA x - )+1 σ A 2Aμ2)e-12(1σ2(NA22NA x - )+1 σ A 2Aμ2)d A =e-12QAe-12QAd A p x A p A x p A A p A x p A -1 2 1 σ 2 N A 2 2 N A x - 1 σ A 2 A μ 2 A -1 2 1 σ 2 N A 2 2 N A x - 1 σ A 2 A μ 2 -1 2 Q A A -1 2 Q A
(4)
where QA=Nσ2A22NA x - σ2+A2 σ A 22μA σ A 2+μ2 σ A 2 Q A N σ 2 A 2 2 N A x - σ 2 A 2 σ A 2 2 μ A σ A 2 μ 2 σ A 2 . Now let σ A | x 21Nσ2+1 σ A 2 σ A | x 2 1 N σ 2 1 σ A 2 μ A | x 2(Nσ2 x - +μ σ A 2) σ A | x 2 μ A | x 2 N σ 2 x - μ σ A 2 σ A | x 2 Then by "completing the square" we have
QA=1 σ A | x 2(A22 μ A | x A+ μ A | x 2) μ A | x 2 σ A | x 2+μ2 σ A 2=1 σ A | x 2A μ A | x 2 μ A | x 2 σ A | x 2+μ2 σ A 2 Q A 1 σ A | x 2 A 2 2 μ A | x A μ A | x 2 μ A | x 2 σ A | x 2 μ 2 σ A 2 1 σ A | x 2 A μ A | x 2 μ A | x 2 σ A | x 2 μ 2 σ A 2
(5)
Hence, pA| x =e-12 σ A | x 2A μ A | x 2e-12(μ2 σ A 2 μ A | x 2 σ A | x 2)e-12 σ A | x 2A μ A | x 2e-12(μ2 σ A 2 μ A | x 2 σ A | x 2)d A p x A -1 2 σ A | x 2 A μ A | x 2 -1 2 μ 2 σ A 2 μ A | x 2 σ A | x 2 A -1 2 σ A | x 2 A μ A | x 2 -1 2 μ 2 σ A 2 μ A | x 2 σ A | x 2 where -12 σ A | x 2A μ A | x 2 -1 2 σ A | x 2 A μ A | x 2 is the "unnormalized" Gaussian density and -12(μ2 σ A 2 μ A | x 2 σ A | x 2) -1 2 μ 2 σ A 2 μ A | x 2 σ A | x 2 is a constant, independent of AA. This implies that pA| x =12π σ A | x 2e-12 σ A | x 2A μ A | x 2 p x A 1 2 σ A | x 2 -1 2 σ A | x 2 A μ A | x 2 where A | x 𝒩 μ A | x σ A | x 2 A | x μ A | x σ A | x 2 . Now
A ^=EA| x =ApA| x d A = μ A | x =Nσ2 x - +μ σ A 2Nσ2+1 σ A 2= σ A 2 σ A 2+σ2N x - +σ2N σ A 2+σ2Nμ=α x - 1μ A x A A A p x A μ A | x N σ 2 x - μ σ A 2 N σ 2 1 σ A 2 σ A 2 σ A 2 σ 2 N x - σ 2 N σ A 2 σ 2 N μ α x - 1 α μ
(6)
Where 0<α= σ A 2 σ A 2+σ2N<1 0 α σ A 2 σ A 2 σ 2 N 1

### Interpretation

1. When there is little data σ A 2σ2N σ A 2 σ 2 N αα is small and A ^=μ A μ .
2. When there is a lot of data σ A 2σ2N σ A 2 σ 2 N , α1 α 1 and A ^= x - A x - .

## Interplay Between Data and Prior Knowledge

Small N A ^ N A favors prior.

Large N A ^ N A favors data.

## The Multivariate Gaussian Model

The multivariate Gaussian model is the most important Bayesian tool in signal processing. It leads directly to the celebrated Wiener and Kalman filters.

Assume that we are dealing with random vectors xx and yy. We will regard yy as a signal vector that is to be estimated from an observation vector xx.

yy plays the same role as θθ did in earlier discussions. We will assume that yy is p×1 and xx is N×1. Furthermore, assume that xx and yy are jointly Gaussian distributed ( x y )𝒩( 0 0 )( R xx R xy R yx R yy ) x y 0 0 R xx R xy R yx R yy Ex=0 x 0 , Ey=0 y 0 , ExxT= R xx x x R xx , ExyT= R xy x y R xy , EyxT= R yx y x R yx , EyyT= R yy y y R yy . R( R xx R xy R yx R yy ) R R xx R xy R yx R yy

### Example 7

x=y+W x y W , W𝒩0σ2I W 0 σ 2 I py=𝒩0 R yy p y 0 R yy which is independent of W W. Ex=Ey+EW=0 x y W 0 , ExxT=EyyT+EyWT+EWyT+EWWT= R yy +σ2I x x y y y W W y W W R yy σ 2 I , ExyT=EyyT+EWyT= R yy =EyxT x y y y W y R yy y x . ( x y )𝒩( 0 0 )( R yy +σ2I R yy R yy R yy ) x y 0 0 R yy σ 2 I R yy R yy R yy From our Bayesian perpsective, we are interested in py| x p x y .

py| x =p x , y px=2πN22πp2detR-12e-12( xTyT )R-1( x y )2πN2det R xx -12e-12xT R xx -1x p x y p x , y p x 2 N 2 2 p 2 R -1 2 -1 2 x y R x y 2 N 2 R xx -1 2 -1 2 x R xx x
(7)
In this formula we are faced with R-1=( R xx R xy R yx R yy )-1 R R xx R xy R yx R yy The inverse of this covariance matrix can be written as ( R xx R xy R yx R yy )-1=( R xx -10 00 )+( ( R xx -1) R xy I )Q-1( ( R yx ) R xx I ) R xx R xy R yx R yy R xx 0 0 0 R xx R xy I Q R yx R xx I where Q R yy R yx R xx R xy Q R yy R yx R xx R xy . (Verify this formula by applying the right hand side above to RR to get II.)

## Content actions

PDF | EPUB (?)

### What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

PDF | EPUB (?)

### What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

#### Collection to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

#### Module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks