Skip to content Skip to navigation

Connexions

You are here: Home » Content » The Bayesian Paradigm

Navigation

Content Actions

  • Download module PDF
  • Add to ...
    Add the module to:
    • My Favorites
    • A lens
    • An external social bookmarking service
    • My Favorites (What is 'My Favorites'?)
      'My Favorites' is a special kind of lens which you can use to bookmark modules and collections directly in Connexions. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need a Connexions account to use 'My Favorites'.
    • A lens (What is a lens?)

      Definition of a lens

      Lenses

      A lens is a custom view of Connexions content. You can think of it as a fancy kind of list that will let you see Connexions through the eyes of organizations and people you trust.

      What is in a lens?

      Lens makers point to Connexions materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

      Who can create a lens?

      Any individual Connexions member, a community, or a respected organization.

    • External bookmarks
  • E-mail the authors

Recently Viewed

This feature requires Javascript to be enabled.

The Bayesian Paradigm

Module by: Clayton Scott, Robert Nowak

Statistical analysis is fundamentally an inversion process. The objective is to the "causes"--parameters of the probabilistic data generation model--from the "effects"--observations. This can be seen in our interpretation of the likelihood function.

Given a parameter θ θ, observations are generated according to px|θ p θ x The likelihood function has the same form as the conditional density function above l θ | x px|θ l θ | x p θ x except now xx is given (we take measurements) and θθ is the variable. The likelihood function essentially inverts the role of observation (effect) and parameter (cause).

Unfortunately, the likelihood function does not provide a formal framework for the desired inversion.

One problem is that the parameter θθ is supposed to be a fixed and deterministic quantity while the observation xx is the realization of a random process. So their role aren't really interchangeable in this setting.

Moreover, while it is tempting to interpret the likelihood l θ | x l θ | x as a density function for θθ, this is not always possible; for example, often l θ | x dθ θ l θ | x

Another problematic issue is the mathematical formalization of statements like: "Based on the measurements xx, I am 95% confident that θθ falls in a certain range."

Example 1

Suppose you toss a coin 10 times and each time it comes up "heads." It might be reasonable to say that we are 99% sure that the coin is unfair, biased towards heads.

Formally: H 0 : θprob heads>0.5 H 0 : θ prob heads 0.5 xNxθx1-θN-x x N x θ x 1 θ N x which is the binomial likelihood. pθ>0.5|x=? p x θ 0.5 ? The problem with this is that pθ H 0 |x p x θ H 0 implies that θθ is a random, not deterministic, quantity. So, while "confidence" statements are very reasonable and in fact a normal part of "everyday thinking," this idea can not be supported from the classical perspective.

All of these "deficiencies" can be circumvented by a change in how we view the parameter θθ.

If we view θθ as the realization of a random variable with density pθ p θ , then Bayes Rule (Bayes, 1763) shows that pθ|x=px|θpθpx|θpθdθ p x θ p θ x p θ θ p θ x p θ Thus, from this perspective we obtain a well-defined inversion: Given xx, the parameter θθ is distributing according to pθ|x p x θ .

From here, confidence measures such as pθ H 0 |x p x θ H 0 are perfectly legitimate quantities to ask for.

Definition 1: Bayesian statistical model
A statistical model compose of a data generation model, px|θ p θ x , and a prior distribution on the parameters, pθ p θ .

The prior distriubtion (or prior for short) models the uncertainty in the parameter. More specifically, pθ p θ models our knowledge--or lack thereof--prior to collecting data.

Notice that pθ|x=px|θpθpxpx|θpθ p x θ p θ x p θ p x p θ x p θ since the data xx are known, px p x is just a constant. Hence, pθ|x p x θ is proportional to the likelihood function multiplied by the prior.

Bayesian analysis has some significant advantages over classical statistical analysis:

  1. properly inverts the relationship between causes and effects
  2. permits meaningful assessments in confidence regions
  3. enables the incorporation of prior knowledge into the analysis (which could come from previous experiments, for example)
  4. leads to more accurate estimators (provided the prior knowledge is accurate)
  5. obeys the Likelihood and Sufficiency principles

Example 2

n,n=1N: x n =A+ W n n n 1 N x n A W n W n 0σ2 W n 0 σ 2 iid. A ̂=1Nn=1N x n A 1 N n 1 N x n MVUB and MLE estimator. Now suppose that we have prior knowledge that - A 0 A A 0 A 0 A A 0 . We might incorporate this by forming a new estimator

A =- A 0 if A ̂<- A 0 A ̂if- A 0 A ̂ A 0 A 0 if A ̂> A 0 A A 0 A A 0 A A 0 A A 0 A 0 A A 0 (1)
This is called a truncated sample mean estimator of AA. Is A A a better estimator of AA than the sample mean A ̂ A ?

Let pa p a denote the density of A ̂ A . Since A ̂=1N x n A 1 N x n , pa=Aσ2N p a A σ 2 N . The density of A A is given by

p a=Pr A ̂- A 0 δa+ A 0 +pa I { - A 0 α A 0 } +Pr A ̂ A 0 δa- A 0 p a A A 0 δ a A 0 p a I { - A 0 α A 0 } A A 0 δ a A 0 (2)
Figure 1
Subfigure 1.1Subfigure 1.2
Subfigure 1.1 (density1.png)Subfigure 1.2 (density2.png)
Now consider the MSE of the sample mean A ̂ A .
MSE A ̂=-a-A2pada MSE A a a A 2 p a (3)

Note

  1. A A is biased (Figure 2).
  2. Although A ̂ A is MVUB, A A is better in the MSE sense.
  3. Prior information is aptly described by regarding AA as a random variable with a prior distribution U- A 0 A 0 U A 0 A 0 , which implies that we know - A 0 A A 0 A 0 A A 0 , but otherwise AA is abitrary.
Figure 2
Subfigure 2.1: Mean of A ̂=A A A . Subfigure 2.2: Mean of A A A A .
Subfigure 2.1 (biased1.png)Subfigure 2.2 (biased2.png)

The Bayesian Approach to Statistical Modeling

Figure 3: Where ww is the noise and xx is the observation.
Figure 3 (block1.png)

Example 3

n,n=1N: x n =A+ W n n n 1 N x n A W n

Figure 4
Figure 4 (block2.png)
Prior distribution allows us to incorporate prior information regarding unknown paremter--probable values of parameter are supported by prior. Basically, the prior reflects what we believe "Nature" will probably throw at us.

Elements of Bayesian Analysis

  • (a) - joint distribution p x , θ =px|θpθ p x , θ p θ x p θ
  • (b) - marginal distributions px=px|θpθdθ p x θ p θ x p θ pθ=px|θpθdx p θ x p θ x p θ where pθ p θ is a prior.
  • (c) - posterior distribution pθ|x=p x , θ px=px|θpθpx|θpθdx p x θ p x , θ p x p θ x p θ x p θ x p θ

Example 4

θ,θ01:px|θ=nxθx1-θn-x θ θ 0 1 p θ x n x θ x 1 θ n x which is the Binomial likelihood. pθ=1Bαβθα-11-θβ-1 p θ 1 B α β θ α 1 1 θ β 1 which is the Beta prior distriubtion and Bαβ=ΓαΓβΓα+β B α β Γ α Γ β Γ α β

Figure 5: This reflects prior knowledge that most probable values of θθ are close to αα+β α α β .
Figure 5 (betaPrior.png)

Joint Density

p x , θ =nxBαβθα+x-11-θn-x+β-1 p x , θ n x B α β θ α x 1 1 θ n x β 1

marginal density

px=nxΓα+βΓαΓβΓα+xΓn-x+βΓα+β+n p x n x Γ α β Γ α Γ β Γ α x Γ n x β Γ α β n

posterior density

pθ|x=θα+x-1θβ+n-x-1Bα+xβ+n-x p x θ θ α x 1 θ β n x 1 B α x β n x where Bα+xβ+n-x B α x β n x is the Beta density with parameters α =α+x α α x and β =β+n-x β β n x

Selecting an Informative Prior

Clearly, the most important objective is to choose the prior pθ p θ that best reflects the prior knowledge available to us. In general, however, our prior knowledge is imprecise and any number of prior densities may aptly capture this information. Moreover, usually the optimal estimator can't be obtained in closed-form.

Therefore, sometimes it is desirable to choose a prior density that models prior knowledge and is nicely matched in functional form to px|θ p θ x so that the optimal esitmator (and posterior density) can be expressed in a simple fashion.

Choosing a Prior

1. Informative Priors

  • design/choose priors that are compatible with prior knowledge of unknown parameters

2. Non-informative Priors

  • attempt to remove subjectiveness from Bayesian procedures
  • designs are often based on invariance arguments

Example 5

Suppose we want to estimate the variance of a process, incorporating a prior that is amplitude-scale invariant (so that we are invariant to arbitrary amplitude rescaling of data). ps=1s p s 1 s satisifies this condition. σ2psAσ2ps σ 2 p s A σ 2 p s where ps p s is non-informative since it is invariant to amplitude-scale.

Conjugate Priors

Idea

Given px|θ p θ x , choose pθ p θ so that pθ|xpx|θpθ p x θ p θ x p θ has a simple functional form.

Conjugate Priors

Choose pθ𝒫 p θ 𝒫 , where 𝒫 𝒫 is a family of densities (e.g., Gaussian family) so that the posterior density also belongs to that family.

Definition 2: conjugate prior
pθ p θ is a conjugate prior for px|θ p θ x if pθ𝒫pθ|x𝒫 p θ 𝒫 p x θ 𝒫

Example 6

n,n=1N: x n =A+ W n n n 1 N x n A W n W n 0σ2 W n 0 σ 2 iid. Rather than modeling AU- A 0 A 0 A U A 0 A 0 (which did not yield a closed-form estimator) consider pA=12π σ A 2-12 σ A 2A-μ2 p A 1 2 σ A 2 -1 2 σ A 2 A μ 2

Figure 6
Figure 6 (gaussPrior.png)
With μ=0 μ 0 and σ A =13 A 0 σ A 1 3 A 0 this Gaussian prior also reflects prior knowledge that it is unlikely for |A| A 0 A A 0 .

The Gaussian prior is also conjugate to the Gaussian likelihood px|A=12πσ2N2-12σ2n=1N x n -A2 p A x 1 2 σ 2 N 2 -1 2 σ 2 n 1 N x n A 2 so that the resulting posterior density is also a simple Gaussian, as shown next.

First note that px|A=12πσ2N2-12σ2n=1N x n -12σ2NA2-2NA x - p A x 1 2 σ 2 N 2 -1 2 σ 2 n 1 N x n -1 2 σ 2 N A 2 2 N A x - where x - =1<