Statistical analysis is fundamentally an
inversion process. The objective is to the
"causes"--parameters of the probabilistic data generation
model--from the "effects"--observations. This can be seen in our
interpretation of the likelihood function.
Given a parameter
θ
θ, observations are generated according to
px|θ
p
θ
x
The likelihood function has the same form as the conditional
density function above
l
θ
|
x
≡px|θ
l
θ
|
x
p
θ
x
except now xx is
given (we take measurements) and θθ is the variable. The
likelihood function essentially inverts the role of observation
(effect) and parameter (cause).
Unfortunately, the likelihood function does not
provide a formal framework for the desired inversion.
One problem is that the parameter θθ is supposed to be a fixed
and deterministic quantity while the observation xx is the realization of a random
process. So their role aren't really interchangeable in this
setting.
Moreover, while it is tempting to interpret the
likelihood
l
θ
|
x
l
θ
|
x
as a density function for θθ, this is not always
possible; for example, often
∫l
θ
|
x
dθ→∞
θ
l
θ
|
x
Another problematic issue is the mathematical
formalization of statements like: "Based on the measurements
xx, I am 95% confident
that θθ falls in
a certain range."
Suppose you toss a coin 10 times and each time
it comes up "heads." It might be reasonable to say that we are
99% sure that the coin is unfair, biased towards heads.
Formally:
H
0
:
θ≡prob heads>0.5
H
0
:
θ
prob heads
0.5
x∼N∑xθ∑x1-θN-∑x
x
N
x
θ
x
1
θ
N
x
which is the binomial likelihood.
pθ>0.5|x=?
p
x
θ
0.5
?
The problem with this is that
pθ∈
H
0
|x
p
x
θ
H
0
implies that θθ is a
random, not deterministic, quantity. So,
while "confidence" statements are very reasonable and in fact
a normal part of "everyday thinking," this idea can not be
supported from the classical perspective.
All of these "deficiencies" can be circumvented
by a change in how we view the parameter θθ.
If we view θθ as the realization of a
random variable with density
pθ
p
θ
, then Bayes Rule (Bayes, 1763) shows that
pθ|x=px|θpθ∫px|θpθdθ
p
x
θ
p
θ
x
p
θ
θ
p
θ
x
p
θ
Thus, from this perspective we obtain a well-defined inversion:
Given xx, the
parameter θθ is
distributing according to
pθ|x
p
x
θ
.
From here, confidence measures such as
pθ∈
H
0
|x
p
x
θ
H
0
are perfectly legitimate quantities to ask for.
- Definition 1:
Bayesian statistical model
A statistical model compose of a data generation model,
px|θ
p
θ
x
, and a prior distribution on the parameters,
pθ
p
θ
.
The prior distriubtion (or
prior for short) models the uncertainty in the
parameter. More specifically,
pθ
p
θ
models our knowledge--or lack thereof--prior to
collecting data.
Notice that
pθ|x=px|θpθpx∝px|θpθ
p
x
θ
∝
p
θ
x
p
θ
p
x
p
θ
x
p
θ
since the data xx are
known,
px
p
x
is just a constant. Hence,
pθ|x
p
x
θ
is proportional to the likelihood function multiplied
by the prior.
Bayesian analysis has some significant
advantages over classical statistical analysis:
- properly inverts the relationship between causes and
effects
- permits meaningful assessments in confidence
regions
- enables the incorporation of prior knowledge into the
analysis (which could come from previous experiments, for
example)
- leads to more accurate estimators (provided the prior
knowledge is accurate)
- obeys the Likelihood and Sufficiency principles
∀n,n=1…N:
x
n
=A+
W
n
n
n
1
…
N
x
n
A
W
n
W
n
∼0σ2
W
n
0
σ
2
iid.
A
̂=1N∑n=1N
x
n
A
1
N
n
1
N
x
n
MVUB and MLE estimator. Now suppose that we have prior knowledge that
-
A
0
≤A≤
A
0
A
0
A
A
0
. We might incorporate this by forming a new estimator
A
∼
=-
A
0
if
A
̂<-
A
0
A
̂if-
A
0
≤
A
̂≤
A
0
A
0
if
A
̂>
A
0
A
∼
A
0
A
A
0
A
A
0
A
A
0
A
0
A
A
0
(1)
This is called a
truncated sample mean
estimator of
AA. Is
A
∼
A
∼
a better estimator of
AA than the sample mean
A
̂
A
?
Let
pa
p
a
denote the density of
A
̂
A
. Since
A
̂=1N∑
x
n
A
1
N
x
n
,
pa=Aσ2N
p
a
A
σ
2
N
. The density of
A
∼
A
∼
is given by
p
∼
a=Pr
A
̂≤-
A
0
δa+
A
0
+pa
I
{
-
A
0
≤
α
≤
A
0
}
+Pr
A
̂≥
A
0
δa-
A
0
p
∼
a
A
A
0
δ
a
A
0
p
a
I
{
-
A
0
≤
α
≤
A
0
}
A
A
0
δ
a
A
0
(2)
Now consider the MSE of the sample mean
A
̂
A
.
MSE
A
̂=∫-∞∞a-A2pada
MSE
A
a
a
A
2
p
a
(3)
-
A
∼
A
∼
is biased (Figure 2).
- Although
A
̂
A
is MVUB,
A
∼
A
∼
is better in the MSE sense.
- Prior information is aptly described by regarding
AA as a random variable with a
prior distribution
U-
A
0
A
0
U
A
0
A
0
, which implies that we know
-
A
0
≤A≤
A
0
A
0
A
A
0
, but otherwise AA is
abitrary.
∀n,n=1…N:
x
n
=A+
W
n
n
n
1
…
N
x
n
A
W
n
Prior distribution allows us to incorporate prior information
regarding unknown paremter--probable values of parameter are
supported by prior. Basically, the prior reflects what we
believe "Nature" will probably throw at us.
- (a) -
joint distribution
p
x
,
θ
=px|θpθ
p
x
,
θ
p
θ
x
p
θ
- (b) -
marginal distributions
px=∫px|θpθdθ
p
x
θ
p
θ
x
p
θ
pθ=∫px|θpθdx
p
θ
x
p
θ
x
p
θ
where
pθ
p
θ
is a prior.
- (c) -
posterior distribution
pθ|x=p
x
,
θ
px=px|θpθ∫px|θpθdx
p
x
θ
p
x
,
θ
p
x
p
θ
x
p
θ
x
p
θ
x
p
θ
∀θ,θ∈01:px|θ=nxθx1-θn-x
θ
θ
0
1
p
θ
x
n
x
θ
x
1
θ
n
x
which is the Binomial likelihood.
pθ=1Bαβθα-11-θβ-1
p
θ
1
B
α
β
θ
α
1
1
θ
β
1
which is the Beta prior distriubtion and
Bαβ=ΓαΓβΓα+β
B
α
β
Γ
α
Γ
β
Γ
α
β
p
x
,
θ
=nxBαβθα+x-11-θn-x+β-1
p
x
,
θ
n
x
B
α
β
θ
α
x
1
1
θ
n
x
β
1
px=nxΓα+βΓαΓβΓα+xΓn-x+βΓα+β+n
p
x
n
x
Γ
α
β
Γ
α
Γ
β
Γ
α
x
Γ
n
x
β
Γ
α
β
n
pθ|x=θα+x-1θβ+n-x-1Bα+xβ+n-x
p
x
θ
θ
α
x
1
θ
β
n
x
1
B
α
x
β
n
x
where
Bα+xβ+n-x
B
α
x
β
n
x
is the Beta density with parameters
α
′
=α+x
α
′
α
x
and
β
′
=β+n-x
β
′
β
n
x
Clearly, the most important objective is to choose
the prior
pθ
p
θ
that best reflects the prior knowledge available to
us. In general, however, our prior knowledge is imprecise and
any number of prior densities may aptly capture this
information. Moreover, usually the optimal estimator can't be
obtained in closed-form.
Therefore, sometimes it is desirable to choose a
prior density that models prior knowledge and
is nicely matched in functional form to
px|θ
p
θ
x
so that the optimal esitmator (and posterior density)
can be expressed in a simple fashion.
- design/choose priors that are compatible with prior
knowledge of unknown parameters
- attempt to remove subjectiveness from Bayesian
procedures
- designs are often based on invariance arguments
Suppose we want to estimate the variance
of a process, incorporating a prior that is amplitude-scale
invariant (so that we are invariant to arbitrary amplitude
rescaling of data).
ps=1s
p
s
1
s
satisifies this condition.
σ2∼ps⇒Aσ2∼ps
σ
2
p
s
A
σ
2
p
s
where
ps
p
s
is non-informative since it is invariant to
amplitude-scale.
Given
px|θ
p
θ
x
, choose
pθ
p
θ
so that
pθ|x∝px|θpθ
∝
p
x
θ
p
θ
x
p
θ
has a simple functional form.
Choose
pθ∈𝒫
p
θ
𝒫
, where
𝒫
𝒫 is a family of densities (e.g.,
Gaussian family) so that the posterior density also belongs
to that family.
- Definition 2:
conjugate prior
pθ
p
θ
is a conjugate prior for
px|θ
p
θ
x
if
pθ∈𝒫⇒pθ|x∈𝒫
p
θ
𝒫
p
x
θ
𝒫
∀n,n=1…N:
x
n
=A+
W
n
n
n
1
…
N
x
n
A
W
n
W
n
∼0σ2
W
n
0
σ
2
iid. Rather than modeling
A∼U-
A
0
A
0
A
U
A
0
A
0
(which did not yield a closed-form estimator) consider
pA=12π
σ
A
2ⅇ-12
σ
A
2A-μ2
p
A
1
2
σ
A
2
-1
2
σ
A
2
A
μ
2
With
μ=0
μ
0
and
σ
A
=13
A
0
σ
A
1
3
A
0
this Gaussian prior also reflects prior knowledge that it is
unlikely for
|A|≥
A
0
A
A
0
.
The Gaussian prior is also conjugate to the
Gaussian likelihood
px|A=12πσ2N2ⅇ-12σ2∑n=1N
x
n
-A2
p
A
x
1
2
σ
2
N
2
-1
2
σ
2
n
1
N
x
n
A
2
so that the resulting posterior density is also a simple
Gaussian, as shown next.
First note that
px|A=12πσ2N2ⅇ-12σ2∑n=1N
x
n
ⅇ-12σ2NA2-2NA
x
-
p
A
x
1
2
σ
2
N
2
-1
2
σ
2
n
1
N
x
n
-1
2
σ
2
N
A
2
2
N
A
x
-
where
x
-
=1<