Sufficient statistics arise in nearly every
aspect of statistical inference. It is important to understand
them before progressing to areas such as hypothesis testing and
parameter estimation.
Suppose we observe an
NN-dimensional random vector
XX, characterized by
the density or mass function
fθx
f
θ
x
, where θθ is a
pp-dimensional vector of
parameters to be estimated. The functional form of
fx
f
x
is assumed known. The
parameter θθ
completely determines the distribution of XX. Conversely, a measurement
xx of XX provides information about
θθ through
the probability law
fθx
f
θ
x
.
Suppose
X=
X
1
X
2
X
X
1
X
2
, where
X
i
∼θ1
X
i
θ
1
are IID. Here
θθ is a scalar parameter
specifying the mean. The distribution of XX is determined by
θθ through the density
fθx=12πⅇ-
x
1
-θ2212πⅇ-
x
2
-θ22
f
θ
x
1
2
x
1
θ
2
2
1
2
x
2
θ
2
2
On the other hand, if we observe
x=100102
x
100
102
,
then we may safely assume
θ=0
θ
0
is highly unlikely.
The NN-dimensional
observation XX
carries information about the
pp-dimensional parameter vector
θθ. If
p<N
p
N
, one may ask the following question: Can we compress
xx into a
low-dimensional statistic without any loss of information?
Does there exist some function
t=Tx
t
T
x
, where the dimension of tt is
M<N
M
N
, such that tt carries all the useful
information about θθ?
If so, for the purpose of studying
θθ we could
discard the raw measurements xx and retain only the
low-dimensional statistic tt. We call tt a sufficient
statistic. The following definition captures this notion
precisely:
- Definition 1:
Let
X1
,
…
,
XM
X
1
,
…
,
X
M
be a random sample, governed by the density or
probability mass function
fx|θ
f
θ
x
. The statistic
Tx
T
x
is sufficient for θθ if the conditional
distribution of xx, given
Tx=t
T
x
t
, is independent of θθ. Equivalently, the
functional form of
fθ|tx
f
θ
t
x
does not involve θθ.
How should we interpret this definition? Here are some
possibilities:
1. Let
fθxt
f
θ
x
t
denote the joint density or probability mass
function on
(
X
,
T
(
X
)
)
(
X
,
T
(
X
)
)
. If
TX
T
X
is a sufficient statistic for θθ, then
fθx=fθxTx=fθ|txfθt=fx|tfθt
f
θ
x
f
θ
x
T
x
f
θ
t
x
f
θ
t
f
t
x
f
θ
t
(1)
Therefore, the parametrization of the probability law for
the measurement
xx is manifested in the
parametrization of the probability law for the statistic
Tx
T
x
.
2. Given
t=Tx
t
T
x
, full knowledge of the measurement xx brings no additional
information about θθ. Thus, we may
discard xx and
retain on the compressed statistic tt.
3. Any inference strategy based on
fθx
f
θ
x
may be replaced by a strategy based on
fθt
f
θ
t
.
(Scharf,
pp.78)
Suppose a binary information source emits
a sequence of binary (0 or 1) valued, independent
variables
x
1
,
…
,
x
N
x
1
,
…
,
x
N
. Each binary symbol may be viewed as a
realization of a Bernoulli trial:
x
n
∼Bernoulliθ
x
n
Bernoulli
θ
, iid. The parameter
θ∈01
θ
0
1
is to be estimated.
The probability mass function for the
random sample
x=
x
1
…
x
N
T
x
x
1
…
x
N
is
fθx=∏n=1Nfθ
x
n
∏n=1Nθk1-θN-k
f
θ
x
n
1
N
f
θ
x
n
n
1
N
θ
f
θ
x
x
n
1
θ
1
x
n
θ
k
1
θ
N
k
(2)
where
k=∑n=1N
x
n
k
n
1
N
x
n
is the number of 1's in the sample.
We will show that
kk is a sufficient statistic
for xx. This will
entail showing that the conditional probability mass
function
fθ|kx
f
θ
k
x
does not depend on
θθ.
The distribution of the number of ones
in NN independent Bernoulli
trials is binomial:
fθk=Nkθk1-θN-k
f
θ
k
N
k
θ
k
1
θ
N
k
Next, consider the joint distribution of
(
x
,
∑
x
n
)
(
x
,
x
n
)
. We have
fθx=fθx∑
x
n
f
θ
x
f
θ
x
x
n
Thus, the conditional probability may be written
fθ|kx=fθxkfθk=fθxfθk=θk1-θN-kNkθk1-θN-k=1Nk
f
θ
k
x
f
θ
x
k
f
θ
k
f
θ
x
f
θ
k
θ
k
1
θ
N
k
N
k
θ
k
1
θ
N
k
1
N
k
(3)
This shows that
kk is indeed
a sufficient statistic for
θθ. The
NN values
x
1
,
…
,
x
N
x
1
,
…
,
x
N
can be replaced by the quantity
kk without losing information
about
θθ.
In the previous
example, suppose we wish to store in memory the
information we possess about
θθ. Compare the savings,
in terms of bits, we gain by storing the sufficient
statistic kk instead of the
full sample
x
1
,
…
,
x
N
x
1
,
…
,
x
N
.
In the example above,
we had to guess the sufficient statistic, and work out the
conditional probability by hand. In general, this will be a
tedious way to go about finding sufficient
statistics. Fortunately, spotting sufficient statistics can be
made easier by the Fisher-Neyman
Factorization Theorem.
Minimal sufficient statistics are,
roughly speaking, sufficient statistics that cannot be
compressed any more without losing information about the unknown
parameter. Completeness is a technical
characterization of sufficient statistics that allows one to
prove minimality. These topics are covered in detail in
this
module.
Further examples of sufficient statistics may be
found in the module on the Fisher-Neyman
Factorization Theorem.
-
L. Scharf. (1991). Statistical Signal Processing. Addison-Wesley.