Skip to content Skip to navigation Skip to collection information

OpenStax-CNX

You are here: Home » Content » Applied Probability » Simple Random Samples and Statistics

Navigation

Table of Contents

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
  • Rice Digital Scholarship

    This collection is included in aLens by: Digital Scholarship at Rice University

    Click the "Rice Digital Scholarship" link to see all content affiliated with them.

Also in these lenses

  • UniqU content

    This collection is included inLens: UniqU's lens
    By: UniqU, LLC

    Click the "UniqU content" link to see all content selected in this lens.

Recently Viewed

This feature requires Javascript to be enabled.
 

Simple Random Samples and Statistics

Module by: Paul E Pfeiffer. E-mail the author

Summary: The (simple) random sample, is basic to much of classical statistics. Once formulated, we may apply probability theory to exhibit several basic ideas of statistical analysis. A population may be most any collection of individuals or entities. Associated with each member is a quantity or a feature that can be assigned a number. The population distribution is the distribution of that quantity among the members of the population. To obtain information about the population distribution, we select “at random” a subset of the population and observe how the quantity varies over the sample. Hopefully, the distribution in the sample will give a useful approximation to the population distribution. We obtain values of such quantities as the mean and variance in the sample (which are random quantities) and use these as estimators for corresponding population parameters (which are fixed). Probability analysis provides estimates of the variation of the sample parameters about the corresponding population parameters.

Simple Random Samples and Statistics

We formulate the notion of a (simple) random sample, which is basic to much of classical statistics. Once formulated, we may apply probability theory to exhibit several basic ideas of statistical analysis.

We begin with the notion of a population distribution. A population may be most any collection of individuals or entities. Associated with each member is a quantity or a feature that can be assigned a number. The quantity varies throughout the population. The population distribution is the distribution of that quantity among the members of the population.

If each member could be observed, the population distribution could be determined completely. However, that is not always feasible. In order to obtain information about the population distribution, we select “at random” a subset of the population and observe how the quantity varies over the sample. Hopefully, the sample distribution will give a useful approximation to the population distribution.

The sampling process

We take a sample of size n, which means we select n members of the population and observe the quantity associated with each. The selection is done in such a manner that on any trial each member is equally likely to be selected. Also, the sampling is done in such a way that the result of any one selection does not affect, and is not affected by, the others. It appears that we are describing a composite trial. We model the sampling process as follows:

  • Let Xi, 1in1in be the random variable for the ith component trial. Then the class {Xi:1in}{Xi:1in} is iid, with each member having the population distribution.

This provides a model for sampling either from a very large population (often referred to as an infinite population) or sampling with replacement from a small population.

The goal is to determine as much as possible about the character of the population. Two important parameters are the mean and the variance. We want the population mean and the population variance. If the sample is representative of the population, then the sample mean and the sample variance should approximate the population quantities.

  • The sampling process is the iid class {Xi:1in}{Xi:1in}.
  • A random sample is an observation, or realization, (t1,t2,,tn)(t1,t2,,tn) of the sampling process.

The sample average and the population mean

Consider the numerical average of the values in the sample x¯=1ni=1ntix¯=1ni=1nti. This is an observation of the sample average

A n = 1 n i = 1 n X i = 1 n S n A n = 1 n i = 1 n X i = 1 n S n
(1)

The sample sum Sn and the sample average An are random variables. If another observation were made (another sample taken), the observed value of these quantities would probably be different. Now Sn and An are functions of the random variables {Xi:1in}{Xi:1in} in the sampling process. As such, they have distributions related to the population distribution (the common distribution of the Xi). According to the central limit theorem, for any reasonable sized sample they should be approximately normally distributed. As the examples demonstrating the central limit theorem show, the sample size need not be large in many cases. Now if the population mean E[X]E[X] is μ and the population variance Var [X] Var [X] is σ2, then

E [ S n ] = i = 1 n E [ X i ] = n E [ X ] = n μ and Var [ S n ] = i = 1 n Var [ X i ] = n Var [ X ] = n σ 2 E [ S n ] = i = 1 n E [ X i ] = n E [ X ] = n μ and Var [ S n ] = i = 1 n Var [ X i ] = n Var [ X ] = n σ 2
(2)

so that

E [ A n ] = 1 n E [ S n ] = μ and Var [ A n ] = 1 n 2 Var [ S n ] = σ 2 / n E [ A n ] = 1 n E [ S n ] = μ and Var [ A n ] = 1 n 2 Var [ S n ] = σ 2 / n
(3)

Herein lies the key to the usefulness of a large sample. The mean of the sample average An is the same as the population mean, but the variance of the sample average is 1/n1/n times the population variance. Thus, for large enough sample, the probability is high that the observed value of the sample average will be close to the population mean. The population standard deviation, as a measure of the variation is reduced by a factor 1/n1/n.

Example 1: Sample size

Suppose a population has mean μ and variance σ2. A sample of size n is to be taken. There are complementary questions:

  1. If n is given, what is the probability the sample average lies within distance a from the population mean?
  2. What value of n is required to ensure a probability of at least p that the sample average lies within distance a from the population mean?

SOLUTION

Suppose the sample variance is known or can be approximated reasonably. If the sample size n is reasonably large, depending on the population distribution (as seen in the previous demonstrations), then An is approximately N(μ,σ2/n)N(μ,σ2/n).

  1. Sample size given, probability to be determined.
    p=P(|An-μ|a)=PAn-μσ/nanσ=2Φ(an/σ)-1p=P(|An-μ|a)=PAn-μσ/nanσ=2Φ(an/σ)-1
    (4)
  2. Sample size to be determined, probability specified.
    2Φ(an/σ)-1piffΦ(an/σ)p+122Φ(an/σ)-1piffΦ(an/σ)p+12
    (5)
    Find from a table or by use of the inverse normal function the value of x=an/σx=an/σ required to make Φ(x)Φ(x) at least (p+1)/2(p+1)/2. Then
    nσ2(x/a)2=σa2x2nσ2(x/a)2=σa2x2
    (6)

We may use the MATLAB function norminv to calculate values of x for various p.

p = [0.8 0.9 0.95 0.98 0.99];
x = norminv(0,1,(1+p)/2);
disp([p;x;x.^2]')
    0.8000    1.2816    1.6424
    0.9000    1.6449    2.7055
    0.9500    1.9600    3.8415
    0.9800    2.3263    5.4119
    0.9900    2.5758    6.6349

For p=0.95,σ=2,a=0.2p=0.95,σ=2,a=0.2, n(2/0.2)23.8415=384.15n(2/0.2)23.8415=384.15. Use at least 385 or perhaps 400 because of uncertainty about the actual σ2

The idea of a statistic

As a function of the random variables in the sampling process, the sample average is an example of a statistic.

Definition. A statistic is a function of the class {Xi:1in}{Xi:1in} which uses explicitly no unknown parameters of the population.

Example 2: Statistics as functions of the sampling process

The random variable

W = 1 n i = 1 n ( X i - μ ) 2 , where μ = E [ X ] W = 1 n i = 1 n ( X i - μ ) 2 , where μ = E [ X ]
(7)

is not a statistic, since it uses the unknown parameter μ. However, the following is a statistic.

V n * = 1 n i = 1 n ( X i - A n ) 2 = 1 n i = 1 n X i 2 - A n 2 V n * = 1 n i = 1 n ( X i - A n ) 2 = 1 n i = 1 n X i 2 - A n 2
(8)

It would appear that Vn*Vn* might be a reasonable estimate of the population variance. However, the following result shows that a slight modification is desirable.

Example 3: An estimator for the population variance

The statistic

V n = 1 n - 1 i = 1 n ( X i - A n ) 2 V n = 1 n - 1 i = 1 n ( X i - A n ) 2
(9)

is an estimator for the population variance.

VERIFICATION

Consider the statistic

V n * = 1 n i = 1 n ( X i - A n ) 2 = 1 n i = 1 n X i 2 - A n 2 V n * = 1 n i = 1 n ( X i - A n ) 2 = 1 n i = 1 n X i 2 - A n 2
(10)

Noting that E[X2]=σ2+μ2E[X2]=σ2+μ2, we use the last expression to show

E [ V n * ] = 1 n n ( σ 2 + μ 2 ) - σ 2 n + μ 2 = n - 1 n σ 2 E [ V n * ] = 1 n n ( σ 2 + μ 2 ) - σ 2 n + μ 2 = n - 1 n σ 2
(11)

The quantity has a bias in the average. If we consider

V n = n n - 1 V n * = 1 n - 1 i = 1 n ( X i - A n ) 2 , then E [ V n ] = n n - 1 n - 1 n σ 2 = σ 2 V n = n n - 1 V n * = 1 n - 1 i = 1 n ( X i - A n ) 2 , then E [ V n ] = n n - 1 n - 1 n σ 2 = σ 2
(12)

The quantity Vn with 1/(n-1)1/(n-1) rather than 1/n1/n is often called the sample variance to distinguish it from the population variance. If the set of numbers

( t 1 , t 2 , , t N ) ( t 1 , t 2 , , t N )
(13)

represent the complete set of values in a population of N members, the variance for the population would be given by

σ 2 = 1 N i = 1 N t i 2 - 1 N i = 1 N t i 2 σ 2 = 1 N i = 1 N t i 2 - 1 N i = 1 N t i 2
(14)

Here we use 1/N1/N rather than 1/(N-1)1/(N-1).

Since the statistic Vn has mean value σ2, it seems a reasonable candidate for an estimator of the population variance. If we ask how good is it, we need to consider its variance. As a random variable, it has a variance. An evaluation similar to that for the mean, but more complicated in detail, shows that

Var [ V n ] = 1 n μ 4 - n - 3 n - 1 σ 4 where μ 4 = E [ ( X - μ ) 4 ] Var [ V n ] = 1 n μ 4 - n - 3 n - 1 σ 4 where μ 4 = E [ ( X - μ ) 4 ]
(15)

For large n, Var [Vn] Var [Vn] is small, so that Vn is a good large-sample estimator for σ2.

Example 4: A sampling demonstration of the CLT

Consider a population random variable XX uniform [-1, 1]. Then E[X]=0E[X]=0 and Var [X]=1/3 Var [X]=1/3. We take 100 samples of size 100, and determine the sample sums. This gives a sample of size 100 of the sample sum random variable S100, which has mean zero and variance 100/3. For each observed value of the sample sum random variable, we plot the fraction of observed sums less than or equal to that value. This yields an experimental distribution function for S100, which is compared with the distribution function for a random variable YN(0,100/3)YN(0,100/3).

rand('seed',0)    % Seeds random number generator for later comparison
tappr                                         % Approximation setup
Enter matrix [a b] of x-range endpoints  [-1 1]
Enter number of x approximation points  100
Enter density as a function of t  0.5*(t<=1)
Use row matrices X and PX as in the simple case
qsample                                 % Creates sample
Enter row matrix of VALUES  X
Enter row matrix of PROBABILITIES  PX
Sample size n =  10000                  % Master sample size 10,000
Sample average ex = 0.003746
Approximate population mean E(X) = 1.561e-17
Sample variance vx = 0.3344
Approximate population variance V(X) = 0.3333
m = 100;
a = reshape(T,m,m);                     % Forms 100 samples of size 100
A = sum(a);                             % Matrix A of sample sums
[t,f] = csort(A,ones(1,m));             % Sorts A and determines cumulative
p = cumsum(f)/m;                        % fraction of elements <= each value
pg = gaussian(0,100/3,t);               % Gaussian dbn for sample sum values
plot(t,p,'k-',t,pg,'k-.')               % Comparative plot
% Plotting details                      (see Figure 1)
Figure 1: The central limit theorem for sample sums.
Figure one is a graph of two plots, titled, Central limit theorem for sample sums. The horizontal axis is labeled, sample sum values, and the vertical axis is labeled, cumulative fraction. The values on the horizontal axis range from -15 to 20 in increments of 5. The values on the vertical axis range from 0 to 1 in increments of 0.1. There are two captions inside the graph. The first reads, X uniform on [-1 1], and the second reads, E[X] = 0 Var[X] = 1/3. The first plot is a smooth, dashed line, labeled gaussian. The second plot is a wavering, jagged solid line labeled experimental. Both plots follow generally the same shape. They begin in the bottom right at approximately (-12, 0) with a positive slope, and they move to the right, increasing at an increasing rate. At nearly the midpoint in the graph, approximately (0, 0.5), the graphs adjust and begin increasing at a decreasing rate, approaching the top-right corner of the graph while tapering off to a horizontal line. The gaussian, dashed line follows this path's description more accurately, while the solid experimental line seems to be closely fitted to the gaussian line's path with some imperfections causing it to waver jaggedly at a couple spots along the path.

Collection Navigation

Content actions

Download:

Collection as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Module as:

PDF | More downloads ...

Add:

Collection to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

Module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks