Skip to content Skip to navigation Skip to collection information

Connexions

You are here: Home » Content » Collaborative Statistics Using R » Sampling

Navigation

Recently Viewed

This feature requires Javascript to be enabled.
 

Sampling

Module by: Ananda Mahto. E-mail the author

Based on: Sampling and Data: Sampling by Susan Dean, Barbara Illowsky, Ph.D.

Summary: This module introduces the concept of statistical sampling. Students are taught the difference between a simple random sample, stratified sample, cluster sample, systematic sample, and convenience sample. Example problems are provided, including an optional classroom activity.

Gathering information about an entire population often costs too much or is virtually impossible. Instead, we use a sample of the population. A sample should have the same characteristics as the population it is representing. Most statisticians use various methods of random sampling in an attempt to achieve this goal. This section will describe a few of the most common methods.

There are several different methods of random sampling. In each form of random sampling, each member of a population initially has an equal chance of being selected for the sample. Each method has pros and cons. The easiest method to describe is called a simple random sample. Two simple random samples contain members equally representative of the entire population. In other words, each sample of the same size has an equal chance of being selected. For example, suppose Lisa wants to form a four-person study group (herself and three other people) from her pre-calculus class, which has 33 members including Lisa. To choose a simple random sample of size 3 from the other members of her class, Lisa could put all the other 32 names in a hat, shake the hat, close her eyes, and pick out 3 names. A more technological way is for Lisa to first list the names of her classmates together with a two-digit number as shown below.

Table 1: Class Roster
ID Name ID Name ID Name
01 Anselmo 12 Larry 23 Rowell
02 Bayani 13 Lizzy 24 Salangsang
03 Cheng 14 Macierz 25 Slade
04 Cuarismo 15 Motogawa 26 Stracher
05 Cuningham 16 Okimoto 27 Tallai
06 Fontecha 17 Patel 28 Tran
07 Hong 18 Price 29 Wai
08 Hoobler 19 Quizon 30 Wood
09 Jiao 20 Reyes 31 Yogi
10 Khan 21 Roquero 32 Zoe
11 King 22 Roth

Lisa can either use a table of random numbers (found in many statistics books as well as mathematical handbooks) or a calculator or computer to generate random numbers. For this example, suppose Lisa chooses to generate random numbers by using R. She enters the statement sample(32, 3), where 32 is the number of students in the class (excluding herself), and 3 is the number of samples she wants. If she wants her random sample to be replicable, she needs to set a seed for her sample by using the set.seed() as demonstrated in the second example. When you try this exercise, you should get different results if you are not using a seed value or if you are using a different seed value from the one in the example code.

Tip:

Use set.seed() whenever you want to be able to reproduce your results. You can, for instance, set the seed at the date that you are first running your experiment. For example, if your first experiment was being done on August 1 2010, you might write 20100801 for your seed. Every time you re-run your experiment use the same date from your original experiment as your seed, and your output will be the same.

# A random sample of 3 from 32
sample(32, 3)
## [1]  7 10 32
# Using set.seed() to get a reproducible sample
# The seed can be any number you want
set.seed(123)
sample(32, 3)
## [1] 10 25 13

Using this information, Lisa will select the students with the ID numbers generated by R.

Sometimes, it is difficult or impossible to obtain a simple random sample because populations are too large. Then we choose other forms of sampling methods that involve a chance process for getting the sample. Other well-known random sampling methods are the stratified sample, the cluster sample, and the systematic sample.

To choose a stratified sample, divide the population into groups called strata and then take a sample from each stratum. For example, you could stratify (group) your college population by department and then choose a simple random sample from each stratum (each department) to get a stratified random sample. To choose a simple random sample from each department, number each member of the first department, number each member of the second department and do the same for the remaining departments. Then use simple random sampling to choose numbers from the first department and do the same for each of the remaining departments. Those numbers picked from the first department, picked from the second department and so on represent the members who make up the stratified sample.

To choose a cluster sample, divide the population into strata and then randomly select some of the strata. All the members from these strata are in the cluster sample. For example, if you randomly sample four departments from your stratified college population, the four departments make up the cluster sample. You could do this by numbering the different departments and then choose four different numbers using simple random sampling. All members of the four departments with those numbers are the cluster sample.

To choose a systematic sample, randomly select a starting point and take every nth piece of data from a listing of the population. For example, suppose you have to do a phone survey. Your phone book contains 20,000 residence listings. You must choose 400 names for the sample. Number the population 1 - 20,000 and then use a simple random sample to pick a number that represents the first name of the sample. Then choose every 50th name thereafter until you have a total of 400 names (you might have to go back to the of your phone list). Systematic sampling is frequently chosen because it is a simple method.

A type of sampling that is nonrandom is convenience sampling. Convenience sampling involves using results that are readily available. For example, a computer software store conducts a marketing study by interviewing potential customers who happen to be in the store browsing through the available software. The results of convenience sampling may be very good in some cases and highly biased (favors certain outcomes) in others.

Sampling data should be done very carefully. Collecting data carelessly can have devastating results. Surveys mailed to households and then returned may be very biased (for example, they may favor a certain group). It is better for the person conducting the survey to select the sample respondents.

In reality, simple random sampling should be done with replacement That is, once a member is picked that member goes back into the population and thus may be chosen more than once. This is true random sampling. However for practical reasons, in most populations, simple random sampling is done without replacement. That is, a member of the population may be chosen only once. Most samples are taken from large populations and the sample tends to be small in comparison to the population. Since this is the case, sampling without replacement is approximately the same as sampling with replacement because the chance of picking the same sample more than once using with replacement is very low.

For example, in a college population of 10,000 people, suppose you want to pick a sample of 1000 for a survey. For any particular sample of 1000, if you are sampling with replacement,

  • the chance of picking the first person is 1000 out of 10,000 (0.1000);
  • the chance of picking a different second person for this sample is 999 out of 10,000 (0.0999);
  • the chance of picking the same person again is 1 out of 10,000 (very low).

If you are sampling without replacement,

  • the chance of picking the first person for any particular sample is 1000 out of 10,000 (0.1000);
  • the chance of picking a different second person is 999 out of 9,999 (0.0999);
  • you do not replace the first person before picking the next person.

Compare the fractions 999/10,000 and 999/9,999. For accuracy, carry the decimal answers to 4 place decimals. To 4 decimal places, these numbers are equivalent (0.0999).

Sampling without replacement instead of sampling with replacement only becomes a mathematics issue when the population is small which is not that common. For example, if the population is 25 people, the sample is 10 and you are sampling with replacement for any particular sample,

  • the chance of picking the first person is 10 out of 25 and a different second person is 9 out of 25 (you replace the first person).

If you sample without replacement,

  • the chance of picking the first person is 10 out of 25 and then the second person (which is different) is 9 out of 24 (you do not replace the first person).

Compare the fractions 9/25 and 9/24. To 4 decimal places, 9/25 = 0.3600 and 9/24 = 0.3750. To 4 decimal places, these numbers are not equivalent.

You can also use R to sample with replacement by adding replace = TRUE to the sample() function. Imagine, for instance, that you want to replicate flipping a coin 20 times. Since there is only one heads and one tails in our population, we use replacement to get our sample. In the example below, we are again using the set.seed() function so that you can confirm that you are getting the same results.


# Simulating coin flipping
coin.flips = c("H", "T")
set.seed(123)
sample(coin.flips, 30, replace = TRUE)
##  [1] "H" "T" "H" "T" "T" "H" "T" "T" "T" "H" "T"
## [12] "H" "T" "T" "H" "T" "H" "H" "H" "T" "T" "T"
## [23] "T" "T" "T" "T" "T" "T" "H" "H"

When you analyze data, it is important to be aware of sampling errors and nonsampling errors. The actual process of sampling causes sampling errors. For example, the sample may not be large enough or representative of the population. Factors not related to the sampling process cause nonsampling errors. A defective counting device can cause a nonsampling error.

If we were to examine two samples representing the same population, they would, more than likely, not be the same. Just as there is variation in data, there is variation in samples. As you become accustomed to sampling, the variability will seem natural.

Optional Collaborative Classroom Exercise

Exercise 1

As a class, determine whether or not the following samples are representative. If they are not, discuss the reasons.

  1. To find the average GPA of all students in a university, use all honor students at the university as the sample.
  2. To find out the most popular cereal among young people under the age of 10, stand outside a large supermarket for three hours and speak to every 20th child under age 10 who enters the supermarket.
  3. To find the average annual income of all adults in the United States, sample U.S. congressmen. Create a cluster sample by considering each state as a stratum (group). By using simple random sampling, select states to be part of the cluster. Then survey every U.S. congressman in the cluster.
  4. To determine the proportion of people taking public transportation to work, survey 20 people in New York City. Conduct the survey by sitting in Central Park on a bench and interviewing every person who sits next to you.
  5. To determine the average cost of a two day stay in a hospital in Massachusetts, survey 100 hospitals across the state using simple random sampling.

Collection Navigation

Content actions

Download:

Collection as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add:

Collection to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

Module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks