Box plots or box-whisker plots give a good graphical image of the concentration of the data. They also show how far from most of the data the extreme values are. The box plot is constructed from five values: the smallest value, the first quartile, the median, the third quartile, and the largest value. The median, the first quartile, and the third quartile will be discussed here, and then again in the section on measuring data in this chapter. We use these values to compare how close other data values are to them.
The median, a number, is a way of measuring the "center" of the data. You can think of the median as the "middle value," although it does not actually have to be one of the observed values. It is a number that separates ordered data into halves. Half the values are the same number or smaller than the median and half the values are the same number or larger. For example, consider the following data:
1, 11.5, 6, 7.2, 4, 8, 9, 10, 6.8, 8.3, 2, 2, 10, 1
Ordered from smallest to largest:
1, 1, 2, 2, 4, 6, 6.8, 7.2, 8, 8.3, 9, 10, 10, 11.5
The median is between the 7th value, 6.8, and the 8th value 7.2. To find the median, add the two values together and divide by 2.
6.8+7.22=7
6.8
7.2
2
7
(1)
The median is 7. Half of the values are smaller than 7 and half of the values are larger than 7.
Quartiles are numbers that separate the data into quarters. Quartiles may or may not be part of the data. To find the quartiles, first find the median or second quartile. The first quartile is the middle value of the lower half of the data and the third quartile is the middle value of the upper half of the data. To get the idea, consider the same data set shown above.
The median or second quartile is 7. The lower half of the data is 1, 1, 2, 2, 4, 6, 6.8. The middle value of the lower half is 2.
1, 1, 2, 2, 4, 6, 6.8
The number 2, which is part of the data, is the first quartile. One-fourth of the values are the same or less than 2 and three-fourths of the values are more than 2.
The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the upper half is 9.
7.2, 8, 8.3, 9, 10, 10, 11.5
The number 9, which is part of the data, is the third quartile. Three-fourths of the values are less than 9 and one-fourth of the values are more than 9.
To construct a box plot, use a horizontal number line and a rectangular box. The smallest and largest data values label the endpoints of the axis. The first quartile marks one end of the box and the third quartile marks the other end of the box. The middle fifty percent of the data fall inside the box. The "whiskers" extend from the ends of the box to the smallest and largest data values. The box plot gives a good quick picture of the data.
Using the data from the start of this section, recall that the first quartile is 2, the median is 7, and the third quartile is 9. The smallest value is 1 and the largest value is 11.5. The box plot is constructed as follows:
The two whiskers extend from the first quartile to the smallest value and from the third quartile to the largest value. The median is shown with a dashed line.
As with histograms, box plots are very easy to make in R. They are made using the boxplot() function. By default, R orients the box plot vertically; to change this, simply add the argument horizontal = TRUE to the function. For most simple plots, no additional arguments are required. You may also want to change the size of the R graph window to view the box plot in a more aesthetic scale.
Here is how you would create the box plot in R using the same data from the start of this section.
a = c(1, 11.5, 6, 7.2, 4, 8, 9,
10, 6.8, 8.3, 2, 2, 10, 1)
boxplot(a, horizontal = TRUE)
The following data are the heights of 40 students in a statistics class.
59, 60, 61, 62, 62, 63, 63, 64, 64, 64, 65, 65, 65, 65, 65, 65, 65, 65, 65, 66, 66, 67, 67, 68, 68, 69, 70, 70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75, 77
Construct a box plot with the following properties:
- Smallest value = 59
- Largest value = 77
- Q1: First quartile = 64.5
- Q2: Second quartile or median= 66
- Q3: Third quartile = 70
- a. Each quarter has 25% of the data.
- b. The spreads of the four quarters are 64.5 - 59 = 5.5 (first quarter), 66 - 64.5 = 1.5 (second quarter), 70 - 66 = 4 (3rd quarter), and 77 - 70 = 7 (fourth quarter). So, the second quarter has the smallest spread and the fourth quarter has the largest spread.
- c. Interquartile Range: IQR = Q3 - Q1 = 70 - 64.5 = 5.5IQR =Q3 -Q1 =70- 64.5= 5.5.
- d. The interval 59 through 65 has more than 25% of the data so it has more data in it than the interval 66 through 70 which has 25% of the data.
For some sets of data, some of the largest value, smallest value, first quartile, median, and third quartile may be the same. For instance, you might have a data set in which the median and the third quartile are the same. In this case, the diagram would not have a dotted line inside the box displaying the median. The right side of the box would display both the third quartile and the median. For example, if the smallest value and the first quartile were both 1, the median and the third quartile were both 5, and the largest value was 7, the box plot would look as follows:
Test scores for a college statistics class held during the day are:
99, 56, 78, 55.5, 32, 90, 80, 81, 56, 59, 45, 77, 84.5, 84, 70, 72, 68, 32, 79, 90
Test scores for a college statistics class held during the evening are:
98, 78, 68, 83, 81, 89, 88, 76, 65, 45, 98, 90, 80, 84.5, 85, 79, 78, 98, 90, 79, 81, 25.5
- What are the smallest and largest data values for each data set?
- What is the median, the first quartile, and the third quartile for each data set?
- Create a boxplot for each set of data.
- Which boxplot has the widest spread for the middle 50% of the data (the data between
the first and third quartiles)? What does this mean for that set of data in comparison to the other set of data?
- For each data set, what percent of the data is between the smallest value and the first quartile? (Answer: 25%) the first quartile and the median? (Answer: 25%) the median and the third quartile? the third quartile and the largest value? What percent of the data is between the first quartile and the largest value? (Answer: 75%)
- Xmin = 32Xmin = 32
- Q1 = 56Q1 = 56
- M = 74.5M = 74.5
- Q3 = 82.5Q3 = 82.5
- Xmax = 99Xmax = 99
- Xmin = 25.5Xmin = 25.5
- Q1 = 78Q1 = 78
- M = 81M = 81
- Q3 = 89Q3 = 89
- Xmax = 98Xmax = 98
The first data set (the top box plot) has the widest spread for the middle 50% of the data. IQR = Q3 - Q1IQR = Q3 - Q1
is
82.5 - 56 =
26.582.5 - 56 = 26.5
for the first data set and
89 - 78 =
1189 - 78 = 11
for the second data set. So, the first set of data has its middle 50% of scores more spread out.
25% of the data is between MM and Q3Q3 and 25% is between Q3Q3 and XmaxXmax.
- Median:
A number that separates ordered data into halves: half the values are the same number or smaller than the median and half the values are the same number or larger than the median. The median may or may not be part of the data.
- Quartiles:
The numbers that separate the data into quarters. Quartiles may or may not be part of the data. The second quartile is the median of the data.