Comparing Measures of Central Tendency
How do the various measures of central tendency compare with
each other? For
symmetric
distributions, the mean, median, trimean, and
trimmed mean are equal, as is the mode except in
bimodal
distributions. Differences among the measures
occur with
skewed
distributions.
Figure 1 shows the distribution of
642 scores on an introductory psychology test. Notice this
distribution has a slight positive skew.
Measures of central tendency are shown in
Table 1. Notice they do not differ greatly, with the
exception that the mode is lower than the other measures. When
distributions have a
positive skew, the mean is
higher than the median. For these data, the mean of 91.58 is
higher than the median of 90. Typically the
trimean and
trimmed mean will fall
between the
median and
the
mean, although in
this case, the trimmed mean is slightly lower than the
median. The
geomtric
mean is the lower than all measures except the
mode.
Measures of central tendency for the test scores.
|
Measure
|
Value
|
|
Mode
|
84.00
|
|
Median
|
90.00
|
|
Geometric
Mean
|
89.70
|
|
Trimean
|
90.25
|
|
Mean
trimmed
50%
|
89.81
|
|
Mean
|
91.58
|
The distribution of baseball salaries (in 1994) shown in
Figure 2 has a much more pronounced skew than the
distribution in
Figure 1.
Table 2 shows the measures of
central tendency for these data. The
large skew results in very different values for these
measures. No single measure of central tendency is sufficient
for data such as these. If you were asked the very general
question:"So, what do baseball players make?"
and answered with the mean of $1,183,000, you would have not
told the whole story since only about one third of baseball
players make that much. If you answered with the mode of
$250,000 or the median of $500,000, you would not be giving
any indication that some players make many millions of
dollars. Fortunately, there is no need to summarize a
distribution with a single number. When the various measurs
differ, our opinion is that you should report the mean,
median, and either the trimean or a the mean trimmed
50%. Sometimes it is worth reporting the mode as well. In the
media, the median is usually reported to summarize the center
of skewed distributions. You will hear about median salaries
and median prices of houses sold, etc. This is better than
reporting only the mean, but it would be informative to hear
more statistics.
Measures of central tendency for baseball salaries (in thousands of dollars).
|
Measure
|
Value
|
|
Mode
|
250
|
|
Median
|
500
|
|
Geometric
Mean
|
555
|
|
Trimean
|
792
|
|
Mean
trimmed
50%
|
619
|
|
Mean
|
1,183
|
Glossary
Bimodal Distribution:
A distribution with two distinct peaks. An example is
shown below.
Bar Chart:
A graphical method of presenting
data
from a
discrete variable. A bar
is drawn for each value of the variable. The height of each
bar contains the number or percentage of observations with
that value of the variable. An exmple is shown below. See
also:
histogram,
line graph,
pie
chart,
box plot. See
Figure 4 for an example.
Box Plot:
One of the more effective graphical summaries of a data set,
the box plot generally shows
mean,
median, 25th and 75th
percentiles, and outliers. A standard
box plot is composed of the
median,
upper hinge,
lower hinge,
higher
adjacent value,
lower adjacent
value,
outside values, and
far out values. An example is shown
below.
Parallel box plots are
very useful for comparing distributions. See
Figure 5 for an example. See also:
step,
H-spread.
Center (of a Distribution):
Class Interval:
Bin Width: Also known as bin width, the class
interval is a division of
data for
use in a
histogram. For
instance, it is possible to partition scores on a 100 point
test into class intervals of 1-25, 26-49, 50-74 and 75-100.
Continuous Variables:
Variables that can take on any
value in a certain range. Time and distance are continuous;
gender, SAT score and "time rounded to the nearest second" are
not. Variables that are not continuous are known as
discrete variables. No measured
variable is truly continuous; however, discrete variables
measured with enough precision can often be considered
continuous for practical purposes.
Data:
A collection of values to be used for statistical
analysis. See also:
variable.
Discrete:
Variables that can only take on a finite number of values are
called "discrete variables." All
qualitative variables are
discrete. Some
quantitative
variables are discrete, such as performance rated as
1, 2, 3, 4, or 5, or temperature rounded to the nearest
degree. Sometimes, a variable that takes on enough discrete
values can be considered to be continuous for practical
purposes. One example is time to the nearest millisecond.
Variables that can take on an infinite number of possible
values are called
continuous
variables.
Distribution:
Frequency Distribution: The distribution of
empirical data is called a frequency distribution and consists
of a count of the number of occurences of each value. If the
data are continuous, then a
grouped
frequency distribution is used. Typically, a
distribution is portrayed using a
frequency polygon or a
histogram. Mathematical distributions
are often used to define distributions. The normal
distribution is, perhaps, the best known example. Many
empirical distributions are approximated well by mathematical
distributions such as the normal distribution.
Far Out Value:
One of the components of a
box plot,
far out values are those that are more than 2
steps from the nearest
hinge. They are beyond the outer fences.
Geometric Mean:
The geometric mean of
n n numbers
is obtained by multiplying all of them together, and then
taking the nth root of them. It is one of the rarer measures
of
central tendency, and not to be
confused with the much, much more common
arithmetic mean.
Grouped Frequency Distribution:
A grouped frequency distribution is a
frequency distribution in which
frequencies are displayed for ranges of data rather than for
individual values. For example, the distribution of heights
might be calculated by defining one-inch ranges. The frequency
of indivuals with various heights rounded off to the nearest
inch would be then be tabulated. See also:
histogram.
Higher Adjacent Value:
One of the components of a
box plot,
the higher adjacent value is the largest value in the
data below the 75th
percentile.
Levels of Measurement:
Measurement scales differ in their level of measurement. There
are four common levels of measurement:
-
Nominal scales are only
labels.
-
Ordinal Scales are ordered but
are not truly quantitative. Equal intervals on the ordinal
scale do not imply equal intervals on the underlying
trait.
-
Interval scales are are ordered and equal
intervals equal intervals on the underlying
trait. However, interval scales do not have a true zero
point.
-
Ratio scales are
interval scales that do have a true zero point. With ratio
scales, it is sensible to talk about one value being twice
as large as another, for example.
Line Graph:
Essentially a
bar graph in which the
height of each par is represented by a single point, with each
of these points connected by a line. Line graphs are best used
to show change over time, and should never be used if your
X-axis is not an ordered variable.
Lower Hinge:
A component of a
box plot, the lower
hinge is the 25th
percentile. The
upper hinge is the 75th percentile.
Lower Adjacent Value:
A component of a
box plot, the lower
adjacent value is smallest value in the data above the inner
lower fence.
Mean:
Arithmetic Mean:
Also known as the arithmetic mean, the mean is typically what
is meant by the word
average. The
mean is perhaps the most common measure of
central tendency. The mean of a
variable is given by (the sum of all
its values)/(the number of values). For example, the mean of
4, 8, and 9 is 7. The sample mean is written as M, and the
population mean as the Greek letter mu (μ). Despite its
popularity, the mean may not be an appropriate measure of
central tendency for
skewed
distributions, or in situations with outliers.
Mode:
The mode is a measure of
central
tendency. It is the most common value in a
distribution: the mode of 3, 4, 4,
5, 5, 5, 8 is 5. Note that the mode may be very different from
the
mean and the
median: 1, 1, 1, 3, 8, 10 has mode 1, but
mean 6 and median 2.
Nominal Scale:
A nominal scale is one of four Levels of Measurement. No
ordering is implied, and addition/subtraction and
multiplication/division would be inappropriate for a variable
on a nominal scale.
FemaleMale
Female
Male
and
BuddhistChristianHinduMuslim
Buddhist
Christian
Hindu
Muslim
have no natural ordering (except
alphabetic). Occasionally, numeric values are nominal: for
instance, if a variable was coded as
Female=1
Female
1
,
Male=2
Male
2
, the set
12
1
2
is still nominal.
Ordinal Scale:
One of four
levels of
measurement, an ordinal scale is a set of ordered
values. However, there is no set distance between scale
values. For instance, for the scale: (Very Poor, Poor,
Average, Good, Very Good) is an ordinal scale. You can assign
numerical values to an ordinal scale: rating performance such
as 1 for "Very Poor," 2 for "Poor," etc, but there is no
assurance that the difference between a score of 1 and 2 means
the same thing as the difference between a score of and 3.
Parallel Box Plots:
Two or more
box plots drawn on the
same Y-axis. These are often useful in comparing features of
distributions. An example
portraying the times it took samples of women and men to do a
task is shown below. See
Figure 8 for an
example.
Percentile:
1.
There is no universally accepted definition of a
percentile. Using the 65th percentile as an example, some
statisticians define the 65th percentile as the lowest score
that is larger than 65% of the scores. Others have defined the
65th percentile as the smallest score that is greater than or
equal to 65% of the scores. A more sophisticated definition is
given below.
2.
The first step is to compute the rank (
R
R) of the percentile in question. This is done using
the following formula:
R=P100N+1
R
P
100
N
1
where
P P is the desired
percentile and
N N is the number
of numbers. If
R R is an integer,
then the
Pth
Pth
percentile is the number with rank
R R. When
R
R is not an integer, we compute the
Pth
Pth
perentile by interpolation as
follows:
-
Define IR
IR
as the integer portion of
R R (the number to the
left of the decimal point).
-
Define FR
FR
as the fractional portion
or R R.
-
Find the scores with Rank IR
IR
and with Rank
I
R
+1
I
R
1
.
-
Interpolate by multiplying the difference between the
scores by
FR
FR
and add the result to the lower
score.
Pie Chart:
A graphical representation of
data,
the pie chart shows
relative
frequencies of classes of data. It is a circle cut into
a number of wedges, one for each class, with the area of each
wedge proportional to its relative frequency. Pie charts are
only effective for a small number of classes, and are one of
the less effective graphical representations.
Qualitative Variables:
Categorical Variable: Also known as categorical
variables, qualitative variables are
variables with no natural sense of
ordering. For instance, hair color (Black, Brown, Gray, Red,
Yellow) is a qualitative variable, as is name (Adam, Becky,
Christina, Dave . . .). Qualitative variables can be coded to
appear numeric but their numbers are meaningless, as in
male=1, female=2. Variables that are not qualitative are known
as
quantitative variables.
Quantitative Variables:
Variables that have are measured
on a numeric or quantitative scale.
Ordinal,
interval and
ratio scales are quantitative. A country's
population, a person's shoe size, or a car's speed are all
quantitative variables. Variables that are not quantitative
are known as
qualitative
variables.
Ratio Scale:
One of the four basic
levels of
measurement, a ratio scale is a numerical scale with a
true zero point and in which a given size interval has the
same interpretation for the entire scale. Weight is a ratio
scale, Therefore it is meaningful to say that a 200 pound
person weighs twice as much as a 100 pound person.
Relative Frequency:
The proportion of observations falling into a given class. For
example, if a bag of 55 M&M's has 11 green M&M's,
then the frequency of green M&M's is 11 and the relative
frequency is
11/55=0.20
1155
0.20
. Relative frequencies arise in the creation of
histograms and
pie
charts, and sometimes in
bar graphs.
Skew:
A distribution is skewed if one tail extends out further than
the other. A distribution has positive skew (is skewed to the
right) if the tail to the right is longer. See
Figure 9 for an example.
A distribution has a negative skew (is skewed to the left) if
the tail to the left is longer. See
Figure 10
for an example.
Sturgis's Rule:
One method of determining the number of
classes for a
histogram, Sturgis's Rule is to take
1+log2N
1
2
N
classes, rounded to the nearest integer.
Symmetric Distribution:
In a symmetric
distribution,
the upper and lower halfs of the distribution are mirror
images of each other. For example, in the distribution shown
below, the portions above and below 50 are mirror images of
each other. In a symmetric distribution, the
mean is equal to the
median. See
Figure 11 for
an example.
Trimean:
The trimean is a measure of
central
tendency; it is a weighted average of the 25th, 50th,
and 75th
percentiles. Specifically
it is computed as follows:
Trimean=0.25
25
th
+0.5
50
th
+0.25
75
th
Trimean
0.25
25
th
0.5
50
th
0.25
75
th
Trimmed Mean:
The trimmed mean is a measure of
central
tendency generally falling between the
mean and the
median. As in the computation of the
median, all observations are ordered. Next, the highest and
lowest alpha percent of the data are removed, where alpha
ranges from 0 to 50. Finally, the mean of the remaining
observations is taken. The trimmed mean has advantages over
both the mean and median, but is computationally more
difficult and analytically more intractable.
Variables:
Something that can take on different values. For example,
different subjects in an experiment weight different
amounts. Therefore "weight" is a variable in the
experiment. Or, subjects may be given different doses of a
drug. This would make "dosage" a variable. Variables can be
dependent or
independent,
qualitative or
quantitative, and
continuous or
discrete.
Dependent Variable:
A
variable that measures the
experimental outcome. In most experiments, the effects of the
independent variable on the
dependent variables are observed. For example, if a study
investigated the effectiveness of an experimental treatment
for depression, then the measure of depression would be the
dependent variable. Synonym: dependent measure
Independent Variables:
Variables that are manipulated
by the experimenter, as opposed to
dependent variables. Most experiments
consist of observing the effect of the independent variable on
the dependent variable(s).