<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE document PUBLIC "-//CNX//DTD CNXML 0.5 plus MathML//EN" "http://cnx.rice.edu/cnxml/0.5/DTD/cnxml_mathml.dtd">
<document xmlns="http://cnx.rice.edu/cnxml" xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="m10215">

  <name>Box Plots</name>

  <metadata>
  <md:version>2.8</md:version>
  <md:created>2001/07/18</md:created>
  <md:revised>2008/04/20 15:09:15.794 GMT-5</md:revised>
  <md:authorlist>
      <md:author id="dmlane">
      <md:firstname>David</md:firstname>
      
      <md:surname>Lane</md:surname>
      <md:email>lane@rice.edu</md:email>
    </md:author>
  </md:authorlist>

  <md:maintainerlist>
    <md:maintainer id="jago">
      <md:firstname>Adan</md:firstname>
      
      <md:surname>Galvan</md:surname>
      <md:email>agalvan@gmail.com</md:email>
    </md:maintainer>
    <md:maintainer id="dmlane">
      <md:firstname>David</md:firstname>
      
      <md:surname>Lane</md:surname>
      <md:email>lane@rice.edu</md:email>
    </md:maintainer>
    <md:maintainer id="meyer">
      <md:firstname>Eileen</md:firstname>
      
      <md:surname>Meyer</md:surname>
      <md:email>meyer@rice.edu</md:email>
    </md:maintainer>
  </md:maintainerlist>
  
  <md:keywordlist>
    <md:keyword>box plots</md:keyword>
    <md:keyword>statistics</md:keyword>
  </md:keywordlist>

  <md:abstract>Introduction to box plots.</md:abstract>
</metadata>

  <content>
    <para id="intro">
      We have already discussed techniques for visually representing
      data (see <cnxn document="m10160" strength="8">histograms</cnxn>
      and <cnxn document="m10214" strength="8">frequency
      polygons</cnxn>).  In this section we present another important
      method, called <term>box plots</term>.  (We encountered a
      simplified form of box plots in the <cnxn document="m10892" strength="8">introduction to this chapter</cnxn>.)  Box plots
      are useful for identifying outliers and for comparing
      distributions.  We will explain box plots with the help of data
      from an in-class experiment.  Students in Introductory
      Statistics were presented with a page containing 30 colored
      rectangles.  Their task was to name the colors as quickly as
      possible, and their times were recorded.  We'll compare the
      scores for the 16 men and 31 women who participated in the
      experiment by making separate box plots for each gender.  (Such
      a display is said to involve <term>parallel box plots</term>.)
    </para>

    <para id="first">
      There are several steps in constructing a box plot.  The first
      relies on the 25th, 50th, and 75th percentiles in the
      distribution of scores.  <cnxn target="figure1" strength="9"/>
      shows how these three statistics are used.  For each gender we
      draw a box extending from the 25th percentile to the 75th
      percentile.  The 50th percentile is drawn inside the box.  Therefore,
    </para>

    <code type="block">
      the bottom of each box is the 25th percentile,
      the top is the 75th percentile,
      and the line in the middle is the 50th percentile.
    </code>

    <para id="para1b">
      The data for the women in our sample are shown in <cnxn target="table1" strength="9"/>.
    </para>

    <table id="table1" frame="all">
      <name>Times (in seconds) for women to name the colors.</name>
      <tgroup cols="7">
	<tbody>
	  <row>
	    <entry>14</entry>
	    <entry>17</entry>
	    <entry>18</entry>
	    <entry>19</entry>
	    <entry>20</entry>
	    <entry>21</entry>
	    <entry>29</entry>
	  </row>
	  <row>
	    <entry>15</entry>
	    <entry>17</entry>
	    <entry>18</entry>
	    <entry>19</entry>
	    <entry>20</entry>
	    <entry>21</entry>
	    <entry/>
	  </row>
	  <row>
	    <entry>16</entry>
	    <entry>17</entry>
	    <entry>18</entry>
	    <entry>19</entry>
	    <entry>20</entry>
	    <entry>23</entry>
	    <entry/>
	  </row>
	  <row>
	    <entry>16</entry>
	    <entry>17</entry>
	    <entry>18</entry>
	    <entry>20</entry>
	    <entry>20</entry>
	    <entry>24</entry>
	    <entry/>
	  </row>
	  <row>
	    <entry>17</entry>
	    <entry>18</entry>
	    <entry>18</entry>
	    <entry>20</entry>
	    <entry>21</entry>
	    <entry>24</entry>
	    <entry/>
	  </row>
	</tbody>
      </tgroup>
    </table>

    <para id="para1c">
      For these data, the 25th percentile is 17, the 50th percentile
      is 19, and the 75th percentile is 20.  For the men (whose data
      are not shown), the 25th percentile is 19, the 50th percentile
      is 22.5, and the 75th percentile is 25.5.
    </para>

    <figure id="figure1">
      <media type="image/gif" src="image001.gif"/>
      <caption>The first step in creating box plots.</caption>
    </figure>

    <para id="para1d">
      Before proceeding, the terminology in <cnxn target="table2" strength="9"/> is helpful.
    </para>

    <table id="table2" frame="all">
      <name>Terminology</name>
      <tgroup cols="3">
	<thead>
	  <row>
	    <entry>Name</entry>
	    <entry>Formula</entry>
	    <entry>Value for Women's Data</entry>
	  </row>
	</thead>
	<tbody>
	  <row>
	    <entry>Upper Hinge</entry>
	    <entry>75th percentile</entry>
	    <entry>20</entry>
	  </row>
	  <row>
	    <entry>Lower Hinge</entry>
	    <entry>25th percentile</entry>
	    <entry>17</entry>
	  </row>
	  <row>
	    <entry>H-Spread</entry>
	    <entry>
	      <m:math>
		<m:apply>
		  <m:minus/>
		  <m:ci>Upper Hinge</m:ci>
		  <m:ci>Lower Hinge</m:ci>
		</m:apply>
	      </m:math>
	    </entry>
	    <entry>3</entry>
	  </row>
	  <row>
	    <entry>Step</entry>
	    <entry>
	      <m:math>
		<m:apply>
		  <m:times/>
		  <m:cn>1.5</m:cn>
		  <m:ci>H-Spread</m:ci>
		</m:apply>
	      </m:math>
	    </entry>
	    <entry>4.5</entry>
	  </row>
	  <row>
	    <entry>Upper Inner Fence</entry>
	    <entry>
	      <m:math>
		<m:apply>
		  <m:plus/>
		  <m:ci>Upper Hinge</m:ci>
		  <m:ci>1 Step</m:ci>
		</m:apply>
	      </m:math>
	    </entry>
	    <entry>24.5</entry>
	  </row>
	  <row>
	    <entry>Lower Inner Fence</entry>
	    <entry>
	      <m:math>
		<m:apply>
		  <m:minus/>
		  <m:ci>Lower Hinge</m:ci>
		  <m:ci>1 Step</m:ci>
		</m:apply>
	      </m:math>
	    </entry>
	    <entry>12.5</entry>
	  </row>
	  <row>
	    <entry>Upper Outer Fence</entry>
	    <entry>
	      <m:math>
		<m:apply>
		  <m:plus/>
		  <m:ci>Upper Hinge</m:ci>
		  <m:ci>2 Steps</m:ci>
		</m:apply>
	      </m:math>
	    </entry>
	    <entry>29</entry>
	  </row>
	  <row>
	    <entry>Lower Outer Fence</entry>
	    <entry>
	      <m:math>
		<m:apply>
		  <m:minus/>
		  <m:ci>Lower Hinge</m:ci>
		  <m:ci>2 Steps</m:ci>
		</m:apply>
	      </m:math>
	    </entry>
	    <entry>8</entry>
	  </row>
	  <row>
	    <entry>Upper Adjacent</entry>
	    <entry>Largest value below Upper Inner Fence</entry>
	    <entry>24</entry>
	  </row>
	  <row>
	    <entry>Lower Adjacent</entry>
	    <entry>Smallest value above Lower Inner Fence</entry>
	    <entry>14</entry>
	  </row>
	  <row>
	    <entry>Outside Value</entry>
	    <entry>
	      A value beyond an Inner Fence but not beyond an Outer
	      Fence
	    </entry>
	    <entry>29 (this value is on the fence, but not beyond)</entry>
	  </row>
	  <row>
	    <entry>Far Out Value</entry>
	    <entry>A value beyond an Outer Fence</entry>
	    <entry>None in these data</entry>
	  </row>
	</tbody>
      </tgroup>
    </table>

    <para id="second">
      Continuing with the box plots, we put "whiskers" above and below
      each box, to give additional information about the spread of
      data (<cnxn target="figure2" strength="9"/>).  Whiskers are
      vertical lines that end in a horizontal stroke (the purpose of
      the stroke is just to make the vertical lines more visible).
      Whiskers are drawn from the upper and lower hinges to the upper
      and lower adjacent values (24 and 14 for the women's data).
    </para>
    
    <figure id="figure2">
      <media type="image/gif" src="image002.gif"/>
      <caption>The box plots with the whiskers drawn.</caption>  
    </figure>
    
    <para id="fourth">
      Although we don't draw whiskers all the way to outside or far
      out values, we still wish to represent these outliers in our box
      plots.  This is achieved by adding additional marks beyond the
      whiskers.  Specifically, outside values are indicated by small
      circles, and far out values are indicated by asterisks.  In our
      data, there are no far out values, and just one outside value.
      The outside value of 29 is for the women, and is shown in <cnxn target="figure3" strength="9"/>.
    </para>
    
    <figure id="figure3">
      <media type="image/gif" src="image003.gif"/>
      <caption>The box plots with the outlier shown.</caption>
    </figure>

    <para id="fifth">
      There is one more mark to include in box plots (although
      sometimes it is omitted).  We indicate the mean score for a
      group by inserting a plus sign.  <cnxn target="figure4" strength="9"/> shows the result of adding means to our box
      plots.
    </para>
    
    <figure id="figure4">
      <media type="image/gif" src="image004.gif"/>
      <caption>The completed box plots.</caption>
    </figure>
    
    <para id="fifth2">
      <cnxn target="figure4" strength="9"/> provides a revealing
      summary of the data.  Since half the scores in a distribution
      are between the hinges (recall that the hinges are the 25th and
      75th percentiles), we see that half the women's times are
      between 17 and 20 whereas half the men's times are between 19
      and 25.  We also see that women generally named the colors
      faster than the men did, although one woman was slower than
      almost all of the men.  <cnxn target="figure5" strength="9"/>
      shows the boxplot for the women's data with detailed labels.
    </para>
    
    <figure id="figure5">
      <media type="image/gif" src="boxplot_labeled.gif"/>
      <caption>The boxplot for the women's data.</caption>
    </figure>
      
    <para id="fifth3">
      Here are some other examples of box plots.
      <list id="list1" type="bulleted">
	<item>
	  <link src="http://psych.rice.edu/online_stat/chapter2/boxplots_files/target_boxplot.html">Time
	    to move the mouse over a target.</link>
	</item>
	<item>
	  <link src="http://psych.rice.edu/online_stat/chapter2/boxplots_files/draft.html">Draft
	    Lottery</link>
	</item>
      </list>
    </para>
    
    <section id="variations">
      <name>Variations on box plots</name>
      <para id="sectpara1">
	Statistical analysis programs may offer options on how box
	plots are created.  For example, the box plot in <cnxn target="figure6" strength="9"/> is constructed from our data
	but differs from the previous box plot in several ways.
	<list id="list2" type="enumerated">
	  <item>First, it does not mark outliers.</item>
	  <item>
	    Second, the means are indicated by green lines rather than
	    plus signs.
	  </item>
	  <item>
	    The mean of all scores is indicated by a grey line.
	  </item>
	  <item>
	    Individual scores are represented by dots.  Since the
	    scores have been rounded to the nearest second, any given
	    dot might represent more than one score.
	  </item>
	  <item>
	    The box for the women is wider than the box for the men
	    because the widths of the boxes are proportional to the
	    number of subjects of each gender (31 women and 16 men).
	  </item>
	</list>
      </para>

      <figure id="figure6">
	<media type="image/gif" src="image006.gif"/>
	<caption>
	  Box plots showing the individual scores and the means.
	</caption>
      </figure>
      
      <para id="sectpara1a">
	Each dot in <cnxn target="figure6" strength="9"/> represents a
	group of subjects with the same score (rounded to the nearest
	second).  An alternative graphing technique is to
	<emphasis>jitter</emphasis> the points.  This means spreading
	out different dots at the same horizontal position, one dot
	for each subject. The exact horizontal position of a point is
	determined randomly (under the constraint that different dots
	don?t overlap).  Spreading out the dots allows you to see
	multiple occurrences of a given score.  <cnxn target="figure7" strength="9"/> shows what jittering looks like.
      </para>
      
      <figure id="figure7">
	<media type="image/gif" src="image007.gif"/>
	<caption>
	  Box plots with the individual scores jittered.
	</caption>
      </figure>
      
      <para id="last">
	Different styles of box plots are best for different
	situations, and there are no firm rules for which to use.
	When exploring your data you should try several ways of
	visualizing them.  Which graph you include in your report
	should depend on how well different graphs reveal the aspects
	of the data you consider most important.
      </para>
    </section>
    
  </content>  
</document>
