<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE document PUBLIC "-//CNX//DTD CNXML 0.5 plus MathML//EN" "http://cnx.rice.edu/cnxml/0.5/DTD/cnxml_mathml.dtd">
<document xmlns="http://cnx.rice.edu/cnxml" xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="m11132">
  
  <name>Sampling Distribution of Difference Between Means</name>
  
  <metadata>
  <md:version>2.3</md:version>
  <md:created>2003/04/28</md:created>
  <md:revised>2003/06/19 12:05:49.001 GMT-5</md:revised>
  <md:authorlist>
    <md:author id="dmlane">
      <md:firstname>David</md:firstname>
      
      <md:surname>Lane</md:surname>
      <md:email>lane@rice.edu</md:email>
    </md:author>
  </md:authorlist>

  <md:maintainerlist>
    <md:maintainer id="dmlane">
      <md:firstname>David</md:firstname>
      
      <md:surname>Lane</md:surname>
      <md:email>lane@rice.edu</md:email>
    </md:maintainer>
    <md:maintainer id="liqun">
      <md:firstname>Liqun</md:firstname>
      
      <md:surname>Wang</md:surname>
      <md:email>liqun@rice.edu</md:email>
    </md:maintainer>
  </md:maintainerlist>
  
  <md:keywordlist>
    <md:keyword>sampling distribution</md:keyword>
    <md:keyword>difference between means</md:keyword>
  </md:keywordlist>

  <md:abstract>This module discusses sampling distribution of difference between means.</md:abstract>
</metadata>
  

  <content>
    <para id="para1">
      Statistical analysis are very often concerned with the
      difference between means. A typical example is an experiment
      designed to compare the mean of a control group with the mean of
      an experimental group. <term>Inferential statistics</term> used
      in the analysis of this type of experiment depend on the
      sampling distribution of the difference between means.   
    </para>

    <para id="para2">
      The sampling distribution of the difference between means can be
      thought of as the distribution that would result if we repeated
      the following three steps over and over again:  

      <list id="list1" type="enumerated">
	<item>
	  Sample 
	  <m:math>
	    <m:ci>
	      <m:msub><m:mi>n</m:mi><m:mn>1</m:mn></m:msub>
	    </m:ci>
	  </m:math> scores from Population 1 and 
	  <m:math>
	    <m:ci>
	      <m:msub><m:mi>n</m:mi><m:mn>2</m:mn></m:msub>
	    </m:ci>
	  </m:math> scores from Population 2;
	</item>

	<item>Compute the means of the two samples (
	  <m:math>
	    <m:ci>
	      <m:msub><m:mi>M</m:mi><m:mn>1</m:mn></m:msub>
	    </m:ci>
	  </m:math> and 
	  <m:math>
	    <m:ci>
	      <m:msub><m:mi>M</m:mi><m:mn>2</m:mn></m:msub>
	    </m:ci>
	  </m:math>);
	</item>

	<item>Compute the difference between means 
	  <m:math>
	    <m:apply>
	      <m:minus/>
	      <m:ci>
		<m:msub><m:mi>M</m:mi><m:mn>1</m:mn></m:msub>
	      </m:ci>
	      <m:ci>
		<m:msub><m:mi>M</m:mi><m:mn>2</m:mn></m:msub>
	      </m:ci>
	    </m:apply>
	  </m:math>. The distribution of the differences between means
	  is the sampling distribution of the difference between
	  means.
	</item>
      </list>
    </para>

    <para id="para3">
      As you might expect, the mean of the sampling distribution of
      the mean is: 
      <m:math display="block">
	<m:apply>
	  <m:eq/>
	  <m:ci>
	    <m:msub><m:mi>μ</m:mi>
	      <m:mrow>
		<m:msub><m:mi>M</m:mi><m:mn>1</m:mn></m:msub>
		<m:mo>-</m:mo>
		<m:msub><m:mi>M</m:mi><m:mn>2</m:mn></m:msub>
	      </m:mrow>
	    </m:msub>
	  </m:ci>
	  <m:apply>
	    <m:minus/>
	    <m:ci>
	      <m:msub><m:mi>μ</m:mi><m:mn>1</m:mn></m:msub>
	    </m:ci>
	    <m:ci>
	      <m:msub><m:mi>μ</m:mi><m:mn>2</m:mn></m:msub>
	    </m:ci>
	  </m:apply>
	</m:apply>
      </m:math>

      which says that the mean of the distribution of differences
      between sample means is equal to the difference between
      population means. For example, say that mean test score of all
      12-year olds in a population is 34 and the mean of 10-year olds
      is 25. If numerous samples were taken from each age group and
      the mean difference computed each time, the mean of these
      numerous differences between sample means would be 34 - 25 = 9. 
    </para>

    <para id="para4">
      From the variance sum law, we know that:

      <m:math display="block">
	<m:apply>
	  <m:eq/>
	  <m:apply>
	    <m:power/>
	    <m:ci>
	      <m:msub><m:mi>σ</m:mi>
		<m:mrow>
		  <m:msub><m:mi>M</m:mi><m:mn>1</m:mn></m:msub>
		  <m:mo>-</m:mo>
		  <m:msub><m:mi>M</m:mi><m:mn>2</m:mn></m:msub>
		</m:mrow>
	      </m:msub>
	    </m:ci>
	    <m:cn>2</m:cn>
	  </m:apply>
	  <m:apply>
	    <m:plus/>
	    <m:apply>
	      <m:power/>
	      <m:ci>
		<m:msub><m:mi>σ</m:mi>
		  <m:mrow>
		    <m:msub><m:mi>M</m:mi><m:mn>1</m:mn></m:msub>
		  </m:mrow>
		</m:msub>
	      </m:ci>
	      <m:cn>2</m:cn>
	    </m:apply>
	    <m:apply>
	      <m:power/>
	      <m:ci>
		<m:msub><m:mi>σ</m:mi>
		  <m:mrow>
		    <m:msub><m:mi>M</m:mi><m:mn>2</m:mn></m:msub>
		  </m:mrow>
		</m:msub>
	      </m:ci>
	      <m:cn>2</m:cn>
	    </m:apply>
	  </m:apply>
	</m:apply>
      </m:math>

      which says that the variance of the sampling distribution of the
      difference between means is equal to the variance of the
      sampling distribution of the mean for Population 1 plus the
      variance of the sampling distribution of the mean for Population
      2. Recall the formula for the variance of the sampling
      distribution of the mean: 

      <m:math display="block">
	<m:apply>
	  <m:eq/>
	  <m:apply>
	    <m:power/>
	    <m:ci>
	      <m:msub><m:mi>σ</m:mi><m:mi>M</m:mi></m:msub>
	    </m:ci>
	    <m:cn>2</m:cn>
	  </m:apply>
	  <m:apply>
	    <m:divide/>
	    <m:apply>
	      <m:power/>
	      <m:ci>σ</m:ci>
	      <m:cn>2</m:cn>
	    </m:apply>
	    <m:ci>N</m:ci>
	  </m:apply>
	</m:apply>
      </m:math>

      Since we have two populations and two samples sizes, we need to
      distinguish between the two variances and sample sizes. We do
      this using the subscripts 1 and 2. Using this convention we can
      write the formula for the variance of the sampling distribution
      of the difference between means as: 

      <m:math display="block">
	<m:apply>
	  <m:eq/>
	  <m:apply>
	    <m:power/>
	    <m:ci>
	      <m:msub><m:mi>σ</m:mi>
		<m:mrow>
		  <m:msub><m:mi>M</m:mi><m:mn>1</m:mn></m:msub>
		  <m:mo>-</m:mo>
		  <m:msub><m:mi>M</m:mi><m:mn>2</m:mn></m:msub>
		</m:mrow>
	      </m:msub>
	    </m:ci>
	    <m:cn>2</m:cn>
	  </m:apply>
	  <m:apply>
	    <m:plus/>
	    <m:apply>
	      <m:divide/>
	      <m:apply>
		<m:power/>
		<m:ci>
		  <m:msub><m:mi>σ</m:mi><m:mn>1</m:mn></m:msub>
		</m:ci>
		<m:cn>2</m:cn>
	      </m:apply>
	      <m:ci>
		<m:msub><m:mi>n</m:mi><m:mn>1</m:mn></m:msub>
	      </m:ci>
	    </m:apply>
	    <m:apply>
	      <m:divide/>
	      <m:apply>
		<m:power/>
		<m:ci>
		  <m:msub><m:mi>σ</m:mi><m:mn>2</m:mn></m:msub>
		</m:ci>
		<m:cn>2</m:cn>
	      </m:apply>
	      <m:ci>
		<m:msub><m:mi>n</m:mi><m:mn>2</m:mn></m:msub>
	      </m:ci>
	    </m:apply>
	  </m:apply>
	</m:apply>
      </m:math>

      Since the standard error of a sampling distribution is the
      standard deviation of the sampling distribution, the standard
      error of the difference between means is: 

      <m:math display="block">
	<m:apply>
	  <m:eq/>
	  <m:ci>
	    <m:msub><m:mi>σ</m:mi>
	      <m:mrow>
		<m:msub><m:mi>M</m:mi><m:mn>1</m:mn></m:msub>
		<m:mo>-</m:mo>
		<m:msub><m:mi>M</m:mi><m:mn>2</m:mn></m:msub>
	      </m:mrow>
	    </m:msub>
	  </m:ci>
	  <m:apply>
	    <m:root/>
	    <m:apply>
	      <m:plus/>
	      <m:apply>
		<m:divide/>
		<m:apply>
		  <m:power/>
		  <m:ci>
		    <m:msub><m:mi>σ</m:mi><m:mn>1</m:mn></m:msub>
		  </m:ci>
		  <m:cn>2</m:cn>
		</m:apply>
		<m:ci>
		  <m:msub><m:mi>n</m:mi><m:mn>1</m:mn></m:msub>
		</m:ci>
	      </m:apply>
	      <m:apply>
		<m:divide/>
		<m:apply>
		  <m:power/>
		  <m:ci>
		    <m:msub><m:mi>σ</m:mi><m:mn>2</m:mn></m:msub>
		  </m:ci>
		  <m:cn>2</m:cn>
		</m:apply>
		<m:ci>
		  <m:msub><m:mi>n</m:mi><m:mn>2</m:mn></m:msub>
		</m:ci>
	      </m:apply>
	    </m:apply>
	  </m:apply>
	</m:apply>
      </m:math>

      Just to review the notation, the symbol on the left contains a
      sigma (<m:math><m:ci>σ</m:ci></m:math>) which means it is
      a standard deviation. The subscripts  
      <m:math>
	<m:apply>
	  <m:minus/>
	  <m:ci>
	    <m:msub><m:mi>M</m:mi><m:mn>1</m:mn></m:msub>
	  </m:ci>
	  <m:ci>
	    <m:msub><m:mi>M</m:mi><m:mn>2</m:mn></m:msub>
	  </m:ci>
	</m:apply>
      </m:math> indicate that it is the standard deviation of the
      sampling distribution of  
      <m:math>
	<m:apply>
	  <m:minus/>
	  <m:ci>
	    <m:msub><m:mi>M</m:mi><m:mn>1</m:mn></m:msub>
	  </m:ci>
	  <m:ci>
	    <m:msub><m:mi>M</m:mi><m:mn>2</m:mn></m:msub>
	  </m:ci>
	</m:apply>
      </m:math>.
    </para>
        
    <para id="para5">
      Now let's look at an application of this formula. Assume there
      are two species of green beings on Mars. The mean height of
      Species 1 is 32 while the mean height of Species 2 is 22. The
      variances of the two species are 60 and 70 respectively and the
      heights of both species are normally distributed. You randomly
      sample 10 members of Species 1 and 14 members of Species 2. What
      is the probability that the mean of the 10 members of Species 2
      will exceed the mean of the 14 members of Species 2 by 5 or
      more? Without doing any calculations, you probably know that the
      probability is pretty high since the difference in population
      means is 10. But what exactly is the probability. 
    </para>

    <para id="para6">
      First, let's determine the sampling distribution of the
      difference between means. Using the formulas above, the mean is 

      <m:math display="block">
	<m:apply>
	  <m:eq/>
	  <m:ci>
	    <m:msub><m:mi>μ</m:mi>
	      <m:mrow>
		<m:msub><m:mi>M</m:mi><m:mn>1</m:mn></m:msub>
		<m:mo>-</m:mo>
		<m:msub><m:mi>M</m:mi><m:mn>2</m:mn></m:msub>
	      </m:mrow>
	    </m:msub>
	  </m:ci>
	  <m:apply>
	    <m:minus/>
	    <m:cn>32</m:cn>
	    <m:cn>22</m:cn>
	  </m:apply>
	  <m:cn>10</m:cn>
	</m:apply>
      </m:math>

      The standard error is:

      <m:math display="block">
	<m:apply>
	  <m:eq/>
	  <m:ci>
	    <m:msub><m:mi>σ</m:mi>
	      <m:mrow>
		<m:msub><m:mi>M</m:mi><m:mn>1</m:mn></m:msub>
		<m:mo>-</m:mo>
		<m:msub><m:mi>M</m:mi><m:mn>2</m:mn></m:msub>
	      </m:mrow>
	    </m:msub>
	  </m:ci>
	  <m:apply>
	    <m:root/>
	    <m:apply>
	      <m:plus/>
	      <m:apply>
		<m:divide/>
		<m:cn>60</m:cn>
		<m:cn>10</m:cn>
	      </m:apply>
	      <m:apply>
		<m:divide/>
		<m:cn>70</m:cn>
		<m:cn>14</m:cn>
	      </m:apply>
	    </m:apply>
	  </m:apply>
	  <m:cn>3.317</m:cn>
	</m:apply>
      </m:math>

      The sampling distribution is shown in <cnxn target="figure1" strength="7"/>. Notice that it is normally distributed with a
      mean of 10 and a standard deviation of 3.317. The area above 5
      is shaded blue. 
    </para>

    <figure id="figure1">
      <media type="image/gif" src="figure1.gif"/>
      <caption>
	The sampling distribution of the difference between means.
      </caption>
    </figure>

    <para id="para7">
      The last step is to determine the area that is shaded
      blue. Using either a Z table or the <cnxn document="m11328">normal calculator</cnxn>, the area can be
      determined to be 0.934. Thus the probability that the mean of
      the sample from Species 2 will exceed the mean of the sample
      from Species 1 by 5 or more.
    </para>

    <para id="para8">
      As shown below, the formula for the standard error of the
      difference between means is much simpler if the sample sizes and
      the population variances are equal. Since the variances and
      samples sizes are the same, there is no need to use the
      subscripts 1 and 2 to differentiate these terms. 

      <m:math display="block">
	<m:apply>
	  <m:eq/>
	  <m:ci>
	    <m:msub><m:mi>σ</m:mi>
	      <m:mrow>
		<m:msub><m:mi>M</m:mi><m:mn>1</m:mn></m:msub>
		<m:mo>-</m:mo>
		<m:msub><m:mi>M</m:mi><m:mn>2</m:mn></m:msub>
	      </m:mrow>
	    </m:msub>
	  </m:ci>
	  <m:apply>
	    <m:root/>
	    <m:apply>
	      <m:plus/>
	      <m:apply>
		<m:divide/>
		<m:apply>
		  <m:power/>
		  <m:ci>
		    <m:msub><m:mi>σ</m:mi><m:mn>1</m:mn></m:msub>
		  </m:ci>
		  <m:cn>2</m:cn>
		</m:apply>
		<m:ci>
		  <m:msub><m:mi>n</m:mi><m:mn>1</m:mn></m:msub>
		</m:ci>
	      </m:apply>
	      <m:apply>
		<m:divide/>
		<m:apply>
		  <m:power/>
		  <m:ci>
		    <m:msub><m:mi>σ</m:mi><m:mn>2</m:mn></m:msub>
		  </m:ci>
		  <m:cn>2</m:cn>
		</m:apply>
		<m:ci>
		  <m:msub><m:mi>n</m:mi><m:mn>2</m:mn></m:msub>
		</m:ci>
	      </m:apply>
	    </m:apply>
	  </m:apply>
	  <m:apply>
	    <m:root/>
	    <m:apply>
	      <m:plus/>
	      <m:apply>
		<m:divide/>
		<m:apply>
		  <m:power/>
		  <m:ci>σ</m:ci>
		  <m:cn>2</m:cn>
		</m:apply>
		<m:ci>n</m:ci>
	      </m:apply>
	      <m:apply>
		<m:divide/>
		<m:apply>
		  <m:power/>
		  <m:ci>σ</m:ci>
		  <m:cn>2</m:cn>
		</m:apply>
		<m:ci>n</m:ci>
	      </m:apply>
	    </m:apply>
	  </m:apply>
	  <m:apply>
	    <m:root/>
	    <m:apply>
	      <m:divide/>
	      <m:apply>
		<m:times/>
		<m:cn>2</m:cn>
		<m:apply>
		  <m:power/>
		  <m:ci>σ</m:ci>
		  <m:cn>2</m:cn>
		</m:apply>
	      </m:apply>
	      <m:ci>n</m:ci>
	    </m:apply>
	  </m:apply>
	</m:apply>
      </m:math>

      This simplified version of the formula can be used for the
      following problem: The mean height of 15-year olds boys (in cm)
      is 175 and the variance is 64. For girls, the mean is 165 and
      the variance is 64. If eight boys and eight girls were samples,
      what is the probability that the mean height of the sample of
      girls would be higher than the mean height of the boys? In other
      words, what is the probability that the mean height of girls
      minus the mean height of boys is greater than 0? 
    </para>

    <para id="para9">
      As before, the problem can be solved in terms of the sampling
      distribution of the difference between means (girls - boys). The
      mean of the distribution is 165 - 175 = -10. The standard
      deviation of the distribution is: 

      <m:math display="block">
	<m:apply>
	  <m:eq/>
	  <m:ci>
	    <m:msub><m:mi>σ</m:mi>
	      <m:mrow>
		<m:msub><m:mi>M</m:mi><m:mn>1</m:mn></m:msub>
		<m:mo>-</m:mo>
		<m:msub><m:mi>M</m:mi><m:mn>2</m:mn></m:msub>
	      </m:mrow>
	    </m:msub>
	  </m:ci>
	  <m:apply>
	    <m:root/>
	    <m:apply>
	      <m:divide/>
	      <m:apply>
		<m:times/>
		<m:cn>2</m:cn>
		<m:apply>
		  <m:power/>
		  <m:ci>σ</m:ci>
		  <m:cn>2</m:cn>
		</m:apply>
	      </m:apply>
	      <m:ci>n</m:ci>
	    </m:apply>
	  </m:apply>
	  <m:apply>
	    <m:root/>
	    <m:apply>
	      <m:divide/>
	      <m:apply>
		<m:times/>
		<m:cn>2</m:cn>
		<m:cn>64</m:cn>
	      </m:apply>
	      <m:cn>8</m:cn>
	    </m:apply>
	  </m:apply>
	  <m:cn>4</m:cn>
	</m:apply>
      </m:math> 

      A graph of the distribution is shown in <cnxn target="figure2" strength="7"/>. It is clear that it is unlikely that the mean
      height for girls would be higher than the mean height for boys
      since in the population boys are quite a bit taller. Nonetheless
      it is not inconceivable that the girls' mean could be higher
      than the boys' mean. 
    </para>

    <figure id="figure2">
      <media type="image/gif" src="figure2.gif"/>
      <caption>
	Sampling distribution of the difference between mean heights.
      </caption>
    </figure>

    <para id="para10">
      A difference between means of 0 or higher is a difference of
      <m:math>
	<m:apply>
	  <m:eq/>
	  <m:apply>
	    <m:divide/>
	    <m:cn>10</m:cn>
	    <m:cn>4</m:cn>
	  </m:apply>
	  <m:cn>2.5</m:cn>
	</m:apply>
      </m:math> standard deviations above the mean of -10. The
      probability of a score 2.5 or more standard deviations above the
      mean is 0.0062. 
    </para>

    
  </content>
  
</document>
