<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE document PUBLIC "-//CNX//DTD CNXML 0.5 plus MathML//EN" "http://cnx.rice.edu/technology/cnxml/schema/dtd/0.5/cnxml_mathml.dtd">
<document xmlns="http://cnx.rice.edu/cnxml" xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:bib="http://bibtexml.sf.net/" xmlns:m="http://www.w3.org/1998/Math/MathML" id="new">
  <name>Linear Regression and Correlation: The Regression Equation</name>
  <metadata>
  <md:version>1.4</md:version>
  <md:created>2008/06/23 14:44:47 GMT-5</md:created>
  <md:revised>2008/07/15 13:30:15.729 GMT-5</md:revised>
  <md:authorlist>
      <md:author id="billowsky">
      <md:firstname>Barbara</md:firstname>
      
      <md:surname>Illowsky</md:surname>
      <md:email>illowskybarbara@deanza.edu</md:email>
    </md:author>
      <md:author id="sdean">
      <md:firstname>Susan</md:firstname>
      
      <md:surname>Dean</md:surname>
      <md:email>deansusan@deanza.edu</md:email>
    </md:author>
  </md:authorlist>

  <md:maintainerlist>
    <md:maintainer id="cnxorg">
      <md:firstname/>
      
      <md:surname>Connexions</md:surname>
      <md:email>cnx@cnx.org</md:email>
    </md:maintainer>
  </md:maintainerlist>
  
  <md:keywordlist>
    <md:keyword>elementary</md:keyword>
    <md:keyword>statistics</md:keyword>
  </md:keywordlist>

  <md:abstract>This module provides an overview of Linear Regression and Correlation: The Regression Equation as a part of Collaborative Statistics collection (col10522) by Barbara Illowsky and Susan Dean.</md:abstract>
</metadata>
  <content>
    <para id="delete_me">Data rarely fits a straight line exactly. Usually, you must be satisfied with rough
predictions. Typically, you have a set of data whose scatter plot appears to <emphasis>"fit"</emphasis> a
straight line. This is called a <emphasis>Line of Best Fit or Least Squares Line</emphasis>.</para><section id="element-748"><name>Optional Collaborative Classroom Activity</name>
<para id="element-900">
If you know a person's pinky (smallest) finger length, do you think you could predict that
person's height? Collect data from your class (pinky finger length, in inches). The
independent variable, <m:math><m:mi>x</m:mi></m:math>, is pinky finger length and the dependent variable, <m:math><m:mi>y</m:mi></m:math>, is height.
</para><para id="element-657">For each set of data, plot the points on graph paper. Make your graph big enough and
<emphasis>use a ruler</emphasis>. Then "by eye" draw a line that appears to "fit" the data. For your line, pick
two convenient points and use them to find the slope of the line. Find the y-intercept of
the line by extending your lines so they cross the y-axis. Using the slopes and the
y-intercepts, write your equation of "best fit". Do you think everyone will have the same
equation? Why or why not?</para><para id="element-598">Using your equation, what is the predicted height for a pinky length of 2.5 inches?</para></section><example id="element-22"><para id="element-998">
A random sample of 11 statistics students produced the following data
where <m:math><m:mi>x</m:mi></m:math> is the third exam score, out of 80, and <m:math><m:mi>y</m:mi></m:math> is the final exam score, out of 200.
Can you predict the final exam score of a random student if you know the third exam score?
</para>

<figure id="linrgs_regeq1"><subfigure>

<table id="element-50">
<tgroup cols="2"><thead>
  <row>
    <entry>x (third exam score)</entry>
    <entry>y (final exam score)</entry>
  </row>
</thead>
<tbody>
  <row>
    <entry>65</entry>
    <entry>175</entry>
  </row>
  <row>
    <entry>67</entry>
    <entry>133</entry>
  </row>
  <row>
    <entry>71</entry>
    <entry>185</entry>
  </row>
  <row>
    <entry>71</entry>
    <entry>163</entry>
  </row>
  <row>
    <entry>66</entry>
    <entry>126</entry>
  </row>
  <row>
    <entry>75</entry>
    <entry>198</entry>
  </row>
  <row>
    <entry>67</entry>
    <entry>153</entry>
  </row>
  <row>
    <entry>70</entry>
    <entry>163</entry>
  </row>
  <row>
    <entry>71</entry>
    <entry>159</entry>
  </row>
  <row>
    <entry>69</entry>
    <entry>151</entry>
  </row>
  <row>
    <entry>69</entry>
    <entry>159</entry>
  </row>
</tbody>

</tgroup>
</table>
<caption>Table showing the scores on the final exam based on scores from the third exam.</caption>
</subfigure>
<subfigure>
<media type="image/png" src="linrgs_regeq1.png">
<param name="alt" value="Scatterplot of exam scores with the third exam score on the x-axis and the final exam score on the y-axis."/>

<param name="print-width" value="3in"/>
</media>
<caption>Scatter plot showing the scores on the final exam based on scores from the third exam.</caption>
</subfigure></figure>

<para id="element-303">The third exam score, <m:math><m:mi>x</m:mi></m:math>, is the independent variable and the final exam score, <m:math><m:mi>y</m:mi></m:math>, is the
dependent variable. We will plot a regression line that best "fits" the data. If each of you
were to fit a line "by eye", you would draw different lines. We can use what is called a
<emphasis>least-squares regression line</emphasis> to obtain the best fit line.</para>
</example><para id="element-644">Consider the diagram shown. Each point of data is of the the form <m:math><m:mo>(</m:mo><m:mi>x</m:mi><m:mo>,</m:mo><m:mi>y</m:mi><m:mo>)</m:mo></m:math> and each point of
the line of best fit using least-squares linear regression has the form

<m:math>
<m:mo>(</m:mo>
<m:mi>x</m:mi>
<m:mo>,</m:mo>
<m:mover>
<m:mi>y</m:mi>
<m:mo>^</m:mo>
</m:mover>
<m:mo>)</m:mo>
</m:math>.
</para><para id="element-51">The <m:math><m:mover><m:mi>y</m:mi><m:mo>^</m:mo></m:mover></m:math> is read <emphasis>"y hat"</emphasis> and is the <emphasis>estimated value of <m:math><m:mi>y</m:mi></m:math></emphasis>. It is the value of <m:math><m:mi>y</m:mi></m:math> obtained using the
regression line. It is not generally equal to <m:math><m:mi>y</m:mi></m:math> from data.</para><para id="element-530"><figure id="linrgs_regeq2"><media type="image/png" src="linrgs_regeq2.png">
<param name="alt" value="Scatterplot of the exam scores with a line of best fit tying in the relationship between the third exam and final exam scores. A specific point on the line, specific data point, and the distance between these two points are used in order to show an example of how to compute the sum of squared errors in order to find the points on the line of best fit."/>

<param name="print-width" value="5in"/>
</media></figure></para><para id="element-621">The term <m:math><m:mo>|</m:mo><m:msub><m:mi>y</m:mi><m:mn>0</m:mn></m:msub><m:mo>-</m:mo><m:msub><m:mover><m:mi>y</m:mi><m:mo>^</m:mo></m:mover><m:mn>0</m:mn></m:msub><m:mo>|</m:mo><m:mo>=</m:mo><m:msub><m:mi>ε</m:mi><m:mn>0</m:mn></m:msub></m:math> is called the <emphasis>"error" or residual</emphasis>. It is not an error in the
sense of a mistake, but measures the vertical distance between the actual value of <m:math><m:mi>y</m:mi></m:math> and the
estimated value of <m:math><m:mi>y</m:mi></m:math>.</para><para id="element-756"><m:math><m:mi>ε</m:mi></m:math> = the Greek letter <emphasis>epsilon</emphasis></para><para id="element-15">For each data point, you can calculate, <m:math><m:mo>|</m:mo><m:msub><m:mi>y</m:mi><m:mi>i</m:mi></m:msub><m:mo>-</m:mo><m:msub><m:mover><m:mi>y</m:mi><m:mo>^</m:mo></m:mover><m:mi>i</m:mi></m:msub><m:mo>|</m:mo><m:mo>=</m:mo><m:msub><m:mi>ε</m:mi><m:mi>i</m:mi></m:msub></m:math> for <m:math><m:mi>i</m:mi><m:mo>=</m:mo><m:mtext>1, 2, 3, ..., 11</m:mtext></m:math>.</para><para id="element-670">Each <m:math><m:mi>ε</m:mi></m:math> is a vertical distance.</para><para id="element-610">For the example about the third exam scores and the final exam scores for the 11
statistics students, there are 11 data points. Therefore, there are 11 <m:math><m:mi>ε</m:mi></m:math> values. If you
square each <m:math><m:mi>ε</m:mi></m:math> and add, you get

</para><para id="element-575"><m:math>
<m:mo>(</m:mo>
<m:msub>
<m:mi>ε</m:mi>
<m:mn>1</m:mn>
</m:msub>
<m:msup>
<m:mo>)</m:mo>
<m:mn>2</m:mn>
</m:msup>
<m:mo>+</m:mo>
<m:mo>(</m:mo>
<m:msub>
<m:mi>ε</m:mi>
<m:mn>2</m:mn>
</m:msub>
<m:msup>
<m:mo>)</m:mo>
<m:mn>2</m:mn>
</m:msup>
<m:mo>+</m:mo>
<m:mtext>...</m:mtext>
<m:mo>+</m:mo>
<m:mo>(</m:mo>
<m:msub>
<m:mi>ε</m:mi>
<m:mn>11</m:mn>
</m:msub>
<m:msup>
<m:mo>)</m:mo>
<m:mn>2</m:mn>
</m:msup>
<m:mo>=</m:mo>
<m:mover>
<m:mrow>
<m:munder>
<m:mi>Σ</m:mi>
<m:mtext>i = 1</m:mtext>
</m:munder>
</m:mrow>
<m:mn>11</m:mn>
</m:mover>
<m:msup>
<m:mi>ε</m:mi>
<m:mn>2</m:mn>
</m:msup>
</m:math></para><para id="element-215">This is called the <emphasis>Sum of Squared Errors (SSE)</emphasis>.</para>

<para id="element-640">Using calculus, you can make the <emphasis>SSE</emphasis> a minimum. When you make the <emphasis>SSE</emphasis> a
minimum, you have determined the points that are on the line of best fit. It turns out that
the line of best fit has the equation:
</para>

<equation id="element-710"><m:math>
<m:mover>
<m:mi>y</m:mi>
<m:mo>^</m:mo>
</m:mover>
<m:mo>=</m:mo>
<m:mi>a</m:mi>
<m:mo>+</m:mo>
<m:mtext>bx</m:mtext>
</m:math>
</equation>

<para id="element-716">where <m:math><m:mi>a</m:mi><m:mo>=</m:mo><m:mover><m:mi>y</m:mi><m:mi>¯</m:mi></m:mover><m:mo>-</m:mo><m:mi>b</m:mi><m:mo>⋅</m:mo><m:mover><m:mi>x</m:mi><m:mi>¯</m:mi></m:mover></m:math>
and <m:math><m:mi>b</m:mi><m:mo>=</m:mo><m:mfrac><m:mrow><m:mi>Σ</m:mi><m:mo>(</m:mo><m:mi>x</m:mi><m:mo>-</m:mo><m:mover><m:mi>x</m:mi><m:mi>¯</m:mi></m:mover><m:mo>)</m:mo><m:mo>⋅</m:mo><m:mo>(</m:mo><m:mi>y</m:mi><m:mo>-</m:mo><m:mover><m:mi>y</m:mi><m:mi>¯</m:mi></m:mover><m:mo>)</m:mo></m:mrow>
<m:mrow><m:mi>Σ</m:mi><m:mo>(</m:mo><m:mi>x</m:mi><m:mo>-</m:mo><m:mover><m:mi>x</m:mi><m:mi>¯</m:mi></m:mover><m:msup><m:mo>)</m:mo><m:mn>2</m:mn></m:msup></m:mrow></m:mfrac></m:math>.</para>

<para id="element-153"><m:math><m:mover><m:mi>x</m:mi><m:mi>¯</m:mi></m:mover></m:math> and <m:math><m:mover><m:mi>y</m:mi><m:mi>¯</m:mi></m:mover></m:math> are the averages of the <m:math><m:mi>x</m:mi></m:math> values and the <m:math><m:mi>y</m:mi></m:math> values, respectively. The best fit line always passes through the point
<m:math><m:mo>(</m:mo><m:mover><m:mi>x</m:mi><m:mi>¯</m:mi></m:mover><m:mo>,</m:mo><m:mover><m:mi>y</m:mi><m:mi>¯</m:mi></m:mover><m:mo>)</m:mo></m:math>.</para>

<para id="element-414">The slope <m:math><m:mi>b</m:mi></m:math> can be written as 
<m:math><m:mi>b</m:mi><m:mo>=</m:mo><m:mi>r</m:mi><m:mo>⋅</m:mo><m:mo>(</m:mo>
<m:mfrac><m:mrow>
<m:msub><m:mi>s</m:mi><m:mi>y</m:mi></m:msub></m:mrow>
<m:mrow>
<m:msub><m:mi>s</m:mi><m:mi>x</m:mi></m:msub></m:mrow></m:mfrac><m:mo>)</m:mo></m:math> where <m:math><m:msub><m:mi>s</m:mi><m:mi>y</m:mi></m:msub></m:math>
= the standard deviation of the
<m:math><m:mi>y</m:mi></m:math> values and <m:math><m:msub><m:mi>s</m:mi><m:mi>x</m:mi></m:msub></m:math> = the standard deviation of the <m:math><m:mi>x</m:mi></m:math> values. <m:math><m:mi>r</m:mi></m:math> is the correlation
coefficient which is discussed in the next section.</para><note>Many calculators or any linear regression and correlation computer program can
calculate the best fit line. The calculations tend to be tedious if done by hand. <emphasis>In the
Collaborative Statistics Workbook, there are instructions for calculating the best fit
line.</emphasis></note><para id="element-27">The graph of the line of best fit for the third exam/final exam example is shown below:</para><figure id="linrgs_regeq3"><media type="image/png" src="linrgs_regeq3.png">
<param name="alt" value="Scatterplot of the third exam scores by final exam scores and its line of best fit."/>

<param name="print-width" value="4in"/>
</media></figure><para id="element-689">Remember, the best fit line is called the <emphasis>least squares regression line</emphasis> (it is sometimes referred to as the <emphasis>LSL</emphasis> which is an acronym for least squares line). The best fit line for the third exam/final exam example has the equation:
</para><equation id="element-643"><m:math>
<m:mover>
<m:mi>y</m:mi>
<m:mo>^</m:mo>
</m:mover>
<m:mo>=</m:mo>
<m:mn>-173.51</m:mn>
<m:mo>+</m:mo>
<m:mtext>4.83x</m:mtext>
<m:mspace width="20pt"/>
</m:math></equation><para id="element-39">The idea behind finding the best fit line is based on the assumption that the data are
actually scattered about a straight line. Remember, it is always important to plot a
scatter diagram first (which many calculators and computer programs can do) to see if it
is worth calculating the line of best fit.</para><note>If the scatter plot indicates that there is a linear relationship between
the variables, then it is reasonable to use a best fit line to make predictions for
<m:math><m:mi>y</m:mi></m:math> given <m:math><m:mi>x</m:mi></m:math> within the domain of x-values in the sample data, <emphasis>but not necessarily
for x-values outside that domain.</emphasis></note>   
  </content>
  
</document>
