<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE document PUBLIC "-//CNX//DTD CNXML 0.5 plus MathML//EN" "http://cnx.rice.edu/cnxml/0.5/DTD/cnxml_mathml.dtd">
<document xmlns="http://cnx.rice.edu/cnxml" xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:bib="http://bibtexml.sf.net/" id="m11171">

  <name>Base Rates</name>

  <metadata>
  <md:version>2.3</md:version>
  <md:created>2003/05/16</md:created>
  <md:revised>2003/07/14 14:28:19.758 GMT-5</md:revised>
  <md:authorlist>
    <md:author id="dmlane">
      <md:firstname>David</md:firstname>
      
      <md:surname>Lane</md:surname>
      <md:email>lane@rice.edu</md:email>
    </md:author>
  </md:authorlist>

  <md:maintainerlist>
    <md:maintainer id="dmlane">
      <md:firstname>David</md:firstname>
      
      <md:surname>Lane</md:surname>
      <md:email>lane@rice.edu</md:email>
    </md:maintainer>
    <md:maintainer id="mjeanes">
      <md:firstname>Matthew</md:firstname>
      
      <md:surname>Jeanes</md:surname>
      <md:email>mjeanes@rice.edu</md:email>
    </md:maintainer>
  </md:maintainerlist>
  
  <md:keywordlist>
    <md:keyword>base</md:keyword>
    <md:keyword>rate</md:keyword>
    <md:keyword>misses</md:keyword>
    <md:keyword>false</md:keyword>
    <md:keyword>positive</md:keyword>
    <md:keyword>bayes</md:keyword>
    <md:keyword>prior</md:keyword>
    <md:keyword>posterior</md:keyword>
    <md:keyword>probability</md:keyword>
  </md:keywordlist>

  <md:abstract>Finding the true probability of an event taking into account misses, false positives, and base rates.  Also, using Bayes' Theorem.</md:abstract>
</metadata>



  <content>
    <para id="intro">
      Suppose that at your regular physical exam you test positive for
      Disease X. Although Disease X has only mild symptoms, you are
      concerned and ask your doctor about the accuracy of the test. It
      turns out that the test is 95% accurate. It would appear that the
      probability that you have Disease X is therefore 0.95. However,
      the situation is not that simple.  
    </para>

    <para id="para2">
      For one thing, more information about the accuracy of the test
      is needed because there two kinds of errors the test can make:
      <term>misses</term> and <term>false positives</term>.
      
      If you actually had Disease X and the test
      failed to detect it, that would be a
      <emphasis>miss</emphasis>. If you did not have Disease X and the
      test indicated you did, that would be a <emphasis>false
      positive</emphasis>. The miss and false positive rates are not
      necessarily the same.
    </para>
    <example id="missfp">
      <para id="expara">
	Lets' say that the test accurately indicates the disease in
	99% of the people who have it and accurately indicates no
	disease in 91% of the people who do not have it. This would mean
	that the test has a miss rate of 0.01 and a false positive rate
	of 0.09. This would lead you to revise your judgment and
	conclude that your chance of having the disease is 0.09 rather
	than 0.05. This would be true if half the people in your
	situation (people who show up for a regular physical exam) had
	disease X.
      </para>
    </example>

    <para id="baserate">
      The analysis becomes complicated if more or less than half the
      people in your situation have Disease X. The proportion of the
      people having the disease is called the <term>base rate</term>.  
      
      It is very important to consider the base rate when classifying
      people. As the saying does, "if you hear hoofs, think horse
      not zebra" since you are more likely to encounter a horse
      than a zebra (at least in most places.)
    </para>

    <para id="diseaseX">
      Assume that Disease X is a rare disease, and only 2% of people
      in your situation have it. How does that affect the probability
      that you have it? Or, more generally, what is the probability
      that someone who tests positive actually has the disease. Lets
      consider what would happen if one million people were
      tested. Out of these one million people, 2% or 20,000 people
      would have the disease. Of these 20,000 with the disease, the
      test would accurately detect it in 99% of them. This means that
      19,800 cases would be accurately identified. Now lets consider
      the 98% of the one million people (980,000) who do not have the
      disease. Since the false positive rate is 0.09, 9% of these
      980,000 people will test positive for the disease. This is a
      total of 88,200 people incorrectly diagnosed.
    </para>

    <para id="sumup">
      To sum up, 19,800 people who tested positive would actually have
      the disease and 88,200 people who tested positive would not have
      the disease. This means that of all those who tested positive,
      only
	<m:math display="block">
	  <m:apply>
	    <m:eq/>
	    <m:apply>
	      <m:divide/>
	      <m:cn>19800</m:cn>
	      <m:apply>
		<m:plus/>
		<m:cn>19800</m:cn>
		<m:cn>88200</m:cn>
	      </m:apply>
	    </m:apply>
	    <m:cn>0.1833</m:cn>
	  </m:apply>
	</m:math>
      of them would actually have the disease. So the probability that
      you have the disease is not 0.95, or 0.91, but only 0.1833.
    </para>

    <para id="results">
      These results are summarized in <cnxn target="diagnose"/>. The numbers of people
      diagnosed with the disease are shown in
      <emphasis>italics</emphasis>. Of the one million people tested,
      the test was correct for 891,800 of those without the disease
      and for 19,800 with the disease; the test was correct 91% of the
      time. However, if you look only at the people testing positive
      (shown in <emphasis>italics</emphasis>), only 19,800 (0.1833) of
      the 108,000 testing positive actually have the disease.
    </para>
    
    <table frame="all" id="diagnose">
      <name>Table 1.  Diagnosing Disease X</name>
      <tgroup cols="4" align="left" colsep="1" rowsep="1">
	<colspec colnum="1" colname="c1"/>
	<colspec colnum="2" colname="c2"/>
	<colspec colnum="3" colname="c3"/>
	<colspec colnum="4" colname="c4"/>
	<thead valign="top">
	  <row>
	    <entry namest="c1" nameend="c4" align="center">
	      True Condition
	    </entry>
	  </row>
	</thead>
	<tbody valign="top">
	  <row>
	    <entry namest="c1" nameend="c2" align="center">
	      No Disease -
	      980,000
	    </entry>
	    <entry namest="c3" nameend="c4" align="center">
	      Disease -
	      20,000
	    </entry>
	  </row>
	  <row>
	    <entry namest="c1" nameend="c2" align="center">
	      <m:math>
		<m:mtext mathvariant="bold">
		  Test Results
		</m:mtext>
	      </m:math>
	    </entry>
	    <entry namest="c3" nameend="c4" align="center">
	      <m:math>
		<m:mtext mathvariant="bold">
		  Test Results
		</m:mtext>
	      </m:math>
	    </entry>
	  </row>
	  <row>
	    <entry>
	      <emphasis>
		Positive - 88,200
	      </emphasis>
	    </entry>
	    <entry>
	      Negative - 891,800
	    </entry>
	    <entry>
	      <emphasis>
		Positive - 19,800
	      </emphasis>
	    </entry>
	    <entry>
	      Negative - 200
	    </entry>
	  </row>
	</tbody>
      </tgroup>
    </table>
    
    <section id="bayes">
      <name>Bayes' Theorem</name>
      <para id="bt1">
	This same result can be obtained using <term>Bayes'
	theorem</term>. Bayes' theorem considers both the prior
	probability of an event and the diagnostic value of a test to
	determine the posterior probability of the event.
	
	For the
	current example, the event is that you have Disease X. Let's
	call this Event
	<m:math>
	  <m:ci>D</m:ci>
	</m:math>.  
	Since only 2% of people in your situation have
	Disease X, the prior probability of Event 
	<m:math>
	    <m:ci>D</m:ci>
	</m:math>
	is 0.02. Or, more formally,
	<m:math>
	  <m:apply>
	    <m:eq/>
	    <m:apply>
	      <m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
		<m:ci>D</m:ci>
	    </m:apply>
	    <m:cn>0.02</m:cn>
	  </m:apply>
	</m:math>.

	If 
	<m:math>
	  <m:apply>
	    <m:diff/>
	      <m:ci>D</m:ci>
	  </m:apply>
	</m:math>
	represents the probability that Event 
	<m:math>
	  <m:ci>D</m:ci>
	</m:math> 
	is false, then 
	<m:math>
	  <m:apply>
	    <m:eq/>
	    <m:apply>
	      <m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
	      <m:apply>
		<m:diff/>
		<m:ci>D</m:ci>
	      </m:apply>
	      
	    </m:apply>
	    <m:apply>
	      <m:minus/>
	      <m:cn>1</m:cn>
	      <m:apply>
		<m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
		<m:ci>D</m:ci>
	      </m:apply>
	    </m:apply>
	    <m:cn>0.98</m:cn>
	  </m:apply>
	</m:math>
      </para>
      
      <para id="bt2">
	To define the diagnostic value of the test, we need to define
	another event: that you test positive for Disease X. Let's
	call this event 
	<m:math>
	  <m:ci>T</m:ci> 
	</m:math>.  
	The diagnostic value of the test
	depends on the probability you will test positive given that
	you actually have the disease, written as
	<m:math>
	  <m:apply>
	    <m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
	    <m:condition>
	      <m:ci>D</m:ci>
	    </m:condition>
	    <m:ci>T</m:ci>
	  </m:apply>
	</m:math>
	, and the probability you test positive given that you do not
	have the disease, written as 
	<m:math>
	  <m:apply>
	    <m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
	    <m:condition>
	      <m:apply>
		<m:diff/>
		<m:ci>D</m:ci>
	      </m:apply>
	    </m:condition>
	    <m:ci>T</m:ci>
	  </m:apply>
	</m:math>.  
	Bayes' theorem shown below allows you to calculate
	<m:math>
	  <m:apply>
	    <m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
	    <m:condition>
	      <m:ci>T</m:ci>
	    </m:condition>
	    <m:ci>D</m:ci>
	  </m:apply>
	</m:math>
	, the probability that you have the disease given that you
	test positive for it.
      </para>

      <rule id="bt" type="theorem"> 
	  <name>Bayes' Theorem</name>
	  <statement>
	  
	  <equation id="bayeseq">
	    <m:math>
	      <m:apply>
		<m:eq/>
		<m:apply>
		  <m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
		  <m:condition>
		    <m:ci>T</m:ci>
		  </m:condition>
		  <m:ci>D</m:ci>
		</m:apply>
		<m:apply>
		  <m:divide/>
		  <m:apply>
		    <m:times/>
		    <m:apply>
		      <m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
		      <m:condition>
			<m:ci>D</m:ci>
		      </m:condition>
		      <m:ci>T</m:ci>
		    </m:apply>
		    <m:apply>
		      <m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
		      <m:ci>D</m:ci>
		    </m:apply>
		  </m:apply>
		  <m:apply>
		    <m:plus/>
		    <m:apply>
		      <m:times/>
		      <m:apply>
			<m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
			<m:condition>
			  <m:ci>D</m:ci>
			</m:condition>
			<m:ci>T</m:ci>
		      </m:apply>
		      <m:apply>
			<m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
			<m:ci>D</m:ci>
		      </m:apply>
		    </m:apply>
		    <m:apply>
		      <m:times/>
		      <m:apply>
			<m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
			<m:condition>
			  <m:apply>
			    <m:diff/>
			    <m:ci>D</m:ci>
			  </m:apply>
			</m:condition>
			<m:ci>T</m:ci>
		      </m:apply>
		      <m:apply>
			<m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
			<m:apply>
			  <m:diff/>
			  <m:ci>D</m:ci>
			</m:apply>
		      </m:apply>
		    </m:apply>
		  </m:apply>
		</m:apply>
	      </m:apply>
	    </m:math>
	  </equation>
	  
	  </statement>
	 
      </rule>

      <para id="list">
	<list id="nums">
	  <name>The various terms are:</name>
	  <item>
	    <m:math>
	      <m:apply>
		<m:eq/>
		<m:apply>
		  <m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
		  <m:condition>
		    <m:ci>D</m:ci>
		  </m:condition>
		  <m:ci>T</m:ci>
		</m:apply>
		<m:cn>0.99</m:cn>
	      </m:apply>
	    </m:math>
	  </item>
	  <item>
	    <m:math>
	      <m:apply>
		<m:eq/>
		<m:apply>
		  <m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
		  <m:condition>
		    <m:apply>
		      <m:diff/>
		      <m:ci>D</m:ci>
		    </m:apply>
		  </m:condition>
		  <m:ci>T</m:ci>
		</m:apply>
		<m:cn>0.09</m:cn>
	      </m:apply>
	    </m:math>
	  </item>
	  <item>
	    <m:math>
	      <m:apply>
		<m:eq/>
		<m:apply>
		  <m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
		  <m:ci>D</m:ci>
		</m:apply>
		<m:cn>0.02</m:cn>
	      </m:apply>
	      
	    </m:math>
	  </item>
	  <item>
	    <m:math>
	      <m:apply>
		<m:eq/>
		
		<m:apply>
		  <m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
		  <m:apply>
		    <m:diff/>
		    <m:ci>D</m:ci>
		  </m:apply>
		</m:apply>
		<m:cn>0.98</m:cn>
	      </m:apply>
	    </m:math>
	  </item>
	</list>
      </para>
      
      <para id="therefore">
	Therefore,
	<m:math display="block">
	  <m:apply>
	    <m:eq/>
	    <m:apply>
	      <m:csymbol definitionURL="http://cnx.rice.edu/cd/cnxmath.ocd#probability"/>
	      <m:condition>
		<m:ci>T</m:ci>
	      </m:condition>
	      <m:ci>D</m:ci>
	    </m:apply>
	    <m:apply>
	      <m:divide/>
	      <m:apply>
		<m:times/>
		<m:cn>0.99</m:cn>
		<m:cn>0.02</m:cn>
	      </m:apply>
	      <m:apply>
		<m:plus/>
		<m:apply>
		  <m:times/>
		  <m:cn>0.99</m:cn>
		  <m:cn>0.02</m:cn>
		</m:apply>
		<m:apply>
		  <m:times/>
		  <m:cn>0.09</m:cn>
		  <m:cn>0.98</m:cn>
		</m:apply>
	      </m:apply>
	    </m:apply>
	    <m:cn>0.1833</m:cn>
	  </m:apply>
	</m:math>
	
	which is the same value computed previously.
      </para>
	  
    </section>
  </content>
  <glossary id="glossary1">
    <definition id="missesdef">
      <term>Misses</term>
      <meaning>
	Occur when a diagnostic test returns a negative result, but
	the true state of the subject is positive. For example, if a
	person has strep throat and the diagnostic test indicates
	fails to indicate it, then a miss has occured. The concept
	is similar to a Type II error in signficance testing.
      </meaning>  
    </definition>
    <definition id="fpdef">
      <term>False Positive</term>
	<meaning>
	Occurs when a diagnostic procedure returns a positive result
	while the true state of the subject is negative. For example, if
	a test for strep says the patient has strep when in fact he or
	she does not, then the error in diagnosis would be called a
	false positive. In some contexts, a false positive is called a
	false alarm. The concept is similar to a Type I error in
	signficance testing.
      </meaning>
    </definition>
    <definition id="baseratedef">
      <term>Base Rate</term> 
	<meaning>
	The true proportion of a population
	having some condition, attribute or disease. For example,
	the proportion of people with schizophrenia is about
	0.01.
      </meaning>
    </definition>
    <definition id="priorpdef">
      <term>Prior Probability</term>
      <meaning>
	The prior probability of an event is the probability of
	the event computed before the collection of new data. One
	begins with a prior probability of an event and revises it
	in the light of new data. For example, if 0.01 of a
	population has schizophrenia then the probability that a
	person drawn at random would have schizophrenia is
	0.01. This is the prior probability. If you then learn
	that that there score on a personality test suggests the
	person is schizophrenic, you would adjust your probability
	accordingly. The adjusted probability is the posterior
	probability.
      </meaning>
    </definition>
    <definition id="postpdef">
      <term>Posterior Probability</term>
      <meaning>
	The posterior probability of an event is the probability
	of the event computed following the collection of new
	data. One begins with a prior probability of an event and
	revises it in the light of new data. For example, if 0.01
	of a population has schizophrenia then the probability
	that a person drawn at random would have schizophrenia is
	0.01. This is the prior probability. If you then learn
	that that there score on a personality test suggests the
	person is schizophrenic, you would adjust your probability
	accordingly. The adjusted probability is the posterior
	probability.
      </meaning>
    </definition>
  </glossary>

      
</document>
