<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE document PUBLIC "-//CNX//DTD CNXML 0.5 plus MathML//EN" "http://cnx.rice.edu/technology/cnxml/schema/dtd/0.5/cnxml_mathml.dtd">
<document xmlns="http://cnx.rice.edu/cnxml" xmlns:md="http://cnx.rice.edu/mdml/0.4" xmlns:bib="http://bibtexml.sf.net/" xmlns:m="http://www.w3.org/1998/Math/MathML" id="e10180501">
<name>Riemann Area Optimization and Histogram Construction</name>
<metadata>
  <md:version>1.4</md:version>
  <md:created>2005/10/18 20:05:28 GMT-5</md:created>
  <md:revised>2005/11/30 19:01:59.062 US/Central</md:revised>
  <md:authorlist>
      <md:author id="lfanders">
      <md:firstname>Leif</md:firstname>
      <md:othername>Faure</md:othername>
      <md:surname>Anderson</md:surname>
      <md:email>lfanders@mail.uh.edu</md:email>
    </md:author>
  </md:authorlist>

  <md:maintainerlist>
    <md:maintainer id="lfanders">
      <md:firstname>Leif</md:firstname>
      <md:othername>Faure</md:othername>
      <md:surname>Anderson</md:surname>
      <md:email>lfanders@mail.uh.edu</md:email>
    </md:maintainer>
  </md:maintainerlist>
  
  <md:keywordlist>
    <md:keyword>Area</md:keyword>
    <md:keyword>Frequency Distribution</md:keyword>
    <md:keyword>Histogram</md:keyword>
    <md:keyword>Optimization</md:keyword>
    <md:keyword>Riemann</md:keyword>
    <md:keyword>Statistical Analysis</md:keyword>
  </md:keywordlist>

  <md:abstract>The purpose of this module is to assist histogram design.  The module presents an integral in terms of the Riemann sum equation for finding the area of a data-set.  It builds on an associative technique (from a discrete to continuous function) suggested by the sub-interval in the delta-x and bin number in the histogram.  The resulting area spectrum contains possible areas and an actual area for a data-set.  The module concludes with statements on the possible optimization of the integral by calculating the actual area.  Building a histogram in this way may imply compression qualities for variance in the data-set frequency distribution.</md:abstract>
</metadata>
<content>
<section id="id5427406">
<name>Graphical Representation of a Function and Area
Spectrum</name>
<para id="id3771030">The finite set of an integral defines the
spatial boundaries for an area. As the width of the sub-interval
decreases, the number of sub-intervals or ‘bins’ increases and the
resulting area approaches the actual area.</para>
<para id="id7328584">For the data-set with function defined, 
f(x) = x, where x is all real numbers in 0.1 increments from
0 to 1 without replication, the Riemann area value approaches the
actual value:</para>
<table id="id5928324">
<tgroup cols="3">
<tbody>
<row>
<entry>Bins</entry>
<entry>∆x</entry>
<entry>area</entry>
</row>
<row>
<entry>1</entry>
<entry>1</entry>
<entry>1</entry>
</row>
<row>
<entry>2</entry>
<entry>0.5</entry>
<entry>0.75</entry>
</row>
<row>
<entry>5</entry>
<entry>0.2</entry>
<entry>0.6</entry>
</row>
<row>
<entry>10</entry>
<entry>0.1</entry>
<entry>0.55</entry>
</row>
</tbody>
</tgroup>
</table>
<para id="id7351239">Two histograms that would result from the
sub-intervals (Figure 1) for this example present a characteristic that
exemplifies the 0 to 1 –line: The area plotted by the function is
the area equation for a right triangle. The
histogram with 1 bin portrays the idea clearly because it maps the
base and height.</para>
<figure id="element-87">
<media type="image/jpg" src="NewRiemannAreaFig1.jpg"/>
<caption>Increase sub-intervals approach the actual area applied to the associated function for a data set.</caption></figure><para id="id7351280">Infinity sub-intervals would result in the
closest Riemann area to the actual area, but it is a inappropriate
histogram for analysis.  Such construction, where bins are equal to or greater than n, can be thought of as the design of a 'micro' -histogram.  However, much statistical inference is made through the alternative and as such, is the limit of the research underlying this module.</para>
<para id="id1114056">In the example, the furthest right graph, with ten sub-intervals would
present a histogram with an equal number of bins and data
points.  This shows the width of the sub-interval correlates with the number of bins and will in turn, affect the frequency of data points within each bin.  Departing from the example, it is assumed that a data set resulting in a nonlinear frequency will have an unknown area.</para>
<para id="pr11120500">The association of discrete values forming, through sub-intervals, a continuous function enables graphical representation of potential histograms where the x-axis presents the number of bins as an element of sub-interval width (Figure 2).</para>
<para id="id7351315">The Riemann equation finds the actual area but implies a range of possibilities above and below the actual area. The important distinction from discrete intervals for data points to
applying a range that all real numbers for the set, enables a
broader association. It results in what may be called an area
spectrum.</para>
</section>
<section id="id17562975">
<name>Riemann Area Optimization and Histogram Construction</name>
<section id="id17562995">
<name>Notes on an Associative Technique for Statistical
Analysis.</name>
<para id="id17544172">The Riemann sum equation can be applied to the
aspect of selecting a bin for a histogram. A <link src="http://cnx.rice.edu/content/m13110/latest/">reiteration</link> of the ∆x
function for finding the full set of sums gives the equation finding all possible areas of a data-set:</para>
<equation id="id17544194">
<m:math>
 <m:apply>
  <m:sum/>
  <m:bvar>
   <m:ci>x</m:ci>
  </m:bvar>
  <m:lowlimit>
   <m:cn>1</m:cn>
  </m:lowlimit>
  <m:uplimit>
   <m:ci>n</m:ci>
  </m:uplimit>
  <m:apply>
   <m:times/>
   <m:ci>n</m:ci>
   <m:apply>
    <m:plus/>
    <m:ci>b</m:ci>
    <m:apply>
     <m:times/>
     <m:cn>-1</m:cn>
     <m:ci>a</m:ci>
    </m:apply>
   </m:apply>
   <m:apply>
    <m:power/>
    <m:ci>x</m:ci>
    <m:cn>-1</m:cn>
   </m:apply>
  </m:apply>
 </m:apply>
</m:math>
</equation>
<para id="id17572199">The continuous characteristic for ∆x of the
Riemann equation allows the integral of (1) to represent a discrete
data-set in a nonlinear curve through the differentials between a
range of areas.</para>

<para id="id17540259">The independent variable x, for equation (1) is the instance variable for the number of observations of a data set.  It is defined for non-negative integers, <m:math>
 <m:apply>
   <m:ms>1 ≤ x ≤ n</m:ms>
 </m:apply>
</m:math>.</para>
<para id="id17540282">
<figure id="id17562937">
<media type="image/jpg" src="Absract1fig1_new2.jpg"/>
<caption>Derived function: Illustrated significant range for the
areas of a data-set. Data Source: NIST, Information
Technology Laboratory, Statistical Engineering Division
&lt;http://www.itl.nist.gov/div898/strd/univ/data/Lew.dat
&gt;</caption>
</figure>
</para>
<para id="id17443515">The n variable in (1) represents the number
of observations in a dataset, as opposed to the number of sub-intervals as is
the case of ∆x. In figure 2, the actual area is found between f(1) and f(200).  The integral is a substitute for ∆x if regarded as
an intermediary step for bin selection preceding the frequency
distribution.</para>
<para id="id17443539">For the previous example (Table 1) where the actual area is
known, the function associated with the data and then applied to
(1) will produce an area spectrum encompassing the largest
approximated area 0.9 and the smallest, 0.09, where the actual area
is 0.5.</para>
<para id="id17443551">Therefore, an optimization technique is available
 based on targeting the actual area within the area spectrum
resulting from (1) and have possible use for statistical analysis.
The technique of optimizing the integral suggests a qualification of bin number.  The method of finding the least distance from the curve to 0 is one option for optimization.  </para>
<para id="id11150502">The four variables comprising this formula convey standardization for histogram construction.  The technique allows graphical representation to occur independent from perceived frequency distribution of sample data points.  The area spectrum represents a compression of the ways point within a data set distribute given the four variables.  This quality may prove beneficial for inferences made from the resulting histogram.  It may also, however, prove limited to specific sources of data.  Limitations of the technique are currently being investigated.</para>
</section>
</section>
</content>
</document>
