Skip to content Skip to navigation

Connexions

You are here: Home » Content » Riemann Area Optimization and Histogram Construction

Navigation

Content Actions

  • Download module PDF
  • Add to ...
    Add the module to:
    • My Favorites
    • A lens
    • An external social bookmarking service
    • My Favorites (What is 'My Favorites'?)
      'My Favorites' is a special kind of lens which you can use to bookmark modules and collections directly in Connexions. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need a Connexions account to use 'My Favorites'.
    • A lens (What is a lens?)

      Definition of a lens

      Lenses

      A lens is a custom view of Connexions content. You can think of it as a fancy kind of list that will let you see Connexions through the eyes of organizations and people you trust.

      What is in a lens?

      Lens makers point to Connexions materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

      Who can create a lens?

      Any individual Connexions member, a community, or a respected organization.

    • External bookmarks
  • E-mail the author

Recently Viewed

This feature requires Javascript to be enabled.

Riemann Area Optimization and Histogram Construction

Module by: Leif Anderson

Summary: The purpose of this module is to assist histogram design. The module presents an integral in terms of the Riemann sum equation for finding the area of a data-set. It builds on an associative technique (from a discrete to continuous function) suggested by the sub-interval in the delta-x and bin number in the histogram. The resulting area spectrum contains possible areas and an actual area for a data-set. The module concludes with statements on the possible optimization of the integral by calculating the actual area. Building a histogram in this way may imply compression qualities for variance in the data-set frequency distribution.

Graphical Representation of a Function and Area Spectrum

The finite set of an integral defines the spatial boundaries for an area. As the width of the sub-interval decreases, the number of sub-intervals or ‘bins’ increases and the resulting area approaches the actual area.

For the data-set with function defined, f(x) = x, where x is all real numbers in 0.1 increments from 0 to 1 without replication, the Riemann area value approaches the actual value:

Bins ∆x area
1 1 1
2 0.5 0.75
5 0.2 0.6
10 0.1 0.55

Two histograms that would result from the sub-intervals (Figure 1) for this example present a characteristic that exemplifies the 0 to 1 –line: The area plotted by the function is the area equation for a right triangle. The histogram with 1 bin portrays the idea clearly because it maps the base and height.

Figure 1: Increase sub-intervals approach the actual area applied to the associated function for a data set.
Figure 1 (NewRiemannAreaFig1.jpg)

Infinity sub-intervals would result in the closest Riemann area to the actual area, but it is a inappropriate histogram for analysis. Such construction, where bins are equal to or greater than n, can be thought of as the design of a 'micro' -histogram. However, much statistical inference is made through the alternative and as such, is the limit of the research underlying this module.

In the example, the furthest right graph, with ten sub-intervals would present a histogram with an equal number of bins and data points. This shows the width of the sub-interval correlates with the number of bins and will in turn, affect the frequency of data points within each bin. Departing from the example, it is assumed that a data set resulting in a nonlinear frequency will have an unknown area.

The association of discrete values forming, through sub-intervals, a continuous function enables graphical representation of potential histograms where the x-axis presents the number of bins as an element of sub-interval width (Figure 2).

The Riemann equation finds the actual area but implies a range of possibilities above and below the actual area. The important distinction from discrete intervals for data points to applying a range that all real numbers for the set, enables a broader association. It results in what may be called an area spectrum.

Riemann Area Optimization and Histogram Construction

Notes on an Associative Technique for Statistical Analysis.

The Riemann sum equation can be applied to the aspect of selecting a bin for a histogram. A reiteration of the ∆x function for finding the full set of sums gives the equation finding all possible areas of a data-set:

x=1nnb+-1ax-1 x 1 n n b -1 a x -1 (1)

The continuous characteristic for ∆x of the Riemann equation allows the integral of (1) to represent a discrete data-set in a nonlinear curve through the differentials between a range of areas.

The independent variable x, for equation (1) is the instance variable for the number of observations of a data set. It is defined for non-negative integers, 1 ≤ x ≤ n 1 ≤ x ≤ n .

Figure 2: Derived function: Illustrated significant range for the areas of a data-set. Data Source: NIST, Information Technology Laboratory, Statistical Engineering Division <http://www.itl.nist.gov/div898/strd/univ/data/Lew.dat >
Figure 2 (Absract1fig1_new2.jpg)

The n variable in (1) represents the number of observations in a dataset, as opposed to the number of sub-intervals as is the case of ∆x. In figure 2, the actual area is found between f(1) and f(200). The integral is a substitute for ∆x if regarded as an intermediary step for bin selection preceding the frequency distribution.

For the previous example (Table 1) where the actual area is known, the function associated with the data and then applied to (1) will produce an area spectrum encompassing the largest approximated area 0.9 and the smallest, 0.09, where the actual area is 0.5.

Therefore, an optimization technique is available based on targeting the actual area within the area spectrum resulting from (1) and have possible use for statistical analysis. The technique of optimizing the integral suggests a qualification of bin number. The method of finding the least distance from the curve to 0 is one option for optimization.

The four variables comprising this formula convey standardization for histogram construction. The technique allows graphical representation to occur independent from perceived frequency distribution of sample data points. The area spectrum represents a compression of the ways point within a data set distribute given the four variables. This quality may prove beneficial for inferences made from the resulting histogram. It may also, however, prove limited to specific sources of data. Limitations of the technique are currently being investigated.

Comments, questions, feedback, criticisms?

Send feedback