Summary: The purpose of this module is to assist histogram design. The module presents an integral in terms of the Riemann sum equation for finding the area of a data-set. It builds on an associative technique (from a discrete to continuous function) suggested by the sub-interval in the delta-x and bin number in the histogram. The resulting area spectrum contains possible areas and an actual area for a data-set. The module concludes with statements on the possible optimization of the integral by calculating the actual area. Building a histogram in this way may imply compression qualities for variance in the data-set frequency distribution.
The finite set of an integral defines the spatial boundaries for an area. As the width of the sub-interval decreases, the number of sub-intervals or ‘bins’ increases and the resulting area approaches the actual area.
For the data-set with function defined, f(x) = x, where x is all real numbers in 0.1 increments from 0 to 1 without replication, the Riemann area value approaches the actual value:
| Bins | ∆x | area |
| 1 | 1 | 1 |
| 2 | 0.5 | 0.75 |
| 5 | 0.2 | 0.6 |
| 10 | 0.1 | 0.55 |
Two histograms that would result from the sub-intervals (Figure 1) for this example present a characteristic that exemplifies the 0 to 1 –line: The area plotted by the function is the area equation for a right triangle. The histogram with 1 bin portrays the idea clearly because it maps the base and height.
![]() |
Infinity sub-intervals would result in the closest Riemann area to the actual area, but it is a inappropriate histogram for analysis. Such construction, where bins are equal to or greater than n, can be thought of as the design of a 'micro' -histogram. However, much statistical inference is made through the alternative and as such, is the limit of the research underlying this module.
In the example, the furthest right graph, with ten sub-intervals would present a histogram with an equal number of bins and data points. This shows the width of the sub-interval correlates with the number of bins and will in turn, affect the frequency of data points within each bin. Departing from the example, it is assumed that a data set resulting in a nonlinear frequency will have an unknown area.
The association of discrete values forming, through sub-intervals, a continuous function enables graphical representation of potential histograms where the x-axis presents the number of bins as an element of sub-interval width (Figure 2).
The Riemann equation finds the actual area but implies a range of possibilities above and below the actual area. The important distinction from discrete intervals for data points to applying a range that all real numbers for the set, enables a broader association. It results in what may be called an area spectrum.
The Riemann sum equation can be applied to the aspect of selecting a bin for a histogram. A reiteration of the ∆x function for finding the full set of sums gives the equation finding all possible areas of a data-set:
The continuous characteristic for ∆x of the Riemann equation allows the integral of (1) to represent a discrete data-set in a nonlinear curve through the differentials between a range of areas.
The independent variable x, for equation (1) is the instance variable for the number of observations of a data set. It is defined for non-negative integers,
![]() |
The n variable in (1) represents the number of observations in a dataset, as opposed to the number of sub-intervals as is the case of ∆x. In figure 2, the actual area is found between f(1) and f(200). The integral is a substitute for ∆x if regarded as an intermediary step for bin selection preceding the frequency distribution.
For the previous example (Table 1) where the actual area is known, the function associated with the data and then applied to (1) will produce an area spectrum encompassing the largest approximated area 0.9 and the smallest, 0.09, where the actual area is 0.5.
Therefore, an optimization technique is available based on targeting the actual area within the area spectrum resulting from (1) and have possible use for statistical analysis. The technique of optimizing the integral suggests a qualification of bin number. The method of finding the least distance from the curve to 0 is one option for optimization.
The four variables comprising this formula convey standardization for histogram construction. The technique allows graphical representation to occur independent from perceived frequency distribution of sample data points. The area spectrum represents a compression of the ways point within a data set distribute given the four variables. This quality may prove beneficial for inferences made from the resulting histogram. It may also, however, prove limited to specific sources of data. Limitations of the technique are currently being investigated.