Fast Fourier transforms (FFTs) can be succinctly expressed as microprocessor algorithms that are depth first recursive. For example, the CooleyTukey FFT divides a size N transform into two size N/2 transforms, which in turn are divided into size N/4 transforms. This recursion continues until the base case of two size 1 transforms is reached, where the two smaller subtransforms are then combined into a size 2 subtransform, and then two completed size 2 transforms are combined into a size 4 transform, and so on, until the size N transform is complete.
Computing the FFT with such a depth first traversal has an important advantage in terms of memory locality: at any point during the traversal, the two completed subtransforms that compose a larger subtransform will still be in the closest level of the memory hierarchy in which they fit (see, i.a., [5] and [3]). In contrast, a breadth first traversal of a sufficiently large transform could force data out of cache during every pass (ibid.).
Many implementations of the FFT require a bitreversal permutation of either the input or the output data, but a depth first recursive algorithm implicitly performs the permutation during recursion. The bitreversal permutation is an expensive computation, and despite being the subject of hundreds of research papers over the years, it can easily account for a large fraction of the FFTs runtime – more so for the conjugatepair algorithm with the rotated bitreversal permutation. Such permutations will be encountered in later sections, but for the mean time it should be noted that the algorithms in this chapter do not require bitreversal permutations – the input and output are in natural order.

Radix2
A recursive depth first implementation of the CooleyTukey radix2 decimationintime (DIT)
FFT is described with pseudocode in Code 1, and an
implementation coded in C with only the most basic optimization – avoiding multiply operations where
Implementation  Machine  Runtime 
DanielsonLanczos, 1942 [2]  Human  140 minutes 
CooleyTukey, 1965 [1]  IBM 7094  
Listing 1, Appendix 1, 2011  Macbook Air 4,2 
However it is worth noting that when considered from a historical perspective, the performance does seem impressive – as shown in Table 1. The runtimes in Table 1 are approximate; the CooleyTukey figure is roughly extrapolated from the floating point operations per second (FLOPS) count of a size 2048 complex transform given in their 1965 paper [1]; and the speed of the reference implementation is derived from the runtime of a size 64 complex FFT (again, based on the FLOPS count). Furthermore, the precision differs between the results; Danielson and Lanczos computed the DFT to 3–5 significant figures (possibly with the aid of slide rules or adding machines), while the other results were computed with the host machines' implementation of single precision floating point arithmetic.
The runtime performance of the FFT has improved by about seven orders of magnitude in 70 years, and this can largely be attributed to the computing machines of the day being generally faster. The following sections and chapters will show that the performance can be further improved by over two orders of magnitude if the algorithm is enhanced with optimizations that are amenable to the underlying machine.
Splitradix

As was the case with the radix2 FFT in the previous section, the splitradix FFT neatly maps from the system of linear functions to the pseudocode of Code 2, and then to the C implementation included in Appendix 1.
Code 2 explicitly handles the base case for
Also note that two twiddle factors, viz.
Conjugatepair
From a pseudocode perspective, there is little difference between the ordinary splitradix algorithm
and the conjugatepair algorithm (see Code 3). In line 10, the

The source code has a few subtle differences that are not revealed in the
pseudocode. The pseudocode in Code 3 requires an array of complex numbers


Tangent
The tangent FFT is divided into two functions, described with pseudocode
in Code 4 and Code 5. If the
tangent FFT is computed prior to convolution in the frequency domain, the
convolution kernel can absorb the final scaling and only Code 5 is required. Otherwise Code 4 is
used as a wrapper around Code 5 to perform the
rescaling, and the result
Code 4 is similar to Code 3,
except that
Code 5 is almost a 1:1 mapping of the system of linear equations, except that the base cases of
Putting it all together
The simple implementations covered in this section were benchmarked for sizes
of transforms
It can be seen from Figure 1 that although the conjugatepair and splitradix algorithms have exactly the same FLOP count, the conjugatepair algorithm is substantially faster. The difference in speed can be attributed to the fact that the conjugatepair algorithm requires only one twiddle factor per size 4 subtransform, whereas the ordinary splitradix algorithm requires two.
Though the tangent FFT requires the same number of twiddle factors but uses fewer FLOPs compared to the conjugatepair algorithm, its performance is worse than the radix2 FFT for most sizes of transform, and this can be attributed to the cost of computing the scaling factors.
A simple analysis with a profiling tool reveals that each implementations'
runtime is dominated by the time taken to compute the coefficients. Even in the
case of the conjugatepair algorithm, over