Fast Fourier transforms (FFTs) can be succinctly expressed as microprocessor algorithms that are depth first recursive. For example, the Cooley-Tukey FFT divides a size N transform into two size N/2 transforms, which in turn are divided into size N/4 transforms. This recursion continues until the base case of two size 1 transforms is reached, where the two smaller sub-transforms are then combined into a size 2 sub-transform, and then two completed size 2 transforms are combined into a size 4 transform, and so on, until the size N transform is complete.
Computing the FFT with such a depth first traversal has an important advantage in terms of memory locality: at any point during the traversal, the two completed sub-transforms that compose a larger sub-transform will still be in the closest level of the memory hierarchy in which they fit (see, i.a., [5] and [3]). In contrast, a breadth first traversal of a sufficiently large transform could force data out of cache during every pass (ibid.).
Many implementations of the FFT require a bit-reversal permutation of either the input or the output data, but a depth first recursive algorithm implicitly performs the permutation during recursion. The bit-reversal permutation is an expensive computation, and despite being the subject of hundreds of research papers over the years, it can easily account for a large fraction of the FFTs runtime – more so for the conjugate-pair algorithm with the rotated bit-reversal permutation. Such permutations will be encountered in later sections, but for the mean time it should be noted that the algorithms in this chapter do not require bit-reversal permutations – the input and output are in natural order.
|
Radix-2
A recursive depth first implementation of the Cooley-Tukey radix-2 decimation-in-time (DIT)
FFT is described with pseudocode in Code 1, and an
implementation coded in C with only the most basic optimization – avoiding multiply operations where
| Implementation | Machine | Runtime |
| Danielson-Lanczos, 1942 [2] | Human | 140 minutes |
| Cooley-Tukey, 1965 [1] | IBM 7094 | |
| Listing 1, Appendix 1, 2011 | Macbook Air 4,2 |
However it is worth noting that when considered from a historical perspective, the performance does seem impressive – as shown in Table 1. The runtimes in Table 1 are approximate; the Cooley-Tukey figure is roughly extrapolated from the floating point operations per second (FLOPS) count of a size 2048 complex transform given in their 1965 paper [1]; and the speed of the reference implementation is derived from the runtime of a size 64 complex FFT (again, based on the FLOPS count). Furthermore, the precision differs between the results; Danielson and Lanczos computed the DFT to 3–5 significant figures (possibly with the aid of slide rules or adding machines), while the other results were computed with the host machines' implementation of single precision floating point arithmetic.
The runtime performance of the FFT has improved by about seven orders of magnitude in 70 years, and this can largely be attributed to the computing machines of the day being generally faster. The following sections and chapters will show that the performance can be further improved by over two orders of magnitude if the algorithm is enhanced with optimizations that are amenable to the underlying machine.
Split-radix
|
As was the case with the radix-2 FFT in the previous section, the split-radix FFT neatly maps from the system of linear functions to the pseudocode of Code 2, and then to the C implementation included in Appendix 1.
Code 2 explicitly handles the base case for
Also note that two twiddle factors, viz.
Conjugate-pair
From a pseudocode perspective, there is little difference between the ordinary split-radix algorithm
and the conjugate-pair algorithm (see Code 3). In line 10, the
|
The source code has a few subtle differences that are not revealed in the
pseudocode. The pseudocode in Code 3 requires an array of complex numbers
|
|
Tangent
The tangent FFT is divided into two functions, described with pseudocode
in Code 4 and Code 5. If the
tangent FFT is computed prior to convolution in the frequency domain, the
convolution kernel can absorb the final scaling and only Code 5 is required. Otherwise Code 4 is
used as a wrapper around Code 5 to perform the
rescaling, and the result
Code 4 is similar to Code 3,
except that
Code 5 is almost a 1:1 mapping of the system of linear equations, except that the base cases of
Putting it all together
![]() |
The simple implementations covered in this section were benchmarked for sizes
of transforms
It can be seen from Figure 1 that although the conjugate-pair and split-radix algorithms have exactly the same FLOP count, the conjugate-pair algorithm is substantially faster. The difference in speed can be attributed to the fact that the conjugate-pair algorithm requires only one twiddle factor per size 4 sub-transform, whereas the ordinary split-radix algorithm requires two.
Though the tangent FFT requires the same number of twiddle factors but uses fewer FLOPs compared to the conjugate-pair algorithm, its performance is worse than the radix-2 FFT for most sizes of transform, and this can be attributed to the cost of computing the scaling factors.
A simple analysis with a profiling tool reveals that each implementations'
runtime is dominated by the time taken to compute the coefficients. Even in the
case of the conjugate-pair algorithm, over









