The publication by Cooley and Tukey Entry 5 in 1965 of an
efficient algorithm for the calculation of the DFT was a major
turning point in the development of digital signal processing.
During the five or so years that followed, various extensions and
modifications were made to the original algorithm Entry 4. By the early
1970's the practical programs were basically in the form used today.
The standard development presented in Entry 21, Entry 22, Entry 2 shows
how the DFT of a length-N sequence can be simply calculated from the
two length-N/2 DFT's of the even index terms and the odd index
terms. This is then applied to the two half-length DFT's to give
four quarter-length DFT's, and repeated until N scalars are left
which are the DFT values. Because of alternately taking the even and
odd index terms, two forms of the resulting programs are called
decimation-in-time and decimation-in-frequency. For a length of
Although the decimation methods are straightforward and easy to understand, they do not generalize well. For that reason it will be assumed that the reader is familiar with that description and this chapter will develop the FFT using the index map from Multidimensional Index Mapping.
The Cooley-Tukey FFT always uses the Type 2 index map from
Multidimensional Index Mapping: Equation 11. This is necessary for the most popular forms that
have
Type-2 conditions Multidimensional Index Mapping: Equation 8 and Multidimensional Index Mapping: Equation 11 become
and
The row and column calculations in Multidimensional Index Mapping: Equation 15 are uncoupled by Multidimensional Index Mapping: Equation 16 which for this case are
To make each short sum a DFT, the
In order to have the smallest values for
which makes the index maps of Equation 1 become
These index maps are all evaluated modulo
This map of Equation 8 and the form of the DFT in Equation 10 are the fundamentals of the Cooley-Tukey FFT.
The order of the summations using the Type 2 map in Equation 10
cannot be reversed as it can with the Type-1 map. This is because
of the
Turning Equation 10 into an efficient program requires some care.
From Multidimensional Index Mapping: Efficiencies Resulting from Index Mapping with the DFT we know that all the factors should be
equal. If
The twiddle factor array will always have unity in the first row and first column.
To complete Equation 10 at this point, after the row DFT's are
multiplied by the TF array, the
![]() |
![]() |
This flow-graph, the twiddle factor map of Equation 11, and the basic equation Equation 10 should be completely understood before going further.
A very efficient indexing scheme has evolved over the years that results in a compact and efficient computer program. A FORTRAN program is given below that implements the radix-2 FFT. It should be studied Entry 1 to see how it implements Equation 10 and the flow-graph representation.
|
This discussion, the flow graph of Winograd’s Short DFT Algorithms: Figure 2 and the program of Figure 3 are all based on the input index map of Multidimensional Index Mapping: Equation 6 and Equation 1 and the calculations are performed in-place. According to Multidimensional Index Mapping: In-Place Calculation of the DFT and Scrambling, this means the output is scrambled in bit-reversed order and should be followed by an unscrambler to give the DFT in proper order. This formulation is called a decimation-in-frequency FFT Entry 21, Entry 22, Entry 2. A very similar algorithm based on the output index map can be derived which is called a decimation-in-time FFT. Examples of FFT programs are found in Entry 1 and in the Appendix of this book.
Soon after the paper by Cooley and Tukey, there were improvements and extensions made. One very important discovery was the improvement in efficiency by using a larger radix of 4, 8 or even 16. For example, just as for the radix-2 butterfly, there are no multiplications required for a length-4 DFT, and therefore, a radix-4 FFT would have only twiddle factor multiplications. Because there are half as many stages in a radix-4 FFT, there would be half as many multiplications as in a radix-2 FFT. In practice, because some of the multiplications are by unity, the improvement is not by a factor of two, but it is significant. A radix-4 FFT is easily developed from the basic radix-2 structure by replacing the length-2 butterfly by a length-4 butterfly and making a few other modifications. Programs can be found in Entry 1 and operation counts will be given in "Evaluation of the Cooley-Tukey FFT Algorithms".
Increasing the radix to 8 gives some improvement but not as much as from 2 to 4. Increasing it to 16 is theoretically promising but the small decrease in multiplications is somewhat offset by an increase in additions and the program becomes rather long. Other radices are not attractive because they generally require a substantial number of multiplications and additions in the butterflies.
The second method of reducing arithmetic is to remove the
unnecessary TF multiplications by plus or minus unity or by plus or
minus the square root of minus one. This occurs when the exponent of
In addition to removing multiplications by one or
A time-consuming and unnecessary part of the execution of a FFT program is the calculation of the sine and cosine terms which are the real and imaginary parts of the TFs. There are basically three approaches to obtaining the sine and cosine values. They can be calculated as needed which is what is done in the sample program above. One value per stage can be calculated and the others recursively calculated from those. That method is fast but suffers from accumulated round-off errors. The fastest method is to fetch precalculated values from a stored table. This has the disadvantage of requiring considerable memory space.
If all the N DFT values are not needed, special forms of the FFT can be developed using a process called pruning Entry 17 which removes the operations concerned with the unneeded outputs.
Special algorithms are possible for cases with real data or with symmetric data Entry 6. The decimation-in-time algorithm can be easily modified to transform real data and save half the arithmetic required for complex data Entry 27. There are numerous other modifications to deal with special hardware considerations such as an array processor or a special microprocessor such as the Texas Instruments TMS320. Examples of programs that deal with some of these items can be found in Entry 22, Entry 1, Entry 6.
Recently several papers Entry 18, Entry 7, Entry 28, Entry 23, Entry 8
have been published on algorithms to calculate a length-
The basic idea behind the split-radix FFT (SRFFT) as derived by Duhamel and Hollmann Entry 7, Entry 8 is the application of a radix-2 index map to the even-indexed terms and a radix-4 map to the odd- indexed terms. The basic definition of the DFT
with
for the even index terms, and
and
for the odd index terms. This results in
an L-shaped “butterfly" shown in Figure 4 which relates a length-N
DFT to one length-N/2 DFT and two length-N/4 DFT's with twiddle
factors. Repeating this process for the half and quarter length
DFT's until scalars result gives the SRFFT algorithm in much the
same way the decimation-in-frequency radix-2 Cooley-Tukey FFT is
derived Entry 21, Entry 22, Entry 2. The resulting flow graph for the
algorithm calculated in place looks like a radix-2 FFT except for
the location of the twiddle factors. Indeed, it is the location of
the twiddle factors that makes this algorithm use less arithmetic.
The L- shaped SRFFT butterfly Figure 4 advances the calculation of the top
half by one of the
![]() |
![]() |
Unlike the fixed radix, mixed radix or variable radix Cooley-Tukey FFT or even the prime factor algorithm or Winograd Fourier transform algorithm , the Split-Radix FFT does not progress completely stage by stage, or, in terms of indices, does not complete each nested sum in order. This is perhaps better seen from the polynomial formulation of Martens Entry 18. Because of this, the indexing is somewhat more complicated than the conventional Cooley-Tukey program.
A FORTRAN program is given below which implements the basic decimation-in-frequency split-radix FFT algorithm. The indexing scheme Entry 23 of this program gives a structure very similar to the Cooley-Tukey programs in Entry 1 and allows the same modifications and improvements such as decimation-in-time, multiple butterflies, table look-up of sine and cosine values, three real per complex multiply methods, and real data versions Entry 8, Entry 27.
|
As was done for the other decimation-in-frequency algorithms, the input index map is used and the calculations are done in place resulting in the output being in bit-reversed order. It is the three statements following label 30 that do the special indexing required by the SRFFT. The last stage is length- 2 and, therefore, inappropriate for the standard L-shaped butterfly, so it is calculated separately in the DO 60 loop. This program is considered a one-butterfly version. A second butterfly can be added just before statement 40 to remove the unnecessary multiplications by unity. A third butterfly can be added to reduce the number of real multiplications from four to two for the complex multiplication when W has equal real and imaginary parts. It is also possible to reduce the arithmetic for the two- butterfly case and to reduce the data transfers by directly programming a length-4 and length-8 butterfly to replace the last three stages. This is called a two-butterfly-plus version. Operation counts for the one, two, two-plus and three butterfly SRFFT programs are given in the next section. Some details can be found in Entry 23.
The special case of a SRFFT for real data and symmetric data is discussed in Entry 8. An application of the decimation-in-time SRFFT to real data is given in Entry 27. Application to convolution is made in Entry 9, to the discrete Hartley transform in Entry 26, Entry 9, to calculating the discrete cosine transform in Entry 28, and could be made to calculating number theoretic transforms.
An improvement in operation count has been reported by Johnson and Frigo Entry 15 which involves a scaling of multiplying factors. The improvement is small but until this result, it was generally thought the Split-Radix FFT was optimal for total floating point operation count.
The evaluation of any FFT algorithm starts with a count of the real (or floating point) arithmetic. Table 1 gives the number of real multiplications and additions required to calculate a length-N FFT of complex data. Results of programs with one, two, three and five butterflies are given to show the improvement that can be expected from removing unnecessary multiplications and additions. Results of radices two, four, eight and sixteen for the Cooley-Tukey FFT as well as of the split-radix FFT are given to show the relative merits of the various structures. Comparisons of these data should be made with the table of counts for the PFA and WFTA programs in The Prime Factor and Winograd Fourier Transform Algorithms: Evaluation of the PFA and WFTA. All programs use the four-multiply-two-add complex multiply algorithm. A similar table can be developed for the three-multiply-three-add algorithm, but the relative results are the same.
From the table it is seen that a greater improvement is obtained going from radix-2 to 4 than from 4 to 8 or 16. This is partly because length 2 and 4 butterflies have no multiplications while length 8, 16 and higher do. It is also seen that going from one to two butterflies gives more improvement than going from two to higher values. From an operation count point of view and from practical experience, a three butterfly radix-4 or a two butterfly radix-8 FFT is a good compromise. The radix-8 and 16 programs become long, especially with multiple butterflies, and they give a limited choice of transform length unless combined with some length 2 and 4 butterflies.
| N | M1 | M2 | M3 | M5 | A1 | A2 | A3 | A5 |
| 2 | 4 | 0 | 0 | 0 | 6 | 4 | 4 | 4 |
| 4 | 16 | 4 | 0 | 0 | 24 | 18 | 16 | 16 |
| 8 | 48 | 20 | 8 | 4 | 72 | 58 | 52 | 52 |
| 16 | 128 | 68 | 40 | 28 | 192 | 162 | 148 | 148 |
| 32 | 320 | 196 | 136 | 108 | 480 | 418 | 388 | 388 |
| 64 | 768 | 516 | 392 | 332 | 1152 | 1026 | 964 | 964 |
| 128 | 1792 | 1284 | 1032 | 908 | 2688 | 2434 | 2308 | 2308 |
| 256 | 4096 | 3076 | 2568 | 2316 | 6144 | 5634 | 5380 | 5380 |
| 512 | 9216 | 7172 | 6152 | 5644 | 13824 | 12802 | 12292 | 12292 |
| 1024 | 20480 | 16388 | 14344 | 13324 | 30720 | 28674 | 27652 | 27652 |
| 2048 | 45056 | 36868 | 32776 | 30732 | 67584 | 63490 | 61444 | 61444 |
| 4096 | 98304 | 81924 | 73736 | 69644 | 147456 | 139266 | 135172 | 135172 |
| 4 | 12 | 0 | 0 | 0 | 22 | 16 | 16 | 16 |
| 16 | 96 | 36 | 28 | 24 | 176 | 146 | 144 | 144 |
| 64 | 576 | 324 | 284 | 264 | 1056 | 930 | 920 | 920 |
| 256 | 3072 | 2052 | 1884 | 1800 | 5632 | 5122 | 5080 | 5080 |
| 1024 | 15360 | 11268 | 10588 | 10248 | 28160 | 26114 | 25944 | 25944 |
| 4096 | 73728 | 57348 | 54620 | 53256 | 135168 | 126978 | 126296 | 126296 |
| 8 | 32 | 4 | 4 | 4 | 66 | 52 | 52 | 52 |
| 64 | 512 | 260 | 252 | 248 | 1056 | 930 | 928 | 928 |
| 512 | 6144 | 4100 |