Skip to content Skip to navigation

Connexions

You are here: Home » Content » The Cooley-Tukey Fast Fourier Transform Algorithm

Navigation

Content Actions

  • Download module PDF
  • Add to ...
    Add the module to:
    • My Favorites
    • A lens
    • An external social bookmarking service
    • My Favorites (What is 'My Favorites'?)
      'My Favorites' is a special kind of lens which you can use to bookmark modules and collections directly in Connexions. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need a Connexions account to use 'My Favorites'.
    • A lens (What is a lens?)

      Definition of a lens

      Lenses

      A lens is a custom view of Connexions content. You can think of it as a fancy kind of list that will let you see Connexions through the eyes of organizations and people you trust.

      What is in a lens?

      Lens makers point to Connexions materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

      Who can create a lens?

      Any individual Connexions member, a community, or a respected organization.

    • External bookmarks
  • E-mail the author

Recently Viewed

The Cooley-Tukey Fast Fourier Transform Algorithm

Module by: C. Sidney Burrus

The publication by Cooley and Tukey Entry 5 in 1965 of an efficient algorithm for the calculation of the DFT was a major turning point in the development of digital signal processing. During the five or so years that followed, various extensions and modifications were made to the original algorithm Entry 4. By the early 1970's the practical programs were basically in the form used today. The standard development presented in Entry 21, Entry 22, Entry 2 shows how the DFT of a length-N sequence can be simply calculated from the two length-N/2 DFT's of the even index terms and the odd index terms. This is then applied to the two half-length DFT's to give four quarter-length DFT's, and repeated until N scalars are left which are the DFT values. Because of alternately taking the even and odd index terms, two forms of the resulting programs are called decimation-in-time and decimation-in-frequency. For a length of 2M2M, the dividing process is repeated M=log2NM=log2N times and requires NN multiplications each time. This gives the famous formula for the computational complexity of the FFT of Nlog2NNlog2N which was derived in Multidimensional Index Mapping: Equation 34.

Although the decimation methods are straightforward and easy to understand, they do not generalize well. For that reason it will be assumed that the reader is familiar with that description and this chapter will develop the FFT using the index map from Multidimensional Index Mapping.

The Cooley-Tukey FFT always uses the Type 2 index map from Multidimensional Index Mapping: Equation 11. This is necessary for the most popular forms that have N=RMN=RM, but is also used even when the factors are relatively prime and a Type 1 map could be used. The time and frequency maps from Multidimensional Index Mapping: Equation 6 and Multidimensional Index Mapping: Equation 12 are

n = ( ( K 1 n 1 + K 2 n 2 ) ) N n = ( ( K 1 n 1 + K 2 n 2 ) ) N (1)
k = ( ( K 3 k 1 + K 4 k 2 ) ) N k = ( ( K 3 k 1 + K 4 k 2 ) ) N (2)

Type-2 conditions Multidimensional Index Mapping: Equation 8 and Multidimensional Index Mapping: Equation 11 become

K 1 = a N 2 or K 2 = b N 1 but not both K 1 = a N 2 or K 2 = b N 1 but not both (3)

and

K 3 = c N 2 or K 4 = d N 1 but not both K 3 = c N 2 or K 4 = d N 1 but not both (4)

The row and column calculations in Multidimensional Index Mapping: Equation 15 are uncoupled by Multidimensional Index Mapping: Equation 16 which for this case are

( ( K 1 K 4 ) ) N = 0 or ( ( K 2 K 3 ) ) N = 0 but not both ( ( K 1 K 4 ) ) N = 0 or ( ( K 2 K 3 ) ) N = 0 but not both (5)

To make each short sum a DFT, the KiKi must satisfy

( ( K 1 K 3 ) ) N = N 2 and ( ( K 2 K 4 ) ) N = N 1 ( ( K 1 K 3 ) ) N = N 2 and ( ( K 2 K 4 ) ) N = N 1 (6)

In order to have the smallest values for KiKi the constants in Equation 3 are chosen to be

a = d = K 2 = K 3 = 1 a = d = K 2 = K 3 = 1 (7)

which makes the index maps of Equation 1 become

n = N 2 n 1 + n 2 n = N 2 n 1 + n 2 (8)
k = k 1 + N 1 k 2 k = k 1 + N 1 k 2 (9)

These index maps are all evaluated modulo NN as indicated in (80), but in Equation 8, explicit reduction is not necessary since n never exceeds NN. The reduction notation will be omitted for clarity. From Multidimensional Index Mapping: Equation 15 and example Multidimensional Index Mapping: Equation 19, the DFT is

X = n 2 = 0 N 2 - 1 n 1 = 0 N 1 - 1 x W N 1 n 1 k 1 W N n 2 k 1 W N 2 n 2 k 2 X = n 2 = 0 N 2 - 1 n 1 = 0 N 1 - 1 x W N 1 n 1 k 1 W N n 2 k 1 W N 2 n 2 k 2 (10)

This map of Equation 8 and the form of the DFT in Equation 10 are the fundamentals of the Cooley-Tukey FFT.

The order of the summations using the Type 2 map in Equation 10 cannot be reversed as it can with the Type-1 map. This is because of the WNWN terms, the twiddle factors.

Turning Equation 10 into an efficient program requires some care. From Multidimensional Index Mapping: Efficiencies Resulting from Index Mapping with the DFT we know that all the factors should be equal. If N=RMN=RM , with R called the radix, N1N1 is first set equal to RR and N2N2 is then necessarily RM-1M-1. Consider n1n1 to be the index along the rows and n2n2 along the columns. The inner sum of Equation 10 over n1n1 represents a length-N1N1 DFT for each value of n2n2. These N2N2 length-N1N1 DFT's are the DFT's of the rows of the x(n1,n2)x(n1,n2) array. The resulting array of row DFT's is multiplied by an array of twiddle factors which are the WNWN terms in Equation 10. The twiddle-factor array for a length-8 radix-2 FFT is

T F : W 8 n 2 k 1 = W 0 W 0 W 0 W 1 W 0 W 2 W 0 W 3 = 1 1 1 W 1 - j 1 - j W T F : W 8 n 2 k 1 = W 0 W 0 W 0 W 1 W 0 W 2 W 0 W 3 = 1 1 1 W 1 - j 1 - j W (11)

The twiddle factor array will always have unity in the first row and first column.

To complete Equation 10 at this point, after the row DFT's are multiplied by the TF array, the N1N1 length-N2N2 DFT's of the columns are calculated. However, since the columns DFT's are of length RM-1RM-1, they can be posed as a RM-2RM-2 by RR array and the process repeated, again using length-RR DFT's. After MM stages of length-RR DFT's with TF multiplications interleaved, the DFT is complete. The flow graph of a length-2 DFT is given in Winograd’s Short DFT Algorithms: Figure 1 and is called a butterfly because of its shape. The flow graph of the complete length-8 radix-2 FFT is shown in Winograd’s Short DFT Algorithms: Figure 2 .

Figure 1: A Radix-2 Butterfly
Figure 1 (R2BF.png)
Figure 2: Length-8 Radix-2 FFT Flow Graph
Figure 2 (File0026.png)

This flow-graph, the twiddle factor map of Equation 11, and the basic equation Equation 10 should be completely understood before going further.

A very efficient indexing scheme has evolved over the years that results in a compact and efficient computer program. A FORTRAN program is given below that implements the radix-2 FFT. It should be studied Entry 1 to see how it implements Equation 10 and the flow-graph representation.

Figure 3: A Radix-2 Cooley-Tukey FFT Program
N2 = N
        DO 10 K = 1, M
            N1 = N2
            N2 = N2/2
            E  = 6.28318/N1
            A  = 0
            DO 20 J = 1, N2
                C = COS (A)
                S =-SIN (A)
                A = J*E
                DO 30 I = J, N, N1
                    L = I + N2
                    XT   = X(I) - X(L)
                    X(I) = X(I) + X(L)
                    YT   = Y(I) - Y(L)
                    Y(I) = Y(I) + Y(L)
                    X(L) = XT*C - YT*S
                    Y(L) = XT*S + YT*C
   30           CONTINUE
   20       CONTINUE
   10   CONTINUE
 
 

This discussion, the flow graph of Winograd’s Short DFT Algorithms: Figure 2 and the program of Figure 3 are all based on the input index map of Multidimensional Index Mapping: Equation 6 and Equation 1 and the calculations are performed in-place. According to Multidimensional Index Mapping: In-Place Calculation of the DFT and Scrambling, this means the output is scrambled in bit-reversed order and should be followed by an unscrambler to give the DFT in proper order. This formulation is called a decimation-in-frequency FFT Entry 21, Entry 22, Entry 2. A very similar algorithm based on the output index map can be derived which is called a decimation-in-time FFT. Examples of FFT programs are found in Entry 1 and in the Appendix of this book.

Modifications to the Basic Cooley-Tukey FFT

Soon after the paper by Cooley and Tukey, there were improvements and extensions made. One very important discovery was the improvement in efficiency by using a larger radix of 4, 8 or even 16. For example, just as for the radix-2 butterfly, there are no multiplications required for a length-4 DFT, and therefore, a radix-4 FFT would have only twiddle factor multiplications. Because there are half as many stages in a radix-4 FFT, there would be half as many multiplications as in a radix-2 FFT. In practice, because some of the multiplications are by unity, the improvement is not by a factor of two, but it is significant. A radix-4 FFT is easily developed from the basic radix-2 structure by replacing the length-2 butterfly by a length-4 butterfly and making a few other modifications. Programs can be found in Entry 1 and operation counts will be given in "Evaluation of the Cooley-Tukey FFT Algorithms".

Increasing the radix to 8 gives some improvement but not as much as from 2 to 4. Increasing it to 16 is theoretically promising but the small decrease in multiplications is somewhat offset by an increase in additions and the program becomes rather long. Other radices are not attractive because they generally require a substantial number of multiplications and additions in the butterflies.

The second method of reducing arithmetic is to remove the unnecessary TF multiplications by plus or minus unity or by plus or minus the square root of minus one. This occurs when the exponent of WNWN is zero or a multiple of N/4N/4. A reduction of additions as well as multiplications is achieved by removing these extraneous complex multiplications since a complex multiplication requires at least two real additions. In a program, this reduction is usually achieved by having special butterflies for the cases where the TF is one or jj. As many as four special butterflies may be necessary to remove all unnecessary arithmetic, but in many cases there will be no practical improvement above two or three.

In addition to removing multiplications by one or jj, there can be a reduction in multiplications by using a special butterfly for TFs with WN/8WN/8, which have equal real and imaginary parts. Also, for computers or hardware with multiplication considerably slower than addition, it is desirable to use an algorithm for complex multiplication that requires three multiplications and three additions rather than the conventional four multiplications and two additions. Note that this gives no reduction in the total number of arithmetic operations, but does give a trade of multiplications for additions. This is one reason not to use complex data types in programs but to explicitly program complex arithmetic.

A time-consuming and unnecessary part of the execution of a FFT program is the calculation of the sine and cosine terms which are the real and imaginary parts of the TFs. There are basically three approaches to obtaining the sine and cosine values. They can be calculated as needed which is what is done in the sample program above. One value per stage can be calculated and the others recursively calculated from those. That method is fast but suffers from accumulated round-off errors. The fastest method is to fetch precalculated values from a stored table. This has the disadvantage of requiring considerable memory space.

If all the N DFT values are not needed, special forms of the FFT can be developed using a process called pruning Entry 17 which removes the operations concerned with the unneeded outputs.

Special algorithms are possible for cases with real data or with symmetric data Entry 6. The decimation-in-time algorithm can be easily modified to transform real data and save half the arithmetic required for complex data Entry 27. There are numerous other modifications to deal with special hardware considerations such as an array processor or a special microprocessor such as the Texas Instruments TMS320. Examples of programs that deal with some of these items can be found in Entry 22, Entry 1, Entry 6.

The Split-Radix FFT Algorithm

Recently several papers Entry 18, Entry 7, Entry 28, Entry 23, Entry 8 have been published on algorithms to calculate a length-2M2M DFT more efficiently than a Cooley-Tukey FFT of any radix. They all have the same computational complexity and are optimal for lengths up through 16 and until recently was thought to give the best total add-multiply count possible for any power-of-two length. Yavne published an algorithm with the same computational complexity in 1968 Entry 29, but it went largely unnoticed. Johnson and Frigo have recently reported the first improvement in almost 40 years Entry 15. The reduction in total operations is only a few percent, but it is a reduction.

The basic idea behind the split-radix FFT (SRFFT) as derived by Duhamel and Hollmann Entry 7, Entry 8 is the application of a radix-2 index map to the even-indexed terms and a radix-4 map to the odd- indexed terms. The basic definition of the DFT

C k = n = 0 N - 1 x n W n k C k = n = 0 N - 1 x n W n k (12)

with W=e-j2π/NW=e-j2π/N gives

C 2 k = n = 0 N / 2 - 1 [ x n + x n + N / 2 ] W 2 n k C 2 k = n = 0 N / 2 - 1 [ x n + x n + N / 2 ] W 2 n k (13)

for the even index terms, and

C 4 k + 1 = n = 0 N / 4 - 1 [ ( x n - x n + N / 2 ) - j ( x n + N / 4 - x n + 3 N / 4 ) ] W n W 4 n k C 4 k + 1 = n = 0 N / 4 - 1 [ ( x n - x n + N / 2 ) - j ( x n + N / 4 - x n + 3 N / 4 ) ] W n W 4 n k (14)

and

C 4 k + 3 = n = 0 N / 4 - 1 [ ( x n - x n + N / 2 ) + j ( x n + N / 4 - x n + 3 N / 4 ) ] W 3 n W 4 n k C 4 k + 3 = n = 0 N / 4 - 1 [ ( x n - x n + N / 2 ) + j ( x n + N / 4 - x n + 3 N / 4 ) ] W 3 n W 4 n k (15)

for the odd index terms. This results in an L-shaped “butterfly" shown in Figure 4 which relates a length-N DFT to one length-N/2 DFT and two length-N/4 DFT's with twiddle factors. Repeating this process for the half and quarter length DFT's until scalars result gives the SRFFT algorithm in much the same way the decimation-in-frequency radix-2 Cooley-Tukey FFT is derived Entry 21, Entry 22, Entry 2. The resulting flow graph for the algorithm calculated in place looks like a radix-2 FFT except for the location of the twiddle factors. Indeed, it is the location of the twiddle factors that makes this algorithm use less arithmetic. The L- shaped SRFFT butterfly Figure 4 advances the calculation of the top half by one of the MM stages while the lower half, like a radix-4 butterfly, calculates two stages at once. This is illustrated for N=8N=8 in Figure 5.

Figure 4: SRFFT Butterfly
Figure 4 (File0029.png)
Figure 5: Length-8 SRFFT
Figure 5 (File0030.png)

Unlike the fixed radix, mixed radix or variable radix Cooley-Tukey FFT or even the prime factor algorithm or Winograd Fourier transform algorithm , the Split-Radix FFT does not progress completely stage by stage, or, in terms of indices, does not complete each nested sum in order. This is perhaps better seen from the polynomial formulation of Martens Entry 18. Because of this, the indexing is somewhat more complicated than the conventional Cooley-Tukey program.

A FORTRAN program is given below which implements the basic decimation-in-frequency split-radix FFT algorithm. The indexing scheme Entry 23 of this program gives a structure very similar to the Cooley-Tukey programs in Entry 1 and allows the same modifications and improvements such as decimation-in-time, multiple butterflies, table look-up of sine and cosine values, three real per complex multiply methods, and real data versions Entry 8, Entry 27.

Figure 6: Split-Radix FFT FORTRAN Subroutine
SUBROUTINE FFT(X,Y,N,M)
        N2 = 2*N
        DO  10 K = 1, M-1
            N2 = N2/2
            N4 = N2/4
            E  = 6.283185307179586/N2
            A = 0
            DO  20 J = 1, N4
                A3  = 3*A
                CC1 = COS(A)
                SS1 = SIN(A)
                CC3 = COS(A3)
                SS3 = SIN(A3)
                A   = J*E
                IS  = J
                ID  = 2*N2
 40             DO 30 I0 = IS, N-1, ID
                    I1 = I0 + N4
                    I2 = I1 + N4
                    I3 = I2 + N4
                    R1    = X(I0) - X(I2)
                    X(I0) = X(I0) + X(I2)
                    R2    = X(I1) - X(I3)
                    X(I1) = X(I1) + X(I3)
                    S1    = Y(I0) - Y(I2)
                    Y(I0) = Y(I0) + Y(I2)
                    S2    = Y(I1) - Y(I3)
                    Y(I1) = Y(I1) + Y(I3)
                    S3    = R1 - S2
                    R1    = R1 + S2
                    S2    = R2 - S1
                    R2    = R2 + S1
                    X(I2) = R1*CC1 - S2*SS1
                    Y(I2) =-S2*CC1 - R1*SS1
                    X(I3) = S3*CC3 + R2*SS3
                    Y(I3) = R2*CC3 - S3*SS3
 30             CONTINUE
                IS = 2*ID - N2 + J
                ID = 4*ID
                IF (IS.LT.N) GOTO 40
 20         CONTINUE
 10     CONTINUE
        IS = 1
        ID = 4
 50     DO 60 I0 = IS, N, ID
            I1    = I0 + 1
            R1    = X(I0)
            X(I0) = R1 + X(I1)
            X(I1) = R1 - X(I1)
            R1    = Y(I0)
            Y(I0) = R1 + Y(I1)
  60    Y(I1) = R1 - Y(I1)
            IS = 2*ID - 1
            ID = 4*ID
        IF (IS.LT.N) GOTO 50
 
 

As was done for the other decimation-in-frequency algorithms, the input index map is used and the calculations are done in place resulting in the output being in bit-reversed order. It is the three statements following label 30 that do the special indexing required by the SRFFT. The last stage is length- 2 and, therefore, inappropriate for the standard L-shaped butterfly, so it is calculated separately in the DO 60 loop. This program is considered a one-butterfly version. A second butterfly can be added just before statement 40 to remove the unnecessary multiplications by unity. A third butterfly can be added to reduce the number of real multiplications from four to two for the complex multiplication when W has equal real and imaginary parts. It is also possible to reduce the arithmetic for the two- butterfly case and to reduce the data transfers by directly programming a length-4 and length-8 butterfly to replace the last three stages. This is called a two-butterfly-plus version. Operation counts for the one, two, two-plus and three butterfly SRFFT programs are given in the next section. Some details can be found in Entry 23.

The special case of a SRFFT for real data and symmetric data is discussed in Entry 8. An application of the decimation-in-time SRFFT to real data is given in Entry 27. Application to convolution is made in Entry 9, to the discrete Hartley transform in Entry 26, Entry 9, to calculating the discrete cosine transform in Entry 28, and could be made to calculating number theoretic transforms.

An improvement in operation count has been reported by Johnson and Frigo Entry 15 which involves a scaling of multiplying factors. The improvement is small but until this result, it was generally thought the Split-Radix FFT was optimal for total floating point operation count.

Evaluation of the Cooley-Tukey FFT Algorithms

The evaluation of any FFT algorithm starts with a count of the real (or floating point) arithmetic. Table 1 gives the number of real multiplications and additions required to calculate a length-N FFT of complex data. Results of programs with one, two, three and five butterflies are given to show the improvement that can be expected from removing unnecessary multiplications and additions. Results of radices two, four, eight and sixteen for the Cooley-Tukey FFT as well as of the split-radix FFT are given to show the relative merits of the various structures. Comparisons of these data should be made with the table of counts for the PFA and WFTA programs in The Prime Factor and Winograd Fourier Transform Algorithms: Evaluation of the PFA and WFTA. All programs use the four-multiply-two-add complex multiply algorithm. A similar table can be developed for the three-multiply-three-add algorithm, but the relative results are the same.

From the table it is seen that a greater improvement is obtained going from radix-2 to 4 than from 4 to 8 or 16. This is partly because length 2 and 4 butterflies have no multiplications while length 8, 16 and higher do. It is also seen that going from one to two butterflies gives more improvement than going from two to higher values. From an operation count point of view and from practical experience, a three butterfly radix-4 or a two butterfly radix-8 FFT is a good compromise. The radix-8 and 16 programs become long, especially with multiple butterflies, and they give a limited choice of transform length unless combined with some length 2 and 4 butterflies.

Number of Real Multiplications and Additions for Complex Single Radix FFTs
N M1 M2 M3 M5 A1 A2 A3 A5
2 4 0 0 0 6 4 4 4
4 16 4 0 0 24 18 16 16
8 48 20 8 4 72 58 52 52
16 128 68 40 28 192 162 148 148
32 320 196 136 108 480 418 388 388
64 768 516 392 332 1152 1026 964 964
128 1792 1284 1032 908 2688 2434 2308 2308
256 4096 3076 2568 2316 6144 5634 5380 5380
512 9216 7172 6152 5644 13824 12802 12292 12292
1024 20480 16388 14344 13324 30720 28674 27652 27652
2048 45056 36868 32776 30732 67584 63490 61444 61444
4096 98304 81924 73736 69644 147456 139266 135172 135172
4 12 0 0 0 22 16 16 16
16 96 36 28 24 176 146 144 144
64 576 324 284 264 1056 930 920 920
256 3072 2052 1884 1800 5632 5122 5080 5080
1024 15360 11268 10588 10248 28160 26114 25944 25944
4096 73728 57348 54620 53256 135168 126978 126296 126296
8 32 4 4 4 66 52 52 52
64 512 260 252 248 1056 930 928 928
512 6144 4100