Skip to content Skip to navigation Skip to collection information

OpenStax-CNX

You are here: Home » Content » ECE 454 and ECE 554 Supplemental reading » The Cooley-Tukey Fast Fourier Transform Algorithm

Navigation

Table of Contents

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
  • Rice Digital Scholarship

    This module is included in aLens by: Digital Scholarship at Rice UniversityAs a part of collections: "Fast Fourier Transforms", "Fast Fourier Transforms (6x9 Version)"

    Click the "Rice Digital Scholarship" link to see all content affiliated with them.

  • NSF Partnership display tagshide tags

    This module is included inLens: NSF Partnership in Signal Processing
    By: Sidney BurrusAs a part of collection: "Fast Fourier Transforms"

    Click the "NSF Partnership" link to see all content affiliated with them.

    Click the tag icon tag icon to display tags associated with this content.

  • Featured Content display tagshide tags

    This module is included inLens: Connexions Featured Content
    By: ConnexionsAs a part of collection: "Fast Fourier Transforms"

    Comments:

    "The Fast Fourier Transform (FFT) is a landmark algorithm used in fields ranging from signal processing to high-performance computing. First popularized by two American scientists in 1965, the […]"

    Click the "Featured Content" link to see all content affiliated with them.

    Click the tag icon tag icon to display tags associated with this content.

Also in these lenses

  • UniqU content

    This module is included inLens: UniqU's lens
    By: UniqU, LLCAs a part of collection: "Fast Fourier Transforms"

    Click the "UniqU content" link to see all content selected in this lens.

  • Lens for Engineering

    This module is included inLens: Lens for Engineering
    By: Sidney Burrus

    Click the "Lens for Engineering" link to see all content selected in this lens.

Recently Viewed

This feature requires Javascript to be enabled.

Tags

(What is a tag?)

These tags come from the endorsement, affiliation, and other lenses that include this content.
 

The Cooley-Tukey Fast Fourier Transform Algorithm

Module by: C. Sidney Burrus. E-mail the author

The publication by Cooley and Tukey [5] in 1965 of an efficient algorithm for the calculation of the DFT was a major turning point in the development of digital signal processing. During the five or so years that followed, various extensions and modifications were made to the original algorithm [4]. By the early 1970's the practical programs were basically in the form used today. The standard development presented in [21], [22], [2] shows how the DFT of a length-N sequence can be simply calculated from the two length-N/2 DFT's of the even index terms and the odd index terms. This is then applied to the two half-length DFT's to give four quarter-length DFT's, and repeated until N scalars are left which are the DFT values. Because of alternately taking the even and odd index terms, two forms of the resulting programs are called decimation-in-time and decimation-in-frequency. For a length of 2M2M, the dividing process is repeated M=log2NM=log2N times and requires NN multiplications each time. This gives the famous formula for the computational complexity of the FFT of Nlog2NNlog2N which was derived in Multidimensional Index Mapping: Equation 34.

Although the decimation methods are straightforward and easy to understand, they do not generalize well. For that reason it will be assumed that the reader is familiar with that description and this chapter will develop the FFT using the index map from Multidimensional Index Mapping.

The Cooley-Tukey FFT always uses the Type 2 index map from Multidimensional Index Mapping: Equation 11. This is necessary for the most popular forms that have N=RMN=RM, but is also used even when the factors are relatively prime and a Type 1 map could be used. The time and frequency maps from Multidimensional Index Mapping: Equation 6 and Multidimensional Index Mapping: Equation 12 are

n = ( ( K 1 n 1 + K 2 n 2 ) ) N n = ( ( K 1 n 1 + K 2 n 2 ) ) N
(1)
k = ( ( K 3 k 1 + K 4 k 2 ) ) N k = ( ( K 3 k 1 + K 4 k 2 ) ) N
(2)

Type-2 conditions Multidimensional Index Mapping: Equation 8 and Multidimensional Index Mapping: Equation 11 become

K 1 = a N 2 or K 2 = b N 1 but not both K 1 = a N 2 or K 2 = b N 1 but not both
(3)

and

K 3 = c N 2 or K 4 = d N 1 but not both K 3 = c N 2 or K 4 = d N 1 but not both
(4)

The row and column calculations in Multidimensional Index Mapping: Equation 15 are uncoupled by Multidimensional Index Mapping: Equation 16 which for this case are

( ( K 1 K 4 ) ) N = 0 or ( ( K 2 K 3 ) ) N = 0 but not both ( ( K 1 K 4 ) ) N = 0 or ( ( K 2 K 3 ) ) N = 0 but not both
(5)

To make each short sum a DFT, the KiKi must satisfy

( ( K 1 K 3 ) ) N = N 2 and ( ( K 2 K 4 ) ) N = N 1 ( ( K 1 K 3 ) ) N = N 2 and ( ( K 2 K 4 ) ) N = N 1
(6)

In order to have the smallest values for KiKi the constants in Equation 3 are chosen to be

a = d = K 2 = K 3 = 1 a = d = K 2 = K 3 = 1
(7)

which makes the index maps of Equation 1 become

n = N 2 n 1 + n 2 n = N 2 n 1 + n 2
(8)
k = k 1 + N 1 k 2 k = k 1 + N 1 k 2
(9)

These index maps are all evaluated modulo NN, but in Equation 8, explicit reduction is not necessary since nn never exceeds NN. The reduction notation will be omitted for clarity. From Multidimensional Index Mapping: Equation 15 and example Multidimensional Index Mapping: Equation 19, the DFT is

X = n 2 = 0 N 2 - 1 n 1 = 0 N 1 - 1 x W N 1 n 1 k 1 W N n 2 k 1 W N 2 n 2 k 2 X = n 2 = 0 N 2 - 1 n 1 = 0 N 1 - 1 x W N 1 n 1 k 1 W N n 2 k 1 W N 2 n 2 k 2
(10)

This map of Equation 8 and the form of the DFT in Equation 10 are the fundamentals of the Cooley-Tukey FFT.

The order of the summations using the Type 2 map in Equation 10 cannot be reversed as it can with the Type-1 map. This is because of the WNWN terms, the twiddle factors.

Turning Equation 10 into an efficient program requires some care. From Multidimensional Index Mapping: Efficiencies Resulting from Index Mapping with the DFT we know that all the factors should be equal. If N=RMN=RM , with R called the radix, N1N1 is first set equal to RR and N2N2 is then necessarily RM-1 RM-1. Consider n1n1 to be the index along the rows and n2n2 along the columns. The inner sum of Equation 10 over n1n1 represents a length-N1N1 DFT for each value of n2n2. These N2N2 length-N1N1 DFT's are the DFT's of the rows of the x(n1,n2)x(n1,n2) array. The resulting array of row DFT's is multiplied by an array of twiddle factors which are the WNWN terms in Equation 10. The twiddle-factor array for a length-8 radix-2 FFT is

T F : W 8 n 2 k 1 = W 0 W 0 W 0 W 1 W 0 W 2 W 0 W 3 = 1 1 1 W 1 - j 1 - j W T F : W 8 n 2 k 1 = W 0 W 0 W 0 W 1 W 0 W 2 W 0 W 3 = 1 1 1 W 1 - j 1 - j W
(11)

The twiddle factor array will always have unity in the first row and first column.

To complete Equation 10 at this point, after the row DFT's are multiplied by the TF array, the N1N1 length-N2N2 DFT's of the columns are calculated. However, since the columns DFT's are of length R M-1R M-1, they can be posed as a R M-2R M-2 by RR array and the process repeated, again using length-RR DFT's. After MM stages of length-RR DFT's with TF multiplications interleaved, the DFT is complete. The flow graph of a length-2 DFT is given in Figure 1 and is called a butterfly because of its shape. The flow graph of the complete length-8 radix-2 FFT is shown in Figure 2 .

Figure 1: A Radix-2 Butterfly
This figure contains four items with arrows arranged evenly and rectangularly, and arrows pointing from the leftmost items to the rightmost items. The first two items on the left read x(0) on the top and x(1) on the bottom, and are accompanied by large dots next to them, indicating a starting point for four arrows. From each dot are two arrows, one pointing directly across to the right at an item, and one pointing diagonally to the right at the other rightmost item. The top-right item that is pointed to by two arrows reads X(0)  = x(0) + x(1), and the bottom-right item that is pointed to by two arrows reads X(0)  = x(0) - x(1). The horizontal line pointing from the bototm-left item to the bottom-right item is labeled with a dash. The entire figure is labeled, Radix-2 butterfly.
Figure 2: Length-8 Radix-2 FFT Flow Graph
This figure is a flow graph with eight lines crossing in different directions at three points along the graph from left to right, with each beginning and end point marked with a black dot. The eight horizontal lines that flow consistently across the graph are labeled on the left in order as x_0 to x_7. At its initial point, x_0 through x_3 connect diagonally downward four rows, so that they meet x_4 through x_7 at the same horizontal level. Conversely, x_4 through x_7 connect diagonally upward to the original locations of x_0 to x_3, respectively. Directly before the first four lines connect to the last four lines, on the fifth through eighth horizontal parallels where they will connect, are small arrows indicating movement to the right, and these arrows are labeled with a dash. Directly after these lines connect on the fifth through eighth parallels are small arrows pointing to the right again, this time labeled with w^0, w^1, w^2, w^3 respectively. At this point, the figure is described about halfway across the page from left to right. The next section of crossed lines is exactly the same for the top four parallels and the bottom four parallels. The top two lines in each section connect diagonally downward two spaces, and conversely the bottom two lines connect diagonally upward the same amount. The second half of each section (the third, fourth, seventh and eight) again have arrows labeled with a dash pointing to the right, before the connected segments, and arrows pointing to the right labeled w^0 and w^2 respectively after the connections. Finally, one more set of crossed connected lines occurs, this time in an identical pattern for four groups of two lines. The lines cross this other and connect to the adjacent parallel, with odd-numbered parallels connecting downward and their even-numbered partners connecting upward. Before the connection point are again arrows pointing to the right, labeled with a dash, that occur in the second, fourth, sixth, and eighth parallels. This concludes the lines, and the labels to these lines on the right side of the figure read x_0, x_4, x_2, x_6, x_1, x_5, x_3, x_7 down the page.

This flow-graph, the twiddle factor map of Equation 11, and the basic equation Equation 10 should be completely understood before going further.

A very efficient indexing scheme has evolved over the years that results in a compact and efficient computer program. A FORTRAN program is given below that implements the radix-2 FFT. It should be studied [1] to see how it implements Equation 10 and the flow-graph representation.

Listing 1: A Radix-2 Cooley-Tukey FFT Program
N2 = N
        DO 10 K = 1, M
            N1 = N2
            N2 = N2/2
            E  = 6.28318/N1
            A  = 0
            DO 20 J = 1, N2
                C = COS (A)
                S =-SIN (A)
                A = J*E
                DO 30 I = J, N, N1
                    L = I + N2
                    XT   = X(I) - X(L)
                    X(I) = X(I) + X(L)
                    YT   = Y(I) - Y(L)
                    Y(I) = Y(I) + Y(L)
                    X(L) = XT*C - YT*S
                    Y(L) = XT*S + YT*C
   30           CONTINUE
   20       CONTINUE
   10   CONTINUE
 
 

This discussion, the flow graph of Winograd’s Short DFT Algorithms: Figure 2 and the program of Code 1 are all based on the input index map of Multidimensional Index Mapping: Equation 6 and Equation 1 and the calculations are performed in-place. According to Multidimensional Index Mapping: In-Place Calculation of the DFT and Scrambling, this means the output is scrambled in bit-reversed order and should be followed by an unscrambler to give the DFT in proper order. This formulation is called a decimation-in-frequency FFT [21], [22], [2]. A very similar algorithm based on the output index map can be derived which is called a decimation-in-time FFT. Examples of FFT programs are found in [1] and in the Appendix of this book.

Modifications to the Basic Cooley-Tukey FFT

Soon after the paper by Cooley and Tukey, there were improvements and extensions made. One very important discovery was the improvement in efficiency by using a larger radix of 4, 8 or even 16. For example, just as for the radix-2 butterfly, there are no multiplications required for a length-4 DFT, and therefore, a radix-4 FFT would have only twiddle factor multiplications. Because there are half as many stages in a radix-4 FFT, there would be half as many multiplications as in a radix-2 FFT. In practice, because some of the multiplications are by unity, the improvement is not by a factor of two, but it is significant. A radix-4 FFT is easily developed from the basic radix-2 structure by replacing the length-2 butterfly by a length-4 butterfly and making a few other modifications. Programs can be found in [1] and operation counts will be given in "Evaluation of the Cooley-Tukey FFT Algorithms".

Increasing the radix to 8 gives some improvement but not as much as from 2 to 4. Increasing it to 16 is theoretically promising but the small decrease in multiplications is somewhat offset by an increase in additions and the program becomes rather long. Other radices are not attractive because they generally require a substantial number of multiplications and additions in the butterflies.

The second method of reducing arithmetic is to remove the unnecessary TF multiplications by plus or minus unity or by plus or minus the square root of minus one. This occurs when the exponent of WNWN is zero or a multiple of N/4N/4. A reduction of additions as well as multiplications is achieved by removing these extraneous complex multiplications since a complex multiplication requires at least two real additions. In a program, this reduction is usually achieved by having special butterflies for the cases where the TF is one or jj. As many as four special butterflies may be necessary to remove all unnecessary arithmetic, but in many cases there will be no practical improvement above two or three.

In addition to removing multiplications by one or jj, there can be a reduction in multiplications by using a special butterfly for TFs with WN/8WN/8, which have equal real and imaginary parts. Also, for computers or hardware with multiplication considerably slower than addition, it is desirable to use an algorithm for complex multiplication that requires three multiplications and three additions rather than the conventional four multiplications and two additions. Note that this gives no reduction in the total number of arithmetic operations, but does give a trade of multiplications for additions. This is one reason not to use complex data types in programs but to explicitly program complex arithmetic.

A time-consuming and unnecessary part of the execution of a FFT program is the calculation of the sine and cosine terms which are the real and imaginary parts of the TFs. There are basically three approaches to obtaining the sine and cosine values. They can be calculated as needed which is what is done in the sample program above. One value per stage can be calculated and the others recursively calculated from those. That method is fast but suffers from accumulated round-off errors. The fastest method is to fetch precalculated values from a stored table. This has the disadvantage of requiring considerable memory space.

If all the N DFT values are not needed, special forms of the FFT can be developed using a process called pruning [17] which removes the operations concerned with the unneeded outputs.

Special algorithms are possible for cases with real data or with symmetric data [6]. The decimation-in-time algorithm can be easily modified to transform real data and save half the arithmetic required for complex data [27]. There are numerous other modifications to deal with special hardware considerations such as an array processor or a special microprocessor such as the Texas Instruments TMS320. Examples of programs that deal with some of these items can be found in [22], [1], [6].

The Split-Radix FFT Algorithm

Recently several papers [18], [7], [28], [23], [8] have been published on algorithms to calculate a length-2M2M DFT more efficiently than a Cooley-Tukey FFT of any radix. They all have the same computational complexity and are optimal for lengths up through 16 and until recently was thought to give the best total add-multiply count possible for any power-of-two length. Yavne published an algorithm with the same computational complexity in 1968 [29], but it went largely unnoticed. Johnson and Frigo have recently reported the first improvement in almost 40 years [15]. The reduction in total operations is only a few percent, but it is a reduction.

The basic idea behind the split-radix FFT (SRFFT) as derived by Duhamel and Hollmann [7], [8] is the application of a radix-2 index map to the even-indexed terms and a radix-4 map to the odd- indexed terms. The basic definition of the DFT

C k = n = 0 N - 1 x n W n k C k = n = 0 N - 1 x n W n k
(12)

with W=e-j2π/NW=e-j2π/N gives

C 2 k = n = 0 N / 2 - 1 [ x n + x n + N / 2 ] W 2 n k C 2 k = n = 0 N / 2 - 1 [ x n + x n + N / 2 ] W 2 n k
(13)

for the even index terms, and

C 4 k + 1 = n = 0 N / 4 - 1 [ ( x n - x n + N / 2 ) - j ( x n + N / 4 - x n + 3 N / 4 ) ] W n W 4 n k C 4 k + 1 = n = 0 N / 4 - 1 [ ( x n - x n + N / 2 ) - j ( x n + N / 4 - x n + 3 N / 4 ) ] W n W 4 n k
(14)

and

C 4 k + 3 = n = 0 N / 4 - 1 [ ( x n - x n + N / 2 ) + j ( x n + N / 4 - x n + 3 N / 4 ) ] W 3 n W 4 n k C 4 k + 3 = n = 0 N / 4 - 1 [ ( x n - x n + N / 2 ) + j ( x n + N / 4 - x n + 3 N / 4 ) ] W 3 n W 4 n k
(15)

for the odd index terms. This results in an L-shaped “butterfly" shown in Figure 3 which relates a length-N DFT to one length-N/2 DFT and two length-N/4 DFT's with twiddle factors. Repeating this process for the half and quarter length DFT's until scalars result gives the SRFFT algorithm in much the same way the decimation-in-frequency radix-2 Cooley-Tukey FFT is derived [21], [22], [2]. The resulting flow graph for the algorithm calculated in place looks like a radix-2 FFT except for the location of the twiddle factors. Indeed, it is the location of the twiddle factors that makes this algorithm use less arithmetic. The L- shaped SRFFT butterfly Figure 3 advances the calculation of the top half by one of the MM stages while the lower half, like a radix-4 butterfly, calculates two stages at once. This is illustrated for N=8N=8 in Figure 4.

Figure 3: SRFFT Butterfly
This figure is an array of dots connected with diagonal and horizontal lines. There are two vertical columns of four dots followed by a third column with two dots. The first column of dots each has one horizontal line pointing across, and one diagonal line pointing downward for the first two rows and upward for the second two rows. The diagonal lines connect to dots on the second column either two rows above or two rows below their original position. To the left of the third and fourth dots on the second column, next to the horizontal lines that connect to them, are two dashes. In between the third and fourth dots of the second column is a large label, j. Further to the right, the third and fourth dots of the second column connect to the two dots in the third column. The third dot in the second column connects to the second dot in the third column, which is horizontally level with the fourth dot in the second column. The fourth dot in the second column connects to the first dot in the third column, which is horizontally level with the third dot in the second column. Along this line segment is a dash near the first dot of the third column. There are also two horizontal lines connecting the third dot in the second column with the first dot in the third column, and the fourth dot in the second column with the second dot in the third column.
Figure 4: Length-8 SRFFT
This figure is a flow graph with eight lines crossing in different directions at three points along the graph from left to right. The eight horizontal lines that flow consistently across the graph begin with black dots. At these initial points, the first four connect diagonally downward four rows, and the last four connect diagonally upward to the first four rows. For each of the eight initial dots, there is also a horizontal line connecting these initial dots to eight vertical dots at the same point that the aforementioned diagonal lines terminate. At this point in the figure, the graph separates between a lower half and upper half. At this point in the lower half in between the dots are a label, j. The lower half continues directly from the lower four dots that were connected in the large first section, and to the right they behave in a similar fashion, with the upper two moving two spaces down diagonally, and the lower two moving two spaces up, accompanied by horizontal lines connecting the dots directly across. There is a break in horizontal movement for the lower half of the figure at this point, where the four dots are disconnected and followed by a new set of four adjacent dots. These dots are visually grouped in two sections, with more diagonal lines simply connecting each dot to one adjacent dot diagonally to the right, up or down, and two horizontal lines connecting the dots directly across. In between these two sections and the larger section to the left are two labels, one above labeled w, and the lower labeled w^3. The upper portion of the right half of the figure is not continued or connected to the large initial section. Its first portion mimics the shape of its lower neighbor, with four rows of dots, four horizontal lines to four adjacent dots, and diagonal lines moving upward and downward two spaces up or down. Connected to the lower portion of this upper section is another set of two dots, both vertically and diagonally connected to the larger section to its left. In the middle of the leftmost dots of this section is a label, j. Above and in the upper-right corner of the figure is a final disconnected set of four dots, diagonally and horizontally connected. For the entire figure, there is a dash labeling each dot that to its left is connected by a down-sloping diagonal line.

Unlike the fixed radix, mixed radix or variable radix Cooley-Tukey FFT or even the prime factor algorithm or Winograd Fourier transform algorithm , the Split-Radix FFT does not progress completely stage by stage, or, in terms of indices, does not complete each nested sum in order. This is perhaps better seen from the polynomial formulation of Martens [18]. Because of this, the indexing is somewhat more complicated than the conventional Cooley-Tukey program.

A FORTRAN program is given below which implements the basic decimation-in-frequency split-radix FFT algorithm. The indexing scheme [23] of this program gives a structure very similar to the Cooley-Tukey programs in [1] and allows the same modifications and improvements such as decimation-in-time, multiple butterflies, table look-up of sine and cosine values, three real per complex multiply methods, and real data versions [8], [27].

Listing 2: Split-Radix FFT FORTRAN Subroutine
SUBROUTINE FFT(X,Y,N,M)
        N2 = 2*N
        DO  10 K = 1, M-1
            N2 = N2/2
            N4 = N2/4
            E  = 6.283185307179586/N2
            A = 0
            DO  20 J = 1, N4
                A3  = 3*A
                CC1 = COS(A)
                SS1 = SIN(A)
                CC3 = COS(A3)
                SS3 = SIN(A3)
                A   = J*E
                IS  = J
                ID  = 2*N2
 40             DO 30 I0 = IS, N-1, ID
                    I1 = I0 + N4
                    I2 = I1 + N4
                    I3 = I2 + N4
                    R1    = X(I0) - X(I2)
                    X(I0) = X(I0) + X(I2)
                    R2    = X(I1) - X(I3)
                    X(I1) = X(I1) + X(I3)
                    S1    = Y(I0) - Y(I2)
                    Y(I0) = Y(I0) + Y(I2)
                    S2    = Y(I1) - Y(I3)
                    Y(I1) = Y(I1) + Y(I3)
                    S3    = R1 - S2
                    R1    = R1 + S2
                    S2    = R2 - S1
                    R2    = R2 + S1
                    X(I2) = R1*CC1 - S2*SS1
                    Y(I2) =-S2*CC1 - R1*SS1
                    X(I3) = S3*CC3 + R2*SS3
                    Y(I3) = R2*CC3 - S3*SS3
 30             CONTINUE
                IS = 2*ID - N2 + J
                ID = 4*ID
                IF (IS.LT.N) GOTO 40
 20         CONTINUE
 10     CONTINUE
        IS = 1
        ID = 4
 50     DO 60 I0 = IS, N, ID
            I1    = I0 + 1
            R1    = X(I0)
            X(I0) = R1 + X(I1)
            X(I1) = R1 - X(I1)
            R1    = Y(I0)
            Y(I0) = R1 + Y(I1)
  60    Y(I1) = R1 - Y(I1)
            IS = 2*ID - 1
            ID = 4*ID
        IF (IS.LT.N) GOTO 50
 
 

As was done for the other decimation-in-frequency algorithms, the input index map is used and the calculations are done in place resulting in the output being in bit-reversed order. It is the three statements following label 30 that do the special indexing required by the SRFFT. The last stage is length- 2 and, therefore, inappropriate for the standard L-shaped butterfly, so it is calculated separately in the DO 60 loop. This program is considered a one-butterfly version. A second butterfly can be added just before statement 40 to remove the unnecessary multiplications by unity. A third butterfly can be added to reduce the number of real multiplications from four to two for the complex multiplication when W has equal real and imaginary parts. It is also possible to reduce the arithmetic for the two- butterfly case and to reduce the data transfers by directly programming a length-4 and length-8 butterfly to replace the last three stages. This is called a two-butterfly-plus version. Operation counts for the one, two, two-plus and three butterfly SRFFT programs are given in the next section. Some details can be found in [23].

The special case of a SRFFT for real data and symmetric data is discussed in [8]. An application of the decimation-in-time SRFFT to real data is given in [27]. Application to convolution is made in [9], to the discrete Hartley transform in [26], [9], to calculating the discrete cosine transform in [28], and could be made to calculating number theoretic transforms.

An improvement in operation count has been reported by Johnson and Frigo [15] which involves a scaling of multiplying factors. The improvement is small but until this result, it was generally thought the Split-Radix FFT was optimal for total floating point operation count.

Evaluation of the Cooley-Tukey FFT Algorithms

The evaluation of any FFT algorithm starts with a count of the real (or floating point) arithmetic. Table 1 gives the number of real multiplications and additions required to calculate a length-N FFT of complex data. Results of programs with one, two, three and five butterflies are given to show the improvement that can be expected from removing unnecessary multiplications and additions. Results of radices two, four, eight and sixteen for the Cooley-Tukey FFT as well as of the split-radix FFT are given to show the relative merits of the various structures. Comparisons of these data should be made with the table of counts for the PFA and WFTA programs in The Prime Factor and Winograd Fourier Transform Algorithms: Evaluation of the PFA and WFTA. All programs use the four-multiply-two-add complex multiply algorithm. A similar table can be developed for the three-multiply-three-add algorithm, but the relative results are the same.

From the table it is seen that a greater improvement is obtained going from radix-2 to 4 than from 4 to 8 or 16. This is partly because length 2 and 4 butterflies have no multiplications while length 8, 16 and higher do. It is also seen that going from one to two butterflies gives more improvement than going from two to higher values. From an operation count point of view and from practical experience, a three butterfly radix-4 or a two butterfly radix-8 FFT is a good compromise. The radix-8 and 16 programs become long, especially with multiple butterflies, and they give a limited choice of transform length unless combined with some length 2 and 4 butterflies.

Table 1: Number of Real Multiplications and Additions for Complex Single Radix FFTs
N M1 M2 M3 M5 A1 A2 A3 A5
2 4 0 0 0 6 4 4 4
4 16 4 0 0 24 18 16 16
8 48 20 8 4 72 58 52 52
16 128 68 40 28 192 162 148 148
32 320 196 136 108 480 418 388 388
64 768 516 392 332 1152 1026 964 964
128 1792 1284 1032 908 2688 2434 2308 2308
256 4096 3076 2568 2316 6144 5634 5380 5380
512 9216 7172 6152 5644 13824 12802 12292 12292
1024 20480 16388 14344 13324 30720 28674 27652 27652
2048 45056 36868 32776 30732 67584 63490 61444 61444
4096 98304 81924 73736 69644 147456 139266 135172 135172
4 12 0 0 0 22 16 16 16
16 96 36 28 24 176 146 144 144
64 576 324 284 264 1056 930 920 920
256 3072 2052 1884 1800 5632 5122 5080 5080
1024 15360 11268 10588 10248 28160 26114 25944 25944
4096 73728 57348 54620 53256 135168 126978 126296 126296
8 32 4 4 4 66 52 52 52
64 512 260 252 248 1056 930 928 928
512 6144 4100 4028 3992 12672 11650 11632 11632
4096 65536 49156 48572 48280 135168 126978 126832 126832
16 80 20 20 20 178 148 148 148
256 2560 1540 1532 1528 5696 5186 5184 5184
4096 61440 45060 44924 44856 136704 128514 128480 128480
2 0 0 0 0 4 4 4 4
4 8 0 0 0 20 16 16 16
8 24 8 4 4 60 52 52 52
16 72 32 28 24 164 144 144 144
32 184 104 92 84 412 372 372 372
64 456 288 268 248 996 912 912 912
128 1080 744 700 660 2332 2164 2164 2164
256 2504 1824 1740 1656 5348 5008 5008 5008
512 5688 4328 4156 3988 12060 11380 11380 11380
1024 12744 10016 9676 9336 26852 25488 25488 25488
2048 28216 22760 22076 21396 59164 56436 56436 56436
4096 61896 50976 49612 48248 129252 123792 123792 123792

In Table 1 Mi and Ai refer to the number of real multiplications and real additions used by an FFT with i separately written butterflies. The first block has the counts for Radix-2, the second for Radix-4, the third for Radix-8, the fourth for Radix-16, and the last for the Split-Radix FFT. For the split-radix FFT, M3 and A3 refer to the two- butterfly-plus program and M5 and A5 refer to the three-butterfly program.

The first evaluations of FFT algorithms were in terms of the number of real multiplications required as that was the slowest operation on the computer and, therefore, controlled the execution speed. Later with hardware arithmetic both the number of multiplications and additions became important. Modern systems have arithmetic speeds such that indexing and data transfer times become important factors. Morris [19] has looked at some of these problems and has developed a procedure called autogen to write partially straight-line program code to significantly reduce overhead and speed up FFT run times. Some hardware, such as the TMS320 signal processing chip, has the multiply and add operations combined. Some machines have vector instructions or have parallel processors. Because the execution speed of an FFT depends not only on the algorithm, but also on the hardware architecture and compiler, experiments must be run on the system to be used.

In many cases the unscrambler or bit-reverse-counter requires 10% of the execution time, therefore, if possible, it should be eliminated. In high-speed convolution where the convolution is done by multiplication of DFT's, a decimation-in-frequency FFT can be combined with a decimation-in-time inverse FFT to require no unscrambler. It is also possible for a radix-2 FFT to do the unscrambling inside the FFT but the structure is not very regular [22], [14]. Special structures can be found in [22] and programs for data that are real or have special symmetries are in [6], [8], [27].

Although there can be significant differences in the efficiencies of the various Cooley-Tukey and Split-Radix FFTs, the number of multiplications and additions for all of them is on the order of NlogNNlogN. That is fundamental to the class of algorithms.

The Quick Fourier Transform, An FFT based on Symmetries

The development of fast algorithms usually consists of using special properties of the algorithm of interest to remove redundant or unnecessary operations of a direct implementation. The discrete Fourier transform (DFT) defined by

C ( k ) = n = 0 N - 1 x ( n ) W N n k C ( k ) = n = 0 N - 1 x ( n ) W N n k
(16)

where

W N = e - j 2 π / N W N = e - j 2 π / N
(17)

has enormous capacity for improvement of its arithmetic efficiency. Most fast algorithms use the periodic and symmetric properties of its basis functions. The classical Cooley-Tukey FFT and prime factor FFT [1] exploit the periodic properties of the cosine and sine functions. Their use of the periodicities to share and, therefore, reduce arithmetic operations depends on the factorability of the length of the data to be transformed. For highly composite lengths, the number of floating-point operation is of order Nlog(N)Nlog(N) and for prime lengths it is of order N2N2.

This section will look at an approach using the symmetric properties to remove redundancies. This possibility has long been recognized [13], [16], [25], [20] but has not been developed in any systematic way in the open literature. We will develop an algorithm, called the quick Fourier transform (QFT) [16], that will reduce the number of floating point operations necessary to compute the DFT by a factor of two to four over direct methods or Goertzel's method for prime lengths. Indeed, it seems the best general algorithm available for prime length DFTs. One can always do better by using Winograd type algorithms but they must be individually designed for each length. The Chirp Z-transform can be used for longer lengths.

Input and Output Symmetries

We use the fact that the cosine is an even function and the sine is an odd function. The kernel of the DFT or the basis functions of the expansion is given by

W N n k = e - j 2 π n k / N = cos ( 2 π n k / N ) + j sin ( 2 π n k / N ) W N n k = e - j 2 π n k / N = cos ( 2 π n k / N ) + j sin ( 2 π n k / N )
(18)

which has an even real part and odd imaginary part. If the data x(n)x(n) are decomposed into their real and imaginary parts and those into their even and odd parts, we have

x ( n ) = u ( n ) + j v ( n ) = [ u e ( n ) + u o ( n ) ] + j [ v e ( n ) + v o ( n ) ] x ( n ) = u ( n ) + j v ( n ) = [ u e ( n ) + u o ( n ) ] + j [ v e ( n ) + v o ( n ) ]
(19)

where the even part of the real part of x(n)x(n) is given by

u e ( n ) = ( u ( n ) + u ( - n ) ) / 2 u e ( n ) = ( u ( n ) + u ( - n ) ) / 2
(20)

and the odd part of the real part is

u o ( n ) = ( u ( n ) - u ( - n ) ) / 2 u o ( n ) = ( u ( n ) - u ( - n ) ) / 2
(21)

with corresponding definitions of ve(n)ve(n) and vo(n)vo(n). Using Convolution Algorithms: Equation 32 with a simpler notation, the DFT of Convolution Algorithms: Equation 29 becomes

C ( k ) = n = 0 N - 1 ( u + j v ) ( cos - j sin ) . C ( k ) = n = 0 N - 1 ( u + j v ) ( cos - j sin ) .
(22)

The sum over an integral number of periods of an odd function is zero and the sum of an even function over half of the period is one half the sum over the whole period. This causes Equation 16 and Equation 22 to become

C ( k ) = n = 0 N / 2 - 1 [ u e cos + v o sin ] + j [ v e cos - v o sin ] . C ( k ) = n = 0 N / 2 - 1 [ u e cos + v o sin ] + j [ v e cos - v o sin ] .
(23)

for k=0,1,2,,N-1k=0,1,2,,N-1.

The evaluation of the DFT using equation Equation 23 requires half as many real multiplication and half as many real additions as evaluating it using Equation 16 or Equation 22. We have exploited the symmetries of the sine and cosine as functions of the time index nn. This is independent of whether the length is composite or not. Another view of this formulation is that we have used the property of associatively of multiplication and addition. In other words, rather than multiply two data points by the same value of a sine or cosine then add the results, one should add the data points first then multiply the sum by the sine or cosine which requires one rather than two multiplications.

Next we take advantage of the symmetries of the sine and cosine as functions of the frequency index kk. Using these symmetries on Equation 23 gives

C ( k ) = n = 0 N / 2 - 1 [ u e cos + v o sin ] + j [ v e cos - v o sin ] C ( k ) = n = 0 N / 2 - 1 [ u e cos + v o sin ] + j [ v e cos - v o sin ]
(24)
C ( N - k ) = n = 0 N / 2 - 1 [ u e cos - v o sin ] + j [ v e cos + v o sin ] . C ( N - k ) = n = 0 N / 2 - 1 [ u e cos - v o sin ] + j [ v e cos + v o sin ] .
(25)

for k=0,1,2,,N/2-1k=0,1,2,,N/2-1. This again reduces the number of operations by a factor of two, this time because it calculates two output values at a time. The first reduction by a factor of two is always available. The second is possible only if both DFT values are needed. It is not available if you are calculating only one DFT value. The above development has not dealt with the details that arise with the difference between an even and an odd length. That is straightforward.

Further Reductions if the Length is Even

If the length of the sequence to be transformed is even, there are further symmetries that can be exploited. There will be four data values that are all multiplied by plus or minus the same sine or cosine value. This means a more complicated pre-addition process which is a generalization of the simple calculation of the even and odd parts in Equation 20 and Equation 21 will reduce the size of the order N2N2 part of the algorithm by still another factor of two or four. It the length is divisible by 4, the process can be repeated. Indeed, it the length is a power of 2, one can show this process is equivalent to calculating the DFT in terms of discrete cosine and sine transforms [10], [11] with a resulting arithmetic complexity of order Nlog(N)Nlog(N) and with a structure that is well suited to real data calculations and pruning.

If the flow-graph of the Cooley-Tukey FFT is compared to the flow-graph of the QFT, one notices both similarities and differences. Both progress in stages as the length is continually divided by two. The Cooley-Tukey algorithm uses the periodic properties of the sine and cosine to give the familiar horizontal tree of butterflies. The parallel diagonal lines in this graph represent the parallel stepping through the data in synchronism with the periodic basis functions. The QFT has diagonal lines that connect the first data point with the last, then the second with the next to last, and so on to give a “star" like picture. This is interesting in that one can look at the flow graph of an algorithm developed by some completely different strategy and often find section with the parallel structures and other parts with the star structure. These must be using some underlying periodic and symmetric properties of the basis functions.

Arithmetic Complexity and Timings

A careful analysis of the QFT shows that 2N2N additions are necessary to compute the even and odd parts of the input data. This is followed by the length N/2N/2 inner product that requires 4(N/2)2=N24(N/2)2=N2 real multiplications and an equal number of additions. This is followed by the calculations necessary for the simultaneous calculations of the first half and last half of C(k)C(k) which requires 4(N/2)=2N4(N/2)=2N real additions. This means the total QFT algorithm requires M2M2 real multiplications and N2+4NN2+4N real additions. These numbers along with those for the Goertzel algorithm [3], [1], [20] and the direct calculation of the DFT are included in the following table. Of the various order-N2N2 DFT algorithms, the QFT seems to be the most efficient general method for an arbitrary length NN.

Table 2
Algorithm Real Mults. Real Adds Trig Eval.
       
Direct DFT 4 N 2 4 N 2 4 N 2 4 N 2 2 N 2 2 N 2
Mod. 2nd Order Goertzel N 2 + N N 2 + N 2 N 2 + N 2 N 2 + N N N
QFT N 2 N 2 N 2 + 4 N N 2 + 4 N 2 N 2 N
 

Timings of the algorithms on a PC in milliseconds are given in the following table.

Table 3
Algorithm N = 125 N = 125 N = 256 N = 256
     
Direct DFT 4.90 19.83
Mod. 2O. Goertzel 1.32 5.55
QFT 1.09 4.50
Chirp + FFT 1.70 3.52
 

These timings track the floating point operation counts fairly well.

Conclusions

The QFT is a straight-forward DFT algorithm that uses all of the possible symmetries of the DFT basis function with no requirements on the length being composite. These ideas have been proposed before, but have not been published or clearly developed by [16], [25], [24], [12]. It seems that the basic QFT is practical and useful as a general algorithm for lengths up to a hundred or so. Above that, the chirp z-transform [1] or other filter based methods will be superior. For special cases and shorter lengths, methods based on Winograd's theories will always be superior. Nevertheless, the QFT has a definite place in the array of DFT algorithms and is not well known. A Fortran program is included in the appendix.

It is possible, but unlikely, that further arithmetic reduction could be achieved using the fact that WNWN has unity magnitude as was done in second-order Goertzel algorithm. It is also possible that some way of combining the Goertzel and QFT algorithm would have some advantages. A development of a complete QFT decomposition of a DFT of length-2M2M shows interesting structure [10], [11] and arithmetic complexity comparable to average Cooley-Tukey FFTs. It does seem better suited to real data calculations with pruning.

References

  1. Burrus, C. S. and Parks, T. W. (1985). DFT/FFT and Convolution Algorithms. New York: John Wiley & Sons.
  2. Brigham, E. Oran. (1988). The Fast Fourier Transform and Its Applications. [Expansion of the 1974 book]. Englewood Cliffs, NJ: Prentice-Hall.
  3. Burrus, C. S. (1992). Goertzel's Algorithm. Unpublished Notes. ECE Dept., Rice University.
  4. Cooley, James W. and Lewis, Peter A. W. and Welch, Peter D. (1967, June). Historical Notes on the Fast Fourier Transform. IEEE Transactions on Audio and Electroacoustics, 15, 260–262.
  5. Cooley, J. W. and Tukey, J. W. (1965). An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Computat., 19, 297–301.
  6. DSP Committee, (Ed.). (1979). Digital Signal Processing II, selected reprints. New York: IEEE Press.
  7. Duhamel, P. and Hollmann, H. (1984, January 5). Split Radix FFT Algorithm. Electronic Letters, 20(1), 14–16.
  8. Duhamel, P. (1986, April). Implementation of `Split-Radix' FFT Algorithms for Complex, Real, and Real-Symmetric Data. [A shorter version appeared in the ICASSP-85 Proceedings, p. 20.6, March 1985]. IEEE Trans. on ASSP, 34, 285–295.
  9. Duhamel, P. and Vetterli, M. (1986, April). Cyclic Convolution of Real Sequences: Hartley versus Fourier and New Schemes. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-86), 6.5.
  10. Guo, H. and Sitton, G. A. and Burrus, C. S. (1994, April 19–22). The Quick Discrete Fourier Transform. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. (p. III:445–448). IEEE ICASSP-94, Adelaide, Australia
  11. Guo, H. and Sitton, G. A. and Burrus, C. S. (1998, February). The Quick Fourier Transform: an FFT based on Symmetries. IEEE Transactions on Signal Processing, 46(2), 335–341.
  12. Heideman, M. T. (1985). private communication.
  13. Heideman, M. T. and Johnson, D. H. and Burrus, C. S. (1985). Gauss and the History of the FFT. Archive for History of Exact Sciences, 34(3), 265–277.
  14. Johnson, H. W. and Burrus, C. S. (1984, March). An In-Order, In-Place Radix-2 FFT. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. (p. 28A.2.1–4). San Diego, CA
  15. Johnson, Steven G. and Frigo, Matteo. (2007, January). A Modified Split-Radix FFT with Fewer Arithmetic Operations. IEEE Transactions on Signal Processing, 55(1), 111–119.
  16. Kohne, John F. (1980, July). A Quick Fourier Transform Algorithm. (TR-1723). Technical report. Naval Electronics Laboratory Center.
  17. Markel, J. D. (1971, June). FFT Pruning. IEEE Trans on Audio and Electroacoustics, 19(4), 305–311.
  18. Martens, J. B. (1984, August). Recursive Cyclotomic Factorization – A New Algorithm for Calculating the Discrete Fourier Transform. IEEE Trans. on ASSP, 32(4), 750–762.
  19. Morris, L. R. (1982, 1983). Digital Signal Processing Software. Toronto, Canada: DSPSW, Inc.
  20. Oppenheim, A. V. and Schafer, R. W. (1989). Discrete-Time Signal Processing. Englewood Cliffs, NJ: Prentice-Hall.
  21. Oppenheim, A. V. and Schafer, R. W. (1999). Discrete-Time Signal Processing. (Second). [Earlier editions in 1975 and 1989]. Englewood Cliffs, NJ: Prentice-Hall.
  22. Rabiner, L. R. and Gold, B. (1975). Theory and Application of Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall.
  23. Sorensen, H. V. and Heideman, M. T. and Burrus, C. S. (1986, February). On Computing the Split–Radix FFT. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(1), 152–156.
  24. Sitton, G. A. (1985). The QFT Algorithm.
  25. Sitton, G. A. (1991, June). The QDFT: An Efficient Method for Short Prime Length DFTs.
  26. Sorensen, H. V. and Jones, D. L. and Burrus, C. S. and Heideman, M. T. (1985, October). On Computing the Discrete Hartley Transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(5), 1231–1238.
  27. Sorensen, H. V. and Jones, D. L. and Heideman, M. T. and Burrus, C. S. (1987, June). Real Valued Fast Fourier Transform Algorithms. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(6), 849–863.
  28. Vetterli, Martin and Nussbaumer, H. J. (1984, August). Simple FFT and DCT Algorithms with Reduced Number of Operations. Signal Processing, 6(4), 267–278.
  29. Yavne, R. (1968). An Economical Method for Calculating the Discrete Fourier Transform. In Fall Joint Computer Conference, AFIPS Conference Proceedings. (Vol. 33 part 1, p. 115–125).

Collection Navigation

Content actions

Download:

Collection as:

EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Module as:

PDF | More downloads ...

Add:

Collection to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

Module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks