The publication by Cooley and Tukey [5] in 1965 of an
efficient algorithm for the calculation of the DFT was a major
turning point in the development of digital signal processing.
During the five or so years that followed, various extensions and
modifications were made to the original algorithm [4]. By the early
1970's the practical programs were basically in the form used today.
The standard development presented in [21], [22], [2] shows
how the DFT of a length-N sequence can be simply calculated from the
two length-N/2 DFT's of the even index terms and the odd index
terms. This is then applied to the two half-length DFT's to give
four quarter-length DFT's, and repeated until N scalars are left
which are the DFT values. Because of alternately taking the even and
odd index terms, two forms of the resulting programs are called
decimation-in-time and decimation-in-frequency. For a length of
2M2M, the dividing process is repeated M=log2NM=log2N times and
requires NN multiplications each time. This gives the famous
formula for the computational complexity of the FFT of Nlog2NNlog2N
which was derived in Multidimensional Index Mapping: Equation 34.
Although the decimation methods are straightforward and easy
to understand, they do not generalize well. For that reason it will
be assumed that the reader is familiar with that description and
this chapter will develop the FFT using the index map from Multidimensional Index Mapping.
The Cooley-Tukey FFT always uses the Type 2 index map from
Multidimensional Index Mapping: Equation 11. This is necessary for the most popular forms that
have N=RMN=RM, but is also used even when the factors are
relatively prime and a Type 1 map could be used. The time and
frequency maps from Multidimensional Index Mapping: Equation 6 and Multidimensional Index Mapping: Equation 12 are
n
=
(
(
K
1
n
1
+
K
2
n
2
)
)
N
n
=
(
(
K
1
n
1
+
K
2
n
2
)
)
N
(1)
k
=
(
(
K
3
k
1
+
K
4
k
2
)
)
N
k
=
(
(
K
3
k
1
+
K
4
k
2
)
)
N
(2)
Type-2 conditions Multidimensional Index Mapping: Equation 8 and Multidimensional Index Mapping: Equation 11 become
K
1
=
a
N
2
or
K
2
=
b
N
1
but
not
both
K
1
=
a
N
2
or
K
2
=
b
N
1
but
not
both
(3)
and
K
3
=
c
N
2
or
K
4
=
d
N
1
but
not
both
K
3
=
c
N
2
or
K
4
=
d
N
1
but
not
both
(4)
The row and column calculations in Multidimensional Index Mapping: Equation 15 are uncoupled by
Multidimensional Index Mapping: Equation 16 which for this case are
(
(
K
1
K
4
)
)
N
=
0
or
(
(
K
2
K
3
)
)
N
=
0
but
not
both
(
(
K
1
K
4
)
)
N
=
0
or
(
(
K
2
K
3
)
)
N
=
0
but
not
both
(5)
To make each short sum a DFT, the KiKi must satisfy
(
(
K
1
K
3
)
)
N
=
N
2
and
(
(
K
2
K
4
)
)
N
=
N
1
(
(
K
1
K
3
)
)
N
=
N
2
and
(
(
K
2
K
4
)
)
N
=
N
1
(6)
In order to have the smallest values for
KiKi the constants in Equation 3 are chosen to be
a
=
d
=
K
2
=
K
3
=
1
a
=
d
=
K
2
=
K
3
=
1
(7)
which makes the index maps of Equation 1 become
n
=
N
2
n
1
+
n
2
n
=
N
2
n
1
+
n
2
(8)
k
=
k
1
+
N
1
k
2
k
=
k
1
+
N
1
k
2
(9)
These index maps are all evaluated modulo NN, but in Equation 8, explicit reduction is not necessary since
nn never exceeds NN. The reduction notation will be omitted for
clarity. From Multidimensional Index Mapping: Equation 15 and example Multidimensional Index Mapping: Equation 19, the DFT is
X
=
∑
n
2
=
0
N
2
-
1
∑
n
1
=
0
N
1
-
1
x
W
N
1
n
1
k
1
W
N
n
2
k
1
W
N
2
n
2
k
2
X
=
∑
n
2
=
0
N
2
-
1
∑
n
1
=
0
N
1
-
1
x
W
N
1
n
1
k
1
W
N
n
2
k
1
W
N
2
n
2
k
2
(10)
This map of Equation 8 and the form of the DFT in Equation 10 are
the fundamentals of the Cooley-Tukey FFT.
The order of the summations using the Type 2 map in Equation 10
cannot be reversed as it can with the Type-1 map. This is because
of the WNWN terms, the twiddle factors.
Turning Equation 10 into an efficient program requires some care.
From Multidimensional Index Mapping: Efficiencies Resulting from Index Mapping with the DFT we know that all the factors should be
equal. If
N=RMN=RM ,
with R called the radix,
N1N1
is first set
equal to
RR
and
N2N2
is then necessarily RM-1
RM-1.
Consider
n1n1
to be the index along the rows and n2n2 along the columns. The inner
sum of Equation 10 over n1n1 represents a length-N1N1 DFT for each value
of n2n2. These N2N2 length-N1N1 DFT's are the DFT's of the rows
of the x(n1,n2)x(n1,n2) array. The resulting array of row DFT's is
multiplied by an array of twiddle factors which are the WNWN terms
in Equation 10. The twiddle-factor array for a length-8 radix-2 FFT
is
T
F
:
W
8
n
2
k
1
=
W
0
W
0
W
0
W
1
W
0
W
2
W
0
W
3
=
1
1
1
W
1
-
j
1
-
j
W
T
F
:
W
8
n
2
k
1
=
W
0
W
0
W
0
W
1
W
0
W
2
W
0
W
3
=
1
1
1
W
1
-
j
1
-
j
W
(11)
The twiddle factor array will always have unity in the first row and
first column.
To complete Equation 10 at this point, after the row DFT's are
multiplied by the TF array, the N1N1 length-N2N2 DFT's of the
columns are calculated. However, since the columns DFT's are of
length
R
M-1R
M-1,
they can be posed as a
R
M-2R
M-2 by RR array and the
process repeated, again using length-RR DFT's. After MM stages of
length-RR DFT's with TF multiplications interleaved, the DFT is
complete. The flow graph of a length-2 DFT is given in Figure 1 and
is called a butterfly because of its shape. The flow graph of the
complete length-8 radix-2 FFT is shown in Figure 2 .
This flow-graph, the twiddle factor map of Equation 11, and the
basic equation Equation 10 should be completely understood before
going further.
A very efficient indexing scheme has evolved over the years
that results in a compact and efficient computer program. A FORTRAN
program is given below that implements the radix-2 FFT. It should be
studied [1] to see how it implements Equation 10 and the
flow-graph representation.
Listing 1: A Radix-2 Cooley-Tukey FFT ProgramN2 = N
DO 10 K = 1, M
N1 = N2
N2 = N2/2
E = 6.28318/N1
A = 0
DO 20 J = 1, N2
C = COS (A)
S =-SIN (A)
A = J*E
DO 30 I = J, N, N1
L = I + N2
XT = X(I) - X(L)
X(I) = X(I) + X(L)
YT = Y(I) - Y(L)
Y(I) = Y(I) + Y(L)
X(L) = XT*C - YT*S
Y(L) = XT*S + YT*C
30 CONTINUE
20 CONTINUE
10 CONTINUE
|
This discussion, the flow graph of Winograd’s Short DFT Algorithms: Figure 2 and the program
of Code 1 are all based on the input index map of
Multidimensional Index Mapping: Equation 6 and Equation 1 and the calculations are performed
in-place. According to Multidimensional Index Mapping: In-Place Calculation of the DFT and Scrambling, this means the output is
scrambled in bit-reversed order and should be followed by an
unscrambler to give the DFT in proper order. This formulation is
called a decimation-in-frequency FFT [21], [22], [2]. A very
similar algorithm based on the output index map can be derived which
is called a decimation-in-time FFT. Examples of FFT programs are
found in [1] and in the Appendix of this book.
Soon after the paper by Cooley and Tukey, there were
improvements and extensions made. One very important discovery was
the improvement in efficiency by using a larger radix of 4, 8 or
even 16. For example, just as for the radix-2 butterfly, there are
no multiplications required for a length-4 DFT, and therefore, a
radix-4 FFT would have only twiddle factor multiplications. Because
there are half as many stages in a radix-4 FFT, there would be half
as many multiplications as in a radix-2 FFT. In practice, because
some of the multiplications are by unity, the improvement is not by
a factor of two, but it is significant. A radix-4 FFT is easily
developed from the basic radix-2 structure by replacing the length-2
butterfly by a length-4 butterfly and making a few other
modifications. Programs can be found in [1] and operation
counts will be given in "Evaluation of the Cooley-Tukey FFT Algorithms".
Increasing the radix to 8 gives some improvement but not as
much as from 2 to 4. Increasing it to 16 is theoretically promising
but the small decrease in multiplications is somewhat offset by an
increase in additions and the program becomes rather long. Other
radices are not attractive because they generally require a
substantial number of multiplications and additions in the
butterflies.
The second method of reducing arithmetic is to remove the
unnecessary TF multiplications by plus or minus unity or by plus or
minus the square root of minus one. This occurs when the exponent of
WNWN is zero or a multiple of N/4N/4. A reduction of additions as
well as multiplications is achieved by removing these extraneous
complex multiplications since a complex multiplication requires at
least two real additions. In a program, this reduction is usually
achieved by having special butterflies for the cases where the TF is
one or jj. As many as four special butterflies may be necessary to
remove all unnecessary arithmetic, but in many cases there will be
no practical improvement above two or three.
In addition to removing multiplications by one or jj, there
can be a reduction in multiplications by using a special butterfly
for TFs with WN/8WN/8, which have equal real and imaginary parts.
Also, for computers or hardware with multiplication considerably
slower than addition, it is desirable to use an algorithm for
complex multiplication that requires three multiplications and three
additions rather than the conventional four multiplications and two
additions. Note that this gives no reduction in the total number of
arithmetic operations, but does give a trade of multiplications for
additions. This is one reason not to use complex data types in
programs but to explicitly program complex arithmetic.
A time-consuming and unnecessary part of the execution of a
FFT program is the calculation of the sine and cosine terms which
are the real and imaginary parts of the TFs. There are basically
three approaches to obtaining the sine and cosine values. They can
be calculated as needed which is what is done in the sample program
above. One value per stage can be calculated and the others
recursively calculated from those. That method is fast but suffers
from accumulated round-off errors. The fastest method is to fetch
precalculated values from a stored table. This has the disadvantage
of requiring considerable memory space.
If all the N DFT values are not needed, special forms of the
FFT can be developed using a process called pruning [17]
which removes the operations concerned with the unneeded outputs.
Special algorithms are possible for cases with real data or
with symmetric data [6]. The decimation-in-time algorithm
can be easily modified to transform real data and save half the
arithmetic required for complex data [27]. There are
numerous other modifications to deal with special hardware
considerations such as an array processor or a special
microprocessor such as the Texas Instruments TMS320. Examples of
programs that deal with some of these items can be found in
[22], [1], [6].
Recently several papers [18], [7], [28], [23], [8]
have been published on algorithms to calculate a length-2M2M DFT
more efficiently than a Cooley-Tukey FFT of any radix. They all have
the same computational complexity and are optimal for lengths up
through 16 and until recently was thought to give the best total add-multiply count
possible for any power-of-two length. Yavne published an algorithm
with the same computational complexity in 1968 [29], but it
went largely unnoticed. Johnson and Frigo have recently reported the first
improvement in almost 40 years [15]. The reduction
in total operations is only a few percent, but it is a reduction.
The basic idea behind the split-radix FFT (SRFFT) as derived
by Duhamel and Hollmann [7], [8] is the application of a
radix-2 index map to the even-indexed terms and a radix-4 map to the
odd- indexed terms. The basic definition of the DFT
C
k
=
∑
n
=
0
N
-
1
x
n
W
n
k
C
k
=
∑
n
=
0
N
-
1
x
n
W
n
k
(12)with W=e-j2π/NW=e-j2π/N gives
C
2
k
=
∑
n
=
0
N
/
2
-
1
[
x
n
+
x
n
+
N
/
2
]
W
2
n
k
C
2
k
=
∑
n
=
0
N
/
2
-
1
[
x
n
+
x
n
+
N
/
2
]
W
2
n
k
(13)for the even index terms, and
C
4
k
+
1
=
∑
n
=
0
N
/
4
-
1
[
(
x
n
-
x
n
+
N
/
2
)
-
j
(
x
n
+
N
/
4
-
x
n
+
3
N
/
4
)
]
W
n
W
4
n
k
C
4
k
+
1
=
∑
n
=
0
N
/
4
-
1
[
(
x
n
-
x
n
+
N
/
2
)
-
j
(
x
n
+
N
/
4
-
x
n
+
3
N
/
4
)
]
W
n
W
4
n
k
(14)and
C
4
k
+
3
=
∑
n
=
0
N
/
4
-
1
[
(
x
n
-
x
n
+
N
/
2
)
+
j
(
x
n
+
N
/
4
-
x
n
+
3
N
/
4
)
]
W
3
n
W
4
n
k
C
4
k
+
3
=
∑
n
=
0
N
/
4
-
1
[
(
x
n
-
x
n
+
N
/
2
)
+
j
(
x
n
+
N
/
4
-
x
n
+
3
N
/
4
)
]
W
3
n
W
4
n
k
(15)for the odd index terms. This results in
an L-shaped “butterfly" shown in Figure 3 which relates a length-N
DFT to one length-N/2 DFT and two length-N/4 DFT's with twiddle
factors. Repeating this process for the half and quarter length
DFT's until scalars result gives the SRFFT algorithm in much the
same way the decimation-in-frequency radix-2 Cooley-Tukey FFT is
derived [21], [22], [2]. The resulting flow graph for the
algorithm calculated in place looks like a radix-2 FFT except for
the location of the twiddle factors. Indeed, it is the location of
the twiddle factors that makes this algorithm use less arithmetic.
The L- shaped SRFFT butterfly Figure 3 advances the calculation of the top
half by one of the MM stages while the lower half, like a radix-4
butterfly, calculates two stages at once. This is illustrated for N=8N=8 in Figure 4.
Unlike the fixed radix, mixed radix or variable radix
Cooley-Tukey FFT or even the prime factor algorithm or Winograd
Fourier transform algorithm , the Split-Radix FFT does not progress
completely stage by stage, or, in terms of indices, does not
complete each nested sum in order. This is perhaps better seen from
the polynomial formulation of Martens [18]. Because of
this, the indexing is somewhat more complicated than the
conventional Cooley-Tukey program.
A FORTRAN program is given below which implements the basic
decimation-in-frequency split-radix FFT algorithm. The indexing
scheme [23] of this program gives a structure very similar
to the Cooley-Tukey programs in [1] and allows the same
modifications and improvements such as decimation-in-time, multiple
butterflies, table look-up of sine and cosine values, three real per
complex multiply methods, and real data versions
[8], [27].
Listing 2: Split-Radix FFT FORTRAN SubroutineSUBROUTINE FFT(X,Y,N,M)
N2 = 2*N
DO 10 K = 1, M-1
N2 = N2/2
N4 = N2/4
E = 6.283185307179586/N2
A = 0
DO 20 J = 1, N4
A3 = 3*A
CC1 = COS(A)
SS1 = SIN(A)
CC3 = COS(A3)
SS3 = SIN(A3)
A = J*E
IS = J
ID = 2*N2
40 DO 30 I0 = IS, N-1, ID
I1 = I0 + N4
I2 = I1 + N4
I3 = I2 + N4
R1 = X(I0) - X(I2)
X(I0) = X(I0) + X(I2)
R2 = X(I1) - X(I3)
X(I1) = X(I1) + X(I3)
S1 = Y(I0) - Y(I2)
Y(I0) = Y(I0) + Y(I2)
S2 = Y(I1) - Y(I3)
Y(I1) = Y(I1) + Y(I3)
S3 = R1 - S2
R1 = R1 + S2
S2 = R2 - S1
R2 = R2 + S1
X(I2) = R1*CC1 - S2*SS1
Y(I2) =-S2*CC1 - R1*SS1
X(I3) = S3*CC3 + R2*SS3
Y(I3) = R2*CC3 - S3*SS3
30 CONTINUE
IS = 2*ID - N2 + J
ID = 4*ID
IF (IS.LT.N) GOTO 40
20 CONTINUE
10 CONTINUE
IS = 1
ID = 4
50 DO 60 I0 = IS, N, ID
I1 = I0 + 1
R1 = X(I0)
X(I0) = R1 + X(I1)
X(I1) = R1 - X(I1)
R1 = Y(I0)
Y(I0) = R1 + Y(I1)
60 Y(I1) = R1 - Y(I1)
IS = 2*ID - 1
ID = 4*ID
IF (IS.LT.N) GOTO 50
|
As was done for the other decimation-in-frequency
algorithms, the input index map is used and the calculations are
done in place resulting in the output being in bit-reversed order.
It is the three statements following label 30 that do the special
indexing required by the SRFFT. The last stage is length- 2 and,
therefore, inappropriate for the standard L-shaped butterfly, so it
is calculated separately in the DO 60 loop. This program is
considered a one-butterfly version. A second butterfly can be added
just before statement 40 to remove the unnecessary multiplications
by unity. A third butterfly can be added to reduce the number of
real multiplications from four to two for the complex multiplication
when W has equal real and imaginary parts. It is also possible to
reduce the arithmetic for the two- butterfly case and to reduce the
data transfers by directly programming a length-4 and length-8
butterfly to replace the last three stages. This is called a
two-butterfly-plus version. Operation counts for the one, two,
two-plus and three butterfly SRFFT programs are given in the next
section. Some details can be found in [23].
The special case of a SRFFT for real data and symmetric data
is discussed in [8]. An application of the
decimation-in-time SRFFT to real data is given in [27].
Application to convolution is made in [9], to the discrete
Hartley transform in [26], [9], to calculating the discrete
cosine transform in [28], and could be made to calculating
number theoretic transforms.
An improvement in operation count has been reported by Johnson
and Frigo [15] which involves a scaling of multiplying
factors. The improvement is small but until this result, it was
generally thought the Split-Radix FFT was optimal for total floating
point operation count.
The evaluation of any FFT algorithm starts with a count of
the real (or floating point) arithmetic. Table 1 gives the number of
real multiplications and additions required to calculate a length-N
FFT of complex data. Results of programs with one, two, three and
five butterflies are given to show the improvement that can be
expected from removing unnecessary multiplications and additions.
Results of radices two, four, eight and sixteen for the Cooley-Tukey
FFT as well as of the split-radix FFT are given to show the relative
merits of the various structures. Comparisons of these data should
be made with the table of counts for the PFA and WFTA programs in
The Prime Factor and Winograd Fourier Transform Algorithms: Evaluation of the PFA and WFTA. All programs use the four-multiply-two-add
complex multiply algorithm. A similar table can be developed for the
three-multiply-three-add algorithm, but the relative results are the
same.
From the table it is seen that a greater improvement is
obtained going from radix-2 to 4 than from 4 to 8 or 16. This is
partly because length 2 and 4 butterflies have no multiplications
while length 8, 16 and higher do. It is also seen that going from
one to two butterflies gives more improvement than going from two to
higher values. From an operation count point of view and from
practical experience, a three butterfly radix-4 or a two butterfly
radix-8 FFT is a good compromise. The radix-8 and 16 programs become
long, especially with multiple butterflies, and they give a limited
choice of transform length unless combined with some length 2 and 4
butterflies.
Table 1: Number of Real Multiplications and Additions for Complex
Single Radix FFTs
| N |
M1 |
M2 |
M3 |
M5 |
A1 |
A2 |
A3 |
A5 |
| 2 |
4 |
0 |
0 |
0 |
6 |
4 |
4 |
4 |
| 4 |
16 |
4 |
0 |
0 |
24 |
18 |
16 |
16 |
| 8 |
48 |
20 |
8 |
4 |
72 |
58 |
52 |
52 |
| 16 |
128 |
68 |
40 |
28 |
192 |
162 |
148 |
148 |
| 32 |
320 |
196 |
136 |
108 |
480 |
418 |
388 |
388 |
| 64 |
768 |
516 |
392 |
332 |
1152 |
1026 |
964 |
964 |
| 128 |
1792 |
1284 |
1032 |
908 |
2688 |
2434 |
2308 |
2308 |
| 256 |
4096 |
3076 |
2568 |
2316 |
6144 |
5634 |
5380 |
5380 |
| 512 |
9216 |
7172 |
6152 |
5644 |
13824 |
12802 |
12292 |
12292 |
| 1024 |
20480 |
16388 |
14344 |
13324 |
30720 |
28674 |
27652 |
27652 |
| 2048 |
45056 |
36868 |
32776 |
30732 |
67584 |
63490 |
61444 |
61444 |
| 4096 |
98304 |
81924 |
73736 |
69644 |
147456 |
139266 |
135172 |
135172 |
| 4 |
12 |
0 |
0 |
0 |
22 |
16 |
16 |
16 |
| 16 |
96 |
36 |
28 |
24 |
176 |
146 |
144 |
144 |
| 64 |
576 |
324 |
284 |
264 |
1056 |
930 |
920 |
920 |
| 256 |
3072 |
2052 |
1884 |
1800 |
5632 |
5122 |
5080 |
5080 |
| 1024 |
15360 |
11268 |
10588 |
10248 |
28160 |
26114 |
25944 |
25944 |
| 4096 |
73728 |
57348 |
54620 |
53256 |
135168 |
126978 |
126296 |
126296 |
| 8 |
32 |
4 |
4 |
4 |
66 |
52 |
52 |
52 |
| 64 |
512 |
260 |
252 |
248 |
1056 |
930 |
928 |
928 |
| 512 |
6144 |
4100 |
4028 |
3992 |
12672 |
11650 |
11632 |
11632 |
| 4096 |
65536 |
49156 |
48572 |
48280 |
135168 |
126978 |
126832 |
126832 |
| 16 |
80 |
20 |
20 |
20 |
178 |
148 |
148 |
148 |
| 256 |
2560 |
1540 |
1532 |
1528 |
5696 |
5186 |
5184 |
5184 |
| 4096 |
61440 |
45060 |
44924 |
44856 |
136704 |
128514 |
128480 |
128480 |
| 2 |
0 |
0 |
0 |
0 |
4 |
4 |
4 |
4 |
| 4 |
8 |
0 |
0 |
0 |
20 |
16 |
16 |
16 |
| 8 |
24 |
8 |
4 |
4 |
60 |
52 |
52 |
52 |
| 16 |
72 |
32 |
28 |
24 |
164 |
144 |
144 |
144 |
| 32 |
184 |
104 |
92 |
84 |
412 |
372 |
372 |
372 |
| 64 |
456 |
288 |
268 |
248 |
996 |
912 |
912 |
912 |
| 128 |
1080 |
744 |
700 |
660 |
2332 |
2164 |
2164 |
2164 |
| 256 |
2504 |
1824 |
1740 |
1656 |
5348 |
5008 |
5008 |
5008 |
| 512 |
5688 |
4328 |
4156 |
3988 |
12060 |
11380 |
11380 |
11380 |
| 1024 |
12744 |
10016 |
9676 |
9336 |
26852 |
25488 |
25488 |
25488 |
| 2048 |
28216 |
22760 |
22076 |
21396 |
59164 |
56436 |
56436 |
56436 |
| 4096 |
61896 |
50976 |
49612 |
48248 |
129252 |
123792 |
123792 |
123792 |
In Table 1 Mi and Ai refer to the number of real multiplications
and real additions used by an FFT with i separately written butterflies.
The first block has
the counts for Radix-2, the second for Radix-4, the third for Radix-8, the
fourth for Radix-16, and the last for the Split-Radix FFT. For the
split-radix FFT, M3 and A3 refer to the two- butterfly-plus program
and M5 and A5 refer to the three-butterfly program.
The first evaluations of FFT algorithms were in terms of the
number of real multiplications required as that was the slowest
operation on the computer and, therefore, controlled the execution
speed. Later with hardware arithmetic both the number of
multiplications and additions became important. Modern systems have
arithmetic speeds such that indexing and data transfer times become
important factors. Morris [19] has looked at some of these
problems and has developed a procedure called autogen to write
partially straight-line program code to significantly reduce
overhead and speed up FFT run times. Some hardware, such as the
TMS320 signal processing chip, has the multiply and add operations
combined. Some machines have vector instructions or have parallel
processors. Because the execution speed of an FFT depends not only
on the algorithm, but also on the hardware architecture and
compiler, experiments must be run on the system to be used.
In many cases the unscrambler or bit-reverse-counter requires
10% of the execution time, therefore, if possible, it should be
eliminated. In high-speed convolution where the convolution is done
by multiplication of DFT's, a decimation-in-frequency FFT can be
combined with a decimation-in-time inverse FFT to require no
unscrambler. It is also possible for a radix-2 FFT to do the
unscrambling inside the FFT but the structure is not very regular
[22], [14]. Special structures can be found in [22] and
programs for data that are real or have special symmetries are in
[6], [8], [27].
Although there can be significant differences in the
efficiencies of the various Cooley-Tukey and Split-Radix FFTs, the
number of multiplications and additions for all of them is on the
order of NlogNNlogN. That is fundamental to the class of algorithms.
The development of fast algorithms usually consists of using special
properties of the algorithm of interest to remove redundant or unnecessary
operations of a direct implementation. The discrete Fourier transform
(DFT) defined by
C
(
k
)
=
∑
n
=
0
N
-
1
x
(
n
)
W
N
n
k
C
(
k
)
=
∑
n
=
0
N
-
1
x
(
n
)
W
N
n
k
(16)where
W
N
=
e
-
j
2
π
/
N
W
N
=
e
-
j
2
π
/
N
(17)has enormous capacity for improvement of its arithmetic efficiency. Most
fast algorithms use the periodic and symmetric properties of its basis
functions. The classical Cooley-Tukey FFT and prime factor FFT [1]
exploit the periodic properties of the cosine and sine functions. Their
use of the periodicities to share and, therefore, reduce arithmetic
operations depends on the factorability of the length of the data to be
transformed. For highly composite lengths, the number of floating-point
operation is of order Nlog(N)Nlog(N) and for prime lengths it is of order
N2N2.
This section will look at an approach using the symmetric properties to
remove redundancies. This possibility has long been recognized
[13], [16], [25], [20] but has not been developed in any
systematic way in the open literature. We will develop an algorithm,
called the quick Fourier transform (QFT) [16], that will reduce
the number of floating point operations necessary to compute the DFT by a
factor of two to four over direct methods or Goertzel's method for prime
lengths. Indeed, it seems the best general algorithm available for prime
length DFTs. One can always do better by using Winograd type algorithms
but they must be individually designed for each length. The Chirp Z-transform can be used for longer lengths.
We use the fact that the cosine is an even function and the sine is an odd
function. The kernel of the DFT or the basis functions of the expansion is
given by
W
N
n
k
=
e
-
j
2
π
n
k
/
N
=
cos
(
2
π
n
k
/
N
)
+
j
sin
(
2
π
n
k
/
N
)
W
N
n
k
=
e
-
j
2
π
n
k
/
N
=
cos
(
2
π
n
k
/
N
)
+
j
sin
(
2
π
n
k
/
N
)
(18)which has an even real part and odd imaginary part. If the data x(n)x(n) are
decomposed into their real and imaginary parts and those into their even and
odd parts, we have
x
(
n
)
=
u
(
n
)
+
j
v
(
n
)
=
[
u
e
(
n
)
+
u
o
(
n
)
]
+
j
[
v
e
(
n
)
+
v
o
(
n
)
]
x
(
n
)
=
u
(
n
)
+
j
v
(
n
)
=
[
u
e
(
n
)
+
u
o
(
n
)
]
+
j
[
v
e
(
n
)
+
v
o
(
n
)
]
(19)where the even part of the real part of x(n)x(n) is given by
u
e
(
n
)
=
(
u
(
n
)
+
u
(
-
n
)
)
/
2
u
e
(
n
)
=
(
u
(
n
)
+
u
(
-
n
)
)
/
2
(20)and the odd part of the real part is
u
o
(
n
)
=
(
u
(
n
)
-
u
(
-
n
)
)
/
2
u
o
(
n
)
=
(
u
(
n
)
-
u
(
-
n
)
)
/
2
(21)with corresponding definitions of ve(n)ve(n) and vo(n)vo(n).
Using Convolution Algorithms: Equation 32 with a simpler notation, the DFT of Convolution Algorithms: Equation 29 becomes
C
(
k
)
=
∑
n
=
0
N
-
1
(
u
+
j
v
)
(
cos
-
j
sin
)
.
C
(
k
)
=
∑
n
=
0
N
-
1
(
u
+
j
v
)
(
cos
-
j
sin
)
.
(22)The sum over an integral number of periods of an odd function is zero and
the sum of an even function over half of the period is one half the sum
over the whole period. This causes Equation 16 and Equation 22 to become
C
(
k
)
=
∑
n
=
0
N
/
2
-
1
[
u
e
cos
+
v
o
sin
]
+
j
[
v
e
cos
-
v
o
sin
]
.
C
(
k
)
=
∑
n
=
0
N
/
2
-
1
[
u
e
cos
+
v
o
sin
]
+
j
[
v
e
cos
-
v
o
sin
]
.
(23)for k=0,1,2,⋯,N-1k=0,1,2,⋯,N-1.
The evaluation of the DFT using equation Equation 23 requires half as many
real multiplication and half as many real additions as evaluating it using
Equation 16 or Equation 22. We have exploited the symmetries of the sine and
cosine as functions of the time index nn. This is independent of whether
the length is composite or not. Another view of this formulation is that
we have used the property of associatively of multiplication and addition.
In other words, rather than multiply two data points by the same value of
a sine or cosine then add the results, one should add the data points
first then multiply the sum by the sine or cosine which requires one
rather than two multiplications.
Next we take advantage of the symmetries of the sine and cosine as
functions of the frequency index kk. Using these symmetries on Equation 23
gives
C
(
k
)
=
∑
n
=
0
N
/
2
-
1
[
u
e
cos
+
v
o
sin
]
+
j
[
v
e
cos
-
v
o
sin
]
C
(
k
)
=
∑
n
=
0
N
/
2
-
1
[
u
e
cos
+
v
o
sin
]
+
j
[
v
e
cos
-
v
o
sin
]
(24)
C
(
N
-
k
)
=
∑
n
=
0
N
/
2
-
1
[
u
e
cos
-
v
o
sin
]
+
j
[
v
e
cos
+
v
o
sin
]
.
C
(
N
-
k
)
=
∑
n
=
0
N
/
2
-
1
[
u
e
cos
-
v
o
sin
]
+
j
[
v
e
cos
+
v
o
sin
]
.
(25)for k=0,1,2,⋯,N/2-1k=0,1,2,⋯,N/2-1. This again reduces the number of
operations by a factor of two, this time because it calculates two output
values at a time. The first reduction by a factor of two is always
available. The second is possible only if both DFT values are needed. It
is not available if you are calculating only one DFT value. The above
development has not dealt with the details that arise with the difference
between an even and an odd length. That is straightforward.
If the length of the sequence to be transformed is even, there are further
symmetries that can be exploited. There will be four data values that are
all multiplied by plus or minus the same sine or cosine value. This means
a more complicated pre-addition process which is a generalization of the
simple calculation of the even and odd parts in Equation 20 and Equation 21
will reduce the size of the order N2N2 part of the algorithm by still
another factor of two or four. It the length is divisible by 4, the
process can be repeated. Indeed, it the length is a power of 2, one can
show this process is equivalent to calculating the DFT in terms of
discrete cosine and sine transforms [10], [11] with a resulting
arithmetic complexity of order Nlog(N)Nlog(N) and with a structure that is
well suited to real data calculations and pruning.
If the flow-graph of the Cooley-Tukey FFT is compared to the flow-graph of
the QFT, one notices both similarities and differences. Both progress in
stages as the length is continually divided by two. The Cooley-Tukey
algorithm uses the periodic properties of the sine and cosine to give the
familiar horizontal tree of butterflies. The parallel diagonal lines in
this graph represent the parallel stepping through the data in synchronism
with the periodic basis functions. The QFT has diagonal lines that
connect the first data point with the last, then the second with the next
to last, and so on to give a “star" like picture. This is interesting in
that one can look at the flow graph of an algorithm developed by some
completely different strategy and often find section with the parallel
structures and other parts with the star structure. These must be using
some underlying periodic and symmetric properties of the basis functions.
A careful analysis of the QFT shows that 2N2N additions are necessary to
compute the even and odd parts of the input data. This is followed by the
length N/2N/2 inner product that requires 4(N/2)2=N24(N/2)2=N2 real
multiplications and an equal number of additions. This is followed by the
calculations necessary for the simultaneous calculations of the first half
and last half of C(k)C(k) which requires 4(N/2)=2N4(N/2)=2N real additions. This
means the total QFT algorithm requires M2M2 real multiplications and N2+4NN2+4N real additions. These numbers along with those for the Goertzel
algorithm [3], [1], [20] and the direct calculation of the DFT are
included in the following table. Of the various order-N2N2 DFT
algorithms, the QFT seems to be the most efficient general method for an
arbitrary length NN.
Table 2
| Algorithm |
Real Mults. |
Real Adds |
Trig Eval. |
| |
|
|
|
| Direct DFT |
4
N
2
4
N
2
|
4
N
2
4
N
2
|
2
N
2
2
N
2
|
| Mod. 2nd Order Goertzel |
N
2
+
N
N
2
+
N
|
2
N
2
+
N
2
N
2
+
N
|
N
N
|
| QFT |
N
2
N
2
|
N
2
+
4
N
N
2
+
4
N
|
2
N
2
N
|
| |
Timings of the algorithms on a PC in milliseconds are given in the
following table.
Table 3
| Algorithm |
N
=
125
N
=
125
|
N
=
256
N
=
256
|
| |
|
|
| Direct DFT |
4.90 |
19.83 |
| Mod. 2O. Goertzel |
1.32 |
5.55 |
| QFT |
1.09 |
4.50 |
| Chirp + FFT |
1.70 |
3.52 |
| |
These timings track the floating point operation counts fairly well.
The QFT is a straight-forward DFT algorithm that uses all of the possible
symmetries of the DFT basis function with no requirements on the length
being composite. These ideas have been proposed before, but have not been
published or clearly developed by [16], [25], [24], [12].
It seems that the basic QFT is practical and useful as a general algorithm
for lengths up to a hundred or so. Above that, the chirp z-transform
[1] or other filter based methods will be superior. For special
cases and shorter lengths, methods based on Winograd's theories will
always be superior. Nevertheless, the QFT has a definite place in the
array of DFT algorithms and is not well known. A Fortran program is
included in the appendix.
It is possible, but unlikely, that further arithmetic reduction could be
achieved using the fact that WNWN has unity magnitude as was done in
second-order Goertzel algorithm. It is also possible that some way of
combining the Goertzel and QFT algorithm would have some advantages. A
development of a complete QFT decomposition of a DFT of length-2M2M
shows interesting structure [10], [11] and arithmetic complexity
comparable to average Cooley-Tukey FFTs. It does seem better suited to
real data calculations with pruning.
-
Burrus, C. S. and Parks, T. W. (1985). DFT/FFT and Convolution Algorithms. New York: John Wiley & Sons.
-
Brigham, E. Oran. (1988). The Fast Fourier Transform and Its Applications. [Expansion of the 1974 book]. Englewood Cliffs, NJ: Prentice-Hall.
-
Burrus, C. S. (1992). Goertzel's Algorithm. Unpublished Notes. ECE Dept., Rice University.
-
Cooley, James W. and Lewis, Peter A. W. and Welch, Peter D. (1967, June). Historical Notes on the Fast Fourier Transform. IEEE Transactions on Audio and Electroacoustics, 15, 260–262.
-
Cooley, J. W. and Tukey, J. W. (1965). An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Computat., 19, 297–301.
-
DSP Committee, (Ed.). (1979). Digital Signal Processing II, selected reprints. New York: IEEE Press.
-
Duhamel, P. and Hollmann, H. (1984, January 5). Split Radix FFT Algorithm. Electronic Letters, 20(1), 14–16.
-
Duhamel, P. (1986, April). Implementation of `Split-Radix' FFT Algorithms for Complex, Real, and Real-Symmetric Data. [A shorter version appeared in the ICASSP-85 Proceedings, p. 20.6, March 1985]. IEEE Trans. on ASSP, 34, 285–295.
-
Duhamel, P. and Vetterli, M. (1986, April). Cyclic Convolution of Real Sequences: Hartley versus Fourier and New Schemes. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-86), 6.5.
-
Guo, H. and Sitton, G. A. and Burrus, C. S. (1994, April 19–22). The Quick Discrete Fourier Transform. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. (p. III:445–448). IEEE ICASSP-94, Adelaide, Australia
-
Guo, H. and Sitton, G. A. and Burrus, C. S. (1998, February). The Quick Fourier Transform: an FFT based on Symmetries. IEEE Transactions on Signal Processing, 46(2), 335–341.
-
Heideman, M. T. (1985). private communication.
-
Heideman, M. T. and Johnson, D. H. and Burrus, C. S. (1985). Gauss and the History of the FFT. Archive for History of Exact Sciences, 34(3), 265–277.
-
Johnson, H. W. and Burrus, C. S. (1984, March). An In-Order, In-Place Radix-2 FFT. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. (p. 28A.2.1–4). San Diego, CA
-
Johnson, Steven G. and Frigo, Matteo. (2007, January). A Modified Split-Radix FFT with Fewer Arithmetic Operations. IEEE Transactions on Signal Processing, 55(1), 111–119.
-
Kohne, John F. (1980, July). A Quick Fourier Transform Algorithm. (TR-1723). Technical report. Naval Electronics Laboratory Center.
-
Markel, J. D. (1971, June). FFT Pruning. IEEE Trans on Audio and Electroacoustics, 19(4), 305–311.
-
Martens, J. B. (1984, August). Recursive Cyclotomic Factorization – A New Algorithm for Calculating the Discrete Fourier Transform. IEEE Trans. on ASSP, 32(4), 750–762.
-
Morris, L. R. (1982, 1983). Digital Signal Processing Software. Toronto, Canada: DSPSW, Inc.
-
Oppenheim, A. V. and Schafer, R. W. (1989). Discrete-Time Signal Processing. Englewood Cliffs, NJ: Prentice-Hall.
-
Oppenheim, A. V. and Schafer, R. W. (1999). Discrete-Time Signal Processing. (Second). [Earlier editions in 1975 and 1989]. Englewood Cliffs, NJ: Prentice-Hall.
-
Rabiner, L. R. and Gold, B. (1975). Theory and Application of Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall.
-
Sorensen, H. V. and Heideman, M. T. and Burrus, C. S. (1986, February). On Computing the Split–Radix FFT. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(1), 152–156.
-
Sitton, G. A. (1985). The QFT Algorithm.
-
Sitton, G. A. (1991, June). The QDFT: An Efficient Method for Short Prime Length DFTs.
-
Sorensen, H. V. and Jones, D. L. and Burrus, C. S. and Heideman, M. T. (1985, October). On Computing the Discrete Hartley Transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(5), 1231–1238.
-
Sorensen, H. V. and Jones, D. L. and Heideman, M. T. and Burrus, C. S. (1987, June). Real Valued Fast Fourier Transform Algorithms. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(6), 849–863.
-
Vetterli, Martin and Nussbaumer, H. J. (1984, August). Simple FFT and DCT Algorithms with Reduced Number of Operations. Signal Processing, 6(4), 267–278.
-
Yavne, R. (1968). An Economical Method for Calculating the Discrete Fourier Transform. In Fall Joint Computer Conference, AFIPS Conference Proceedings. (Vol. 33 part 1, p. 115–125).
"The Fast Fourier Transform (FFT) is a landmark algorithm used in fields ranging from signal processing to high-performance computing. First popularized by two American scientists in 1965, the […]"