Precompute twiddle factors
The
twiddle factor, or
W
N
k
=ⅇ-ⅈ2πkN
W
N
k
2
k
N
,
terms
that multiply the intermediate data in the
FFT algorithms consist of cosines and sines
that each take the equivalent of several multiplies to compute.
However, at most
NN unique
twiddle factors can appear in any FFT or DFT algorithm.
(For example, in the
radix-2 decimation-in-time FFT,
only
N2
N
2
twiddle factors
∀k,k=012…N2-1:
W
N
k
k
k
0
1
2
…
N
2
1
W
N
k
are used.)
These twiddle factors can be precomputed once and stored
in an array in computer memory, and accessed in the FFT
algorithm by
table lookup.
This simple technique yields very substantial savings and
is almost always used in practice.
Compiler-friendly programming
On most computers, only some of the total computation time
of an FFT is spent performing the FFT butterfly computations;
determining indices, loading and storing data, computing
loop parameters and other operations consume the majority
of cycles.
Careful programming that allows the compiler to generate
efficient code can make a several-fold improvement in the
run-time of an FFT.
The best choice of radix in terms of program speed may depend more on characteristics
of the hardware (such as the number of CPU registers) or
compiler than on the exact number of computations.
Very often the manufacturer's library codes are carefully
crafted by experts who know intimately both the hardware
and compiler architecture and how to get the most performance
out of them, so use of well-written FFT libraries is
generally recommended.
Certain freely available programs and libraries are also
very good.
Perhaps the best current general-purpose library is the
FFTW package;
information can be found at
http://www.fftw.org.
A paper by
Frigo and Johnson describes many of the key issues in developing
compiler-friendly code.
Program in assembly language
While compilers continue to improve, FFT programs written directly
in the assembly language of a specific machine are often
several times faster than the best compiled code.
This is particularly true for DSP microprocessors, which have
special instructions for accelerating FFTs that compilers don't use.
(I have myself seen differences of up to 26 to 1 in favor of assembly!)
Very often, FFTs in the manufacturer's or high-performance third-party libraries
are hand-coded in assembly.
For DSP microprocessors, the codes developed by
Meyer, Schuessler, and Schwarz
are perhaps the best ever developed;
while the particular processors are now obsolete, the techniques
remain equally relevant today.
Most DSP processors provide special instructions and a
hardware design favoring the radix-2 decimation-in-time algorithm,
which is thus generally fastest on these machines.
Special hardware
Some processors have special hardware accelerators or co-processors
specifically designed to accelerate FFT computations.
For example,
AMI Semiconductor's Toccata ultra-low-power DSP microprocessor family,
which is widely used in digital hearing aids, have on-chip FFT accelerators; it is always faster and more power-efficient to use such accelerators and whatever radix they prefer.
In a surprising number of applications, almost all of the computations
are FFTs.
A number of special-purpose chips are designed to specifically
compute FFTs, and are used in specialized high-performance
applications such as radar systems.
Other systems, such as
OFDM-based
communications receivers, have special FFT hardware built
into the digital receiver circuit.
Such hardware can run many times faster, with much less power
consumption, than FFT programs on general-purpose processors.
Effective memory management
Cache misses or excessive data movement between registers and memory
can greatly slow down an FFT computation.
Efficient programs such as the
FFTW package
are carefully designed to minimize these inefficiences.
In-place algorithms reuse the data
memory throughout the transform, which can reduce cache misses for
longer lengths.
Real-valued FFTs
FFTs of real-valued signals require only half as many computations as with complex-valued data.
There are several methods for reducing the computation,
which are described in more detail in
Sorensen et al.
- Use DFT symmetry properties to do two real-valued DFTs at once with one FFT program
- Perform one stage of the radix-2 decimation-in-time decomposition
and compute the two length-
N2
N
2
DFTs using the above approach.
- Use a direct real-valued FFT algorithm; see H.V. Sorensen
et.al.
Special cases
Occasionally only certain DFT frequencies are needed,
the input signal values are mostly zero, the signal
is real-valued (as discussed above), or other special
conditions exist for which faster algorithms can be
developed.
Sorensen and Burrus
describe slightly faster algorithms for
pruned
or
zero-padded data.
Goertzel's algorithm is useful
when only a few DFT outputs are needed.
The
running FFT can be faster
when DFTs of highly overlapped blocks of data are needed,
as in a
spectrogram.
Higher-radix algorithms
Higher-radix algorithms, such as the
radix-4, radix-8, or
split-radix FFTs,
require fewer computations and can produce modest but worthwhile savings.
Even the
split-radix FFT
reduces the multiplications by only 33% and the additions
by a much lesser amount relative to the
radix-2 FFTs;
significant improvements in program speed are often
due to implicit
loop-unrolling or other compiler benefits
than from the computational reduction itself!
Fast bit-reversal
Bit-reversing the input
or output data can consume several percent of the total
run-time of an FFT program.
Several fast bit-reversal algorithms have been developed
that can reduce this to two percent or less, including
the method published by
D.M.W. Evans.
Trade additions for multiplications
When FFTs first became widely used, hardware multipliers
were relatively rare on digital computers, and multiplications
generally required many more cycles than additions.
Methods to reduce multiplications, even at the expense
of a substantial increase in additions, were often beneficial.
The
prime factor algorithms
and the
Winograd Fourier transform algorithms,
which required fewer multiplies and considerably more additions
than the
power-of-two-length algorithms,
were developed during this period.
Current processors generally have high-speed pipelined
hardware multipliers, so trading multiplies for additions
is often no longer beneficial.
In particular, most machines now support single-cycle
multiply-accumulate (MAC) operations, so balancing the
number of multiplies and adds and combining them into
single-cycle MACs generally results in the fastest code.
Thus, the prime-factor and Winograd FFTs are rarely used
today unless the application requires FFTs of a specific length.
It is possible to implement a complex multiply with
3 real multiplies and 5 real adds rather than the usual
4 real multiplies and 2 real adds:
C+ⅈSX+ⅈY=CX-SY+ⅈCY+SX
C
S
X
Y
C
X
S
Y
C
Y
S
X
but alernatively
Z=CX-Y
Z
C
X
Y
D=C+S
D
C
S
E=C-S
E
C
S
CX-SY=EY+Z
C
X
S
Y
E
Y
Z
CY+SX=DX-Z
C
Y
S
X
D
X
Z
In an FFT,
DD and
EE
come entirely from the twiddle factors,
so they can be precomputed and stored in a look-up table.
This reduces the cost of the complex twiddle-factor multiply
to 3 real multiplies and 3 real adds, or one less and one more,
respectively, than the conventional 4/2 computation.
Special butterflies
Certain twiddle factors,
namely
W
N
0
=1
W
N
0
1
,
W
N
N
2
W
N
N
2
,
W
N
N
4
W
N
N
4
,
W
N
N
8
W
N
N
8
,
W
N
3
N
8
W
N
3
N
8
, etc., can be implemented
with no additional operations, or with fewer real operations than
a general complex multiply.
Programs that specially implement such butterflies in the most
efficient manner throughout the algorithm can reduce the
computational cost by up to several NN
multiplies and additions in a length-NN
FFT.
Practical Perspective
When optimizing FFTs for speed, it can be important to maintain
perspective on the benefits that can be expected from
any given optimization.
The following list categorizes the various techniques by potential
benefit;
these will be somewhat situation- and machine-dependent, but clearly
one should begin with the most significant and put the most effort
where the pay-off is likely to be largest.
Methods to speed up computation of DFTs- Tremendous Savings -
-
FFT
(
Nlog2N
N
2
N
savings)
- Substantial Savings -
(
≥2
2
)
- Table lookup of cosine/sine
- Compiler tricks/good programming
- Assembly-language programming
- Special-purpose hardware
- Real-data FFT for real data (factor of 2)
- Special cases
-
Minor Savings -
- radix-4, split-radix (-10% - +30%)
- special butterflies
-
3-real-multiplication complex multiply
- Fast bit-reversal (up to 6%)
fact: On general-purpose machines, computation is only part
of the total run time. Address generation, indexing, data
shuffling, and memory access take up much or most of the cycles.
fact: A well-written
radix-2 program will run much faster than a poorly written
split-radix program!
References-
D.M.W. Evans. (1987, August). An improved digit-reversal permutation algorithm for the fast Fourier and Hartley transforms. IEEE Transactions on Signal Processing, 35(8), 1120-1125.
-
M. Frigo and S.G. Johnson. (2005, February). The design and implementation of FFTW3. Proceedings of the IEEE, 93(2), 216-231.
-
R. Meyer, H.W. Schuessler, and K. Schwarz. (1990). FFT Implmentation on DSP chips - Theory and Practice. IEEE International Conference on Acoustics, Speech, and Signal Processing.
-
H.V. Sorensen, D.L Jones, M.T. Heideman, and C.S. Burrus. (1987, June). Real-valued fast Fourier transform algorithms. IEEE Transactions on Signal Processing, 35(6), 849-863.
-
H.V. Sorensen and C.S. Burrus. (1993, March). Efficient computation of the DFT with only a subset of input or output points. IEEE Transactions on Signal Processing, 41(3), 1184-1200.