by Steven G. Johnson (Department of Mathematics, Massachusetts Institute of Technology) and Matteo Frigo (Cilk Arts, Inc.)
Although there are a wide range of fast Fourier transform (FFT)
algorithms, involving a wealth of mathematics from number theory to
polynomial algebras, the vast majority of FFT implementations in
practice employ some variation on the Cooley-Tukey
algorithm Entry 9. The Cooley-Tukey algorithm can be
derived in two or three lines of elementary algebra. It can be
implemented almost as easily, especially if only power-of-two sizes
are desired; numerous popular textbooks list short FFT subroutines
for power-of-two sizes, written in the language du jour. The
implementation of the Cooley-Tukey algorithm, at least, would
therefore seem to be a long-solved problem. In this chapter, however,
we will argue that matters are not as straightforward as they might
appear.
For many years, the primary route to improving upon the Cooley-Tukey
FFT seemed to be reductions in the count of arithmetic operations,
which often dominated the execution time prior to the ubiquity of fast
floating-point hardware (at least on non-embedded processors). Therefore, great effort was expended towards
finding new algorithms with reduced arithmetic
counts Entry 13, from Winograd's method to achieve
Θ(n)Θ(n) multiplications (at the cost of many more
additions) Entry 54, Entry 22, Entry 12, Entry 13 to the
split-radix variant on Cooley-Tukey that long achieved the lowest
known total count of additions and multiplications for power-of-two
sizes Entry 56, Entry 10, Entry 53, Entry 32, Entry 13 (but
was recently improved upon Entry 27, Entry 31). The question of
the minimum possible arithmetic count continues to be of fundamental
theoretical interest—it is not even known whether better than
Θ(nlogn)Θ(nlogn) complexity is possible, since Ω(nlogn)Ω(nlogn)
lower bounds on the count of additions have only been proven subject
to restrictive assumptions about the
algorithms Entry 33, Entry 36, Entry 37. Nevertheless,
the difference in the number of arithmetic operations, for
power-of-two sizes nn, between the 1965 radix-2 Cooley-Tukey
algorithm (∼5nlog2n∼5nlog2n Entry 9) and the currently
lowest-known arithmetic count (∼349nlog2n∼349nlog2n Entry 27, Entry 31) remains only about 25%.
And yet there is a vast gap between this basic mathematical theory and
the actual practice—highly optimized FFT packages are often an
order of magnitude faster than the textbook subroutines, and the
internal structure to achieve this performance is radically different
from the typical textbook presentation of the “same” Cooley-Tukey
algorithm. For example, Figure 1 plots the ratio of benchmark
speeds between a highly optimized FFT Entry 15, Entry 16 and a
typical textbook radix-2 implementation Entry 38, and the
former is faster by a factor of 5–40 (with a larger ratio as nn
grows). Here, we will consider some of the reasons for this
discrepancy, and some techniques that can be used to address the
difficulties faced by a practical high-performance FFT
implementation.
In particular, in this chapter we will discuss some of the lessons
learned and the strategies adopted in the FFTW library. FFTW Entry 15, Entry 16 is a widely used free-software library that computes the
discrete Fourier transform (DFT) and its various special cases.
Its performance is competitive even with manufacturer-optimized programs Entry 16,
and this performance is portable thanks the structure of the
algorithms employed, self-optimization techniques, and highly
optimized kernels (FFTW's codelets) generated by a
special-purpose compiler.
This chapter is structured as follows. First "Review of the Cooley-Tukey FFT", we
briefly review the basic ideas behind the Cooley-Tukey algorithm and
define some common terminology, especially focusing on the many
degrees of freedom that the abstract algorithm allows to
implementations. Next, in "Goals and Background of the FFTW Project", we provide some context for FFTW's development and stress that performance, while it receives the most publicity, is not necessarily the most important consideration in the implementation of a library of this sort. Third, in "FFTs and the Memory Hierarchy", we consider a basic
theoretical model of the computer memory hierarchy and its impact on
FFT algorithm choices: quite general considerations push
implementations towards large radices and explicitly recursive
structure. Unfortunately, general considerations are not sufficient
in themselves, so we will explain in "Adaptive Composition of FFT Algorithms" how FFTW
self-optimizes for particular machines by selecting its algorithm at
runtime from a composition of simple algorithmic steps. Furthermore,
"Generating Small FFT Kernels" describes the utility and the principles of automatic
code generation used to produce the highly optimized building blocks
of this composition, FFTW's codelets. Finally, we will
briefly consider an important non-performance issue, in
"Numerical Accuracy in FFTs".
The (forward, one-dimensional)
discrete Fourier transform (DFT) of an array XX of nn complex numbers is
the array YY given by
Y
[
k
]
=
∑
ℓ
=
0
n
-
1
X
[
ℓ
]
ω
n
ℓ
k
,
Y
[
k
]
=
∑
ℓ
=
0
n
-
1
X
[
ℓ
]
ω
n
ℓ
k
,
(1)
where 0≤k<n0≤k<n and ωn=exp(-2πi/n)ωn=exp(-2πi/n) is a
primitive root of unity. Implemented directly, Equation 1 would
require Θ(n2)Θ(n2) operations; fast Fourier transforms are O(nlogn)O(nlogn) algorithms to compute the same result. The most important
FFT (and the one primarily used in FFTW) is known as the
“Cooley-Tukey” algorithm, after the two authors who rediscovered and
popularized it in 1965 Entry 9, although it had been
previously known as early as 1805 by Gauss as well as by later
re-inventors Entry 24. The basic idea behind this FFT is
that a DFT of a composite size n=n1n2n=n1n2 can be re-expressed in
terms of smaller DFTs of sizes n1n1 and n2n2—essentially, as a
two-dimensional DFT of size n1×n2n1×n2 where the output is
transposed. The choices of factorizations of nn, combined
with the many different ways to implement the data re-orderings of the
transpositions, have led to numerous implementation strategies for the
Cooley-Tukey FFT, with many variants distinguished by their own
names Entry 13, Entry 52. FFTW implements a space of
many such variants, as described in "Adaptive Composition of FFT Algorithms", but here
we derive the basic algorithm, identify its key features, and outline
some important historical variations and their relation to FFTW.
The Cooley-Tukey algorithm can be derived as follows. If nn can be
factored into n=n1n2n=n1n2, Equation 1 can be rewritten by
letting ℓ=ℓ1n2+ℓ2ℓ=ℓ1n2+ℓ2 and k=k1+k2n1k=k1+k2n1. We then have:
Y
[
k
1
+
k
2
n
1
]
=
∑
ℓ
2
=
0
n
2
-
1
∑
ℓ
1
=
0
n
1
-
1
X
[
ℓ
1
n
2
+
ℓ
2
]
ω
n
1
ℓ
1
k
1
ω
n
ℓ
2
k
1
ω
n
2
ℓ
2
k
2
,
Y
[
k
1
+
k
2
n
1
]
=
∑
ℓ
2
=
0
n
2
-
1
∑
ℓ
1
=
0
n
1
-
1
X
[
ℓ
1
n
2
+
ℓ
2
]
ω
n
1
ℓ
1
k
1
ω
n
ℓ
2
k
1
ω
n
2
ℓ
2
k
2
,
(2)
where k1,2=0,...,n1,2-1k1,2=0,...,n1,2-1. Thus, the algorithm computes
n2n2 DFTs of size n1n1 (the inner sum), multiplies the result by
the so-called Entry 21 twiddle factors ωnℓ2k1ωnℓ2k1, and finally computes n1n1 DFTs of size
n2n2 (the outer sum). This decomposition is then continued
recursively. The literature uses the term radix to describe
an n1n1 or n2n2 that is bounded (often constant); the small DFT
of the radix is traditionally called a butterfly.
Many well-known variations are distinguished by the radix alone. A
decimation in time (DIT) algorithm uses n2n2 as
the radix, while a decimation in frequency (DIF)
algorithm uses n1n1 as the radix. If multiple radices are used,
e.g. for nn composite but not a prime power, the algorithm is called
mixed radix. A peculiar blending of radix 2 and 4 is called
split radix, which was proposed to minimize the count of
arithmetic
operations Entry 56, Entry 10, Entry 53, Entry 32, Entry 13
although it has been superseded in this
regard Entry 27, Entry 31. FFTW implements both DIT and
DIF, is mixed-radix with radices that are adapted to the
hardware, and often uses much larger radices (e.g. radix 32) than were
once common. On the other end of the scale, a “radix” of roughly
nn has been called a four-step FFT algorithm (or
six-step, depending on how many transposes one
performs) Entry 2; see "FFTs and the Memory Hierarchy" for some theoretical and
practical discussion of this algorithm.
A key difficulty in implementing the Cooley-Tukey FFT is that the
n1n1 dimension corresponds to discontiguous inputs ℓ1ℓ1 in XX but
contiguous outputs k1k1 in YY, and vice-versa for n2n2. This is a
matrix transpose for a single decomposition stage, and the composition
of all such transpositions is a (mixed-base) digit-reversal
permutation (or bit-reversal, for radix 2). The resulting
necessity of discontiguous memory access and data re-ordering hinders
efficient use of hierarchical memory architectures (e.g., caches), so
that the optimal execution order of an FFT for given hardware is
non-obvious, and various approaches have been proposed.
One ordering distinction is between recursion and iteration. As
expressed above, the Cooley-Tukey algorithm could be thought of as
defining a tree of smaller and smaller DFTs, as depicted in Figure 2;
for example, a textbook radix-2 algorithm would divide size nn into
two transforms of size n/2n/2, which are divided into four transforms
of size n/4n/4, and so on until a base case is reached (in principle,
size 1). This might naturally suggest a recursive implementation in
which the tree is traversed “depth-first” as in Figure 2(right) and the
algorithm of Figure 3—one size n/2n/2 transform is solved completely
before processing the other one, and so on. However, most traditional
FFT implementations are non-recursive (with rare
exceptions Entry 46) and traverse the tree
“breadth-first” Entry 52 as in Figure 2(left)—in the radix-2
example, they would perform nn (trivial) size-1 transforms, then
n/2n/2 combinations into size-2 transforms, then n/4n/4 combinations
into size-4 transforms, and so on, thus making log2nlog2n passes over
the whole array. In contrast, as we discuss in
"Discussion", FFTW employs an explicitly
recursive strategy that encompasses both depth-first and
breadth-first styles, favoring the former since it has some
theoretical and practical advantages as discussed in "FFTs and the Memory Hierarchy".
A second ordering distinction lies in how the digit-reversal is
performed. The classic approach is a single, separate digit-reversal
pass following or preceding the arithmetic computations; this approach
is so common and so deeply embedded into FFT lore that many
practitioners find it difficult to imagine an FFT without an
explicit bit-reversal stage. Although this pass requires only O(n)O(n)
time Entry 29, it can still be non-negligible, especially if the
data is out-of-cache; moreover, it neglects the possibility that data
reordering during the transform may improve memory locality. Perhaps
the oldest alternative is the Stockham auto-sort
FFT Entry 47, Entry 52, which transforms back and forth
between two arrays with each butterfly, transposing one digit each
time, and was popular to improve contiguity of access for vector
computers Entry 48. Alternatively, an explicitly
recursive style, as in FFTW, performs the digit-reversal implicitly
at the “leaves” of its computation when operating out-of-place
(see section "Discussion"). A simple example of this
style, which computes in-order output using an out-of-place radix-2
FFT without explicit bit-reversal, is shown in the algorithm of Figure 3
[corresponding to Figure 2(right)]. To operate in-place
with O(1)O(1) scratch storage, one can interleave small matrix
transpositions with the
butterflies Entry 26, Entry 49, Entry 40, Entry 23, and a
related strategy in FFTW Entry 16 is briefly described by
"Discussion".
Finally, we should mention that there are many FFTs entirely
distinct from Cooley-Tukey. Three notable such algorithms are the
prime-factor algorithm for gcd(n1,n2)=1gcd(n1,n2)=1 Entry 35, along with Rader's Entry 41 and
Bluestein's Entry 4, Entry 43, Entry 35 algorithms
for prime nn. FFTW implements the first two in its codelet
generator for hard-coded nn "Generating Small FFT Kernels" and the latter two for
general prime nn (sections "Plans for prime sizes" and "Goals and Background of the FFTW Project"). There is also the Winograd
FFT Entry 54, Entry 22, Entry 12, Entry 13, which
minimizes the number of multiplications at the expense of a large
number of additions; this trade-off is not beneficial on current
processors that have specialized hardware multipliers.
The FFTW project, begun in 1997 as a side project of the authors Frigo and Johnson as graduate students at MIT, has gone through several major revisions, and as of 2008 consists of more than 40,000 lines of code. It is difficult to measure the popularity of a free-software package, but (as of 2008) FFTW has been cited in over 500 academic papers, is used in hundreds of shipping free and proprietary software packages, and the authors have received over 10,000 emails from users of the software. Most of this chapter focuses on performance of FFT implementations, but FFTW would probably not be where it is today if that were the only consideration in its design. One of the key factors in FFTW's success seems to have been its flexibility in addition to its performance. In fact, FFTW is probably
the most flexible DFT library available:
- FFTW is written in portable C and runs well on many
architectures and operating systems.
- FFTW computes DFTs in O(nlogn)O(nlogn) time for any
length nn. (Most other DFT implementations are either
restricted to a subset of sizes or they become Θ(n2)Θ(n2) for
certain values of nn, for example when nn is prime.)
- FFTW imposes no restrictions on the rank (dimensionality) of
multi-dimensional transforms. (Most other implementations are
limited to one-dimensional, or at most two- and three-dimensional
data.)
- FFTW supports multiple and/or strided DFTs; for example,
to transform a 3-component vector field or a portion of a
multi-dimensional array. (Most implementations support only a
single DFT of contiguous data.)
- FFTW supports DFTs of real data, as well as of real
symmetric/anti-symmetric data (also called discrete cosine/sine
transforms).
Our design philosophy has been to first define the most general
reasonable functionality, and then to obtain the highest possible
performance without sacrificing this generality. In this section, we
offer a few thoughts about why such flexibility has proved important,
and how it came about that FFTW was designed in this way.
FFTW's generality is partly a consequence of the fact the FFTW
project was started in response to the needs of a real application for
one of the authors (a spectral solver for Maxwell's
equations Entry 28), which from the beginning had to run
on heterogeneous hardware. Our initial application required
multi-dimensional DFTs of three-component vector fields (magnetic
fields in electromagnetism), and so right away this meant: (i)
multi-dimensional FFTs; (ii) user-accessible loops of FFTs of
discontiguous data; (iii) efficient support for non-power-of-two sizes
(the factor of eight difference between n×n×nn×n×n and
2n×2n×2n2n×2n×2n was too much to tolerate); and (iv) saving a
factor of two for the common real-input case was desirable. That is,
the initial requirements already encompassed most of the features
above, and nothing about this application is particularly unusual.
Even for one-dimensional DFTs, there is a common misperception that
one should always choose power-of-two sizes if one cares about
efficiency. Thanks to FFTW's code generator (described in "Generating Small FFT Kernels"), we could afford to
devote equal optimization effort to any nn with small factors (2, 3,
5, and 7 are good), instead of mostly optimizing powers of two like
many high-performance FFTs. As a result, to pick a typical example
on the 3 GHz Core Duo processor of Figure 1, n=3600=24·32·52n=3600=24·32·52 and n=3840=28·3·5n=3840=28·3·5 both execute faster
than n=4096=212n=4096=212. (And if there are factors one particularly
cares about, one can generate code for them too.)
One initially missing feature was efficient support for large prime
sizes; the conventional wisdom was that large-prime algorithms were
mainly of academic interest, since in real applications (including
ours) one has enough freedom to choose a highly composite transform
size. However, the prime-size algorithms are fascinating, so we
implemented Rader's O(nlogn)O(nlogn) prime-nn algorithm Entry 41 purely for fun,
including it in FFTW 2.0 (released in 1998) as a bonus feature. The response was
astonishingly positive—even though users are (probably) never
forced by their application to compute a prime-size DFT, it
is rather inconvenient to always worry that collecting an unlucky
number of data points will slow down one's analysis by a factor of a
million. The prime-size algorithms are certainly slower than
algorithms for nearby composite sizes, but in interactive
data-analysis situations the difference between 1 ms and 10 ms means little,
while educating users to avoid large prime factors is hard.
Another form of flexibility that deserves comment has to do with a
purely technical aspect of computer software. FFTW's
implementation involves some unusual language choices internally
(the FFT-kernel generator, described in "Generating Small FFT Kernels",
is written in Objective Caml, a functional language
especially suited for compiler-like programs), but its user-callable
interface is purely in C with lowest-common-denominator datatypes
(arrays of floating-point values). The advantage of this is that
FFTW can be (and has been) called from almost any other programming
language, from Java to Perl to Fortran 77. Similar
lowest-common-denominator interfaces are apparent in many other
popular numerical libraries, such as LAPACK Entry 1. Language
preferences arouse strong feelings, but this technical constraint means
that modern programming dialects are best hidden from view for a
numerical library.
Ultimately, very few scientific-computing applications should have
performance as their top priority. Flexibility is often far more important,
because one wants to be limited only by one's imagination, rather than
by one's software, in the kinds of problems that can be studied.
There are many complexities of computer architectures that impact the
optimization of FFT implementations, but one of the most pervasive
is the memory hierarchy. On any modern general-purpose computer,
memory is arranged into a hierarchy of storage devices with increasing
size and decreasing speed: the fastest and smallest memory being the
CPU registers, then two or three levels of cache, then the main-memory
RAM, then external storage such as hard disks. Most of these levels are managed automatically by
the hardware to hold the most-recently-used data from the next level
in the hierarchy.
There are many complications, however, such as limited cache
associativity (which means that certain locations in memory cannot be
cached simultaneously) and cache lines (which optimize the cache for
contiguous memory access), which are reviewed in numerous textbooks on
computer architectures. In this section, we focus on the
simplest abstract principles of memory hierarchies in order to grasp
their fundamental impact on FFTs.
Because access to memory is in many cases the slowest part of the
computer, especially compared to arithmetic, one wishes to load as
much data as possible in to the faster levels of the hierarchy, and
then perform as much computation as possible before going back to the
slower memory devices. This is called temporal locality: if a given datum is used more than once, we arrange the
computation so that these usages occur as close together as possible
in time.
To understand temporal-locality strategies at a basic level, in this
section we will employ an idealized model of a cache in a two-level
memory hierarchy, as defined in Entry 19.
This
ideal cache stores ZZ data items from main memory
(e.g. complex numbers for our purposes): when the processor loads a
datum from memory, the access is quick if the datum is already in the
cache (a cache hit) and slow otherwise (a cache
miss, which requires the datum to be fetched into the cache). When a
datum is loaded into the cache, it must replace
some other datum, and the ideal-cache model assumes that the optimal
replacement strategy is used Entry 3: the new datum replaces
the datum that will not be needed for the longest time in the future;
in practice, this can be simulated to within a factor of two by
replacing the least-recently used datum Entry 19, but ideal
replacement is much simpler to analyze. Armed with this ideal-cache
model, we can now understand some basic features of FFT
implementations that remain essentially true even on real cache
architectures. In particular, we want to know the cache
complexity, the number Q(n;Z)Q(n;Z) of cache misses for an FFT of size
nn with an ideal cache of size ZZ, and what algorithm choices reduce
this complexity.
First, consider a textbook radix-2 algorithm, which divides nn by 2
at each stage and operates breadth-first as in
Figure 2(left), performing all butterflies of a given
size at a time. If n>Zn>Z, then each pass over the array incurs
Θ(n)Θ(n) cache misses to reload the data, and there are log2nlog2n
passes, for Θ(nlog2n)Θ(nlog2n) cache misses in total—no temporal
locality at all is exploited!
One traditional solution to this problem is blocking: the
computation is divided into maximal blocks that fit into the cache,
and the computations for each block are completed before moving on to
the next block. Here, a block of ZZ numbers can fit into the
cache (not
including storage for twiddle factors and so on), and thus the natural
unit of computation is a sub-FFT of size ZZ. Since each of these
blocks involves Θ(ZlogZ)Θ(ZlogZ) arithmetic operations, and there
are Θ(nlogn)Θ(nlogn) operations overall, there must be
Θ(nZlogZn)Θ(nZlogZn) such blocks. More explicitly, one
could use a radix-ZZ Cooley-Tukey algorithm, breaking nn down by
factors of ZZ [or Θ(Z)Θ(Z)] until a size ZZ is reached: each
stage requires n/Zn/Z blocks, and there are logZnlogZn stages, again
giving Θ(nZlogZn)Θ(nZlogZn) blocks overall. Since each
block requires ZZ cache misses to load it into cache, the cache
complexity QbQb of such a blocked algorithm is
Q
b
(
n
;
Z
)
=
Θ
(
n
log
Z
n
)
.
Q
b
(
n
;
Z
)
=
Θ
(
n
log
Z
n
)
.
(3)
In fact, this complexity is rigorously optimal for Cooley-Tukey
FFT algorithms Entry 25, and immediately points us towards
large radices (not radix 2!) to exploit caches effectively in
FFTs.
However, there is one shortcoming of any blocked FFT algorithm: it
is cache aware, meaning that the implementation depends
explicitly on the cache size ZZ. The implementation must be modified
(e.g. changing the radix) to adapt to different machines as the cache
size changes. Worse, as mentioned above, actual machines have
multiple levels of cache, and to exploit these one must perform
multiple levels of blocking, each parameterized by the corresponding
cache size. In the above example, if there were a smaller and faster
cache of size z<Zz<Z, the size-ZZ sub-FFTs should themselves be
performed via radix-zz Cooley-Tukey using blocks of size zz. And so
on. There are two paths out of these difficulties: one is
self-optimization, where the implementation automatically adapts
itself to the hardware (implicitly including any cache sizes), as
described in "Adaptive Composition of FFT Algorithms"; the other is to exploit
cache-oblivious algorithms. FFTW employs both of these
techniques.
The goal of cache-obliviousness is to structure the algorithm so that
it exploits the cache without having the cache size as a parameter:
the same code achieves the same asymptotic cache complexity regardless
of the cache size ZZ. An optimal cache-oblivious algorithm
achieves the optimal cache complexity (that is, in an
asymptotic sense, ignoring constant factors). Remarkably, optimal
cache-oblivious algorithms exist for many problems, such as matrix
multiplication, sorting, transposition, and FFTs Entry 19.
Not all cache-oblivious algorithms are optimal, of course—for
example, the textbook radix-2 algorithm discussed above is
“pessimal” cache-oblivious (its cache complexity is independent of
ZZ because it always achieves the worst case!).
For instance, Figure 2(right) and the algorithm of Figure 3 shows a way to
obliviously exploit the cache with a radix-2 Cooley-Tukey algorithm,
by ordering the computation depth-first rather than breadth-first.
That is, the DFT of size nn is divided into two DFTs of size
n/2n/2, and one DFT of size n/2n/2 is completely finished
before doing any computations for the second DFT of size
n/2n/2. The two subtransforms are then combined using n/2n/2 radix-2
butterflies, which requires a pass over the array and (hence nn cache
misses if n>Zn>Z). This process is repeated recursively until a
base-case (e.g. size 2) is reached. The cache complexity Q2(n;Z)Q2(n;Z)
of this algorithm satisfies the recurrence
Q
2
(
n
;
Z
)
=
n
n
≤
Z
2
Q
2
(
n
/
2
;
Z
)
+
Θ
(
n
)
otherwise
.
Q
2
(
n
;
Z
)
=
n
n
≤
Z
2
Q
2
(
n
/
2
;
Z
)
+
Θ
(
n
)
otherwise
.
(4)
The key property is this: once the recursion reaches a size n≤Zn≤Z, the subtransform fits into the cache and no further misses are
incurred. The algorithm does not “know” this and continues
subdividing the problem, of course, but all of those further
subdivisions are in-cache because they are performed in the same
depth-first branch of the tree. The solution of
Equation 4 is
Q
2
(
n
;
Z
)
=
Θ
(
n
log
[
n
/
Z
]
)
.
Q
2
(
n
;
Z
)
=
Θ
(
n
log
[
n
/
Z
]
)
.
(5)
This is worse than the theoretical optimum Qb(n;Z)Qb(n;Z) from
Equation 3, but it is cache-oblivious (ZZ never entered the
algorithm) and exploits at least some temporal
locality. On the other hand, when it is combined
with FFTW's self-optimization and larger radices in
"Adaptive Composition of FFT Algorithms", this algorithm actually performs very well until nn
becomes extremely large. By itself, however, the algorithm of Figure 3 must
be modified to attain adequate performance for reasons that have
nothing to do with the cache. These practical issues are discussed
further in "Cache-obliviousness in practice".
There exists a different recursive FFT that is optimal
cache-oblivious, however, and that is the radix-nn
“four-step” Cooley-Tukey algorithm (again executed recursively,
depth-first) Entry 19. The cache complexity QoQo of this algorithm
satisfies the recurrence:
Q
o
(
n
;
Z
)
=
n
n
≤
Z
2
n
Q
o
(
n
;
Z
)
+
Θ
(
n
)
otherwise
.
Q
o
(
n
;
Z
)
=
n
n
≤
Z
2
n
Q
o
(
n
;
Z
)
+
Θ
(
n
)
otherwise
.
(6)
That is, at each stage one performs nn DFTs of size
nn (recursively), then multiplies by the Θ(n)Θ(n) twiddle
factors (and does a matrix transposition to obtain in-order output),
then finally performs another nn DFTs of size nn. The
solution of Equation 6 is Qo(n;Z)=Θ(nlogZn)Qo(n;Z)=Θ(nlogZn), the
same as the optimal cache complexity (Equation 3)!
These algorithms illustrate the basic features of most optimal
cache-oblivious algorithms: they employ a recursive divide-and-conquer
strategy to subdivide the problem until it fits into cache, at which
point the subdivision continues but no further cache misses are
required. Moreover, a cache-oblivious algorithm exploits all levels
of the cache in the same way, so an optimal cache-oblivious algorithm
exploits a multi-level cache optimally as well as a two-level
cache Entry 19: the multi-level “blocking” is implicit in
the recursion.
Even though the radix-nn algorithm is optimal cache-oblivious,
it does not follow that FFT implementation is a solved problem. The
optimality is only in an asymptotic sense, ignoring constant factors,
O(n)O(n) terms, etcetera, all of which can matter a great deal in
practice. For small or moderate nn, quite different algorithms may
be superior, as discussed in "Memory strategies in FFTW". Moreover, real
caches are inferior to an ideal cache in several ways. The
unsurprising consequence of all this is that cache-obliviousness, like
any complexity-based algorithm property, does not absolve one from the
ordinary process of software optimization. At best, it reduces the
amount of memory/cache tuning that one needs to perform, structuring
the implementation to make further optimization easier and more
portable.
Perhaps most importantly, one needs to perform an optimization that
has almost nothing to do with the caches: the recursion must be
“coarsened” to amortize the function-call overhead and to enable
compiler optimization. For example, the simple pedagogical code of the
algorithm in Figure 3 recurses all the way down to n=1n=1, and hence
there are ≈2n≈2n function calls in total, so that every data
point incurs a two-function-call overhead on average. Moreover, the
compiler cannot fully exploit the large register sets and instruction-level
parallelism of modern processors with an n=1n=1 function
body.
These problems can be effectively erased, however, simply by making
the base cases larger, e.g. the recursion could stop when n=32n=32 is
reached, at which point a highly optimized hard-coded FFT of that
size would be executed. In FFTW, we produced this sort of large
base-case using a specialized code-generation program described in
"Generating Small FFT Kernels".
One might get the impression that there is a strict dichotomy that
divides cache-aware and cache-oblivious algorithms, but the two are
not mutually exclusive in practice. Given an implementation of a
cache-oblivious strategy, one can further optimize it for the cache
characteristics of a particular machine in order to improve the
constant factors. For example, one can tune the radices used, the
transition point between the radix-nn algorithm and the
bounded-radix algorithm, or other algorithmic choices as described
in "Memory strategies in FFTW". The advantage of starting cache-aware tuning with a
cache-oblivious approach is that the starting point already exploits
all levels of the cache to some extent, and one has reason to hope
that good performance on one machine will be more portable to other
architectures than for a purely cache-aware “blocking” approach. In
practice, we have found this combination to be very successful with
FFTW.
The recursive cache-oblivious strategies described above form a useful
starting point, but FFTW supplements them with a number of
additional tricks, and also exploits cache-obliviousness in
less-obvious forms.
We currently find that the general radix-nn algorithm is
beneficial only when nn becomes very large, on the order of 220≈106220≈106. In practice, this means that we use at most a single
step of radix-nn (two steps would only be used for n≳240n≳240). The reason for this is that the implementation of
radix nn is less efficient than for a bounded radix: the
latter has the advantage that an entire radix butterfly can be
performed in hard-coded lo