Skip to content Skip to navigation

Connexions

You are here: Home » Content » Implementing FFTs in Practice

Navigation

Content Actions

  • Download module PDF
  • Add to ...
    Add the module to:
    • My Favorites
    • A lens
    • An external social bookmarking service
    • My Favorites (What is 'My Favorites'?)
      'My Favorites' is a special kind of lens which you can use to bookmark modules and collections directly in Connexions. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need a Connexions account to use 'My Favorites'.
    • A lens (What is a lens?)

      Definition of a lens

      Lenses

      A lens is a custom view of Connexions content. You can think of it as a fancy kind of list that will let you see Connexions through the eyes of organizations and people you trust.

      What is in a lens?

      Lens makers point to Connexions materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

      Who can create a lens?

      Any individual Connexions member, a community, or a respected organization.

    • External bookmarks
  • E-mail the authors

Recently Viewed

This feature requires Javascript to be enabled.

Implementing FFTs in Practice

Module by: Steven G. Johnson, Matteo Frigo

Summary: Discussion of the considerations involved in high-performance FFT implementations, which center largely on memory access and other non-arithmetic concerns, as illustrated by a case study of the FFTW library.

by Steven G. Johnson (Department of Mathematics, Massachusetts Institute of Technology) and Matteo Frigo (Cilk Arts, Inc.)

Introduction

Although there are a wide range of fast Fourier transform (FFT) algorithms, involving a wealth of mathematics from number theory to polynomial algebras, the vast majority of FFT implementations in practice employ some variation on the Cooley-Tukey algorithm Entry 9. The Cooley-Tukey algorithm can be derived in two or three lines of elementary algebra. It can be implemented almost as easily, especially if only power-of-two sizes are desired; numerous popular textbooks list short FFT subroutines for power-of-two sizes, written in the language du jour. The implementation of the Cooley-Tukey algorithm, at least, would therefore seem to be a long-solved problem. In this chapter, however, we will argue that matters are not as straightforward as they might appear.

For many years, the primary route to improving upon the Cooley-Tukey FFT seemed to be reductions in the count of arithmetic operations, which often dominated the execution time prior to the ubiquity of fast floating-point hardware (at least on non-embedded processors). Therefore, great effort was expended towards finding new algorithms with reduced arithmetic counts Entry 13, from Winograd's method to achieve Θ(n)Θ(n) multiplications1 (at the cost of many more additions) Entry 54, Entry 22, Entry 12, Entry 13 to the split-radix variant on Cooley-Tukey that long achieved the lowest known total count of additions and multiplications for power-of-two sizes Entry 56, Entry 10, Entry 53, Entry 32, Entry 13 (but was recently improved upon Entry 27, Entry 31). The question of the minimum possible arithmetic count continues to be of fundamental theoretical interest—it is not even known whether better than Θ(nlogn)Θ(nlogn) complexity is possible, since Ω(nlogn)Ω(nlogn) lower bounds on the count of additions have only been proven subject to restrictive assumptions about the algorithms Entry 33, Entry 36, Entry 37. Nevertheless, the difference in the number of arithmetic operations, for power-of-two sizes nn, between the 1965 radix-2 Cooley-Tukey algorithm (5nlog2n5nlog2n Entry 9) and the currently lowest-known arithmetic count (349nlog2n349nlog2n Entry 27, Entry 31) remains only about 25%.

Figure 1: The ratio of speed (1/time) between a highly optimized FFT (FFTW 3.1.2 Entry 15, Entry 16) and a typical textbook radix-2 implementation (Numerical Recipes in C Entry 38) on a 3 GHz Intel Core Duo with the Intel C compiler 9.1.043, for single-precision complex-data DFTs of size nn, plotted versus log2nlog2n. Top line (squares) shows FFTW with SSE SIMD instructions enabled, which perform multiple arithmetic operations at once (see section ); bottom line (circles) shows FFTW with SSE disabled, which thus requires a similar number of arithmetic instructions to the textbook code. (This is not intended as a criticism of Numerical Recipes—simple radix-2 implementations are reasonable for pedagogy—but it illustrates the radical differences between straightforward and optimized implementations of FFT algorithms, even with similar arithmetic costs.) For n219n219, the ratio increases because the textbook code becomes much slower (this happens when the DFT size exceeds the level-2 cache).
Figure 1 (fftwnr.png)

And yet there is a vast gap between this basic mathematical theory and the actual practice—highly optimized FFT packages are often an order of magnitude faster than the textbook subroutines, and the internal structure to achieve this performance is radically different from the typical textbook presentation of the “same” Cooley-Tukey algorithm. For example, Figure 1 plots the ratio of benchmark speeds between a highly optimized FFT Entry 15, Entry 16 and a typical textbook radix-2 implementation Entry 38, and the former is faster by a factor of 5–40 (with a larger ratio as nn grows). Here, we will consider some of the reasons for this discrepancy, and some techniques that can be used to address the difficulties faced by a practical high-performance FFT implementation.2

In particular, in this chapter we will discuss some of the lessons learned and the strategies adopted in the FFTW library. FFTW Entry 15, Entry 16 is a widely used free-software library that computes the discrete Fourier transform (DFT) and its various special cases. Its performance is competitive even with manufacturer-optimized programs Entry 16, and this performance is portable thanks the structure of the algorithms employed, self-optimization techniques, and highly optimized kernels (FFTW's codelets) generated by a special-purpose compiler.

This chapter is structured as follows. First "Review of the Cooley-Tukey FFT", we briefly review the basic ideas behind the Cooley-Tukey algorithm and define some common terminology, especially focusing on the many degrees of freedom that the abstract algorithm allows to implementations. Next, in "Goals and Background of the FFTW Project", we provide some context for FFTW's development and stress that performance, while it receives the most publicity, is not necessarily the most important consideration in the implementation of a library of this sort. Third, in "FFTs and the Memory Hierarchy", we consider a basic theoretical model of the computer memory hierarchy and its impact on FFT algorithm choices: quite general considerations push implementations towards large radices and explicitly recursive structure. Unfortunately, general considerations are not sufficient in themselves, so we will explain in "Adaptive Composition of FFT Algorithms" how FFTW self-optimizes for particular machines by selecting its algorithm at runtime from a composition of simple algorithmic steps. Furthermore, "Generating Small FFT Kernels" describes the utility and the principles of automatic code generation used to produce the highly optimized building blocks of this composition, FFTW's codelets. Finally, we will briefly consider an important non-performance issue, in "Numerical Accuracy in FFTs".

Review of the Cooley-Tukey FFT

The (forward, one-dimensional) discrete Fourier transform (DFT) of an array XX of nn complex numbers is the array YY given by

Y [ k ] = = 0 n - 1 X [ ] ω n k , Y [ k ] = = 0 n - 1 X [ ] ω n k , (1)

where 0k<n0k<n and ωn=exp(-2πi/n)ωn=exp(-2πi/n) is a primitive root of unity. Implemented directly, Equation 1 would require Θ(n2)Θ(n2) operations; fast Fourier transforms are O(nlogn)O(nlogn) algorithms to compute the same result. The most important FFT (and the one primarily used in FFTW) is known as the “Cooley-Tukey” algorithm, after the two authors who rediscovered and popularized it in 1965 Entry 9, although it had been previously known as early as 1805 by Gauss as well as by later re-inventors Entry 24. The basic idea behind this FFT is that a DFT of a composite size n=n1n2n=n1n2 can be re-expressed in terms of smaller DFTs of sizes n1n1 and n2n2—essentially, as a two-dimensional DFT of size n1×n2n1×n2 where the output is transposed. The choices of factorizations of nn, combined with the many different ways to implement the data re-orderings of the transpositions, have led to numerous implementation strategies for the Cooley-Tukey FFT, with many variants distinguished by their own names Entry 13, Entry 52. FFTW implements a space of many such variants, as described in "Adaptive Composition of FFT Algorithms", but here we derive the basic algorithm, identify its key features, and outline some important historical variations and their relation to FFTW.

The Cooley-Tukey algorithm can be derived as follows. If nn can be factored into n=n1n2n=n1n2, Equation 1 can be rewritten by letting =1n2+2=1n2+2 and k=k1+k2n1k=k1+k2n1. We then have:

Y [ k 1 + k 2 n 1 ] = 2 = 0 n 2 - 1 1 = 0 n 1 - 1 X [ 1 n 2 + 2 ] ω n 1 1 k 1 ω n 2 k 1 ω n 2 2 k 2 , Y [ k 1 + k 2 n 1 ] = 2 = 0 n 2 - 1 1 = 0 n 1 - 1 X [ 1 n 2 + 2 ] ω n 1 1 k 1 ω n 2 k 1 ω n 2 2 k 2 , (2)

where k1,2=0,...,n1,2-1k1,2=0,...,n1,2-1. Thus, the algorithm computes n2n2 DFTs of size n1n1 (the inner sum), multiplies the result by the so-called Entry 21 twiddle factors ωn2k1ωn2k1, and finally computes n1n1 DFTs of size n2n2 (the outer sum). This decomposition is then continued recursively. The literature uses the term radix to describe an n1n1 or n2n2 that is bounded (often constant); the small DFT of the radix is traditionally called a butterfly.

Many well-known variations are distinguished by the radix alone. A decimation in time (DIT) algorithm uses n2n2 as the radix, while a decimation in frequency (DIF) algorithm uses n1n1 as the radix. If multiple radices are used, e.g. for nn composite but not a prime power, the algorithm is called mixed radix. A peculiar blending of radix 2 and 4 is called split radix, which was proposed to minimize the count of arithmetic operations Entry 56, Entry 10, Entry 53, Entry 32, Entry 13 although it has been superseded in this regard Entry 27, Entry 31. FFTW implements both DIT and DIF, is mixed-radix with radices that are adapted to the hardware, and often uses much larger radices (e.g. radix 32) than were once common. On the other end of the scale, a “radix” of roughly nn has been called a four-step FFT algorithm (or six-step, depending on how many transposes one performs) Entry 2; see "FFTs and the Memory Hierarchy" for some theoretical and practical discussion of this algorithm.

A key difficulty in implementing the Cooley-Tukey FFT is that the n1n1 dimension corresponds to discontiguous inputs 11 in XX but contiguous outputs k1k1 in YY, and vice-versa for n2n2. This is a matrix transpose for a single decomposition stage, and the composition of all such transpositions is a (mixed-base) digit-reversal permutation (or bit-reversal, for radix 2). The resulting necessity of discontiguous memory access and data re-ordering hinders efficient use of hierarchical memory architectures (e.g., caches), so that the optimal execution order of an FFT for given hardware is non-obvious, and various approaches have been proposed.

Figure 2: Schematic of traditional breadth-first (left) vs. recursive depth-first (right) ordering for radix-2 FFT of size 8: the computations for each nested box are completed before doing anything else in the surrounding box. Breadth-first computation performs all butterflies of a given size at once, while depth-first computation completes one subtransform entirely before moving on to the next (as in the algorithm below).
Figure 2 (breadthdepth.png)

One ordering distinction is between recursion and iteration. As expressed above, the Cooley-Tukey algorithm could be thought of as defining a tree of smaller and smaller DFTs, as depicted in Figure 2; for example, a textbook radix-2 algorithm would divide size nn into two transforms of size n/2n/2, which are divided into four transforms of size n/4n/4, and so on until a base case is reached (in principle, size 1). This might naturally suggest a recursive implementation in which the tree is traversed “depth-first” as in Figure 2(right) and the algorithm of Figure 3—one size n/2n/2 transform is solved completely before processing the other one, and so on. However, most traditional FFT implementations are non-recursive (with rare exceptions Entry 46) and traverse the tree “breadth-first” Entry 52 as in Figure 2(left)—in the radix-2 example, they would perform nn (trivial) size-1 transforms, then n/2n/2 combinations into size-2 transforms, then n/4n/4 combinations into size-4 transforms, and so on, thus making log2nlog2n passes over the whole array. In contrast, as we discuss in "Discussion", FFTW employs an explicitly recursive strategy that encompasses both depth-first and breadth-first styles, favoring the former since it has some theoretical and practical advantages as discussed in "FFTs and the Memory Hierarchy".

Figure 3: A depth-first recursive radix-2 DIT Cooley-Tukey FFT to compute a DFT of a power-of-two size n=2mn=2m. The input is an array XX of length nn with stride ιι (i.e., the inputs are X[ι]X[ι] for =0,...,n-1=0,...,n-1) and the output is an array YY of length nn (with stride 1), containing the DFT of XX [Equation 1]. X+ιX+ι denotes the array beginning with X[ι]X[ι]. This algorithm operates out-of-place, produces in-order output, and does not require a separate bit-reversal stage.
Y[0,...,n-1]recfft 2(n,X,ι)Y[0,...,n-1]recfft 2(n,X,ι):
IF n=1 THEN
     Y[0]X[0]Y[0]X[0]
ELSE
     Y[0,...,n/2-1]recfft2(n/2,X,2ι)Y[0,...,n/2-1]recfft2(n/2,X,2ι)
     Y[n/2,...,n-1]recfft2(n/2,X+ι,2ι)Y[n/2,...,n-1]recfft2(n/2,X+ι,2ι)
     FOR k1=0k1=0 TO (n/2)-1(n/2)-1 DO
          tY[k1]tY[k1]
          Y[k1]t+ωnk1Y[k1+n/2]Y[k1]t+ωnk1Y[k1+n/2]
          Y[k1+n/2]t-ωnk1Y[k1+n/2]Y[k1+n/2]t-ωnk1Y[k1+n/2]
     END FOR
END IF

A second ordering distinction lies in how the digit-reversal is performed. The classic approach is a single, separate digit-reversal pass following or preceding the arithmetic computations; this approach is so common and so deeply embedded into FFT lore that many practitioners find it difficult to imagine an FFT without an explicit bit-reversal stage. Although this pass requires only O(n)O(n) time Entry 29, it can still be non-negligible, especially if the data is out-of-cache; moreover, it neglects the possibility that data reordering during the transform may improve memory locality. Perhaps the oldest alternative is the Stockham auto-sort FFT Entry 47, Entry 52, which transforms back and forth between two arrays with each butterfly, transposing one digit each time, and was popular to improve contiguity of access for vector computers Entry 48. Alternatively, an explicitly recursive style, as in FFTW, performs the digit-reversal implicitly at the “leaves” of its computation when operating out-of-place (see section "Discussion"). A simple example of this style, which computes in-order output using an out-of-place radix-2 FFT without explicit bit-reversal, is shown in the algorithm of Figure 3 [corresponding to Figure 2(right)]. To operate in-place with O(1)O(1) scratch storage, one can interleave small matrix transpositions with the butterflies Entry 26, Entry 49, Entry 40, Entry 23, and a related strategy in FFTW Entry 16 is briefly described by "Discussion".

Finally, we should mention that there are many FFTs entirely distinct from Cooley-Tukey. Three notable such algorithms are the prime-factor algorithm for gcd(n1,n2)=1gcd(n1,n2)=1 Entry 35, along with Rader's Entry 41 and Bluestein's Entry 4, Entry 43, Entry 35 algorithms for prime nn. FFTW implements the first two in its codelet generator for hard-coded nn "Generating Small FFT Kernels" and the latter two for general prime nn (sections "Plans for prime sizes" and "Goals and Background of the FFTW Project"). There is also the Winograd FFT Entry 54, Entry 22, Entry 12, Entry 13, which minimizes the number of multiplications at the expense of a large number of additions; this trade-off is not beneficial on current processors that have specialized hardware multipliers.

Goals and Background of the FFTW Project

The FFTW project, begun in 1997 as a side project of the authors Frigo and Johnson as graduate students at MIT, has gone through several major revisions, and as of 2008 consists of more than 40,000 lines of code. It is difficult to measure the popularity of a free-software package, but (as of 2008) FFTW has been cited in over 500 academic papers, is used in hundreds of shipping free and proprietary software packages, and the authors have received over 10,000 emails from users of the software. Most of this chapter focuses on performance of FFT implementations, but FFTW would probably not be where it is today if that were the only consideration in its design. One of the key factors in FFTW's success seems to have been its flexibility in addition to its performance. In fact, FFTW is probably the most flexible DFT library available:

  • FFTW is written in portable C and runs well on many architectures and operating systems.
  • FFTW computes DFTs in O(nlogn)O(nlogn) time for any length nn. (Most other DFT implementations are either restricted to a subset of sizes or they become Θ(n2)Θ(n2) for certain values of nn, for example when nn is prime.)
  • FFTW imposes no restrictions on the rank (dimensionality) of multi-dimensional transforms. (Most other implementations are limited to one-dimensional, or at most two- and three-dimensional data.)
  • FFTW supports multiple and/or strided DFTs; for example, to transform a 3-component vector field or a portion of a multi-dimensional array. (Most implementations support only a single DFT of contiguous data.)
  • FFTW supports DFTs of real data, as well as of real symmetric/anti-symmetric data (also called discrete cosine/sine transforms).

Our design philosophy has been to first define the most general reasonable functionality, and then to obtain the highest possible performance without sacrificing this generality. In this section, we offer a few thoughts about why such flexibility has proved important, and how it came about that FFTW was designed in this way.

FFTW's generality is partly a consequence of the fact the FFTW project was started in response to the needs of a real application for one of the authors (a spectral solver for Maxwell's equations Entry 28), which from the beginning had to run on heterogeneous hardware. Our initial application required multi-dimensional DFTs of three-component vector fields (magnetic fields in electromagnetism), and so right away this meant: (i) multi-dimensional FFTs; (ii) user-accessible loops of FFTs of discontiguous data; (iii) efficient support for non-power-of-two sizes (the factor of eight difference between n×n×nn×n×n and 2n×2n×2n2n×2n×2n was too much to tolerate); and (iv) saving a factor of two for the common real-input case was desirable. That is, the initial requirements already encompassed most of the features above, and nothing about this application is particularly unusual.

Even for one-dimensional DFTs, there is a common misperception that one should always choose power-of-two sizes if one cares about efficiency. Thanks to FFTW's code generator (described in "Generating Small FFT Kernels"), we could afford to devote equal optimization effort to any nn with small factors (2, 3, 5, and 7 are good), instead of mostly optimizing powers of two like many high-performance FFTs. As a result, to pick a typical example on the 3 GHz Core Duo processor of Figure 1, n=3600=24·32·52n=3600=24·32·52 and n=3840=28·3·5n=3840=28·3·5 both execute faster than n=4096=212n=4096=212. (And if there are factors one particularly cares about, one can generate code for them too.)

One initially missing feature was efficient support for large prime sizes; the conventional wisdom was that large-prime algorithms were mainly of academic interest, since in real applications (including ours) one has enough freedom to choose a highly composite transform size. However, the prime-size algorithms are fascinating, so we implemented Rader's O(nlogn)O(nlogn) prime-nn algorithm Entry 41 purely for fun, including it in FFTW 2.0 (released in 1998) as a bonus feature. The response was astonishingly positive—even though users are (probably) never forced by their application to compute a prime-size DFT, it is rather inconvenient to always worry that collecting an unlucky number of data points will slow down one's analysis by a factor of a million. The prime-size algorithms are certainly slower than algorithms for nearby composite sizes, but in interactive data-analysis situations the difference between 1 ms and 10 ms means little, while educating users to avoid large prime factors is hard.

Another form of flexibility that deserves comment has to do with a purely technical aspect of computer software. FFTW's implementation involves some unusual language choices internally (the FFT-kernel generator, described in "Generating Small FFT Kernels", is written in Objective Caml, a functional language especially suited for compiler-like programs), but its user-callable interface is purely in C with lowest-common-denominator datatypes (arrays of floating-point values). The advantage of this is that FFTW can be (and has been) called from almost any other programming language, from Java to Perl to Fortran 77. Similar lowest-common-denominator interfaces are apparent in many other popular numerical libraries, such as LAPACK Entry 1. Language preferences arouse strong feelings, but this technical constraint means that modern programming dialects are best hidden from view for a numerical library.

Ultimately, very few scientific-computing applications should have performance as their top priority. Flexibility is often far more important, because one wants to be limited only by one's imagination, rather than by one's software, in the kinds of problems that can be studied.

FFTs and the Memory Hierarchy

There are many complexities of computer architectures that impact the optimization of FFT implementations, but one of the most pervasive is the memory hierarchy. On any modern general-purpose computer, memory is arranged into a hierarchy of storage devices with increasing size and decreasing speed: the fastest and smallest memory being the CPU registers, then two or three levels of cache, then the main-memory RAM, then external storage such as hard disks.3 Most of these levels are managed automatically by the hardware to hold the most-recently-used data from the next level in the hierarchy.4 There are many complications, however, such as limited cache associativity (which means that certain locations in memory cannot be cached simultaneously) and cache lines (which optimize the cache for contiguous memory access), which are reviewed in numerous textbooks on computer architectures. In this section, we focus on the simplest abstract principles of memory hierarchies in order to grasp their fundamental impact on FFTs.

Because access to memory is in many cases the slowest part of the computer, especially compared to arithmetic, one wishes to load as much data as possible in to the faster levels of the hierarchy, and then perform as much computation as possible before going back to the slower memory devices. This is called temporal locality: if a given datum is used more than once, we arrange the computation so that these usages occur as close together as possible in time.

Understanding FFTs with an ideal cache

To understand temporal-locality strategies at a basic level, in this section we will employ an idealized model of a cache in a two-level memory hierarchy, as defined in Entry 19. This ideal cache stores ZZ data items from main memory (e.g. complex numbers for our purposes): when the processor loads a datum from memory, the access is quick if the datum is already in the cache (a cache hit) and slow otherwise (a cache miss, which requires the datum to be fetched into the cache). When a datum is loaded into the cache,5 it must replace some other datum, and the ideal-cache model assumes that the optimal replacement strategy is used Entry 3: the new datum replaces the datum that will not be needed for the longest time in the future; in practice, this can be simulated to within a factor of two by replacing the least-recently used datum Entry 19, but ideal replacement is much simpler to analyze. Armed with this ideal-cache model, we can now understand some basic features of FFT implementations that remain essentially true even on real cache architectures. In particular, we want to know the cache complexity, the number Q(n;Z)Q(n;Z) of cache misses for an FFT of size nn with an ideal cache of size ZZ, and what algorithm choices reduce this complexity.

First, consider a textbook radix-2 algorithm, which divides nn by 2 at each stage and operates breadth-first as in Figure 2(left), performing all butterflies of a given size at a time. If n>Zn>Z, then each pass over the array incurs Θ(n)Θ(n) cache misses to reload the data, and there are log2nlog2n passes, for Θ(nlog2n)Θ(nlog2n) cache misses in total—no temporal locality at all is exploited!

One traditional solution to this problem is blocking: the computation is divided into maximal blocks that fit into the cache, and the computations for each block are completed before moving on to the next block. Here, a block of ZZ numbers can fit into the cache6 (not including storage for twiddle factors and so on), and thus the natural unit of computation is a sub-FFT of size ZZ. Since each of these blocks involves Θ(ZlogZ)Θ(ZlogZ) arithmetic operations, and there are Θ(nlogn)Θ(nlogn) operations overall, there must be Θ(nZlogZn)Θ(nZlogZn) such blocks. More explicitly, one could use a radix-ZZ Cooley-Tukey algorithm, breaking nn down by factors of ZZ [or Θ(Z)Θ(Z)] until a size ZZ is reached: each stage requires n/Zn/Z blocks, and there are logZnlogZn stages, again giving Θ(nZlogZn)Θ(nZlogZn) blocks overall. Since each block requires ZZ cache misses to load it into cache, the cache complexity QbQb of such a blocked algorithm is

Q b ( n ; Z ) = Θ ( n log Z n ) . Q b ( n ; Z ) = Θ ( n log Z n ) . (3)

In fact, this complexity is rigorously optimal for Cooley-Tukey FFT algorithms Entry 25, and immediately points us towards large radices (not radix 2!) to exploit caches effectively in FFTs.

However, there is one shortcoming of any blocked FFT algorithm: it is cache aware, meaning that the implementation depends explicitly on the cache size ZZ. The implementation must be modified (e.g. changing the radix) to adapt to different machines as the cache size changes. Worse, as mentioned above, actual machines have multiple levels of cache, and to exploit these one must perform multiple levels of blocking, each parameterized by the corresponding cache size. In the above example, if there were a smaller and faster cache of size z<Zz<Z, the size-ZZ sub-FFTs should themselves be performed via radix-zz Cooley-Tukey using blocks of size zz. And so on. There are two paths out of these difficulties: one is self-optimization, where the implementation automatically adapts itself to the hardware (implicitly including any cache sizes), as described in "Adaptive Composition of FFT Algorithms"; the other is to exploit cache-oblivious algorithms. FFTW employs both of these techniques.

The goal of cache-obliviousness is to structure the algorithm so that it exploits the cache without having the cache size as a parameter: the same code achieves the same asymptotic cache complexity regardless of the cache size ZZ. An optimal cache-oblivious algorithm achieves the optimal cache complexity (that is, in an asymptotic sense, ignoring constant factors). Remarkably, optimal cache-oblivious algorithms exist for many problems, such as matrix multiplication, sorting, transposition, and FFTs Entry 19. Not all cache-oblivious algorithms are optimal, of course—for example, the textbook radix-2 algorithm discussed above is “pessimal” cache-oblivious (its cache complexity is independent of ZZ because it always achieves the worst case!).

For instance, Figure 2(right) and the algorithm of Figure 3 shows a way to obliviously exploit the cache with a radix-2 Cooley-Tukey algorithm, by ordering the computation depth-first rather than breadth-first. That is, the DFT of size nn is divided into two DFTs of size n/2n/2, and one DFT of size n/2n/2 is completely finished before doing any computations for the second DFT of size n/2n/2. The two subtransforms are then combined using n/2n/2 radix-2 butterflies, which requires a pass over the array and (hence nn cache misses if n>Zn>Z). This process is repeated recursively until a base-case (e.g. size 2) is reached. The cache complexity Q2(n;Z)Q2(n;Z) of this algorithm satisfies the recurrence

Q 2 ( n ; Z ) = n n Z 2 Q 2 ( n / 2 ; Z ) + Θ ( n ) otherwise . Q 2 ( n ; Z ) = n n Z 2 Q 2 ( n / 2 ; Z ) + Θ ( n ) otherwise . (4)

The key property is this: once the recursion reaches a size nZnZ, the subtransform fits into the cache and no further misses are incurred. The algorithm does not “know” this and continues subdividing the problem, of course, but all of those further subdivisions are in-cache because they are performed in the same depth-first branch of the tree. The solution of Equation 4 is

Q 2 ( n ; Z ) = Θ ( n log [ n / Z ] ) . Q 2 ( n ; Z ) = Θ ( n log [ n / Z ] ) . (5)

This is worse than the theoretical optimum Qb(n;Z)Qb(n;Z) from Equation 3, but it is cache-oblivious (ZZ never entered the algorithm) and exploits at least some temporal locality.7 On the other hand, when it is combined with FFTW's self-optimization and larger radices in "Adaptive Composition of FFT Algorithms", this algorithm actually performs very well until nn becomes extremely large. By itself, however, the algorithm of Figure 3 must be modified to attain adequate performance for reasons that have nothing to do with the cache. These practical issues are discussed further in "Cache-obliviousness in practice".

There exists a different recursive FFT that is optimal cache-oblivious, however, and that is the radix-nn “four-step” Cooley-Tukey algorithm (again executed recursively, depth-first) Entry 19. The cache complexity QoQo of this algorithm satisfies the recurrence:

Q o ( n ; Z ) = n n Z 2 n Q o ( n ; Z ) + Θ ( n ) otherwise . Q o ( n ; Z ) = n n Z 2 n Q o ( n ; Z ) + Θ ( n ) otherwise . (6)

That is, at each stage one performs nn DFTs of size nn (recursively), then multiplies by the Θ(n)Θ(n) twiddle factors (and does a matrix transposition to obtain in-order output), then finally performs another nn DFTs of size nn. The solution of Equation 6 is Qo(n;Z)=Θ(nlogZn)Qo(n;Z)=Θ(nlogZn), the same as the optimal cache complexity (Equation 3)!

These algorithms illustrate the basic features of most optimal cache-oblivious algorithms: they employ a recursive divide-and-conquer strategy to subdivide the problem until it fits into cache, at which point the subdivision continues but no further cache misses are required. Moreover, a cache-oblivious algorithm exploits all levels of the cache in the same way, so an optimal cache-oblivious algorithm exploits a multi-level cache optimally as well as a two-level cache Entry 19: the multi-level “blocking” is implicit in the recursion.

Cache-obliviousness in practice

Even though the radix-nn algorithm is optimal cache-oblivious, it does not follow that FFT implementation is a solved problem. The optimality is only in an asymptotic sense, ignoring constant factors, O(n)O(n) terms, etcetera, all of which can matter a great deal in practice. For small or moderate nn, quite different algorithms may be superior, as discussed in "Memory strategies in FFTW". Moreover, real caches are inferior to an ideal cache in several ways. The unsurprising consequence of all this is that cache-obliviousness, like any complexity-based algorithm property, does not absolve one from the ordinary process of software optimization. At best, it reduces the amount of memory/cache tuning that one needs to perform, structuring the implementation to make further optimization easier and more portable.

Perhaps most importantly, one needs to perform an optimization that has almost nothing to do with the caches: the recursion must be “coarsened” to amortize the function-call overhead and to enable compiler optimization. For example, the simple pedagogical code of the algorithm in Figure 3 recurses all the way down to n=1n=1, and hence there are 2n2n function calls in total, so that every data point incurs a two-function-call overhead on average. Moreover, the compiler cannot fully exploit the large register sets and instruction-level parallelism of modern processors with an n=1n=1 function body.8 These problems can be effectively erased, however, simply by making the base cases larger, e.g. the recursion could stop when n=32n=32 is reached, at which point a highly optimized hard-coded FFT of that size would be executed. In FFTW, we produced this sort of large base-case using a specialized code-generation program described in "Generating Small FFT Kernels".

One might get the impression that there is a strict dichotomy that divides cache-aware and cache-oblivious algorithms, but the two are not mutually exclusive in practice. Given an implementation of a cache-oblivious strategy, one can further optimize it for the cache characteristics of a particular machine in order to improve the constant factors. For example, one can tune the radices used, the transition point between the radix-nn algorithm and the bounded-radix algorithm, or other algorithmic choices as described in "Memory strategies in FFTW". The advantage of starting cache-aware tuning with a cache-oblivious approach is that the starting point already exploits all levels of the cache to some extent, and one has reason to hope that good performance on one machine will be more portable to other architectures than for a purely cache-aware “blocking” approach. In practice, we have found this combination to be very successful with FFTW.

Memory strategies in FFTW

The recursive cache-oblivious strategies described above form a useful starting point, but FFTW supplements them with a number of additional tricks, and also exploits cache-obliviousness in less-obvious forms.

We currently find that the general radix-nn algorithm is beneficial only when nn becomes very large, on the order of 220106220106. In practice, this means that we use at most a single step of radix-nn (two steps would only be used for n240n240). The reason for this is that the implementation of radix nn is less efficient than for a bounded radix: the latter has the advantage that an entire radix butterfly can be performed in hard-coded lo