To understand temporal-locality strategies at a basic level, in this
section we will employ an idealized model of a cache in a two-level
memory hierarchy, as defined in [19].
This
ideal cache stores ZZ data items from main memory
(e.g. complex numbers for our purposes): when the processor loads a
datum from memory, the access is quick if the datum is already in the
cache (a cache hit) and slow otherwise (a cache
miss, which requires the datum to be fetched into the cache). When a
datum is loaded into the cache, it must replace
some other datum, and the ideal-cache model assumes that the optimal
replacement strategy is used [3]: the new datum replaces
the datum that will not be needed for the longest time in the future;
in practice, this can be simulated to within a factor of two by
replacing the least-recently used datum [19], but ideal
replacement is much simpler to analyze. Armed with this ideal-cache
model, we can now understand some basic features of FFT
implementations that remain essentially true even on real cache
architectures. In particular, we want to know the cache
complexity, the number Q(n;Z)Q(n;Z) of cache misses for an FFT of size
nn with an ideal cache of size ZZ, and what algorithm choices reduce
this complexity.
First, consider a textbook radix-2 algorithm, which divides nn by 2
at each stage and operates breadth-first as in
Figure 2(left), performing all butterflies of a given
size at a time. If n>Zn>Z, then each pass over the array incurs
Θ(n)Θ(n) cache misses to reload the data, and there are log2nlog2n
passes, for Θ(nlog2n)Θ(nlog2n) cache misses in total—no temporal
locality at all is exploited!
One traditional solution to this problem is blocking: the
computation is divided into maximal blocks that fit into the cache,
and the computations for each block are completed before moving on to
the next block. Here, a block of ZZ numbers can fit into the
cache (not
including storage for twiddle factors and so on), and thus the natural
unit of computation is a sub-FFT of size ZZ. Since each of these
blocks involves Θ(ZlogZ)Θ(ZlogZ) arithmetic operations, and there
are Θ(nlogn)Θ(nlogn) operations overall, there must be
Θ(nZlogZn)Θ(nZlogZn) such blocks. More explicitly, one
could use a radix-ZZ Cooley-Tukey algorithm, breaking nn down by
factors of ZZ [or Θ(Z)Θ(Z)] until a size ZZ is reached: each
stage requires n/Zn/Z blocks, and there are logZnlogZn stages, again
giving Θ(nZlogZn)Θ(nZlogZn) blocks overall. Since each
block requires ZZ cache misses to load it into cache, the cache
complexity QbQb of such a blocked algorithm is
Q
b
(
n
;
Z
)
=
Θ
(
n
log
Z
n
)
.
Q
b
(
n
;
Z
)
=
Θ
(
n
log
Z
n
)
.
(3)In fact, this complexity is rigorously optimal for Cooley-Tukey
FFT algorithms [25], and immediately points us towards
large radices (not radix 2!) to exploit caches effectively in
FFTs.
However, there is one shortcoming of any blocked FFT algorithm: it
is cache aware, meaning that the implementation depends
explicitly on the cache size ZZ. The implementation must be modified
(e.g. changing the radix) to adapt to different machines as the cache
size changes. Worse, as mentioned above, actual machines have
multiple levels of cache, and to exploit these one must perform
multiple levels of blocking, each parameterized by the corresponding
cache size. In the above example, if there were a smaller and faster
cache of size z<Zz<Z, the size-ZZ sub-FFTs should themselves be
performed via radix-zz Cooley-Tukey using blocks of size zz. And so
on. There are two paths out of these difficulties: one is
self-optimization, where the implementation automatically adapts
itself to the hardware (implicitly including any cache sizes), as
described in "Adaptive Composition of FFT Algorithms"; the other is to exploit
cache-oblivious algorithms. FFTW employs both of these
techniques.
The goal of cache-obliviousness is to structure the algorithm so that
it exploits the cache without having the cache size as a parameter:
the same code achieves the same asymptotic cache complexity regardless
of the cache size ZZ. An optimal cache-oblivious algorithm
achieves the optimal cache complexity (that is, in an
asymptotic sense, ignoring constant factors). Remarkably, optimal
cache-oblivious algorithms exist for many problems, such as matrix
multiplication, sorting, transposition, and FFTs [19].
Not all cache-oblivious algorithms are optimal, of course—for
example, the textbook radix-2 algorithm discussed above is
“pessimal” cache-oblivious (its cache complexity is independent of
ZZ because it always achieves the worst case!).
For instance, Figure 2(right) and algorithm Figure 3 shows a way to
obliviously exploit the cache with a radix-2 Cooley-Tukey algorithm,
by ordering the computation depth-first rather than breadth-first.
That is, the DFT of size nn is divided into two DFTs of size
n/2n/2, and one DFT of size n/2n/2 is completely finished
before doing any computations for the second DFT of size
n/2n/2. The two subtransforms are then combined using n/2n/2 radix-2
butterflies, which requires a pass over the array and (hence nn cache
misses if n>Zn>Z). This process is repeated recursively until a
base-case (e.g. size 2) is reached. The cache complexity Q2(n;Z)Q2(n;Z)
of this algorithm satisfies the recurrence
Q
2
(
n
;
Z
)
=
n
n
≤
Z
2
Q
2
(
n
/
2
;
Z
)
+
Θ
(
n
)
otherwise
.
Q
2
(
n
;
Z
)
=
n
n
≤
Z
2
Q
2
(
n
/
2
;
Z
)
+
Θ
(
n
)
otherwise
.
(4)The key property is this: once the recursion reaches a size n≤Zn≤Z, the subtransform fits into the cache and no further misses are
incurred. The algorithm does not “know” this and continues
subdividing the problem, of course, but all of those further
subdivisions are in-cache because they are performed in the same
depth-first branch of the tree. The solution of
eq. (Equation 4) is
Q
2
(
n
;
Z
)
=
Θ
(
n
log
[
n
/
Z
]
)
.
Q
2
(
n
;
Z
)
=
Θ
(
n
log
[
n
/
Z
]
)
.
(5)This is worse than the theoretical optimum Qb(n;Z)Qb(n;Z) from
eq. (Equation 3), but it is cache-oblivious (ZZ never entered the
algorithm) and exploits at least some temporal
locality. On the other hand, when it is combined
with FFTW's self-optimization and larger radices in
"Adaptive Composition of FFT Algorithms", this algorithm actually performs very well until nn
becomes extremely large. By itself, however, algorithm Figure 3 must
be modified to attain adequate performance for reasons that have
nothing to do with the cache. These practical issues are discussed
further in "Cache-obliviousness in practice".
There exists a different recursive FFT that is optimal
cache-oblivious, however, and that is the radix-nn
“four-step” Cooley-Tukey algorithm (again executed recursively,
depth-first) [19]. The cache complexity QoQo of this algorithm
satisfies the recurrence:
Q
o
(
n
;
Z
)
=
n
n
≤
Z
2
n
Q
o
(
n
;
Z
)
+
Θ
(
n
)
otherwise
.
Q
o
(
n
;
Z
)
=
n
n
≤
Z
2
n
Q
o
(
n
;
Z
)
+
Θ
(
n
)
otherwise
.
(6)That is, at each stage one performs nn DFTs of size
nn (recursively), then multiplies by the Θ(n)Θ(n) twiddle
factors (and does a matrix transposition to obtain in-order output),
then finally performs another nn DFTs of size nn. The
solution of eq. (Equation 6) is Qo(n;Z)=Θ(nlogZn)Qo(n;Z)=Θ(nlogZn), the
same as the optimal cache complexity eq. (Equation 3)!
These algorithms illustrate the basic features of most optimal
cache-oblivious algorithms: they employ a recursive divide-and-conquer
strategy to subdivide the problem until it fits into cache, at which
point the subdivision continues but no further cache misses are
required. Moreover, a cache-oblivious algorithm exploits all levels
of the cache in the same way, so an optimal cache-oblivious algorithm
exploits a multi-level cache optimally as well as a two-level
cache [19]: the multi-level “blocking” is implicit in
the recursion.