In 1990, it was estimated that Cray Research's installed base of approximately 200 machines spent 40% of all CPU cycles computing the fast Fourier transform (FFT) [20]. With each machine worth about USD$25 million, the performance of the FFT was of prime importance.
Today, use of the FFT is even more pervasive, and it is counted among the
10 algorithms that have had the greatest influence on the development and
practice of science and engineering in the 20th
century [7]. Huge numbers of mobile smartphones, tablets
and PCs [24], [11], most of which are
equipped with single-instruction, multiple-data (SIMD) [9], [13] microprocessors,
compute the FFT on a large scale for a plethora of sound, video and image
processing applications. In the space of a few years, mobile applications have
become a part of many people's everyday lives [17].
This thesis shows that the key to optimizing the performance of the split-radix
FFT algorithms on SIMD microprocessors is latency and spatial
locality optimizations, and in some cases, a variant of the split-radix
FFT called the conjugate-pair
algorithm [16], [21], [22], [30].
It is also shown that extensive machine specific calibration may be
superfluous.
FFTW [14], [12], [19],
SPIRAL [15], [29], [28] and
UHFFT [2], [4], [1], [26], [25]
are state of the art FFT libraries that employ automatic empirical
optimization. SPIRAL automatically performs machine-specific optimizations at
compile time, and FFTW and UHFFT automatically adapt to a machine at run-time.
Aside from the use of automatic optimization, a common denominator among these
libraries is the use of large straight line blocks of code and optimized memory
locality.
The hypotheses outlined below test whether good heuristics and model-based optimization can be used in the place of automatic empirical optimization.
Large FFT exhibit poor temporal locality, and when computing these
transforms on microprocessor based systems that feature a cache, best
performance is typically achieved when “streaming” sequential data through
the CPU. Hypothesis 1 is tested in Implementation Details with
replicated coefficient lookup tables that trade-off increased memory size for
better spatial locality, and in Streaming FFT by topologically sorting
a directed acyclic graph (DAG) of sub-transforms to again improve spatial locality.
Hypothesis 2 is based on the idea that memory bandwidth is a bottleneck, and on
the fact that the conjugate-pair algorithm requires only half the number of
twiddle factor loads. This hypothesis is tested in
Split-radix vs. conjugate-pair, where a highly optimized
implementation of the conjugate-pair algorithm is benchmarked against an
equally highly optimized implementation of the ordinary split-radix algorithm.
Exploratory experiments suggest that good results can be obtained without
empirical techniques, and that certain parameters can be predicted based on the
characteristics of the underlying machine and the compiler used. Hypothesis 3
is tested in Results and Discussion by building a model that predicts
performance, and by benchmarking FFTW against an implementation that does not
require extensive calibration, on 18 different machines.
In investigating the hypotheses, the scope of this work has been limited in several ways:
- It is limited to single-threaded complex 1D FFTs, because
multi-dimensional, multi-threaded or multi-processor FFTs (or any
combination thereof) are ultimately decomposed into 1D components running on
a single core, and all other things being equal, it is the performance of
these 1D components running on a single microprocessor core that determines
the overall performance of a given multi-threaded implementation;
- It is limited to transforms that operate on vectors of length 2m2m
where m∈N0m∈N0, because these are the easiest to compute on
machines, and consequently the most often used by applications. This excludes
the prime-factor algorithm [27], [33],
and the Radar [31] and
Bluestein [5], [27], [32]
algorithms for prime sizes;
- It is limited to the
split-radix [6], [10], [23], [34], [36]
and
conjugate-pair [16], [21], [22], [30]
algorithms. The Winograd
algorithm [8], [10], [18], [35]
is excluded because of its low performance on systems where multiplication
costs about the same as addition;
- It is limited to out-of-place transforms, because they are generally
faster than in-place transforms, except at the boundaries of the
cache [3];
- The benchmark experiments are limited to the Intel x86 and ARM
machines, because it is estimated that 92% of the microprocessors in the
rapidly expanding mobile market are ARM devices [11], while
Intel's share of the worldwide PC and mobile PC microprocessors markets is
estimated to be 79.3% and 84.4%,
respectively [24].
The contributions of this work are summarized as follows:
- Three methods of computing the conjugate-pair algorithm on SIMD
microprocessors are described in Streaming FFT;
- The source code for the high-performance SIMD FFT library
developed in this thesis is publicly available under a permissive open source
licence on github.
This work is divided into two parts. The first part,
Chapters 1-4, encompasses the
relevant background, while the second part,
Chapters 5-8, is concerned with
contributions that challenge the state of the art.
A brief overview of the contents of each chapter:
- Algorithms provides an overview of FFT algorithms from the mathematical perspective;
- Implementation details complements the mathematical perspective
of the previous chapter with a more focused view of the low level details
that are relevant to efficient implementation on SIMD microprocessors;
- Existing libraries reviews existing state of the art libraries,
with reference to algorithms and implementation details of the previous
chapters;
- Streaming FFT describes SFFT, a library for
SIMD microprocessors that is, in many cases, faster than the state of
the art FFT libraries reviewed in Existing libraries;
- Benchmark methods describes the benchmarking methods used to
evaluate performance and accuracy of various FFT implementations throughout
this work;
- Results and discussion presents the results of benchmarks on 18
different machines, as well as the results of model-based optimization
experiments, with reference to earlier chapters and other related work;
- Conclusions and future work concludes the work with a review of
the hypotheses, a summary of the contributions, and some idea for directions
that future work might take.
-
Ali, A. and Johnsson, L. (2006). UHFFT: A High Performance DFT Framework.
-
Ali, A. and Johnsson, L. and Mirkovic, D. (2007). Empirical Auto-tuning Code Generator for FFT and Trignometric Transforms. In ODES: 5th Workshop on Optimizations for DSP and Embedded Systems, in conjunction with International Symposium on Code Generation and Optimization (CGO).
-
Ali, A. and Johnsson, L. and Subhlok, J. (2007). Adaptive computation of self sorting in-place FFTs on hierarchical memory architectures. High Performance Computing and Communications, 372–383.
-
Ali, A. and Johnsson, L. and Subhlok, J. (2007). Scheduling FFT computation on SMP and multicore systems. In Proceedings of the 21st annual international conference on Supercomputing. (p. 293–301). ACM.
-
Bluestein, L. (1970). A linear filtering approach to the computation of discrete Fourier transform. Audio and Electroacoustics, IEEE Transactions on, 18(4), 451–455.
-
Duhamel, P. and Hollmann, H. (1984). Split radix FFT algorithm. Electronics Letters, 20(1), 14–16.
-
Dongarra, J. and Sullivan, F. (2000). Guest Editors' Introduction: The Top 10 Algorithms. Computing in Science & Engineering, 22–23.
-
Duhamel, P. (1990). Algorithms meeting the lower bounds on the multiplicative complexity of length-2n DFTs and their connection with practical algorithms. Acoustics, Speech and Signal Processing, IEEE Transactions on, 38(9), 1504–1511.
-
Duncan, R. (1990). A survey of parallel computer architectures. Computer, 23(2), 5–16.
-
Duhamel, P. and Vetterli, M. (1990). Fast Fourier transforms: a tutorial review and a state of the art. Signal Processing, 19(4), 259–299.
-
Fitzpatrick, Jason. (2011). An Interview with Steve Furber. Communications of the ACM, 54(5), 34-39.
-
Frigo, M. and Johnson, S.G. (2005). The design and implementation of FFTW3. Proceedings of the IEEE, 93(2), 216–231.
-
Flynn, M.J. (1972). Some computer organizations and their effectiveness. Computers, IEEE Transactions on, 100(9), 948–960.
-
Frigo, M. (1999). A fast Fourier transform compiler. In ACM SIGPLAN Notices. (Vol. 34, p. 169–180). ACM.
-
Franchetti, F. and Voronenko, Y. and Puschel, M. (2006). FFT program generation for shared memory: SMP and multicore. In SC 2006 Conference, Proceedings of the ACM/IEEE. (p. 51–51). IEEE.
-
Gopinath, R. A. (1989). Conjugate pair fast Fourier transform. Electronics Letters, 25, 1084.
-
Garg, R. and Telang, R. (2011). Estimating App Demand from Publicly Available Data.
-
Heideman, M. and Burrus, C. (1986). On the number of multiplications necessary to compute a length-2n DFT. Acoustics, Speech and Signal Processing, IEEE Transactions on, 34(1), 91–95.
-
Johnson, S. G. and Frigo, M. (2008, September). Implementing FFTs in practice. In Connexions: Fast Fourier Transforms. Houston TX: Rice University.
-
Johnson, JR and Johnson, RW. (1997). Challenges of computing the fast Fourier transform. In DARPA Conference. Citeseer.
-
Kamar, I. and Elcherif, Y. (1989). Conjugate pair fast Fourier transform. Electronics Letters, 25, 324.
-
Krot, A. M. and Minervina, H. B. (1992). Conjugate pair fast Fourier transform. Electronics Letters, 28, 1143.
-
Martens, J.B. (1984). Recursive cyclotomic factorization–A new algorithm for calculating the discrete Fourier transform. Acoustics, Speech and Signal Processing, IEEE Transactions on, 32(4), 750–761.
-
McGrath, Dylan. (2011, September). IDC cuts PC microprocessor forecast. EETimes.
-
Mirkovic, D. and Johnsson, S.L. (2001). Automatic performance tuning in the UHFFT library. Lecture notes in computer science, I–71.
-
Mirkovic, D. and Mahasoom, R. and Johnsson, L. (2000). Adaptive software library for fast Fourier transforms. In 2000 International Conference on Supercomputing. (p. 215–224).
-
Oppenheim, A.V. and Schafer, R.W. and Buck, J.R. (1989). Discrete-time signal processing. (Vol. 2). Prentice hall Upper Saddle River eN. JNJ.
-
Puschel, M. and Moura, J.M.F. and Johnson, J.R. and Padua, D. and Veloso, M.M. and Singer, B.W. and Xiong, J. and Franchetti, F. and Gacic, A. and Voronenko, Y. (2005). SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 93(2), 232–275.
-
Puschel, M. and Moura, J.M.F. and Singer, B. and Xiong, J. and Johnson, J. and Padua, D. and Veloso, M. and Johnson, R.W. (2004). SPIRAL: A generator for platform-adapted libraries of signal processing alogorithms. International Journal of High Performance Computing Applications, 18(1), 21.
-
Qian, H-S. and Zhao, Z-J. (1990). Conjugate pair fast Fourier transform. Electronics Letters, 26, 541.
-
Rader, C.M. (1968). Discrete Fourier transforms when the number of data samples is prime. Proceedings of the IEEE, 56(6), 1107–1108.
-
Rabiner, L. and Schafer, R. and Rader, C. (1969). The chirp z-transform algorithm. Audio and Electroacoustics, IEEE Transactions on, 17(2), 86–92.
-
Temperton, C. (1983). Note on prime factor FFT algorithms. J. Comput. Phys., 1,
-
Vetterli, M. and Nussbaumer, H.J. (1984). Simple FFT and DCT algorithms with reduced number of operations. Signal Processing, 6(4), 267–278.
-
Winograd, S. (1978). On computing the discrete Fourier transform. Mathematics of Computation, 32(141), 175–199.
-
Yavne, R. (1968). An economical method for calculating the discrete Fourier transform. In Proceedings of the December 9-11, 1968, fall joint computer conference, part I. (p. 115–125). ACM.