Skip to content Skip to navigation


You are here: Home » Content » Introduction


Recently Viewed

This feature requires Javascript to be enabled.


Module by: Anthony Blake. E-mail the author

In 1990, it was estimated that Cray Research's installed base of approximately 200 machines spent 40% of all CPU cycles computing the fast Fourier transform (FFT) [20]. With each machine worth about USD$25 million, the performance of the FFT was of prime importance.

Today, use of the FFT is even more pervasive, and it is counted among the 10 algorithms that have had the greatest influence on the development and practice of science and engineering in the 20th century [7]. Huge numbers of mobile smartphones, tablets and PCs [24], [11], most of which are equipped with single-instruction, multiple-data (SIMD) [9], [13] microprocessors, compute the FFT on a large scale for a plethora of sound, video and image processing applications. In the space of a few years, mobile applications have become a part of many people's everyday lives [17].

This thesis shows that the key to optimizing the performance of the split-radix FFT algorithms on SIMD microprocessors is latency and spatial locality optimizations, and in some cases, a variant of the split-radix FFT called the conjugate-pair algorithm [16], [21], [22], [30]. It is also shown that extensive machine specific calibration may be superfluous.


FFTW [14], [12], [19], SPIRAL [15], [29], [28] and UHFFT [2], [4], [1], [26], [25] are state of the art FFT libraries that employ automatic empirical optimization. SPIRAL automatically performs machine-specific optimizations at compile time, and FFTW and UHFFT automatically adapt to a machine at run-time. Aside from the use of automatic optimization, a common denominator among these libraries is the use of large straight line blocks of code and optimized memory locality.

The hypotheses outlined below test whether good heuristics and model-based optimization can be used in the place of automatic empirical optimization.

Hypothesis 1: Accessing memory in sequential “streams” is critical for best performance

Large FFT exhibit poor temporal locality, and when computing these transforms on microprocessor based systems that feature a cache, best performance is typically achieved when “streaming” sequential data through the CPU. Hypothesis 1 is tested in Implementation Details with replicated coefficient lookup tables that trade-off increased memory size for better spatial locality, and in Streaming FFT by topologically sorting a directed acyclic graph (DAG) of sub-transforms to again improve spatial locality.

Hypothesis 2: The conjugate-pair algorithm is faster than the ordinary split-radix algorithm

Hypothesis 2 is based on the idea that memory bandwidth is a bottleneck, and on the fact that the conjugate-pair algorithm requires only half the number of twiddle factor loads. This hypothesis is tested in Split-radix vs. conjugate-pair, where a highly optimized implementation of the conjugate-pair algorithm is benchmarked against an equally highly optimized implementation of the ordinary split-radix algorithm.

Hypothesis 3: The performance of an FFT can be predicted based on characteristics of the underlying machine and the compiler

Exploratory experiments suggest that good results can be obtained without empirical techniques, and that certain parameters can be predicted based on the characteristics of the underlying machine and the compiler used. Hypothesis 3 is tested in Results and Discussion by building a model that predicts performance, and by benchmarking FFTW against an implementation that does not require extensive calibration, on 18 different machines.


In investigating the hypotheses, the scope of this work has been limited in several ways:

  1. It is limited to single-threaded complex 1D FFTs, because multi-dimensional, multi-threaded or multi-processor FFTs (or any combination thereof) are ultimately decomposed into 1D components running on a single core, and all other things being equal, it is the performance of these 1D components running on a single microprocessor core that determines the overall performance of a given multi-threaded implementation;
  2. It is limited to transforms that operate on vectors of length 2m2m where mN0mN0, because these are the easiest to compute on machines, and consequently the most often used by applications. This excludes the prime-factor algorithm [27], [33], and the Radar [31] and Bluestein [5], [27], [32] algorithms for prime sizes;
  3. It is limited to the split-radix [6], [10], [23], [34], [36] and conjugate-pair [16], [21], [22], [30] algorithms. The Winograd algorithm [8], [10], [18], [35] is excluded because of its low performance on systems where multiplication costs about the same as addition;
  4. It is limited to out-of-place transforms, because they are generally faster than in-place transforms, except at the boundaries of the cache [3];
  5. The benchmark experiments are limited to the Intel x86 and ARM machines, because it is estimated that 92% of the microprocessors in the rapidly expanding mobile market are ARM devices [11], while Intel's share of the worldwide PC and mobile PC microprocessors markets is estimated to be 79.3% and 84.4%, respectively [24].


The contributions of this work are summarized as follows:

  1. Three methods of computing the conjugate-pair algorithm on SIMD microprocessors are described in Streaming FFT;
  2. The source code for the high-performance SIMD FFT library developed in this thesis is publicly available under a permissive open source licence on github.


This work is divided into two parts. The first part, Chapters 1-4, encompasses the relevant background, while the second part, Chapters 5-8, is concerned with contributions that challenge the state of the art.

A brief overview of the contents of each chapter:

  1. Algorithms provides an overview of FFT algorithms from the mathematical perspective;
  2. Implementation details complements the mathematical perspective of the previous chapter with a more focused view of the low level details that are relevant to efficient implementation on SIMD microprocessors;
  3. Existing libraries reviews existing state of the art libraries, with reference to algorithms and implementation details of the previous chapters;
  4. Streaming FFT describes SFFT, a library for SIMD microprocessors that is, in many cases, faster than the state of the art FFT libraries reviewed in Existing libraries;
  5. Benchmark methods describes the benchmarking methods used to evaluate performance and accuracy of various FFT implementations throughout this work;
  6. Results and discussion presents the results of benchmarks on 18 different machines, as well as the results of model-based optimization experiments, with reference to earlier chapters and other related work;
  7. Conclusions and future work concludes the work with a review of the hypotheses, a summary of the contributions, and some idea for directions that future work might take.


  1. Ali, A. and Johnsson, L. (2006). UHFFT: A High Performance DFT Framework.
  2. Ali, A. and Johnsson, L. and Mirkovic, D. (2007). Empirical Auto-tuning Code Generator for FFT and Trignometric Transforms. In ODES: 5th Workshop on Optimizations for DSP and Embedded Systems, in conjunction with International Symposium on Code Generation and Optimization (CGO).
  3. Ali, A. and Johnsson, L. and Subhlok, J. (2007). Adaptive computation of self sorting in-place FFTs on hierarchical memory architectures. High Performance Computing and Communications, 372–383.
  4. Ali, A. and Johnsson, L. and Subhlok, J. (2007). Scheduling FFT computation on SMP and multicore systems. In Proceedings of the 21st annual international conference on Supercomputing. (p. 293–301). ACM.
  5. Bluestein, L. (1970). A linear filtering approach to the computation of discrete Fourier transform. Audio and Electroacoustics, IEEE Transactions on, 18(4), 451–455.
  6. Duhamel, P. and Hollmann, H. (1984). Split radix FFT algorithm. Electronics Letters, 20(1), 14–16.
  7. Dongarra, J. and Sullivan, F. (2000). Guest Editors' Introduction: The Top 10 Algorithms. Computing in Science & Engineering, 22–23.
  8. Duhamel, P. (1990). Algorithms meeting the lower bounds on the multiplicative complexity of length-2n DFTs and their connection with practical algorithms. Acoustics, Speech and Signal Processing, IEEE Transactions on, 38(9), 1504–1511.
  9. Duncan, R. (1990). A survey of parallel computer architectures. Computer, 23(2), 5–16.
  10. Duhamel, P. and Vetterli, M. (1990). Fast Fourier transforms: a tutorial review and a state of the art. Signal Processing, 19(4), 259–299.
  11. Fitzpatrick, Jason. (2011). An Interview with Steve Furber. Communications of the ACM, 54(5), 34-39.
  12. Frigo, M. and Johnson, S.G. (2005). The design and implementation of FFTW3. Proceedings of the IEEE, 93(2), 216–231.
  13. Flynn, M.J. (1972). Some computer organizations and their effectiveness. Computers, IEEE Transactions on, 100(9), 948–960.
  14. Frigo, M. (1999). A fast Fourier transform compiler. In ACM SIGPLAN Notices. (Vol. 34, p. 169–180). ACM.
  15. Franchetti, F. and Voronenko, Y. and Puschel, M. (2006). FFT program generation for shared memory: SMP and multicore. In SC 2006 Conference, Proceedings of the ACM/IEEE. (p. 51–51). IEEE.
  16. Gopinath, R. A. (1989). Conjugate pair fast Fourier transform. Electronics Letters, 25, 1084.
  17. Garg, R. and Telang, R. (2011). Estimating App Demand from Publicly Available Data.
  18. Heideman, M. and Burrus, C. (1986). On the number of multiplications necessary to compute a length-2n DFT. Acoustics, Speech and Signal Processing, IEEE Transactions on, 34(1), 91–95.
  19. Johnson, S. G. and Frigo, M. (2008, September). Implementing FFTs in practice. In Connexions: Fast Fourier Transforms. Houston TX: Rice University.
  20. Johnson, JR and Johnson, RW. (1997). Challenges of computing the fast Fourier transform. In DARPA Conference. Citeseer.
  21. Kamar, I. and Elcherif, Y. (1989). Conjugate pair fast Fourier transform. Electronics Letters, 25, 324.
  22. Krot, A. M. and Minervina, H. B. (1992). Conjugate pair fast Fourier transform. Electronics Letters, 28, 1143.
  23. Martens, J.B. (1984). Recursive cyclotomic factorization–A new algorithm for calculating the discrete Fourier transform. Acoustics, Speech and Signal Processing, IEEE Transactions on, 32(4), 750–761.
  24. McGrath, Dylan. (2011, September). IDC cuts PC microprocessor forecast. EETimes.
  25. Mirkovic, D. and Johnsson, S.L. (2001). Automatic performance tuning in the UHFFT library. Lecture notes in computer science, I–71.
  26. Mirkovic, D. and Mahasoom, R. and Johnsson, L. (2000). Adaptive software library for fast Fourier transforms. In 2000 International Conference on Supercomputing. (p. 215–224).
  27. Oppenheim, A.V. and Schafer, R.W. and Buck, J.R. (1989). Discrete-time signal processing. (Vol. 2). Prentice hall Upper Saddle River eN. JNJ.
  28. Puschel, M. and Moura, J.M.F. and Johnson, J.R. and Padua, D. and Veloso, M.M. and Singer, B.W. and Xiong, J. and Franchetti, F. and Gacic, A. and Voronenko, Y. (2005). SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, 93(2), 232–275.
  29. Puschel, M. and Moura, J.M.F. and Singer, B. and Xiong, J. and Johnson, J. and Padua, D. and Veloso, M. and Johnson, R.W. (2004). SPIRAL: A generator for platform-adapted libraries of signal processing alogorithms. International Journal of High Performance Computing Applications, 18(1), 21.
  30. Qian, H-S. and Zhao, Z-J. (1990). Conjugate pair fast Fourier transform. Electronics Letters, 26, 541.
  31. Rader, C.M. (1968). Discrete Fourier transforms when the number of data samples is prime. Proceedings of the IEEE, 56(6), 1107–1108.
  32. Rabiner, L. and Schafer, R. and Rader, C. (1969). The chirp z-transform algorithm. Audio and Electroacoustics, IEEE Transactions on, 17(2), 86–92.
  33. Temperton, C. (1983). Note on prime factor FFT algorithms. J. Comput. Phys., 1,
  34. Vetterli, M. and Nussbaumer, H.J. (1984). Simple FFT and DCT algorithms with reduced number of operations. Signal Processing, 6(4), 267–278.
  35. Winograd, S. (1978). On computing the discrete Fourier transform. Mathematics of Computation, 32(141), 175–199.
  36. Yavne, R. (1968). An economical method for calculating the discrete Fourier transform. In Proceedings of the December 9-11, 1968, fall joint computer conference, part I. (p. 115–125). ACM.

Content actions

Download module as:

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens


A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks