This thesis describes how to compute the fast Fourier transform (FFT) of a power-of-two length signal on single-instruction, multiple-data (SIMD) microprocessors faster than or very close to the speed of state of the art libraries such as FFTW (``Fastest Fourier Transform in the West''), SPIRAL and Intel Integrated Performance Primitives (IPP).
The conjugate-pair algorithm has advantages in terms of memory bandwidth, and three implementations of this algorithm, which incorporate latency and spatial locality optimizations, are automatically vectorized at the algorithm level of abstraction. Performance results on 2-way, 4-way and 8-way SIMD machines show that the performance scales much better than FFTW or SPIRAL.
The implementations presented in this thesis are compiled into a high-performance FFT library called SFFT (``Streaming Fast Fourier Transform''), and benchmarked against FFTW, SPIRAL, Intel IPP and Apple Accelerate on sixteen x86 machines and two ARM NEON machines, and shown to be, in many cases, faster than these state of the art libraries, but without having to perform extensive machine specific calibration, thus demonstrating that there are good heuristics for predicting the performance of the FFT on SIMD microprocessors (i.e., the need for empirical optimization may be overstated).
University of Waikato
This collection contains: