FFTW [14], [12], [19], SPIRAL [15], [29], [28] and UHFFT [2], [4], [1], [26], [25] are state of the art FFT libraries that employ automatic empirical optimization. SPIRAL automatically performs machine-specific optimizations at compile time, and FFTW and UHFFT automatically adapt to a machine at run-time. Aside from the use of automatic optimization, a common denominator among these libraries is the use of large straight line blocks of code and optimized memory locality.

The hypotheses outlined below test whether good heuristics and model-based optimization can be used in the place of automatic empirical optimization.

**Hypothesis 1: Accessing memory in sequential “streams” is critical for best performance**

Large FFT exhibit poor temporal locality, and when computing these transforms on microprocessor based systems that feature a cache, best performance is typically achieved when “streaming” sequential data through the CPU. Hypothesis 1 is tested in Implementation Details with replicated coefficient lookup tables that trade-off increased memory size for better spatial locality, and in Streaming FFT by topologically sorting a directed acyclic graph (DAG) of sub-transforms to again improve spatial locality.

**Hypothesis 2: The conjugate-pair algorithm is faster than the ordinary split-radix algorithm**

Hypothesis 2 is based on the idea that memory bandwidth is a bottleneck, and on the fact that the conjugate-pair algorithm requires only half the number of twiddle factor loads. This hypothesis is tested in Split-radix vs. conjugate-pair, where a highly optimized implementation of the conjugate-pair algorithm is benchmarked against an equally highly optimized implementation of the ordinary split-radix algorithm.

**Hypothesis 3: The performance of an FFT can be predicted based on characteristics of the underlying machine and the compiler**

Exploratory experiments suggest that good results can be obtained without empirical techniques, and that certain parameters can be predicted based on the characteristics of the underlying machine and the compiler used. Hypothesis 3 is tested in Results and Discussion by building a model that predicts performance, and by benchmarking FFTW against an implementation that does not require extensive calibration, on 18 different machines.