# Connexions

You are here: Home » Content » Implementation Details

### Recently Viewed

This feature requires Javascript to be enabled.

# Implementation Details

Module by: Anthony Blake. E-mail the author

This Chapter complements the mathematical perspective of Algorithms with a more focused view of the low level details that are relevant to efficient implementation on SIMD microprocessors. These techniques are widely practised by today's state of the art implementations, and form the basis for more advanced techniques presented in later chapters.

## Simple programs

Fast Fourier transforms (FFTs) can be succinctly expressed as microprocessor algorithms that are depth first recursive. For example, the Cooley-Tukey FFT divides a size N transform into two size N/2 transforms, which in turn are divided into size N/4 transforms. This recursion continues until the base case of two size 1 transforms is reached, where the two smaller sub-transforms are then combined into a size 2 sub-transform, and then two completed size 2 transforms are combined into a size 4 transform, and so on, until the size N transform is complete.

Computing the FFT with such a depth first traversal has an important advantage in terms of memory locality: at any point during the traversal, the two completed sub-transforms that compose a larger sub-transform will still be in the closest level of the memory hierarchy in which they fit (see, i.a., [5] and [3]). In contrast, a breadth first traversal of a sufficiently large transform could force data out of cache during every pass (ibid.).

Many implementations of the FFT require a bit-reversal permutation of either the input or the output data, but a depth first recursive algorithm implicitly performs the permutation during recursion. The bit-reversal permutation is an expensive computation, and despite being the subject of hundreds of research papers over the years, it can easily account for a large fraction of the FFTs runtime – more so for the conjugate-pair algorithm with the rotated bit-reversal permutation. Such permutations will be encountered in later sections, but for the mean time it should be noted that the algorithms in this chapter do not require bit-reversal permutations – the input and output are in natural order.

 IF N=1N=1     RETURN x0x0   ELSE     Ek2=0,⋯,N/2-1←DITFFT2N/2(x2n2)Ek2=0,⋯,N/2-1←DITFFT2N/2(x2n2)     Ok2=0,⋯,N/2-1←DITFFT2N/2(x2n2+1)Ok2=0,⋯,N/2-1←DITFFT2N/2(x2n2+1)     FOR k=0k=0 to N/2-1N/2-1       Xk←Ek+ωNkOkXk←Ek+ωNkOk       Xk+N/2←Ek-ωNkOkXk+N/2←Ek-ωNkOk     END FOR     RETURN XkXk   ENDIF 

A recursive depth first implementation of the Cooley-Tukey radix-2 decimation-in-time (DIT) FFT is described with pseudocode in Code 1, and an implementation coded in C with only the most basic optimization – avoiding multiply operations where ωN0ωN0 is unity in the first iteration of the loop – is included in Appendix 1. Even when compiled with a state-of-the-art auto-vectorizing compiler,1 the code achieves poor performance on modern microprocessors, and is useful only as a baseline reference.2

 Implementation Machine Runtime Danielson-Lanczos, 1942 [2] Human 140 minutes Cooley-Tukey, 1965 [1] IBM 7094 ∼∼ 10.5 ms Listing 1, Appendix 1, 2011 Macbook Air 4,2 ∼∼ 440 μμs

However it is worth noting that when considered from a historical perspective, the performance does seem impressive – as shown in Table 1. The runtimes in Table 1 are approximate; the Cooley-Tukey figure is roughly extrapolated from the floating point operations per second (FLOPS) count of a size 2048 complex transform given in their 1965 paper [1]; and the speed of the reference implementation is derived from the runtime of a size 64 complex FFT (again, based on the FLOPS count). Furthermore, the precision differs between the results; Danielson and Lanczos computed the DFT to 3–5 significant figures (possibly with the aid of slide rules or adding machines), while the other results were computed with the host machines' implementation of single precision floating point arithmetic.

The runtime performance of the FFT has improved by about seven orders of magnitude in 70 years, and this can largely be attributed to the computing machines of the day being generally faster. The following sections and chapters will show that the performance can be further improved by over two orders of magnitude if the algorithm is enhanced with optimizations that are amenable to the underlying machine.

   IF N=1N=1     RETURN x0x0   ELSIF N=2N=2     X0←x0+x1X0←x0+x1     X1←x0-x1X1←x0-x1   ELSE     Uk2=0,⋯,N/2-1←SPLITFFTN/2(x2n2)Uk2=0,⋯,N/2-1←SPLITFFTN/2(x2n2)     Zk4=0,⋯,N/4-1←SPLITFFTN/4(x4n4+1)Zk4=0,⋯,N/4-1←SPLITFFTN/4(x4n4+1)     Zk4=0,⋯,N/4-1'←SPLITFFTN/4(x4n4+3)Zk4=0,⋯,N/4-1'←SPLITFFTN/4(x4n4+3)     FOR k=0k=0 to N/4-1N/4-1       Xk←Uk+(ωNkZk+ωN3kZk')Xk←Uk+(ωNkZk+ωN3kZk')       Xk+N/2←Uk-(ωNkZk+ωN3kZk')Xk+N/2←Uk-(ωNkZk+ωN3kZk')       Xk+N/4←Uk+N/4-i(ωNkZk-ωN3kZk')Xk+N/4←Uk+N/4-i(ωNkZk-ωN3kZk')       Xk+3N/4←Uk+N/4+i(ωNkZk-ωN3kZk')Xk+3N/4←Uk+N/4+i(ωNkZk-ωN3kZk')     END FOR   ENDIF   RETURN XkXk 

As was the case with the radix-2 FFT in the previous section, the split-radix FFT neatly maps from the system of linear functions to the pseudocode of Code 2, and then to the C implementation included in Appendix 1.

Code 2 explicitly handles the base case for N=2N=2, to accommodate not only size 2 transforms, but also size 4 and size 8 transforms (and all larger transforms that are ultimately composed of these smaller transforms). A size 4 transform is divided into two size 1 sub-transforms and one size 2 transform, which cannot be further divided by the split-radix algorithm, and so it must be handled as a base case. Likewise with the size 8 transform that divides into one size 4 sub-transform and two size 2 sub-transforms: the size 2 sub-transforms cannot be further decomposed with the split-radix algorithm.

Also note that two twiddle factors, viz. ωNkωNk and ωN3kωN3k, are required for the split-radix decomposition; this is an advantage compared to the radix-2 decomposition which would require four twiddle factors for the same size 4 transform.

### Conjugate-pair

From a pseudocode perspective, there is little difference between the ordinary split-radix algorithm and the conjugate-pair algorithm (see Code 3). In line 10, the x4n4+3x4n4+3 terms have been shifted cyclically by -4-4 to x4n4-1x4n4-1, and in lines 12-15, the coefficient of Zk'Zk' has been shifted cyclically from ωN3kωN3k to ωN-kωN-k.

   IF N=1N=1     RETURN x0x0   ELSIF N=2N=2     X0←x0+x1X0←x0+x1     X1←x0-x1X1←x0-x1   ELSE     Uk2=0,⋯,N/2-1←CONJFFTN/2(x2n2)Uk2=0,⋯,N/2-1←CONJFFTN/2(x2n2)     Zk4=0,⋯,N/4-1←CONJFFTN/4(x4n4+1)Zk4=0,⋯,N/4-1←CONJFFTN/4(x4n4+1)     Zk4=0,⋯,N/4-1'←CONJFFTN/4(x4n4-1)Zk4=0,⋯,N/4-1'←CONJFFTN/4(x4n4-1)     FOR k=0k=0 to N/4-1N/4-1       Xk←Uk+(ωNkZk+ωN-kZk')Xk←Uk+(ωNkZk+ωN-kZk')       Xk+N/2←Uk-(ωNkZk+ωN-kZk')Xk+N/2←Uk-(ωNkZk+ωN-kZk')       Xk+N/4←Uk+N/4-i(ωNkZk-ωN-kZk')Xk+N/4←Uk+N/4-i(ωNkZk-ωN-kZk')       Xk+3N/4←Uk+N/4+i(ωNkZk-ωN-kZk')Xk+3N/4←Uk+N/4+i(ωNkZk-ωN-kZk')     END FOR   ENDIF   RETURN XkXk 

The source code has a few subtle differences that are not revealed in the pseudocode. The pseudocode in Code 3 requires an array of complex numbers xnxn for input, but the source code requires a reference to an array of complex numbers with a stride3 – this avoids copying xnxn into three separate arrays, viz. x2n2x2n2, x4n4+1x4n4+1 and x4n4-1x4n4-1, with every invocation of Code 3. The subtle complication arises due to the cyclic shifting of the x4n4-1x4n4-1 term; the negative shifting results in pointers that reference data before the start of the array. Rather than immediately wrapping the references around to end of the array such that they always point to valid data, the recursion proceeds until the base cases are reached before any adjustment is performed. Once at the leaves of the recursion, any pointers that reference data lying before the start of the input array are incremented by NN elements,4 so as to point to the correct data.

   IF N=1N=1     RETURN x0x0   ELSIF N=2N=2     X0←x0+x1X0←x0+x1     X1←x0-x1X1←x0-x1   ELSE     Uk2=0,⋯,N/2-1←TANGENTFFT4N/2(x2n2)Uk2=0,⋯,N/2-1←TANGENTFFT4N/2(x2n2)     Zk4=0,⋯,N/4-1←TANGENTFFT8N/4(x4n4+1)Zk4=0,⋯,N/4-1←TANGENTFFT8N/4(x4n4+1)     Zk4=0,⋯,N/4-1'←TANGENTFFT8N/4(x4n4-1)Zk4=0,⋯,N/4-1'←TANGENTFFT8N/4(x4n4-1)     FOR k=0k=0 to N/4-1N/4-1       Xk←Uk+(ωNksN/4,kZk+ωN-ksN/4,kZk')Xk←Uk+(ωNksN/4,kZk+ωN-ksN/4,kZk')       Xk+N/2←Uk-(ωNksN/4,kZk+ωN-ksN/4,kZk')Xk+N/2←Uk-(ωNksN/4,kZk+ωN-ksN/4,kZk')       Xk+N/4←Uk+N/4-i(ωNksN/4,kZk-ωN-ksN/4,kZk')Xk+N/4←Uk+N/4-i(ωNksN/4,kZk-ωN-ksN/4,kZk')       Xk+3N/4←Uk+N/4+i(ωNksN/4,kZk-ωN-ksN/4,kZk')Xk+3N/4←Uk+N/4+i(ωNksN/4,kZk-ωN-ksN/4,kZk')     END FOR   ENDIF   RETURN XkXk 
   IF N=1N=1     RETURN x0x0   ELSIF N=2N=2     X0←x0+x1X0←x0+x1     X1←x0-x1X1←x0-x1   ELSIF N=4N=4     Tk2=0,1←TANGENTFFT82(x2n2)Tk2=0,1←TANGENTFFT82(x2n2)     Tk2=0,1'←TANGENTFFT82(x2n2+1)Tk2=0,1'←TANGENTFFT82(x2n2+1)     X0←T0+T0'X0←T0+T0'     X2←T0-T0'X2←T0-T0'     X1←T1+T1'X1←T1+T1'     X3←T1-T1'X3←T1-T1'   ELSE     Uk4=0,⋯,N/4-1←TANGENTFFT8N/4(x4n4)Uk4=0,⋯,N/4-1←TANGENTFFT8N/4(x4n4)     Yk8=0,⋯,N/8-1←TANGENTFFT8N/8(x8n8+2)Yk8=0,⋯,N/8-1←TANGENTFFT8N/8(x8n8+2)     Yk8=0,⋯,N/8-1'←TANGENTFFT8N/8(x8n8-2)Yk8=0,⋯,N/8-1'←TANGENTFFT8N/8(x8n8-2)     Zk4=0,⋯,N/4-1←TANGENTFFT8N/4(x4n4+1)Zk4=0,⋯,N/4-1←TANGENTFFT8N/4(x4n4+1)     Zk4=0,⋯,N/4-1'←TANGENTFFT8N/4(x4n4-1)Zk4=0,⋯,N/4-1'←TANGENTFFT8N/4(x4n4-1)     FOR k=0k=0 to N/8-1N/8-1       αN,k←sN/4,k/sN,kαN,k←sN/4,k/sN,k       βN,k←sN/8,k/sN/2,kβN,k←sN/8,k/sN/2,k       γN,k←sN/4,k+N/8/sN,k+N/8γN,k←sN/4,k+N/8/sN,k+N/8       δN,k←sN/2,k/sN,kδN,k←sN/2,k/sN,k       ϵN,k←sN/2,k+N/8/sN,k+N/8ϵN,k←sN/2,k+N/8/sN,k+N/8       Ω0←ωNk*αN,kΩ0←ωNk*αN,k       Ω1←ωNk+N/8*γN,kΩ1←ωNk+N/8*γN,k       Ω2←ωN2k*βN,kΩ2←ωN2k*βN,k       T0←(Ω2Yk+Ω¯2Yk)*δN,kT0←(Ω2Yk+Ω¯2Yk)*δN,k       T1←i(Ω2Yk-Ω¯2Yk)*ϵN,kT1←i(Ω2Yk-Ω¯2Yk)*ϵN,k       Xk←Uk*αN,k+T0+(Ω0Zk+Ω¯0Zk')Xk←Uk*αN,k+T0+(Ω0Zk+Ω¯0Zk')       Xk+N/2←Uk*αN,k+T0-(Ω0Zk+Ω¯0Zk')Xk+N/2←Uk*αN,k+T0-(Ω0Zk+Ω¯0Zk')       Xk+N/4←Uk*αN,k-T0-i(Ω0Zk-Ω¯0Zk')Xk+N/4←Uk*αN,k-T0-i(Ω0Zk-Ω¯0Zk')       Xk+3N/4←Uk*αN,k-T0+i(Ω0Zk-Ω¯0Zk')Xk+3N/4←Uk*αN,k-T0+i(Ω0Zk-Ω¯0Zk')       Xk+N/8←Uk+N/8*γN,k-T1+(Ω1Zk+N/8+Ω¯0Zk+N/8')Xk+N/8←Uk+N/8*γN,k-T1+(Ω1Zk+N/8+Ω¯0Zk+N/8')       Xk+3N/8←Uk+N/8*γN,k+T1-i(Ω1Zk+N/8-Ω¯0Zk+N/8')Xk+3N/8←Uk+N/8*γN,k+T1-i(Ω1Zk+N/8-Ω¯0Zk+N/8')       Xk+5N/8←Uk+N/8*γN,k-T1-(Ω1Zk+N/8+Ω¯0Zk+N/8')Xk+5N/8←Uk+N/8*γN,k-T1-(Ω1Zk+N/8+Ω¯0Zk+N/8')       Xk+7N/8←Uk+N/8*γN,k+T1+i(Ω1Zk+N/8-Ω¯0Zk+N/8')Xk+7N/8←Uk+N/8*γN,k+T1+i(Ω1Zk+N/8-Ω¯0Zk+N/8')     END FOR   ENDIF   RETURN XkXk 

### Tangent

The tangent FFT is divided into two functions, described with pseudocode in Code 4 and Code 5. If the tangent FFT is computed prior to convolution in the frequency domain, the convolution kernel can absorb the final scaling and only Code 5 is required. Otherwise Code 4 is used as a wrapper around Code 5 to perform the rescaling, and the result XkXk is in the correct basis.

Code 4 is similar to Code 3, except that ZkZk and Zk'Zk' are computed with Code 5, and thus scaled by 1/sN/4,k1/sN/4,k. Because ZkZk and Zk'Zk' are respectively multiplied by the coefficients ωNkωNk and ωN-kωN-k, the results are scaled into the correct basis by absorbing sN/4,ksN/4,k into the coefficients.

Code 5 is almost a 1:1 mapping of the system of linear equations, except that the base cases of N=1,2,4N=1,2,4 are handled explicitly. In Code 5, the case of N=4N=4 is handled with two size 2 base cases, which are combined into a size 4 FFT.

### Putting it all together

The simple implementations covered in this section were benchmarked for sizes of transforms 2222 through to 218218 running on a Macbook Air 4,2 and the results are plotted in Figure 1. The speed of each transform is measured in Cooley-Tukey gigaflops (CTGs), where a higher measurement indicates a faster transform.5

It can be seen from Figure 1 that although the conjugate-pair and split-radix algorithms have exactly the same FLOP count, the conjugate-pair algorithm is substantially faster. The difference in speed can be attributed to the fact that the conjugate-pair algorithm requires only one twiddle factor per size 4 sub-transform, whereas the ordinary split-radix algorithm requires two.

Though the tangent FFT requires the same number of twiddle factors but uses fewer FLOPs compared to the conjugate-pair algorithm, its performance is worse than the radix-2 FFT for most sizes of transform, and this can be attributed to the cost of computing the scaling factors.

A simple analysis with a profiling tool reveals that each implementations' runtime is dominated by the time taken to compute the coefficients. Even in the case of the conjugate-pair algorithm, over 55%55% of the runtime is spent calculating the complex exponential function. Eliminating this performance bottleneck is the topic of the next section.

## Precomputed coefficients

The speed of Code 1Code 5 may be dramatically improved if the coefficients are precomputed and stored in a lookup table (LUT).

When computing an FFT of size NN, Code 1 requires N/2N/2 different twiddle factors that correspond to N/2N/2 samples of a half rotation around the complex plane. Rather than storing N/2N/2 complex numbers, the symmetries of the sine and cosine waves that compose ωNkωNk may be exploited to reduce the storage to N/4N/4 real numbers – a 75%75% reduction in memory – by storing only one quadrant of a sine or cosine wave from which the real and imaginary parts of any twiddle factor can be constructed. Such a scheme has advantages in hardware implementations where LUT memory is a costly resource [4], but for modern microprocessor implementations of the FFT, it is more advantageous to have a less complex indexing scheme and better memory locality, rather than a smaller LUT.

As already mentioned, each transform of size NN that is computed with Code 1 requires N/2N/2 twiddle factors from ωN0ωN0 through to ωNN/2ωNN/2, but the two sub-transforms of Code 1 require twiddle factors ranging from ωN/20ωN/20 through to ωN/2N/4ωN/2N/4. The twiddle factors of the sub-transforms can be obtained by downsampling the parent transform's twiddle factors by a factor of 2, and because the downsampling factors are all powers of 2, simple shift operations can be used to index any twiddle factor anywhere in the transform from one LUT.

Appendix 2 contains listings of source code that augment each of the simple implementations from the previous section with LUTs of precomputed coefficients. The modifications are fairly minor: each implementation now has an initialization function that populates the LUT(s) based on the size of the transform to be computed, and each transform function now has a parameter of log2(stride)log2(stride), so as to economically index the twiddle factors with little computation.

As Figure 2 shows, the speedup resulting from the precomputed twiddle LUT is dramatic – sometimes more than a factor of 6 (cf. Figure 1). Interestingly, the ordinary split-radix algorithm is now faster than the conjugate-pair algorithm, and inspection of the compiler output shows that this is due to the more complicated addressing scheme at the leaves of the computation, and because the compiler lacks good heuristics for complex multiplication by a conjugate. The performance of the tangent FFT is hampered by the same problem, yet the tangent FFT has better performance, which can be attributed to the tangent FFT having larger straight line blocks of code at the leaves of the computation (the tangent FFT has leaves of size 4, while the split-radix and conjugate-pair FFTs have leaves of size 2).

## Single instruction, multiple data

The performance of the programs in the previous section may be further improved by explicitly describing the computation with SIMD intrinsics. Auto-vectorizing compilers, such as the Intel C compiler used to compile the previous examples, can extract some data-level parallelism and generate SIMD code from a scalar description of a computation, but better results can be obtained when using vector intrinsics to explicitly specify the parallel computation.

Intrinsics are an alternative to inline assembly code when the compiler fails to meet performance constraints. In most cases an intrinsic function directly maps to a single instruction on the underlying machine, and so intrinsics provide many of the advantages of inline assembler code. But in contrast to inline assembler code, the compiler uses its detailed knowledge of the intrinsic semantics to provide better optimizations and handle tasks such as register allocation.

Almost all desktop and handheld machines now have processors that implement some sort of SIMD extension to the instruction set. All major Intel processors since the Pentium III have implemented SSE, an extension to the x86 architecture that introduced 4-way single precision floating point computation with a new register file consisting of eight 128-bit SIMD registers – known as XMM registers. The AMD64 architecture doubled the number of XMM registers to 16, and Intel followed by implementing 16 XMM registers in the Intel 64 architecture. SSE has since been expanded with support for other data types and new instructions with the introduction of SSE2, SSE3, SSSE3 and SSE4. Most notably, SSE2 introduced support for double precision floating point arithmetic and thus Intel's first attempt at SIMD extensions, MMX, was effectively deprecated. Intel's recent introduction of the sandybridge micro-architecture heralded the first implementation of AVX – a major upgrade to SSE that doubled the size of XMM registers to 256 bits (and renamed them YMM registers), enabling 8-way single precision and 4-way double precision floating point arithmetic.

Another notable example of SIMD extensions implemented in commodity microprocessors is the NEON extension to the ARMv7 architecture. The Cortex family of processors that implement ARMv7 are widely used in mobile, handheld and tablet computing devices such as the iPad, iPhone and Canon PowerShot A470, and the NEON extensions provide these embedded devices with the performance required for processing audio and video codecs as well as graphics and gaming workloads.

Compared to SSE and AVX, NEON has some subtle differences that can greatly improve performance if used properly. First, it has dual length SIMD vectors that are aliased over the same registers; a pair of 64-bit registers refers to the lower and upper half of one 128-bit register – in contrast, the AVX extension increases the size of SSE registers to 256-bit, but the SSE registers are only aliased over the lower half of the AVX registers. Second, NEON can interleave and de-interleave data during vector load or store operations, for up to four vectors of four elements interleaved together. In the context of FFTs, the interleaving/de-interleaving instructions can be used to reduce or eliminate vector permutations or shuffles.

### Split format vs. interleaved format

In the previous examples, the data was stored in interleaved format (i.e., the real and imaginary parts composing each element of complex data are stored adjacently in memory), but operating on the data in split format (i.e., the real parts of each element are stored in one contiguous array, while the imaginary parts of each element are stored contiguously in another array) can simplify the computation when using SIMD. The case of complex multiplication illustrates this point.

   static inline __m128 MUL_INTERLEAVED(__m128 a, __m128 b) {     __m128 re, im;     re = _mm_shuffle_ps(a,a,_MM_SHUFFLE(2,2,0,0));     re = _mm_mul_ps(re, b);     im = _mm_shuffle_ps(a,a,_MM_SHUFFLE(3,3,1,1));     b = _mm_shuffle_ps(b,b,_MM_SHUFFLE(2,3,0,1));     im = _mm_mul_ps(im, b);     im = _mm_xor_ps(im, _mm_set_ps(0.0f, -0.0f, 0.0f, -0.0f));     return _mm_add_ps(re, im);   } 

#### Interleaved format complex multiplication

The function in Code 6 takes complex data in two 4-way single precision SSE registers (a and b) and performs complex multiplication, returning the result in a single precision SSE register. The SSE intrinsic functions are prefixed with _mm_', and the SSE data type corresponding to a single 128-bit single precision register is __m128'.

When operating with interleaved data, each SSE register contains two complex numbers. Two shuffle operations at lines 3 and 5 are used to replicate the real and imaginary parts (respectively) of the two complex numbers in input a. At line 4, the real and imaginary parts of the two complex numbers in b are each multiplied with the real parts of the complex numbers in a. A third shuffle is used to swap the real and imaginary parts of the complex numbers in b, before being multiplied with the imaginary parts of the complex numbers in a – and the exclusive or operation at line 8 is used to selectively negate the sign of the real parts in this result. Finally, the two intermediate results stored in the re and im registers are added. In total, seven SSE instructions are used to multiply two pairs of single precision complex numbers.

#### Split format complex multiplication

   typedef struct _reg_t {     __m128 re, im;   } reg_t;     static inline reg_t MUL_SPLIT(reg_t a, reg_t b) {     reg_t r;     r.re = _mm_sub_ps(_mm_mul_ps(a.re,b.re),_mm_mul_ps(a.im,b.im));     r.im = _mm_add_ps(_mm_mul_ps(a.re,b.im),_mm_mul_ps(a.im,b.re));     return r;   } 

The function in Code 9 takes complex data in two structs of SSE registers, performs the complex multiplication of each element of the vectors, and returns the result in a struct of SSE registers. Each struct is composed of a register containing the real parts of four complex numbers, and another register containing the imaginary parts – so the function in Code 9 is effectively operating on vectors twice as long as the function in Code 6. The benefit of operating in split format is obvious: the shuffle operations that were required in Code 6 are avoided because the real and imaginary parts can be implicitly swapped at the instruction level, rather than by awkwardly manipulating SIMD registers at the data level of abstraction. Thus, Code 9 computes complex multiplication for vectors twice as long while using one less SSE instruction – not to mention other advantages such as reducing chains of dependent instructions. The only disadvantage to the split format approach is that twice as many registers are needed to compute a given operation – this might preclude the use of a larger radix or force register paging for some kernels of computation.

#### Fast interleaved format complex multiplication

Code 10 is fast method of interleaved complex multiplication that may be used in situations where one of the operands can be unpacked prior to multiplication – in such cases the instruction count is reduced from 7 instructions to 4 instructions (cf. Code 6). This method of complex multiplication lends itself especially well to the conjugate-pair algorithm where the same twiddle factor is used twice – by doubling the size of the twiddle factor LUT, the multiplication instruction count is reduced from 14 instructions to 8 instructions. Furthermore, large chains of dependent instructions are reduced, and in practice the actual performance gain can be quite impressive.

Operand a in Code 6 has been replaced with two operands in Code 10: re and im – these operands have been unpacked, as was done in lines 3 and 5 of Code 6. Furthermore, line 8 of Code 6 is also avoided by performing the selective negation during unpacking.

   static inline __m128   MUL_UNPACKED_INTERLEAVED(__m128 re, __m128 im, __m128 b) {     re = _mm_mul_ps(re, b);     im = _mm_mul_ps(im, b);     im = _mm_shuffle_ps(im,im,_MM_SHUFFLE(2,3,0,1));     return _mm_add_ps(re, im);   } 

### Vectorized loops

The performance of the FFTs in the previous sections can be increased by explicitly vectorizing the loops. The Macbook Air 4,2 used to compile and run the previous examples has a CPU that implements SSE and AVX, but for the purposes of simplicity, SSE intrinsics are used in the following examples. The loop of the radix-2 implementation is used as an example in Code 11.

   for(k=0;k

Each iteration of the loop in Code 11 accesses two elements of complex data in the array out, and one complex element from the twiddle factor LUT. Over multiple iterations of the loop, out is accessed contiguously in two places, but the LUT is accessed with a non-unit stride in all sub-transforms except the outer transform. Some vector machines can perform what are known as vector scatter or gather memory operations – where a vector gather could be used in this case to gather elements from the LUT that are separated by a stride. But SSE only supports contiguous or streaming access to memory. Thus, to efficiently compute multiple iterations of the loop in parallel, the twiddle factor LUT is replaced with an array of LUTs – each corresponding to a sub-transform of a particular size. In this way, all memory accesses for the parallelized loop are contiguous and no memory bandwidth is wasted.

   for(k=0;k

Code 12 computes the loop of Code 11 using split format data and a vector length of four (i.e., it computes four iterations at once). Note that the vector load and store operations used in Code 12 require that the memory accesses are 16-byte aligned – this is a fairly standard proviso for vector memory operations, and use of the correct memory alignment attributes and/or memory allocation routines ensures that memory is always correctly aligned.

Some FFT libraries require the input to be in split format (i.e., the real parts of each element are stored in one contiguous array, while the imaginary parts are stored contiguously in another array) for the purposes of simplifying the computation, but this conflicts with many other libraries and use cases of the FFT – for example, Apple's vDSP library operates in split format, but many examples require the use of un-zip/zip functions on the input/output data (see Usage Case 2: Fast Fourier Transforms in (Reference)). A compromise is to convert interleaved format data to split format on the first pass of the FFT, computing most of the FFT with split format sub-transforms, and converting the data back to interleaved format as it is processed on the last pass.

Appendix 3 contains listings of FFTs with vectorized loops. The input and output of the FFTs is in interleaved format, but the computation of the inner loops is performed on split format data. At the leaves of the transform there are no loops, so the computation falls back to scalar arithmetic.

Figure 3 summarizes the performance of the listings in Appendix 3. Interestingly, the radix-2 FFT is faster than both the conjugate-pair and ordinary split-radix algorithms until size 4096 transforms, and this is due to the conjugate-pair and split-radix algorithms being more complicated at the leaves of the computation. The radix-2 algorithm only has to deal with one size of sub-transform at the leaves, but the split-radix algorithms have to handle special cases for two sizes, and furthermore, a larger proportion of the computation takes place at the leaves with the split-radix algorithms. The conjugate-pair algorithm is again slower than the ordinary split-radix algorithm, which can (again) be attributed to the compiler's relatively degenerate code output when computing complex multiplication with a conjugate.

Overall, performance improves with the use of explicit vector parallelism, but still falls short of the state of the art. The next section characterizes the remaining performance bottlenecks.

## The performance bottleneck

The memory access patterns of an FFT are the biggest obstacle to performance on modern microprocessors. To illustrate this point, Figure 4 visualizes the memory accesses of each straight line block of code in a size 64 radix-2 DIT FFT (the source code of which is provided in Appendix 3).

The vertical axis of Figure 4 is memory. Because the diagram depicts a size 64 transform there are 64 rows, each corresponding to a complex word in memory. Because the transform is out-of-place, there are input and output arrays for the data. The input array contains the data “in time”, while the output array contains the result “in frequency”. Rather than show 128 rows – 64 for the input and 64 for the output – the input array's address space has been aliased over the output array's address space, where the orange code indicates an access to the input array and the green and blue codes for accesses to the output array.

Each column along the horizontal axis represents the memory accesses sampled at each kernel (i.e., butterfly) of the computation, which are all straight line blocks of code. The first column shows two orange and one blue memory operations, and these correspond to a radix-2 computation at the leaves reading two elements from the input data, and writing two elements into the output array. The second column shows a similar radix-2 computation at the leaves: two elements of data are read from the input at addresses 18 and 48, the size 2 DFT computed, and the results written to the output array at addresses 2 and 3.

There are columns that do not indicate accesses to the input array, and these are the blocks that are not at the leaves of the computation. They load data from some locations in the output, performing the computation, and store the data back to the same locations in the output array.

There are two problems that Figure 4 illustrates. The first is that the accesses to the input array – the samples “in time" – are indeed very decimated, as might be expected with a decimation-in-time algorithm. Second, it can be observed that the leaves of the computation are rather inefficient, because there are large numbers of straight line blocks of code performing scalar memory accesses, and no loops of more than a few iterations (i.e., the leaves of the computation are not taking advantage of the machine's SIMD capability).

Figure 3 in the previous section showed that the vectorized radix-2 FFT was faster than the split-radix algorithms up to size 4096 transforms; a comparison between Figure 4 and Figure 5 helps explain this phenomenon. The split-radix algorithm spends more time computing the leaves of the computation (blue), so despite the split-radix algorithms being more efficient in the inner loops of SIMD computation, the performance has been held back by higher proportion of very small straight line blocks of code (corresponding to sub-transforms smaller than size 4) performing scalar memory accesses at the leaves of the computation.

Because the addresses of memory operations at the leaves are a function of variables passed on the stack, it is very difficult for a hardware prefetch unit to keep these leaves supplied with data, and thus memory latency becomes an issue. In later chapters, it is shown that increasing the size of the base cases at the leaves improves performance.

## Footnotes

1. Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1.0.038 Build 20110811.
2. Benchmark methods contains a full account of the benchmark methods.
3. A stride of nn would indicate that only every nthnth term is being referred to.
4. In this case, NN refers to the size of the outer most transform rather than the size of the sub-transform.
5. CTGs are an inverse time measurement. See Benchmark methods for a full explanation of the benchmarking methods.

## References

1. Cooley, J.W. and Tukey, J.W. (1965). An Algorithm for the Machine Calculation of Complex Fourier Series. Mathematics of Computation, 19(90), 297–301.
2. Danielson, G.C. and Lanczos, C. (1942). Some improvements in practical Fourier analysis and their application to X-ray scattering from liquids. Journal of the Franklin Institute, 233(5), 435–452.
3. Johnson, S. G. and Frigo, M. (2008, September). Implementing FFTs in practice. In Connexions: Fast Fourier Transforms. Houston TX: Rice University.
4. Pitkanen, T. and Makinen, R. and Heikkinen, J. and Partanen, T. and Takala, J. (2006). Low-power, high-performance TTA processor for 1024-point fast Fourier transform. Embedded Computer Systems: Architectures, Modeling, and Simulation, 227–236.
5. Singleton, Richard C. (1967, October). On computing the fast Fourier transform. Commun. ACM, 10, 647–654.

## Content actions

PDF | EPUB (?)

### What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

### Reuse / Edit:

Reuse or edit module (?)

#### Check out and edit

If you have permission to edit this content, using the "Reuse / Edit" action will allow you to check the content out into your Personal Workspace or a shared Workgroup and then make your edits.

#### Derive a copy

If you don't have permission to edit the content, you can still use "Reuse / Edit" to adapt the content by creating a derived copy of it and then editing and publishing the copy.