Statically elaborating a DAG that represents a depth-first recursive FFT is much like computing a depth-first recursive FFT: instead of performing computation at the leaves of the recursion and where smaller DFTs are combined into one, a node representing the computation is appended to the end of a list, and the list of nodes, i.e., a topological ordering of the DAG, is later translated into a program that can be compiled and executed.
Emitting code with a vector length of 1 (i.e., scalar code or vector code where only one complex element fits in a vector register) is relatively simple and is described in "Vector length 1". For vector lengths above 1, vectorizing the topological ordering of nodes poses some subtle challenges, and these details are described in "Other vector lengths". The fully hard-coded FFTs described in this section are generally only practical for smaller sizes of transforms, typically where
Vector length 1
A VL of 1 implies that the computation is essentially scalar, and only one complex element can fit in a vector register. An example of such a scenario is when using interleaved double-precision floating-point arithmetic on an SSE2 machine: one 128-bit XMM register is used to store two 64-bit floats that represent the real and imaginary parts of a complex number.
When
- Elaborate a topological ordering of nodes, where each node represents either a computation at the leaves of the transform, or a computation in the body of the transform (i.e., where smaller sub-transforms are combined into a larger transform);
- Write the program header to output, including a list of variables that correspond to registers used by the nodes;
- Traverse the list of nodes in order, and for each node, emit a statement that performs the computation represented by the given node. If a node is the last node to use a variable, a statement storing the variable to its corresponding location in memory is also emitted;
- Write the program footer to output.
Elaborate
Code 3 is a function, written in C++, that performs the first task in the process.
As mentioned earlier, elaborating a topological ordering of nodes with a depth-first recursive structure is much like
actually computing an FFT with a depth-first recursive program (cf. Listing 3 in Appendix 2).
Table 1 lists the nodes contained in the list `ns' after elaborating a size-8
transform by invoking elaborate(8, 0, 0, 0).
|
A transform is divided into sub-transforms with recursive calls at lines 4, 6 and 7, until the base cases of size 2 or size 4 are reached at the leaves of the elaboration. As well as the size-2 and size-4 base cases, which are handled at lines 20-21 and 17-18 (respectively), there is a special case where two size-2 base cases are handled in parallel at lines 9-10. This special case of handling two size-2 base cases as a larger size-4 node ensures that larger
transforms are composed of nodes that are homogeneous in size – this is of little utility when emitting
The elaborate function modifies the class member variable `ns' at lines 10, 14, 18 and 21, where it appends a new node to the back of the list. After the function returns, the ns list represents a topological ordering of the computation with CNodeLoad and CNodeBfly nodes. The nodes of type CNodeLoad represent leaf computations: these computations load elements from the input array and perform a small amount of leaf computation, leaving the result in a set of registers. The CNodeBfly nodes represent computations in the body of the transform: these use a twiddle factor to perform a butterfly computation on a vector of registers, leaving the result in the same registers.
| Type | Size | Addresses | Registers | Twiddle |
CNodeLoad
|
4 | {0,4,2,6} | {0,1,2,3} | |
CNodeLoad
|
2(x2) | {1,5,7,3} | {4,5,6,7} | |
CNodeBfly
|
4 | {0,2,4,6} |
|
|
CNodeBfly
|
4 | {1,3,5,7} |
|
The constructor for a CNodeLoad object computes input array addresses for the load operations using the input array offset (ioffset), the input array stride, the size of the node (the nodes instantiated at lines 9 and 17 are size-4, and the node instantiated at line 20 is size-2) and a final parameter that is non-zero if the node is a single node (the nodes instantiated at lines 17 and 20 are single nodes, while the node instantiated at line 9 is composed of two size-2 nodes).
As the newly instantiated CNodeLoad objects are appended to the back of ns at lines 10, 14 and 21, the assign_leaf_registers function assigns registers to the outputs of each instance. Registers are identified with integers beginning at zero, and when each register is created it is assigned an identifier from an auto-incrementing counter (rmap, where the node for a given register is the last node to reference that register.
The constructor for a CNodeBfly object uses k and stride to compute a twiddle factor for the new instance of a butterfly computation node. When the new instance of CNodeBfly is appended to the end of ns at line 14, the assign_body_registers function assigns registers
where assign_body_registers functions also updates the map of registers to node pointers by
setting CNodeBfly.
Emitting code
|
Given a list of nodes, it is a simple process to emit C code that can be compiled to actually compute the transform.
The example in Code 30 would be emitted from the list of four nodes in Table 1. Lines 1–4 are emitted from a function that generates a header, and line 13 is emitted from a function that generates a footer. Lines 6–11 are generated based on the list of nodes.
Code 30 contains references to several types, functions and macros that use upper-case identifiers – these are primitive functions or types that have been predefined as inline functions or macros. A benefit of using primitives in this way is that the details specific to numerical representation and the underlying machine have been abstracted away; thus, the same function can be compiled for a variety of types and machines by simply including a different header file with different primitives. Code 30, for example, could be compiled for double-precision arithmetic on an SSE2 machine by including sse_double.h, or it could be compiled with much slower scalar arithmetic by including scalar.h. The same code can even be used, without modification, to compute forward and backwards transforms, by using C preprocessor directives to conditionally alter the macros.
In order to accommodate mixed numerical representations, the signature of the outermost function references data with void pointers. In the case of the double-precision example in Code 30, SFFT_D would be defined to be double in the appropriate header file, and the void pointers are then cast to SFFT_D pointers.
The size-8 transform in Table 1 uses 8 registers, and thus a declaration of 8 registers of type SFFT_R has been emitted at line 4 in Code 30. In the case of double-precision arithmetic on a SSE2 machine, SFFT_R is defined as __m128d in sse_double.h.
The first two rows of Table 1 correspond to lines 6 and 7 of Code 30, respectively. The L_4 primitive is used to compute the size-4 leaf node in the first row of the table. The second row is a load/leaf node of size 2(x2), indicating two size-2 nodes in parallel, which is computed with the L_2 primitive. The input addresses in the table are the addresses of complex words, while the addresses in the generated code refer to the real and imaginary parts of a complex word, and thus the addresses from Table 1 are multiplied by a factor of 2 to obtain the addresses in Code 30.
The final two CNodeBfly nodes of Table 1 correspond to the K_0 and K_N sub-transform (a.k.a. butterfly) primitives at lines 8 and 10, respectively. Because the node in the third row of Table 1 has a twiddle factor of K_0 primitive is used for this special case. The K_N primitive at line 10 does require a twiddle factor, which is passed to K_N as two vector literals that represent the twiddle factor in unpacked form.1 Fast interleaved complex multiplication describes how interleaved complex multiplication is faster if one operand is pre-unpacked.
After each node is processed, the registers that have been used by it are checked in a map (rmap) that maps each register to the last node to have used that register. If the current node is the last node to have used a register, the register is stored to memory. In the case of the transform in Code 30, four registers are stored with an instance of the S_4 primitive at lines 9 and 11. In contrast to the load operations at the leaves, which are decimated-in-time and thus effectively pseudo-random memory accesses, the store operations are to linear regions of memory, the addresses of which can be determined from each register's integer identifier. The store address offset for data in register
Other vector lengths
If elaborate function in Code 3 is vectorized. Broadly speaking, CNodeLoad objects that operate on adjacent
memory locations are collected together and computed in parallel. After each such computation, each position in a vector register contains an element that belongs to a different node. Transposes are then used to transform sets of vector registers such that each register contains elements from one node. Finally, the CNodeBfly objects can be easily computed in parallel, as they were with VL-1 because the elements in each vector register correspond to one node.
Overview
Table 2 lists the nodes that represent a VL-1 size-16 transform. A VL of 2 implies that each vector register contains 2 complex words, and load operations on each of the 4 addresses in the first row of Table 2 will also load the complex words in the adjacent memory locations. Note that the complex words that would be incidentally loaded in the upper half of the VL-2 registers are the complex words that the third CNodeLoad object at row 5 would have loaded. This is exploited to load and compute the first and third CNodeLoad objects in parallel.
| Type | Size | Addresses | Registers | Twiddle |
CNodeLoad
|
4 | {0,8,4,12} | {0,1,2,3} | |
CNodeLoad
|
2(x2) | {2,10,14,6} | {4,5,6,7} | |
CNodeBfly
|
4 | {0,2,4,6} |
|
|
CNodeBfly
|
4 | {1,3,5,7} |
|
|
CNodeLoad
|
4 | {1,9,5,13} | {8,9,10,11} | |
CNodeLoad
|
4 | {15,7,3,11} | {12,13,14,15} | |
CNodeBfly
|
4 | {0,4,8,12} |
|
|
CNodeBfly
|
4 | {1,5,9,13} |
|
|
CNodeBfly
|
4 | {2,6,10,14} |
|
|
CNodeBfly
|
4 | {3,7,11,15} |
|
| Type | Sizes | Addresses | Registers | Twiddles |
| Load | {4,4} | {{0,1},{8,9},{4,5},{12,13}} | {{0,1},{2,3},{8,9},{10,11}} | |
| Load | {2(x2),4} | {{2,3},{10,11},{14,15},{6,7}} | {{4,5},{6,7},{14,15},{12,13}} | |
| Bfly | {4,4} | {{0,1},{2,3},{4,5},{6,7}} | {
|
|
| Bfly | {4,4} | {{0,1},{4,5},{8,9},{12,13}} |
{
|
|
| Bfly | {4,4} | {{2,3},{6,7},{10,11},{14,15}} |
{
|
The second CNodeLoad object computes two size-2 leaf transforms in parallel, while the last CNodeLoad object computes a size-4 leaf transform. Because the size-4 transform is composed of two size-2 transforms, and memory addresses of the fourth CNodeLoad are adjacent (although permuted), some of the computation can be computed in parallel.
If the CNodeLoad objects at rows 1 and 5 are computed in parallel, the output will be four VL-2 registers: {{0,8}, {1,9}, {2,10}, {3,11}} – i.e., the first register contains what would have been register 0 in the lower half, and what would have been register 8 in the top half etc. Similarly, computing rows 2 and 6 in parallel would yield four VL-2 registers: {{4,14}, {5,15}, {6,12}, {7,13}} – note the permutation of the upper halves in this case. These registers are transposed to {{0,1}, {2,3}, {8,9}, {10,11}} and {{4,5}, {6,7}, {14,15}, {12,13}}, as in row 1 and 2 of Table 3.
With the transposed VL-2 registers, it is now possible to compute CNodeBfly nodes in parallel. For example, rows 2 and 3 of Table 2 can be computed in parallel on four VL-2 registers represented by {{0,1}, {2,3}, {4,5}, {6,7}}, as in row 3 of Table 3.
Implementation
Code 82 is a C++ implementation of the vectorize_loads function. This
function modifies a topological ordering of nodes (the class member variable ns) and uses two other functions: find_parallel_loads, which searches forward from the current node to find another CNodeLoad that shares adjacent memory addresses; and merge_loads(a,b), which adds the addresses, registers and type of b to a. Type introspection is used at lines 7 and 36 (and in other Listings), to differentiate between the two types of object.
Code 81 is a C++ implementation of the vectorize_ks function. For each CNodeBfly node, the function searches forward for another CNodeBfly that does not have a register dependence. Once found, the registers of the latter node are added to the former node, and the latter node erased. Finally, at line 19, the registers of the vectorized CNodeBfly node are merged using a perfect shuffle, which is then recursively applied on each half of the list. The effect is a merge that works for any power of 2 vector length.
|
|
If vectorize_loads and vectorize_ks are invoked with
- The register identifiers in line 4 of Code 89 consist of a list of two integers delimited with an underscore. The integers listed in each register's name are the VL-1 registers that were subsumed to create the larger register (cf. VL-1 code in Code 30);
- The leaf primitives (lines 6 and 7 in Code 89) have a list of underscore delimited
integers in the name, where each integer corresponds to the type of sub-transform to be computed on that position in the vector registers. For example, the
L_4_4primitive is named to indicate a size-4 leaf operation on the lower and upper halves of the vector registers, while theL_2_4primitive performs two size-2 leaf operations on the lower half of the registers and a size-4 leaf operation on the upper halves; - The body node primitives (
K_N) and store primitives (S_4) are unchanged because they perform the same operation on each element of the vector registers. This is as a result of the register transposes that were previously performed on the outputs of the leaf primitives.
|
Scalability
So far, hard-coded transforms of vector length 1 and 2 have been presented. On Intel machines, VL-1 can be used to compute double-precision transforms with SSE2, while VL-2 can be used to compute double-precision transforms with AVX and single-precision transforms with SSE. The method of vectorization presented in this chapter scales above VL-2, and has been successfully used to compute VL-4 single-precision transforms with AVX.
The leaf primitives were coded by hand in all cases; VL-1 required L_2 and L_4, while
VL-2 required L_2_2, L_2_4, L_4_2 and L_4_4. In the case of VL-4, not all permutations
of possible leaf primitive were required – only 11 out of 16 were needed for the transforms that were generated.
It is an easy exercise to code the leaf primitives for
Constraints
For a transform of size
If this constraint is not satisfied, the size of either VL or
Performance
|
Figure 1 shows the results of a benchmark for transforms of size 4 through to 1024 running on a Macbook Air 4,2. The speed of FFTW 3.3 running in estimate and patient modes is also shown for comparison.
FFTW running in patient mode evaluates a huge configuration space of parameters, while the hard-coded FFT required no calibration.
A variety of vector lengths are represented, and the hard-coded FFTs have good performance while



















