Statically elaborating a DAG that represents a depthfirst recursive FFT is much like computing a depthfirst recursive FFT: instead of performing computation at the leaves of the recursion and where smaller DFTs are combined into one, a node representing the computation is appended to the end of a list, and the list of nodes, i.e., a topological ordering of the DAG, is later translated into a program that can be compiled and executed.
Emitting code with a vector length of 1 (i.e., scalar code or vector code where only one complex element fits in a vector register) is relatively simple and is described in "Vector length 1". For vector lengths above 1, vectorizing the topological ordering of nodes poses some subtle challenges, and these details are described in "Other vector lengths". The fully hardcoded FFTs described in this section are generally only practical for smaller sizes of transforms, typically where
Vector length 1
A VL of 1 implies that the computation is essentially scalar, and only one complex element can fit in a vector register. An example of such a scenario is when using interleaved doubleprecision floatingpoint arithmetic on an SSE2 machine: one 128bit XMM register is used to store two 64bit floats that represent the real and imaginary parts of a complex number.
When
 Elaborate a topological ordering of nodes, where each node represents either a computation at the leaves of the transform, or a computation in the body of the transform (i.e., where smaller subtransforms are combined into a larger transform);
 Write the program header to output, including a list of variables that correspond to registers used by the nodes;
 Traverse the list of nodes in order, and for each node, emit a statement that performs the computation represented by the given node. If a node is the last node to use a variable, a statement storing the variable to its corresponding location in memory is also emitted;
 Write the program footer to output.
Elaborate
Code 3 is a function, written in C++, that performs the first task in the process.
As mentioned earlier, elaborating a topological ordering of nodes with a depthfirst recursive structure is much like
actually computing an FFT with a depthfirst recursive program (cf. Listing 3 in Appendix 2).
Table 1 lists the nodes contained in the list `ns
' after elaborating a size8
transform by invoking elaborate(8, 0, 0, 0)
.

A transform is divided into subtransforms with recursive calls at lines 4, 6 and 7, until the base cases of size 2 or size 4 are reached at the leaves of the elaboration. As well as the size2 and size4 base cases, which are handled at lines 2021 and 1718 (respectively), there is a special case where two size2 base cases are handled in parallel at lines 910. This special case of handling two size2 base cases as a larger size4 node ensures that larger
transforms are composed of nodes that are homogeneous in size – this is of little utility when emitting
The elaborate
function modifies the class member variable `ns
' at lines 10, 14, 18 and 21, where it appends a new node to the back of the list. After the function returns, the ns
list represents a topological ordering of the computation with CNodeLoad
and CNodeBfly
nodes. The nodes of type CNodeLoad
represent leaf computations: these computations load elements from the input array and perform a small amount of leaf computation, leaving the result in a set of registers. The CNodeBfly
nodes represent computations in the body of the transform: these use a twiddle factor to perform a butterfly computation on a vector of registers, leaving the result in the same registers.
Type  Size  Addresses  Registers  Twiddle 
CNodeLoad

4  {0,4,2,6}  {0,1,2,3}  
CNodeLoad

2(x2)  {1,5,7,3}  {4,5,6,7}  
CNodeBfly

4  {0,2,4,6} 


CNodeBfly

4  {1,3,5,7} 

The constructor for a CNodeLoad
object computes input array addresses for the load operations using the input array offset (ioffset
), the input array stride
, the size of the node (the nodes instantiated at lines 9 and 17 are size4, and the node instantiated at line 20 is size2) and a final parameter that is nonzero if the node is a single node (the nodes instantiated at lines 17 and 20 are single nodes, while the node instantiated at line 9 is composed of two size2 nodes).
As the newly instantiated CNodeLoad
objects are appended to the back of ns
at lines 10, 14 and 21, the assign_leaf_registers
function assigns registers to the outputs of each instance. Registers are identified with integers beginning at zero, and when each register is created it is assigned an identifier from an autoincrementing counter (rmap
, where the node for a given register is the last node to reference that register.
The constructor for a CNodeBfly
object uses k
and stride
to compute a twiddle factor for the new instance of a butterfly computation node. When the new instance of CNodeBfly
is appended to the end of ns
at line 14, the assign_body_registers
function assigns registers
where assign_body_registers
functions also updates the map of registers to node pointers by
setting CNodeBfly
.
Emitting code

Given a list of nodes, it is a simple process to emit C code that can be compiled to actually compute the transform.
The example in Code 30 would be emitted from the list of four nodes in Table 1. Lines 1–4 are emitted from a function that generates a header, and line 13 is emitted from a function that generates a footer. Lines 6–11 are generated based on the list of nodes.
Code 30 contains references to several types, functions and macros that use uppercase identifiers – these are primitive functions or types that have been predefined as inline functions or macros. A benefit of using primitives in this way is that the details specific to numerical representation and the underlying machine have been abstracted away; thus, the same function can be compiled for a variety of types and machines by simply including a different header file with different primitives. Code 30, for example, could be compiled for doubleprecision arithmetic on an SSE2 machine by including sse_double.h
, or it could be compiled with much slower scalar arithmetic by including scalar.h
. The same code can even be used, without modification, to compute forward and backwards transforms, by using C preprocessor directives to conditionally alter the macros.
In order to accommodate mixed numerical representations, the signature of the outermost function references data with void pointers. In the case of the doubleprecision example in Code 30, SFFT_D
would be defined to be double
in the appropriate header file, and the void pointers are then cast to SFFT_D
pointers.
The size8 transform in Table 1 uses 8 registers, and thus a declaration of 8 registers of type SFFT_R
has been emitted at line 4 in Code 30. In the case of doubleprecision arithmetic on a SSE2 machine, SFFT_R
is defined as __m128d
in sse_double.h
.
The first two rows of Table 1 correspond to lines 6 and 7 of Code 30, respectively. The L_4
primitive is used to compute the size4 leaf node in the first row of the table. The second row is a load/leaf node of size 2(x2), indicating two size2 nodes in parallel, which is computed with the L_2
primitive. The input addresses in the table are the addresses of complex words, while the addresses in the generated code refer to the real and imaginary parts of a complex word, and thus the addresses from Table 1 are multiplied by a factor of 2 to obtain the addresses in Code 30.
The final two CNodeBfly
nodes of Table 1 correspond to the K_0
and K_N
subtransform (a.k.a. butterfly) primitives at lines 8 and 10, respectively. Because the node in the third row of Table 1 has a twiddle factor of K_0
primitive is used for this special case. The K_N
primitive at line 10 does require a twiddle factor, which is passed to K_N
as two vector literals that represent the twiddle factor in unpacked form.1 Fast interleaved complex multiplication describes how interleaved complex multiplication is faster if one operand is preunpacked.
After each node is processed, the registers that have been used by it are checked in a map (rmap
) that maps each register to the last node to have used that register. If the current node is the last node to have used a register, the register is stored to memory. In the case of the transform in Code 30, four registers are stored with an instance of the S_4
primitive at lines 9 and 11. In contrast to the load operations at the leaves, which are decimatedintime and thus effectively pseudorandom memory accesses, the store operations are to linear regions of memory, the addresses of which can be determined from each register's integer identifier. The store address offset for data in register
Other vector lengths
If elaborate
function in Code 3 is vectorized. Broadly speaking, CNodeLoad
objects that operate on adjacent
memory locations are collected together and computed in parallel. After each such computation, each position in a vector register contains an element that belongs to a different node. Transposes are then used to transform sets of vector registers such that each register contains elements from one node. Finally, the CNodeBfly
objects can be easily computed in parallel, as they were with VL1 because the elements in each vector register correspond to one node.
Overview
Table 2 lists the nodes that represent a VL1 size16 transform. A VL of 2 implies that each vector register contains 2 complex words, and load operations on each of the 4 addresses in the first row of Table 2 will also load the complex words in the adjacent memory locations. Note that the complex words that would be incidentally loaded in the upper half of the VL2 registers are the complex words that the third CNodeLoad
object at row 5 would have loaded. This is exploited to load and compute the first and third CNodeLoad
objects in parallel.
Type  Size  Addresses  Registers  Twiddle 
CNodeLoad

4  {0,8,4,12}  {0,1,2,3}  
CNodeLoad

2(x2)  {2,10,14,6}  {4,5,6,7}  
CNodeBfly

4  {0,2,4,6} 


CNodeBfly

4  {1,3,5,7} 


CNodeLoad

4  {1,9,5,13}  {8,9,10,11}  
CNodeLoad

4  {15,7,3,11}  {12,13,14,15}  
CNodeBfly

4  {0,4,8,12} 


CNodeBfly

4  {1,5,9,13} 


CNodeBfly

4  {2,6,10,14} 


CNodeBfly

4  {3,7,11,15} 

Type  Sizes  Addresses  Registers  Twiddles 
Load  {4,4}  {{0,1},{8,9},{4,5},{12,13}}  {{0,1},{2,3},{8,9},{10,11}}  
Load  {2(x2),4}  {{2,3},{10,11},{14,15},{6,7}}  {{4,5},{6,7},{14,15},{12,13}}  
Bfly  {4,4}  {{0,1},{2,3},{4,5},{6,7}}  {


Bfly  {4,4}  {{0,1},{4,5},{8,9},{12,13}} 
{


Bfly  {4,4}  {{2,3},{6,7},{10,11},{14,15}} 
{

The second CNodeLoad
object computes two size2 leaf transforms in parallel, while the last CNodeLoad
object computes a size4 leaf transform. Because the size4 transform is composed of two size2 transforms, and memory addresses of the fourth CNodeLoad
are adjacent (although permuted), some of the computation can be computed in parallel.
If the CNodeLoad
objects at rows 1 and 5 are computed in parallel, the output will be four VL2 registers: {{0,8}, {1,9}, {2,10}, {3,11}} – i.e., the first register contains what would have been register 0 in the lower half, and what would have been register 8 in the top half etc. Similarly, computing rows 2 and 6 in parallel would yield four VL2 registers: {{4,14}, {5,15}, {6,12}, {7,13}} – note the permutation of the upper halves in this case. These registers are transposed to {{0,1}, {2,3}, {8,9}, {10,11}} and {{4,5}, {6,7}, {14,15}, {12,13}}, as in row 1 and 2 of Table 3.
With the transposed VL2 registers, it is now possible to compute CNodeBfly
nodes in parallel. For example, rows 2 and 3 of Table 2 can be computed in parallel on four VL2 registers represented by {{0,1}, {2,3}, {4,5}, {6,7}}, as in row 3 of Table 3.
Implementation
Code 82 is a C++ implementation of the vectorize_loads
function. This
function modifies a topological ordering of nodes (the class member variable ns
) and uses two other functions: find_parallel_loads
, which searches forward from the current node to find another CNodeLoad
that shares adjacent memory addresses; and merge_loads(a,b)
, which adds the addresses, registers and type of b
to a
. Type introspection is used at lines 7 and 36 (and in other Listings), to differentiate between the two types of object.
Code 81 is a C++ implementation of the vectorize_ks
function. For each CNodeBfly
node, the function searches forward for another CNodeBfly
that does not have a register dependence. Once found, the registers of the latter node are added to the former node, and the latter node erased. Finally, at line 19, the registers of the vectorized CNodeBfly
node are merged using a perfect shuffle, which is then recursively applied on each half of the list. The effect is a merge that works for any power of 2 vector length.


If vectorize_loads
and vectorize_ks
are invoked with
 The register identifiers in line 4 of Code 89 consist of a list of two integers delimited with an underscore. The integers listed in each register's name are the VL1 registers that were subsumed to create the larger register (cf. VL1 code in Code 30);
 The leaf primitives (lines 6 and 7 in Code 89) have a list of underscore delimited
integers in the name, where each integer corresponds to the type of subtransform to be computed on that position in the vector registers. For example, the
L_4_4
primitive is named to indicate a size4 leaf operation on the lower and upper halves of the vector registers, while theL_2_4
primitive performs two size2 leaf operations on the lower half of the registers and a size4 leaf operation on the upper halves;  The body node primitives (
K_N
) and store primitives (S_4
) are unchanged because they perform the same operation on each element of the vector registers. This is as a result of the register transposes that were previously performed on the outputs of the leaf primitives.

Scalability
So far, hardcoded transforms of vector length 1 and 2 have been presented. On Intel machines, VL1 can be used to compute doubleprecision transforms with SSE2, while VL2 can be used to compute doubleprecision transforms with AVX and singleprecision transforms with SSE. The method of vectorization presented in this chapter scales above VL2, and has been successfully used to compute VL4 singleprecision transforms with AVX.
The leaf primitives were coded by hand in all cases; VL1 required L_2
and L_4
, while
VL2 required L_2_2
, L_2_4
, L_4_2
and L_4_4
. In the case of VL4, not all permutations
of possible leaf primitive were required – only 11 out of 16 were needed for the transforms that were generated.
It is an easy exercise to code the leaf primitives for
Constraints
For a transform of size
If this constraint is not satisfied, the size of either VL or
Performance

Figure 1 shows the results of a benchmark for transforms of size 4 through to 1024 running on a Macbook Air 4,2. The speed of FFTW 3.3 running in estimate and patient modes is also shown for comparison.
FFTW running in patient mode evaluates a huge configuration space of parameters, while the hardcoded FFT required no calibration.
A variety of vector lengths are represented, and the hardcoded FFTs have good performance while