# Connexions

You are here: Home » Content » DSPA » Efficient FFT Algorithm and Programming Tricks

### Lenses

What is a lens?

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

#### In these lenses

• Lens for Engineering

This module is included inLens: Lens for Engineering
By: Sidney Burrus

Click the "Lens for Engineering" link to see all content selected in this lens.

### Recently Viewed

This feature requires Javascript to be enabled.

Inside Collection (Course):

Course by: Janko Calic. E-mail the author

# Efficient FFT Algorithm and Programming Tricks

Module by: Douglas L. Jones. E-mail the author

Summary: Many tricks and techniques have been developed to speed up the computation of FFTs. Significant reductions in computation time result from table lookup of twiddle factors, compiler-friendly or assembly-language programming, special hardware, and FFT algorithms for real-valued data. Higher-radix algorithms, fast bit-reversal, and special butterflies yield more modest but worthwhile savings.

The use of FFT algorithms such as the radix-2 decimation-in-time or decimation-in-frequency methods result in tremendous savings in computations when computing the discrete Fourier transform. While most of the speed-up of FFTs comes from this, careful implementation can provide additional savings ranging from a few percent to several-fold increases in program speed.

## Precompute twiddle factors

The twiddle factor, or W N k =e(j2πkN) W N k 2 k N , terms that multiply the intermediate data in the FFT algorithms consist of cosines and sines that each take the equivalent of several multiplies to compute. However, at most NN unique twiddle factors can appear in any FFT or DFT algorithm. (For example, in the radix-2 decimation-in-time FFT, only N2 N 2 twiddle factors W N k  ,   k=012N21    k k 0 1 2 N 2 1 W N k are used.) These twiddle factors can be precomputed once and stored in an array in computer memory, and accessed in the FFT algorithm by table lookup. This simple technique yields very substantial savings and is almost always used in practice.

## Compiler-friendly programming

On most computers, only some of the total computation time of an FFT is spent performing the FFT butterfly computations; determining indices, loading and storing data, computing loop parameters and other operations consume the majority of cycles. Careful programming that allows the compiler to generate efficient code can make a several-fold improvement in the run-time of an FFT. The best choice of radix in terms of program speed may depend more on characteristics of the hardware (such as the number of CPU registers) or compiler than on the exact number of computations. Very often the manufacturer's library codes are carefully crafted by experts who know intimately both the hardware and compiler architecture and how to get the most performance out of them, so use of well-written FFT libraries is generally recommended. Certain freely available programs and libraries are also very good. Perhaps the best current general-purpose library is the FFTW package; information can be found at http://www.fftw.org. A paper by Frigo and Johnson describes many of the key issues in developing compiler-friendly code.

## Program in assembly language

While compilers continue to improve, FFT programs written directly in the assembly language of a specific machine are often several times faster than the best compiled code. This is particularly true for DSP microprocessors, which have special instructions for accelerating FFTs that compilers don't use. (I have myself seen differences of up to 26 to 1 in favor of assembly!) Very often, FFTs in the manufacturer's or high-performance third-party libraries are hand-coded in assembly. For DSP microprocessors, the codes developed by Meyer, Schuessler, and Schwarz are perhaps the best ever developed; while the particular processors are now obsolete, the techniques remain equally relevant today. Most DSP processors provide special instructions and a hardware design favoring the radix-2 decimation-in-time algorithm, which is thus generally fastest on these machines.

## Special hardware

Some processors have special hardware accelerators or co-processors specifically designed to accelerate FFT computations. For example, AMI Semiconductor's Toccata ultra-low-power DSP microprocessor family, which is widely used in digital hearing aids, have on-chip FFT accelerators; it is always faster and more power-efficient to use such accelerators and whatever radix they prefer.

In a surprising number of applications, almost all of the computations are FFTs. A number of special-purpose chips are designed to specifically compute FFTs, and are used in specialized high-performance applications such as radar systems. Other systems, such as OFDM-based communications receivers, have special FFT hardware built into the digital receiver circuit. Such hardware can run many times faster, with much less power consumption, than FFT programs on general-purpose processors.

## Effective memory management

Cache misses or excessive data movement between registers and memory can greatly slow down an FFT computation. Efficient programs such as the FFTW package are carefully designed to minimize these inefficiences. In-place algorithms reuse the data memory throughout the transform, which can reduce cache misses for longer lengths.

## Real-valued FFTs

FFTs of real-valued signals require only half as many computations as with complex-valued data. There are several methods for reducing the computation, which are described in more detail in Sorensen et al.

1. Use DFT symmetry properties to do two real-valued DFTs at once with one FFT program
2. Perform one stage of the radix-2 decimation-in-time decomposition and compute the two length- N2 N 2 DFTs using the above approach.
3. Use a direct real-valued FFT algorithm; see H.V. Sorensen et.al.

## Special cases

Occasionally only certain DFT frequencies are needed, the input signal values are mostly zero, the signal is real-valued (as discussed above), or other special conditions exist for which faster algorithms can be developed. Sorensen and Burrus describe slightly faster algorithms for pruned or zero-padded data. Goertzel's algorithm is useful when only a few DFT outputs are needed. The running FFT can be faster when DFTs of highly overlapped blocks of data are needed, as in a spectrogram.

Higher-radix algorithms, such as the radix-4, radix-8, or split-radix FFTs, require fewer computations and can produce modest but worthwhile savings. Even the split-radix FFT reduces the multiplications by only 33% and the additions by a much lesser amount relative to the radix-2 FFTs; significant improvements in program speed are often due to implicit loop-unrolling or other compiler benefits than from the computational reduction itself!

## Fast bit-reversal

Bit-reversing the input or output data can consume several percent of the total run-time of an FFT program. Several fast bit-reversal algorithms have been developed that can reduce this to two percent or less, including the method published by D.M.W. Evans.

When FFTs first became widely used, hardware multipliers were relatively rare on digital computers, and multiplications generally required many more cycles than additions. Methods to reduce multiplications, even at the expense of a substantial increase in additions, were often beneficial. The prime factor algorithms and the Winograd Fourier transform algorithms, which required fewer multiplies and considerably more additions than the power-of-two-length algorithms, were developed during this period. Current processors generally have high-speed pipelined hardware multipliers, so trading multiplies for additions is often no longer beneficial. In particular, most machines now support single-cycle multiply-accumulate (MAC) operations, so balancing the number of multiplies and adds and combining them into single-cycle MACs generally results in the fastest code. Thus, the prime-factor and Winograd FFTs are rarely used today unless the application requires FFTs of a specific length.

It is possible to implement a complex multiply with 3 real multiplies and 5 real adds rather than the usual 4 real multiplies and 2 real adds: (C+jS)(X+jY)=CXSY+j(CY+SX) C S X Y C X S Y C Y S X but alernatively Z=C(XY) Z C X Y D=C+S D C S E=CS E C S CXSY=EY+Z C X S Y E Y Z CY+SX=DXZ C Y S X D X Z In an FFT, DD and EE come entirely from the twiddle factors, so they can be precomputed and stored in a look-up table. This reduces the cost of the complex twiddle-factor multiply to 3 real multiplies and 3 real adds, or one less and one more, respectively, than the conventional 4/2 computation.

## Special butterflies

Certain twiddle factors, namely W N 0 =1 W N 0 1 , W N N 2 W N N 2 , W N N 4 W N N 4 , W N N 8 W N N 8 , W N 3 N 8 W N 3 N 8 , etc., can be implemented with no additional operations, or with fewer real operations than a general complex multiply. Programs that specially implement such butterflies in the most efficient manner throughout the algorithm can reduce the computational cost by up to several NN multiplies and additions in a length-NN FFT.

## Practical Perspective

When optimizing FFTs for speed, it can be important to maintain perspective on the benefits that can be expected from any given optimization. The following list categorizes the various techniques by potential benefit; these will be somewhat situation- and machine-dependent, but clearly one should begin with the most significant and put the most effort where the pay-off is likely to be largest.

### Methods to speed up computation of DFTs

• Tremendous Savings:
1. FFT ( Nlog 2 N N 2 N savings)
• Substantial Savings: ( 2 2 )
1. Table lookup of cosine/sine
2. Compiler tricks/good programming
3. Assembly-language programming
4. Special-purpose hardware
5. Real-data FFT for real data (factor of 2)
6. Special cases
• Minor Savings:
2. special butterflies
3. 3-real-multiplication complex multiply
4. Fast bit-reversal (up to 6%)

## fact:

On general-purpose machines, computation is only part of the total run time. Address generation, indexing, data shuffling, and memory access take up much or most of the cycles.

## fact:

A well-written radix-2 program will run much faster than a poorly written split-radix program!

## References

1. D.M.W. Evans. (1987, August). An improved digit-reversal permutation algorithm for the fast Fourier and Hartley transforms. IEEE Transactions on Signal Processing, 35(8), 1120-1125.
2. M. Frigo and S.G. Johnson. (2005, February). The design and implementation of FFTW3. Proceedings of the IEEE, 93(2), 216-231.
3. R. Meyer, H.W. Schuessler, and K. Schwarz. (1990). FFT Implmentation on DSP chips - Theory and Practice. IEEE International Conference on Acoustics, Speech, and Signal Processing.
4. H.V. Sorensen, D.L Jones, M.T. Heideman, and C.S. Burrus. (1987, June). Real-valued fast Fourier transform algorithms. IEEE Transactions on Signal Processing, 35(6), 849-863.
5. H.V. Sorensen and C.S. Burrus. (1993, March). Efficient computation of the DFT with only a subset of input or output points. IEEE Transactions on Signal Processing, 41(3), 1184-1200.

## Content actions

PDF | EPUB (?)

### What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

PDF | EPUB (?)

### What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

#### Collection to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

#### Module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks