The prime factor algorithm (PFA) and the Winograd Fourier transform algorithm (WFTA) are methods for efficiently calculating the DFT which use, and in fact, depend on the Type-1 index map from Multidimensional Index Mapping: Equation 10 and Multidimensional Index Mapping: Equation 6. The use of this index map preceded Cooley and Tukey's paper Entry 6, Entry 18 but its full potential was not realized until it was combined with Winograd's short DFT algorithms. The modern PFA was first presented in Entry 12 and a program given in Entry 1. The WFTA was first presented in Entry 20 and programs given in Entry 15, Entry 5.
The number theoretic basis for the indexing in these algorithms may, at first, seem more complicated than in the Cooley-Tukey FFT; however, if approached from the general index mapping point of view of Multidimensional Index Mapping, it is straightforward, and part of a common approach to breaking large problems into smaller ones. The development in this section will parallel that in The Cooley-Tukey Fast Fourier Transform Algorithm.
The general index maps of Multidimensional Index Mapping: Equation 6 and Multidimensional Index Mapping: Equation 12 must satisfy the Type-1 conditions of Multidimensional Index Mapping: Equation 7 and Multidimensional Index Mapping: Equation 10 which are
The row and column calculations in Multidimensional Index Mapping: Equation 15 are uncoupled by Multidimensional Index Mapping: Equation 16 which for this case are
In addition, to make each short sum a DFT, the
In order to have the smallest values for
which gives for the index maps in Equation 1
The frequency index map is a form of the Chinese remainder theorem. Using these index maps, the DFT in Multidimensional Index Mapping: Equation 15 becomes
which is a pure two-dimensional DFT with no twiddle factors and the
summations can be done in either order. Choices other than
Equation 5 could be used. For example,
An important feature of the short Winograd DFT's described
in Winograd’s Short DFT Algorithms that is useful for both the PFA and WFTA is the
fact that the multiplier constants in Winograd’s Short DFT Algorithms: Equation 5 or Winograd’s Short DFT Algorithms: Equation 8 are
either real or imaginary, never a general complex number. For that
reason, multiplication by complex data requires only two real
multiplications, not four. That is a very significant feature. It is
also true that the
If the DFT is calculated directly using Equation 8, the algorithm is called a prime factor algorithm Entry 6, Entry 18 and was discussed in Winograd’s Short DFT Algorithms and Multidimensional Index Mapping: In-Place Calculation of the DFT and Scrambling. When the short DFT's are calculated by the very efficient algorithms of Winograd discussed in Factoring the Signal Processing Operators, the PFA becomes a very powerful method that is as fast or faster than the best Cooley-Tukey FFT's Entry 1, Entry 12.
A flow graph is not as helpful with the PFA as it was with the Cooley-Tukey FFT, however, the following representation in Figure 1 which combines Figures Multidimensional Index Mapping: Figure 1 and Winograd’s Short DFT Algorithms: Figure 2 gives a good picture of the algorithm with the example of Multidimensional Index Mapping: Equation 25
|
If
|
As in the Cooley-Tukey program, the DO 10 loop steps through the M stages (factors of N) and the DO 20 loop calculates the N2 length-N1 DFT's. The input index map of Equation 6 is implemented in the DO 30 loop and the statement just before label 20. In the PFA, each stage or factor requires a separately programmed module or butterfly. This lengthens the PFA program but an efficient Cooley-Tukey program will also require three or more butterflies.
Because the PFA is calculated in-place using the input index map, the output is scrambled. There are five approaches to dealing with this scrambled output. First, there are some applications where the output does not have to be unscrambled as in the case of high-speed convolution. Second, an unscrambler can be added after the PFA to give the output in correct order just as the bit-reversed-counter is used for the Cooley-Tukey FFT. A simple unscrambler is given in Entry 3, Entry 1 but it is not in place. The third method does the unscrambling in the modules while they are being calculated. This is probably the fastest method but the program must be written for a specific length Entry 3, Entry 1. A fourth method is similar and achieves the unscrambling by choosing the multiplier constants in the modules properly Entry 11. The fifth method uses a separate indexing method for the input and output of each module Entry 3, Entry 17.
The Winograd Fourier transform algorithm (WFTA) uses a very
powerful property of the Type-1 index map and the DFT to give a
further reduction of the number of multiplications in the PFA. Using
an operator notation where
It has been shown Entry 21, Entry 9 that if
each operator represents identical operations on each row or column,
they commute. Since
If each short DFT in
where A represents the set of additions
done on each row or column that performs the residue reduction as
Winograd’s Short DFT Algorithms: Equation 30. Because of the appearance of the flow graph of
This is the PFA of Equation 8 and Figure 1 where
or
but the two adjacent multiplication
operators can be premultiplied and the result represented by one
operator
This is the basic idea of the Winograd Fourier transform algorithm. The commuting of the multiplication operators together in the center of the algorithm is called nesting and it results in a significant decrease in the number of multiplications that must be done at the execution of the algorithm. Pictorially, the PFA of Figure 1 becomes Entry 12 the WFTA in Figure 3.
|
The rectangular structure of the preweave addition operators causes an expansion of the data in the center of the algorithm. The 15 data points in Figure 3 become 18 intermediate values. This expansion is a major problem in programming the WFTA because it prevents a straightforward in-place calculation and causes an increase in the number of required additions and in the number of multiplier constants that must be precalculated and stored.
From Figure 3 and the idea of premultiplying the individual
multiplication operators, it can be seen why the multiplications by
unity had to be considered in Winograd’s Short DFT Algorithms: Table 1. Even if a multiplier in
The number of additions depends on the order of the pre- and postweave operators. For example in the length-15 WFTA in Figure 3, if the length-5 had been done first and last, there would have been six row addition preweaves in the preweave operator rather than the five shown. It is difficult to illustrate the algorithm for three or more factors of N, but the ideas apply to any number of factors. Each length has an optimal ordering of the pre- and postweave operators that will minimize the number of additions.
A program for the WFTA is not as simple as for the FFT or PFA because of the very characteristic that reduces the number of multiplications, the nesting. A simple two-factor example program is given in Entry 3 and a general program can be found in Entry 15, Entry 5. The same lengths are possible with the PFA and WFTA and the same short DFT modules can be used, however, the multiplies in the modules must occur in one place for use in the WFTA.
In the previous section it was seen how using the
permutation property of the elementary operators in the PFA allowed
the nesting of the multiplications to reduce their number. It was
also seen that a proper ordering of the operators could minimize the
number of additions. These ideas have been extended in formulating a
more general algorithm optimizing problem. If the DFT operator
The DFT in Equation 10 becomes
The operator notation is very helpful in understanding the central
ideas, but may hide some important facts. It has been shown
Entry 21, Entry 11 that operators in different
This formulation allows a very large set of possible orderings, in fact, the number is so large that some automatic technique must be used to find the “best". It is possible to set up a criterion of optimality that not only includes the number of multiplications but the number of additions as well. The effects of relative multiply-add times, data transfer times, CPU register and memory sizes, and other hardware characteristics can be included in the criterion. Dynamic programming can then be applied to derive an optimal algorithm for a particular application Entry 9. This is a very interesting idea as there is no longer a single algorithm, but a class and an optimizing procedure. The challenge is to generate a broad enough class to result in a solution that is close to a global optimum and to have a practical scheme for finding the solution.
Results obtained applying the dynamic programming method to the design of fairly long DFT algorithms gave algorithms that had fewer multiplications and additions than either a pure PFA or WFTA Entry 9. It seems that some nesting is desirable but not total nesting for four or more factors. There are also some interesting possibilities in mixing the Cooley-Tukey with this formulation. Unfortunately, the twiddle factors are not the same for all rows and columns, therefore, operations cannot commute past a twiddle factor operator. There are ways of breaking the total algorithm into horizontal paths and using different orderings along the different paths Entry 16, Entry 11. In a sense, this is what the split-radix FFT does with its twiddle factors when compared to a conventional Cooley-Tukey FFT.
There are other modifications of the basic structure of the Type-1 index map DFT algorithm. One is to use the same index structure and conversion of the short DFT's to convolution as the PFA but to use some other method for the high-speed convolution. Table look-up of partial products based on distributed arithmetic to eliminate all multiplications Entry 4 looks promising for certain very specific applications, perhaps for specialized VLSI implementation. Another possibility is to calculate the short convolutions using number-theoretic transforms Entry 2, Entry 15, Entry 16. This would also require special hardware. Direct calculation of short convolutions is faster on certain pipelined processor such as the TMS-320 microprocessor Entry 13.
As for the Cooley-Tukey FFT's, the first evaluation of these
algorithms will be on the number of multiplications and additions
required. The number of multiplications to compute the PFA in
Equation 8 is given by Multidimensional Index Mapping: Equation 3. Using the notation that
The count of multiplies and adds in Table 1 are calculated from (105) with the counts of the factors taken from Winograd’s Short DFT Algorithms: Table 1. The list of lengths are those possible with modules in the program of length 2, 3, 4, 5, 7, 8, 9 and 16 as is true for the PFA in Entry 3, Entry 1 and the WFTA in Entry 15, Entry 5. A maximum of four relatively prime lengths can be used from this group giving 59 different lengths over the range from 2 to 5040. The radix-2 or split-radix FFT allows 12 different lengths over the same range. If modules of length 11 and 13 from Entry 8 are added, the maximum length becomes 720720 and the number of different lengths becomes 239. Adding modules for 17, 19 and 25 from Entry 8 gives a maximum length of 1163962800 and a very large and dense number of possible lengths. The length of the code for the longer modules becomes excessive and should not be included unless needed.
The number of multiplications necessary for the WFTA is simply the product of those necessary for the required modules, including multiplications by unity. The total number may contain some unity multipliers but it is difficult to remove them in a practical program. Table 1 contains both the total number (MULTS) and the number with the unity multiplies removed (RMULTS).
Calculating the number of additions for the WFTA is more complicated than for the PFA because of the expansion of the data moving through the algorithm. For example the number of additions, TA, for the length-15 example in Figure 3 is given by
where
| Length | PFA | PFA | WFTA | WFTA | WFTA |
| N | Mults | Adds | Mults | RMults | Adds |
| 10 | 20 | 88 | 24 | 20 | 88 |
| 12 | 16 | 96 | 24 | 16 | 96 |
| 14 | 32 | 172 | 36 | 32 | 172 |
| 15 | 50 | 162 | 36 | 34 | 162 |
| 18 | 40 | 204 | 44 | 40 | 208 |
| 20 | 40 | 216 | 48 | 40 | 216 |
| 21 | 76 | 300 | 54 | 52 | 300 |
| 24 | 44 | 252 | 48 | 36 | 252 |
| 28 | 64 | 400 | 72 | 64 | 400 |
| 30 | 100 | 384 | 72 | 68 | 384 |
| 35 | 150 | 598 | 108 | 106 | 666 |
| 36 | 80 | 480 | 88 | 80 | 488 |
| 40 | 100 | 532 | 96 | 84 | 532 |
| 42 | 152 | 684 | 108 | 104 | 684 |
| 45 | 190 | 726 | 132 | 130 | 804 |
| 48 | 124 | 636 | 108 | 92 | 660 |
| 56 | 156 | 940 | 144 | 132 | 940 |
| 60 | 200 | 888 | 144 | 136 | 888 |
| 63 | 284 | 1236 | 198 | 196 | 1394 |
| 70 | 300 | 1336 | 216 | 212 | 1472 |
| 72 | 196 | 1140 | 176 | 164 | 1156 |
| 80 | 260 | 1284 | 216 | 200 | 1352 |
| 84 | 304 | 1536 | 216 | 208 | 1536 |
| 90 | 380 | 1632 | 264 | 260 | 1788 |
| 105 | 590 | 2214 | 324 | 322 | 2418 |
| 112 | 396 | 2188 | 324 | 308 | 2332 |
| 120 | 460 | 2076 | 288 | 276 | 2076 |
| 126 | 568 | 2724 | 396 | 392 |