This report describes three large DFT modules (17,19,25) which were developed by the first author, Howard Johnson, in June of 1981, and two previously undocumented modules (11,13) which were originally generated at Stanford in 1978 Entry 8.
The length 17 and 19 modules were created in the style of Winograd's convolutional DFT programs with strict adherence to three additional module development principles. First, as much code as possible was automatically generated. This included use of FORTRAN programs to generate the input and output mapping statements and the multiplication statements, and heavy use of EDIT commands to copy redundant sections of code. The code for imaginary data manipulation was copied directly from a working listing of code for the real part. All discussion below therefore centers on producing code only for the real part of the input data array. Even the EDIT commands for copying sections of code and substituting variable names were themselves listed in a command file. In this way, the programmer was prevented from introducing occasional typographical errors which are the bane of the DFT module debugger. Errors which did occur tended to be very large and obvious. Test routines were written to test particularly difficult sections of code before they were inserted into the DFT module (such as the modulo z8+1z8+1 convolution subsection).
Once the reduction, or PRE-WEAVE, section was written, the reconstruction, or POST-WEAVE, section was arranged to be the transpose of the reduction equations, according to the method of 'transposing the tensor' Entry 6. Although the problem of minimizing the number of additions in a module is not necessarily solved by transposing the tensor, due to the inordinate difficulty of finding suitable substitutions which would abate the addition count, and the high probability of error involved in making such substitutions, it was decided to use this method. This method also provides a convenient way to check the correctness of the reconstruction procedure by computing the matrices of the reduction and reconstruction subroutines and testing to see that they are indeed a transpose pair.
Intrinsic to the method of transposing the tensor is the fact that the matrix B used to compute the algorithm's multiplication coefficients from the Nth roots of unity is generally more complicated than either the reduction matrix or its transpose, the reconstruction matrix. This result is a consequence of B having been generated from Toom-Cook polynomial reconstruction procedures and also CRT polynomial reconstructions, which are both known to be more complicated than their associated reduction procedures. The problem of finding B in order to compute a set of multipliers may be neatly circumvented by directly solving a set of linear equations to find a coefficient vector which makes the algorithm work. The details of this trick are not reported here, but may be found in Entry 3. Suffice to say that given working FORTRAN subroutines for the reduction and reconstruction procedures, a FORTRAN program exists which will solve for the correct coefficients.
The length 25 module does not follow the traditional Winograd approach. This module is an in-line code version of a common-factor 5x5 DFT. Each length 5 DFT is a prime-length convolutional module. The output unscrambling is included in the assignment statements at the end of the program. Some of the length 5 modules used in this program are implemented as scaled versions of conventional length 5 modules in order to save some multiplies by 1/4. The scaling factors are then compensated for by adjusting the twiddle factors. This module has three multiply sections, one for the row DFT's with a data expansion factor of 6/5, one for the twiddle factors (expansion=33/25) and on for the column DFT's (expansion=6/5).
Modules for lengths 11 and 13 are very similar in spirit to the length 19 and 17 modules. Derivations are presented for both the 11 and 13 length modules which are consistent with the listings, although these interpretations may not agree with the original intentions of the designer Entry 8 they are correct in the sense that the algorithms could have been derived in the stated manner. Both the modules are of prime length and they are implemented in Winograd's convolutional style.
FORTRAN listings for all five modules are included with this report in a subroutine form suitable for use in Burrus' PFA program Entry 1. Addition and multiplication counts given are for complex input data.
This module closely follows the traditional Winograd prime-length approach.
- Use the index map x¯(n)=x(<3n>mod17)x¯(n)=x(<3n>mod17) to convert the DFT into a length 16 convolution, plus a correction term for the DC component.
- Reduce the length 16 convolution modulo all the irreducible factors of z16-1z16-1. (Irreducible over the rationals).
modz8+1:r108-r115modz8-1:r100-r107modz8+1:r108-r115modz8-1:r100-r107(1)
From z8-1z8-1 data
modz4+1:r31-r34modz4-1:r200-r203modz4+1:r31-r34modz4-1:r200-r203(2)
From z4-1z4-1 data
modz2+1:r35-r36modz2-1:r204-r205modz2+1:r35-r36modz2-1:r204-r205(3)
From z2-1z2-1 data
modz+1:r38modz-1:r37modz+1:r38modz-1:r37(4)
- Reduce the convolution modulo z2+1z2+1 using Toom-Cook factors of zz, 1/z1/z and z+1z+1. This creates variables r35, r36, and r314.
- Reduce the modulo z4+1z4+1 convolution with an iterated Toom-Cook reduction using the factors zz, 1/z1/z and z-1z-1 for the first step, and the factors zz, 1/z1/z and z+1z+1 for the second step. The first step produces r310 and r39, and the second step computes r313, r312 and r311. This is exactly the reduction procedure used in Nussbaumer's z4+1z4+1 convolution algorithm.
- Patch up the DC term by adding the z-1z-1 reduction result to x(i(1))x(i(1)).
- Use Nussbaumer's z8+1z8+1 convolution algorithm Entry 5 on r108-r115. This is the only exception to the strict use of transposing the tensor, as his algorithm saves two additions by computing the transposed reconstruction procedure in an obscure fashion. The result, however, is an exact calculation of the transpose. This reduction computes twenty-one values, r315-r335, which must be weighted by coefficients to produce the reconstructed z8+1z8+1 output, t115-t135.
- Weight the variables r31-r39, r310-r314 by coefficients to produce t11-t19, t110-t114.
- The reconstruction procedure for the z8-1z8-1 terms is a straightforward transpose of the reduction procedure.
- The z16-1z16-1 convolution result is reconstructed from the z8-1z8-1 (real)
and z8+1z8+1 (imaginary) vectors and mapped back to the outputs using the
reverse of the input map.
- All coefficients were computed using the author's QR decomposition
linear equation solver and are accurate to at least 14 places.
This module closely follows the traditional Winograd prime-length approach.
- Use the index map x¯(n)=x(<2n>mod19)x¯(n)=x(<2n>mod19) to convert the DFT into a length 18 convolution plus a correction term for the DC component.
- Reduce the length 16 convolution modulo z9+1z9+1 and z9-1z9-1.
modz9-1:r100-r108modz9+9:r109-r117modz9-1:r100-r108modz9+9:r109-r117(5)
- Use Nussbaumer's z9-1z9-1 convolution algorithm on r100-r108. This is a transposed tensor method, however it again uses an obscure reconstruction procedure. This algorithm computes nineteen intermediate quantities, r31-r319, which are then weighted against nineteen coefficients to produce t11-t119. This data is then partially reconstructed to yield the final result of the modz9-1modz9-1 convolution, t32-t310.
- In the course of the z9-1z9-1 convolution algorithm the z9-1z9-1 data is reduced modulo z-1z-1 and stored in r31. This quantity is added to x(i(1))x(i(1)) to patch up the DC term.
- An algebraic trick is used to compute the z9+1z9+1 convolution using the z9-1z9-1 algorithm. Suppose there exists a ring homomorphism H which maps elements of the ring of real polynomials modulo z9+1z9+1 into the ring of polynomials modulo z9-1z9-1. Then HH could be used on the z9+1z9+1 data, the resulting polynomial could be convolved in the modulo z9-1z9-1 domain using the existing procedure, and the output of that procedure could be mapped back through H-1H-1 into the modulo z9+1z9+1 domain. Such a homomorphism does exist, and moreover it happens to be its own inverse. H(p)H(p) where pp is a polynomial (in either R[x]/z9-1R[x]/z9-1 or R[x]/z9+1R[x]/z9+1) may be formed from pp by negating the sign on all odd-numbered coefficients, that is, H(p)(z)=p(-z)H(p)(z)=p(-z). The alternate negation of data values going into and coming out of the modz9-1modz9-1 convolution algorithm is accomplished without an increase in computing time by appropriate placement of negative signs. The nineteen intermediate values formed are r320-r338 which are then weighted by the (purely imaginary) coefficients to produce t120-t138. A partial reconstruction yields the z9+1z9+1 convolution result, t311-t319.
- The z18-1z18-1 convolution result is reconstructed from the z9-1z9-1 (real)
and z9+1z9+1 (imaginary) vectors and mapped back to the outputs using the
reverse of the input map.
- All coefficients were computed using the author's QR decomposition
linear equation solver and are accurate to at least 14 places.
This module is a common factor type module which uses length 5 convolutional DFT submodules. The length 5 submodules are implemented in a transposed tensor configuration using an index map x(n¯)=x(<2n>mod5)x(n¯)=x(<2n>mod5) followed by a reduction modulo all the irreducible factors of z4-1z4-1. The z2+1z2+1 convolution is implemented using Toom-Cook factors of zz, 1/z1/z and z-1z-1. The reconstruction matrix is exactly the transpose of the reduction procedure. The coefficients for the length 5 submodules were found using the author's QR procedure, and the twiddle factors were generated in a special FORTRAN program. The details of saving multiplies by scaling some of the prime length submodules in a common factor algorithm are discussed below in Section 5. This length 25 module has a total of 132 multiplies and 420 adds. Using Winograd's decomposition of the length 25 OFT into two length 5 DFT's and a length 20 convolution the best operation count generated by this author was 108 multiplies and 604 adds.
Scaling short length DFT algorithms can sometimes save multiplies. The prime length modules (p>2)(p>2) generally include one constant equal to 1/(p-1)1/(p-1), corresponding to convolution modulo x-1x-1 This convenient constant can in some cases be exploited. One particularly nice example is the length 25 DFT.
Use length 5 DFT modules to put together a length 25 DFT with Singleton's algorithm. This results in an algorithm which uses the length 5 module ten times, and has sixteen non-trivial twiddle factors. Counting a twiddle factor as 3/2 multiplies, and using DFT modules with 5 multiplies, the full length 25 algorithm will have 74 multiplies.
In order to exploit the constant 1/4 which appears in each length 5 module the basic length 5 module must be modified to create alternate modules A and B (Figure I). The regular length 5 DFT is represented as R. Algorithm A computes the same DFT, but with outputs 1 through 4 scaled up by a factor of 4. Algorithm B expects inputs 1 through 4 to be scaled down by a factor of 1/4. Algorithms A and B have each traded 1 multiply for 2 additions. The additions are used to implement the -factor of 4 which appears in both algorithms.
To implement a scaled algorithm:
- i - Assume the input data has been appropriately mapped into a 5 by 5 array.
- ii - Use RR on the first column of data and A on all other columns. This will scale the data in the twiddle area up by a factor of 4.
- iii - Scale down all twiddle factors by a factor of 1/16. This leaves data in the twiddle area scaled down by a composite factor of 1/4 when compared to a normal length 25 DFT.
- iv - Use RR on the first row of data and use B on all other rows. BB is modified to expect the scaled down data in the twiddle area.
Since 4 A's and 4 B's were used, a total trade has been made of 8 multiplies for 16 adds. Such a trade may in many instances be a reasonable exchange. The great thing about this scaling is that the D.C. terms did not have to be scaled, which would have generated more adds in modification A and multiplies in modification B. No additional counter-scaling multiplies were needed in this special example because the twiddle factor were available to absorb the scaling mismatches. Similar approaches should be possible for lengths 9, 49, and 121.
The PFA case is similar in spirit, but is lacking the twiddle factors to perform counter-scaling. One of the modules will have to be modified to perform the counter-scaling function.
Two basic facts will be needed. First, any Winograd-type prime length DFT module contains one constant equal to 1/(p-1)1/(p-1) and can be modified like algorithm A to scale up all of its outputs except the DC term. This modification trades one multiply for the number of adds needed to implement a multiply by (p-1)(p-1). Secondly, any Winograd-type prime length DFT module can be modified to scale all of its outputs by an arbitrary constant at the expense of only one multiply. This is accomplished by nesting the scaling constant with the multiplies in the middle of the Winograd module. Since only one of the module's original constants is trivial (that is the unity constant on the DC term) only one extra multiply is generated. This procedure assumes the module has first been re-arranged to eliminate the "cross" computation as illustrated in Figure II. Such a rearrangement can always be accomplished in prime length modules.
Now, suppose we combine length p and q modules with Good's prime factor algorithm (not using twiddles). The following scaling procedure will work:
- i - Assume the input data has been appropriately loaded into a pxqpxq data array
- ii - Scale the non-DC outputs of the length pp module and apply the modified module to all columns of the data array.
- iii - Now all the rows are scaled by (p-1)(p-1) except the zeroeth row, corresponding to the DC outputs of the length pp modules. Apply a normal length qq module to the zeroeth row. Modify the length qq module to scale by 1/(p-1)1/(p-1) and apply the modified version to all the other rows. The DFT is now complete.
As an example, consider the 3x7 DFT. In the length 3 module scaling the non-DC outputs trades one multiply for one add. When the scaled DFT is constructed, the modified length 3 module is used 7 times. But two rows must be scaled by modified length 7 modules, which brings the total multiply savings to 5 at a cost of 7 adds. This looks like a nice tradeoff. The total number of multiplies in a normal 3x7 PFA is 38.
These ideas can be expanded to multidimensional cases, although it quickly becomes difficult to keep track of which rows and columns need to be counter-scaled.
- Use the index map x¯(n)=x(<8n>mod11)x¯(n)=x(<8n>mod11) to convert the DFT into a length 10 convolution, plus a correction term for the DC components.
- Reduce the length 10 convolution modulo all the irreducible factors of z10-1z10-1
modz5-1:T1,T3,T2,T5,T4modz5+1:T6,-T8,-T7,-T10,T9modz5-1:T1,T3,T2,T5,T4modz5+1:T6,-T8,-T7,-T10,T9(6)
from z5-1z5-1 data
modz-1:T13modz5-1/z-1:AM4,AM7,AM3,AM6(afterweighting)modz-1:T13modz5-1/z-1:AM4,AM7,AM3,AM6(afterweighting)(7)
from z5+1z5+1 data
modz+1:AM2(afterweighting)modz5+1/z+1:S9,S11,S10,S12(appearsin)modz+1:AM2(afterweighting)modz5+1/z+1:S9,S11,S10,S12(appearsin)(8)
- Patch up the DC terms by adding the z-1z-1 reduction result to X(I(1))X(I(1)) and store the result in AMO.
- The z5-1z5-1 convolution proceeds in four steps. First, do the irreducible factor reductions, then reduce further with an iterated Toom-Cook procedure, weight all remaining variables, and apply the transpose of the complete reduction stage to the weighted results. The first Toom-Cook reduction uses the factors zz, 1/z1/z and z+1z+1 on the vectors AM4,AM3 and AM7,AM6 which generates the new vector AM4-AM7,AM3-AM6. Each of the original two vectors is then individually reduced using factors of zz, 1/z1/z and z+1z+1, while the new vector is reduced by AA, 1/z1/z and z-1z-1. This procedure generates nine variables: AM4,AM3,AM5; AM7,AM6,AM8; S7,S8,AM11. (The expressions for S6 and S8 contain the variables of interest).
- The nine variables from 4) are weighted along with T13.
- An exact transpose of the reduction algorithm is applied to the weighted variables (and AMO).
- The result S16,S15,S18,S17,S19 is the real part of the answer and is mapped back to the output using the map x¯(n)=x(<8n+1>mod11x¯(n)=x(<8n+1>mod11. This is an unusual map, but it is perfectly acceptable.
- A in the length 19 transform the z5+1z5+1 convolution is computed with a variation of the z5-1z5-1 algorithm. First the inputs T6,-T8,-T7,-T1O,T9 are alternately negated, then the z5-1z5-1 algorithm is applied and the outputs alternately negated.
- The result S21,S20,S23,S22,S24, representing the imaginary part of the answer, is mapped back to the output using the map x¯(n)=x(<8n+1>mod11)x¯(n)=x(<8n+1>mod11).
- In both this algorithm and the length 13 DFT plus and minus signs have been freely altered to force all constants to be positive. Also, many shortcut computations were used to save adds, obscuring in some places the logical flow of the algorithm.
- All coefficients were computed using the author's QR decomposition linear equation solver and are accurate to at least 14 places.
- Use the index map x¯(n)=x(<2n>mod13)x¯(n)=x(<2n>mod13) to convert the DFT into a length 12 convolution, plus a correction term for the DC components.
- Reduce the length 12 convolution modulo all the irreducible factors of z12-1z12-1
modz6+1:A7,A8,A9,A10,A11,A12modz6-1:A1,A2,A3,A4,A5,A6modz6+1:A7,A8,A9,A10,A11,A12modz6-1:A1,A2,A3,A4,A5,A6(9)
from z6-1z6-1 data
modz2-1:A14,A13modz2-z+1:A23,A22modz2+z+1:A25,A24modz2-1:A14,A13modz2-z+1:A23,A22modz2+z+1:A25,A24(10)
from z2-1z2-1 data
modz-1:A15modz+1:implicit(A13-A14)modz-1:A15modz+1:implicit(A13-A14)(11)
from z6+1z6+1 data
modz2+1:A17,A16modz4-z2+1:A27,A26,-A31,-A30modz2+1:A17,A16modz4-z2+1:A27,A26,-A31,-A30(12)
- Patch up the DC terms by adding the z-1z-1 reduction result to X(I(1))X(I(1)) and store the result in AMO.
- The z2-z+1z2-z+1 and z2+z+1z2+z+1 convolutions are reduced using Toom-cook factors of zz, 1/z1/z and z+1z+1 in one case and zz, 1/z1/z and z-1z-1 in the other case, and then all the reduced quantities are weighted by constants generating new variables:
from z2-z+1z2-z+1
zAM71/zAM6z-1AM8zAM71/zAM6z-1AM8(13)
from z2+z+1z2+z+1zAM101/zAM9z+1AM11zAM101/zAM9z+1AM11(14)
- The original modz+1modz+1 reduction quantity is weighted and passed, along with AMO and the above six variables, to a reconstruction procedure which first combines the z-1z-1 and z2+z+1z2+z+1 data to compute the convolution mod z3-1z3-1 (CC4,CC5,CC6), and then combines the z+1z+1 and z2-z+1z2-z+1 data to compute the convolution mod z3+1z3+1 (CC1,CC2,CC3). These two vectors are combined to compute the complete z6-1z6-1 output, which appears in permuted form in CC15 through CC20.
- The z2+1z2+1 vector is decomposed with Toom-Cook factors of zz, 1/z1/z and z+1z+1 yielding A17,A16 and the implicit term (A16+A17).
- The z4-z2+1z4-z2+1 vector is decomposed with a double iterated Toom-Cook scheme. First the vector is broken into two length two pieces: A27,A26 and A31,A30. Then the vectors are reduced by the factors of zz, 1/z1/z and z+1z+1 operating on whole vectors to produce a set of three length two vectors:
Ā27,A26
A31,A30
A29,A28 = (A27+A31), (A26+A30)
These vectors are not calculated in a straightforward manner. Each length two vector is further reduced, in the second iteration, by the factors zz, 1/z1/z and z+1z+1 to create three new implicit variables
(A27+A26)(A27+A26), (A31+A30)(A31+A30) and (A29+A28)(A29+A28).
- The nine variables from Section 6 and the three variables from Section 5 are weighted by constants and the modz6+1modz6+1 reconstruction proceeds in an ad-hoc fashion which closely resembles a transposed tensor method, but has some differences. The add count for the reconstruction would have been the same if the transposed tensor method had been applied. The z6+1z6+1 result appears in permuted form in variables CC21 through CC26.
- The final result is reconstructed from the z6-1z6-1 and z6-1z6-1 vectors. The DC term, x(i(1))x(i(1)) is set equal' to AMO.
- All coefficients were computed using the author's QR decomposition linear
equation solver and are accurate to at least 14 places.
-
Burrus, C. S. and Eschenbacker, P. W. (1981, August). An In-Place, In-Order Prime Factor FFT Algorithm. IEEE Trans. ASSP, 29,
-
Burrus, C.S. (1977, June). Index Mappings for Multidimensional Formulation of the DFT and Convolution. IEEE Trans. ASSP.
-
Johnson, H. W. (1982). An Approach to the Design of Large DFT's. Ph. D. Thesis. Rice University.
-
McClellan, J. H., and Rader, C. M. (1979). Number Theory in Digital Signal Processing. Prentice-Hall.
-
Nussbaumer, H. J. (1981). FFT and Convolution Algorithms. Springer-Verlag.
-
Rice, B. (1979, August). Winograd Convolution Algorithms over Finite Fields. Congressus Numeratium, 29, 827.
-
Silverman, H. F. (1977, April). An Introduction to Programming the Winograd Fourier Transform Algorithm. IEEE Trans. ASSP.
-
S.S. Narayan, M.H. Narasimha and Peterson, A.M. (1978, May). DFT Algorithms Analysis and Implementation. [There are several errors in the algoritms in this report.]. Report No. 3606-12. Standford University.
-
Zohar, S. (1979, August). A Prescription of Winograd's Discrete Fourier Transform Algorithm. IEEE Trans. ASSP.