Skip to content Skip to navigation

OpenStax-CNX

You are here: Home » Content » Winograd's Short DFT Algorithms

Navigation

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
  • Rice Digital Scholarship

    This module is included in aLens by: Digital Scholarship at Rice UniversityAs a part of collections: "Fast Fourier Transforms", "Fast Fourier Transforms (6x9 Version)"

    Click the "Rice Digital Scholarship" link to see all content affiliated with them.

  • NSF Partnership display tagshide tags

    This module is included inLens: NSF Partnership in Signal Processing
    By: Sidney BurrusAs a part of collection: "Fast Fourier Transforms"

    Click the "NSF Partnership" link to see all content affiliated with them.

    Click the tag icon tag icon to display tags associated with this content.

  • Featured Content display tagshide tags

    This module is included inLens: Connexions Featured Content
    By: ConnexionsAs a part of collection: "Fast Fourier Transforms"

    Comments:

    "The Fast Fourier Transform (FFT) is a landmark algorithm used in fields ranging from signal processing to high-performance computing. First popularized by two American scientists in 1965, the […]"

    Click the "Featured Content" link to see all content affiliated with them.

    Click the tag icon tag icon to display tags associated with this content.

Also in these lenses

  • UniqU content

    This module is included inLens: UniqU's lens
    By: UniqU, LLCAs a part of collection: "Fast Fourier Transforms"

    Click the "UniqU content" link to see all content selected in this lens.

  • Lens for Engineering

    This module is included inLens: Lens for Engineering
    By: Sidney Burrus

    Click the "Lens for Engineering" link to see all content selected in this lens.

Recently Viewed

This feature requires Javascript to be enabled.

Tags

(What is a tag?)

These tags come from the endorsement, affiliation, and other lenses that include this content.
 

Winograd's Short DFT Algorithms

Module by: C. Sidney Burrus, Ivan Selesnick. E-mail the authors

In 1976, S. Winograd [20] presented a new DFT algorithm which had significantly fewer multiplications than the Cooley-Tukey FFT which had been published eleven years earlier. This new Winograd Fourier Transform Algorithm (WFTA) is based on the type- one index map from Multidimensional Index Mapping with each of the relatively prime length short DFT's calculated by very efficient special algorithms. It is these short algorithms that this section will develop. They use the index permutation of Rader described in the another module to convert the prime length short DFT's into cyclic convolutions. Winograd developed a method for calculating digital convolution with the minimum number of multiplications. These optimal algorithms are based on the polynomial residue reduction techniques of Polynomial Description of Signals: Equation 1 to break the convolution into multiple small ones [2], [12], [14], [23], [21], [9].

The operation of discrete convolution defined by

y ( n ) = k h ( n - k ) x ( k ) y ( n ) = k h ( n - k ) x ( k )
(1)

is called a bilinear operation because, for a fixed h(n)h(n), y(n)y(n) is a linear function of x(n)x(n) and for a fixed x(n)x(n) it is a linear function of h(n)h(n). The operation of cyclic convolution is the same but with all indices evaluated modulo NN.

Recall from Polynomial Description of Signals: Equation 3 that length-N cyclic convolution of x(n)x(n) and h(n)h(n) can be represented by polynomial multiplication

Y ( s ) = X ( s ) H ( s ) mod ( s N - 1 ) Y ( s ) = X ( s ) H ( s ) mod ( s N - 1 )
(2)

This bilinear operation of Equation 1 and Equation 2 can also be expressed in terms of linear matrix operators and a simpler bilinear operator denoted by oo which may be only a simple element-by-element multiplication of the two vectors [12], [9], [10]. This matrix formulation is

Y = C [ A X o B H ] Y = C [ A X o B H ]
(3)

where XX, HH and YY are length-N vectors with elements of x(n)x(n), h(n)h(n) and y(n)y(n) respectively. The matrices AA and BB have dimension MM x NN , and CC is NN x MM with MNMN. The elements of AA, BB, and CC are constrained to be simple; typically small integers or rational numbers. It will be these matrix operators that do the equivalent of the residue reduction on the polynomials in Equation 2.

In order to derive a useful algorithm of the form Equation 3 to calculate Equation 1, consider the polynomial formulation Equation 2 again. To use the residue reduction scheme, the modulus is factored into relatively prime factors. Fortunately the factoring of this particular polynomial, sN-1sN-1, has been extensively studied and it has considerable structure. When factored over the rationals, which means that the only coefficients allowed are rational numbers, the factors are called cyclotomic polynomials [2], [12], [14]. The most interesting property for our purposes is that most of the coefficients of cyclotomic polynomials are zero and the others are plus or minus unity for degrees up to over one hundred. This means the residue reduction will generally require no multiplications.

The operations of reducing X(s)X(s) and H(s)H(s) in Equation 2 are carried out by the matrices AA and BB in Equation 3. The convolution of the residue polynomials is carried out by the oo operator and the recombination by the CRT is done by the CC matrix. More details are in [2], [12], [14], [9], [10] but the important fact is the AA and BB matrices usually contain only zero and plus or minus unity entries and the CC matrix only contains rational numbers. The only general multiplications are those represented by oo. Indeed, in the theoretical results from computational complexity theory, these real or complex multiplications are usually the only ones counted. In practical algorithms, the rational multiplications represented by CC could be a limiting factor.

The h(n)h(n) terms are fixed for a digital filter, or they represent the WW terms from Multidimensional Index Mapping: Equation 1 if the convolution is being used to calculate a DFT. Because of this, d=BHd=BH in Equation 3 can be precalculated and only the AA and CC operators represent the mathematics done at execution of the algorithm. In order to exploit this feature, it was shown [23], [9] that the properties of Equation 3 allow the exchange of the more complicated operator CC with the simpler operator BB. Specifically this is given by

Y = C [ A X o B H ] Y = C [ A X o B H ]
(4)
Y ' = B T [ A X o C T H ' ] Y ' = B T [ A X o C T H ' ]
(5)

where HH' has the same elements as HH, but in a permuted order, and likewise YY' and YY. This very important property allows precomputing the more complicated CTHCTH' in Equation 5 rather than BHBH as in Equation 3.

Because BHBH or CTHCTH' can be precomputed, the bilinear form of Equation 3 and Equation 5 can be written as a linear form. If an MM x MM diagonal matrix DD is formed from d=CTHd=CTH, or in the case of Equation 3, d=BHd=BH, assuming a commutative property for oo, Equation 5 becomes

Y ' = B T D A X Y ' = B T D A X
(6)

and Equation 3 becomes

Y = C D A X Y = C D A X
(7)

In most cases there is no reason not to use the same reduction operations on XX and HH, therefore, BB can be the same as AA and Equation 6 then becomes

Y ' = A T D A X Y ' = A T D A X
(8)

In order to illustrate how the residue reduction is carried out and how the A matrix is obtained, the length-5 DFT algorithm started in The DFT as Convolution or Filtering: Matrix 1 will be continued. The DFT is first converted to a length-4 cyclic convolution by the index permutation from The DFT as Convolution or Filtering: Equation 3 to give the cyclic convolution in The DFT as Convolution or Filtering. To avoid confusion from the permuted order of the data x(n)x(n) in The DFT as Convolution or Filtering, the cyclic convolution will first be developed without the permutation, using the polynomial U(s)U(s)

U ( s ) = x ( 1 ) + x ( 3 ) s + x ( 4 ) s 2 + x ( 2 ) s 3 U ( s ) = x ( 1 ) + x ( 3 ) s + x ( 4 ) s 2 + x ( 2 ) s 3
(9)
U ( s ) = u ( 0 ) + u ( 1 ) s + u ( 2 ) s 2 + u ( 3 ) s 3 U ( s ) = u ( 0 ) + u ( 1 ) s + u ( 2 ) s 2 + u ( 3 ) s 3
(10)

and then the results will be converted back to the permuted x(n)x(n). The length-4 cyclic convolution in terms of polynomials is

Y ( s ) = U ( s ) H ( s ) mod ( s 4 - 1 ) Y ( s ) = U ( s ) H ( s ) mod ( s 4 - 1 )
(11)

and the modulus factors into three cyclotomic polynomials

s 4 - 1 = ( s 2 - 1 ) ( s 2 + 1 ) s 4 - 1 = ( s 2 - 1 ) ( s 2 + 1 )
(12)
= ( s - 1 ) ( s + 1 ) ( s 2 + 1 ) = ( s - 1 ) ( s + 1 ) ( s 2 + 1 )
(13)
= P 1 P 2 P 3 = P 1 P 2 P 3
(14)

Both U(s)U(s) and H(s)H(s) are reduced modulo these three polynomials. The reduction modulo P1P1 and P2P2 is done in two stages. First it is done modulo (s2-1)(s2-1), then that residue is further reduced modulo (s-1)(s-1) and (s+1)(s+1).

U ( s ) = u 0 + u 1 s + u 2 s 2 + u 3 s 3 U ( s ) = u 0 + u 1 s + u 2 s 2 + u 3 s 3
(15)
U ' ( s ) = ( ( U ( s ) ) ) ( s 2 - 1 ) = ( u 0 + u 2 ) + ( u 1 + u 3 ) s U ' ( s ) = ( ( U ( s ) ) ) ( s 2 - 1 ) = ( u 0 + u 2 ) + ( u 1 + u 3 ) s
(16)
U 1 ( s ) = ( ( U ' ( s ) ) ) P 1 = ( u 0 + u 1 + u 2 + u 3 ) U 1 ( s ) = ( ( U ' ( s ) ) ) P 1 = ( u 0 + u 1 + u 2 + u 3 )
(17)
U 2 ( s ) = ( ( U ' ( s ) ) ) P 2 = ( u 0 - u 1 + u 2 - u 3 ) U 2 ( s ) = ( ( U ' ( s ) ) ) P 2 = ( u 0 - u 1 + u 2 - u 3 )
(18)
U 3 ( s ) = ( ( U ( s ) ) ) P 3 = ( u 0 - u 2 ) + ( u 1 - u 3 ) s U 3 ( s ) = ( ( U ( s ) ) ) P 3 = ( u 0 - u 2 ) + ( u 1 - u 3 ) s
(19)

The reduction in Equation 16 of the data polynomial Equation 15 can be denoted by a matrix operation on a vector which has the data as entries.

1 0 1 0 0 1 0 1 u 0 u 1 u 2 u 3 = u 0 + u 2 u 1 + u 3 1 0 1 0 0 1 0 1 u 0 u 1 u 2 u 3 = u 0 + u 2 u 1 + u 3
(20)

and the reduction in Equation 19 is

1 0 - 1 0 0 1 0 - 1 u 0 u 1 u 2 u 3 = u 0 - u 2 u 1 - u 3 1 0 - 1 0 0 1 0 - 1 u 0 u 1 u 2 u 3 = u 0 - u 2 u 1 - u 3
(21)

Combining Equation 20 and Equation 21 gives one operator

1 0 1 0 0 1 0 1 1 0 - 1 0 0 1 0 - 1 u 0 + u 2 u 1 + u 3 u 0 - u 2 u 1 - u 3 = u 0 + u 2 u 1 + u 3 u 0 - u 2 u 1 - u 3 = w 0 w 1 v 0 v 1 1 0 1 0 0 1 0 1 1 0 - 1 0 0 1 0 - 1 u 0 + u 2 u 1 + u 3 u 0 - u 2 u 1 - u 3 = u 0 + u 2 u 1 + u 3 u 0 - u 2 u 1 - u 3 = w 0 w 1 v 0 v 1
(22)

Further reduction of v0+v1sv0+v1s is not possible because P3=s2+1P3=s2+1 cannot be factored over the rationals. However s2-1s2-1 can be factored into P1P2=(s-1)(s+1)P1P2=(s-1)(s+1) and, therefore, w0+w1sw0+w1s can be further reduced as was done in Equation 17 and Equation 18 by

1 1 w 0 w 1 = w 0 + w 1 = u 0 + u 2 + u 1 + u 3 1 1 w 0 w 1 = w 0 + w 1 = u 0 + u 2 + u 1 + u 3
(23)
1 - 1 w 0 w 1 = w 0 - w 1 = u 0 + u 2 - u 1 - u 3 1 - 1 w 0 w 1 = w 0 - w 1 = u 0 + u 2 - u 1 - u 3
(24)

Combining Equation 22, Equation 23 and Equation 24 gives

1 1 0 0 1 - 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 1 1 0 - 1 0 0 1 0 - 1 u 0 u 1 u 2 u 3 = r 0 r 1 v 0 v 1 1 1 0 0 1 - 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 1 1 0 - 1 0 0 1 0 - 1 u 0 u 1 u 2 u 3 = r 0 r 1 v 0 v 1
(25)

The same reduction is done to H(s)H(s) and then the convolution of Equation 11 is done by multiplying each residue polynomial of X(s)X(s) and H(s)H(s) modulo each corresponding cyclotomic factor of P(s)P(s) and finally a recombination using the polynomial Chinese Remainder Theorem (CRT) as in Polynomial Description of Signals: Equation 9 and Polynomial Description of Signals: Equation 13.

Y ( s ) = K 1 ( s ) U 1 ( s ) H 1 ( s ) + K 2 ( s ) U 2 ( s ) H 2 ( s ) + K 3 ( s ) U 3 ( s ) H 3 ( s ) Y ( s ) = K 1 ( s ) U 1 ( s ) H 1 ( s ) + K 2 ( s ) U 2 ( s ) H 2 ( s ) + K 3 ( s ) U 3 ( s ) H 3 ( s )
(26)

mod (s4-1)(s4-1)

where U1(s)=r1U1(s)=r1 and U2(s)=r2U2(s)=r2 are constants and U3(s)=v0+v1sU3(s)=v0+v1s is a first degree polynomial. U1U1 times H1H1 and U2U2 times H2H2 are easy, but multiplying U3U3 time H3H3 modulo (s2+1)(s2+1) is more difficult.

The multiplication of U3(s)U3(s) times H3(s)H3(s) can be done by the Toom-Cook algorithm [2], [12], [14] which can be viewed as Lagrange interpolation or polynomial multiplication modulo a special polynomial with three arbitrary coefficients. To simplify the arithmetic, the constants are chosen to be plus and minus one and zero. The details of this can be found in [2], [12], [14]. For this example it can be verified that

( ( v 0 + v 1 s ) ( h 0 + h 1 s ) ) ) s 2 + 1 = ( v 0 h 0 - v 1 h 1 ) + ( v 0 h 1 + v 1 h 0 ) s ( ( v 0 + v 1 s ) ( h 0 + h 1 s ) ) ) s 2 + 1 = ( v 0 h 0 - v 1 h 1 ) + ( v 0 h 1 + v 1 h 0 ) s
(27)

which by the Toom-Cook algorithm or inspection is

1 - 1 0 - 1 - 1 1 1 0 0 1 1 1 v 0 v 1 o 1 0 0 1 1 1 h 0 h 1 = y 0 y 1 1 - 1 0 - 1 - 1 1 1 0 0 1 1 1 v 0 v 1 o 1 0 0 1 1 1 h 0 h 1 = y 0 y 1
(28)

where oo signifies point-by-point multiplication. The total AA matrix in Equation 3 is a combination of Equation 25 and Equation 28 giving

A X = A 1 A 2 A 3 X A X = A 1 A 2 A 3 X
(29)
= 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 1 1 1 0 0 1 - 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 1 1 0 - 1 0 0 1 0 - 1 u 0 u 1 u 2 u 3 = r 0 r 1 v 0 v 1 = 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 1 1 1 0 0 1 - 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 1 1 0 - 1 0 0 1 0 - 1 u 0 u 1 u 2 u 3 = r 0 r 1 v 0 v 1
(30)

where the matrix A3A3 gives the residue reduction s2-1s2-1 and s2+1s2+1, the upper left-hand part of A2A2 gives the reduction modulo s-1s-1 and s+1s+1, and the lower right-hand part of A1 carries out the Toom-Cook algorithm modulo s2+1s2+1 with the multiplication in Equation 5. Notice that by calculating Equation 30 in the three stages, seven additions are required. Also notice that A1A1 is not square. It is this “expansion" that causes more than NN multiplications to be required in oo in Equation 5 or DD in Equation 6. This staged reduction will derive the AA operator for Equation 5

The method described above is very straight-forward for the shorter DFT lengths. For N=3N=3, both of the residue polynomials are constants and the multiplication given by o in Equation 3 is trivial. For N=5N=5, which is the example used here, there is one first degree polynomial multiplication required but the Toom-Cook algorithm uses simple constants and, therefore, works well as indicated in Equation 28. For N=7N=7, there are two first degree residue polynomials which can each be multiplied by the same techniques used in the N=5N=5 example. Unfortunately, for any longer lengths, the residue polynomials have an order of three or greater which causes the Toom-Cook algorithm to require constants of plus and minus two and worse. For that reason, the Toom-Cook method is not used, and other techniques such as index mapping are used that require more than the minimum number of multiplications, but do not require an excessive number of additions. The resulting algorithms still have the structure of Equation 8. Blahut [2] and Nussbaumer [14] have a good collection of algorithms for polynomial multiplication that can be used with the techniques discussed here to construct a wide variety of DFT algorithms.

The constants in the diagonal matrix DD can be found from the CRT matrix CC in Equation 5 using d=CTHd=CTH' for the diagonal terms in DD. As mentioned above, for the smaller prime lengths of 3, 5, and 7 this works well but for longer lengths the CRT becomes very complicated. An alternate method for finding DD uses the fact that since the linear form Equation 6 or Equation 8 calculates the DFT, it is possible to calculate a known DFT of a given x(n)x(n) from the definition of the DFT in Multidimensional Index Mapping: Equation 1 and, given the AA matrix in Equation 8, solve for DD by solving a set of simultaneous equations. The details of this procedure are described in [9].

A modification of this approach also works for a length which is an odd prime raised to some power: N=PMN=PM. This is a bit more complicated [12], [23] but has been done for lengths of 9 and 25. For longer lengths, the conventional Cooley-Tukey type- two index map algorithm seems to be more efficient. For powers of two, there is no primitive root, and therefore, no simple conversion of the DFT into convolution. It is possible to use two generators [12], [14], [21] to make the conversion and there exists a set of length 4, 8, and 16 DFT algorithms of the form in Equation 8 in [12].

In Table 1 an operation count of several short DFT algorithms is presented. These are practical algorithms that can be used alone or in conjunction with the index mapping to give longer DFT's as shown in The Prime Factor and Winograd Fourier Transform Algorithms. Most are optimized in having either the theoretical minimum number of multiplications or the minimum number of multiplications without requiring a very large number of additions. Some allow other reasonable trade-offs between numbers of multiplications and additions. There are two lists of the number of multiplications. The first is the number of actual floating point multiplications that must be done for that length DFT. Some of these (one or two in most cases) will be by rational constants and the others will be by irrational constants. The second list is the total number of multiplications given in the diagonal matrix DD in Equation 8. At least one of these will be unity ( the one associated with X(0)X(0)) and in some cases several will be unity ( for N=2MN=2M ). The second list is important in programming the WFTA in The Prime Factor and Winograd Fourier Transform Algorithm: The Winograd Fourier Transform Algorithm.

Table 1: Number of Real Multiplications and Additions for a Length-N DFT of Complex Data
Length N Mult Non-one Mult Total Adds
2 0 4 4
3 4 6 12
4 0 8 16
5 10 12 34
7 16 18 72
8 4 16 52
9 20 22 84
11 40 42 168
13 40 42 188
16 20 36 148
17 70 72 314
19 76 78 372
25 132 134 420
32 68 - 388

Because of the structure of the short DFTs, the number of real multiplications required for the DFT of real data is exactly half that required for complex data. The number of real additions required is slightly less than half that required for complex data because (N-1)(N-1) of the additions needed when NN is prime add a real to an imaginary, and that is not actually performed. When N=2mN=2m, there are (N-2)(N-2) of these pseudo additions. The special case for real data is discussed in [5], [7], [19].

The structure of these algorithms are in the form of X'=CDAXX'=CDAX or BTDAXBTDAX or ATDAXATDAX from Equation 5 and Equation 8. The AA and BB matrices are generally MM by NN with MNMN and have elements that are integers, generally 0 or ±1±1. A pictorial description is given in Figure 1.

Figure 1: Flow Graph for the Length-5 DFT
This image consist of 6 vertical lines. The lines are labeled j0.588, j0.362, -j1.539, 0.559, -1.250, and 1.000 from lowest to highest. The lowest line is shorter and from this line there are two diagonal lines that extend upward to the fourth and fifth lines. There are also lines extending up from line two to the top line as well as a line extending from line three to five and then down from that point to line 4. The lines on the figure are symmetrical. So, any line described has an analog line on the opposite side of the figure.
Figure 2: Block Diagram of a Winograd Short DFT
This image has a rectangle with an x at its center. There is a trapezoid formed on the right and left sides of the rectangle. Each of these trapezoids contains plus symbols. On the opposite ends of the trapezoid, the sides not formed by the rectangle, there are three lines then three dots and then another line proceeding vertically from top to bottom. These lines are symmetrical on both sides.

The flow graph in Figure 1 should be compared with the matrix description of Equation 8 and Equation 30, and with the programs in [2], [12], [3], [14] and the appendices. The shape in Figure 2 illustrates the expansion of the data by AA. That is to say, AXAX has more entries than XX because M>NM>N. The A operator consists of additions, the DD operator gives the MM multiplications (some by one) and ATAT contracts the data back to NN values with additions only. MM is one half the second list of multiplies in Table 1.

An important characteristic of the DD operator in the calculation of the DFT is its entries are either purely real or imaginary. The reduction of the WW vector by (s(N-1)/2-1)(s(N-1)/2-1) and (s(N-1)/2+1)(s(N-1)/2+1) separates the real and the imaginary constants. This is discussed in [23], [9]. The number of multiplications for complex data is only twice those necessary for real data, not four times.

Although this discussion has been on the calculation of the DFT, very similar results are true for the calculation of convolution and correlation, and these will be further developed in Algorithms for Data with Restrictions. The ATDAATDA structure and the picture in Figure 2 are the same for convolution. Algorithms and operation counts can be found in [2], [14], [1].

The Bilinear Structure

The bilinear form introduced in Equation 3 and the related linear form in Equation 6 are very powerful descriptions of both the DFT and convolution.

Bilinear: Y = C [ A X o B H ] Bilinear: Y = C [ A X o B H ]
(31)
Linear: Y = C D A X Linear: Y = C D A X
(32)

Since Equation 31 is a bilinear operation defined in terms of a second bilinear operator o , this formulation can be nested. For example if o is itself defined in terms of a second bilinear operator @, by

X o H = C ' [ A ' X @ B ' H ] X o H = C ' [ A ' X @ B ' H ]
(33)

then Equation 31 becomes

Y = C C ' [ A ' A X @ B ' B H ] Y = C C ' [ A ' A X @ B ' B H ]
(34)

For convolution, if A represents the polynomial residue reduction modulo the cyclotomic polynomials, then AA is square (e.g. Equation 25 and o represents multiplication of the residue polynomials modulo the cyclotomic polynomials. If AA represents the reduction modulo the cyclotomic polynomials plus the Toom-Cook reduction as was the case in the example of Equation 30, then AA is NxM and o is term-by- term simple scalar multiplication. In this case AXAX can be thought of as a transform of XX and CC is the inverse transform. This is called a rectangular transform [1] because AA is rectangular. The transform requires only additions and convolution is done with MM multiplications. The other extreme is when AA represents reduction over the NN complex roots of sN-1sN-1. In this case AA is the DFT itself, as in the example of (43), and o is point by point complex multiplication and C is the inverse DFT. A trivial case is where AA, BB and CC are identity operators and o is the cyclic convolution.

This very general and flexible bilinear formulation coupled with the idea of nesting in Equation 34 gives a description of most forms of convolution.

Winograd's Complexity Theorems

Because Winograd's work [2], [12], [23], [21], [22], [24] has been the foundation of the modern results in efficient convolution and DFT algorithms, it is worthwhile to look at his theoretical conclusions on optimal algorithms. Most of his results are stated in terms of polynomial multiplication as Polynomial Description of Signals: Equation 3 or Equation 11. The measure of computational complexity is usually the number of multiplications, and only certain multiplications are counted. This must be understood in order not to misinterpret the results.

This section will simply give a statement of the pertinent results and will not attempt to derive or prove anything. A short interpretation of each theorem will be given to relate the result to the algorithms developed in this chapter. The indicated references should be consulted for background and detail.

Theorem 1 [23] Given two polynomials, x(s)x(s) and h(s)h(s), of degree NN and MM respectively, each with indeterminate coefficients that are elements of a field HH, N+M+1N+M+1 multiplications are necessary to compute the coefficients of the product polynomial x(s)h(s)x(s)h(s). Multiplication by elements of the field GG (the field of constants), which is contained in HH, are not counted and GG contains at least N+MN+M distinct elements.

The upper bound in this theorem can be realized by choosing an arbitrary modulus polynomial P(s)P(s) of degree N+M+1N+M+1 composed of N+M+1N+M+1 distinct linear polynomial factors with coefficients in GG which, since its degree is greater than the product x(s)h(s)x(s)h(s), has no effect on the product, and by reducing x(s)x(s) and h(s)h(s) to N+M+1N+M+1 residues modulo the N+M+1N+M+1 factors of P(sP(s). These residues are multiplied by each other, requiring N+M+1N+M+1 multiplications, and the results recombined using the Chinese remainder theorem (CRT). The operations required in the reduction and recombination are not counted, while the residue multiplications are. Since the modulus P(s)P(s) is arbitrary, its factors are chosen to be simple so as to make the reduction and CRT simple. Factors of zero, plus and minus unity, and infinity are the simplest. Plus and minus two and other factors complicate the actual calculations considerably, but the theorem does not take that into account. This algorithm is a form of the Toom-Cook algorithm and of Lagrange interpolation [2], [12], [14], [23]. For our applications, HH is the field of reals and GG the field of rationals.

Theorem 2 [23] If an algorithm exists which computes x(s)h(s)x(s)h(s) in N+M+1N+M+1 multiplications, all but one of its multiplication steps must necessarily be of the form

m k = ( g k ' + x ( g k ) ) ( g k " + h ( g k ) ) for k = 0 , 1 , . . . , N + M m k = ( g k ' + x ( g k ) ) ( g k " + h ( g k ) ) for k = 0 , 1 , . . . , N + M
(35)

where gkgk are distinct elements of GG; and gk'gk' and gk"gk" are arbitrary elements of GG

This theorem states that the structure of an optimal algorithm is essentially unique although the factors of P(s)P(s) may be chosen arbitrarily.

Theorem 3 [23] Let P(sP(s) be a polynomial of degree NN and be of the form P(s)=Q(s)kP(s)=Q(s)k, where Q(s)Q(s) is an irreducible polynomial with coefficients in GG and kk is a positive integer. Let x(s)x(s) and h(sh(s) be two polynomials of degree at least N-1N-1 with coefficients from HH, then 2N-12N-1 multiplications are required to compute the product x(s)h(s)x(s)h(s) modulo P(s)P(s).

This theorem is similar to Theorem 1 with the operations of the reduction of the product modulo P(sP(s) not being counted.

Theorem 4 [23] Any algorithm that computes the product x(s)h(sx(s)h(s) modulo P(sP(s) according to the conditions stated in Theorem 3 and requires 2N-12N-1 multiplications will necessarily be of one of three structures, each of which has the form of Theorem 2 internally.

As in Theorem 2, this theorem states that only a limited number of possible structures exist for optimal algorithms.

Theorem 5 [23] If the modulus polynomial P(s)P(s) has degree NN and is not irreducible, it can be written in a unique factored form P(s)=P1m1(s)P2m2(s)...Pkmk(s)P(s)=P1m1(s)P2m2(s)...Pkmk(s) where each of the Pi(s)Pi(s) are irreducible over the allowed coefficient field GG. 2N-k2N-k multiplications are necessary to compute the product x(s)h(s)x(s)h(s) modulo P(s)P(s) where x(s)x(s) and h(s)h(s) have coefficients in HH and are of degree at least N-1N-1. All algorithms that calculate this product in 2N-k2N-k multiplications must be of a form where each of the kk residue polynomials of x(s)x(s) and h(s)h(s) are separately multiplied modulo the factors of P(s)P(s) via the CRT.

Corollary: If the modulus polynomial is P(s)=sN-1P(s)=sN-1, then 2N-t(N)2N-t(N) multiplications are necessary to compute x(s)h(s)x(s)h(s) modulo P(s)P(s), where t(N)t(N) is the number of positive divisors of NN.

Theorem 5 is very general since it allows a general modulus polynomial. The proof of the upper bound involves reducing x(s)x(s) and h(s)h(s) modulo the kk factors of P(s)P(s). Each of the kk irreducible residue polynomials is then multiplied using the method of Theorem 4 requiring 2Ni-12Ni-1 multiplies and the products are combined using the CRT. The total number of multiplies from the kk parts is 2N-k2N-k. The theorem also states the structure of these optimal algorithms is essentially unique. The special case of P(s)=sN-1P(s)=sN-1 is interesting since it corresponds to cyclic convolution and, as stated in the corollary, kk is easily determined. The factors of sN-1sN-1 are called cyclotomic polynomials and have interesting properties [2], [12], [14].

Theorem 6 [23], [21] Consider calculating the DFT of a prime length real-valued number sequence. If GG is chosen as the field of rational numbers, the number of real multiplications necessary to calculate a length-P DFT is u(DFT(N))=2P-3-t(P-1)u(DFT(N))=2P-3-t(P-1) where t(P-1)t(P-1) is the number of divisors of P-1P-1.

This theorem not only gives a lower limit on any practical prime length DFT algorithm, it also gives practical algorithms for N=3,5N=3,5, and 7. Consider the operation counts in Table 1 to understand this theorem. In addition to the real multiplications counted by complexity theory, each optimal prime-length algorithm will have one multiplication by a rational constant. That constant corresponds to the residue modulo (s-1) which always exists for the modulus P(s)=sN-1-1P(s)=sN-1-1. In a practical algorithm, this multiplication must be carried out, and that accounts for the difference in the prediction of Theorem 6 and count in Table 1. In addition, there is another operation that for certain applications must be counted as a multiplication. That is the calculation of the zero frequency term X(0)X(0) in the first row of the example in The DFT as Convolution or Filtering: Matrix 1. For applications to the WFTA discussed in The Prime Factor and Winograd Fourier Transform Algorithms: The Winograd Fourier Transform Algorithm, that operation must be counted as a multiply. For lengths longer than 7, optimal algorithms require too many additions, so compromise structures are used.

Theorem 7 [24], [6] If GG is chosen as the field of rational numbers, the number of real multiplications necessary to calculate a length-N DFT where N is a prime number raised to an integer power: N=PmN=Pm , is given by

u ( D F T ( N ) ) = 2 N - ( ( m 2 + m ) / 2 ) t ( P - 1 ) - m - 1 u ( D F T ( N ) ) = 2 N - ( ( m 2 + m ) / 2 ) t ( P - 1 ) - m - 1
(36)

where t(P-1)t(P-1) is the number of divisors of (P-1)(P-1).

This result seems to be practically achievable only for N=9N=9, or perhaps 25. In the case of N=9N=9, there are two rational multiplies that must be carried out and are counted in Table 1 but are not predicted by Theorem 7. Experience [8] indicates that even for N=25N=25, an algorithm based on a Cooley-Tukey FFT using a type 2 index map gives an over-all more balanced result.

Theorem 8 [6] If GG is chosen as the field of rational numbers, the number of real multiplications necessary to calculate a length-N DFT where N=2mN=2m is given by

u ( D F T ( N ) ) = 2 N - m 2 - m - 2 u ( D F T ( N ) ) = 2 N - m 2 - m - 2
(37)

This result is not practically useful because the number of additions necessary to realize this minimum of multiplications becomes very large for lengths greater than 16. Nevertheless, it proves the minimum number of multiplications required of an optimal algorithm is a linear function of NN rather than of NlogNNlogN which is that required of practical algorithms. The best practical power-of-two algorithm seems to the Split-Radix [4] FFT discussed in The Cooley-Tukey Fast Fourier Transform Algorithm: The Split-Radix FFT Algorithm.

All of these theorems use ideas based on residue reduction, multiplication of the residues, and then combination by the CRT. It is remarkable that this approach finds the minimum number of required multiplications by a constructive proof which generates an algorithm that achieves this minimum; and the structure of the optimal algorithm is, within certain variations, unique. For shorter lengths, the optimal algorithms give practical programs. For longer lengths the uncounted operations involved with the multiplication of the higher degree residue polynomials become very large and impractical. In those cases, efficient suboptimal algorithms can be generated by using the same residue reduction as for the optimal case, but by using methods other than the Toom-Cook algorithm of Theorem 1 to multiply the residue polynomials.

Practical long DFT algorithms are produced by combining short prime length optimal DFT's with the Type 1 index map from Multidimensional Index Mapping to give the Prime Factor Algorithm (PFA) and the Winograd Fourier Transform Algorithm (WFTA) discussed in The Prime Factor and Winograd Fourier Transform Algorithms. It is interesting to note that the index mapping technique is useful inside the short DFT algorithms to replace the Toom-Cook algorithm and outside to combine the short DFT's to calculate long DFT's.

The Automatic Generation of Winograd's Short DFTs

by Ivan Selesnick, Polytechnic Institute of New York University

Introduction

Efficient prime length DFTs are important for two reasons. A particular application may require a prime length DFT and secondly, the maximum length and the variety of lengths of a PFA or WFTA algorithm depend upon the availability of prime length modules.

This [15], [18], [16], [17] discusses automation of the process Winograd used for constructing prime length FFTs [2], [8] for N<7N<7 and that Johnson and Burrus [9] extended to N<19N<19. It also describes a program that will design any prime length FFT in principle, and will also automatically generate the algorithm as a C program and draw the corresponding flow graph.

Winograd's approach uses Rader's method to convert a prime length DFT into a P-1P-1 length cyclic convolution, polynomial residue reduction to decompose the problem into smaller convolutions [2], [14], and the Toom-Cook algorithm [2], [13]. The Chinese Remainder Theorem (CRT) for polynomials is then used to recombine the shorter convolutions. Unfortunately, the design procedure derived directly from Winograd's theory becomes cumbersome for longer length DFTs, and this has often prevented the design of DFT programs for lengths greater than 19.

Here we use three methods to facilitate the construction of prime length FFT modules. First, the matrix exchange property [2], [9], [11] is used so that the transpose of the reduction operator can be used rather than the more complicated CRT reconstruction operator. This is then combined with the numerical method [9] for obtaining the multiplication coefficients rather than the direct use of the CRT. We also deviate from the Toom-Cook algorithm, because it requires too many additions for the lengths in which we are interested. Instead we use an iterated polynomial multiplication algorithm [2]. We have incorporated these three ideas into a single structural procedure that automates the design of prime length FFTs.

Matrix Description

It is important that each step in the Winograd FFT can be described using matrices. By expressing cyclic convolution as a bilinear form, a compact form of prime length DFTs can be obtained.

If yy is the cyclic convolution of hh and xx, then yy can be expressed as

y = C [ A x . * B h ] y = C [ A x . * B h ]
(38)

where, using the Matlab convention, .*.* represents point by point multiplication. When AA,BB, and CC are allowed to be complex, AA and BB are seen to be the DFT operator and CC, the inverse DFT. When only real numbers are allowed, AA, BB, and CC will be rectangular. This form of convolution is presented with many examples in [2]. Using the matrix exchange property explained in [2] and [9] this form can be written as

y = R B T [ C T R h . * A x ] y = R B T [ C T R h . * A x ]
(39)

where RR is the permutation matrix that reverses order.

When hh is fixed, as it is when considering prime length DFTs, the term CTRhCTRh can be precomputed and a diagonal matrix DD formed by D=diag{CTRh}D=diag{CTRh}. This is advantageous because in general, CC is more complicated than BB, so the ability to “hide" CC saves computation. Now y=RBTDAxy=RBTDAx or y=RATDAxy=RATDAx since AA and BB can be the same; they implement a polynomial reduction. The form y=RTDAxTy=RTDAxT can also be used for the prime length DFTs, it is only necessary to permute the entries of x and to ensure that the DC term is computed correctly. The computation of the DC term is simple, for the residue of a polynomial modulo a-1a-1 is always the sum of the coefficients. After adding the x0x0 term of the original input sequence, to the s-ls-l residue, the DC term is obtained. Now DFT{x}=RATDAxDFT{x}=RATDAx. In [9] Johnson observes that by permuting the elements on the diagonal of DD, the output can be permuted, so that the RR matrix can be hidden in DD, and DFT{x}=ATDAxDFT{x}=ATDAx. From the knowledge of this form, once AA is found, DD can be found numerically [9].

Programming the Design Procedure

Because each of the above steps can be described by matrices, the development of a prime length FFTs is made convenient with the use of a matrix oriented programming language such as Matlab. After specifying the appropriate matrices that describe the desired FFT algorithm, generating code involves compiling the matrices into the desired code for execution.

Each matrix is a section of one stage of the flow graph that corresponds to the DFT program. The four stages are:

  1. Permutation Stage: Permutes input and output sequence.
  2. Reduction Stage: Reduces the cyclic convolution to smaller polynomial products.
  3. Polynomial Product Stage: Performs the polynomial multiplications.
  4. Multiplication Stage: Implements the point-by-point multiplication in the bilinear form.

Each of the stages can be clearly seen in the flow graphs for the DFTs. Figure 3 shows the flow graph for a length 17 DFT algorithm that was automatically drawn by the program.

Figure 3: Flowgraph of length-17 DFT
A diagram containing 17 horizontal lines that move and adjust at four segments to follow a specific flow. From left to right, the lines begin in different directions, crossing paths in no discernible pattern until at once the majority of the lines cross again consistently over the same region. Some of the lines continue at a slant upwards, but most continue horizontally across to about one third of the figure over. Then, the majority of the lines, with the exception of the upper three, break off and split into two pieces, both of which angled downward. The lines then break off again at certain points, and then all become horizontal approximately halfway across the drawing. 24 new horizontal lines exist at this point, creating a total of 41 horizontal lines in the figure. The right half of the drawing is symmetric to the left.

The programs that accomplish this process are written in Matlab and C. Those that compute the appropriate matrices are written in Matlab. These matrices are then stored as two ASCII files, with the dimensions in one and the matrix elements in the second. A C program then reads the flies and compiles them to produce the final FFT program in C [18]

The Reduction Stage

The reduction of an NthNth degree polynomial, X(s)X(s), modulo the cyclotomic polynomial factors of (sN-1)(sN-1) requires only additions for many N, however, the actual number of additions depends upon the way in which the reduction proceeds. The reduction is most efficiently performed in steps. For example, if N=4N=4 and ((X(s))s-1((X(s))s-1,((X(s))s+1((X(s))s+1and ((X(s))s2+1((X(s))s2+1 where the double parenthesis denote polynomial reduction modulo (s-1)(s-1), s+1s+1, and s2+1)s2+1), then in the first step ((X(s)))s2-1((X(s)))s2-1, and ((Xs)))s2+1((Xs)))s2+1 should be computed. In the second step, ((Xs)))s-1((Xs)))s-1 and ((Xs)))s+1((Xs)))s+1 can be found by reducing ((X(s)))s2-1((X(s)))s2-1 This process is described by the diagram in Figure 4.

Figure 4: Factorization of s4-1s4-1 in steps
A tree diagram describing factorization. At the top is s^4 - 1, which splits into s^2 - 1 to the left and s^2 + 1 to the right. s^2 -1 then splits into s - 1 to the left and s + 1 to the right.

When NN is even, the appropriate first factorization is (SN/2-1)(sN/2+1)(SN/2-1)(sN/2+1), however, the next appropriate factorization is frequently less obvious. The following procedure has been found to generate a factorization in steps that coincides with the factorization that minimizes the cumulative number of additions incurred by the steps. The prime factors of NN are the basis of this procedure and their importance is clear from the useful well-known equation sN-1=n|NCn(s)sN-1=n|NCn(s) where Cn(s)Cn(s) is the nthnth cyclotomic polynomial.

We first introduce the following two functions defined on the positive integers,

ψ ( N ) = the smallest prime factor of N for N > 1 ψ ( N ) = the smallest prime factor of N for N > 1
(40)

and ψ(1)=1ψ(1)=1.

Suppose P(s)P(s) is equal to either (sN-1)(sN-1) or an intermediate noncyclotomic polynomial appearing in the factorization process, for example, (a2-1)(a2-1), above. Write P(s)P(s) in terms of its cyclotomic factors,

P ( s ) = C k 1 ( s ) C k 2 ( s ) C k L P ( s ) = C k 1 ( s ) C k 2 ( s ) C k L
(41)

define the two sets, GG and GG, by

G = { k 1 , , k L } and G = { k / g c d ( G ) : k G } G = { k 1 , , k L } and G = { k / g c d ( G ) : k G }
(42)

and define the two integers, tt and TT, by

t = min { ψ ( k ) : k G , k > 1 } and T = max n u ( k , t ) : k G } t = min { ψ ( k ) : k G , k > 1 } and T = max n u ( k , t ) : k G }
(43)

Then form two new sets,

A = { k G : T k } and B = { k G : T | k } A = { k G : T k } and B = { k G : T | k }
(44)

The factorization of P(s)P(s),

P ( s ) = ( k A C k ( s ) ) ( k B C k ( s ) ) P ( s ) = ( k A C k ( s ) ) ( k B C k ( s ) )
(45)

has been found useful in the procedure for factoring (sN-1)(sN-1). This is best illustrated with an example.

Example: N=36N=36

Step 1. Let P(s)=s36-1P(s)=s36-1. Since P=C1C2C3C4C6C9C12C18C36P=C1C2C3C4C6C9C12C18C36

G = G = { 1 , 2 , 3 , 4 , 6 , 9 , 12 , 18 , 36 } G = G = { 1 , 2 , 3 , 4 , 6 , 9 , 12 , 18 , 36 }
(46)
t = min { 2 , 3 } = 2 t = min { 2 , 3 } = 2
(47)
A = { k G : 4 | k } = { 1 , 2 , 3 , 6 , 9 , 18 } A = { k G : 4 | k } = { 1 , 2 , 3 , 6 , 9 , 18 }
(48)
B = { k G : 4 | k } = { 4 , 12 , 36 } B = { k G : 4 | k } = { 4 , 12 , 36 }
(49)

Hence the factorization of s36-1s36-1 into two intermediate polynomials is as expected,

k A C k ( s ) = s 18 - 1 , k B C k ( s ) = s 18 + 1 k A C k ( s ) = s 18 - 1 , k B C k ( s ) = s 18 + 1
(50)

If a 36th degree polynomial, X(s)X(s), is represented by a vector of coefficients, X=(x35,,x0)'X=(x35,,x0)', then ((X(s))s18-1((X(s))s18-1 (represented by X') and ((X(s))s18+1((X(s))s18+1 (represented by X") is given by

t e s t t e s t
(51)

which entails 36 additions.

Step 2. This procedure is repeated with P(s)=s18-1P(s)=s18-1 and P(s)=s18+1P(s)=s18+1. We will just show it for the later. Let P(s)=s18+1P(s)=s18+1. Since P=C4C12C36P=C4C12C36

G = { 4 , 12 , 36 } , G ' = { l , 3 , 9 } G = { 4 , 12 , 36 } , G ' = { l , 3 , 9 }
(52)
t = min 3 = 3 t = min 3 = 3
(53)
T = max ν ( k , 3 ) : k G = max l , 3 , 9 = 9 T = max ν ( k , 3 ) : k G = max l , 3 , 9 = 9
(54)
A = k G : 9 | k } = { 4 , 12 } A = k G : 9 | k } = { 4 , 12 }
(55)
B = k G : 9 | k } = { 36 } B = k G : 9 | k } = { 36 }
(56)

This yields the two intermediate polynomials,

s 6 + 1 , and s 12 - s 6 + 1 s 6 + 1 , and s 12 - s 6 + 1
(57)

In the notation used above,

X ' X ' ' = I 6 - I 6 I 6 I 6 I 6 - I 6 I 6 X X ' X ' ' = I 6 - I 6 I 6 I 6 I 6 - I 6 I 6 X
(58)

entailing 24 additions. Continuing this process results in a factorization in steps

In order to see the number of additions this scheme uses for numbers of the form N=P-1N=P-1 (which is relevant to prime length FFT algorithms) figure 4 shows the number of additions the reduction process uses when the polynomial X(s) is real.

Figure 4: Number of Additions for Reduction Stage

The Polynomial Product Stage

The iterated convolution algorithm can be used to construct an NN point linear convolution algorithm from shorter linear convolution algorithms [2]. Suppose the linear convolution yy, of the nn point vectors xx and hh (hh known) is described by

y = E n T D E n x y = E n T D E n x
(59)

where EnEn is an “expansion" matrix the elements of which are ±l±l's and 0's and DD is an appropriate diagonal matrix. Because the only multiplications in this expression are by the elements of DD, the number of multiplications required, M(n)M(n), is equal to the number of rows of EnEn. The number of additions is denoted by A(n)A(n).

Given a matrix En1En1 and a matrix En2En2, the iterated algorithm gives a method for combining En1En1 and En2En2 to construct a valid expansion matrix, EnEn, for Nn1n2Nn1n2. Specifically,

E n 1 , n 2 = ( I M ( n 2 ) E n 1 ) ( E n 2 × I n 1 ) E n 1 , n 2 = ( I M ( n 2 ) E n 1 ) ( E n 2 × I n 1 )
(60)

The product n1n2n1n2 may be greater than NN, for zeros can be (conceptually) appended to xx. The operation count associated with En1,n2En1,n2 is

A ( n 1 , n 2 ) = n ! A ( n 2 ) + A ( n 1 ) M n 2 ) A ( n 1 , n 2 ) = n ! A ( n 2 ) + A ( n 1 ) M n 2 )
(61)
M ( n 1 , n 2 ) = M ( n 1 ) M ( n 2 ) M ( n 1 , n 2 ) = M ( n 1 ) M ( n 2 )
(62)

Although they are both valid expansion matrices, En1,n2En2,n1En1,n2En2,n1 and An1,n2An2,n1An1,n2An2,n1 Because Mn1,n2Mn2,n1Mn1,n2Mn2,n1 it is desirable to chose an ordering of factors to minimize the additions incurred by the expansion matrix. The following [1], [14] follows from above.

Multiple Factors

Note that a valid expansion matrix, ENEN, can be constructed from En1,n2En1,n2 and En3En3, for Nn1n2n3Nn1n2n3. In general, any number of factors can be used to create larger expansion matrices. The operation count associated with En1,n2,n3En1,n2,n3 is

A ( n 1 , n 2 , n 3 ) = n 1 n 2 A ( n 3 ) + n 1 A ( n 2 ) M ( n 3 ) + A ( n 1 ) M ( n 2 ) M ( n 3 ) A ( n 1 , n 2 , n 3 ) = n 1 n 2 A ( n 3 ) + n 1 A ( n 2 ) M ( n 3 ) + A ( n 1 ) M ( n 2 ) M ( n 3 )
(63)
M ( n 1 , n 2 , n 3 ) = M ( n 1 ) M ( n 2 ) M ( n 3 ) M ( n 1 , n 2 , n 3 ) = M ( n 1 ) M ( n 2 ) M ( n 3 )
(64)

These equations generalize in the predicted way when more factors are considered. Because the ordering of the factors is relevant in the equation for A(.)A(.) but not for M(.)M(.), it is again desirable to order the factors to minimize the number of additions. By exploiting the following property of the expressions for A(.)A(.) and M(.)M(.), the optimal ordering can be found [1].

reservation of Optimal Ordering. Suppose A(n1,n2,n3)min{A(nk1,nk2,nk3):k1,k2,k3{1,2,3}A(n1,n2,n3)min{A(nk1,nk2,nk3):k1,k2,k3{1,2,3} and distinct}, then

  1. A ( n 1 , n 2 ) A ( n 2 , n 1 ) A ( n 1 , n 2 ) A ( n 2 , n 1 )
    (65)
  2. A ( n 2 , n 3 ) A ( n 3 , n 2 ) A ( n 2 , n 3 ) A ( n 3 , n 2 )
    (66)
  3. A ( n 1 , n 3 ) A ( n 3 , n 1 ) A ( n 1 , n 3 ) A ( n 3 , n 1 )
    (67)

The generalization of this property to more than two factors reveals that an optimal ordering of {n1,,nL-i}{n1,,nL-i} is preserved in an optimal ordering of {n1,nL}{n1,nL}. Therefore, if (n1,nL)(n1,nL) is an optimal ordering of {n1,nL}{n1,nL}, then (nk,nk+1)(nk,nk+1) is an optimal ordering of {nk,nk+1}{nk,nk+1} and consequently

A ( n k ) M ( n k ) - n k A ( n k + 1 ) M ( n k + 1 ) - n k + 1 A ( n k ) M ( n k ) - n k A ( n k + 1 ) M ( n k + 1 ) - n k + 1
(68)

for all k=1,2,,L-1k=1,2,,L-1.

This immediately suggests that an optimal ordering of {n1,nL}{n1,nL} is one for which

A ( n 1 ) M ( n 1 ) - n 1 A ( n L ) M ( n L ) - n L A ( n 1 ) M ( n 1 ) - n 1 A ( n L ) M ( n L ) - n L
(69)

is nondecreasing. Hence, ordering the factors, {n1,nL}{n1,nL}, to minimize the number of additions incurred by En1,,nLEn1,,nL simply involves computing the appropriate ratios.

Discussion and Conclusion

We have designed prime length FFTs up to length 53 that are as good as the previous designs that only went up to 19. Table 1 gives the operation counts for the new and previously designed modules, assuming complex inputs.

It is interesting to note that the operation counts depend on the factorability of P-1P-1. The primes 11, 23, and 47 are all of the form 1+2P11+2P1 making the design of efficient FFTs for these lengths more difficult.

Further deviations from the original Winograd approach than we have made could prove useful for longer lengths. We investigated, for example, the use of twiddle factors at appropriate points in the decomposition stage; these can sometimes be used to divide the cyclic convolution into smaller convolutions. Their use means, however, that the 'center* multiplications would no longer be by purely real or imaginary numbers.

Table 2: Operation counts for prime length DFTs
N Mult Adds
7 16 72
11 40 168
13 40 188
17 82 274
19 88 360
23 174 672
29 190 766
31 160 984
37 220 920
41 282 1140
43 304 1416
47 640 2088
53 556 2038

The approach in writing a program that writes another program is a valuable one for several reasons. Programming the design process for the design of prime length FFTs has the advantages of being practical, error-free, and flexible. The flexibility is important because it allows for modification and experimentation with different algorithmic ideas. Above all, it has allowed longer DFTs to be reliably designed.

More details on the generation of programs for prime length FFTs can be found in the 1993 Technical Report.

References

  1. Agarwal, R. C. and Cooley, J. W. (1977, October). New Algorithms for Digital Convolution. IEEE Trans. on ASSP, 25(2), 392–410.
  2. Blahut, Richard E. (1985). Fast Algorithms for Digital Signal Processing. Reading, Mass.: Addison-Wesley.
  3. Burrus, C. S. and Parks, T. W. (1985). DFT/FFT and Convolution Algorithms. New York: John Wiley & Sons.
  4. Duhamel, P. and Hollmann, H. (1984, January 5). Split Radix FFT Algorithm. Electronic Letters, 20(1), 14–16.
  5. Duhamel, P. (1986, April). Implementation of `Split-Radix' FFT Algorithms for Complex, Real, and Real-Symmetric Data. [A shorter version appeared in the ICASSP-85 Proceedings, p. 20.6, March 1985]. IEEE Trans. on ASSP, 34, 285–295.
  6. Heideman, M. T. and Burrus, C. S. (1986, February). On the Number of Multiplications Necessary to Compute a Length- DFT. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(1), 91–95.
  7. Heideman, M. T. and Johnson, H. W. and Burrus, C. S. (1984, March). Prime Factor FFT Algorithms for Real–Valued Series. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. (p. 28A.7.1–4). San Diego, CA
  8. Johnson, H. W. and Burrus, C. S. (1981). Large DFT Modules: N = 11, 13, 17, 19, and 25. (8105). Technical report. Department of Electrical Engineering, Rice University, Houston, TX 77251–1892.
  9. Johnson, Howard W. and Burrus, C. S. (1985, February). On the Structure of Efficient DFT Algorithms. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(1), 248–254.
  10. Kolba, D. P. and Parks, T. W. (1977, August). A Prime Factor FFT Algorithm Using High Speed Convolution. [also in]. IEEE Trans. on ASSP, 25, 281–294.
  11. Lim, J. S. and Oppenheim, A. V. (1988). Advanced Topics in Signal Processing. Englewood Cliffs, NJ: Prentice-Hall.
  12. McClellan, J. H. and Rader, C. M. (1979). Number Theory in Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall.
  13. Myers, Douglas G. (1990). Digital Signal Processing, Efficient Convolution and Fourier Transform Techniques. Sydney, Australia: Prentice-Hall.
  14. Nussbaumer, H. J. (1981, 1982). Fast Fourier Transform and Convolution Algorithms. (Second). Heidelberg, Germany: Springer-Verlag.
  15. Selesnick, I. W. and Burrus, C. S. (1992, May). Automating the Design of Prime Length FFT Programs. In Proceedings of the IEEE International Symposium on Circuits and Systems. (Vol. 1, p. 133–136). ISCAS-92, San Diego, CA
  16. Selesnick, I. W. and Burrus, C. S. (1993, April). Multidimensional Mapping Techniques for Convolution. In Proceedings of the IEEE International Conference on Signal Processing. (Vol. III, pp. III-288–291). IEEE ICASSP-93, Minneapolis
  17. Selesnick, I. W. and Burrus, C. S. (1994, June 30). Extending Winograd's Small Convolution Algorithm to Longer Lengths. In Proceedings of the IEEE International Symposium on Circuits and Systems. (Vol. 2, p. 2.449–452). IEEE ISCAS-94, London
  18. Selesnick, Ivan W. and Burrus, C. Sidney. (1996, January). Automatic Generation of Prime Length FFT Programs. IEEE Transactions on Signal Processing, 44(1), 14–24.
  19. Sorensen, H. V. and Jones, D. L. and Heideman, M. T. and Burrus, C. S. (1987, June). Real Valued Fast Fourier Transform Algorithms. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(6), 849–863.
  20. Winograd, S. (1976, April). On Computing the Discrete Fourier Transform. Proc. National Academy of Sciences, USA, 73, 1006–1006.
  21. Winograd, S. (1978, January). On Computing the Discrete Fourier Transform. Mathematics of Computation, 32, 175–199.
  22. Winograd, S. (1979, May). On the Multiplicative Complexity of the Discrete Fourier Transform. [also in]. Advances in Mathematics, 32(2), 83–117.
  23. Winograd, S. (1980). SIAM CBMS-NSF Series, No. 33: Arithmetic Complexity of Computation. Philadelphia: SIAM.
  24. Winograd, S. (1980, April). Signal Processing and Complexity of Computation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. (p. 1381–1384). ICASSP-80, Denver

Content actions

Download module as:

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks