In 1976, S. Winograd [20] presented a new DFT algorithm
which had significantly fewer multiplications than the CooleyTukey
FFT which had been published eleven years earlier. This new Winograd
Fourier Transform Algorithm (WFTA) is based on the type one index
map from Multidimensional Index Mapping with each of the relatively prime length
short DFT's calculated by very efficient special algorithms. It is
these short algorithms that this section will develop. They use the
index permutation of Rader described in the another module to
convert the prime length short DFT's into cyclic convolutions.
Winograd developed a method for calculating digital convolution with
the minimum number of multiplications. These optimal algorithms are
based on the polynomial residue reduction techniques of Polynomial Description of Signals: Equation 1 to break the convolution into multiple small ones
[2], [12], [14], [23], [21], [9].
The operation of discrete convolution defined by
y
(
n
)
=
∑
k
h
(
n

k
)
x
(
k
)
y
(
n
)
=
∑
k
h
(
n

k
)
x
(
k
)
(1)
is called a bilinear operation because, for a fixed h(n)h(n),
y(n)y(n) is a linear function of x(n)x(n) and for a fixed x(n)x(n) it is a
linear function of h(n)h(n). The operation of cyclic convolution is
the same but with all indices evaluated modulo NN.
Recall from Polynomial Description of Signals: Equation 3 that lengthN cyclic convolution of
x(n)x(n) and h(n)h(n) can be represented by polynomial multiplication
Y
(
s
)
=
X
(
s
)
H
(
s
)
mod
(
s
N

1
)
Y
(
s
)
=
X
(
s
)
H
(
s
)
mod
(
s
N

1
)
(2)
This bilinear operation of Equation 1 and Equation 2 can also be
expressed in terms of linear matrix operators and a simpler bilinear
operator denoted by oo which may be only a simple
elementbyelement multiplication of the two vectors
[12], [9], [10]. This matrix formulation is
Y
=
C
[
A
X
o
B
H
]
Y
=
C
[
A
X
o
B
H
]
(3)
where XX, HH and YY are lengthN vectors with elements of
x(n)x(n), h(n)h(n) and y(n)y(n) respectively. The matrices AA and BB
have dimension MM x NN , and CC is NN x MM with M≥NM≥N.
The elements of AA, BB, and CC are constrained to be simple;
typically small integers or rational numbers. It will be these
matrix operators that do the equivalent of the residue reduction on
the polynomials in Equation 2.
In order to derive a useful algorithm of the form Equation 3 to
calculate Equation 1, consider the polynomial formulation
Equation 2 again. To use the residue reduction scheme, the modulus
is factored into relatively prime factors. Fortunately the factoring
of this particular polynomial, sN1sN1, has been extensively studied
and it has considerable structure. When factored over the rationals,
which means that the only coefficients allowed are rational numbers,
the factors are called cyclotomic polynomials
[2], [12], [14]. The most interesting property for our
purposes is that most of the coefficients of cyclotomic polynomials
are zero and the others are plus or minus unity for degrees up to
over one hundred. This means the residue reduction will generally
require no multiplications.
The operations of reducing X(s)X(s) and H(s)H(s) in Equation 2 are carried
out by the matrices AA and BB in Equation 3. The convolution of
the residue polynomials is carried out by the oo operator and the
recombination by the CRT is done by the CC matrix. More details are
in [2], [12], [14], [9], [10] but the important fact is
the AA and BB matrices usually contain only zero and plus or minus
unity entries and the CC matrix only contains rational numbers. The
only general multiplications are those represented by oo. Indeed,
in the theoretical results from computational complexity theory,
these real or complex multiplications are usually the only ones
counted. In practical algorithms, the rational multiplications
represented by CC could be a limiting factor.
The h(n)h(n) terms are fixed for a digital filter, or they
represent the WW terms from Multidimensional Index Mapping: Equation 1 if the convolution is being
used to calculate a DFT. Because of this, d=BHd=BH in Equation 3
can be precalculated and only the AA and CC operators represent
the mathematics done at execution of the algorithm. In order to
exploit this feature, it was shown [23], [9] that the
properties of Equation 3 allow the exchange of the more complicated
operator CC with the simpler operator BB. Specifically this is
given by
Y
=
C
[
A
X
o
B
H
]
Y
=
C
[
A
X
o
B
H
]
(4)
Y
'
=
B
T
[
A
X
o
C
T
H
'
]
Y
'
=
B
T
[
A
X
o
C
T
H
'
]
(5)
where HH' has the same elements as HH, but in a permuted order,
and likewise YY' and YY. This very important property allows
precomputing the more complicated CTHCTH' in Equation 5 rather than
BHBH as in Equation 3.
Because BHBH or CTHCTH' can be precomputed, the bilinear form of
Equation 3 and Equation 5 can be written as a linear form. If an
MM x MM diagonal matrix DD is formed from d=CTHd=CTH, or in the
case of Equation 3, d=BHd=BH, assuming a commutative property for
oo, Equation 5 becomes
Y
'
=
B
T
D
A
X
Y
'
=
B
T
D
A
X
(6)
and Equation 3 becomes
Y
=
C
D
A
X
Y
=
C
D
A
X
(7)
In most cases there is no reason not to use the same reduction
operations on XX and HH, therefore, BB can be the same as AA and
Equation 6 then becomes
Y
'
=
A
T
D
A
X
Y
'
=
A
T
D
A
X
(8)
In order to illustrate how the residue reduction is carried
out and how the A matrix is obtained, the length5 DFT algorithm
started in The DFT as Convolution or Filtering: Matrix 1 will be continued. The DFT is first converted
to a length4 cyclic convolution by the index permutation from
The DFT as Convolution or Filtering: Equation 3 to give the cyclic convolution in The DFT as Convolution or Filtering. To avoid
confusion from the permuted order of the data x(n)x(n) in The DFT as Convolution or Filtering,
the cyclic convolution will first be developed without the
permutation, using the polynomial U(s)U(s)
U
(
s
)
=
x
(
1
)
+
x
(
3
)
s
+
x
(
4
)
s
2
+
x
(
2
)
s
3
U
(
s
)
=
x
(
1
)
+
x
(
3
)
s
+
x
(
4
)
s
2
+
x
(
2
)
s
3
(9)
U
(
s
)
=
u
(
0
)
+
u
(
1
)
s
+
u
(
2
)
s
2
+
u
(
3
)
s
3
U
(
s
)
=
u
(
0
)
+
u
(
1
)
s
+
u
(
2
)
s
2
+
u
(
3
)
s
3
(10)
and then the results will be converted back to the permuted x(n)x(n).
The length4 cyclic convolution in terms of polynomials is
Y
(
s
)
=
U
(
s
)
H
(
s
)
mod
(
s
4

1
)
Y
(
s
)
=
U
(
s
)
H
(
s
)
mod
(
s
4

1
)
(11)
and the modulus factors into three cyclotomic polynomials
s
4

1
=
(
s
2

1
)
(
s
2
+
1
)
s
4

1
=
(
s
2

1
)
(
s
2
+
1
)
(12)
=
(
s

1
)
(
s
+
1
)
(
s
2
+
1
)
=
(
s

1
)
(
s
+
1
)
(
s
2
+
1
)
(13)
=
P
1
P
2
P
3
=
P
1
P
2
P
3
(14)
Both U(s)U(s) and H(s)H(s) are reduced modulo these three polynomials.
The reduction modulo P1P1 and P2P2 is done in two stages. First it
is done modulo (s21)(s21), then that residue is further reduced
modulo (s1)(s1) and (s+1)(s+1).
U
(
s
)
=
u
0
+
u
1
s
+
u
2
s
2
+
u
3
s
3
U
(
s
)
=
u
0
+
u
1
s
+
u
2
s
2
+
u
3
s
3
(15)
U
'
(
s
)
=
(
(
U
(
s
)
)
)
(
s
2

1
)
=
(
u
0
+
u
2
)
+
(
u
1
+
u
3
)
s
U
'
(
s
)
=
(
(
U
(
s
)
)
)
(
s
2

1
)
=
(
u
0
+
u
2
)
+
(
u
1
+
u
3
)
s
(16)
U
1
(
s
)
=
(
(
U
'
(
s
)
)
)
P
1
=
(
u
0
+
u
1
+
u
2
+
u
3
)
U
1
(
s
)
=
(
(
U
'
(
s
)
)
)
P
1
=
(
u
0
+
u
1
+
u
2
+
u
3
)
(17)
U
2
(
s
)
=
(
(
U
'
(
s
)
)
)
P
2
=
(
u
0

u
1
+
u
2

u
3
)
U
2
(
s
)
=
(
(
U
'
(
s
)
)
)
P
2
=
(
u
0

u
1
+
u
2

u
3
)
(18)
U
3
(
s
)
=
(
(
U
(
s
)
)
)
P
3
=
(
u
0

u
2
)
+
(
u
1

u
3
)
s
U
3
(
s
)
=
(
(
U
(
s
)
)
)
P
3
=
(
u
0

u
2
)
+
(
u
1

u
3
)
s
(19)
The reduction in Equation 16 of the data polynomial Equation 15 can
be denoted by a matrix operation on a vector which has the data as
entries.
1
0
1
0
0
1
0
1
u
0
u
1
u
2
u
3
=
u
0
+
u
2
u
1
+
u
3
1
0
1
0
0
1
0
1
u
0
u
1
u
2
u
3
=
u
0
+
u
2
u
1
+
u
3
(20)
and the reduction in Equation 19 is
1
0

1
0
0
1
0

1
u
0
u
1
u
2
u
3
=
u
0

u
2
u
1

u
3
1
0

1
0
0
1
0

1
u
0
u
1
u
2
u
3
=
u
0

u
2
u
1

u
3
(21)
Combining Equation 20 and Equation 21 gives one operator
1
0
1
0
0
1
0
1
1
0

1
0
0
1
0

1
u
0
+
u
2
u
1
+
u
3
u
0

u
2
u
1

u
3
=
u
0
+
u
2
u
1
+
u
3
u
0

u
2
u
1

u
3
=
w
0
w
1
v
0
v
1
1
0
1
0
0
1
0
1
1
0

1
0
0
1
0

1
u
0
+
u
2
u
1
+
u
3
u
0

u
2
u
1

u
3
=
u
0
+
u
2
u
1
+
u
3
u
0

u
2
u
1

u
3
=
w
0
w
1
v
0
v
1
(22)
Further reduction of v0+v1sv0+v1s is not possible because P3=s2+1P3=s2+1 cannot be factored over the rationals. However s21s21 can be
factored into P1P2=(s1)(s+1)P1P2=(s1)(s+1) and, therefore, w0+w1sw0+w1s can
be further reduced as was done in Equation 17 and Equation 18 by
1
1
w
0
w
1
=
w
0
+
w
1
=
u
0
+
u
2
+
u
1
+
u
3
1
1
w
0
w
1
=
w
0
+
w
1
=
u
0
+
u
2
+
u
1
+
u
3
(23)
1

1
w
0
w
1
=
w
0

w
1
=
u
0
+
u
2

u
1

u
3
1

1
w
0
w
1
=
w
0

w
1
=
u
0
+
u
2

u
1

u
3
(24)
Combining Equation 22, Equation 23 and Equation 24 gives
1
1
0
0
1

1
0
0
0
0
1
0
0
0
0
1
1
0
1
0
0
1
0
1
1
0

1
0
0
1
0

1
u
0
u
1
u
2
u
3
=
r
0
r
1
v
0
v
1
1
1
0
0
1

1
0
0
0
0
1
0
0
0
0
1
1
0
1
0
0
1
0
1
1
0

1
0
0
1
0

1
u
0
u
1
u
2
u
3
=
r
0
r
1
v
0
v
1
(25)
The same reduction is done to H(s)H(s) and then the convolution of
Equation 11 is done by multiplying each residue polynomial of X(s)X(s)
and H(s)H(s) modulo each corresponding cyclotomic factor of P(s)P(s) and
finally a recombination using the polynomial Chinese Remainder
Theorem (CRT) as in Polynomial Description of Signals: Equation 9 and Polynomial Description of Signals: Equation 13.
Y
(
s
)
=
K
1
(
s
)
U
1
(
s
)
H
1
(
s
)
+
K
2
(
s
)
U
2
(
s
)
H
2
(
s
)
+
K
3
(
s
)
U
3
(
s
)
H
3
(
s
)
Y
(
s
)
=
K
1
(
s
)
U
1
(
s
)
H
1
(
s
)
+
K
2
(
s
)
U
2
(
s
)
H
2
(
s
)
+
K
3
(
s
)
U
3
(
s
)
H
3
(
s
)
(26)
mod (s41)(s41)
where U1(s)=r1U1(s)=r1 and U2(s)=r2U2(s)=r2 are constants and U3(s)=v0+v1sU3(s)=v0+v1s is a first degree polynomial. U1U1 times H1H1 and
U2U2 times H2H2 are easy, but multiplying U3U3 time H3H3 modulo
(s2+1)(s2+1) is more difficult.
The multiplication of U3(s)U3(s) times H3(s)H3(s) can be done by the
ToomCook algorithm [2], [12], [14] which can be viewed as
Lagrange interpolation or polynomial multiplication modulo a special
polynomial with three arbitrary coefficients. To simplify the
arithmetic, the constants are chosen to be plus and minus one and
zero. The details of this can be found in [2], [12], [14].
For this example it can be verified that
(
(
v
0
+
v
1
s
)
(
h
0
+
h
1
s
)
)
)
s
2
+
1
=
(
v
0
h
0

v
1
h
1
)
+
(
v
0
h
1
+
v
1
h
0
)
s
(
(
v
0
+
v
1
s
)
(
h
0
+
h
1
s
)
)
)
s
2
+
1
=
(
v
0
h
0

v
1
h
1
)
+
(
v
0
h
1
+
v
1
h
0
)
s
(27)
which by the ToomCook algorithm or inspection is
1

1
0

1

1
1
1
0
0
1
1
1
v
0
v
1
o
1
0
0
1
1
1
h
0
h
1
=
y
0
y
1
1

1
0

1

1
1
1
0
0
1
1
1
v
0
v
1
o
1
0
0
1
1
1
h
0
h
1
=
y
0
y
1
(28)
where oo signifies pointbypoint multiplication. The total AA
matrix in Equation 3 is a combination of Equation 25 and
Equation 28 giving
A
X
=
A
1
A
2
A
3
X
A
X
=
A
1
A
2
A
3
X
(29)
=
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
1
1
1
1
0
0
1

1
0
0
0
0
1
0
0
0
0
1
1
0
1
0
0
1
0
1
1
0

1
0
0
1
0

1
u
0
u
1
u
2
u
3
=
r
0
r
1
v
0
v
1
=
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
1
1
1
1
0
0
1

1
0
0
0
0
1
0
0
0
0
1
1
0
1
0
0
1
0
1
1
0

1
0
0
1
0

1
u
0
u
1
u
2
u
3
=
r
0
r
1
v
0
v
1
(30)
where the matrix A3A3 gives the residue reduction s21s21 and
s2+1s2+1, the upper lefthand part of A2A2 gives the reduction
modulo s1s1 and s+1s+1, and the lower righthand part of A1
carries out the ToomCook algorithm modulo s2+1s2+1 with the
multiplication in Equation 5. Notice that by calculating
Equation 30 in the three stages, seven additions are required. Also
notice that A1A1 is not square. It is this “expansion" that causes
more than NN multiplications to be required in oo in Equation 5
or DD in Equation 6. This staged reduction will derive the AA
operator for Equation 5
The method described above is very straightforward for the
shorter DFT lengths. For N=3N=3, both of the residue polynomials
are constants and the multiplication given by o in Equation 3 is
trivial. For N=5N=5, which is the example used here, there is one
first degree polynomial multiplication required but the ToomCook
algorithm uses simple constants and, therefore, works well as
indicated in Equation 28. For N=7N=7, there are two first degree
residue polynomials which can each be multiplied by the same
techniques used in the N=5N=5 example. Unfortunately, for any
longer lengths, the residue polynomials have an order of three or
greater which causes the ToomCook algorithm to require constants of
plus and minus two and worse. For that reason, the ToomCook method
is not used, and other techniques such as index mapping are used
that require more than the minimum number of multiplications, but do
not require an excessive number of additions. The resulting
algorithms still have the structure of Equation 8. Blahut
[2] and Nussbaumer [14] have a good collection of
algorithms for polynomial multiplication that can be used with the
techniques discussed here to construct a wide variety of DFT
algorithms.
The constants in the diagonal matrix DD can be found from the
CRT matrix CC in Equation 5 using d=CTHd=CTH' for the diagonal
terms in DD. As mentioned above, for the smaller prime lengths of
3, 5, and 7 this works well but for longer lengths the CRT becomes
very complicated. An alternate method for finding DD uses the fact
that since the linear form Equation 6 or Equation 8 calculates the
DFT, it is possible to calculate a known DFT of a given x(n)x(n) from
the definition of the DFT in Multidimensional Index Mapping: Equation 1 and, given the AA matrix in
Equation 8, solve for DD by solving a set of simultaneous
equations. The details of this procedure are described in
[9].
A modification of this approach also works for a length
which is an odd prime raised to some power: N=PMN=PM. This is a
bit more complicated [12], [23] but has been done for lengths
of 9 and 25. For longer lengths, the conventional CooleyTukey type
two index map algorithm seems to be more efficient. For powers of
two, there is no primitive root, and therefore, no simple conversion
of the DFT into convolution. It is possible to use two generators
[12], [14], [21] to make the conversion and there exists a
set of length 4, 8, and 16 DFT algorithms of the form in Equation 8
in [12].
In Table 1 an operation count of several short DFT
algorithms is presented. These are practical algorithms that can be
used alone or in conjunction with the index mapping to give longer
DFT's as shown in The Prime Factor and Winograd Fourier Transform Algorithms. Most are optimized in having
either the theoretical minimum number of multiplications or the
minimum number of multiplications without requiring a very large
number of additions. Some allow other reasonable tradeoffs between
numbers of multiplications and additions. There are two lists of the
number of multiplications. The first is the number of actual
floating point multiplications that must be done for that length
DFT. Some of these (one or two in most cases) will be by rational
constants and the others will be by irrational constants. The second
list is the total number of multiplications given in the diagonal
matrix DD in Equation 8. At least one of these will be unity ( the
one associated with X(0)X(0)) and in some cases several will be unity
( for N=2MN=2M ). The second list is important in programming the
WFTA in The Prime Factor and Winograd Fourier Transform Algorithm: The Winograd Fourier Transform Algorithm.
Table 1: Number of Real Multiplications and Additions for a LengthN
DFT of Complex Data
Length N 
Mult Nonone 
Mult Total 
Adds 
2 
0 
4 
4 
3 
4 
6 
12 
4 
0 
8 
16 
5 
10 
12 
34 
7 
16 
18 
72 
8 
4 
16 
52 
9 
20 
22 
84 
11 
40 
42 
168 
13 
40 
42 
188 
16 
20 
36 
148 
17 
70 
72 
314 
19 
76 
78 
372 
25 
132 
134 
420 
32 
68 
 
388 
Because of the structure of the short DFTs, the number of real multiplications required for the DFT of
real data is exactly half that required for complex data. The number
of real additions required is slightly less than half that required
for complex data because (N1)(N1) of the additions needed when NN is
prime add a real to an imaginary, and that is not actually
performed. When N=2mN=2m, there are (N2)(N2) of these pseudo
additions. The special case for real data is discussed in
[5], [7], [19].
The structure of these algorithms are in the form of X'=CDAXX'=CDAX or BTDAXBTDAX or ATDAXATDAX from Equation 5 and Equation 8. The
AA and BB matrices are generally MM by NN with M≥NM≥N and
have elements that are integers, generally 0 or ±1±1. A
pictorial description is given in Figure 1.
The flow graph in Figure 1 should be compared with the
matrix description of Equation 8 and Equation 30, and with the
programs in [2], [12], [3], [14] and the appendices. The shape in Figure 2 illustrates the expansion of the data by AA. That is to
say, AXAX has more entries than XX because M>NM>N. The A operator
consists of additions, the DD operator gives the MM
multiplications (some by one) and ATAT contracts the data back to
NN values with additions only. MM is one half the second list of
multiplies in Table 1.
An important characteristic of the DD operator in the
calculation of the DFT is its entries are either purely real or
imaginary. The reduction of the WW vector by (s(N1)/21)(s(N1)/21) and (s(N1)/2+1)(s(N1)/2+1) separates the real and the imaginary
constants. This is discussed in [23], [9]. The number of
multiplications for complex data is only twice those necessary for
real data, not four times.
Although this discussion has been on the calculation of the
DFT, very similar results are true for the calculation of
convolution and correlation, and these will be further developed in
Algorithms for Data with Restrictions. The ATDAATDA structure and the picture in Figure 2
are the same for convolution. Algorithms and operation counts can be
found in [2], [14], [1].
The bilinear form introduced in Equation 3 and the related linear
form in Equation 6 are very powerful descriptions of both the DFT
and convolution.
Bilinear:
Y
=
C
[
A
X
o
B
H
]
Bilinear:
Y
=
C
[
A
X
o
B
H
]
(31)
Linear:
Y
=
C
D
A
X
Linear:
Y
=
C
D
A
X
(32)Since Equation 31 is a bilinear operation defined in terms of a
second bilinear operator o , this formulation can be nested. For
example if o is itself defined in terms of a second bilinear
operator @, by
X
o
H
=
C
'
[
A
'
X
@
B
'
H
]
X
o
H
=
C
'
[
A
'
X
@
B
'
H
]
(33)then Equation 31 becomes
Y
=
C
C
'
[
A
'
A
X
@
B
'
B
H
]
Y
=
C
C
'
[
A
'
A
X
@
B
'
B
H
]
(34)For convolution, if A represents the polynomial residue reduction
modulo the cyclotomic polynomials, then AA is square (e.g.
Equation 25 and o represents multiplication of the residue
polynomials modulo the cyclotomic polynomials. If AA represents the
reduction modulo the cyclotomic polynomials plus the ToomCook
reduction as was the case in the example of Equation 30, then AA is
NxM and o is termby term simple scalar multiplication. In this
case AXAX can be thought of as a transform of XX and CC is the
inverse transform. This is called a rectangular transform
[1] because AA is rectangular. The transform requires
only additions and convolution is done with MM multiplications. The
other extreme is when AA represents reduction over the NN complex
roots of sN1sN1. In this case AA is the DFT itself, as in the
example of (43), and o is point by point complex multiplication
and C is the inverse DFT. A trivial case is where AA, BB and CC
are identity operators and o is the cyclic convolution.
This very general and flexible bilinear formulation coupled
with the idea of nesting in Equation 34 gives a description of most
forms of convolution.
Because Winograd's work [2], [12], [23], [21], [22], [24]
has been the foundation of the modern results in efficient
convolution and DFT algorithms, it is worthwhile to look at his
theoretical conclusions on optimal algorithms. Most of his results
are stated in terms of polynomial multiplication as Polynomial Description of Signals: Equation 3 or
Equation 11. The measure of computational complexity is usually the
number of multiplications, and only certain multiplications are
counted. This must be understood in order not to misinterpret the
results.
This section will simply give a statement of the pertinent
results and will not attempt to derive or prove anything. A short
interpretation of each theorem will be given to relate the result to
the algorithms developed in this chapter. The indicated references
should be consulted for background and detail.
Theorem 1
[23] Given two polynomials, x(s)x(s) and h(s)h(s), of
degree NN and MM respectively, each with indeterminate
coefficients that are elements of a field HH, N+M+1N+M+1
multiplications are necessary to compute the
coefficients of the product polynomial x(s)h(s)x(s)h(s).
Multiplication by elements of the field GG (the field of
constants), which is contained in HH, are not counted and
GG contains at least N+MN+M distinct elements.
The upper bound in this theorem can be
realized by choosing an arbitrary modulus polynomial P(s)P(s) of
degree N+M+1N+M+1 composed of N+M+1N+M+1 distinct linear polynomial
factors with coefficients in GG which, since its degree is greater
than the product x(s)h(s)x(s)h(s), has no effect on the product, and by
reducing x(s)x(s) and h(s)h(s) to N+M+1N+M+1 residues modulo the N+M+1N+M+1
factors of P(sP(s). These residues are multiplied by each other,
requiring N+M+1N+M+1 multiplications, and the results recombined using
the Chinese remainder theorem (CRT). The operations required in the
reduction and recombination are not counted, while the residue
multiplications are. Since the modulus P(s)P(s) is arbitrary, its
factors are chosen to be simple so as to make the reduction and CRT
simple. Factors of zero, plus and minus unity, and infinity are the
simplest. Plus and minus two and other factors complicate the actual
calculations considerably, but the theorem does not take that into
account. This algorithm is a form of the ToomCook algorithm and of
Lagrange interpolation [2], [12], [14], [23]. For our
applications, HH is the field of reals and GG the field of
rationals.
Theorem 2
[23] If an algorithm exists which computes
x(s)h(s)x(s)h(s) in N+M+1N+M+1 multiplications, all but one of its
multiplication steps must necessarily be of the form
m
k
=
(
g
k
'
+
x
(
g
k
)
)
(
g
k
"
+
h
(
g
k
)
)
for
k
=
0
,
1
,
.
.
.
,
N
+
M
m
k
=
(
g
k
'
+
x
(
g
k
)
)
(
g
k
"
+
h
(
g
k
)
)
for
k
=
0
,
1
,
.
.
.
,
N
+
M
(35)where gkgk are distinct elements of GG; and gk'gk' and gk"gk"
are arbitrary elements of GG
This theorem states that the structure of
an optimal algorithm is essentially unique although the factors of
P(s)P(s) may be chosen arbitrarily.
Theorem 3
[23] Let P(sP(s) be a polynomial of degree NN and
be of the form P(s)=Q(s)kP(s)=Q(s)k, where Q(s)Q(s) is an
irreducible polynomial with coefficients in GG and kk is a
positive integer. Let x(s)x(s) and h(sh(s) be two polynomials
of degree at least N1N1 with coefficients from HH, then
2N12N1 multiplications are required to compute the product
x(s)h(s)x(s)h(s) modulo P(s)P(s).
This theorem is similar to
Theorem 1 with the operations of the reduction of the product modulo
P(sP(s) not being counted.
Theorem 4
[23] Any algorithm that computes the product
x(s)h(sx(s)h(s) modulo P(sP(s) according to the conditions stated
in Theorem 3 and requires 2N12N1 multiplications will
necessarily be of one of three structures, each of which
has the form of Theorem 2 internally.
As in Theorem 2, this theorem
states that only a limited number of possible structures exist for
optimal algorithms.
Theorem 5
[23] If the modulus polynomial P(s)P(s) has
degree NN and is not irreducible, it can be written in a
unique factored form P(s)=P1m1(s)P2m2(s)...Pkmk(s)P(s)=P1m1(s)P2m2(s)...Pkmk(s) where each of the Pi(s)Pi(s) are
irreducible over the allowed coefficient field GG. 2Nk2Nk
multiplications are necessary to compute the product
x(s)h(s)x(s)h(s) modulo P(s)P(s) where x(s)x(s) and h(s)h(s) have
coefficients in HH and are of degree at least N1N1. All
algorithms that calculate this product in 2Nk2Nk
multiplications must be of a form where each of the kk
residue polynomials of x(s)x(s) and h(s)h(s) are separately
multiplied modulo the factors of P(s)P(s) via the CRT.
Corollary: If the modulus polynomial is P(s)=sN1P(s)=sN1,
then 2Nt(N)2Nt(N) multiplications are necessary to compute
x(s)h(s)x(s)h(s) modulo P(s)P(s), where t(N)t(N) is the number of
positive divisors of NN.
Theorem 5 is very general since it
allows a general modulus polynomial. The proof of the upper bound
involves reducing x(s)x(s) and h(s)h(s) modulo the kk factors of
P(s)P(s). Each of the kk irreducible residue polynomials is then
multiplied using the method of Theorem 4 requiring 2Ni12Ni1
multiplies and the products are combined using the CRT. The total
number of multiplies from the kk parts is 2Nk2Nk. The theorem also
states the structure of these optimal algorithms is essentially
unique. The special case of P(s)=sN1P(s)=sN1 is interesting since it
corresponds to cyclic convolution and, as stated in the corollary,
kk is easily determined. The factors of sN1sN1 are called
cyclotomic polynomials and have interesting properties
[2], [12], [14].
Theorem 6
[23], [21] Consider calculating the DFT of a
prime length realvalued number sequence. If GG is chosen
as the field of rational numbers, the number of real
multiplications necessary to calculate a lengthP DFT is
u(DFT(N))=2P3t(P1)u(DFT(N))=2P3t(P1) where t(P1)t(P1) is the number of
divisors of P1P1.
This theorem not only gives a lower limit
on any practical prime length DFT algorithm, it also gives practical
algorithms for N=3,5N=3,5, and 7. Consider the operation counts in
Table 1 to understand this theorem. In addition to the real
multiplications counted by complexity theory, each optimal
primelength algorithm will have one multiplication by a rational
constant. That constant corresponds to the residue modulo (s1)
which always exists for the modulus P(s)=sN11P(s)=sN11. In a
practical algorithm, this multiplication must be carried out, and
that accounts for the difference in the prediction of Theorem 6 and count in Table 1. In addition, there is another
operation that for certain applications must be counted as a
multiplication. That is the calculation of the zero frequency term
X(0)X(0) in the first row of the example in The DFT as Convolution or Filtering: Matrix 1. For
applications to the WFTA discussed in The Prime Factor and Winograd Fourier Transform Algorithms: The Winograd Fourier Transform Algorithm, that
operation must be counted as a multiply. For lengths longer than 7,
optimal algorithms require too many additions, so compromise
structures are used.
Theorem 7
[24], [6] If GG is chosen as the field of
rational numbers, the number of real multiplications
necessary to calculate a lengthN DFT where N is a prime
number raised to an integer power: N=PmN=Pm , is given by
u
(
D
F
T
(
N
)
)
=
2
N

(
(
m
2
+
m
)
/
2
)
t
(
P

1
)

m

1
u
(
D
F
T
(
N
)
)
=
2
N

(
(
m
2
+
m
)
/
2
)
t
(
P

1
)

m

1
(36)where t(P1)t(P1) is the number of divisors of (P1)(P1).
This result seems to be practically
achievable only for N=9N=9, or perhaps 25. In the case of N=9N=9,
there are two rational multiplies that must be carried out and are
counted in Table 1 but are not predicted by Theorem 7.
Experience [8] indicates that even for N=25N=25, an
algorithm based on a CooleyTukey FFT using a type 2 index map gives
an overall more balanced result.
Theorem 8
[6] If GG is chosen as the field of rational
numbers, the number of real multiplications necessary to
calculate a lengthN DFT where N=2mN=2m is given by
u
(
D
F
T
(
N
)
)
=
2
N

m
2

m

2
u
(
D
F
T
(
N
)
)
=
2
N

m
2

m

2
(37)This result is not practically useful because the number of
additions necessary to realize this minimum of multiplications
becomes very large for lengths greater than 16. Nevertheless, it
proves the minimum number of multiplications required of an optimal
algorithm is a linear function of NN rather than of NlogNNlogN
which is that required of practical algorithms. The best practical
poweroftwo algorithm seems to the SplitRadix [4] FFT
discussed in The CooleyTukey Fast Fourier Transform Algorithm: The SplitRadix FFT Algorithm.
All of these theorems use ideas based on residue reduction,
multiplication of the residues, and then combination by the CRT. It
is remarkable that this approach finds the minimum number of
required multiplications by a constructive proof which generates an
algorithm that achieves this minimum; and the structure of the
optimal algorithm is, within certain variations, unique. For shorter
lengths, the optimal algorithms give practical programs. For longer
lengths the uncounted operations involved with the multiplication of
the higher degree residue polynomials become very large and
impractical. In those cases, efficient suboptimal algorithms can be
generated by using the same residue reduction as for the optimal
case, but by using methods other than the ToomCook algorithm of
Theorem 1 to multiply the residue polynomials.
Practical long DFT algorithms are produced by combining
short prime length optimal DFT's with the Type 1 index map from
Multidimensional Index Mapping to give the Prime Factor Algorithm (PFA) and the
Winograd Fourier Transform Algorithm (WFTA) discussed in The Prime Factor and Winograd Fourier Transform Algorithms. It is interesting to note that the index mapping technique
is useful inside the short DFT algorithms to replace the ToomCook
algorithm and outside to combine the short DFT's to calculate long
DFT's.
by Ivan Selesnick, Polytechnic Institute of New York University
Efficient prime length DFTs are important for two reasons. A particular
application may require a prime length DFT and secondly, the maximum length
and the variety of lengths of a PFA or WFTA algorithm depend upon the
availability of prime length modules.
This [15], [18], [16], [17] discusses automation of the process Winograd
used for constructing prime length FFTs [2], [8] for N<7N<7
and that Johnson and Burrus [9] extended to N<19N<19.
It also describes a program that will design any prime length FFT in principle,
and will also automatically generate the algorithm as a C program and draw
the corresponding flow graph.
Winograd's approach uses Rader's method to convert a prime length DFT into
a P1P1 length cyclic convolution, polynomial residue reduction to decompose
the problem into smaller convolutions [2], [14], and the ToomCook
algorithm [2], [13].
The Chinese Remainder Theorem (CRT) for polynomials is then used to recombine the
shorter convolutions. Unfortunately, the design procedure derived directly from
Winograd's theory becomes cumbersome for longer length DFTs, and this has often
prevented the design of DFT programs for lengths greater than 19.
Here we use three methods to facilitate the construction of prime
length FFT modules. First, the matrix exchange property
[2], [9], [11] is used so
that the transpose of the reduction operator can be used rather than the more
complicated CRT reconstruction operator. This is then combined with the
numerical method [9] for obtaining the multiplication coefficients rather
than the direct use of the CRT. We also deviate from the ToomCook algorithm,
because it requires too many additions for the lengths in which we are interested.
Instead we use an iterated polynomial multiplication algorithm [2]. We have
incorporated these three ideas into a single structural procedure that automates
the design of prime length FFTs.
It is important that each step in the Winograd FFT can be described using matrices.
By expressing cyclic convolution as a bilinear form, a compact form of prime length
DFTs can be obtained.
If yy is the cyclic convolution of hh and xx, then yy can be expressed as
y
=
C
[
A
x
.
*
B
h
]
y
=
C
[
A
x
.
*
B
h
]
(38)where, using the Matlab convention, .*.* represents point by point multiplication.
When AA,BB, and CC are allowed to be complex, AA and BB are seen to be the DFT operator
and CC, the inverse DFT. When only real numbers are allowed, AA, BB, and CC will be
rectangular. This form of convolution is presented with many examples in [2].
Using the matrix exchange property explained in [2] and [9] this form can be
written as
y
=
R
B
T
[
C
T
R
h
.
*
A
x
]
y
=
R
B
T
[
C
T
R
h
.
*
A
x
]
(39)where RR is the permutation matrix that reverses order.
When hh is fixed, as it is when considering prime length DFTs, the term CTRhCTRh can be
precomputed and a diagonal matrix DD formed by D=diag{CTRh}D=diag{CTRh}. This is advantageous
because in general, CC is more complicated than BB, so the ability to “hide" CC saves
computation. Now y=RBTDAxy=RBTDAx or y=RATDAxy=RATDAx since AA and BB can be the same; they
implement a polynomial reduction. The form y=RTDAxTy=RTDAxT can also be used for the prime
length DFTs, it is only necessary to permute the entries of x and to ensure that the
DC term is computed correctly. The computation of the DC term is simple, for the residue
of a polynomial modulo a1a1 is always the sum of the coefficients. After adding the x0x0
term of the original input sequence, to the slsl residue, the DC term is obtained.
Now DFT{x}=RATDAxDFT{x}=RATDAx. In [9] Johnson observes that by permuting the elements on the
diagonal of DD, the output can be permuted, so that the RR matrix can be hidden in DD,
and DFT{x}=ATDAxDFT{x}=ATDAx. From the knowledge of this form, once AA is found, DD can be found
numerically [9].
Because each of the above steps can be described by matrices, the development of a
prime length FFTs is made convenient with the use of a matrix oriented programming
language such as Matlab. After specifying the appropriate matrices that describe the
desired FFT algorithm, generating code involves compiling the matrices into the desired
code for execution.
Each matrix is a section of one stage of the flow graph that corresponds to the DFT
program. The four stages are:
 Permutation Stage: Permutes input and output sequence.
 Reduction Stage: Reduces the cyclic convolution to smaller polynomial products.
 Polynomial Product Stage: Performs the polynomial multiplications.
 Multiplication Stage: Implements the pointbypoint multiplication in the bilinear form.
Each of the stages can be clearly seen in the flow graphs for the DFTs. Figure 3
shows the flow graph for a length 17 DFT algorithm that was automatically drawn by the program.
The programs that accomplish this process are written in Matlab and C. Those that compute
the appropriate matrices are written in Matlab. These matrices are then stored as two ASCII
files, with the dimensions in one and the matrix elements in the second. A C program then
reads the flies and compiles them to produce the final FFT program in C [18]
The reduction of an NthNth degree polynomial, X(s)X(s), modulo the cyclotomic polynomial factors
of (sN1)(sN1) requires only additions for many N, however, the actual number of additions
depends upon the way in which the reduction proceeds. The reduction is most efficiently
performed in steps. For example, if N=4N=4 and ((X(s))s1((X(s))s1,((X(s))s+1((X(s))s+1and
((X(s))s2+1((X(s))s2+1 where the double parenthesis denote polynomial reduction modulo (s1)(s1), s+1s+1,
and s2+1)s2+1), then in the first step ((X(s)))s21((X(s)))s21, and ((Xs)))s2+1((Xs)))s2+1 should be computed.
In the second step, ((Xs)))s1((Xs)))s1 and ((Xs)))s+1((Xs)))s+1 can be found by reducing ((X(s)))s21((X(s)))s21
This process is described by the diagram in Figure 4.
When NN is even, the appropriate first factorization is (SN/21)(sN/2+1)(SN/21)(sN/2+1), however,
the next appropriate factorization is frequently less obvious. The following procedure
has been found to generate a factorization in steps that coincides with the
factorization that minimizes the cumulative number of additions incurred by the steps.
The prime factors of NN are the basis of this procedure and their importance is clear
from the useful wellknown equation sN1=∏nNCn(s)sN1=∏nNCn(s) where Cn(s)Cn(s) is the
nthnth cyclotomic polynomial.
We first introduce the following two functions defined on the positive integers,
ψ
(
N
)
=
the
smallest
prime
factor
of
N
for
N
>
1
ψ
(
N
)
=
the
smallest
prime
factor
of
N
for
N
>
1
(40)and ψ(1)=1ψ(1)=1.
Suppose P(s)P(s) is equal to either (sN1)(sN1) or an intermediate noncyclotomic polynomial
appearing in the factorization process, for example, (a21)(a21), above. Write P(s)P(s) in terms
of its cyclotomic factors,
P
(
s
)
=
C
k
1
(
s
)
C
k
2
(
s
)
⋯
C
k
L
P
(
s
)
=
C
k
1
(
s
)
C
k
2
(
s
)
⋯
C
k
L
(41)define the two sets, GG and GG, by
G
=
{
k
1
,
⋯
,
k
L
}
and
G
=
{
k
/
g
c
d
(
G
)
:
k
∈
G
}
G
=
{
k
1
,
⋯
,
k
L
}
and
G
=
{
k
/
g
c
d
(
G
)
:
k
∈
G
}
(42)and define the two integers, tt and TT, by
t
=
min
{
ψ
(
k
)
:
k
∈
G
,
k
>
1
}
and
T
=
max
n
u
(
k
,
t
)
:
k
∈
G
}
t
=
min
{
ψ
(
k
)
:
k
∈
G
,
k
>
1
}
and
T
=
max
n
u
(
k
,
t
)
:
k
∈
G
}
(43)Then form two new sets,
A
=
{
k
∈
G
:
T
∣
k
}
and
B
=
{
k
∈
G
:
T

k
}
A
=
{
k
∈
G
:
T
∣
k
}
and
B
=
{
k
∈
G
:
T

k
}
(44)The factorization of P(s)P(s),
P
(
s
)
=
(
∏
k
∈
A
C
k
(
s
)
)
(
∏
k
∈
B
C
k
(
s
)
)
P
(
s
)
=
(
∏
k
∈
A
C
k
(
s
)
)
(
∏
k
∈
B
C
k
(
s
)
)
(45)has been found useful in the procedure for factoring (sN1)(sN1).
This is best illustrated with an example.
Example: N=36N=36
Step 1. Let P(s)=s361P(s)=s361. Since P=C1C2C3C4C6C9C12C18C36P=C1C2C3C4C6C9C12C18C36
G
=
G
=
{
1
,
2
,
3
,
4
,
6
,
9
,
12
,
18
,
36
}
G
=
G
=
{
1
,
2
,
3
,
4
,
6
,
9
,
12
,
18
,
36
}
(46)
t
=
min
{
2
,
3
}
=
2
t
=
min
{
2
,
3
}
=
2
(47)
A
=
{
k
∈
G
:
4

k
}
=
{
1
,
2
,
3
,
6
,
9
,
18
}
A
=
{
k
∈
G
:
4

k
}
=
{
1
,
2
,
3
,
6
,
9
,
18
}
(48)
B
=
{
k
∈
G
:
4

k
}
=
{
4
,
12
,
36
}
B
=
{
k
∈
G
:
4

k
}
=
{
4
,
12
,
36
}
(49)Hence the factorization of s361s361 into two intermediate polynomials is as expected,
∏
k
∈
A
C
k
(
s
)
=
s
18

1
,
∏
k
∈
B
C
k
(
s
)
=
s
18
+
1
∏
k
∈
A
C
k
(
s
)
=
s
18

1
,
∏
k
∈
B
C
k
(
s
)
=
s
18
+
1
(50)If a 36th
degree polynomial, X(s)X(s), is represented by a vector of coefficients,
X=(x35,⋯,x0)'X=(x35,⋯,x0)', then ((X(s))s181((X(s))s181 (represented by X') and
((X(s))s18+1((X(s))s18+1 (represented by X") is given
by
which entails 36 additions.
Step 2. This procedure is repeated with P(s)=s181P(s)=s181 and P(s)=s18+1P(s)=s18+1.
We will just show it for the later. Let P(s)=s18+1P(s)=s18+1. Since P=C4C12C36P=C4C12C36
G
=
{
4
,
12
,
36
}
,
G
'
=
{
l
,
3
,
9
}
G
=
{
4
,
12
,
36
}
,
G
'
=
{
l
,
3
,
9
}
(52)
t
=
min
3
=
3
t
=
min
3
=
3
(53)
T
=
max
ν
(
k
,
3
)
:
k
∈
G
=
max
l
,
3
,
9
=
9
T
=
max
ν
(
k
,
3
)
:
k
∈
G
=
max
l
,
3
,
9
=
9
(54)
A
=
k
∈
G
:
9

k
}
=
{
4
,
12
}
A
=
k
∈
G
:
9

k
}
=
{
4
,
12
}
(55)
B
=
k
∈
G
:
9

k
}
=
{
36
}
B
=
k
∈
G
:
9

k
}
=
{
36
}
(56)This yields the two intermediate polynomials,
s
6
+
1
,
and
s
12

s
6
+
1
s
6
+
1
,
and
s
12

s
6
+
1
(57)In the notation used above,
X
'
X
'
'
=
I
6

I
6
I
6
I
6
I
6

I
6
I
6
X
X
'
X
'
'
=
I
6

I
6
I
6
I
6
I
6

I
6
I
6
X
(58)entailing 24 additions. Continuing this process results in a factorization in steps
In order to see the number of additions this scheme uses for numbers of the
form N=P1N=P1 (which is relevant to prime length FFT algorithms) figure 4
shows the number of additions the reduction process uses when the polynomial
X(s) is real.
Figure 4: Number of Additions for Reduction Stage
The iterated convolution algorithm can be used to construct an NN point linear
convolution algorithm from shorter linear convolution algorithms [2].
Suppose the linear convolution yy, of the nn point vectors xx and hh (hh known)
is described by
y
=
E
n
T
D
E
n
x
y
=
E
n
T
D
E
n
x
(59)where EnEn is an “expansion" matrix the elements of which are ±l±l's and 0's and
DD is an appropriate diagonal matrix. Because the only multiplications in this
expression are by the elements of DD, the number of multiplications required,
M(n)M(n), is equal to the number of rows of EnEn. The number of additions is denoted
by A(n)A(n).
Given a matrix En1En1 and a matrix En2En2, the iterated algorithm gives a method
for combining En1En1 and En2En2 to construct a valid expansion matrix, EnEn, for
N≤n1n2N≤n1n2. Specifically,
E
n
1
,
n
2
=
(
I
M
(
n
2
)
⊗
E
n
1
)
(
E
n
2
×
I
n
1
)
E
n
1
,
n
2
=
(
I
M
(
n
2
)
⊗
E
n
1
)
(
E
n
2
×
I
n
1
)
(60)The product n1n2n1n2 may be greater than NN, for zeros can be (conceptually)
appended to xx. The operation count associated with En1,n2En1,n2 is
A
(
n
1
,
n
2
)
=
n
!
A
(
n
2
)
+
A
(
n
1
)
M
n
2
)
A
(
n
1
,
n
2
)
=
n
!
A
(
n
2
)
+
A
(
n
1
)
M
n
2
)
(61)
M
(
n
1
,
n
2
)
=
M
(
n
1
)
M
(
n
2
)
M
(
n
1
,
n
2
)
=
M
(
n
1
)
M
(
n
2
)
(62)Although they are both valid expansion matrices, En1,n2≠En2,n1En1,n2≠En2,n1 and An1,n2≠An2,n1An1,n2≠An2,n1
Because Mn1,n2≠Mn2,n1Mn1,n2≠Mn2,n1 it is desirable to chose an ordering of
factors to minimize the additions incurred by the expansion matrix.
The following [1], [14] follows from above.
Note that a valid expansion matrix, ENEN, can be constructed from En1,n2En1,n2
and En3En3, for N≤n1n2n3N≤n1n2n3. In general, any number of factors can be used to
create larger expansion matrices.
The operation count associated with En1,n2,n3En1,n2,n3 is
A
(
n
1
,
n
2
,
n
3
)
=
n
1
n
2
A
(
n
3
)
+
n
1
A
(
n
2
)
M
(
n
3
)
+
A
(
n
1
)
M
(
n
2
)
M
(
n
3
)
A
(
n
1
,
n
2
,
n
3
)
=
n
1
n
2
A
(
n
3
)
+
n
1
A
(
n
2
)
M
(
n
3
)
+
A
(
n
1
)
M
(
n
2
)
M
(
n
3
)
(63)
M
(
n
1
,
n
2
,
n
3
)
=
M
(
n
1
)
M
(
n
2
)
M
(
n
3
)
M
(
n
1
,
n
2
,
n
3
)
=
M
(
n
1
)
M
(
n
2
)
M
(
n
3
)
(64)These equations generalize in the predicted way when more factors are considered.
Because the ordering of the factors is relevant in the equation for A(.)A(.)
but not for M(.)M(.), it is again desirable to order the factors to minimize
the number of additions. By exploiting the following property of the expressions
for A(.)A(.) and M(.)M(.), the optimal ordering can be found [1].
reservation of Optimal Ordering.
Suppose A(n1,n2,n3)≤min{A(nk1,nk2,nk3):k1,k2,k3∈{1,2,3}A(n1,n2,n3)≤min{A(nk1,nk2,nk3):k1,k2,k3∈{1,2,3} and distinct}, then

A
(
n
1
,
n
2
)
≤
A
(
n
2
,
n
1
)
A
(
n
1
,
n
2
)
≤
A
(
n
2
,
n
1
)
(65)

A
(
n
2
,
n
3
)
≤
A
(
n
3
,
n
2
)
A
(
n
2
,
n
3
)
≤
A
(
n
3
,
n
2
)
(66)

A
(
n
1
,
n
3
)
≤
A
(
n
3
,
n
1
)
A
(
n
1
,
n
3
)
≤
A
(
n
3
,
n
1
)
(67)
The generalization of this property to more than two factors reveals that
an optimal ordering of {n1,⋯,nLi}{n1,⋯,nLi} is preserved in an optimal ordering
of {n1,⋯nL}{n1,⋯nL}. Therefore, if (n1,⋯nL)(n1,⋯nL) is an optimal ordering
of {n1,⋯nL}{n1,⋯nL}, then (nk,nk+1)(nk,nk+1) is an optimal ordering of {nk,nk+1}{nk,nk+1}
and consequently
A
(
n
k
)
M
(
n
k
)

n
k
≤
A
(
n
k
+
1
)
M
(
n
k
+
1
)

n
k
+
1
A
(
n
k
)
M
(
n
k
)

n
k
≤
A
(
n
k
+
1
)
M
(
n
k
+
1
)

n
k
+
1
(68)for all k=1,2,⋯,L1k=1,2,⋯,L1.
This immediately suggests that an optimal ordering of {n1,⋯nL}{n1,⋯nL} is one
for which
A
(
n
1
)
M
(
n
1
)

n
1
⋯
A
(
n
L
)
M
(
n
L
)

n
L
A
(
n
1
)
M
(
n
1
)

n
1
⋯
A
(
n
L
)
M
(
n
L
)

n
L
(69)is nondecreasing. Hence, ordering the factors, {n1,⋯nL}{n1,⋯nL},
to minimize the number of additions incurred by En1,⋯,nLEn1,⋯,nL
simply involves computing the appropriate ratios.
We have designed prime length FFTs up to length 53 that are as good
as the previous designs that only went up to 19. Table 1 gives the
operation counts for the new and previously designed modules,
assuming complex inputs.
It is interesting to note that the operation counts depend on the
factorability of P1P1. The primes 11, 23, and 47 are all of the
form 1+2P11+2P1 making the design of efficient FFTs for these lengths
more difficult.
Further deviations from the original Winograd approach than we have
made could prove useful for longer lengths. We investigated, for example,
the use of twiddle factors at appropriate points in the decomposition
stage; these can sometimes be used to divide the cyclic convolution
into smaller convolutions. Their use means, however, that the 'center*
multiplications would no longer be by purely real or imaginary numbers.
Table 2: Operation counts for prime length DFTs
N 
Mult 
Adds 
7 
16 
72 
11 
40 
168 
13 
40 
188 
17 
82 
274 
19 
88 
360 
23 
174 
672 
29 
190 
766 
31 
160 
984 
37 
220 
920 
41 
282 
1140 
43 
304 
1416 
47 
640 
2088 
53 
556 
2038 
The approach in writing a program that writes another program is a valuable
one for several reasons. Programming the design process for the design of
prime length FFTs has the advantages of being practical, errorfree, and
flexible. The flexibility is important because it allows for modification
and experimentation with different algorithmic ideas. Above all, it has
allowed longer DFTs to be reliably designed.
More details on the generation of programs for prime length FFTs can be found in the 1993 Technical Report.

Agarwal, R. C. and Cooley, J. W. (1977, October). New Algorithms for Digital Convolution. IEEE Trans. on ASSP, 25(2), 392–410.

Blahut, Richard E. (1985). Fast Algorithms for Digital Signal Processing. Reading, Mass.: AddisonWesley.

Burrus, C. S. and Parks, T. W. (1985). DFT/FFT and Convolution Algorithms. New York: John Wiley & Sons.

Duhamel, P. and Hollmann, H. (1984, January 5). Split Radix FFT Algorithm. Electronic Letters, 20(1), 14–16.

Duhamel, P. (1986, April). Implementation of `SplitRadix' FFT Algorithms for Complex, Real, and RealSymmetric Data. [A shorter version appeared in the ICASSP85 Proceedings, p. 20.6, March 1985]. IEEE Trans. on ASSP, 34, 285–295.

Heideman, M. T. and Burrus, C. S. (1986, February). On the Number of Multiplications Necessary to Compute a Length DFT. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(1), 91–95.

Heideman, M. T. and Johnson, H. W. and Burrus, C. S. (1984, March). Prime Factor FFT Algorithms for Real–Valued Series. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. (p. 28A.7.1–4). San Diego, CA

Johnson, H. W. and Burrus, C. S. (1981). Large DFT Modules: N = 11, 13, 17, 19, and 25. (8105). Technical report. Department of Electrical Engineering, Rice University, Houston, TX 77251–1892.

Johnson, Howard W. and Burrus, C. S. (1985, February). On the Structure of Efficient DFT Algorithms. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(1), 248–254.

Kolba, D. P. and Parks, T. W. (1977, August). A Prime Factor FFT Algorithm Using High Speed Convolution. [also in]. IEEE Trans. on ASSP, 25, 281–294.

Lim, J. S. and Oppenheim, A. V. (1988). Advanced Topics in Signal Processing. Englewood Cliffs, NJ: PrenticeHall.

McClellan, J. H. and Rader, C. M. (1979). Number Theory in Digital Signal Processing. Englewood Cliffs, NJ: PrenticeHall.

Myers, Douglas G. (1990). Digital Signal Processing, Efficient Convolution and Fourier Transform Techniques. Sydney, Australia: PrenticeHall.

Nussbaumer, H. J. (1981, 1982). Fast Fourier Transform and Convolution Algorithms. (Second). Heidelberg, Germany: SpringerVerlag.

Selesnick, I. W. and Burrus, C. S. (1992, May). Automating the Design of Prime Length FFT Programs. In Proceedings of the IEEE International Symposium on Circuits and Systems. (Vol. 1, p. 133–136). ISCAS92, San Diego, CA

Selesnick, I. W. and Burrus, C. S. (1993, April). Multidimensional Mapping Techniques for Convolution. In Proceedings of the IEEE International Conference on Signal Processing. (Vol. III, pp. III288–291). IEEE ICASSP93, Minneapolis

Selesnick, I. W. and Burrus, C. S. (1994, June 30). Extending Winograd's Small Convolution Algorithm to Longer Lengths. In Proceedings of the IEEE International Symposium on Circuits and Systems. (Vol. 2, p. 2.449–452). IEEE ISCAS94, London

Selesnick, Ivan W. and Burrus, C. Sidney. (1996, January). Automatic Generation of Prime Length FFT Programs. IEEE Transactions on Signal Processing, 44(1), 14–24.

Sorensen, H. V. and Jones, D. L. and Heideman, M. T. and Burrus, C. S. (1987, June). Real Valued Fast Fourier Transform Algorithms. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(6), 849–863.

Winograd, S. (1976, April). On Computing the Discrete Fourier Transform. Proc. National Academy of Sciences, USA, 73, 1006–1006.

Winograd, S. (1978, January). On Computing the Discrete Fourier Transform. Mathematics of Computation, 32, 175–199.

Winograd, S. (1979, May). On the Multiplicative Complexity of the Discrete Fourier Transform. [also in]. Advances in Mathematics, 32(2), 83–117.

Winograd, S. (1980). SIAM CBMSNSF Series, No. 33: Arithmetic Complexity of Computation. Philadelphia: SIAM.

Winograd, S. (1980, April). Signal Processing and Complexity of Computation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. (p. 1381–1384). ICASSP80, Denver
"The Fast Fourier Transform (FFT) is a landmark algorithm used in fields ranging from signal processing to highperformance computing. First popularized by two American scientists in 1965, the […]"