The Chinese Remainder Theorem for polynomials can be used to
decompose a convolution of two sequences
(the polynomial product of two polynomials evaluated
modulo a third polynomial)
into smaller convolutions (smaller polynomial products) [6].
The Winograd nn point circular convolution algorithm
requires that polynomials are reduced modulo the
cyclotomic polynomial factors of sn1sn1,
Φd(s)Φd(s) for each dd dividing nn.
When nn has several prime divisors the reduction operations
become quite complicated and writing a program to implement them
is difficult.
However, when nn is a prime power, the reduction
operations are very structured and can be done in a straightforward manner.
Therefore, by converting a onedimensional convolution
to a multidimensional one, in which the length is a prime power
along each dimension,
the split nesting algorithm avoids the need for
complicated reductions operations.
This is one advantage the split nesting algorithm has over
the Winograd algorithm.
By applying the reduction operations appropriately to the circular
shift matrix, we are able to obtain a block diagonal form,
just as in the Winograd convolution algorithm.
However, in the split nesting algorithm, each diagonal block
represents multidimensional cyclotomic convolution
rather than a onedimensional one.
By forming multidimensional convolutions out of onedimensional ones,
it is possible to combine algorithms for smaller convolutions
(see the next section).
This is a second advantage split nesting algorithm has over
the Winograd algorithm.
The split nesting algorithm, however, generally uses more than
the minimum number of multiplications.
Below we give an explicit matrix description of the required
reduction operations, give a program that implements them,
and give a formula for the number of additions required.
(No multiplications are needed.)
First, consider n=pn=p, a prime.
Let
X
(
s
)
=
x
0
+
x
1
s
+
⋯
+
x
p

1
s
p

1
X
(
s
)
=
x
0
+
x
1
s
+
⋯
+
x
p

1
s
p

1
(19)and recall sp1=(s1)(sp1+sp2+⋯+s+1)=Φ1(s)Φp(s)sp1=(s1)(sp1+sp2+⋯+s+1)=Φ1(s)Φp(s).
The residue
〈
X
(
s
)
〉
Φ
1
(
s
)
〈
X
(
s
)
〉
Φ
1
(
s
)
is found by summing the coefficients of X(s)X(s).
The residue
〈
X
(
s
)
〉
Φ
p
(
s
)
〈
X
(
s
)
〉
Φ
p
(
s
)
is given by
∑k=0p2(xkxp1)sk∑k=0p2(xkxp1)sk.
Define RpRp to be the matrix that
reduces X(s)X(s) modulo Φ1(s)Φ1(s) and Φp(s)Φp(s),
such that
if
X
0
(
s
)
=
〈
X
(
s
)
〉
Φ
1
(
s
)
X
0
(
s
)
=
〈
X
(
s
)
〉
Φ
1
(
s
)
and
X
1
(
s
)
=
〈
X
(
s
)
〉
Φ
p
(
s
)
X
1
(
s
)
=
〈
X
(
s
)
〉
Φ
p
(
s
)
then
X
0
X
1
=
R
p
X
X
0
X
1
=
R
p
X
(20)where XX, X0X0 and X1X1 are vectors formed
from the coefficients of X(s)X(s), X0(s)X0(s) and X1(s)X1(s).
That is,
R
p
=
1
1
1
1
1
1

1
1

1
1

1
1

1
R
p
=
1
1
1
1
1
1

1
1

1
1

1
1

1
(21)so that Rp=
1

1
GpRp=
1

1
Gp
where GpGp is the Φp(s)Φp(s) reduction matrix
of size (p1)×p(p1)×p.
Similarly,
let X(s)=x0+x1s+⋯+xpe1spe1X(s)=x0+x1s+⋯+xpe1spe1 and
define RpeRpe
to be the matrix that reduces X(s)X(s) modulo
Φ1(s)Φ1(s), Φp(s)Φp(s), ..., Φpe(s)Φpe(s)
such that
X
0
X
1
⋮
X
e
=
R
p
e
X
,
X
0
X
1
⋮
X
e
=
R
p
e
X
,
(22)where as above, XX, X0X0, ..., XeXe are the coefficients
of
X
(
s
)
X
(
s
)
,
〈
X
(
s
)
〉
Φ
1
(
s
)
〈
X
(
s
)
〉
Φ
1
(
s
)
, ...,
〈
X
(
s
)
〉
Φ
p
e
(
s
)
〈
X
(
s
)
〉
Φ
p
e
(
s
)
.
It turns out that
R
p
e
R
p
e
can be written in terms of
R
p
R
p
.
Consider the reduction of
X
(
s
)
=
x
0
+
⋯
+
x
8
s
8
X
(
s
)
=
x
0
+
⋯
+
x
8
s
8
by
Φ
1
(
s
)
=
s

1
Φ
1
(
s
)
=
s

1
,
Φ
3
(
s
)
=
s
2
+
s
+
1
Φ
3
(
s
)
=
s
2
+
s
+
1
, and
Φ
9
(
s
)
=
s
6
+
s
3
+
1
Φ
9
(
s
)
=
s
6
+
s
3
+
1
.
This is most efficiently performed by reducing
X
(
s
)
X
(
s
)
in two steps.
That is, calculate
X
'
(
s
)
=
〈
X
(
s
)
〉
s
3

1
X
'
(
s
)
=
〈
X
(
s
)
〉
s
3

1
and
X
2
(
s
)
=
〈
X
(
s
)
〉
s
6
+
s
3
+
1
X
2
(
s
)
=
〈
X
(
s
)
〉
s
6
+
s
3
+
1
.
Then calculate
X
0
(
s
)
=
〈
X
'
(
s
)
〉
s

1
X
0
(
s
)
=
〈
X
'
(
s
)
〉
s

1
and
X
1
(
s
)
=
〈
X
'
(
s
)
〉
s
2
+
s
+
1
X
1
(
s
)
=
〈
X
'
(
s
)
〉
s
2
+
s
+
1
.
In matrix notation this becomes
X
'
X
2
=
I
3
I
3
I
3
I
3

I
3
I
3

I
3
X
and
X
0
X
1
=
1
1
1
1

1
1

1
X
'
.
X
'
X
2
=
I
3
I
3
I
3
I
3

I
3
I
3

I
3
X
and
X
0
X
1
=
1
1
1
1

1
1

1
X
'
.
(23)Together these become
X
0
X
1
X
2
=
R
3
I
3
I
3
I
3
I
3
I
3
I
3

I
3
I
3

I
3
X
X
0
X
1
X
2
=
R
3
I
3
I
3
I
3
I
3
I
3
I
3

I
3
I
3

I
3
X
(24)or
X
0
X
1
X
2
=
(
R
3
⊕
I
6
)
(
R
3
⊗
I
3
)
X
X
0
X
1
X
2
=
(
R
3
⊕
I
6
)
(
R
3
⊗
I
3
)
X
(25)so that
R9=(R3⊕I6)(R3⊗I3)R9=(R3⊕I6)(R3⊗I3)
where ⊕⊕ denotes the matrix direct sum.
Similarly, one finds that
R27=(R3⊕I24)((R3⊗I3)⊕I18)(R3⊗I9)R27=(R3⊕I24)((R3⊗I3)⊕I18)(R3⊗I9).
In general, one has the following.
RpeRpe is a pe×pepe×pe matrix given by
Rpe=∏k=0e1((Rp⊗Ipk)⊕Ipepk+1)Rpe=∏k=0e1((Rp⊗Ipk)⊕Ipepk+1)
and can be implemented with 2(pe1)2(pe1)
additions.
The following formula gives the decomposition of a circular
convolution into disjoint noncircular convolutions when
the number of points is a prime power.
R
p
e
S
p
e
R
p
e

1
=
1
C
Φ
p
⋱
C
Φ
p
e
=
⊕
i
=
0
e
C
Φ
p
i
R
p
e
S
p
e
R
p
e

1
=
1
C
Φ
p
⋱
C
Φ
p
e
=
⊕
i
=
0
e
C
Φ
p
i
(26)
R
9
S
9
R
9

1
=
1
C
Φ
3
C
Φ
9
R
9
S
9
R
9

1
=
1
C
Φ
3
C
Φ
9
(27)It turns out that when nn is not a prime power,
the reduction of polynomials modulo the cyclotomic polynomial
Φn(s)Φn(s) becomes complicated,
and with an increasing number of prime factors, the complication increases.
Recall, however, that a circular convolution of length
p1e1⋯pkekp1e1⋯pkek can be converted (by an appropriate
permutation) into a kk dimensional circular convolution
of length pieipiei along dimension ii.
By employing this onedimensional to multidimensional
mapping technique, one can avoid having to perform
polynomial reductions modulo Φn(s)Φn(s)
for nonprimepower nn.
The prime factor permutation discussed previously
is the permutation that converts a onedimensional circular
convolution into a multidimensional one.
Again, we can use the Kronecker product to
represent this.
In this case,
the reduction operations are applied to each matrix in the
following way:
T
S
p
1
e
1
⊗
⋯
⊗
S
p
k
e
k
T

1
=
⊕
i
=
0
e
1
C
Φ
p
1
i
⊗
⋯
⊗
⊕
i
=
0
e
k
C
Φ
p
k
i
T
S
p
1
e
1
⊗
⋯
⊗
S
p
k
e
k
T

1
=
⊕
i
=
0
e
1
C
Φ
p
1
i
⊗
⋯
⊗
⊕
i
=
0
e
k
C
Φ
p
k
i
(28)where
T
=
R
p
1
e
1
⊗
⋯
⊗
R
p
k
e
k
T
=
R
p
1
e
1
⊗
⋯
⊗
R
p
k
e
k
(29)
T
S
9
⊗
S
5
T

1
=
1
C
Φ
3
C
Φ
9
⊗
1
C
Φ
5
T
S
9
⊗
S
5
T

1
=
1
C
Φ
3
C
Φ
9
⊗
1
C
Φ
5
(30)where T=R9⊗R5T=R9⊗R5.
The matrix Rp1e1⊗⋯⊗RpkekRp1e1⊗⋯⊗Rpkek
can be factored using a property of the Kronecker product.
Notice that (A⊗B)=(A⊗I)(I⊗B)(A⊗B)=(A⊗I)(I⊗B),
and (A⊗B⊗C)=(A⊗I)(I⊗B⊗I)(I⊗C)(A⊗B⊗C)=(A⊗I)(I⊗B⊗I)(I⊗C)
(with appropriate dimensions) so that one gets
⊗
i
=
1
k
R
p
i
e
i
=
∏
i
=
1
k
(
I
m
i
⊗
R
p
i
e
i
⊗
I
n
i
)
,
⊗
i
=
1
k
R
p
i
e
i
=
∏
i
=
1
k
(
I
m
i
⊗
R
p
i
e
i
⊗
I
n
i
)
,
(31)where mi=∏j=1i1pjejmi=∏j=1i1pjej,
ni=∏j=i+1kpjejni=∏j=i+1kpjej
and where the empty product is taken to be 1.
This factorization shows that TT can be implemented basically
by implementing copies of RpeRpe.
There are many variations on this factorization
as explained in [4].
That the various factorization can be interpreted as vector
or parallel implementations is also explained in
[4].
R
9
⊗
R
5
=
(
R
9
⊗
I
5
)
(
I
9
⊗
R
5
)
R
9
⊗
R
5
=
(
R
9
⊗
I
5
)
(
I
9
⊗
R
5
)
(32)
and
R
9
⊗
R
25
⊗
R
7
=
(
R
9
⊗
I
175
)
(
I
9
⊗
R
25
⊗
I
7
)
(
I
225
⊗
R
7
)
R
9
⊗
R
25
⊗
R
7
=
(
R
9
⊗
I
175
)
(
I
9
⊗
R
25
⊗
I
7
)
(
I
225
⊗
R
7
)
(33)When this factored form of ⊗Rni⊗Rni
or any of the variations alluded to above,
is used, the number of additions incurred is given by
∑
i
=
1
k
N
p
i
e
i
A
(
R
p
i
e
i
)
=
∑
i
=
1
k
N
p
i
e
i
2
(
p
i
e
i

1
)
=
2
N
∑
i
=
1
k
1

1
p
i
e
i
=
2
N
k

∑
i
=
1
k
1
p
i
e
i
∑
i
=
1
k
N
p
i
e
i
A
(
R
p
i
e
i
)
=
∑
i
=
1
k
N
p
i
e
i
2
(
p
i
e
i

1
)
=
2
N
∑
i
=
1
k
1

1
p
i
e
i
=
2
N
k

∑
i
=
1
k
1
p
i
e
i
(34)where N=p1e1⋯pkekN=p1e1⋯pkek.
Although the use of operations of the form
Rp1e1⊗⋯⊗RpkekRp1e1⊗⋯⊗Rpkek
is simple, it does not exactly separate the circular convolution
into smaller disjoint convolutions.
In other words, its use does not give rise in
Equation 28 and Equation 30 to
block diagonal matrices whose diagonal
blocks are the form ⊗iCΦpi⊗iCΦpi.
However, by reorganizing the arrangement of the
operations we can obtain the block diagonal
form we seek.
First, suppose AA, BB and CC are matrices of sizes
a×aa×a, b×bb×b and c×cc×c
respectively.
If
T
B
T

1
=
B
1
B
2
T
B
T

1
=
B
1
B
2
(35)where B1B1 and B2B2 are
matrices of sizes b1×b1b1×b1 and
b2×b2b2×b2,
then
Q
(
A
⊗
B
⊗
C
)
Q

1
=
A
⊗
B
1
⊗
C
A
⊗
B
2
⊗
C
Q
(
A
⊗
B
⊗
C
)
Q

1
=
A
⊗
B
1
⊗
C
A
⊗
B
2
⊗
C
(36)where
Q
=
I
a
⊗
T
(
1
:
b
1
,
:
)
⊗
I
c
I
a
⊗
T
(
b
1
+
1
:
b
,
:
)
⊗
I
c
.
Q
=
I
a
⊗
T
(
1
:
b
1
,
:
)
⊗
I
c
I
a
⊗
T
(
b
1
+
1
:
b
,
:
)
⊗
I
c
.
(37)Here T(1:b1,:)T(1:b1,:) denotes the first b1b1 rows
and all the columns of TT and similarly for
T(b1+1:b,:)T(b1+1:b,:).
Note that
A
⊗
B
1
⊗
C
A
⊗
B
2
⊗
C
≠
A
⊗
B
1
B
2
⊗
C
.
A
⊗
B
1
⊗
C
A
⊗
B
2
⊗
C
≠
A
⊗
B
1
B
2
⊗
C
.
(38)That these two expressions are not equal explains why
the arrangement of operations must be reorganized
in order to obtain the desired block diagonal form.
The appropriate reorganization is described by the
expression in Equation 37.
Therefore, we must modify the transformation of
Equation 28 appropriately.
It should be noted that this reorganization of
operations does not change their computational cost.
It is still given by Equation 34.
For example, we can use this observation
and the expression in Equation 37 to arrive at the following
similarity transformation:
Q
S
p
1
⊗
S
p
2
Q

1
=
1
C
Φ
p
1
C
Φ
p
2
C
Φ
p
1
⊗
C
Φ
p
2
Q
S
p
1
⊗
S
p
2
Q

1
=
1
C
Φ
p
1
C
Φ
p
2
C
Φ
p
1
⊗
C
Φ
p
2
(39)where
Q
=
I
p
1
⊗
1
p
2
t
I
p
1
⊗
G
p
2
R
p
1
⊗
I
p
2
Q
=
I
p
1
⊗
1
p
2
t
I
p
1
⊗
G
p
2
R
p
1
⊗
I
p
2
(40)
1
p
1
p
is a column vector of pp 1's
1
p
=
1
1
⋯
1
t
1
p
=
1
1
⋯
1
t
(41)and GpGp is the (p1)×p(p1)×p matrix:
G
p
=
1

1
1

1
⋱
⋮
1

1
=
I
p

1

1
̲
p

1
.
G
p
=
1

1
1

1
⋱
⋮
1

1
=
I
p

1

1
̲
p

1
.
(42)In general we have
R
S
p
1
e
1
⊗
⋯
⊗
S
p
k
e
k
R

1
=
⊕
d

n
Ψ
(
d
)
R
S
p
1
e
1
⊗
⋯
⊗
S
p
k
e
k
R

1
=
⊕
d

n
Ψ
(
d
)
(43)where R=Rp1e1,⋯,pkekR=Rp1e1,⋯,pkek is given by
R
p
1
e
1
,
⋯
,
p
k
e
k
=
∏
i
=
k
1
Q
(
m
i
,
p
i
e
i
,
n
i
)
R
p
1
e
1
,
⋯
,
p
k
e
k
=
∏
i
=
k
1
Q
(
m
i
,
p
i
e
i
,
n
i
)
(44)with mi=∏j=1i1pjejmi=∏j=1i1pjej,
ni=∏j=i+1kpjejni=∏j=i+1kpjej
and
Q
(
a
,
p
e
,
c
)
=
∏
j
=
0
e

1
I
a
⊗
1

p
t
⊗
I
c
p
j
I
a
⊗
G
p
⊗
I
c
p
j
I
a
c
(
p
e

p
j
+
1
)
.
Q
(
a
,
p
e
,
c
)
=
∏
j
=
0
e

1
I
a
⊗
1

p
t
⊗
I
c
p
j
I
a
⊗
G
p
⊗
I
c
p
j
I
a
c
(
p
e

p
j
+
1
)
.
(45)
1
p
1
p
and
G
p
G
p
are as given in
Equation 41 and Equation 42.
R
S
9
⊗
S
5
R

1
=
1
C
Φ
3
C
Φ
9
C
Φ
5
C
Φ
3
⊗
C
Φ
5
C
Φ
9
⊗
C
Φ
5
R
S
9
⊗
S
5
R

1
=
1
C
Φ
3
C
Φ
9
C
Φ
5
C
Φ
3
⊗
C
Φ
5
C
Φ
9
⊗
C
Φ
5
(46)
where
R
=
R
9
,
5
=
Q
(
9
,
5
,
1
)
Q
(
1
,
9
,
5
)
R
=
R
9
,
5
=
Q
(
9
,
5
,
1
)
Q
(
1
,
9
,
5
)
(47)
and
RR can be implemented
with 152 additions.
Notice the distinction between this example and
example "Reduction Operations".
In example "Reduction Operations" we obtained from
S9⊗S5S9⊗S5 a Kronecker product of two
block diagonal matrices, but here we obtained
a block diagonal matrix whose diagonal blocks
are the Kronecker product of cyclotomic companion
matrices.
Each block in Equation 46 represents a multidimensional
cyclotomic convolution.
A Matlab program that carries out the operation
Rp1e1,⋯,pkekRp1e1,⋯,pkek
in Equation 43 is KRED()
.
function x = KRED(P,E,K,x)
% x = KRED(P,E,K,x);
% P : P = [P(1),...,P(K)];
% E : E = [E(K),...,E(K)];
for i = 1:K
a = prod(P(1:i1).^E(1:i1));
c = prod(P(i+1:K).^E(i+1:K));
p = P(i);
e = E(i);
for j = e1:1:0
x(1:a*c*(p^(j+1))) = RED(p,a,c*(p^j),x(1:a*c*(p^(j+1))));
end
end
It calls the Matlab program RED()
.
function y = RED(p,a,c,x)
% y = RED(p,a,c,x);
y = zeros(a*c*p,1);
for i = 0:c:(a1)*c
for j = 0:c1
y(i+j+1) = x(i*p+j+1);
for k = 0:c:c*(p2)
y(i+j+1) = y(i+j+1) + x(i*p+j+k+c+1);
y(i*(p1)+j+k+a*c+1) = x(i*p+j+k+1)  x(i*p+j+c*(p1)+1);
end
end
end
These two Matlab programs are not written to run as fast as
they could be.
They are a `naive' coding of Rp1e1,⋯,pkekRp1e1,⋯,pkek
and are meant to serve as a basis for more efficient programs.
In particular, the indexing and the loop counters can
be modified to improve the efficiency.
However, the modifications that minimize the overhead
incurred by indexing operations depends on the
programming language, the compiler and the computer used.
These two programs are written with simple
loop counters and complicated indexing operations
so that appropriate modifications can be easily made.
The inverse of RpRp has the form
R
p

1
=
1
p
1
p

1

1

1

1
1

1
p

1

1

1
1

1

1
p

1

1
1

1

1

1
p

1
1

1

1

1

1
R
p

1
=
1
p
1
p

1

1

1

1
1

1
p

1

1

1
1

1

1
p

1

1
1

1

1

1
p

1
1

1

1

1

1
(48)and
R
p
e

1
=
∏
k
=
0
e

1
(
(
R
p

1
⊗
I
p
e

1

k
)
⊕
I
p
e

p
e

k
)
.
R
p
e

1
=
∏
k
=
0
e

1
(
(
R
p

1
⊗
I
p
e

1

k
)
⊕
I
p
e

p
e

k
)
.
(49)Because the inverse of QQ in Equation 37 is given by
Q

1
=
I
a
⊗
T

1
(
:
,
1
:
b
1
)
⊗
I
c
I
a
⊗
T

1
(
:
,
b
1
+
1
:
b
)
⊗
I
c
Q

1
=
I
a
⊗
T

1
(
:
,
1
:
b
1
)
⊗
I
c
I
a
⊗
T

1
(
:
,
b
1
+
1
:
b
)
⊗
I
c
(50)the inverse of the matrix RR described by eqs
Equation 43, Equation 44 and Equation 45
is given by
R

1
=
∏
i
=
1
k
Q
(
m
i
,
p
i
e
i
,
n
i
)

1
R

1
=
∏
i
=
1
k
Q
(
m
i
,
p
i
e
i
,
n
i
)

1
(51)with mi=∏j=1i1pjejmi=∏j=1i1pjej,
ni=∏j=i+1kpjejni=∏j=i+1kpjej
and
Q
(
a
,
p
e
,
c
)

1
=
∏
j
=
e

1
0
I
a
⊗
1
p
t
⊗
I
c
p
j
I
a
⊗
V
p
⊗
I
c
p
j
I
a
c
(
p
e

p
j
+
1
)
Q
(
a
,
p
e
,
c
)

1
=
∏
j
=
e

1
0
I
a
⊗
1
p
t
⊗
I
c
p
j
I
a
⊗
V
p
⊗
I
c
p
j
I
a
c
(
p
e

p
j
+
1
)
(52)where VpVp denotes the matrix in Equation 48
without the first column.
A Matlab program for RtRt is tKRED()
,
it calls the Matlab program tRED()
.
A Matlab program for RtRt is itKRED()
,
it calls the Matlab program itRED()
.
These programs all appear in one of the appendices.