In the algorithm described above
we encountered expressions
of the form A1⊗A2⊗⋯⊗AnA1⊗A2⊗⋯⊗An
which we denote by
⊗i=1nAi.⊗i=1nAi.
To calculate the product ⊗iAix⊗iAix
it is computationally advantageous to factor
⊗iAi⊗iAi into terms of the form
I⊗Ai⊗II⊗Ai⊗I[1].
Then each term represents a set of copies of AiAi.
First, recall the following property of Kronecker products
A
B
⊗
C
D
=
(
A
⊗
C
)
(
B
⊗
D
)
.
A
B
⊗
C
D
=
(
A
⊗
C
)
(
B
⊗
D
)
.
(1)This property can be used to factor ⊗iAi⊗iAi
in the following way.
Let the number of rows and columns of AiAi be denoted
by riri and cici respectively.
Then
A
1
⊗
A
2
=
A
1
I
c
1
⊗
I
r
2
A
2
=
(
A
1
⊗
I
r
2
)
(
I
c
1
⊗
A
2
)
.
A
1
⊗
A
2
=
A
1
I
c
1
⊗
I
r
2
A
2
=
(
A
1
⊗
I
r
2
)
(
I
c
1
⊗
A
2
)
.
(2)But we can also write
A
1
⊗
A
2
=
I
r
1
A
1
⊗
A
2
I
c
2
=
(
I
r
1
⊗
A
2
)
(
A
1
⊗
I
c
2
)
.
A
1
⊗
A
2
=
I
r
1
A
1
⊗
A
2
I
c
2
=
(
I
r
1
⊗
A
2
)
(
A
1
⊗
I
c
2
)
.
(3)Note that in factorization Equation 2, copies of
A2A2 are applied to the data vector xx first,
followed by copies of A1A1.
On the other hand, in factorization Equation 3, copies of
A1A1 are applied to the data vector xx first,
followed by copies of A2A2.
These two factorizations can be distinguished by the
sequence in which A1A1 and A2A2 are ordered.
Lets compare the computational complexity of
factorizations
Equation 2 and Equation 3.
Notice that Equation 2 consists of r2r2 copies of
A1A1 and c1c1 copies of A2A2,
therefore
Equation 2 has a computational cost of
r2Q1+c1Q2r2Q1+c1Q2 where
QiQi is the computational cost of AiAi.
On the other hand,
the computational cost of Equation 3 is
c2Q1+r1Q2c2Q1+r1Q2.
That is, the factorizations Equation 2
and Equation 3 have in general different
computational costs
when AiAi are not square.
Further, observe that Equation 2 is the more
efficient factorization exactly when
r
2
Q
1
+
c
1
Q
2
<
c
2
Q
1
+
r
1
Q
2
r
2
Q
1
+
c
1
Q
2
<
c
2
Q
1
+
r
1
Q
2
(4)or equivalently, when
r
1
-
c
1
Q
1
>
r
2
-
c
2
Q
2
.
r
1
-
c
1
Q
1
>
r
2
-
c
2
Q
2
.
(5)Consequently,
in the more efficient factorization,
the operation AiAi applied to the data vector xx first is the one
for which the ratio (ri-ci)/Qi(ri-ci)/Qi is the
more negative.
If r1>c1r1>c1 and r2<c2r2<c2
then Equation 4 is always true (QiQi is always positive).
Therefore, in the most computationally efficient
factorization of A1⊗A2A1⊗A2, matrices with fewer rows than
columns are always applied to the data vector xx before
matrices with more rows than columns.
If both matrices are square, then their ordering does not
affect the computational efficiency, because
in that case
each ordering has the same computation cost.
We now consider the Kronecker product of more than two matrices.
For the Kronecker product ⊗i=1nAi⊗i=1nAi there
are n!n! possible different ways in which to order
the operations AiAi.
For example
A
1
⊗
A
2
⊗
A
3
=
(
A
1
⊗
I
r
2
r
3
)
(
I
c
1
⊗
A
2
⊗
I
r
3
)
(
I
c
1
c
2
⊗
A
3
)
=
(
A
1
⊗
I
r
2
r
3
)
(
I
c
1
r
2
⊗
A
3
)
(
I
c
1
⊗
A
2
⊗
I
c
3
)
=
(
I
r
1
⊗
A
2
⊗
I
r
3
)
(
A
1
⊗
I
c
2
r
3
)
(
I
c
1
c
2
⊗
A
3
)
=
(
I
r
1
⊗
A
2
⊗
I
r
3
)
(
I
r
1
c
2
⊗
A
3
)
(
A
1
⊗
I
c
2
c
3
)
=
(
I
r
1
r
2
⊗
A
3
)
(
A
1
⊗
I
r
2
c
3
)
(
I
c
1
⊗
A
2
⊗
I
c
3
)
=
(
I
r
1
r
2
⊗
A
3
)
(
I
r
1
⊗
A
2
⊗
I
c
3
)
(
A
1
⊗
I
c
2
c
3
)
A
1
⊗
A
2
⊗
A
3
=
(
A
1
⊗
I
r
2
r
3
)
(
I
c
1
⊗
A
2
⊗
I
r
3
)
(
I
c
1
c
2
⊗
A
3
)
=
(
A
1
⊗
I
r
2
r
3
)
(
I
c
1
r
2
⊗
A
3
)
(
I
c
1
⊗
A
2
⊗
I
c
3
)
=
(
I
r
1
⊗
A
2
⊗
I
r
3
)
(
A
1
⊗
I
c
2
r
3
)
(
I
c
1
c
2
⊗
A
3
)
=
(
I
r
1
⊗
A
2
⊗
I
r
3
)
(
I
r
1
c
2
⊗
A
3
)
(
A
1
⊗
I
c
2
c
3
)
=
(
I
r
1
r
2
⊗
A
3
)
(
A
1
⊗
I
r
2
c
3
)
(
I
c
1
⊗
A
2
⊗
I
c
3
)
=
(
I
r
1
r
2
⊗
A
3
)
(
I
r
1
⊗
A
2
⊗
I
c
3
)
(
A
1
⊗
I
c
2
c
3
)
(6)Each factorization of ⊗iAi⊗iAi can be described by
a permutation g(·)g(·) of {1,...,n}{1,...,n} which gives
the order in which AiAi is applied to the data vector xx.
Ag(1)Ag(1) is the first operation applied to the
data vector xx, Ag(2)Ag(2) is the second, and so on.
For example, the factorization Equation 6 is described
by the permutation g(1)=3g(1)=3, g(2)=1g(2)=1, g(3)=2g(3)=2.
For n=3n=3,
the computational cost of each factorization can be written
as
C
(
g
)
=
Q
g
(
1
)
c
g
(
2
)
c
g
(
3
)
+
r
g
(
1
)
Q
g
(
2
)
c
g
(
3
)
+
r
g
(
1
)
r
g
(
2
)
Q
g
(
3
)
C
(
g
)
=
Q
g
(
1
)
c
g
(
2
)
c
g
(
3
)
+
r
g
(
1
)
Q
g
(
2
)
c
g
(
3
)
+
r
g
(
1
)
r
g
(
2
)
Q
g
(
3
)
(7)In general
C
(
g
)
=
∑
i
=
1
n
∏
j
=
1
i
-
1
r
g
(
j
)
Q
g
(
i
)
∏
j
=
i
+
1
n
c
g
(
j
)
.
C
(
g
)
=
∑
i
=
1
n
∏
j
=
1
i
-
1
r
g
(
j
)
Q
g
(
i
)
∏
j
=
i
+
1
n
c
g
(
j
)
.
(8)Therefore, the most efficient factorization of ⊗iAi⊗iAi
is described by the permutation g(·)g(·) that minimizes CC.
It turns out that for the Kronecker product of more than two
matrices, the ordering of operations that describes
the most efficient factorization of ⊗iAi⊗iAi
also depends only on the ratios (ri-ci)/Qi(ri-ci)/Qi.
To show that this is the case, suppose
u(·)u(·) is the permutation that
minimizes CC,
then u(·)u(·) has the property that
r
u
(
k
)
-
c
u
(
k
)
Q
u
(
k
)
≤
r
u
(
k
+
1
)
-
c
u
(
k
+
1
)
Q
u
(
k
+
1
)
r
u
(
k
)
-
c
u
(
k
)
Q
u
(
k
)
≤
r
u
(
k
+
1
)
-
c
u
(
k
+
1
)
Q
u
(
k
+
1
)
(9)for k=1,⋯,n-1k=1,⋯,n-1.
To support this, note that since u(·)u(·)
is the permutation that minimizes CC,
we have in particular
C
(
u
)
≤
C
(
v
)
C
(
u
)
≤
C
(
v
)
(10)where v(·)v(·) is
the permutation defined by the following:
v
(
i
)
=
{
u
(
i
)
i
<
k
,
i
>
k
+
1
u
(
k
+
1
)
i
=
k
u
(
k
)
i
=
k
+
1
.
v
(
i
)
=
{
u
(
i
)
i
<
k
,
i
>
k
+
1
u
(
k
+
1
)
i
=
k
u
(
k
)
i
=
k
+
1
.
(11)Because only two terms in Equation 8 are different,
we have from Equation 10
∑
i
=
k
k
+
1
∏
j
=
1
i
-
1
r
u
(
j
)
Q
u
(
i
)
∏
j
=
i
+
1
n
c
u
(
j
)
≤
∑
i
=
k
k
+
1
∏
j
=
1
i
-
1
r
v
(
j
)
Q
v
(
i
)
∏
j
=
i
+
1
n
c
v
(
j
)
∑
i
=
k
k
+
1
∏
j
=
1
i
-
1
r
u
(
j
)
Q
u
(
i
)
∏
j
=
i
+
1
n
c
u
(
j
)
≤
∑
i
=
k
k
+
1
∏
j
=
1
i
-
1
r
v
(
j
)
Q
v
(
i
)
∏
j
=
i
+
1
n
c
v
(
j
)
(12)which, after canceling common terms from each side, gives
Q
u
(
k
)
c
u
(
k
+
1
)
+
r
u
(
k
)
Q
u
(
k
+
1
)
≤
Q
v
(
k
)
c
v
(
k
+
1
)
+
r
v
(
k
)
Q
v
(
k
+
1
)
.
Q
u
(
k
)
c
u
(
k
+
1
)
+
r
u
(
k
)
Q
u
(
k
+
1
)
≤
Q
v
(
k
)
c
v
(
k
+
1
)
+
r
v
(
k
)
Q
v
(
k
+
1
)
.
(13)Since v(k)=u(k+1)v(k)=u(k+1) and v(k+1)=u(k)v(k+1)=u(k)
this becomes
Q
u
(
k
)
c
u
(
k
+
1
)
+
r
u
(
k
)
Q
u
(
k
+
1
)
≤
Q
u
(
k
+
1
)
c
u
(
k
)
+
r
u
(
k
+
1
)
Q
u
(
k
)
Q
u
(
k
)
c
u
(
k
+
1
)
+
r
u
(
k
)
Q
u
(
k
+
1
)
≤
Q
u
(
k
+
1
)
c
u
(
k
)
+
r
u
(
k
+
1
)
Q
u
(
k
)
(14)which is equivalent to Equation 9.
Therefore, to find the best factorization of ⊗iAi⊗iAi
it is necessary only to compute the ratios
(ri-ci)/Qi(ri-ci)/Qi and to order them in an non-decreasing order.
The operation AiAi whose index appears first in this list
is applied to the data vector xx first, and so on
As above, if ru(k+1)>cu(k+1)ru(k+1)>cu(k+1) and ru(k)<cu(k)ru(k)<cu(k)
then Equation 14 is always true.
Therefore, in the most computationally efficient
factorization of ⊗iAi⊗iAi, all matrices with fewer rows than columns are
always applied to the data vector xx before
any matrices with more rows than columns.
If some matrices are square, then their ordering does not
affect the computational efficiency as long
as they are applied after all matrices
with fewer rows than columns and before
all matrices with more rows than columns.
Once the permutation g(·)g(·) that minimizes CC
is determined by ordering the ratios
(ri-ci)/Qi(ri-ci)/Qi,
⊗iAi⊗iAi can be written as
⊗
i
=
1
n
A
i
=
∏
i
=
n
1
I
a
(
i
)
⊗
A
g
(
i
)
⊗
I
b
(
i
)
⊗
i
=
1
n
A
i
=
∏
i
=
n
1
I
a
(
i
)
⊗
A
g
(
i
)
⊗
I
b
(
i
)
(15)where
a
(
i
)
=
∏
k
=
1
g
(
i
)
-
1
γ
(
i
,
k
)
a
(
i
)
=
∏
k
=
1
g
(
i
)
-
1
γ
(
i
,
k
)
(16)
b
(
i
)
=
∏
k
=
g
(
i
)
+
1
n
γ
(
i
,
k
)
b
(
i
)
=
∏
k
=
g
(
i
)
+
1
n
γ
(
i
,
k
)
(17)and where γ(·)γ(·) is defined by
γ
(
i
,
k
)
=
{
r
k
if
g
(
i
)
>
g
(
k
)
c
k
if
g
(
i
)
<
g
(
k
)
.
γ
(
i
,
k
)
=
{
r
k
if
g
(
i
)
>
g
(
k
)
c
k
if
g
(
i
)
<
g
(
k
)
.
(18)A Matlab program that computes the permutation
that describes the computationally most efficient
factorization of ⊗i=1nAi⊗i=1nAi is cgc()
.
It also gives the resulting computational cost.
It requires the computational cost of each of the
matrices AiAi and the number of rows and columns
of each.
function [g,C] = cgc(Q,r,c,n)
% [g,C] = cgc(Q,r,c,n);
% Compute g and C
% g : permutation that minimizes C
% C : computational cost of Kronecker product of A(1),...,A(n)
% Q : computation cost of A(i)
% r : rows of A(i)
% c : columns of A(i)
% n : number of terms
f = find(Q==0);
Q(f) = eps * ones(size(Q(f)));
Q = Q(:);
r = r(:);
c = c(:);
[s,g] = sort((r-c)./Q);
C = 0;
for i = 1:n
C = C + prod(r(g(1:i-1)))*Q(g(i))*prod(c(g(i+1:n)));
end
C = round(C);
The Matlab program kpi()
implements the Kronecker
product ⊗i=1nAi⊗i=1nAi.
function y = kpi(d,g,r,c,n,x)
% y = kpi(d,g,r,c,n,x);
% Kronecker Product : A(d(1)) kron ... kron A(d(n))
% g : permutation of 1,...,n
% r : [r(1),...,r(n)]
% c : [c(1),..,c(n)]
% r(i) : rows of A(d(i))
% c(i) : columns of A(d(i))
% n : number of terms
for i = 1:n
a = 1;
for k = 1:(g(i)-1)
if i > find(g==k)
a = a * r(k);
else
a = a * c(k);
end
end
b = 1;
for k = (g(i)+1):n
if i > find(g==k)
b = b * r(k);
else
b = b * c(k);
end
end
% y = (I(a) kron A(d(g(i))) kron I(b)) * x;
y = IAI(d(g(i)),a,b,x);
end
where the last line of code
calls a function that implements (Ia⊗Ad(g(i))⊗Ib)x(Ia⊗Ad(g(i))⊗Ib)x.
That is, the program IAI(i,a,b,x)
implements
(Ia⊗A(i)⊗Ib)x(Ia⊗A(i)⊗Ib)x.
The Matlab program IAI
implements y=(Im⊗A⊗In)xy=(Im⊗A⊗In)x
function y = IAI(A,r,c,m,n,x)
% y = (I(m) kron A kron I(n))x
% r : number of rows of A
% c : number of columns of A
v = 0:n:n*(r-1);
u = 0:n:n*(c-1);
for i = 0:m-1
for j = 0:n-1
y(v+i*r*n+j+1) = A * x(u+i*c*n+j+1);
end
end
It simply uses two loops to implement the mnmn copies
of AA. Each copy of AA is applied to a different
subset of the elements of xx.
The command I⊗A⊗II⊗A⊗I where ⊗⊗ is the
Kronecker (or Tensor) product can be interpreted as a
vector/parallel command [2], [3].
In these references, the implementation of these
commands is discussed in detail and
they have found that the Tensor product is
“an extremely useful tool for matching algorithms
to computer architectures [2].”
The expression I⊗AI⊗A can easily be seen to represent a
parallel command:
I
⊗
A
=
A
A
⋱
A
.
I
⊗
A
=
A
A
⋱
A
.
(19)Each block along the diagonal acts on non-overlapping
sections of the data vector - so that each section
can be performed in parallel. Since each section
represents exactly the same operation, this form is
amenable to implementation on a computer with a
parallel architectural configuration.
The expression A⊗IA⊗I can be similarly seen to represent
a vector command, see [2].
It should also be noted that by employing `stride' permutations,
the command (I⊗A⊗I)x(I⊗A⊗I)x can be replaced
by either (I⊗A)x(I⊗A)x or (A⊗I)x(A⊗I)x[2], [3].
It is only necessary to permute the input and output.
It is also the case that these stride permutations
are natural loading and storing commands for some architectures.
In the programs we have written in conjunction with this paper
we implement the commands y=(I⊗A⊗I)xy=(I⊗A⊗I)x
with loops in a set of subroutines.
The circular convolution and prime length FFT programs
we present, however, explicitly use the form
I⊗A⊗II⊗A⊗I to make clear the
structure of the algorithm, to make them
more modular and simpler, and to make them
amenable to implementation on special architectures.
In fact, in [2] it is suggested that it might be practical
to develop tensor product compilers.
The FFT programs we have generated
will be well suited for such compilers.