Inside Collection (Textbook): High Performance Computing
In the matrix multiplication code, we encountered a non-unit stride and were able to eliminate it with a quick interchange of the loops. Unfortunately, life is rarely this simple. Often you find some mix of variables with unit and non-unit strides, in which case interchanging the loops moves the damage around, but doesn’t make it go away.
The loop to perform a matrix transpose represents a simple example of this dilemma:
DO I=1,N DO 20 J=1,M
DO J=1,M DO 10 I=1,N
A(J,I) = B(I,J) A(J,I) = B(I,J)
ENDDO ENDDO
ENDDO ENDDO
Whichever way you interchange them, you will break the memory access pattern for either A or B. Even more interesting, you have to make a choice between strided loads vs. strided stores: which will it be?1 We really need a general method for improving the memory access patterns for bothA and B, not one or the other. We’ll show you such a method in (Reference).
"The purpose of Chuck Severence's book, High Performance Computing has always been to teach new programmers and scientists about the basics of High Performance Computing. This book is for learners […]"