To be effective, loop unrolling requires a fairly large number of iterations in the original loop. To understand why, picture what happens if the total iteration count is low, perhaps less than 10, or even less than 4. With a trip count this low, the preconditioning loop is doing a proportionately large amount of the work. It’s not supposed to be that way. The preconditioning loop is supposed to catch the few leftover iterations missed by the unrolled, main loop. However, when the trip count is low, you make one or two passes through the unrolled loop, plus one or two passes through the preconditioning loop. In other words, you have more clutter; the loop shouldn’t have been unrolled in the first place.
Probably the only time it makes sense to unroll a loop with a low trip count is when the number of iterations is constant and known at compile time. For instance, suppose you had the following loop:
PARAMETER (NITER = 3)
DO I=1,NITER
A(I) = B(I) * C
ENDDO
Because NITER is hardwired to 3, you can safely unroll to a depth of 3 without worrying about a preconditioning loop. In fact, you can throw out the loop structure altogether and leave just the unrolled loop innards:
PARAMETER (NITER = 3)
A(1) = B(1) * C
A(2) = B(2) * C
A(3) = A(3) * C
Of course, if a loop’s trip count is low, it probably won’t contribute significantly to the overall runtime, unless you find such a loop at the center of a larger loop. Then you either want to unroll it completely or leave it alone.









"The purpose of Chuck Severence's book, High Performance Computing has always been to teach new programmers and scientists about the basics of High Performance Computing. This book is for learners […]"