In this section, we examine another classic CISC processor, the Motorola MC68020, which was used to build Macintosh computers and Sun workstations. We happened to run this code on a BBN GP-1000 Butterfly parallel processing system made up of 96 MC68020 processors.
The Motorola architecture is relatively easy to program in assembly language. It has plenty of 32-bit registers, and they are relatively easy to use. It has a CISC instruction set that keeps assembly language programming quite simple. Many instructions can perform multiple operations in a single instruction.
We use this example to show a progression of optimization levels, using a f77 compiler on a floating-point version of the loop. Our first example is with no optimization:
! Note d0 contains the value I
L5:
movl d0,L13 ! Store I to memory if loop ends
lea a1@(-4),a0 ! a1 = address of B
fmoves a0@(0,d0:l:4),fp0 ! Load of B(I)
lea a3@(-4),a0 ! a3 = address of C
fadds a0@(0,d0:l:4),fp0 ! Load of C(I) (And Add)
lea a2@(-4),a0 ! a2 = address of A
fmoves fp0,a0@(0,d0:l:4) ! Store of A(I)
addql #1,d0 ! Increment I
subql #1,d1 ! Decrement "N"
tstl d1
bnes L5
The value for I is stored in the d0 register. Each time through the loop, it’s incremented by 1. At the same time, register d1 is initialized to the value for N and decremented each time through the loop. Each time through the loop, I is stored into memory, so the proper value for I ends up in memory when the loop terminates. Registers a1, a2, and a3 are preloaded to be the first address of the arrays B, A, and C respectively. However, since FORTRAN arrays begin at 1, we must subtract 4 from each of these addresses before we can use I as the offset. The lea instructions are effectively subtracting 4 from one address register and storing it in another.
The following instruction performs an address computation that is almost a one-to- one translation of an array reference:
fmoves a0@(0,d0:l:4),fp0 ! Load of B(I)
This instruction retrieves a floating-point value from the memory. The address is computed by first multiplying d0 by 4 (because these are 32-bit floating-point numbers) and adding that value to a0. As a matter of fact, the lea and fmoves instructions could have been combined as follows:
fmoves a1@(-4,d0:l:4),fp0 ! Load of B(I)
To compute its memory address, this instruction multiplies d0 by 4, adds the contents of a1, and then subtracts 4. The resulting address is used to load 4 bytes into floating-point register fp0. This is almost a literal translation of fetching B(I). You can see how the assembly is set up to track high-level constructs.
It is almost as if the compiler were “trying” to show off and make use of the nifty assembly language instructions.
Like the Intel, this is not a load-store architecture. The fadds instruction adds a value from memory to a value in a register (fp0) and leaves the result of the addition in the register. Unlike the Intel 8088, we have enough registers to store quite a few of the values used throughout the loop (I, N, the address of A, B, and C) in registers to save memory operations.
In the next example, we compiled the C version of the loop with the normal optimization (-O) turned on. We see the C perspective on arrays in this code. C views arrays as extensions to pointers in C; the loop index advances as an offset from a pointer to the beginning of the array:
! d3 = I
! d1 = Address of A
! d2 = Address of B
! d0 = Address of C
! a6@(20) = N
moveq #0,d3 ! Initialize I
bras L5 ! Jump to End of the loop
L1: movl d3,a1 ! Make copy of I
movl a1,d4 ! Again
asll #2,d4 ! Multiply by 4 (word size)
movl d4,a1 ! Put back in an address register
fmoves a1@(0,d2:l),fp0 ! Load B(I)
movl a6@(16),d0 ! Get address of C
fadds a1@(0,d0:l),fp0 ! Add C(I)
fmoves fp0,a1@(0,d1:l) ! Store into A(I)
addql #1,d3 ! Increment I
L5:
cmpl a6@(20),d3
bits L1
We first see the value of I being copied into several registers and multiplied by 4 (using a left shift of 2, strength reduction). Interestingly, the value in register a1 is I multiplied by 4. Registers d0, d1, and d2 are the addresses of C, B, and A respectively. In the load, add, and store, a1 is the base of the address computation and d0, d1, and d2 are added as an offset to a1 to compute each address.
This is a simplistic optimization that is primarily trying to maximize the values that are kept in registers during loop execution. Overall, it’s a relatively literal translation of the C language semantics from C to assembly. In many ways, C was designed to generate relatively efficient code without requiring a highly sophisticated optimizer.
In this example, we are back to the FORTRAN version on the MC68020. We have compiled it with the highest level of optimization (-OLM) available on this compiler. Now we see a much more aggressive approach to the loop:
! a0 = Address of C(I)
! a1 = Address of B(I)
! a2 = Address of A(I)
L3:
fmoves a1@,fp0 ! Load B(I)
fadds a0@,fp0 ! Add C(I)
fmoves fp0,a2@ ! Store A(I)
addql #4,a0 ! Advance by 4
addql #4,a1 ! Advance by 4
addql #4,a2 ! Advance by 4
subql #1,d0 ! Decrement I
tstl d0
bnes L3
First off, the compiler is smart enough to do all of its address adjustment outside the loop and store the adjusted addresses of A, B, and C in registers. We do the load, add, and store in quick succession. Then we advance the array addresses by 4 and perform the subtraction to determine when the loop is complete.
This is very tight code and bears little resemblance to the original FORTRAN code.
"The purpose of Chuck Severence's book, High Performance Computing has always been to teach new programmers and scientists about the basics of High Performance Computing. This book is for learners […]"