Summary: You will learn more assembly instructions in this lab.
Quite often you need to load a register with a constant.
The C62x instructions you can use for this task are
MVK, MVKL, and
MVKH. Each of these instructions can
load a 16-bit constant to a register. Read and understand
the description of these instructions in the manual.
(Loading constants): Write assembly instructions to do the following:
0xff12
to A1.
0xabcd45ef to
B0.
Intentionally left blank.
Contents of one register can be copied to another register
by using the MV instruction. There is
also the ZERO instruction to set a
register to zero. Learn how to use these instructions by
reading the appropriate TI manual pages.
Because the C62x processor has the so-called load/store
architecture, you must first load up the content of memory
to a register to be able to manipulate it. The basic
assembly instructions you use for loading are
LDB, LDH, and
LDW for loading up 8-, 16-, and 32-bit
data from memory. (There are some variations to these
instructions for different handling of the signs of the
loaded values.) Read and understand how these instructions
work.
However, to specify the address of the memory location to load from, you need to load up another register (used as an address index) and you can use various addressing modes to specify the memory locations in many different ways. The addressing modes is the method by which an instruction calculates the location of an object in memory. The table below lists all the possible different ways to handle the address pointers in C62x CPU. Note the similarity with the C pointer manipulation.
| Syntax | Memory address accessed | Pointer modification |
|---|---|---|
*R |
R |
None |
*++R |
R |
Preincrement |
*--R |
R |
Predecrement |
*R++ |
R |
Postincrement |
*R-- |
R |
Postdecrement |
*+R[disp]
|
R+disp |
None |
*-R[disp]
|
R+disp |
None |
*++R[disp]
|
R+disp |
Preincrement |
*--R[disp]
|
R+disp |
Predecrement |
*R++[disp]
|
R+disp |
Postincrement |
*R--[disp]
|
R+disp |
Postdecrement |
The [disp] specifies the number of
elements in word, halfword, or byte, depending on the
instruction type and it can be either 5-bit
constant or a register. The
increment/decrement of the index registers are also in terms
of the number of bytes in word, halfword or byte. The
addressing modes with displacements are useful when a block
of memory locations is accessed. Those with automatic
increment/decrement are useful when a block is accessed
consecutively to implement a buffer, for example, to store
signal samples to implement a digital filter.
(Load from memory): Assume the following values are stored in memory addresses:
100h fe54 7834h
104h 3459 f34dh
108h 2ef5 7ee4h
10ch 2345 6789h
110h ffff eeddh
114h 3456 787eh
118h 3f4d 7ab3h
Suppose A10 = 0000 0108h. Find the
contents of A1 and
A10 after executing the each of the
following instructions.
LDW .D1 *A10, A1
LDH .D1 *A10, A1
LDB .D1 *A10, A1
LDW .D1 *-A10[1], A1
LDW .D1 *+A10[1], A1
LDW .D1 *+A10[2], A1
LDB .D1 *+A10[2], A1
LDW .D1 *++A10[1], A1
LDW .D1 *--A10[1], A1
LDB .D1 *++A10[1], A1
LDB .D1 *--A10[1], A1
LDW .D1 *A10++[1], A1
LDW .D1 *A10--[1], A1
Intentionally left blank.
Storing the register contents uses the same addressing
modes. The assembly instructions used for storing are
STB, STH, and
STW. Read and understand these
instructions in the TI manual.
(Storing to memory): Write assembly instructions to
store 32-bit constant 53fe 23e4h to
memory address 0000 0123h.
Intentionally left blank.
Sometimes, it becomes necessary to access part of the data
stored in memory. For example, if you store the 32-bit word
0x11223344 at memory location
0x8000, the four bytes having addresses
location 0x8000, location
0x8001, location
0x8002, and location
0x8003 contain the value
0x11223344. Then, if I read the byte
data at memory location 0x8000, what
would be the byte value to be read?
The answer depends on the endian mode of the memory system. In the little endian mode, the lower memory addresses contain the LSB part of the data. Thus, the bytes stored in the four byte addresses will be as shown in Table 2.
0x8000 |
0x44 |
0x8001 |
0x33 |
0x8002 |
0x22 |
0x8003 |
0x11 |
In the big endian mode, the lower memory addresses contain the MSB part of the data. Thus, we have
0x8000 |
0x11 |
0x8001 |
0x22 |
0x8002 |
0x33 |
0x8003 |
0x44 |
In this course, we use the little endian mode by default and all the lab programming must assume the little endian mode.
(Little endian mode): What will be the value in
A0 after executing the following
assembly instructions? (functional unit specifications
were omitted.)
MVKL 0x80000000, A10
MVKH 0x80000000, A10
MVKL 0x12345678, A9
MVKH 0x12345678, A9
STW A9, *A10
LDB *+A10[2],A0
A0 if the
system uses the big endian mode?
Intentionally left blank.
In fact, the above addressing method describes the so-called linear addressing mode (default upon reset), where the offset or increment/decrement of pointers occur without bound. There is a circular addressing modes that can handle a finite size buffer efficiently. You will implement circular buffers for the FIR filtering algorithm in the FIR filtering experiments later.
In the C62x CPU, it takes exactly one CPU clock cycle to
execute each instruction. However, the instructions such as
LDW need to access the slow external
memory and the results of the load are not available
immediately at the end of the execution. This
delay of the execution results is
called delay slots.
For example, let's consider loading up the content of
memory content at address pointed by
A10 to A1 and
then moving the loaded data to A2.
You might be tempted to write simple 2 line assembly code
as follows:
1 LDW .D1 *A10, A1
2 MV .D1 A1,A2
What is wrong with the above code? The result of the
LDW instruction is not available
immediately after LDW is executed.
As a consequence, the MV instruction
does not copy the desired value of A1
to A2. To prevent this undesirable
execution, we need to make the CPU wait until the result
of the LDW instruction is correctly
loaded to A1 before executing the
MV instruction. For load
instructions, we need extra 4 clock cycles until the load
results are valid. To make the CPU wait for 4 clock
cycles, we need to insert 4 NOP (no
operations) instructions between LDW
and MV. Each
NOP instruction makes the CPU idle
for one clock cycle. The resulting code will be like
this:
1 LDW .D1 *A10, A1
2 NOP
3 NOP
4 NOP
5 NOP
6 MV .D1 A1,A2
or simply you can write
1 LDW .D1 *A10, A1
2 NOP 4
3 MV .D1 A1,A2
Then, why didn't the designer of the CPU make such that
LDW instruction takes 5 clock cycles to
begin with, rather than let the programmer insert 4
NOPs? The answer is that you can
insert other instructions other than
NOPs as far as those instructions do
not use the result of the LDW
instruction above. By doing this, the CPU can execute
additional instructions while waiting for the result of the
LDW instruction to be valid, greatly
reducing the total execution time of the entire program.
The Table 3-5 in TI's instruction set description shows the
execution of the instructions with delay slots in more
detail. The instructions with delay slots are multiply
(MPY, 1 delay slot), the load
(LDB, LDW etc. 4 delay slots)
instructions, and the branch (B, 5
delay slots) instruction.
The functional unit latency indicates for how many clock cycles each instructions actually use a functional unit. All C62x instructions have 1 functional unit latency, meaning that each functional unit is ready to execute the next instruction after 1 clock cycle regardless of the delay slots of the instructions. Therefore, the following instructions are valid:
1 LDW .D1 *A10, A4
2 ADD .D1 A1,A2,A3
Although the first LDW instruction do
not load the A4 register correctly
while the ADD is executed, the
D1 functional unit becomes available
in the clock cycle right after the one in which
LDW is executed.
To clarify the execution of instructions with delay slots,
let's think of the following example of
LDW instruction. Let's assume
A10 = 0x0100 A2=1,
and your intent is loading A9 with the
32-bit word at the address 0x0104. The
3 MV instructions are not related to
the LDW instruction. They do something
else.
1 LDW .D1 *A10++[A2], A9
2 MV .L1 A10, A8
3 MV .L1 A1, A10
4 MV .L1 A1, A2
5 ...
We can ask several interesting questions at this point:
A8?
That is, in which clock cycle, the address pointer is
updated?
A2 before the
LDW instruction finishes the actual
loading?
A10 before
the first LDW finishes loading the
memory content to A9? That is, can
we change the address pointer before the 4 delay slots
elapse?
LDW instruction to load the memory
content to A9, the address pointer
and offset registers (A10 and
A2) are read and updated in the
clock cycle the LDW instruction is
issued. Therefore, in line 2, A8 is
loaded with the updated A10, that
is A10 = A8 = 0x104.
LDW reads the
A10 and A2
registers in the first clock cycle, you are free to
change these registers and do not affect the operation
of the first LDW.
Similar theory holds for MPY and
B (when using a register as a branch
address) instructions. The MPY reads
in the source values in the first clock cycle and loads the
multiplication result after the 2nd clock cycle. For
B, the address pointer is read in the
first clock cycle, and the actual branching occurs after the
5th clock cycle. Thus, after the first clock cycle, you are
free to modify the source or the address pointer registers.
For more details, refer Table 3-5 in the instruction set
description or read the description of the individual
instruction.
There are several instructions for addition, subtraction and
multiplication on C62x CPU. The basic instructions are
ADD, SUB, and
MPY. Learn about these instructions in
the TI manual. ADD and
SUB have 0 delay slots (meaning the
results of operation are immediately available), but the
MPY has 1 delay slot (the result of
multiplication is valid after additional 1 clock cycle).
(Add, subtract, and multiply): Write an assembly program
to compute ( 0000 ef35h + 0000 33dch - 0000
1234h ) * 0000 0007h
Intentionally left blank.
Often you need to control the flow of the program execution
by branching to another block of code. The
B instruction does the job in the C62x
CPU. The address of the branch can be specified either by
displacement or stored in a register to be used by the
B instruction. Read and understand the
B instruction in the manual. The
B instruction has 5 delay slots,
meaning that the actual branch occurs in the 5th clock cycle
after the instruction is executed.
In many cases, depending on the result of previous
operations, you execute the branch instruction
conditionally. For example, to implement a loop, you
decrement the loop counter by 1 each time you run a set of
instructions and whenever the loop counter is not zero, you
need to branch to the beginning of the code block to iterate
the loop operations. In C62x CPU, this conditional
branching is implemented using the conditional
operations. Although B may be
the instruction implemented using conditional operations
most often, all instructions in C62x can be conditional.
Conditional instructions are represented in code by using
square brackets, [ ], surrounding the
condition register name. For example, the following
B instruction is executed only if
B0 is nonzero:
1 [B0] B .L1 A0
To execute an instruction conditionally when the condition
register is zero, we use ! in front of the register. For
example, the B instruction is executed
when B0 is zero.
1 [!B0] B .L1 A0
Not all registers can be used as the condition registers.
In C62x CPU, the registers that can be tested in conditional
operations are B0,
B1, B2,
A1, A2.
(Simple loop): Write an assembly program computing the
summation
Intentionally left blank.
The logical operations and bit manipulations are
accomplished by the AND,
OR, XOR,
CLR, SET,
SHL, and SHR
instructions. Read and understand the operations of these
instructions.
Other useful instructions include IDLE
and compare instructions such as CMPEQ
etc. Read and understand the operations of these
instructions.
The set of instructions that can be performed in each functional unit is as follows (See Table 4, Table 5, Table 6 and Table 7). Please refer to TMS320C62x/C67x CPU and Instruction Set Reference Guide for detailed description of each instruction.
| Instruction | Description |
|---|---|
ADD(U) |
signed or unsigned integer addition without saturation |
ADDK |
integer addition using signed 16-bit constant |
ADD2 |
two 16-bit integer adds on upper and lower register halves |
B |
branch using a register |
CLR |
clear a bit field |
EXT |
extract and sign-extend a bit field |
MV |
move from register to register |
MVC |
move between the control file and the register file |
MVK |
move a 16-bit constant into a register and sign extend |
MVKH |
move 16-bit constant into the upper bits of a register |
NEG |
negate (pseudo-operation) |
NOT |
bitwise NOT |
OR |
bitwise OR |
SET |
set a bit field |
SHL |
arithmetic shift left |
SHR |
arithmetic shift right |
SSHL |
shift left with saturation |
SUB(U) |
signed or unsigned integer subtraction without saturation |
SUB2 |
two 16-bit integer integer subs on upper and lower register halves |
XOR |
exclusive OR |
ZERO |
zero a register (pseudo-operation) |
| Instruction | Description |
|---|---|
ABS |
integer absolute value with saturation |
ADD(U) |
signed or unsigned integer addition without saturation |
AND |
bitwise AND |
CMPEQ |
integer compare for equality |
CMPGT(U) |
signed or unsigned integer compare for greater than |
CMPLT(U) |
signed or unsigned integer compare for less than |
LMBD |
leftmost bit detection |
MV |
move from register to register |
NEG |
negate (pseudo-operation) |
NORM |
normalize integer |
NOT |
bitwise NOT |
+OR |
bitwise OR |
SADD |
integer addition with saturation to result size |
SAT |
saturate a 40-bit integer to a 32-bit integer |
SSUB |
integer subtraction with saturation to result size |
SUBC |
conditional integer subtraction and shift - used for division |
XOR |
exclusive OR |
ZERO |
zero a register (pseudo-operation) |
| Instruction | Description |
|---|---|
ADD(U) |
signed or unsigned integer addition without saturation |
ADDAB (B/H/W) |
integer addition using addressing mode |
LDB (B/H/W) |
load from memory with a 15-bit constant offset |
MV |
move from register to register |
STB (B/H/W) |
store to memory with a register offset or 5-bit unsigned constant offset |
SUB(U) |
signed or unsigned integer subtraction without saturation |
SUBAB (B/H/W) |
integer subtraction using addressing mode |
ZERO |
zero a register (pseudo-operation) |
| Instruction | Description |
|---|---|
MPY (U/US/SU) |
signed or unsigned integer multiply 16lsb*16lsb |
MPYH (U/US/SU) |
signed or unsigned integer multiply 16msb*16msb |
MPYLH |
signed or unsigned integer multiply 16lsb*16msb |
MPYHL |
signed or unsigned integer multiply 16msb*16lsb |
SMPY (HL/LH/H) |
integer multiply with left shift and saturation |
Other than the CPU instruction set, there are special commands
to the assembler that direct the assembler to do various jobs
when assembling the code. You should learn about some of
these assembler directives to be able to write an
assembly program. There are useful assembler directives you
can use to let the assembler know various settings, such as
.set, .macro, .endm, .ref, .align, .word, .byte
.include.,
The .set directive defines a symbolic
name. For example, you can have
1 count .set 40
Then, the assembler replaces each occurrence of
count with 40.
You have already seen how the .ref
directive is used to declare symbolic names defined in another
file. It is similar to the extern
declaration in C.
The .space directive reserves a memory
space with specified number of bytes. For example, you can
have
1 buffer .space 128
to define a buffer of size 128 bytes. The symbol
buffer has the address of the first byte
reserved by .space. The
.bes directive is similar to
.space, but the label has the address of
the last byte reserved.
To put a constant value in the memory, you can use
.byte, .word,
etc. If you have
1 const1 .word 0x1234
the assembler places the word constant 0x1234 at
a memory location and const1 has the address of
the memory location. .byte
etc. works similarly.
Sometimes you need to place your data or code at a specific
memory address boundaries such as word, halfword,
etc. You can use the .align
directive to do this. For example, if you have
1 .align 4
2 buffer .space 128
3 ...
Then, the first address of the reserved 128 bytes is at the
word boundary in memory, that is the 2 LSBs of the address (in
binary) are 0. Similarly, for half-word alignment, you should
have .align directive to do this. For
example, if you have
1 .align 2
2 buffer .space 128
3 ...
The .include directive is used to read
the source lines from another file. If you have
1 .include ``other.asm''
will input the lines in other.asm at this
location. This is useful when working with multiple files.
Instead of making a project having multiple files, you can
simply include these different files in one file.
Other assembler directives include .end,
etc. You will learn about the macro
directives .macro, .endm later .
How do you write comments in your assembly program? Anything
that follows ; is considered as a comment
and ignored by the assembler. For example,
1 ; this is a comment
2 ADD .L1 A1,A2,A3 ;add a1 and a2
Each instruction has particular functional units that can execute it. For a complete list of the instructions that can be executed in each functional unit, see Table 3-2 in the instruction set manual. Note that some instructions can be executed by several different functional units.
(Reference) shows how data and
addresses can be transfered between the registers, functional
units and the external memory. If you observe carefully, the
destination path (marked as dst) going
out of the .L1, .S1, .M1 and
D1 units are connected to the register
file A.
.L2, .S2, .M2 and
D2 units should be used.
Therefore if you know the instruction and the destination register, you should be able to assign the functional unit to it.
(Functional units): List all the functional units you can assign to each of these instructions:
ADD .?? A0,A1,A2
B .?? A1
MVKL .?? 000023feh, B0
LDW .?? *A10, A3
Intentionally left blank.
If you look at (Reference) again, each functional unit must receive one of the source data from the corresponding register file. For example, look at the following assembly instruction:
1 ADD .L1 A0,B0,A1
The .L1 unit gets data from
A0 (this is natural) and
B0 (this is not) and stores the result in
A1 (this is a must). The data path
through which the content of B0 is
conveyed to the .L1 unit is called
1X cross path. When this
happens, we add x to the functional unit
to designate the cross path:
1 ADD .L1x A0,B0,A1
Similarly the data path from register file
B to the .M2, .S2
and .L2 units are called
2X cross path.
(Cross path): List all the functional units that can be assigned to each of the instruction:
ADD .??? B0,A1,B2
MPY .??? A1,B2,A4
Intentionally left blank.
In fact, when you write an assembly program, you can omit the functional unit assignment altogether. The assembler figures out the available functional units and properly assigns them. However, manually assigned functional units help you to figure out where the actual execution takes place and how the data move around between register files and functional units. This is particularly useful when you put multiple instructions in parallel. We will learn about the parallel instructions later on.
Now you should know enough about C62x assembly to implement
the inner product algorithm to compute
(Inner product): Write the complete inner
product assembly program to compute
a[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, a }
x[] = { f, e, d, c, b, a, 9, 8, 7, 6 }
The
Intentionally left blank.
When an instruction is executed, it takes several steps, which are fetching, decoding, and execution. If these steps are done one at a time for each instruction, the CPU resources are not fully utilized. To increase the throughput, CPUs are designed to be pipelined, meaning that the foregoing steps are carried out at the same time.
On the C6x processor, the instruction fetch consists of 4
phases; generate fetch address (F1), send address to memory
(F2), wait for data (F3), and read opcode from memory (F4).
Decoding consists of 2 phases; dispatching to functional units
(D1) and decoding (D2). The execution step may consist of up
to 6 phases (E1 to E6) depending on the instructions. For
example, the multiply (MPY) instructions
has 1 delay resulting in 2 execution phases. Similarly, load
(LDx) and branch (B)
instructions have 4 and 5 delays respectively.
When the outcome of an instruction is used by the next
instruction, an appropriate number of
NOPs (no operation or delay) must be
added after multiply (one NOP), load
(four NOPs, or NOP
4), and branch (five NOPs, or
NOP 5) instructions in order to allow the
pipeline to operate properly. Otherwise, before the outcome
of the current instruction is available (which is to be used
by the next instruction), the next instructions are executed
by the pipeline, generating undesired results. The following
code is an example of pipelined code with
NOPs inserted:
1 MVK 40,A2
2 loop: LDH *A5++,A0
3 LDH *A6++,A1
4 NOP 4
5 MPY A0,A1,A3
6 NOP
7 ADD A3,A4,A4
8 SUB A2,1,A2
9 [A2] B loop
10 NOP 5
11 STH A4,*A7
In line 4, we need 4 NOPs because the
A1 is loaded by the
LDH instruction in line 3 with 4 delays.
After 4 delays, the value of A1 is
available to be used in the MPY A0,A1,A3
in line 5. Similarly, we need 5 delays after the
[A2] B loop instruction in line 9 to
prevent the execution of STH A4,*A7
before branching occurs.
The C6x Very Large Instruction Word (VLIW) architecture,
several instructions are captured and processed
simultaneously. This is referred to as a Fetch Packet (FP).
This Fetch Packet allows C6x to fetch eight instructions
simultaneously from on-chip memory. Among the 8 instructions
fetched at the same time, multiple of them can be executed at
the same time if they do not use same CPU resources at the
same time. Because the CPU has 8 separate functional units,
maximum 8 instructions can be executed in parallel, although
the type of parallel instructions are limited because they
must not conflict each other in using CPU resources. In
assembly listing, parallel instructions are indicated by
double pipe symbols (||). When writing assembly
code, by designing code to maximize parallel execution of
instructions (through proper functional unit assignments,
etc.) the execution cycle of the code can
be reduced.
We have seen that C62x CPU has 8 functional units. Each assembly instruction is executed in one of these 8 functional units, and it takes exactly one clock cycle for the execution. Then, while one instruction is being executed in one of the functional units, what are other 7 functional units doing? Can other functional units execute other instructions at the same time?
The answer is YES. Thus, the CPU can execute maximum 8 instructions in each clock cycle. The instructions executed in the same clock cycle are called parallel instructions. Then, what instructions can be executed in parallel? A short answer is: as far as the parallel instructions do not use the same resource of the CPU, they can be put in parallel. For example, the following two instructions do not use the same CPU resource and they can be executed in parallel.
1 ADD .L1 A0,A1,A2
2 || ADD .L2 B0,B1,B2
Then, what are the constraints on the parallel instructions? Let's look at the resource constraints in more detail.
This is simple. Each functional unit can execute only one instruction per each clock cycle. In other words, instructions using the same functional unit cannot be put in parallel.
If you look at the data path diagram of the C62x CPU, there
exists only one cross path from B register file to the
L1, M1 and
S1 functional units. This means the
cross path can be used only once per each clock cycle.
Thus, the following parallel instructions are invalid
because the 1x cross path is used for both instructions.
1 ADD .L1x A0,B1,A2
2 || MPY .M1x A5,B0,A3
The same rule holds for the 2x cross path from the A
register file to the L2,
M2 and S2
functional units.
The D units are used for load and
store instructions. If you examine the C62x data path
diagram, the addresses for load/store can be obtained from
either A or B side using the multiplexers connecting
crisscross to generate the addresses
DA1 and DA2.
Thus, the instructions such as
1 LDW .D2 *B0, A1
is valid. The functional unit must be on the
same side as the address source register
(address index in B0 and therefore
D2 above), because
D1 and D2 units
must receive the addresses from A and B sides,
respectively.
Another constraint is that while loading a register in one register file from memory, you cannot simultaneously store a register in the same register file to memory. For example, the following parallel instructions are invalid:
1 LDW .D1 *A0, A1
2 || STW .D2 A2, *B0
You cannot have more than four reads from the same register in each clock cycle. Thus, the following is invalid:
1 ADD .L1 A1, A1, A2
2 || MPY .M1 A1, A1, A3
3 || SUB .D1 A1, A4, A5
A register cannot be written to more than once in a single
clock cycle. However, note that the actual writing to
registers may not occur in the same clock cycle during
which the instruction is executed. For example, the
MPY instruction writes to the
destination register in the next clock cycle. Thus, the
following is valid:
1 ADD .L1 A1, A1, A2
2 || MPY .M1 A1, A1, A2
The following two instructions (not parallel) are invalid (why?):
1 MPY .M1 A1, A1, A2
2 ADD .L1 A3, A4, A2
Some of these write conflicts are very hard to detect and not detected by the assembler. Extra caution should be exercised with the instructions having nonzero delay slots.
At this point, you might have wondered why the C62x CPU
allows parallel instructions and generate so much headache
with the resource constraints, especially with the
instructions with delay slots. And, why not just make the
MPY instruction take 2 clock cycles to
execute so that we can always use the multiplied result
after issuing it?
The reason is that by executing instructions in parallel, we can reduce the total execution time of the program. A well-written assembly program executes as many instructions as possible in each clock cycle to implement the desired algorithm.
The reason for allowing delay slots is that although it
takes 2 clock cycles for an MPY
instruction generate the result, we can execute another
instruction while waiting for the result. This way, you can
reduce the clock cycles wasted while waiting for the result
from slow instructions, thus increasing the overall
execution speed.
However, how can we put instructions in parallel? Although
there's a systematic way of doing it (we will learn a bit
later), at this point you can try to restructure your
assembly code to execute as many instructions as possible in
parallel. And, you should try to execute other instructions
in the delay slots of those instructions such as
MPY, LDW, etc.,
instead of inserting NOPs to wait the
instructions produce the results.
(parallel instructions): Modify your assembly program for the inner product computation in the previous exercise to use parallel instructions as much as possible. Also, try to fill the delay slots as much as possible. Using the code composer's profiling, compare the clock cycles necessary for executing the modified program. How many clock cycles could you save?
Intentionally left blank.