Skip to content Skip to navigation

OpenStax_CNX

You are here: Home » Content » C62x Assembly Primer II

Navigation

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
  • Rice Digital Scholarship

    This module is included in aLens by: Digital Scholarship at Rice UniversityAs a part of collection: "Finite Impulse Response"

    Click the "Rice Digital Scholarship" link to see all content affiliated with them.

Recently Viewed

This feature requires Javascript to be enabled.
 

C62x Assembly Primer II

Module by: Hyeokho Choi. E-mail the author

Summary: You will learn more assembly instructions in this lab.

Typical Assembly Operations

Loading constants to registers

Quite often you need to load a register with a constant. The C62x instructions you can use for this task are MVK, MVKL, and MVKH. Each of these instructions can load a 16-bit constant to a register. Read and understand the description of these instructions in the manual.

Exercise 1

(Loading constants): Write assembly instructions to do the following:

  1. Load the 16-bit constant 0xff12 to A1.
  2. Load the 32-bit constant 0xabcd45ef to B0.

Solution

Intentionally left blank.

Register moves, zeroing

Contents of one register can be copied to another register by using the MV instruction. There is also the ZERO instruction to set a register to zero. Learn how to use these instructions by reading the appropriate TI manual pages.

Loading from memory to registers

Because the C62x processor has the so-called load/store architecture, you must first load up the content of memory to a register to be able to manipulate it. The basic assembly instructions you use for loading are LDB, LDH, and LDW for loading up 8-, 16-, and 32-bit data from memory. (There are some variations to these instructions for different handling of the signs of the loaded values.) Read and understand how these instructions work.

However, to specify the address of the memory location to load from, you need to load up another register (used as an address index) and you can use various addressing modes to specify the memory locations in many different ways. The addressing modes is the method by which an instruction calculates the location of an object in memory. The table below lists all the possible different ways to handle the address pointers in C62x CPU. Note the similarity with the C pointer manipulation.

Table 1
Syntax Memory address accessed Pointer modification
*R R None
*++R R Preincrement
*--R R Predecrement
*R++ R Postincrement
*R-- R Postdecrement
*+R[disp] R+disp None
*-R[disp] R+disp None
*++R[disp] R+disp Preincrement
*--R[disp] R+disp Predecrement
*R++[disp] R+disp Postincrement
*R--[disp] R+disp Postdecrement

The [disp] specifies the number of elements in word, halfword, or byte, depending on the instruction type and it can be either 5-bit constant or a register. The increment/decrement of the index registers are also in terms of the number of bytes in word, halfword or byte. The addressing modes with displacements are useful when a block of memory locations is accessed. Those with automatic increment/decrement are useful when a block is accessed consecutively to implement a buffer, for example, to store signal samples to implement a digital filter.

Exercise 2

(Load from memory): Assume the following values are stored in memory addresses:


	      100h  fe54  7834h
	      104h  3459  f34dh
	      108h  2ef5  7ee4h
	      10ch  2345  6789h
	      110h  ffff  eeddh
	      114h  3456  787eh
	      118h  3f4d  7ab3h
	    

Suppose A10 = 0000 0108h. Find the contents of A1 and A10 after executing the each of the following instructions.

  1. LDW .D1 *A10, A1
  2. LDH .D1 *A10, A1
  3. LDB .D1 *A10, A1
  4. LDW .D1 *-A10[1], A1
  5. LDW .D1 *+A10[1], A1
  6. LDW .D1 *+A10[2], A1
  7. LDB .D1 *+A10[2], A1
  8. LDW .D1 *++A10[1], A1
  9. LDW .D1 *--A10[1], A1
  10. LDB .D1 *++A10[1], A1
  11. LDB .D1 *--A10[1], A1
  12. LDW .D1 *A10++[1], A1
  13. LDW .D1 *A10--[1], A1

Solution

Intentionally left blank.

Storing data to memory

Storing the register contents uses the same addressing modes. The assembly instructions used for storing are STB, STH, and STW. Read and understand these instructions in the TI manual.

Exercise 3

(Storing to memory): Write assembly instructions to store 32-bit constant 53fe 23e4h to memory address 0000 0123h.

Solution

Intentionally left blank.

Sometimes, it becomes necessary to access part of the data stored in memory. For example, if you store the 32-bit word 0x11223344 at memory location 0x8000, the four bytes having addresses location 0x8000, location 0x8001, location 0x8002, and location 0x8003 contain the value 0x11223344. Then, if I read the byte data at memory location 0x8000, what would be the byte value to be read?

The answer depends on the endian mode of the memory system. In the little endian mode, the lower memory addresses contain the LSB part of the data. Thus, the bytes stored in the four byte addresses will be as shown in Table 2.

Table 2
0x8000 0x44
0x8001 0x33
0x8002 0x22
0x8003 0x11

In the big endian mode, the lower memory addresses contain the MSB part of the data. Thus, we have

Table 3
0x8000 0x11
0x8001 0x22
0x8002 0x33
0x8003 0x44

In this course, we use the little endian mode by default and all the lab programming must assume the little endian mode.

Exercise 4

(Little endian mode): What will be the value in A0 after executing the following assembly instructions? (functional unit specifications were omitted.)

  1. MVKL 0x80000000, A10
  2. MVKH 0x80000000, A10
  3. MVKL 0x12345678, A9
  4. MVKH 0x12345678, A9
  5. STW A9, *A10
  6. LDB *+A10[2],A0
What will be the value in A0 if the system uses the big endian mode?

Solution

Intentionally left blank.

In fact, the above addressing method describes the so-called linear addressing mode (default upon reset), where the offset or increment/decrement of pointers occur without bound. There is a circular addressing modes that can handle a finite size buffer efficiently. You will implement circular buffers for the FIR filtering algorithm in the FIR filtering experiments later.

In the C62x CPU, it takes exactly one CPU clock cycle to execute each instruction. However, the instructions such as LDW need to access the slow external memory and the results of the load are not available immediately at the end of the execution. This delay of the execution results is called delay slots.

Example 1

For example, let's consider loading up the content of memory content at address pointed by A10 to A1 and then moving the loaded data to A2. You might be tempted to write simple 2 line assembly code as follows:


	   1     LDW   .D1    *A10, A1
	   2     MV    .D1    A1,A2
	  

What is wrong with the above code? The result of the LDW instruction is not available immediately after LDW is executed. As a consequence, the MV instruction does not copy the desired value of A1 to A2. To prevent this undesirable execution, we need to make the CPU wait until the result of the LDW instruction is correctly loaded to A1 before executing the MV instruction. For load instructions, we need extra 4 clock cycles until the load results are valid. To make the CPU wait for 4 clock cycles, we need to insert 4 NOP (no operations) instructions between LDW and MV. Each NOP instruction makes the CPU idle for one clock cycle. The resulting code will be like this:


	    1     LDW    .D1    *A10, A1
	    2     NOP
	    3     NOP
	    4     NOP
	    5     NOP
	    6     MV     .D1    A1,A2
	  

or simply you can write


	    1     LDW    .D1    *A10, A1
	    2     NOP  4
	    3     MV     .D1    A1,A2
	  

Then, why didn't the designer of the CPU make such that LDW instruction takes 5 clock cycles to begin with, rather than let the programmer insert 4 NOPs? The answer is that you can insert other instructions other than NOPs as far as those instructions do not use the result of the LDW instruction above. By doing this, the CPU can execute additional instructions while waiting for the result of the LDW instruction to be valid, greatly reducing the total execution time of the entire program.

More on instructions with delay slots

The Table 3-5 in TI's instruction set description shows the execution of the instructions with delay slots in more detail. The instructions with delay slots are multiply (MPY, 1 delay slot), the load (LDB, LDW etc. 4 delay slots) instructions, and the branch (B, 5 delay slots) instruction.

The functional unit latency indicates for how many clock cycles each instructions actually use a functional unit. All C62x instructions have 1 functional unit latency, meaning that each functional unit is ready to execute the next instruction after 1 clock cycle regardless of the delay slots of the instructions. Therefore, the following instructions are valid:


	1     LDW    .D1    *A10, A4
	2     ADD    .D1    A1,A2,A3
	

Although the first LDW instruction do not load the A4 register correctly while the ADD is executed, the D1 functional unit becomes available in the clock cycle right after the one in which LDW is executed.

To clarify the execution of instructions with delay slots, let's think of the following example of LDW instruction. Let's assume A10 = 0x0100 A2=1, and your intent is loading A9 with the 32-bit word at the address 0x0104. The 3 MV instructions are not related to the LDW instruction. They do something else.


	  1     LDW    .D1    *A10++[A2], A9
	  2     MV     .L1    A10, A8
	  3     MV     .L1    A1, A10
	  4     MV     .L1    A1, A2
	  5     ...
	

We can ask several interesting questions at this point:

  1. What is the value loaded to A8? That is, in which clock cycle, the address pointer is updated?
  2. Can we load the address offset register A2 before the LDW instruction finishes the actual loading?
  3. Is it legal to load to A10 before the first LDW finishes loading the memory content to A9? That is, can we change the address pointer before the 4 delay slots elapse?
Here are the answers:
  1. Although it takes extra 4 clock cycles for the LDW instruction to load the memory content to A9, the address pointer and offset registers (A10 and A2) are read and updated in the clock cycle the LDW instruction is issued. Therefore, in line 2, A8 is loaded with the updated A10, that is A10 = A8 = 0x104.
  2. Because the LDW reads the A10 and A2 registers in the first clock cycle, you are free to change these registers and do not affect the operation of the first LDW.
  3. This was already answered above.

Similar theory holds for MPY and B (when using a register as a branch address) instructions. The MPY reads in the source values in the first clock cycle and loads the multiplication result after the 2nd clock cycle. For B, the address pointer is read in the first clock cycle, and the actual branching occurs after the 5th clock cycle. Thus, after the first clock cycle, you are free to modify the source or the address pointer registers. For more details, refer Table 3-5 in the instruction set description or read the description of the individual instruction.

Addition, Subtraction and Multiplication

There are several instructions for addition, subtraction and multiplication on C62x CPU. The basic instructions are ADD, SUB, and MPY. Learn about these instructions in the TI manual. ADD and SUB have 0 delay slots (meaning the results of operation are immediately available), but the MPY has 1 delay slot (the result of multiplication is valid after additional 1 clock cycle).

Exercise 5

(Add, subtract, and multiply): Write an assembly program to compute ( 0000 ef35h + 0000 33dch - 0000 1234h ) * 0000 0007h

Solution

Intentionally left blank.

Branching and conditional operations

Often you need to control the flow of the program execution by branching to another block of code. The B instruction does the job in the C62x CPU. The address of the branch can be specified either by displacement or stored in a register to be used by the B instruction. Read and understand the B instruction in the manual. The B instruction has 5 delay slots, meaning that the actual branch occurs in the 5th clock cycle after the instruction is executed.

In many cases, depending on the result of previous operations, you execute the branch instruction conditionally. For example, to implement a loop, you decrement the loop counter by 1 each time you run a set of instructions and whenever the loop counter is not zero, you need to branch to the beginning of the code block to iterate the loop operations. In C62x CPU, this conditional branching is implemented using the conditional operations. Although B may be the instruction implemented using conditional operations most often, all instructions in C62x can be conditional.

Conditional instructions are represented in code by using square brackets, [ ], surrounding the condition register name. For example, the following B instruction is executed only if B0 is nonzero:


	  1    [B0]    B     .L1    A0
	

To execute an instruction conditionally when the condition register is zero, we use ! in front of the register. For example, the B instruction is executed when B0 is zero.


	  1    [!B0]    B     .L1    A0
	

Not all registers can be used as the condition registers. In C62x CPU, the registers that can be tested in conditional operations are B0, B1, B2, A1, A2.

Exercise 6

(Simple loop): Write an assembly program computing the summation n=1100n n 1 100 n by implementing a simple loop.

Solution

Intentionally left blank.

Logical operations and bit manipulation

The logical operations and bit manipulations are accomplished by the AND, OR, XOR, CLR, SET, SHL, and SHR instructions. Read and understand the operations of these instructions.

Other assembly instructions

Other useful instructions include IDLE and compare instructions such as CMPEQ etc. Read and understand the operations of these instructions.

C62x instruction set summary

The set of instructions that can be performed in each functional unit is as follows (See Table 4, Table 5, Table 6 and Table 7). Please refer to TMS320C62x/C67x CPU and Instruction Set Reference Guide for detailed description of each instruction.

Table 4: .S Unit
Instruction Description
ADD(U) signed or unsigned integer addition without saturation
ADDK integer addition using signed 16-bit constant
ADD2 two 16-bit integer adds on upper and lower register halves
B branch using a register
CLR clear a bit field
EXT extract and sign-extend a bit field
MV move from register to register
MVC move between the control file and the register file
MVK move a 16-bit constant into a register and sign extend
MVKH move 16-bit constant into the upper bits of a register
NEG negate (pseudo-operation)
NOT bitwise NOT
OR bitwise OR
SET set a bit field
SHL arithmetic shift left
SHR arithmetic shift right
SSHL shift left with saturation
SUB(U) signed or unsigned integer subtraction without saturation
SUB2 two 16-bit integer integer subs on upper and lower register halves
XOR exclusive OR
ZERO zero a register (pseudo-operation)
Table 5: .L Unit
Instruction Description
ABS integer absolute value with saturation
ADD(U) signed or unsigned integer addition without saturation
AND bitwise AND
CMPEQ integer compare for equality
CMPGT(U) signed or unsigned integer compare for greater than
CMPLT(U) signed or unsigned integer compare for less than
LMBD leftmost bit detection
MV move from register to register
NEG negate (pseudo-operation)
NORM normalize integer
NOT bitwise NOT
+OR bitwise OR
SADD integer addition with saturation to result size
SAT saturate a 40-bit integer to a 32-bit integer
SSUB integer subtraction with saturation to result size
SUBC conditional integer subtraction and shift - used for division
XOR exclusive OR
ZERO zero a register (pseudo-operation)
Table 6: .D Unit
Instruction Description
ADD(U) signed or unsigned integer addition without saturation
ADDAB (B/H/W) integer addition using addressing mode
LDB (B/H/W) load from memory with a 15-bit constant offset
MV move from register to register
STB (B/H/W) store to memory with a register offset or 5-bit unsigned constant offset
SUB(U) signed or unsigned integer subtraction without saturation
SUBAB (B/H/W) integer subtraction using addressing mode
ZERO zero a register (pseudo-operation)
Table 7: .M Unit
Instruction Description
MPY (U/US/SU) signed or unsigned integer multiply 16lsb*16lsb
MPYH (U/US/SU) signed or unsigned integer multiply 16msb*16msb
MPYLH signed or unsigned integer multiply 16lsb*16msb
MPYHL signed or unsigned integer multiply 16msb*16lsb
SMPY (HL/LH/H) integer multiply with left shift and saturation

Useful assembler directives

Other than the CPU instruction set, there are special commands to the assembler that direct the assembler to do various jobs when assembling the code. You should learn about some of these assembler directives to be able to write an assembly program. There are useful assembler directives you can use to let the assembler know various settings, such as .set, .macro, .endm, .ref, .align, .word, .byte .include.,

The .set directive defines a symbolic name. For example, you can have


	1    count    .set    40
      

Then, the assembler replaces each occurrence of count with 40.

You have already seen how the .ref directive is used to declare symbolic names defined in another file. It is similar to the extern declaration in C.

The .space directive reserves a memory space with specified number of bytes. For example, you can have


	1    buffer    .space    128
      

to define a buffer of size 128 bytes. The symbol buffer has the address of the first byte reserved by .space. The .bes directive is similar to .space, but the label has the address of the last byte reserved.

To put a constant value in the memory, you can use .byte, .word, etc. If you have


	1    const1    .word    0x1234
      

the assembler places the word constant 0x1234 at a memory location and const1 has the address of the memory location. .byte etc. works similarly.

Sometimes you need to place your data or code at a specific memory address boundaries such as word, halfword, etc. You can use the .align directive to do this. For example, if you have


	1               .align    4
	2     buffer    .space    128
	3               ...
      

Then, the first address of the reserved 128 bytes is at the word boundary in memory, that is the 2 LSBs of the address (in binary) are 0. Similarly, for half-word alignment, you should have .align directive to do this. For example, if you have


	1               .align    2
	2     buffer    .space    128
	3               ...
      

The .include directive is used to read the source lines from another file. If you have


	1               .include    ``other.asm''
      

will input the lines in other.asm at this location. This is useful when working with multiple files. Instead of making a project having multiple files, you can simply include these different files in one file.

Other assembler directives include .end, etc. You will learn about the macro directives .macro, .endm later .

How do you write comments in your assembly program? Anything that follows ; is considered as a comment and ignored by the assembler. For example,


	1     ; this is a comment
	2             ADD     .L1     A1,A2,A3      ;add a1 and a2
      

Assigning functional units

Each instruction has particular functional units that can execute it. For a complete list of the instructions that can be executed in each functional unit, see Table 3-2 in the instruction set manual. Note that some instructions can be executed by several different functional units.

(Reference) shows how data and addresses can be transfered between the registers, functional units and the external memory. If you observe carefully, the destination path (marked as dst) going out of the .L1, .S1, .M1 and D1 units are connected to the register file A.

note:

This means that any instruction with one of the A registers as destination (the result of operation is stored in one of A registers) should be executed in one of these 4 functional units.
For the same reason, if the instructions have B registers as destination, the .L2, .S2, .M2 and D2 units should be used.

Therefore if you know the instruction and the destination register, you should be able to assign the functional unit to it.

Exercise 7

(Functional units): List all the functional units you can assign to each of these instructions:

  1. ADD .?? A0,A1,A2
  2. B .?? A1
  3. MVKL .?? 000023feh, B0
  4. LDW .?? *A10, A3

Solution

Intentionally left blank.

If you look at (Reference) again, each functional unit must receive one of the source data from the corresponding register file. For example, look at the following assembly instruction:


	1     ADD   .L1    A0,B0,A1
      

The .L1 unit gets data from A0 (this is natural) and B0 (this is not) and stores the result in A1 (this is a must). The data path through which the content of B0 is conveyed to the .L1 unit is called 1X cross path. When this happens, we add x to the functional unit to designate the cross path:


	1    ADD    .L1x    A0,B0,A1
      

Similarly the data path from register file B to the .M2, .S2 and .L2 units are called 2X cross path.

Exercise 8

(Cross path): List all the functional units that can be assigned to each of the instruction:

  1. ADD .??? B0,A1,B2
  2. MPY .??? A1,B2,A4

Solution

Intentionally left blank.

In fact, when you write an assembly program, you can omit the functional unit assignment altogether. The assembler figures out the available functional units and properly assigns them. However, manually assigned functional units help you to figure out where the actual execution takes place and how the data move around between register files and functional units. This is particularly useful when you put multiple instructions in parallel. We will learn about the parallel instructions later on.

Writing the inner product program

Now you should know enough about C62x assembly to implement the inner product algorithm to compute y=n=110 a n × x n y n 1 10 a n x n

Exercise 9

(Inner product): Write the complete inner product assembly program to compute y=n=110 a n × x n y n 1 10 a n x n where a n a n and x n x n take the following values:


	    a[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, a }
	    x[] = { f, e, d, c, b, a, 9, 8, 7, 6 }
	  

The a n a n and x n x n values must be stored in memory and the inner product is computed by reading the memory contents.

Solution

Intentionally left blank.

Pipeline, Delay slots and Parallel instructions

When an instruction is executed, it takes several steps, which are fetching, decoding, and execution. If these steps are done one at a time for each instruction, the CPU resources are not fully utilized. To increase the throughput, CPUs are designed to be pipelined, meaning that the foregoing steps are carried out at the same time.

On the C6x processor, the instruction fetch consists of 4 phases; generate fetch address (F1), send address to memory (F2), wait for data (F3), and read opcode from memory (F4). Decoding consists of 2 phases; dispatching to functional units (D1) and decoding (D2). The execution step may consist of up to 6 phases (E1 to E6) depending on the instructions. For example, the multiply (MPY) instructions has 1 delay resulting in 2 execution phases. Similarly, load (LDx) and branch (B) instructions have 4 and 5 delays respectively.

When the outcome of an instruction is used by the next instruction, an appropriate number of NOPs (no operation or delay) must be added after multiply (one NOP), load (four NOPs, or NOP 4), and branch (five NOPs, or NOP 5) instructions in order to allow the pipeline to operate properly. Otherwise, before the outcome of the current instruction is available (which is to be used by the next instruction), the next instructions are executed by the pipeline, generating undesired results. The following code is an example of pipelined code with NOPs inserted:


	 1             MVK    40,A2
	 2     loop:   LDH    *A5++,A0
	 3             LDH    *A6++,A1
	 4             NOP    4
	 5             MPY    A0,A1,A3
	 6             NOP
	 7             ADD    A3,A4,A4
	 8             SUB    A2,1,A2
	 9     [A2]    B      loop
	10             NOP    5
	11             STH    A4,*A7
      

In line 4, we need 4 NOPs because the A1 is loaded by the LDH instruction in line 3 with 4 delays. After 4 delays, the value of A1 is available to be used in the MPY A0,A1,A3 in line 5. Similarly, we need 5 delays after the [A2] B loop instruction in line 9 to prevent the execution of STH A4,*A7 before branching occurs.

The C6x Very Large Instruction Word (VLIW) architecture, several instructions are captured and processed simultaneously. This is referred to as a Fetch Packet (FP). This Fetch Packet allows C6x to fetch eight instructions simultaneously from on-chip memory. Among the 8 instructions fetched at the same time, multiple of them can be executed at the same time if they do not use same CPU resources at the same time. Because the CPU has 8 separate functional units, maximum 8 instructions can be executed in parallel, although the type of parallel instructions are limited because they must not conflict each other in using CPU resources. In assembly listing, parallel instructions are indicated by double pipe symbols (||). When writing assembly code, by designing code to maximize parallel execution of instructions (through proper functional unit assignments, etc.) the execution cycle of the code can be reduced.

Parallel instructions and constraints

We have seen that C62x CPU has 8 functional units. Each assembly instruction is executed in one of these 8 functional units, and it takes exactly one clock cycle for the execution. Then, while one instruction is being executed in one of the functional units, what are other 7 functional units doing? Can other functional units execute other instructions at the same time?

The answer is YES. Thus, the CPU can execute maximum 8 instructions in each clock cycle. The instructions executed in the same clock cycle are called parallel instructions. Then, what instructions can be executed in parallel? A short answer is: as far as the parallel instructions do not use the same resource of the CPU, they can be put in parallel. For example, the following two instructions do not use the same CPU resource and they can be executed in parallel.


	1           ADD    .L1    A0,A1,A2
	2    ||     ADD    .L2    B0,B1,B2
      

Resource constraints

Then, what are the constraints on the parallel instructions? Let's look at the resource constraints in more detail.

Functional unit constraints

This is simple. Each functional unit can execute only one instruction per each clock cycle. In other words, instructions using the same functional unit cannot be put in parallel.

Cross paths constraints

If you look at the data path diagram of the C62x CPU, there exists only one cross path from B register file to the L1, M1 and S1 functional units. This means the cross path can be used only once per each clock cycle. Thus, the following parallel instructions are invalid because the 1x cross path is used for both instructions.


	    1          ADD     .L1x    A0,B1,A2
	    2   ||     MPY     .M1x    A5,B0,A3
	  

The same rule holds for the 2x cross path from the A register file to the L2, M2 and S2 functional units.

Loads and Stores constraints

The D units are used for load and store instructions. If you examine the C62x data path diagram, the addresses for load/store can be obtained from either A or B side using the multiplexers connecting crisscross to generate the addresses DA1 and DA2. Thus, the instructions such as


	    1          LDW     .D2     *B0, A1
	  

is valid. The functional unit must be on the same side as the address source register (address index in B0 and therefore D2 above), because D1 and D2 units must receive the addresses from A and B sides, respectively.

Another constraint is that while loading a register in one register file from memory, you cannot simultaneously store a register in the same register file to memory. For example, the following parallel instructions are invalid:


	    1          LDW     .D1     *A0, A1
	    2   ||     STW     .D2     A2, *B0
	  

Constraints on register reads

You cannot have more than four reads from the same register in each clock cycle. Thus, the following is invalid:


	    1          ADD     .L1     A1, A1, A2
	    2   ||     MPY     .M1     A1, A1, A3
	    3   ||     SUB     .D1     A1, A4, A5
	  

Constraints on register writes

A register cannot be written to more than once in a single clock cycle. However, note that the actual writing to registers may not occur in the same clock cycle during which the instruction is executed. For example, the MPY instruction writes to the destination register in the next clock cycle. Thus, the following is valid:


	    1	       ADD     .L1     A1, A1, A2
	    2   ||     MPY     .M1     A1, A1, A2
	  

The following two instructions (not parallel) are invalid (why?):


	    1          MPY     .M1     A1, A1, A2
	    2          ADD     .L1     A3, A4, A2
	  

Some of these write conflicts are very hard to detect and not detected by the assembler. Extra caution should be exercised with the instructions having nonzero delay slots.

Ad-Hoc software pipelining

At this point, you might have wondered why the C62x CPU allows parallel instructions and generate so much headache with the resource constraints, especially with the instructions with delay slots. And, why not just make the MPY instruction take 2 clock cycles to execute so that we can always use the multiplied result after issuing it?

The reason is that by executing instructions in parallel, we can reduce the total execution time of the program. A well-written assembly program executes as many instructions as possible in each clock cycle to implement the desired algorithm.

The reason for allowing delay slots is that although it takes 2 clock cycles for an MPY instruction generate the result, we can execute another instruction while waiting for the result. This way, you can reduce the clock cycles wasted while waiting for the result from slow instructions, thus increasing the overall execution speed.

However, how can we put instructions in parallel? Although there's a systematic way of doing it (we will learn a bit later), at this point you can try to restructure your assembly code to execute as many instructions as possible in parallel. And, you should try to execute other instructions in the delay slots of those instructions such as MPY, LDW, etc., instead of inserting NOPs to wait the instructions produce the results.

Exercise 10

(parallel instructions): Modify your assembly program for the inner product computation in the previous exercise to use parallel instructions as much as possible. Also, try to fill the delay slots as much as possible. Using the code composer's profiling, compare the clock cycles necessary for executing the modified program. How many clock cycles could you save?

Solution

Intentionally left blank.

Content actions

Download module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks