Skip to content Skip to navigation

Connexions

You are here: Home » Content » C6x Assembly Programming

Navigation

Recently Viewed

This feature requires Javascript to be enabled.
 

C6x Assembly Programming

Module by: David Waldo. E-mail the author

Based on: C62x Assembly Primer II by Hyeokho Choi

Summary: This module has details of assembly programming on the TI C6000 family of processors.

Introduction

This module contains details on how to program the TI C6000 family of processors in assembly. The C6000 family of processors has many variants. Therefore, it would not be possible to describe how to program all the processors here. However, the basic architecture and instructions are similar from one processor to another. They differ by the number of registers, the size of the registers, peripherals on the device, etc. This module will assume a device that has 32 general-purpose 32-bit registers and eight functional units, like the C6713 processor.

References

  • SPRU198: TMS320C6000 Programmer's Guide
  • SPRU186: TMS320C6000 Assembly Language Tools User's Guide 
  • SPRU733: TMS320C67x/C67x+ DSP CPU and Instruction Set Reference Guide

Overview of C6000 Architecture

The C6000 consists of internal memory, peripherals (serial port, external memory interface, etc), and most importantly, the CPU that has the registers and the functional units for execution of instructions. Although you don't need to care about the internal architecture of the CPU for compiling and running programs, it is necessary to understand how the CPU fetches and executes the assembly instructions to write a highly optimized assembly program.

Core DSP Operation

In many DSP algorithms, the Sum of Products or Multiply-Accumulate (MAC) operations are very common. A DSP CPU is designed to handle the math-intensive calculations necessary for common DSP algorithms. For efficient implementation of the MAC operation, the C6000 CPU has two multipliers and each of them can perform a 16-bit multiplication in each clock cycle. For example, if we want to compute the dot product of two length-40 vectors a[n] and x[n], we need to compute:

y = n = 1 N a [ n ] x [ n ] y = n = 1 N a [ n ] x [ n ] size 12{y= Sum cSub { size 8{n=1} } cSup { size 8{N} } {a \[ n \] x \[ n \] } } {}
(1)

(For example, the FIR filtering algorithm is exactly the same as this dot product operation.) When an a[n] and x[n] are stored in memory, starting from n=1, we need to compute a[n]x[n] and add it to y (y is initially 0) and repeat this up to n=40. In the C6000 assembly, this MAC operation can be written as:

MPY .M a,x,prod
    ADD .L y,prod,y
    

Ignore .M and .L for now. Here, a, x, prod and y are numbers stored in memory and the instruction MPY multiplies two numbers a and x together and stores the result in prod. The ADD instruction adds two numbers y and prod together storing the result back to y.

Instructions

Below is the structure of a line of assembly code.

Table 1
Label: Parallel bars (||) [Condition] Instruction Unit Operands ;Comments

Labels identify a line of code or a variable and represent a memory address that contains either an instruction or data. The first character of a label must be must be in the first column and must be a letter or an underscore (_) followed by a letter. Labels can include up to 32 alphanumeric characters.

An instruction that executes in parallel with the previous instruction signifies this with parallel bars (||). This field is left blank for an instruction that does not execute in parallel with the previous instruction.

Every instruction in the C6x can execute conditionally. There are five registers available for conditions (on the C67x processors): A1, A2, B0, B1, and B2. If blank, the instruction always executes. Conditions can take a form such as [A1] where the instruction will execute if A1 is not zero. This can be handy for making loops were the counter is put in a register like A1 and is counted down to zero. The condition is put on the branch instruction that branches back to the beginning of the loop.

Register Files

Where are the numbers stored in the CPU? In the C6000, the numbers used in operations are stored in register. Because the registers are directly accessible though the data path of the CPU, accessing the registers is much faster than accessing data in the external memory.

The C6000 CPU has two register files (A and B). Each of these files consists of sixteen 32-bit registers (A0-A15 for file A and B0-B15 for file B). The general-purpose registers can be used for data, data address pointers, or condition registers. The following figure shows a block diagram of the C67x processors. This basic structure is similar to other processors in the C6000 family.

Figure 1: TMS320C67x DSP Block Diagram taken from SPRU733: TMS320C67x/C67x+ DSP CPU and Instruction Set Reference Guide
Figure 1 (graphics1.png)

The general-purpose register files support data ranging in size from 16-bit data through 40-bit fixed-point. Values larger than 32 bits, such as 40-bit long quantities, are stored in register pairs. In a register pair, the 32 LSB's of data are placed in an even-numbered register and the remaining 8 MSB's in the next upper register (which is always an odd-numbered register). In assembly language syntax, a colon between two register names denotes the register pairs, and the odd-numbered register is specified first. For example, A1:A0 represents the register pair consisting of A0 and A1.

Let's for now focus on file A only. The registers in the register file A are named A0 to A15. Each register can store a 32-bit binary number. Then numbers such as a, x, prod and y above are stored in these registers. For example, register A0 stores a. For now, let's assume we interpret all 32-bit numbers stored in registers as unsigned integer. Therefore the range of values we can represent is 0 to 232232 size 12{2 rSup { size 8{"32"} } } {}−1. Let's assume the numbers a, x, prod and y are in the registers A0, A1, A3, A4, respectively. Then the above assembly instructions can be written specifically:

MPY .M1 A0,A1,A3
    ADD .L1 A4,A3,A4
    

The TI C6000 CPU has a load/store architecture. This means that all the numbers must be stored in the registers before being used as operands of the operations for instructions such as MPY and ADD. The numbers can be read from a memory location to a register (using, for example, LDW, LDB instructions) or a register can be loaded with a constant value. The content of a register can be stored to a memory location (using, for example, STW, STB instructions).

In addition to the general-purpose register files, the CPU has a separate register file for the control registers. The control registers are used to control various CPU functions such as addressing mode, interrupts, etc.

Functional Units

Where do the actual operations such as multiplication and addition take place? The C6000 CPU has several functional units that perform the actual operations. Each register file has 4 functional units named .M, .L, .S, and .D. The 4 functional units connected to the register file A are named .M1, .L1, .S1, and .D1. Those connected to the register file B are named .M2, .L2, .S2, and .D2. For example, the functional unit .M1 performs multiplication on the operands that are in register file A. When the CPU executes the MPY .M1 A0, A1, A3 above, the functional unit .M1 takes the value stored in A0 and A1, multiply them together and stores the result to A3. The .M1 in MPY .M1 A0, A1, A3 indicates that this operation is performed in the .M1 unit. The .M1 unit has a 16 bit multiplier and all the multiplications are performed by the .M1 (or .M2) unit. The following diagram shows the basic architecture of the C6000 family and functional units.

Figure 2: Functional Units of the 'C6x taken from SPRU198: TMS320C6000 Programmers' Guide
Figure 2 (graphics2.png)

Similarly, the ADD operation can be executed by the .L1 unit. The .L1 can perform all the logical operations such as bitwise AND operation (AND instruction) as well as basic addition (ADD instruction) and subtraction (SUB instruction).

Exercise 1

Read the description of the ADD and MPY instructions in SPRU733 or similar document for the processor you are using. Write an assembly program that computes A0*(A1+A2)+A3.

Typical Assembly Operations

Loading constants to registers

Quite often you need to load a register with a constant. The C6x instructions you can use for this task are MVK, MVKL, and MVKH. Each of these instructions can load a 16-bit constant to a register. The MVKL instruction loads the LOWER 16-bits and the MVKH instruction loads the HIGH 16-bits into the register. In order to load 32-bit values into a register, both instructions are needed.

Exercise 2

(Loading constants): Write assembly instructions to do the following:

  1. Load the 16-bit constant 0xff12 to A1.
  2. Load the 32-bit constant 0xabcd45ef to B0.

Register moves, zeroing

Contents of one register can be copied to another register by using the MV instruction. There is also the ZERO instruction to set a register to zero.

Loading from memory to registers

Because the C6x processor has the so-called load/store architecture, you must first load up the content of memory to a register to be able to manipulate it. The basic assembly instructions you use for loading are LDB, LDH, and LDW for loading up 8-, 16-, and 32-bit data from memory. (There are some variations to these instructions for different handling of the signs of the loaded values.)

However, to specify the address of the memory location to load from, you need to load up another register (used as an address index) and you can use various addressing modes to specify the memory locations in many different ways. The addressing mode is the method by which an instruction calculates the location of an object in memory. The table below lists all the possible different ways to handle the address pointers in the C6x CPU. Note the similarity with the C pointer manipulation.

Table 2: C6x addressing modes.
Syntax Memory address accessed Pointer modification
*R R None
*++R R Preincrement
*--R R Predecrement
*R++ R Postincrement
*R-- R Postdecrement
*+R[disp] R+disp None
*-R[disp] R+disp None
*++R[disp] R+disp Preincrement
*--R[disp] R+disp Predecrement
*R++[disp] R+disp Postincrement
*R--[disp] R+disp Postdecrement

The [disp] specifies the number of elements in word, halfword, or byte, depending on the instruction type and it can be either 5-bit constant or a register. The increment/decrement of the index registers are also in terms of the number of bytes in word, halfword or byte. The addressing modes with displacements are useful when a block of memory locations is accessed. Those with automatic increment/decrement are useful when a block is accessed consecutively to implement a buffer, for example, to store signal samples to implement a digital filter.

Exercise 3

(Load from memory): Assume the following values are stored in memory addresses:

Loc   32-bit value
100h  fe54  7834h
104h  3459  f34dh
108h  2ef5  7ee4h
10ch  2345  6789h
110h  ffff  eeddh
114h  3456  787eh
118h  3f4d  7ab3h
	    

Suppose A10 = 0000 0108h. Find the contents of A1 and A10 after executing the each of the following instructions.

  1. LDW .D1 *A10, A1
  2. LDH .D1 *A10, A1
  3. LDB .D1 *A10, A1
  4. LDW .D1 *-A10[1], A1
  5. LDW .D1 *+A10[1], A1
  6. LDW .D1 *+A10[2], A1
  7. LDB .D1 *+A10[2], A1
  8. LDW .D1 *++A10[1], A1
  9. LDW .D1 *--A10[1], A1
  10. LDB .D1 *++A10[1], A1
  11. LDB .D1 *--A10[1], A1
  12. LDW .D1 *A10++[1], A1
  13. LDW .D1 *A10--[1], A1

Storing data to memory

Storing the register contents uses the same addressing modes. The assembly instructions used for storing are STB, STH, and STW.

Exercise 4

(Storing to memory): Write assembly instructions to store 32-bit constant 53fe 23e4h to memory address 0000 0123h.

Sometimes, it becomes necessary to access part of the data stored in memory. For example, if you store the 32-bit word 0x11223344 at memory location 0x8000, the four bytes having addresses location 0x8000, location 0x8001, location 0x8002, and location 0x8003 contain the value 0x11223344. Then, if I read the byte data at memory location 0x8000, what would be the byte value to be read?

The answer depends on the endian mode of the memory system. In the little endian mode, the lower memory addresses contain the LSB part of the data. Thus, the bytes stored in the four byte addresses will be as shown in Table 3.

Table 3: Little endian storage mode.
0x8000 0x44
0x8001 0x33
0x8002 0x22
0x8003 0x11

In the big endian mode, the lower memory addresses contain the MSB part of the data. Thus, we have

Table 4: Big endian storage mode.
0x8000 0x11
0x8001 0x22
0x8002 0x33
0x8003 0x44

In the C6x CPU, it takes exactly one CPU clock cycle to execute each instruction. However, the instructions such as LDW need to access the slow external memory and the results of the load are not available immediately at the end of the execution. This delay of the execution results is called delay slots.

Example 1

For example, let's consider loading up the content of memory content at address pointed by A10 to A1 and then moving the loaded data to A2. You might be tempted to write simple 2 line assembly code as follows:


	   1     LDW   .D1    *A10, A1
	   2     MV    .D1    A1,A2
	  

What is wrong with the above code? The result of the LDW instruction is not available immediately after LDW is executed. As a consequence, the MV instruction does not copy the desired value of A1 to A2. To prevent this undesirable execution, we need to make the CPU wait until the result of the LDW instruction is correctly loaded to A1 before executing the MV instruction. For load instructions, we need extra 4 clock cycles until the load results are valid. To make the CPU wait for 4 clock cycles, we need to insert 4 NOP (no operations) instructions between LDW and MV. Each NOP instruction makes the CPU idle for one clock cycle. The resulting code will be like this:


	    1     LDW    .D1    *A10, A1
	    2     NOP
	    3     NOP
	    4     NOP
	    5     NOP
	    6     MV     .D1    A1,A2
	  

or simply you can write


	    1     LDW    .D1    *A10, A1
	    2     NOP  4
	    3     MV     .D1    A1,A2
	  

Then, why didn't the designer of the CPU make such that LDW instruction takes 5 clock cycles to begin with, rather than let the programmer insert 4 NOPs? The answer is that you can insert other instructions other than NOPs as far as those instructions do not use the result of the LDW instruction above. By doing this, the CPU can execute additional instructions while waiting for the result of the LDW instruction to be valid, greatly reducing the total execution time of the entire program.

Delay slots

In the C6x CPU, it takes exactly one CPU clock cycle to execute each instruction. However, the instructions such as LDW need to access the slow external memory and the results of the load are not available immediately at the end of the execution. This delay of the execution results is called delay slots.

Example 2

For example, let's consider loading up the content of memory content at address pointed by A10 to A1 and then moving the loaded data to A2. You might be tempted to write simple 2 line assembly code as follows:


	   1     LDW   .D1    *A10, A1
	   2     MV    .D1    A1,A2
	  

What is wrong with the above code? The result of the LDW instruction is not available immediately after LDW is executed. As a consequence, the MV instruction does not copy the desired value of A1 to A2. To prevent this undesirable execution, we need to make the CPU wait until the result of the LDW instruction is correctly loaded to A1 before executing the MV instruction. For load instructions, we need extra 4 clock cycles until the load results are valid. To make the CPU wait for 4 clock cycles, we need to insert 4 NOP (no operations) instructions between LDW and MV. Each NOP instruction makes the CPU idle for one clock cycle. The resulting code will be like this:


	    1     LDW    .D1    *A10, A1
	    2     NOP
	    3     NOP
	    4     NOP
	    5     NOP
	    6     MV     .D1    A1,A2
	  

or simply you can write


	    1     LDW    .D1    *A10, A1
	    2     NOP  4
	    3     MV     .D1    A1,A2
	  

Why didn't the designer of the CPU make such that LDW instruction takes 5 clock cycles to begin with, rather than let the programmer insert 4 NOPs? The answer is that you can insert other instructions other than NOPs as far as those instructions do not use the result of the LDW instruction above. By doing this, the CPU can execute additional instructions while waiting for the result of the LDW instruction to be valid, greatly reducing the total execution time of the entire program.

Table 5: Delay slots
Description Instructions Delay slots
Single Cycle All instructions except following 0
Multiply MPY, SMPY etc. 1
Load LDB, LDH, LDW 4
Branch B 5

The functional unit latency indicates how many clock cycles each instruction actually uses a functional unit. All C6x instructions have 1 functional unit latency, meaning that each functional unit is ready to execute the next instruction after 1 clock cycle regardless of the delay slots of the instructions. Therefore, the following instructions are valid:


	1     LDW    .D1    *A10, A4
	2     ADD    .D1    A1,A2,A3
	

Although the first LDW instruction do not load the A4 register correctly while the ADD is executed, the D1 functional unit becomes available in the clock cycle right after the one in which LDW is executed.

To clarify the execution of instructions with delay slots, let's think of the following example of the LDW instruction. Let's assume A10 = 0x0100A2=1, and your intent is loading A9 with the 32-bit word at the address 0x0104. The 3 MV instructions are not related to the LDW instruction. They do something else.


	  1     LDW    .D1    *A10++[A2], A9
	  2     MV     .L1    A10, A8
	  3     MV     .L1    A1, A10
	  4     MV     .L1    A1, A2
	  5     ...
	

We can ask several interesting questions at this point:

  1. What is the value loaded to A8? That is, in which clock cycle, the address pointer is updated?
  2. Can we load the address offset register A2 before the LDW instruction finishes the actual loading?
  3. Is it legal to load to A10 before the first LDW finishes loading the memory content to A9? That is, can we change the address pointer before the 4 delay slots elapse?
Here are the answers:
  1. Although it takes an extra 4 clock cycles for the LDW instruction to load the memory content to A9, the address pointer and offset registers (A10 and A2) are read and updated in the clock cycle the LDW instruction is issued. Therefore, in line 2, A8 is loaded with the updated A10, that is A10 = A8 = 0x104.
  2. Because the LDW reads the A10 and A2 registers in the first clock cycle, you are free to change these registers and do not affect the operation of the first LDW.
  3. This was already answered above.

Similar theory holds for MPY and B (when using a register as a branch address) instructions. The MPY reads in the source values in the first clock cycle and loads the multiplication result after the 2nd clock cycle. For B, the address pointer is read in the first clock cycle, and the actual branching occurs after the 5th clock cycle. Thus, after the first clock cycle, you are free to modify the source or the address pointer registers. For more details, refer Table 3-5 in the instruction set description or read the description of the individual instruction.

Addition, Subtraction and Multiplication

There are several instructions for addition, subtraction and multiplication on the C6x CPU. The basic instructions are ADD, SUB, and MPY. ADD and SUB have 0 delay slots (meaning the results of the operation are immediately available), but the MPY has 1 delay slot (the result of the multiplication is valid after an additional 1 clock cycle).

Exercise 5

(Add, subtract, and multiply): Write an assembly program to compute ( 0000 ef35h + 0000 33dch - 0000 1234h ) * 0000 0007h

Branching and conditional operations

Often you need to control the flow of the program execution by branching to another block of code. The B instruction does the job in the C6x CPU. The address of the branch can be specified either by displacement or stored in a register to be used by the B instruction. The B instruction has 5 delay slots, meaning that the actual branch occurs in the 5th clock cycle after the instruction is executed.

In many cases, depending on the result of previous operations, you execute the branch instruction conditionally. For example, to implement a loop, you decrement the loop counter by 1 each time you run a set of instructions and whenever the loop counter is not zero, you need to branch to the beginning of the code block to iterate the loop operations. In the C6x CPU, this conditional branching is implemented using the conditional operations. Although B may be the instruction implemented using conditional operations most often, all instructions in C6x can be conditional.

Conditional instructions are represented in code by using square brackets, [ ], surrounding the condition register name. For example, the following B instruction is executed only if B0 is nonzero:


	  1    [B0]    B     .L1    A0
	

To execute an instruction conditionally when the condition register is zero, we use ! in front of the register. For example, the B instruction is executed when B0 is zero.


	  1    [!B0]    B     .L1    A0
	

Not all registers can be used as the condition registers. In the C62x and C67x devices, the registers that can be tested in conditional operations are B0, B1, B2, A1, A2.

Exercise 6

(Simple loop): Write an assembly program computing the summation n=1100nn1100n by implementing a simple loop.

Logical operations and bit manipulation

The logical operations and bit manipulations are accomplished by the AND, OR, XOR, CLR, SET, SHL, and SHR instructions.

Other assembly instructions

Other useful instructions include IDLE and compare instructions such as CMPEQetc.

C62x instruction set summary

The set of instructions that can be performed in each functional unit is as follows (See Table 6, Table 7, Table 8 and Table 9). Please refer to TMS320C62x/C67x CPU and Instruction Set Reference Guide for detailed description of each instruction.

Table 6: .S Unit
Instruction Description
ADD(U) signed or unsigned integer addition without saturation
ADDK integer addition using signed 16-bit constant
ADD2 two 16-bit integer adds on upper and lower register halves
B branch using a register
CLR clear a bit field
EXT extract and sign-extend a bit field
MV move from register to register
MVC move between the control file and the register file
MVK move a 16-bit constant into a register and sign extend
MVKH move 16-bit constant into the upper bits of a register
NEG negate (pseudo-operation)
NOT bitwise NOT
OR bitwise OR
SET set a bit field
SHL arithmetic shift left
SHR arithmetic shift right
SSHL shift left with saturation
SUB(U) signed or unsigned integer subtraction without saturation
SUB2 two 16-bit integer integer subs on upper and lower register halves
XOR exclusive OR
ZERO zero a register (pseudo-operation)
Table 7: .L Unit
Instruction Description
ABS integer absolute value with saturation
ADD(U) signed or unsigned integer addition without saturation
AND bitwise AND
CMPEQ integer compare for equality
CMPGT(U) signed or unsigned integer compare for greater than
CMPLT(U) signed or unsigned integer compare for less than
LMBD leftmost bit detection
MV move from register to register
NEG negate (pseudo-operation)
NORM normalize integer
NOT bitwise NOT
+OR bitwise OR
SADD integer addition with saturation to result size
SAT saturate a 40-bit integer to a 32-bit integer
SSUB integer subtraction with saturation to result size
SUBC conditional integer subtraction and shift - used for division
XOR exclusive OR
ZERO zero a register (pseudo-operation)
Table 8: .D Unit
Instruction Description
ADD(U) signed or unsigned integer addition without saturation
ADDAB (B/H/W) integer addition using addressing mode
LDB (B/H/W) load from memory with a 15-bit constant offset
MV move from register to register
STB (B/H/W) store to memory with a register offset or 5-bit unsigned constant offset
SUB(U) signed or unsigned integer subtraction without saturation
SUBAB (B/H/W) integer subtraction using addressing mode
ZERO zero a register (pseudo-operation)
Table 9: .M Unit
Instruction Description
MPY (U/US/SU) signed or unsigned integer multiply 16lsb*16lsb
MPYH (U/US/SU) signed or unsigned integer multiply 16msb*16msb
MPYLH signed or unsigned integer multiply 16lsb*16msb
MPYHL signed or unsigned integer multiply 16msb*16lsb
SMPY (HL/LH/H) integer multiply with left shift and saturation

Useful assembler directives

Other than the CPU instruction set, there are special commands to the assembler that direct the assembler to do various jobs when assembling the code. There are useful assembler directivesyou can use to let the assembler know various settings, such as .set, .macro, .endm, .ref, .align, .word, .byte .include.

The .set directive defines a symbolic name. For example, you can have


	1    count    .set    40
      

The assembler replaces each occurrence of count with 40.

The .ref directive is used to declare symbolic names defined in another file. It is similar to the extern declaration in C.

The .space directive reserves a memory space with specified number of bytes. For example, you can have


	1    buffer    .space    128
      

to define a buffer of size 128 bytes. The symbol buffer has the address of the first byte reserved by .space. The .bes directive is similar to .space, but the label has the address of the last byte reserved.

To put a constant value in the memory, you can use .byte, .word, etc. If you have


	1    const1    .word    0x1234
      

the assembler places the word constant 0x1234 at a memory location and const1 has the address of the memory location. .byteetc. works similarly.

Sometimes you need to place your data or code at specific memory address boundaries such as word, halfword, etc. You can use the .align directive to do this. For example, if you have


	1               .align    4
	2     buffer    .space    128
	3               ...
      

the first address of the reserved 128 bytes is at the word boundary in memory, that is the 2 LSBs of the address (in binary) are 0. Similarly, for half-word alignment, you should have .align directive to do this. For example, if you have


	1               .align    2
	2     buffer    .space    128
	3               ...
      

the .include directive is used to read the source lines from another file. The instruction


	1               .include    ``other.asm''
      

will input the lines in other.asm at this location. This is useful when working with multiple files. Instead of making a project having multiple files, you can simply include these different files in one file.

How do you write comments in your assembly program? Anything that follows ; is considered a comment and ignored by the assembler. For example,


	1     ; this is a comment
	2             ADD     .L1     A1,A2,A3      ;add a1 and a2
      

Assigning functional units

Each instruction has particular functional units that can execute it. Note that some instructions can be executed by several different functional units.

The following figure shows how data and addresses can be transfered between the registers, functional units and the external memory. If you observe carefully, the destination path (marked as dst) going out of the .L1, .S1, .M1 and D1 units are connected to the register file A.

Note:

This means that any instruction with one of the A registers as destination (the result of operation is stored in one of A registers) should be executed in one of these 4 functional units.
For the same reason, if the instructions have B registers as destination, the .L2, .S2, .M2 and D2 units should be used.

Figure 3: TMS320C67x DSP Block Diagram taken from SPRU733: TMS320C67x/C67x+ DSP CPU and Instruction Set Reference Guide
Figure 3 (graphics3.jpg)

Therefore if you know the instruction and the destination register, you should be able to assign the functional unit to it.

Exercise 7

(Functional units): List all the functional units you can assign to each of these instructions:

  1. ADD .?? A0,A1,A2
  2. B .?? A1
  3. MVKL .?? 000023feh, B0
  4. LDW .?? *A10, A3

If you look at the figure again, each functional unit must receive one of the source data from the corresponding register file. For example, look at the following assembly instruction:


	1     ADD   .L1    A0,B0,A1
      

The .L1 unit gets data from A0 (this is natural) and B0 (this is not) and stores the result in A1 (this is a must). The data path through which the content of B0 is conveyed to the .L1 unit is called 1Xcross path. When this happens, we add x to the functional unit to designate the cross path:


	1    ADD    .L1x    A0,B0,A1
      

Similarly the data path from register file B to the .M2, .S2 and .L2 units are called 2X cross path.

Exercise 8

(Cross path): List all the functional units that can be assigned to each of the instruction:

  1. ADD .??? B0,A1,B2
  2. MPY .??? A1,B2,A4

In fact, when you write an assembly program, you can omit the functional unit assignment altogether. The assembler figures out the available functional units and properly assigns them. However, manually assigned functional units help you to figure out where the actual execution takes place and how the data move around between register files and functional units. This is particularly useful when you put multiple instructions in parallel. We will learn about the parallel instructions later on.

Writing the inner product program

Now you should know enough about C6x assembly to implement the inner product algorithm to compute y=n=110an×xnyn110anxn

Exercise 9

(Inner product): Write the complete inner product assembly program to compute y=n=110an×xnyn110anxn where anan and xnxn take the following values:


	    a[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, a }
	    x[] = { f, e, d, c, b, a, 9, 8, 7, 6 }
	  

The anan and xnxn values must be stored in memory and the inner product is computed by reading the memory contents.

Pipeline, Delay slots and Parallel instructions

When an instruction is executed, it takes several steps, which are fetching, decoding, and execution. If these steps are done one at a time for each instruction, the CPU resources are not fully utilized. To increase the throughput, CPUs are designed to be pipelined, meaning that the foregoing steps are carried out at the same time.

On the C6x processor, the instruction fetch consists of 4 phases; generate fetch address (F1), send address to memory (F2), wait for data (F3), and read opcode from memory (F4). Decoding consists of 2 phases; dispatching to functional units (D1) and decoding (D2). The execution step may consist of up to 6 phases (E1 to E6) depending on the instructions. For example, the multiply (MPY) instructions has 1 delay resulting in 2 execution phases. Similarly, load (LDx) and branch (B) instructions have 4 and 5 delays respectively.

When the outcome of an instruction is used by the next instruction, an appropriate number of NOPs (no operation or delay) must be added after multiply (one NOP), load (four NOPs, or NOP 4), and branch (five NOPs, or NOP 5) instructions in order to allow the pipeline to operate properly. Otherwise, before the outcome of the current instruction is available (which is to be used by the next instruction), the next instructions are executed by the pipeline, generating undesired results. The following code is an example of pipelined code with NOPs inserted:


	 1             MVK    40,A2
	 2     loop:   LDH    *A5++,A0
	 3             LDH    *A6++,A1
	 4             NOP    4
	 5             MPY    A0,A1,A3
	 6             NOP
	 7             ADD    A3,A4,A4
	 8             SUB    A2,1,A2
	 9     [A2]    B      loop
	10             NOP    5
	11             STH    A4,*A7
      

In line 4, we need 4 NOPs because the A1 is loaded by the LDH instruction in line 3 with 4 delays. After 4 delays, the value of A1 is available to be used in the MPY A0,A1,A3 in line 5. Similarly, we need 5 delays after the [A2] B loop instruction in line 9 to prevent the execution of STH A4,*A7 before branching occurs.

The C6x Very Large Instruction Word (VLIW) architecture, several instructions are captured and processed simultaneously. This is referred to as a Fetch Packet (FP). This Fetch Packet allows C6x to fetch eight instructions simultaneously from on-chip memory. Among the 8 instructions fetched at the same time, multiple of them can be executed at the same time if they do not use same CPU resources at the same time. Because the CPU has 8 separate functional units, maximum 8 instructions can be executed in parallel, although the type of parallel instructions are limited because they must not conflict each other in using CPU resources. In assembly listing, parallel instructions are indicated by double pipe symbols (||). When writing assembly code, by designing code to maximize parallel execution of instructions (through proper functional unit assignments, etc.) the execution cycle of the code can be reduced.

Parallel instructions and constraints

We have seen that C62x CPU has 8 functional units. Each assembly instruction is executed in one of these 8 functional units, and it takes exactly one clock cycle for the execution. Then, while one instruction is being executed in one of the functional units, what are other 7 functional units doing? Can other functional units execute other instructions at the same time?

The answer is YES. Thus, the CPU can execute maximum 8 instructions in each clock cycle. The instructions executed in the same clock cycle are called parallel instructions. Then, what instructions can be executed in parallel? A short answer is: as far as the parallel instructions do not use the same resource of the CPU, they can be put in parallel. For example, the following two instructions do not use the same CPU resource and they can be executed in parallel.


	1           ADD    .L1    A0,A1,A2
	2    ||     ADD    .L2    B0,B1,B2
      

Resource constraints

Then, what are the constraints on the parallel instructions? Let's look at the resource constraints in more detail.

Functional unit constraints

This is simple. Each functional unit can execute only one instruction per each clock cycle. In other words, instructions using the same functional unit cannot be put in parallel.

Cross paths constraints

If you look at the data path diagram of the C62x CPU, there exists only one cross path from B register file to the L1, M1 and S1 functional units. This means the cross path can be used only once per each clock cycle. Thus, the following parallel instructions are invalid because the 1x cross path is used for both instructions.


	    1          ADD     .L1x    A0,B1,A2
	    2   ||     MPY     .M1x    A5,B0,A3
	  

The same rule holds for the 2x cross path from the A register file to the L2, M2 and S2 functional units.

Loads and Stores constraints

The D units are used for load and store instructions. If you examine the C62x data path diagram, the addresses for load/store can be obtained from either A or B side using the multiplexers connecting crisscross to generate the addresses DA1 and DA2. Thus, the instructions such as


	    1          LDW     .D2     *B0, A1
	  

is valid. The functional unit must be on the same side as the address source register (address index in B0 and therefore D2 above), because D1 and D2 units must receive the addresses from A and B sides, respectively.

Another constraint is that while loading a register in one register file from memory, you cannot simultaneously store a register in the same register file to memory. For example, the following parallel instructions are invalid:


	    1          LDW     .D1     *A0, A1
	    2   ||     STW     .D2     A2, *B0
	  

Constraints on register reads

You cannot have more than four reads from the same register in each clock cycle. Thus, the following is invalid:


	    1          ADD     .L1     A1, A1, A2
	    2   ||     MPY     .M1     A1, A1, A3
	    3   ||     SUB     .D1     A1, A4, A5
	  

Constraints on register writes

A register cannot be written to more than once in a single clock cycle. However, note that the actual writing to registers may not occur in the same clock cycle during which the instruction is executed. For example, the MPY instruction writes to the destination register in the next clock cycle. Thus, the following is valid:


	    1	       ADD     .L1     A1, A1, A2
	    2   ||     MPY     .M1     A1, A1, A2
	  

The following two instructions (not parallel) are invalid (why?):


	    1          MPY     .M1     A1, A1, A2
	    2          ADD     .L1     A3, A4, A2
	  

Some of these write conflicts are very hard to detect and not detected by the assembler. Extra caution should be exercised with the instructions having nonzero delay slots.

Ad-Hoc software pipelining

At this point, you might have wondered why the C6x CPU allows parallel instructions and generate so much headache with the resource constraints, especially with the instructions with delay slots. And, why not just make the MPY instruction take 2 clock cycles to execute so that we can always use the multiplied result after issuing it?

The reason is that by executing instructions in parallel, we can reduce the total execution time of the program. A well-written assembly program executes as many instructions as possible in each clock cycle to implement the desired algorithm.

The reason for allowing delay slots is that although it takes 2 clock cycles for an MPY instruction generate the result, we can execute another instruction while waiting for the result. This way, you can reduce the clock cycles wasted while waiting for the result from slow instructions, thus increasing the overall execution speed.

However, how can we put instructions in parallel? Although there's a systematic way of doing it (we will learn a bit later), at this point you can try to restructure your assembly code to execute as many instructions as possible in parallel. And, you should try to execute other instructions in the delay slots of those instructions such as MPY, LDW, etc., instead of inserting NOPs to wait the instructions produce the results.

Content actions

Download module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks