TMS320C6000 Architectural and Programming Overview
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 2Overview Interface between assembly language and architecture Architecture of TMS320C60 Linear assembly Code optimization
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 3 General DSP System Block Diagram PERIPHERALSPERIPHERALSPERIPHERALSPERIPHERALS CentralProcessing Unit Unit Internal Memory Internal Buses ExternalMemory
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 4 Implementation of Sum of Products (SOP) Implementation of Sum of Products (SOP) SOP (Sum of Products) is the key element for most DSP algorithms. So let’s write the code for this algorithm and at the same time discover the C6000 architecture. Two basic operations are required for this algorithm. (1) Multiplication (2) Addition Therefore two basic instructions are required Y = N a n x n n = 1 * = a 1 * x 1 + a 2 * x a N * x N
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 5 Two basic operations are required for this algorithm. (1) Multiplication (2) Addition Therefore two basic instructions are required Implementation of Sum of Products (SOP) Implementation of Sum of Products (SOP) Y = N a n x n n = 1 * So let’s implement the SOP algorithm! The implementation in this module will be done in assembly. = a 1 * x 1 + a 2 * x a N * x N
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 6 Multiply (MPY) The multiplication of a 1 by x 1 is done in assembly by the following instruction: MPYa1, x1, Y This instruction is performed by a multiplier unit that is called “.M” Y = N a n x n n = 1 * = a 1 * x 1 + a 2 * x a N * x N
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 7 Multiply (.M unit).M.M Y = 40 a n x n n = 1 * The. M unit performs multiplications in hardware MPY.Ma1, x1, Y Note: 16-bit by 16-bit multiplier provides a 32-bit result. 32-bit by 32-bit multiplier provides a 64-bit result.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 8 Addition (.?).M.M.?.? Y = 40 a n x n n = 1 * MPY.Ma1, x1, prod ADD.?Y, prod, Y
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 9 Add (.L unit).M.M.L.L Y = 40 a n x n n = 1 * MPY.Ma1, x1, prod ADD.LY, prod, Y RISC processors such as the C6000 use registers to hold the operands, so lets change this code.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 10 Register File - A Y = 40 a n x n n = 1 * MPY.Ma1, x1, prod ADD.LY, prod, Y.M.M.L.L A0A1A2A3A4A15 Register File A a1 x1 prod 32-bits Y Let us correct this by replacing a, x, prod and Y by the registers as shown above.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 11 Specifying Register Names Y = 40 a n x n n = 1 * MPY.MA0, A1, A3 ADD.LA4, A3, A4 The registers A0, A1, A3 and A4 contain the values to be used by the instructions..M.M.L.L A0A1A2A3A4A15 Register File A a1 x1 prod 32-bits Y
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 12 Specifying Register Names Y = 40 a n x n n = 1 * MPY.MA0, A1, A3 ADD.LA4, A3, A4 Register File A contains 16 registers (A0 -A15) which are 32-bits wide..M.M.L.L A0A1A2A3A4A15 Register File A a1 x1 prod 32-bits Y
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 13 Data loading Q: How do we load the operands into the registers?.M.M.L.L A0A1A2A3A4A15 Register File A a1 x1 prod 32-bits Y
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 14 Load Unit “.D” A: The operands are loaded into the registers by loading them from the memory using the.D unit..M.M.L.L A0A1A2A3A15 Register File A a1 x1 prod 32-bits Y.D.D Data Memory Q: How do we load the operands into the registers?
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 15 Load Unit “.D” It is worth noting at this stage that the only way to access memory is through the.D unit..M.M.L.L A0A1A2A3A15 Register File A a1 x1 prod 32-bits Y.D.D Data Memory
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 16 Load Instruction Q: Which instruction(s) can be used for loading operands from the memory to the registers?.M.M.L.L A0A1A2A3A15 Register File A a1 x1 prod 32-bits Y.D.D Data Memory
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 17 Load Instructions (LDB, LDH,LDW,LDDW) A: The load instructions..M.M.L.L A0A1A2A3A15 Register File A a1 x1 prod 32-bits Y.D.D Data Memory Q: Which instruction(s) can be used for loading operands from the memory to the registers?
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 18 Using the Load Instructions Data 16-bits Before using the load unit you have to be aware that this processor is byte addressable, which means that each byte is represented by a unique address. Also the addresses are 32-bit wide. address FFFFFFFF
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 19 The syntax for the load instruction is: Where: Rn is a register that contains the address of the operand to be loaded and Rm is the destination register. Using the Load Instructions Data a1 x1 prod 16-bits Y address FFFFFFFF LD *Rn,Rm
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 20 The syntax for the load instruction is: The question now is how many bytes are going to be loaded into the destination register? Using the Load Instructions Data a1 x1 prod 16-bits Y address FFFFFFFF LD *Rn,Rm
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 21 The syntax for the load instruction is: LD *Rn,Rm Using the Load Instructions Data a1 x1 prod 16-bits Y address FFFFFFFF The answer, is that it depends on the instruction you choose: LDB: loads one byte (8-bit) LDB: loads one byte (8-bit) LDH: loads half word (16-bit) LDH: loads half word (16-bit) LDW: loads a word (32-bit) LDW: loads a word (32-bit) LDDW: loads a double word (64-bit) LDDW: loads a double word (64-bit) Note: LD on its own does not exist.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 22 Using the Load Instructions Data 16-bits address FFFFFFFF 0xB0xA 0xD0xC Example: If we assume that A5 = 0x4 then: (1) LDB *A5, A7 ; gives A7 = 0x (2) LDH *A5,A7; gives A7 = 0x (3) LDW *A5,A7; gives A7 = 0x (4) LDDW *A5,A7:A6; gives A7:A6 = 0x x10x2 0x30x4 0x50x6 0x70x8 The syntax for the load instruction is: LD *Rn,Rm 01
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 23 Using the Load Instructions Data 16-bits address FFFFFFFF 0xB0xA 0xD0xC Question: If data can only be accessed by the load instruction and the.D unit, how can we load the register pointer Rn in the first place? 0x10x2 0x30x4 0x50x6 0x70x8 The syntax for the load instruction is: LD *Rn,Rm
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 24 The instruction MVKL will allow a move of a 16-bit constant into a register as shown below: MVKL.?a, A5 (‘a’ is a constant or label) How many bits represent a full address? 32 bits So why does the instruction not allow a 32-bit move? All instructions are 32-bit wide (see instruction opcode). Loading the Pointer Rn
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 25 To solve this problem another instruction is available: MVKH Loading the Pointer Rn eg. MVKH.?a, A5 (‘a’ is a constant or label) ah ahx al a A5 MVKL a, A5 MVKH a, A5 Finally, to move the 32-bit address to a register we can use:
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 26 Loading the Pointer Rn MVKL0x1234FABC, A5 A5 = 0xFFFFFABC ; Wrong Example 1 A5 = 0x MVKL0x1234FABC, A5 A5 = 0xFFFFFABC (sign extension) MVKH0x1234FABC, A5 A5 = 0x1234FABC ; OK Example 2 MVKH0x1234FABC, A5 A5 = 0x Always use MVKL then MVKH, look at the following examples:
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 27 LDH, MVKL and MVKH.M.M.L.L A0A1A2A3A15 Register File A a x prod 32-bits Y.D.D Data Memory MVKL pt1, A5 MVKL pt1, A5 MVKH pt1, A5 MVKH pt1, A5 MVKL pt2, A6 MVKL pt2, A6 MVKH pt2, A6 MVKH pt2, A6 LDH.D*A5, A0 LDH.D*A5, A0 LDH.D*A6, A1 LDH.D*A6, A1 MPY.MA0, A1, A3 MPY.MA0, A1, A3 ADD.LA4, A3, A4 ADD.LA4, A3, A4
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 28 Creating a loop MVKL pt1, A5 MVKL pt1, A5 MVKH pt1, A5 MVKH pt1, A5 MVKL pt2, A6 MVKL pt2, A6 MVKH pt2, A6 MVKH pt2, A6 LDH.D*A5, A0 LDH.D*A5, A0 LDH.D*A6, A1 LDH.D*A6, A1 MPY.MA0, A1, A3 MPY.MA0, A1, A3 ADD.LA4, A3, A4 ADD.LA4, A3, A4 So far we have only implemented the SOP for one tap only, i.e. Y= a 1 * x 1 So let’s create a loop so that we can implement the SOP for N Taps.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 29 Creating a loop With the C6000 processors there are no dedicated instructions such as block repeat. The loop is created using the B instruction. So far we have only implemented the SOP for one tap only, i.e. Y= a 1 * x 1 So let’s create a loop so that we can implement the SOP for N Taps.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 30 What are the steps for creating a loop 1. Create a label to branch to. 2. Add a branch instruction, B. 3.Create a loop counter. 4. Add an instruction to decrement the loop counter. 5. Make the branch conditional based on the value in the loop counter.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide Create a label to branch to MVKL pt1, A5 MVKL pt1, A5 MVKH pt1, A5 MVKH pt1, A5 MVKL pt2, A6 MVKL pt2, A6 MVKH pt2, A6 MVKH pt2, A6 loop LDH.D*A5, A0 LDH.D*A6, A1 LDH.D*A6, A1 MPY.MA0, A1, A3 MPY.MA0, A1, A3 ADD.LA4, A3, A4 ADD.LA4, A3, A4
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 32 MVKL pt1, A5 MVKL pt1, A5 MVKH pt1, A5 MVKH pt1, A5 MVKL pt2, A6 MVKL pt2, A6 MVKH pt2, A6 MVKH pt2, A6 loop LDH.D*A5, A0 LDH.D*A6, A1 LDH.D*A6, A1 MPY.MA0, A1, A3 MPY.MA0, A1, A3 ADD.LA4, A3, A4 ADD.LA4, A3, A4 B.?loop 2. Add a branch instruction, B.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 33 Which unit is used by the B instruction?.M.M.L.L A0A1A2A3A15 Register File A a x prod 32-bits Y.D.D.M.M.L.L A0A1A2A3A a x prod 32-bits Y.D.D Data Memory.S.S MVKLpt1, A5 MVKLpt1, A5 MVKH pt1, A5 MVKH pt1, A5 MVKL pt2, A6 MVKL pt2, A6 MVKH pt2, A6 MVKH pt2, A6 loop LDH.D*A5, A0 LDH.D*A6, A1 LDH.D*A6, A1 MPY.MA0, A1, A3 MPY.MA0, A1, A3 ADD.LA4, A3, A4 ADD.LA4, A3, A4 B.?loop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 34 Data Memory Which unit is used by the B instruction?.M.M.L.L A0A1A2A3A15 Register File A a x prod 32-bits Y.D.D.M.M.L.L A0A1A2A3A a x prod 32-bits Y.D.D.S.S MVKL.S pt1, A5 MVKL.S pt1, A5 MVKH.S pt1, A5 MVKH.S pt1, A5 MVKL.S pt2, A6 MVKL.S pt2, A6 MVKH.S pt2, A6 MVKH.S pt2, A6 loop LDH.D*A5, A0 LDH.D*A6, A1 LDH.D*A6, A1 MPY.MA0, A1, A3 MPY.MA0, A1, A3 ADD.LA4, A3, A4 ADD.LA4, A3, A4 B.Sloop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 35 Data Memory 3. Create a loop counter..M.M.L.L A0A1A2A3A15 Register File A a x prod 32-bits Y.D.D.M.M.L.L A0A1A2A3A a x prod 32-bits Y.D.D.S.S MVKL.S pt1, A5 MVKL.S pt1, A5 MVKH.S pt1, A5 MVKH.S pt1, A5 MVKL.S pt2, A6 MVKL.S pt2, A6 MVKH.S pt2, A6 MVKH.S pt2, A6 MVKL.Scount, B0 loop LDH.D*A5, A0 LDH.D*A6, A1 LDH.D*A6, A1 MPY.MA0, A1, A3 MPY.MA0, A1, A3 ADD.LA4, A3, A4 ADD.LA4, A3, A4 B.Sloop B registers will be introduced later
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide Decrement the loop counter.M.M.L.L A0A1A2A3A15 Register File A a x prod 32-bits Y.D.D Data Memory.M.M.L.L A0A1A2A3A15 Register File A a x prod 32-bits Y.D.D.S.S MVKL.S pt1, A5 MVKL.S pt1, A5 MVKH.S pt1, A5 MVKH.S pt1, A5 MVKL.S pt2, A6 MVKL.S pt2, A6 MVKH.S pt2, A6 MVKH.S pt2, A6 MVKL.Scount, B0 loop LDH.D*A5, A0 LDH.D*A6, A1 LDH.D*A6, A1 MPY.MA0, A1, A3 MPY.MA0, A1, A3 ADD.LA4, A3, A4 ADD.LA4, A3, A4 SUB.SB0, 1, B0 B.Sloop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 37 What is the syntax for making instruction conditional? [condition] InstructionLabel e.g. [B1]Bloop (1) The condition can be one of the following registers: A1, A2, B0, B1, B2. (2) Any instruction can be conditional. 5. Make the branch conditional based on the value in the loop counter
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 38 The condition can be inverted by adding the exclamation symbol “!” as follows: [!condition] InstructionLabel e.g. [!B0]Bloop ;branch if B0 = 0 [B0]Bloop;branch if B0 != 0 5. Make the branch conditional based on the value in the loop counter
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 39 Data Memory.M.M.L.L A0A1A2A3A15 Register File A a x prod 32-bits Y.D.D.M.M.L.L A0A1A2A3A a x prod 32-bits Y.D.D.S.S MVKL.S2 pt1, A5 MVKL.S2 pt1, A5 MVKH.S2 pt1, A5 MVKH.S2 pt1, A5 MVKL.S2 pt2, A6 MVKL.S2 pt2, A6 MVKH.S2 pt2, A6 MVKH.S2 pt2, A6 MVKL.S2count, B0 loop LDH.D*A5, A0 LDH.D*A6, A1 LDH.D*A6, A1 MPY.MA0, A1, A3 MPY.MA0, A1, A3 ADD.LA4, A3, A4 ADD.LA4, A3, A4 SUB.SB0, 1, B0 [B0]B.Sloop [B0]B.Sloop 5. Make the branch conditional
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 40 Case 1: B.S1 label Relative branch. Label limited to +/ offset. More on the Branch Instruction (1) With this processor all the instructions are encoded in a 32-bit. Therefore the label must have a dynamic range of less than 32-bit as the instruction B has to be coded. 21-bit relative address B32-bit
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 41 More on the Branch Instruction (2) By specifying a register as an operand instead of a label, it is possible to have an absolute branch. This will allow a dynamic range of Case 2: B.S2register Absolute branch. Operates on.S2 ONLY! 5-bit register code B32-bit
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 42 Testing the code This code performs the following operations: operations: a 0 *x 0 + a 0 *x 0 + a 0 *x 0 + … + a 0 *x 0 However, we would like to perform: a 0 *x 0 + a 1 *x 1 + a 2 *x 2 + … + a N *x N MVKL.S2 pt1, A5 MVKL.S2 pt1, A5 MVKH.S2 pt1, A5 MVKH.S2 pt1, A5 MVKL.S2 pt2, A6 MVKL.S2 pt2, A6 MVKH.S2 pt2, A6 MVKH.S2 pt2, A6 MVKL.S2count, B0 loop LDH.D*A5, A0 LDH.D*A6, A1 LDH.D*A6, A1 MPY.MA0, A1, A3 MPY.MA0, A1, A3 ADD.LA4, A3, A4 ADD.LA4, A3, A4 SUB.SB0, 1, B0 [B0]B.Sloop [B0]B.Sloop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 43 Modifying the pointers The solution is to modify the pointers A5 and A6. MVKL.S2 pt1, A5 MVKL.S2 pt1, A5 MVKH.S2 pt1, A5 MVKH.S2 pt1, A5 MVKL.S2 pt2, A6 MVKL.S2 pt2, A6 MVKH.S2 pt2, A6 MVKH.S2 pt2, A6 MVKL.S2count, B0 loop LDH.D*A5, A0 LDH.D*A6, A1 LDH.D*A6, A1 MPY.MA0, A1, A3 MPY.MA0, A1, A3 ADD.LA4, A3, A4 ADD.LA4, A3, A4 SUB.SB0, 1, B0 [B0]B.Sloop [B0]B.Sloop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 44 Indexing Pointers Description Pointer Syntax Pointer Modified *R*R*R*RNo R can be any register In this case the pointers are used but not modified.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 45 Indexing Pointers Description Pointer + Pre-offset - Pre-offset Syntax Pointer Modified *R *+R[ disp ] *-R[ disp ] NoNoNo [disp] specifies the number of elements size in DW (64-bit), W (32-bit), H (16-bit), or B (8-bit). disp = R or 5-bit constant. R can be any register. In this case the pointers are modified BEFORE being used and RESTORED to their previous values.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 46 Indexing Pointers Description Pointer + Pre-offset - Pre-offset Pre-incrementPre-decrement Syntax Pointer Modified *R *+R[ disp ] *-R[ disp ] *++R[ disp ] *--R[ disp ] NoNoNoYesYes In this case the pointers are modified BEFORE being used and NOT RESTORED to their Previous Values.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 47 Indexing Pointers Description Pointer + Pre-offset - Pre-offset Pre-incrementPre-decrementPost-incrementPost-decrement Syntax Pointer Modified *R *+R[ disp ] *-R[ disp ] *++R[ disp ] *--R[ disp ] *R++[ disp ] *R--[ disp ] NoNoNoYesYesYesYes In this case the pointers are modified AFTER being used and NOT RESTORED to their Previous Values.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 48 Indexing Pointers Description Pointer + Pre-offset - Pre-offset Pre-incrementPre-decrementPost-incrementPost-decrement Syntax Pointer Modified *R *+R[ disp ] *-R[ disp ] *++R[ disp ] *--R[ disp ] *R++[ disp ] *R--[ disp ] NoNoNoYesYesYesYes [disp] specifies # elements - size in DW, W, H, or B. disp = R or 5-bit constant. R can be any register.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 49 Modify and testing the code This code now performs the following operations: operations: a 0 *x 0 + a 1 *x 1 + a 2 *x a N *x N MVKL.S2 pt1, A5 MVKL.S2 pt1, A5 MVKH.S2 pt1, A5 MVKH.S2 pt1, A5 MVKL.S2 pt2, A6 MVKL.S2 pt2, A6 MVKH.S2 pt2, A6 MVKH.S2 pt2, A6 MVKL.S2count, B0 loop LDH.D*A5++, A0 LDH.D*A6++, A1 LDH.D*A6++, A1 MPY.MA0, A1, A3 MPY.MA0, A1, A3 ADD.LA4, A3, A4 ADD.LA4, A3, A4 SUB.SB0, 1, B0 [B0]B.Sloop [B0]B.Sloop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 50 Store the final result This code now performs the following operations: operations: a 0 *x 0 + a 1 *x 1 + a 2 *x a N *x N MVKL.S2 pt1, A5 MVKL.S2 pt1, A5 MVKH.S2 pt1, A5 MVKH.S2 pt1, A5 MVKL.S2 pt2, A6 MVKL.S2 pt2, A6 MVKH.S2 pt2, A6 MVKH.S2 pt2, A6 MVKL.S2count, B0 loop LDH.D*A5++, A0 LDH.D*A6++, A1 LDH.D*A6++, A1 MPY.MA0, A1, A3 MPY.MA0, A1, A3 ADD.LA4, A3, A4 ADD.LA4, A3, A4 SUB.SB0, 1, B0 [B0]B.Sloop [B0]B.Sloop STH.DA4, *A7
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 51 Store the final result The Pointer A7 has not been initialised. MVKL.S2 pt1, A5 MVKL.S2 pt1, A5 MVKH.S2 pt1, A5 MVKH.S2 pt1, A5 MVKL.S2 pt2, A6 MVKL.S2 pt2, A6 MVKH.S2 pt2, A6 MVKH.S2 pt2, A6 MVKL.S2count, B0 loop LDH.D*A5++, A0 LDH.D*A6++, A1 LDH.D*A6++, A1 MPY.MA0, A1, A3 MPY.MA0, A1, A3 ADD.LA4, A3, A4 ADD.LA4, A3, A4 SUB.SB0, 1, B0 [B0]B.Sloop [B0]B.Sloop STH.DA4, *A7
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 52 Store the final result The Pointer A7 is now initialised. MVKL.S2 pt1, A5 MVKL.S2 pt1, A5 MVKH.S2 pt1, A5 MVKH.S2 pt1, A5 MVKL.S2 pt2, A6 MVKL.S2 pt2, A6 MVKH.S2 pt2, A6 MVKH.S2 pt2, A6 MVKL.S2pt3, A7 MVKH.S2pt3, A7 MVKL.S2count, B0 loop LDH.D*A5++, A0 LDH.D*A6++, A1 LDH.D*A6++, A1 MPY.MA0, A1, A3 MPY.MA0, A1, A3 ADD.LA4, A3, A4 ADD.LA4, A3, A4 SUB.SB0, 1, B0 [B0]B.Sloop [B0]B.Sloop STH.DA4, *A7
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 53 What is the initial value of A4? A4 is used as an accumulator, so it needs to be reset to zero. so it needs to be reset to zero. MVKL.S2 pt1, A5 MVKL.S2 pt1, A5 MVKH.S2 pt1, A5 MVKH.S2 pt1, A5 MVKL.S2 pt2, A6 MVKL.S2 pt2, A6 MVKH.S2 pt2, A6 MVKH.S2 pt2, A6 MVKL.S2pt3, A7 MVKH.S2pt3, A7 MVKL.S2count, B0 ZERO.LA4 loop LDH.D*A5++, A0 LDH.D*A6++, A1 LDH.D*A6++, A1 MPY.MA0, A1, A3 MPY.MA0, A1, A3 ADD.LA4, A3, A4 ADD.LA4, A3, A4 SUB.SB0, 1, B0 [B0]B.Sloop [B0]B.Sloop STH.DA4, *A7
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 54 How can we add more processing power to this processor?.S1.S1.M1.M1.L1.L1.D1.D1 A0 A1 A2 A3 A4 Register File A Data Memory A15 32-bits Increasing the processing power!
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 55 (1)Increase the clock frequency..S1.S1.M1.M1.L1.L1.D1.D1 A0 A1 A2 A3 A4 Register File A Data Memory A15 32-bits Increasing the processing power! (2)Increase the number of Processing units.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 56 To increase the Processing Power, this processor has two sides (A and B or 1 and 2) Data Memory.S1.S1.M1.M1.L1.L1.D1.D1 A0 A1 A2 A3 A4 Register File A A15 32-bits.S2.S2.M2.M2.L2.L2.D2.D2 B0 B1 B2 B3 B4 Register File B B15 32-bits
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 57 Can the two sides exchange operands in order to increase performance? Data Memory.S1.S1.M1.M1.L1.L1.D1.D1 A0 A1 A2 A3 A4 Register File A A15 32-bits B15.S2.S2.M2.M2.L2.L2.D2.D2 B0 B1 B2 B3 B4 Register File B bits
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 58 The answer is YES but there are limitations. To exchange operands between the two sides, some cross paths or links are required. What is a cross path? A cross path links one side of the CPU to the other. There are two types of cross paths: Data cross paths. Address cross paths.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 59 Data Cross Paths Data cross paths can also be referred to as register file cross paths. These cross paths allow operands from one side to be used by the other side. There are only two cross paths: one path which conveys data from side B to side A, 1X. one path which conveys data from side A to side B, 2X.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 60 TMS320C67x Data-Path
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 61
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 62Architecture
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 63 Data path details
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 64 Functional Units
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 65 Instruction packing
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 66 Instruction packing
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 67 Data types
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 68 DSP Instructions
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 69 Pipeline benefits
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 70 Pipeline Phases
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 71 Delay slots
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 72
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 73 Linear Assembly Comparison of programming techniques. How to write Linear Assembly. Interfacing Linear Assembly with C. Assembly optimiser tool.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 74Introduction With the assembly optimiser, optimisation for loops can be made very simple. Linear assembly takes care of the pipeline structure and generates highly parallel assembly code automatically. The performance of the assembly optimiser can easily reach the performance of hand written assembly code.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide %High ASM Hand Optimised Comparison of Programming Techniques * Typical efficiency vs. hand optimized assembly. SourceEfficiency*Effort % C C ++ Low Optimising Compiler % Linear ASM Med Assembly Optimiser
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 76 Writing in Linear Assembly Linear assembly is similar to hand assembly, except: Does not require NOPs to fill empty delay slots. The functions units do not need to be specified. Grouping of instructions in parallel is performed automatically. Accepts symbolic variable names. ZEROsum loopLDH*p_to_a, a LDH*p_to_b, b MPYa, b, prod ADDsum, prod, sum SUBB0, 1, B0 B loop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 77 How to Write Code in Linear Assembly File extension: Use the “.sa” extension to specify the file is written in linear assembly. How to write code: _sa_Function.cproc ZEROsum loopLDH*pm++, m LDH*pn++, n MPYm, n, prod ADDsum, prod, sum [count]SUBcount, 1, count [count]SUBcount, 1, count B loop.return sum.endproc.cproc defines the beginning of the code.return specifies the return value.endproc defines the end of the linear assembly code NO NOPs required NO parallel instructions required NO functional units specified NO registers required
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 78 Passing and Returning Arguments “pm” and “pn” are two pointers declared in the C code that calls the linear assembly function. The following function prototype in C calls the linear assembly function: int y = dotp (short* a, short* x, int count) The linear assembly function receives the arguments using.cproc: _dotp.cprocpm, pn, count....return y.endproc
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 79 Declaring the Symbolic Variables All the symbolic registers except those used as arguments are declared as follows:.regpm, pn, m, n, prod, sum The assembly optimiser will attempt to assign all these values to registers.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 80 Complete Linear Assembly Code Note: Linear assembly performs automatic return to the calling function. _dotp.cproc pm, pn, count.reg m, n, prod, sum ZEROsum loopLDH*pm++, m LDH*pn++, n MPYm, n, prod ADDsum, prod, sum [count]SUBcount, 1, count [count]SUBcount, 1, count B loop.return sum.endproc
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 81 Function calls in Linear Assembly In linear assembly you can call other functions written in C, linear assembly or assembly. To do this the.call directive is used: Function1.sa _function1.cproc a, b.reg y, float1 MPYa,b,y.call float1 = _fix_to_float(y).return.endproc Fix_to_float.sa _fix_to_float.cproc fix.reg float1 INTSP fix, float1.return float1.endproc Note: Branch out of a linear assembly routine is not allowed.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 82 Invoking the Assembly Optimiser The development tools recognise the linear assembler code by the file extension “.sa”. The assembly optimiser uses the same options as the optimising C compiler. Note: In CCS you can change the options of each file individually by right clicking on the file in the project view and selecting “File Specific Options…”.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 83
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 84 Code Optimization Introduction to optimisation and optimisation procedure. Optimisation of C code using the code generation tools. Optimisation of assembly code.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 85Introduction Software optimisation is the process of manipulating software code to achieve two main goals: Faster execution time. Small code size. Note: It will be shown that in general there is a trade off between faster execution type and smaller code size.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 86Introduction To implement efficient software, the programmer must be familiar with: Processor architecture. Programming language (C, assembly or linear assembly). The code generation tools (compiler, assembler and linker).
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 87 Code Optimisation Procedure
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 88 Code Optimisation Procedure
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 89 Optimising C Compiler Options The ‘C6x optimising C compiler usesthe ANSI C source code and can perform optimisation currently up-to about 80% compared with a hand-scheduled assembly. The ‘C6x optimising C compiler uses the ANSI C source code and can perform optimisation currently up-to about 80% compared with a hand-scheduled assembly. However, to achieve this level of optimisation, knowledge of different levels of optimisation is essential. Optimisation is performed at different stages and levels However, to achieve this level of optimisation, knowledge of different levels of optimisation is essential. Optimisation is performed at different stages and levels.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 90 Optimization levels -O0 – Performs control-flow-graph simplification – Allocates variables to registers – Performs loop rotation – Eliminates unused code – Simplifies expressions and statements – Expands calls to functions declared inline -O1 Performs all -O0 optimizations, plus: – Removes unused assignments – Eliminates local common expressions -O2 Performs all -O1 optimizations, plus: – Performs software pipelining – Performs loop optimizations – Eliminates global common subexpressions – Eliminates global unused assignments – Converts array references in loops to incremented pointer form – Performs loop unrolling
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 91 Optimization levels -O3 Performs all -O2 optimizations, plus: – Removes all functions that are never called – Simplifies functions with return values that are never used – Inlines calls to small functions – Reorders function declarations; the called functions attributes are known when the caller is optimized – Propagates arguments into function bodies when all calls pass the same value in the same argument position – Identifies file-level variable characteristics
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 92 Optimization levels
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 93 Intrisic C functions Intrinsics allow you to express the meaning of certain assembly statements that would otherwise be cumbersome or inexpressible in C/C++. Intrinsics are used like functions; you can use C/C++ variables with these intrinsics, just as you would with any normal function. int x1, x2, y; y = _sadd(x1, x2); int _sadd (int src1, int src2); Adds src1 to src2 and saturates the result.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 94 Assembly Optimisation To develop an appreciation of how to optimise code, let us optimise an FIR filter: For simplicity we write: [1]
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 95 Assembly Optimisation To implement Equation 1, we need to perform the following steps: (1)Load the sample x[i]. (2)Load the coefficients h[i]. (3)Multiply x[i] and h[i]. (4)Add (x[i] * h[i]) to the content of an accumulator. (5)Repeat steps 1 to 4 N-1 times. (6)Store the value in the accumulator to y.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 96 Assembly Optimisation Steps 1 to 6 can be translated into the following ‘C6x assembly code: MVK.S10,B0; Initialise the loop counter MVK.S10,A5; Initialise the accumulator loopLDH.D1*A8++,A2; Load the samples x[i] LDH.D1*A9++,A3; Load the coefficients h[i] NOP4; Add “nop 4” because the LDH has a latency of 5. MPY.M1A2,A3,A4; Multiply x[i] and h[i] NOP; Multiply has a latency of 2 cycles ADD.L1A4,A5,A5; Add “x [i]. h[i]” to the accumulator [B0]SUB.L2B0,1,B0; [B0]B.S1loop; loop overhead NOP5; The branch has a latency of 6 cycles
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 97 Assembly Optimisation In order to optimise the code, we need to: (1)Use instructions in parallel. (2)Remove the NOPs. (3)Remove the loop overhead (remove SUB and B: loop unrolling). (4)Use word access or double-word access instead of byte or half-word access.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 98 Step 1 - Using Parallel Instructions ldh mpy ldh b nop nop nop nop nop nop nop nop nop add sub nop
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 99 Step 1 - Using Parallel Instructions ldh mpy ldh b nop nop nop nop nop nop nop nop nop add sub nop Note: Not all instructions can be put in parallel since the result of one unit is used as an input to the following unit.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 100 Step 2 - Removing the NOPs ldh mpy ldh b nop nop add sub nop loopLDH.D1*A8++,A2 LDH.D1*A9++,A3 [B0]SUB.L2B0,1,B0 [B0]B.S1loop NOP2 MPY.M1A2,B3,A4 NOP ADD.L1A4,A5,A5
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 101 Step 3 - Loop Unrolling The SUB and B instructions consume at least two extra cycles per iteration (this is known as branch overhead). loopLDH.D1*A8++,A2 LDH.D1*A9++,A3 [B0]SUB.L2B0,1,B0 [B0]B.S1loop NOP2 MPY.M1A2,B3,A4 NOP ADD.L1A4,A5,A5 LDH.D1*A8++,A2;Start of iteration 1 ||LDH.D1*B9++,B3 NOP4 MPY.M1XA2,B3,A4;Use of cross path NOP ADD.L1A4,A5,A5 LDH.D1*A8++,A2;Start of iteration 2 ||LDH.D1*A9++,A3 NOP4 MPY.M1A2,B3,A4 NOP ADD.L1A4,A5,A5 ;: LDH.D1*A8++,A2; Start of iteration n ||LDH.D1*A9++,A3 NOP 4 MPY.M1A2,B3,A4 NOP ADD.L1A4,A5,A5
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 102 Step 4 - Word or Double Word Access The ‘C6711 has two 64-bit data buses for data memory access and therefore up to two 64-bit can be loaded into the registers at any time (see Chapter 2). In addition the ‘C6711 devices have variants of the multiplication instruction to support different operation (see Chapter 2). Note: Store can only be up to 32-bit.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 103 loop LDW.D1*A9++,A3; 32-bit word is loaded in a single cycle ||LDW.D2*B6++,B1 NOP4 [B0]SUB.L2 [B0]B.S1loop NOP2 MPY.M1A2,B3,A4 ||MPYH.M2A0,B1,B3 NOP ADD.L1A4,B3,A5 Step 4 - Word or Double Word Access Using word access, MPY and MPYH the previous code can be written as: Note: By loading words and using MPY and MPYH instructions the execution time has been halved since in each iteration two 16x16- bit multiplications are performed.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 104 Optimisation Summary It has been shown that there are four complementary methods for code optimisation: Using instructions in parallel. Filling the delay slots with useful code. Using word or double word load. Loop unrolling. These increase performance and reduce code size.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 105 Optimisation Summary This increases performance but increases code size. It has been shown that there are four complementary methods for code optimisation: Using instructions in parallel. Filling the delay slots with useful code. Using word or double word load. Loop unrolling.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 106 Software Optimisation Part 2 - Software Pipelining
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 107Objectives Why using Software Pipelining, SP? Understand software pipelining concepts. Use software pipelining procedure. Code the word-wide software pipelined dot-product routine. Determine if your pipelined code is more efficient with or without prolog and epilog.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 108 Why using Software Pipelining, SP? SP creates highly optimized loop-code by: Putting several instructions in parallel. Filling delay slots with useful code. Maximizes functional units. SP is implemented by simply using the tools: Compiler options -o2 or -o3. Assembly Optimizer if.sa file.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 109 Software Pipeline concept LDH LDH || LDH || LDH MPY MPY ADD ADD How many cycles would it take to perform this loop 5 times? (Disregard delay-slots). ______________ cycles To explain the concept of software pipelining, we will assume that all instructions execute in on cycle.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 110 Software Pipeline Example LDH LDH || LDH || LDH MPY MPY ADD ADD How many cycles would it take to perform this loop 5 times? (Disregard delay-slots). ______________ cycles 5 x 3 = 15 Let’s examine hardware (functional units) usage...
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 111 Non-Pipelined Code.M1.M2.L1.L2.S1.S2.D1.D2 1Cycleldhldh 2mpy3add4ldhldh5mpy 6add 7ldhldh8mpy 9add. D1. D2
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 112 Pipelining Code.M1.M2.L1.L2.S1.S2.D1.D21Cycle ldhldh 2mpyldhldh 3addmpyldhldh 4addmpyldhldh 5addmpyldhldh 6addmpy 7add Pipelining these instructions took 1/2 the cycles!
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 113 Pipelining Code.M1.M2.L1.L2.S1.S2.D1.D21Cycle ldhldh 2mpyldhldh 3addmpyldhldh 4addmpyldhldh 5addmpyldhldh 6addmpy 7add Pipelining these instructions takes only 7 cycles!
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 114 Loop Kernel Single-cycle “loop” iterated three times. Pipelining Code.M1.L1.D1.D2ldhldh1mpy2ldhldh add3mpyldhldh mpyaddldhldh4 addmpy5ldhldh add6mpy 7add Prolog Staging for loop. Epilog Completing final operations.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 115 Pipelined Code prolog:LDH; load 1 ||LDH MPY; mpy 1 ||LDH; load 2 ||LDH loop:ADD; add 1 ||MPY; mpy 2 ||LDH; load 3 ||LDH ADD; add 2 ||MPY; mpy 3 ||LDH; load 4 ||LDH..
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 116 Software Pipelining Procedure 1.Write algorithm in C code & verify. 1.Write algorithm in C code & verify. 2.Write ‘C6x Linear Assembly code. 2.Write ‘C6x Linear Assembly code. 3.Create dependency graph. 3.Create dependency graph. 4.Allocate registers. 4.Allocate registers. 5.Create scheduling table. 5.Create scheduling table. 6.Translate scheduling table to ‘C6x code. 6.Translate scheduling table to ‘C6x code.
Dr. Naim Dahnoun, Bristol University, (c) Texas Instruments 2002 Chapter 2, Slide 117Assignment Write a C code for FIR filter using 16- bit coefficients and inputs Is this code optimal for implementation on TMS320C6x? Rewrite the C code so that 2 multiplication units are used at each iteration.