Intro to the “c6x” VLIW processor Texas Instruments TMSC6000 series TMSC6700 subseries – include floating point VLIW = Very Long Instruction Word
Operations in Parallel registers Function units
Operations in Parallel registers bypassing Function units
Non-orthogonal registers registers Bypass Function units
Non-orthogonal *** See TI's picture *** A B registers registers Bypass Function units L1 S1 M1 D1 L2 S2 M2 D2 *** See TI's picture ***
Specialized Function Units L units: arithmetic, compare, and logical ops S units: arithmetic, logical, branches, constant generation M units: multiplies D units: address generation / memory accesses
Complicated hardware registers registers
Explicit parallelism registers registers
Simple VLIW encoding Slots that cannot be utilized are filled with no-ops Bad for code density, cache utilization, energy, ...
C6X: Packets One bit of each instruction indicates whether next instruction can be executed in parallel (0 = “EOP”) Any slot can go to any function unit 1 1 1 1 1 1
C6X: Packets One bit of each instruction indicates whether next instruction can be executed in parallel Any slot can go to any function unit 1 1 1 1 1 1
C6X: Packets One bit of each instruction indicates whether next instruction can be executed in parallel Any slot can go to any function unit 1 1 1 1 1 1 1 1 1 1 1 1 Packet cannot cross an 8-word boundary Resources constrain which instructions can be combined in the same packet You can branch into the middle of a packet!
Explicit scheduling Delay slots must be respected – no HW interlocks or scoreboarding Multiply – 1 delay slot Load – 4 delay slots Branch – 5 delay slots B5 := B3 * B2 B5 := B3 * B2 B7 := B5 + B1 B7 := B5 + B1 Right Wrong
Predicated execution Example: Why? To get rid of branches (5 delay slots * 8 wide ....) Basic idea: a comparison result is stored to a condition register ; this register is then used as an operand of other instructions, and its value causes those operations to be selectively enabled or squashed. [Condition registers: A1, A2, B0, B1, B2] Example: If (B3<B4) B3++ else B4++
Predicated execution With branches: With predicates: cmp B3, B4 bge L2 <nop> B3 := B3+1 b DONE L2: B4 := B4+1 DONE: cmplt B3, B4 B0 [B0] B3 := B3+1 [!B0] B4 := B4+1 ...and the last two can be issued in parallel! Control dependency has been converted to data dependency...
Assembly details .text .align 32 .global proc proc: mvk 4, b3 cmpgt b3, b4, b0 [ b0] mvk.S2 9, b5 || [!b0] mvk.S1 8, a5 stw a5, *-a15[4] .....
Fetch/execute pipeline PG generate program address PS program address send PW program memory access PR fetch reaches CPU boundary DP instruction dispatch DC instruction decode E1 execute 1 E2 execute 2 E3 execute 3 E4 execute 4 E5 execute 5
Addressing Modes C equivalent *R (*R) *+R[ucst5] (R[ucst5]) *+R[offsetR] (R[offsetR]) *-R[offsetR] (R[-offsetR]) Special case: 15b offsets: *+B15[ucst15] *+B14[ucst15]
Addressing Modes Pre/post increment/decrement *++R , *R++ *++R[ucst5], *R++[ucst5] *--R[ucst5], *R--[ucst5] *++R[offsetR], *R++[offsetR] *--R[offsetR], *R--[offsetR]
Resources http://www.cs.cmu.edu/~tcal/15745/