CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP So far we have explored dynamic hardware techniques for ILP exploitation: BTB and branch prediction Dynamic scheduling Scoreboard Tomasulo’s algorithm Speculation Multiple issue How can compilers help?
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 2 Loop Unrolling Let’s look at the code: for (i=1000;i>0;i=i-1) x[i] = x[i] + s ADD R2,R0,R0 Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 3 Scheduling On A Simple 5 Stage MIPS Loop: L.D F0,0(R1) stall, wait for F0 value to propagate ADD.D F4, F0, F2 stall, wait for FP add to be completed S.D F4, 0(R1) DADDUI R1, R1, #-8 stall, wait for R1 value to propagate BNE R1, R2, Loop stall one cycle, branch penalty 10 cycles
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 4 We Could Rearrange The Instructions Loop: L.D F0,0(R1) stall, wait for F0 value to propagate ADD.D F4, F0, F2 stall, wait for FP add to be completed S.D F4, 0(R1) DADDUI R1, R1, #-8 stall, wait for R1 value to propagate BNE R1, R2, Loop stall one cycle, branch penalty Interleave these inst. with some independent inst. Best we can achieve is 6 6 cycles Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop 8
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 5 Loop Unrolling Getting into the loop more useful instructions and reducing overhead Step 1: Put several iterations together Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Assume taken
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 6 Loop Unrolling Step 2: Take out control instructions, adjust offsets Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F0,-8(R1) ADD.D F4, F0, F2 S.D F4, -8(R1) L.D F0,-16(R1) ADD.D F4, F0, F2 S.D F4, -16(R1) L.D F0,-24(R1) ADD.D F4, F0, F2 S.D F4, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 7 Loop Unrolling Step 3: Rename registers Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F0,-8(R1) ADD.D F4, F0, F2 S.D F4, -8(R1) L.D F0,-16(R1) ADD.D F4, F0, F2 S.D F4, -16(R1) L.D F0,-24(R1) ADD.D F4, F0, F2 S.D F4, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 8 Loop Unrolling Current loop still has stalls due to RAW dependencies Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop Loop: L.D F0,0(R1) stall, wait for F0 value to propagate ADD.D F4, F0, F2 stall, wait for FP add to be completed S.D F4, 0(R1) DADDUI R1, R1, #-8 stall, wait for R1 value to propagate BNE R1, R2, Loop stall one cycle, branch penalty 28 cycles = 7 per it.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 9 Loop Unrolling Step 4: Interleave iterations Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop 14 cycles = 3.5 per it. Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, #-32 S.D F12, 16(R1) BNE R1, R2, Loop S.D F16, 8(R1)
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 10 Loop Unrolling + Multiple Issue Let’s unroll the loop 5 times, mark int. and FP operations Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) L.D F18,-32(R1) ADD.D F20, F18, F2 S.D F20, -32(R1) DADDUI R1, R1, #-40 BNE R1, R2, Loop
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 11 Loop Unrolling + Multiple Issue Move all loads first, then ADD.D then S.D Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-40 BNE R1, R2, Loop
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 12 Loop Unrolling + Multiple Issue Rearrange instructions to handle delay for DADDUI and BNE Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-40 BNE R1, R2, Loop Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) DADDUI R1, R1, #-40 S.D F16, -24(R1) BNE R1, R2, Loop S.D F20, -32(R1)
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 13 Loop Unrolling + Multiple Issue Fix immediate displacement values Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) DADDUI R1, R1, #-40 S.D F16, 16(R1) BNE R1, R2, Loop S.D F20, 8(R1)
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 14 Loop Unrolling + Multiple Issue Now imagine we can issue 2 instructions per cycle, one integer and one FP Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) DADDUI R1, R1, #-40 S.D F16, 16(R1) BNE R1, R2, Loop S.D F20, 8(R1) cycles = 2.4 per it.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 15 Static Branch Prediction Analyze the code, figure out which outcome of a branch is likely Always predict taken Predict backward branches as taken, forward as not taken Predict based on the profile of previous runs Static branch prediction can help us schedule delayed branch slots
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 16 Static Multiple Issue: VLIW Hardware checking for dependencies in issue packets may be expensive and complex Compiler can examine instructions and decide which ones can be scheduled in parallel – group instructions into instruction packets – VLIW Hardware can then be simplified Processor has multiple functional units and each field of the VLIW is assigned to one unit For example, VLIW could contain 5 fields and one has to contain ALU instruction or branch, two have to contain FP instructions and two have to be memory references
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 17 Example Assume VLIW contains 5 fields: ALU instruction or branch, two FP instructions and two memory references Ignore branch delay slot Memory reference FP instruction ALU instruction Loop: L.D F0,0(R1) stall, wait for F0 value to propagate ADD.D F4, F0, F2 stall, wait for FP add to be completed S.D F4, 0(R1) DADDUI R1, R1, #-8 stall, wait for R1 value to propagate BNE R1, R2, Loop
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 18 Example Unroll seven times and rearrange Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 1 ALU /branch FP mem 3
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 19 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 2 ALU /branch FP mem 3 4
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 20 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 3 3 ALU /branch FP mem 4 6 5
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 21 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 4 4 ALU /branch FP mem
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 22 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 5 ALU /branch FP mem
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 23 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 6 6 ALU /branch FP mem 7 9 8
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 24 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, 24(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 7 7 ALU /branch FP mem 9 8
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 25 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, 24(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 8 8 ALU /branch FP mem 9
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 26 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, 24(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 9 Overall 9 cycles for 7 iterations 1.29 per iteration But VLIW was always half-full ALU /branch FP mem
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 27 Detecting and Enhancing Loop Level Parallelism Determine whether data in later iterations depends on data in earlier iterations – loop-carried dependence Easier detected at source code level than at machine code for(i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i];/* S1 */ B[i+1] = B[i] + A[i+1] /* S2 */ } S1 calculates a value A[i+1] which will be used in next iteration of S1 S2 calculates a value B[i+1] which will be used in next iteration of S2 This is a loop-carried dependence and prevents parallelism S1 calculates a value A[i+1] which will be used in the current iteration of S2 This is dependence within the loop
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 28 Detecting and Enhancing Loop Level Parallelism for(i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i] /* S2 */ } S1 calculates a value A[i] which is not used in the future S2 calculates a value B[i+1] which will be used in next iteration of S1 This is a loop-carried dependence but S1 depends on S2 not on itself and S2 does not depend on S1 This loop can be made parallel if we transform it so that there is no loop-carried dependence A[1] = A[1] + B[1]; for(i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i] /* S2 */ A[i+1] = A[i+1] + B[i+1]; /* S1 */ } B[101] = C[100]+D[100]
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 29 Detecting and Enhancing Loop Level Parallelism Recursion creates loop-carried dependence But sometimes it may parallelizable if distance between dependent elements is >1 for(i=1; i<=100; i=i+1) { A[i] = A[i-1] + B[i]; } for(i=1; i<=100; i=i+1) { A[i] = A[i-5] + B[i]; }
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 30 Detecting and Enhancing Loop Level Parallelism Find all dependencies in the following loop (5) and eliminate as many as you can: for(i=1; i<=100; i=i+1) { Y[i] = X[i] / c; /* S1 */ X[i] = X[i] + c; /* S2 */ Z[i] = Y[i] + c; /* S3 */ Y[i] = c – Y[i]; /* S4 */ } Solution at page 325
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 31 Code Transformation Eliminating dependent computations Copy propagation Tree height reduction DADDUI R1, R2, #4 DADDUI R1, R1, #4 DADDUI R1, R2, #8 ADD R1, R2, R3 ADD R4, R1, R6 ADD R8, R4, R7 ADD R1, R2, R3 ADD R4, R6, R7 ADD R8, R1, R4 Can be done in parallel sum=sum+x /* suppose this is in a loop and we unroll it 5 times */ sum=sum+x1+x2+x3+x4+x5 sum=(sum+x1)+(x2+x3)+(x4+x5) Can be done in parallel Must be done sequentially
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 32 Software Pipelining Combining instructions from different loop iterations to separate dependent instructions within an iteration
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 33 Software Pipelining Apply software pipelining technique to the following loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) R1+16 R1+8R S.D F0,16(R1) ADD.D F4, F0, F2 L.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Startup code Cleanup code
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 34 Software Pipelining vs. Loop Unrolling Loop unrolling eliminates loop maintenance overhead exposing parallelism between iterations Creates larger code Software pipelining enables some loop iterations to run at top speed by eliminating RAW hazards that create latencies within iteration Requires more complex transformations
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 35 Homework #8 Due Tuesday, November 16 by the end of the class Submit either in class (paper) or by (PS or PDF only) or bring the paper copy to my office Do exercises 4.2, 4.6, 4.9 (skip parts d. and e.), 4.11