CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim
CPSC614 Lec 6.2 To avoid a pipeline stall, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction. Goal: to keep a pipeline full.
CPSC614 Lec 6.3 Latencies Inst. producing result Inst. using resultLatency in cycles FP ALU opAnother FP op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Branch: 1, Integer ALU op – branch: 1 Integer load: 1 Integer ALU - integer ALU: 1
CPSC614 Lec 6.4 Example for ( i = 1000; i > 0; i = i – 1) x[i] = x[i] + s; Loop:L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) DADDIUR1, R1, # -8 BNER1, R2, LOOP
CPSC614 Lec 6.5 Without any Scheduling Clock cycle issued Loop:L.DF0, 0(R1)1 stall2 ADD.DF4, F0, F23 stall4 stall5 S.DF4, 0(R1)6 DADDIUR1, R1, # -87 stall8 BNER1, R2, LOOP9 stall10
CPSC614 Lec 6.6 With Scheduling Clock cycle issued Loop:L.DF0, 0(R1)1 DADDIUR1, R1, # -82 ADD.DF4, F0, F23 stall4 BNER1, R2, LOOP5 S.DF4, 8(R1)6 not trivial delayed branch
CPSC614 Lec 6.7 The actual work of operating on the array element takes 3 (load, add, store). The remaining 3 cycles –Loop overhead (DADDIU, BNE) –Stall To eliminate the 3 cycles, we need to get more operations within the loop relative to the number of overhead instructions.
CPSC614 Lec 6.8 Reducing Loop Overhead Loop Unrolling –Simple scheme for increasing the number of instructions relative to the branch and overhead instructions –Simply replicates the loop body multiple times, adjusting the loop termination code. –Improves scheduling »It allows instructions from different iterations to be scheduled together. –Uses different registers for each iteration.
CPSC614 Lec 6.9 Unrolled Loop (No Scheduling) Clock cycle issued Loop:L.DF0, 0(R1)12 ADD.DF4, F0, F234 5 S.DF4, 0(R1)6 L.DF6, -8(R1)78 ADD.DF8, F6, F S.DF8, -8(R1)12 L.DF10, -16(R1)1314 ADD.DF12, F10, F S.DF12, -16(R1)18 L.DF14, -24(R1)1920 ADD.DF16, F14, F S.DF16, -24(R1)24 DADDIUR1, R1, # BNER1, R2, LOOP2728
CPSC614 Lec 6.10 Loop Unrolling Loop unrolling is normally done early in the compilation process, so that redundant computations can be exposed and eliminated by the optimizer. Unrolling improves the performance of the loop by eliminating overhead instructions.
CPSC614 Lec 6.11 Loop Unrolling (Scheduling) Clock cycle issued Loop:L.DF0, 0(R1)1 L.DF6, -8(R1)2 L.DF10, -16(R1)3 L.DF14, -24(R1)4 ADD.DF4, F0, F25 ADD.DF8, F6, F26 ADD.DF12, F10, F27 ADD.DF16, F14, F28 S.DF4, 0(R1)9 S.DF8, -8(R1)10 DADDIUR1, R1, # S.DF12, 16(R1)12 BNER1, R2, LOOP13 S.DF16, 8(R1)14
CPSC614 Lec 6.12 Summary Goal: To know when and how the ordering among instructions may be changed. This process must be performed in a methodical fashion either by a compiler or by hardware.
CPSC614 Lec 6.13 To obtain the final unrolled code, –Determine that it is legal to move the S.D after the DADDIU and BNE, and find the amount to adjust the S.D offset. –Determine that unrolling the loop will be useful by finding that the loop iterations are independent, except for the loop maintenance code. –Use different registers to avoid unnecessary constraints. –Eliminate the extra test and branch instructions and adjust the loop termination and iteration code.
CPSC614 Lec 6.14 –Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This transformation requires analyzing the memory addresses and finding that they do not refer to the same address. –Schedule the code, preserving any dependences needed to yield the same result as the original code.
Loop Unrolling I (No Delayed Branch) Loop:L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) L.DF0, -8(R1) ADD.DF4, F0, F2 S.DF4, -8(R1) L.DF0, -16(R1) ADD.DF4, F0, F2 S.DF4, -16(R1) L.DF0, -24(R1) ADD.DF4, F0, F2 S.DF4, -24(R1) DADDIUR1, R1, # -32 BNER1, R2, LOOP name dependence true dependence
Loop Unrolling II (Register Renaming) Loop:L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) L.DF6, -8(R1) ADD.DF8, F6, F2 S.DF8, -8(R1) L.DF10, -16(R1) ADD.DF12, F10, F2 S.DF12, -16(R1) L.DF14, -24(R1) ADD.DF16, F14, F2 S.DF16, -24(R1) DADDIUR1, R1, # -32 BNER1, R2, LOOP true dependence
CPSC614 Lec 6.17 With the renaming, the copies of each loop body become independent and can be overlapped or executed in parallel. –Potential shortfall in registers Register pressure –It arises because scheduling code to increase ILP causes the number of live values to increase. It may not be possible to allocate all the live values to registers. –The combination of unrolling and aggressive scheduling can cause this problem.
CPSC614 Lec 6.18 Loop unrolling is a simple but useful method for increasing the size of straight-line code fragments that can be scheduled effectively.
CPSC614 Lec 6.19 Unrolling with Two-Issue Loop:L.DF0, 0(R1)1 L.DF6, -8(R1)2 L.DF10, -16(R1)ADD.D F4, F0, F23 L.DF14, -24(R1)ADD.D F8, F6, F24 L.DF18, -32(R1)ADD.D F12, F10, F25 S.DF4, 0(R1)ADD.D F16, F14, F26 S.DF8, -8(R1)ADD.D F20, F18, F27 S.DF12, -16(R1)8 DADDIU R1, R1, #-409 S.DF16, 16(R1)10 BNER1, R2, LOOP11 S.DF20, 8(R1)12
CPSC614 Lec 6.20 Static Branch Prediction Static branch predictors are sometimes used in processors where the expectation is that branch behavior is highly predictable at compile time.
CPSC614 Lec 6.21 Static Branch Prediction Predict a branch taken –Simplest –Average misprediction rate for SPEC: 34% (9% ~ 59%) Predict on the basis of branch direction –backward-going branches: taken –forward-going branches: not taken –Unlikely to generate an overall misprediction rate of less than 30% ~ 40%.
CPSC614 Lec 6.22 Static Branch Prediction Predict branches on the basis of profile information collected from earlier runs. –An individual branch is often highly biased toward taken or untaken. (bimodally distributed) –Changing the input so that the profile is for a different run leads to only a small change in the accuracy of profile-based prediction.
CPSC614 Lec 6.23 VLIW Very Long Instruction Word: –Rely on compiler technology to minimize the potential data hazard stalls. –Actually format the instructions in a potential issue packet so that the hardware need not check explicitly for dependences. –Wide instructions with multiple operations per instruction. (64, 128 bits or more) –Intel IA-64 architecture
CPSC614 Lec 6.24 Basic VLIW Approach VLIWs use multiple, independent functional units. A VLIW packages the multiple operations into one very long instruction. The hardware in a superscalar for multiple issue is unnecessary. Uses loop unrolling, scheduling …
CPSC614 Lec 6.25 Local Scheduling: Scheduling the code within a single basic block. Global Scheduling: scheduling code across branches –much more complex Trace Scheduling: Section 4.5 Figure 4.5 VLIW instructions
CPSC614 Lec 6.26 Problems Increase in code size Wasted functional units –In the previous example, only about 60% of the functional units were used.
CPSC614 Lec 6.27 Detecting and Enhancing Loop-level Parallelism Loop level parallelism : source level ILP : machine level code after compliation for (i= 1000; i< 0; i--) x[i] = x[i] + s
CPSC614 Lec 6.28 Advanced Compiler Support for Exposing and Exploiting ILP for ( i = 1; i <= 100; i ++) { A[i + 1] = A[i] + C[i]; /* S1 */ B[i + 1] = B[i] + A[i + 1]; /* S2 */ }
CPSC614 Lec 6.29 Loop-Carried Dependence Data accesses in later iterations are dependent on data values produced in earlier iterations. for ( i = 1; i <= 100; i ++) { A[i + 1] = A[i] + C[i]; /* S1 */ B[i + 1] = B[i] + A[i + 1]; /* S2 */ } Loop-Carried Dependences This dependence forces successive iterations of this loop to execute in series.
CPSC614 Lec 6.30 Does a loop-carried dependence mean there is no parallelism??? Consider: for (i=0; i< 8; i=i+1) { A = A + C[i]; /* S1 */ } Could compute: “Cycle 1”: temp0 = C[0] + C[1]; temp1 = C[2] + C[3]; temp2 = C[4] + C[5]; temp3 = C[6] + C[7]; “Cycle 2”: temp4 = temp0 + temp1; temp5 = temp2 + temp3; “Cycle 3”: A = temp4 + temp5; Relies on associative nature of “+”.
CPSC614 Lec 6.31 for ( i = 1; i <= 100; i ++) { A[i] = A[i] + B[i]; /* S1 */ B[i + 1] = C[i] + D[i]; /* S2 */ } Loop-Carried Dependence Despite this loop-carried dependence, this loop can be made parallel.
CPSC614 Lec 6.32 A[1] = A[1] + B[1]; for ( i = 1; i <= 99; i ++) { B[i + 1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100];
CPSC614 Lec 6.33 Recurrence A recurrence is when a variable is defined based on the value of that variable in an earlier iteration, often the one immediately preceding. Detecting a recurrence can be important –Some architectures (especially vector computer) have special support for executing recurrences. –Some recurrences can be the source of a reasonable amount of parallelism.
CPSC614 Lec 6.34 for ( i = 2; i <= 100; i = i + 1) Y[i] = Y[i – 1] + Y[i]; Dependence distance: 1 for ( i = 6; i <= 100; i = i + 1) Y[i] = Y[i – 5] + Y[i]; Dependence distance: 5 The larger the distance, the more potential parallelism can be obtained by unrolling the loop.
CPSC614 Lec 6.35 Finding Dependences Determining whether a dependence actually exists => NP-Complete Dependence Analysis –Basic tool for detecting loop-level parallelism –Applies only under a limited set of circumstances. –Greatest common divisor (GCD) test, points-to analysis, interprocedural analysis, …
CPSC614 Lec 6.36 Eliminating Dependent Computation Algebraic Simplifications of Expressions Copy propagation –Eliminates operations that copy values. DADDIUR1, R2, #4 DADDIUR1, R1, #4 DADDIUR1, R2, #8
CPSC614 Lec 6.37 Eliminating Dependent Computation Tree Height Reduction –Reduces the height of the tree structure representing a computation. ADDR1, R2, R3 ADDR4, R1, R6 ADDR8, R4, R7 ADDR1, R2, R3 ADDR4, R6, R7 ADDR8, R1, R4
CPSC614 Lec 6.38 Eliminating Dependent Computation Recurrences sum = sum + x1 + x2 + x3 + x4 + x5 sum = (sum + x1) + (x2 + x3) + (x4 + x5)
CPSC614 Lec 6.39 Software Pipelining Technique for reorganizing loops such that each iteration in the software-pipelined code is made from instructions chosen from different iterations of the original loop. By choosing instructions from different iterations, dependent computations are separated from one another by an entire loop body.
CPSC614 Lec 6.40 Software Pipelining Counterpart to what Tomasulo ’ s algorithm does in hardware Software pipelining symbolically unrolls the loop and then selects instructions from each iteration. Start-up code before the loop and finish- up code after the loop required.
CPSC614 Lec 6.41 Software Pipelining
CPSC614 Lec 6.42 Software Pipelining - Example Show a software-pipelined version of the following loop. Omit the start-up and finish-up code. Loop:L.DF0, 0(R1) ADD.DF4, F0, F2 S.DF4, 0(R1) DADDIUR1, R1, #-8 BNER1, R2, Loop
CPSC614 Lec 6.43 Software Pipelining Software pipelining consumes less code space. Loop unrolling reduces the overhead of the loop (branch, counter update code). Software pipelining reduces the time when the loop is not running at peak speed to once per loop at the beginning and end.
CPSC614 Lec 6.44
CPSC614 Lec 6.45 Hw support for more parallelism at compile time Conditional Instructions Predicated instructions Extension of instruction set Conditional instruction: an instruction that refers a condition, which is evaluated as part of the instruction execution –Condition is true: executed normally –False: no-op –ex) conditional move
CPSC614 Lec 6.46 Example if (A == 0) { S = T; } BNEZ R1, L ADDUR2, R3, R0 L: CMOVZR2, R3, R1 R1=A, R2=S, R3=T conditional move only if the third operand is equal to zero
CPSC614 Lec 6.47 Conditional moves are used to change a control dependence into a data dependence. Handling multiple branches per cycle is complex. => Conditional moves provide a way of reducing branch pressure. A conditional move can often eliminate a branch that is hard to predict, increasing the potential gain.