Instructor: Morris Lancaster CS 211: Computer Architecture Lecture 6 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster
Basic Compiler Techniques for Exposing ILP Crucial for processors that use static issue, and important for processors that make dynamic issue decisions but use static scheduling 1/29/2010 1/29/2010 CS 211 Lecture 6 2
Basic Pipeline Scheduling and Loop Unrolling Exploiting parallelism among instructions Finding sequences of unrelated instructions that can be overlapped in the pipeline Separation of a dependent instruction from a source instruction by a distance in clock cycles equal to the pipeline latency of the source instruction. (Avoid the stall) The compiler works with a knowledge of the amount of available ILP in the program and the latencies of the functional units within the pipeline This couples the compiler, sometimes to the specific chip version, or at least requires the setting of appropriate compiler flags 5/30/3008 1/29/2010 CS 211 Lecture 6 3
Assumed Latencies Instruction Producing Result Instruction Using Result Latency In Clock Cycles (needed to avoid stall) FP ALU op Another FP ALU op 3 Store double 2 Load double 1 Result of the load can be bypassed without stalling store 1/29/2010 CS 211 Lecture 6 4
Basic Pipeline Scheduling and Loop Unrolling (cont) Assume standard 5 stage integer pipeline Branches have a delay of one clock cycle Functional units are fully pipelined or replicated (as many times as the pipeline depth) An operation of any type can be issued on every clock cycle and there are no structural hazards 1/29/2010 CS 211 Lecture 6 5
Basic Pipeline Scheduling and Loop Unrolling (cont) Sample code For (i=1000; i>0; i=i-1) x[i] = x[i] + s; MIPS code Loop: L.D F0,0(R1) ;F0 = array element ADD.D F4,F0,F2 ;add scalar in F2 S.D F4,0(R1) ;store back DADDUI R1,R1,#-8 ;decrement index BNE R1,R2,Loop ;R2 is precomputed so that ;8(R2) is last value to be ;computed 1/29/2010 CS 211 Lecture 6 6
Basic Pipeline Scheduling and Loop Unrolling (cont) MIPS code Loop: L.D F0,0(R1) ;1 clock cycle stall ;2 ADD.D F4,F0,F2 ;3 stall ;4 stall ;5 S.D F4,0(R1) ;6 DADDUI R1,R1,#-8 ;7 stall ;8 BNE R1,R2,Loop ;9 1/29/2010 CS 211 Lecture 6 7
Rescheduling Gives Sample code MIPS code For (i=1000; i>0; i=i-1) x[i] = x[i] + s; MIPS code Loop: L.D F0,0(R1) 1 DADDUI R1,R1,#-8 2 ADD.D F4,F0,F2 3 stall 4 BNE R1,R2,Loop 5 S.D F4,8(R1) 6 1/29/2010 CS 211 Lecture 6 8
Unrolling Gives MIPS code Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop 1/29/2010 CS 211 Lecture 6 9
Unrolling and Removing Hazards Gives MIPS code Loop: L.D F0,0(R1) ;total of 14 clock cycles L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R1,#-32 S.D F12,16(R1) BNE R1,R2,Loop S.D F16,8(R1) 1/29/2010 CS 211 Lecture 6 10
Unrolling Summary for Above Determine that it was legal to move the S.D after the DADDUI and BNE, and find the amount to adjust the S.D offset Determine that unrolling the loop would be useful by finding that the loop iterations were independent, except for loop maintenance code Use different registers to avoid unnecessary constraints that would be forced by using the same registers Eliminate the extra test and branch instruction and adjust the loop termination and iteration code. Determine that the loads and stores can be interchanged by determining that the loads and stores from different iterations are independent Schedule the code, preserving any dependencies 1/29/2010 CS 211 Lecture 6 11
Unrolling Summary (continued) Limits to Impacts of Unrolling Loops As we unroll more, each unroll yields a decreased amount of improvement of distribution of overhead Growth in code size Not good for embedded computers Large code size may increase cache miss rate Shortfall in available registers (register pressure) Scheduling the code to increase ILP causes the number of live values to increase This could generate a shortage of registers and negatively impact the optimization Useful in a variety of processors today 1/29/2010 CS 211 Lecture 6 12
Unrolling and Pipeline Scheduling with Static Multiple Issue Assume two issue, statically scheduled superscalar MIPS pipeline – here are 5 unrolls which would take 17 cycles in previous example Integer instruction FP instruction Clock Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) S.D F4,0(R1) S.D F8,-8(R1) S.D F12,-16(R1) DADDUI R1,R1,#-40 S.D F16,16(R1) S.D F20,8(R1) BNE R1,R2,Loop ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 ADD.D F20,F18,F2 1 2 3 4 5 6 7 8 9 10 11 12 1/29/2010 CS 211 Lecture 6 13
Unrolling and Pipeline Scheduling with Static Multiple Issue Unrolled loop now 12 cycles or 2.4 cycles per element versus 3.5 cycles for the scheduled and unrolled loop on the normal pipeline 1/29/2010 CS 211 Lecture 6 14
Static Branch Prediction Expectation is that branch behavior is highly predictable at compile time (can also be used to help dynamic predictors) It turns out that mis-prediction variance rates are large and that mis-predictions vary from between 9% and 59% for benchmarks Look at this example, as stall for the DSUBU and BEQZ exists LD R1,0(R2) DSUBU R1,R1,R3 BEQZ R1, L OR R4,R5,R6 DADDU R10,R4,R3 L: DADDU R7,R8,R9 1/29/2010 CS 211 Lecture 6 15
Static Branch Prediction Suppose this branch was almost always taken and that the value of R7 was not needed in the fall through LD R1,0(R2) DADDU R7,R8,R9 DSUBU R1,R1,R3 BEQZ R1, L OR R4,R5,R6 DADDU R10,R4,R3 L: ; it was here LD R1,0(R2) DSUBU R1,R1,R3 BEQZ R1, L OR R4,R5,R6 DADDU R10,R4,R3 L: DADDU R7,R8,R9 1/29/2010 CS 211 Lecture 6 16
Static Branch Prediction Suppose this branch was rarely taken and that the value of R4 was not needed in the taken path LD R1,0(R2) OR R4,R5,R6 DSUBU R1,R1,R3 BEQZ R1, L ;it was here DADDU R10,R4,R3 L: DADDU R7,R8,R9 LD R1,0(R2) DSUBU R1,R1,R3 BEQZ R1, L OR R4,R5,R6 DADDU R10,R4,R3 L: DADDU R7,R8,R9 1/29/2010 CS 211 Lecture 6 17
Static Branch Prediction Prediction Schemes Predict Branch as taken – Average misprediction equal to the untaken branch frequency which is about 34% for the SPEC programs For some programs the frequency of forward taken branches may be significantly less than 50% Predict on branch direction Backward branches predicted as taken Forward branches predicted as not taken Misprediction rate still around 30% to 40% Profile scheme Based on information collected from earlier runs Finds that an individual branch is often highly biased toward taken or untaken 1/29/2010 CS 211 Lecture 6 18
Misprediction Rate on SPEC 92 for Profile Based Prediction 1/29/2010 CS 211 Lecture 6 19
Instructions Between Mispredictions 1/29/2010 CS 211 Lecture 6 20
Static Multiple Issue: The VLIW Approach An alternative to superscalar approach Superscalars decide dynamically how many instructions to issue Relies on compiler technology to Minimize potential data hazards and stalls Format the instructions in a potential issue packet so that the hardware does not need to check for dependences Compiler ensures that dependences within an issue packet cannot be present or, Indicate when a dependence may be present Simpler hardware Good performance through extensive compiler optimization 1/29/2010 CS 211 Lecture 6 21
VLIW Processors First multiple issue processors requiring the instruction stream to be explicitly organized used wide instructions with multiple operations per instruction Very Long Instruction Word (VLIW) 64, 128 or more bits wide Early VLIW processors were rigid in formats New, less rigid architectures being pursued for modern desktops 1/29/2010 CS 211 Lecture 6 22
VLIW Approach Multiple, independent functional units Multiple operations packaged into one very long instruction, or require that issue packets are constrained For discussion we assume the multiple instruction in one VLIW No hardware needed to make instruction issue decisions As maximum issue rate grows, the hardware to make the decisions becomes significantly more complex 1/29/2010 CS 211 Lecture 6 23
VLIW Approach (continued) Example Instructions might contain five operations, including one integer operation (which could be branch), two floating point operations, and two memory references Set of fields for each functional unit, on the order of 16 to 24 bits per unit, yielding an instruction length of between 112 and 168 bits 1/29/2010 CS 211 Lecture 6 24
VLIW Approach (continued) Scheduling Local scheduling – where unrolling generates straight line code Operate on a single basic block Global scheduling – where scheduling occurs across branches More complex Example is trace scheduling 1/29/2010 CS 211 Lecture 6 25
Integer Operation/Branch VLIW Example, Local Scheduling on Single Block 1.29 Cycles Per Result (Equivalent of 7 unrolls) Memory Reference 1 Memory Reference 2 FP Operation 1 FP Operation 2 Integer Operation/Branch L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 ADD.D F20,F18,F2 ADD.D F24,F22,F2 S.D F4,0(R1) S.D F8,-8(R1) ADD.D F28,F26,F2 S.D F12,-16(R1) S.D F16,-24(R1) DADDUI R1,R1,#-56 S.D F20,24,(R1) S.D F24,16(R1) S.D F28,8(R1) BNE R1,R2,Loop 7 results in 9 clocks 23 ops in 9 clocks 1/29/2010 CS 211 Lecture 6 26
VLIW Original Model Issues Code size increase Extreme loop unrolling Wasted bits in instruction encoding Limitations of lockstep operation Early VLIWs operated in lock step with no hazard detection Stall in any functional unit in the pipeline caused entire processor to stall (all functional units kept synchronized) Prediction of which data accesses will encounter cache stall is difficult – any miss affects all instructions in the word For large numbers of memory references the lock step restriction becomes unacceptable In more recent processors, functional units operate independently 1/29/2010 CS 211 Lecture 6 27
VLIW Original Model Issues Binary code compatibility issue In VLIW approach, the code sequence (words) make use of both the instruction set definition and the detailed pipeline structure, including both the functional units and their latencies Requires different versions of the code A new processor design on the old instruction set will require recompilation of code Object code translation Loops Where loops are unrolled, the individual loop iterations were dependent and most like would have run well on a vector processor. The multiple issue approach, however, is still preferred 1/29/2010 CS 211 Lecture 6 28
Advanced Compiler Support – Detecting and Enhancing Loop Level Parallelism Loop level parallelism normally analyzed at or near the source code level What dependencies exist among the operations in a loop across the iterations of that loop Loop-carried dependence – where data accesses in later iterations are dependent on data values produced in earlier iterations Most examples considered so far have not loop-level dependence For (i=1000; i>0; i=i-1) x[i] = x[i] + s; There is a dependence between the two uses of x[i] but they do not carry across the loop There is a dependence on successive uses of i in different iterations, which is loop carried, but this dependence involves an induction variable 1/29/2010 CS 211 Lecture 6 29
Advanced Compiler Support – Detecting and Enhancing Loop Level Parallelism For (i=1; I<=100; i=i+1){ A[i+1] = A[i] + C[i]; //S1 B[i+1] = B[i] + A[i+1]; //S2 } Assume A, B, and C are distinct non-overlapping arrays Note that this can be tricky, and requires sophisticated analysis of the program Dependences S1 uses a value computed by S1 in a previous iteration (A[i]) and S2 uses a value computed by S2 in a previous iteration Loop carried dependencies S2 uses a value computed by S1 in the same iteration Not loop carried Multiple Iterations can execute in parallel as long as dependent statements are kept in order 1/29/2010 CS 211 Lecture 6 30
Advanced Compiler Support – Detecting and Enhancing Loop Level Parallelism Another loop For (i=1; I<=100; i=i+1){ A[i] = A[i] + B[i]; //S1 B[i+1] = C[i] + D[i]; //S2 } Dependences S1 uses a value computed by S2 in a previous iteration (B[i+1]) Dependence is not circular, neither statement depends upon itself S2 does not depend on S1 A loop is parallel if it can be written without a cycle in the dependences 1/29/2010 CS 211 Lecture 6 31
Detecting and Enhancing Loop Level Parallelism – Transforming the Loop The transformed loop A[1] = A[1]+B[1]; for (i=1; I<=99; i=i+1){ B[i+1] = C[i] + D[i]; A[i+1]=A[i+1]+B[i+1]; } B[101] = C[100] + D[100] Transformation There was no dependence from S1 to S2. Interchanging the two statements will not affect the outcome On the first iteration of the loop, statement S1 depends on the value B[1] computed prior to initiating the loop For (i=1; I<=100; i=i+1){ A[i] = A[i] + B[i]; //S1 B[i+1] = C[i] + D[i]; //S2 } 1/29/2010 CS 211 Lecture 6 32
Finding Dependencies – Array Oriented Dependence Difficulties Situations in which array oriented dependence analysis cannot give all information needed When objects are referenced via pointers rather than array indices When array indexing is indirect through another array (many sparse array representations use this) When a dependence may exist for some value of the inputs, but does not exist when the code is run since the inputs never take on those values When an optimization depends on knowing more than just the possibility of a dependence, but needs to know on which write of a variable does a read of that variable depend 1/29/2010 CS 211 Lecture 6 33
Finding Dependencies – Points To Analysis Deals with analyzing programs with pointers Three major sources of analysis info Type information – which restricts what a pointer can point to Issues in loosely typed languages Information derived when an object is allocated or when the address of an object is taken, which can be used to restrict what the pointer can point to If p always points to an object allocated in a given source line and q never points to that object, the p and q can never point to the same object Information derived from pointer assignments If p may be assigned to the value of q, then p may point to anything q points to 1/29/2010 CS 211 Lecture 6 34
Eliminating Dependent Computations Example 1 Change instruction stream DADDUI R1,R2,#4 DADDUI R1,R1,#4 To DADDUI R1,R2,#8 Example 2 Reorder ADD R1,R2,R3 ADD R4,R1,R6 ADD R8,R4,R7 To ADD R1,R2,R3 ADD R4,R6,R7 ADD R8,R1,R4 1/29/2010 CS 211 Lecture 6 35
Eliminating Dependent Computations Example 3 Change How Operations Are Performed in unrolling sum = sum + x From sum = sum + x1 + x2 + x3 + x4 + x5; To sum = ((sum+x1)+(x2+x3))+(x4+x5) In first example, the sum is computed left to right. In second example, there are only 3 dependences on results. 1/29/2010 CS 211 Lecture 6 36
Idea is to separate dependencies in original loop body Software Pipelining Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations Software pipelining (Symbolic loop unrolling): reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop Idea is to separate dependencies in original loop body Register management can be tricky but idea is to turn the code into a single loop body In practice both unrolling and software pipelining will be necessary due to the register limitations 1/29/2010 CS 211 Lecture 6 37
Software Pipelining Original Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop Software Pipelined Version – unrolled loop and select instructions i: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) i+1: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) i+2: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) It0 it1 it2 L.D ADD.D L.D S.D ADD.D L.D S.D ADD.D S.D Loop: S.D F4,16(R1) ADD.D F4,F0,F2 L.D F0,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop Above we only show 3 iterations, we could do more!! 1/29/2010 CS 211 Lecture 6 38
Software Pipelining 6.0 7.0 8.0 9.0 2.0 R2 R1 F2 L.D F0, 16, R1 ADD.D F4, F0, F2 L.D F0, 8 (R1) LOOP: SD F4, 16(R1) L.D F0, 0(R1) DADDUI R1, R1, -8 BNE R1, R2, LOOP S.D F4, 16(R1) ST F4, 9(R1) Loop: S.D F4,16(R1) ADD.D F4,F0,F2 L.D F0,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop 6.0 7.0 8.0 9.0 R2 R1 F2 2.0 1/29/2010 CS 211 Lecture 6 39
Cross Section View 1/29/2010 CS 211 Lecture 6 40