Instructor: Morris Lancaster

Slides:



Advertisements
Similar presentations
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Advertisements

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Compiler techniques for exposing ILP
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Exploiting ILP with Software Approaches
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
EENG449b/Savvides Lec /24/04 March 24, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
CSC 4250 Computer Architectures November 14, 2006 Chapter 4.Instruction-Level Parallelism & Software Approaches.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 29, 2003 Topic: Software Approaches for ILP (Compiler Techniques) contd.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
EENG449b/Savvides Lec /24/05 February 24, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Please hand in Assignment 1 now Assignment.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CS203 – Advanced Computer Architecture Instruction Level Parallelism.
Concepts and Challenges
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CSCE430/830 Computer Architecture
CSL718 : VLIW - Software Driven ILP
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
CS 704 Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Instructor: Morris Lancaster CS 211: Computer Architecture Lecture 6 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster

Basic Compiler Techniques for Exposing ILP Crucial for processors that use static issue, and important for processors that make dynamic issue decisions but use static scheduling 1/29/2010 1/29/2010 CS 211 Lecture 6 2

Basic Pipeline Scheduling and Loop Unrolling Exploiting parallelism among instructions Finding sequences of unrelated instructions that can be overlapped in the pipeline Separation of a dependent instruction from a source instruction by a distance in clock cycles equal to the pipeline latency of the source instruction. (Avoid the stall) The compiler works with a knowledge of the amount of available ILP in the program and the latencies of the functional units within the pipeline This couples the compiler, sometimes to the specific chip version, or at least requires the setting of appropriate compiler flags 5/30/3008 1/29/2010 CS 211 Lecture 6 3

Assumed Latencies Instruction Producing Result Instruction Using Result Latency In Clock Cycles (needed to avoid stall) FP ALU op Another FP ALU op 3 Store double 2 Load double 1 Result of the load can be bypassed without stalling store 1/29/2010 CS 211 Lecture 6 4

Basic Pipeline Scheduling and Loop Unrolling (cont) Assume standard 5 stage integer pipeline Branches have a delay of one clock cycle Functional units are fully pipelined or replicated (as many times as the pipeline depth) An operation of any type can be issued on every clock cycle and there are no structural hazards 1/29/2010 CS 211 Lecture 6 5

Basic Pipeline Scheduling and Loop Unrolling (cont) Sample code For (i=1000; i>0; i=i-1) x[i] = x[i] + s; MIPS code Loop: L.D F0,0(R1) ;F0 = array element ADD.D F4,F0,F2 ;add scalar in F2 S.D F4,0(R1) ;store back DADDUI R1,R1,#-8 ;decrement index BNE R1,R2,Loop ;R2 is precomputed so that ;8(R2) is last value to be ;computed 1/29/2010 CS 211 Lecture 6 6

Basic Pipeline Scheduling and Loop Unrolling (cont) MIPS code Loop: L.D F0,0(R1) ;1 clock cycle stall ;2 ADD.D F4,F0,F2 ;3 stall ;4 stall ;5 S.D F4,0(R1) ;6 DADDUI R1,R1,#-8 ;7 stall ;8 BNE R1,R2,Loop ;9 1/29/2010 CS 211 Lecture 6 7

Rescheduling Gives Sample code MIPS code For (i=1000; i>0; i=i-1) x[i] = x[i] + s; MIPS code Loop: L.D F0,0(R1) 1 DADDUI R1,R1,#-8 2 ADD.D F4,F0,F2 3 stall 4 BNE R1,R2,Loop 5 S.D F4,8(R1) 6 1/29/2010 CS 211 Lecture 6 8

Unrolling Gives MIPS code Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop 1/29/2010 CS 211 Lecture 6 9

Unrolling and Removing Hazards Gives MIPS code Loop: L.D F0,0(R1) ;total of 14 clock cycles L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R1,#-32 S.D F12,16(R1) BNE R1,R2,Loop S.D F16,8(R1) 1/29/2010 CS 211 Lecture 6 10

Unrolling Summary for Above Determine that it was legal to move the S.D after the DADDUI and BNE, and find the amount to adjust the S.D offset Determine that unrolling the loop would be useful by finding that the loop iterations were independent, except for loop maintenance code Use different registers to avoid unnecessary constraints that would be forced by using the same registers Eliminate the extra test and branch instruction and adjust the loop termination and iteration code. Determine that the loads and stores can be interchanged by determining that the loads and stores from different iterations are independent Schedule the code, preserving any dependencies 1/29/2010 CS 211 Lecture 6 11

Unrolling Summary (continued) Limits to Impacts of Unrolling Loops As we unroll more, each unroll yields a decreased amount of improvement of distribution of overhead Growth in code size Not good for embedded computers Large code size may increase cache miss rate Shortfall in available registers (register pressure) Scheduling the code to increase ILP causes the number of live values to increase This could generate a shortage of registers and negatively impact the optimization Useful in a variety of processors today 1/29/2010 CS 211 Lecture 6 12

Unrolling and Pipeline Scheduling with Static Multiple Issue Assume two issue, statically scheduled superscalar MIPS pipeline – here are 5 unrolls which would take 17 cycles in previous example Integer instruction FP instruction Clock Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) S.D F4,0(R1) S.D F8,-8(R1) S.D F12,-16(R1) DADDUI R1,R1,#-40 S.D F16,16(R1) S.D F20,8(R1) BNE R1,R2,Loop ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 ADD.D F20,F18,F2 1 2 3 4 5 6 7 8 9 10 11 12 1/29/2010 CS 211 Lecture 6 13

Unrolling and Pipeline Scheduling with Static Multiple Issue Unrolled loop now 12 cycles or 2.4 cycles per element versus 3.5 cycles for the scheduled and unrolled loop on the normal pipeline 1/29/2010 CS 211 Lecture 6 14

Static Branch Prediction Expectation is that branch behavior is highly predictable at compile time (can also be used to help dynamic predictors) It turns out that mis-prediction variance rates are large and that mis-predictions vary from between 9% and 59% for benchmarks Look at this example, as stall for the DSUBU and BEQZ exists LD R1,0(R2) DSUBU R1,R1,R3 BEQZ R1, L OR R4,R5,R6 DADDU R10,R4,R3 L: DADDU R7,R8,R9 1/29/2010 CS 211 Lecture 6 15

Static Branch Prediction Suppose this branch was almost always taken and that the value of R7 was not needed in the fall through LD R1,0(R2) DADDU R7,R8,R9 DSUBU R1,R1,R3 BEQZ R1, L OR R4,R5,R6 DADDU R10,R4,R3 L: ; it was here LD R1,0(R2) DSUBU R1,R1,R3 BEQZ R1, L OR R4,R5,R6 DADDU R10,R4,R3 L: DADDU R7,R8,R9 1/29/2010 CS 211 Lecture 6 16

Static Branch Prediction Suppose this branch was rarely taken and that the value of R4 was not needed in the taken path LD R1,0(R2) OR R4,R5,R6 DSUBU R1,R1,R3 BEQZ R1, L ;it was here DADDU R10,R4,R3 L: DADDU R7,R8,R9 LD R1,0(R2) DSUBU R1,R1,R3 BEQZ R1, L OR R4,R5,R6 DADDU R10,R4,R3 L: DADDU R7,R8,R9 1/29/2010 CS 211 Lecture 6 17

Static Branch Prediction Prediction Schemes Predict Branch as taken – Average misprediction equal to the untaken branch frequency which is about 34% for the SPEC programs For some programs the frequency of forward taken branches may be significantly less than 50% Predict on branch direction Backward branches predicted as taken Forward branches predicted as not taken Misprediction rate still around 30% to 40% Profile scheme Based on information collected from earlier runs Finds that an individual branch is often highly biased toward taken or untaken 1/29/2010 CS 211 Lecture 6 18

Misprediction Rate on SPEC 92 for Profile Based Prediction 1/29/2010 CS 211 Lecture 6 19

Instructions Between Mispredictions 1/29/2010 CS 211 Lecture 6 20

Static Multiple Issue: The VLIW Approach An alternative to superscalar approach Superscalars decide dynamically how many instructions to issue Relies on compiler technology to Minimize potential data hazards and stalls Format the instructions in a potential issue packet so that the hardware does not need to check for dependences Compiler ensures that dependences within an issue packet cannot be present or, Indicate when a dependence may be present Simpler hardware Good performance through extensive compiler optimization 1/29/2010 CS 211 Lecture 6 21

VLIW Processors First multiple issue processors requiring the instruction stream to be explicitly organized used wide instructions with multiple operations per instruction Very Long Instruction Word (VLIW) 64, 128 or more bits wide Early VLIW processors were rigid in formats New, less rigid architectures being pursued for modern desktops 1/29/2010 CS 211 Lecture 6 22

VLIW Approach Multiple, independent functional units Multiple operations packaged into one very long instruction, or require that issue packets are constrained For discussion we assume the multiple instruction in one VLIW No hardware needed to make instruction issue decisions As maximum issue rate grows, the hardware to make the decisions becomes significantly more complex 1/29/2010 CS 211 Lecture 6 23

VLIW Approach (continued) Example Instructions might contain five operations, including one integer operation (which could be branch), two floating point operations, and two memory references Set of fields for each functional unit, on the order of 16 to 24 bits per unit, yielding an instruction length of between 112 and 168 bits 1/29/2010 CS 211 Lecture 6 24

VLIW Approach (continued) Scheduling Local scheduling – where unrolling generates straight line code Operate on a single basic block Global scheduling – where scheduling occurs across branches More complex Example is trace scheduling 1/29/2010 CS 211 Lecture 6 25

Integer Operation/Branch VLIW Example, Local Scheduling on Single Block 1.29 Cycles Per Result (Equivalent of 7 unrolls) Memory Reference 1 Memory Reference 2 FP Operation 1 FP Operation 2 Integer Operation/Branch L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 ADD.D F20,F18,F2 ADD.D F24,F22,F2 S.D F4,0(R1) S.D F8,-8(R1) ADD.D F28,F26,F2 S.D F12,-16(R1) S.D F16,-24(R1) DADDUI R1,R1,#-56 S.D F20,24,(R1) S.D F24,16(R1) S.D F28,8(R1) BNE R1,R2,Loop 7 results in 9 clocks 23 ops in 9 clocks 1/29/2010 CS 211 Lecture 6 26

VLIW Original Model Issues Code size increase Extreme loop unrolling Wasted bits in instruction encoding Limitations of lockstep operation Early VLIWs operated in lock step with no hazard detection Stall in any functional unit in the pipeline caused entire processor to stall (all functional units kept synchronized) Prediction of which data accesses will encounter cache stall is difficult – any miss affects all instructions in the word For large numbers of memory references the lock step restriction becomes unacceptable In more recent processors, functional units operate independently 1/29/2010 CS 211 Lecture 6 27

VLIW Original Model Issues Binary code compatibility issue In VLIW approach, the code sequence (words) make use of both the instruction set definition and the detailed pipeline structure, including both the functional units and their latencies Requires different versions of the code A new processor design on the old instruction set will require recompilation of code Object code translation Loops Where loops are unrolled, the individual loop iterations were dependent and most like would have run well on a vector processor. The multiple issue approach, however, is still preferred 1/29/2010 CS 211 Lecture 6 28

Advanced Compiler Support – Detecting and Enhancing Loop Level Parallelism Loop level parallelism normally analyzed at or near the source code level What dependencies exist among the operations in a loop across the iterations of that loop Loop-carried dependence – where data accesses in later iterations are dependent on data values produced in earlier iterations Most examples considered so far have not loop-level dependence For (i=1000; i>0; i=i-1) x[i] = x[i] + s; There is a dependence between the two uses of x[i] but they do not carry across the loop There is a dependence on successive uses of i in different iterations, which is loop carried, but this dependence involves an induction variable 1/29/2010 CS 211 Lecture 6 29

Advanced Compiler Support – Detecting and Enhancing Loop Level Parallelism For (i=1; I<=100; i=i+1){ A[i+1] = A[i] + C[i]; //S1 B[i+1] = B[i] + A[i+1]; //S2 } Assume A, B, and C are distinct non-overlapping arrays Note that this can be tricky, and requires sophisticated analysis of the program Dependences S1 uses a value computed by S1 in a previous iteration (A[i]) and S2 uses a value computed by S2 in a previous iteration Loop carried dependencies S2 uses a value computed by S1 in the same iteration Not loop carried Multiple Iterations can execute in parallel as long as dependent statements are kept in order 1/29/2010 CS 211 Lecture 6 30

Advanced Compiler Support – Detecting and Enhancing Loop Level Parallelism Another loop For (i=1; I<=100; i=i+1){ A[i] = A[i] + B[i]; //S1 B[i+1] = C[i] + D[i]; //S2 } Dependences S1 uses a value computed by S2 in a previous iteration (B[i+1]) Dependence is not circular, neither statement depends upon itself S2 does not depend on S1 A loop is parallel if it can be written without a cycle in the dependences 1/29/2010 CS 211 Lecture 6 31

Detecting and Enhancing Loop Level Parallelism – Transforming the Loop The transformed loop A[1] = A[1]+B[1]; for (i=1; I<=99; i=i+1){ B[i+1] = C[i] + D[i]; A[i+1]=A[i+1]+B[i+1]; } B[101] = C[100] + D[100] Transformation There was no dependence from S1 to S2. Interchanging the two statements will not affect the outcome On the first iteration of the loop, statement S1 depends on the value B[1] computed prior to initiating the loop For (i=1; I<=100; i=i+1){ A[i] = A[i] + B[i]; //S1 B[i+1] = C[i] + D[i]; //S2 } 1/29/2010 CS 211 Lecture 6 32

Finding Dependencies – Array Oriented Dependence Difficulties Situations in which array oriented dependence analysis cannot give all information needed When objects are referenced via pointers rather than array indices When array indexing is indirect through another array (many sparse array representations use this) When a dependence may exist for some value of the inputs, but does not exist when the code is run since the inputs never take on those values When an optimization depends on knowing more than just the possibility of a dependence, but needs to know on which write of a variable does a read of that variable depend 1/29/2010 CS 211 Lecture 6 33

Finding Dependencies – Points To Analysis Deals with analyzing programs with pointers Three major sources of analysis info Type information – which restricts what a pointer can point to Issues in loosely typed languages Information derived when an object is allocated or when the address of an object is taken, which can be used to restrict what the pointer can point to If p always points to an object allocated in a given source line and q never points to that object, the p and q can never point to the same object Information derived from pointer assignments If p may be assigned to the value of q, then p may point to anything q points to 1/29/2010 CS 211 Lecture 6 34

Eliminating Dependent Computations Example 1 Change instruction stream DADDUI R1,R2,#4 DADDUI R1,R1,#4 To DADDUI R1,R2,#8 Example 2 Reorder ADD R1,R2,R3 ADD R4,R1,R6 ADD R8,R4,R7 To ADD R1,R2,R3 ADD R4,R6,R7 ADD R8,R1,R4 1/29/2010 CS 211 Lecture 6 35

Eliminating Dependent Computations Example 3 Change How Operations Are Performed in unrolling sum = sum + x From sum = sum + x1 + x2 + x3 + x4 + x5; To sum = ((sum+x1)+(x2+x3))+(x4+x5) In first example, the sum is computed left to right. In second example, there are only 3 dependences on results. 1/29/2010 CS 211 Lecture 6 36

Idea is to separate dependencies in original loop body Software Pipelining Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations Software pipelining (Symbolic loop unrolling): reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop Idea is to separate dependencies in original loop body Register management can be tricky but idea is to turn the code into a single loop body In practice both unrolling and software pipelining will be necessary due to the register limitations 1/29/2010 CS 211 Lecture 6 37

Software Pipelining Original Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop Software Pipelined Version – unrolled loop and select instructions i: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) i+1: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) i+2: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) It0 it1 it2 L.D ADD.D L.D S.D ADD.D L.D S.D ADD.D S.D Loop: S.D F4,16(R1) ADD.D F4,F0,F2 L.D F0,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop Above we only show 3 iterations, we could do more!! 1/29/2010 CS 211 Lecture 6 38

Software Pipelining 6.0 7.0 8.0 9.0 2.0 R2 R1 F2 L.D F0, 16, R1 ADD.D F4, F0, F2 L.D F0, 8 (R1) LOOP: SD F4, 16(R1) L.D F0, 0(R1) DADDUI R1, R1, -8 BNE R1, R2, LOOP S.D F4, 16(R1) ST F4, 9(R1) Loop: S.D F4,16(R1) ADD.D F4,F0,F2 L.D F0,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,Loop 6.0 7.0 8.0 9.0 R2 R1 F2 2.0 1/29/2010 CS 211 Lecture 6 39

Cross Section View 1/29/2010 CS 211 Lecture 6 40