EEL 5708 1 Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.

Slides:



Advertisements
Similar presentations
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Advertisements

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 Lecture 3 Pipeline Contd. (Appendix A) Instructor: L.N. Bhuyan CS 203A Advanced Computer Architecture Some slides are adapted from Roth.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Compiler techniques for exposing ILP
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
EENG449b/Savvides Lec /24/04 March 24, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Parallel Architectures and Systems, Edwin Sha 1 Review of Computer Architecture -- Instruction sets, pipelines and caches.
Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CS252 Graduate Computer Architecture Lecture 6 Static Scheduling, Scoreboard February 6 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.
DAP.F96 1 Lecture 4: Hazards, Introduction to Compiler Techniques, Chapter 2.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
EENG449b/Savvides Lec /24/05 February 24, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
\course\ELEG652-03Fall\Topic Exploitation of Instruction-Level Parallelism (ILP)
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Instruction Level Parallelism Vincent H. Berk October 15, 2008 Reading for today: A.7 – A.8 Reading for Friday: 2.1 – 2.5 Project Proposals Due Right.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.
Lecture 5: Pipelining & Instruction Level Parallelism Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Compiler Techniques for ILP
CSCE430/830 Computer Architecture
CPE 631 Lecture 13: Exploiting ILP with SW Approaches
Siddhartha Chatterjee Spring 2008
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Key to pipelining: smooth flow Hazards limit performance
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPE 631 Lecture 14: Exploiting ILP with SW Approaches (2)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni

EEL Acknowledgements All the lecture slides were adopted from the slides of David Patterson (1998, 2001) and David E. Culler (2001), Copyright , University of California Berkeley

EEL Review: Summary of Pipelining Basics Hazards limit performance –Structural: need more HW resources –Data: need forwarding, compiler scheduling –Control: early evaluation & PC, delayed branch, prediction Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency Interrupts, Instruction Set, FP makes pipelining harder Compilers reduce cost of data and control hazards –Load delay slots –Branch delay slots –Branch prediction Today: Longer pipelines => Better branch prediction, more instruction parallelism?

EEL The CPI for pipelined processors Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls There are a variety of techniques used for improving various components. What we have seen as of yet is just the beginning. Techniques can be: –Hardware (dynamic) –Software (static)

EEL Techniques for ILP TechniqueReduces Forwarding and bypassingPotential data hazard stalls Delayed branches & branch scheduling Control hazard stalls Basic dynamic scheduling (scoreboarding) Data hazards from true dependencies Dynamic scheduling with renaming Data hazards from antideps. and output deps. Dynamic branch predictionControl stalls Issuing multiple instr per cycleIdeal CPI SpeculationData and control hazard stalls Dynamic memory disambiguationData hazard stalls with memory

EEL Techniques for ILP (cont’d) TechniqueReduces Loop unrollingControl hazard stalls Basic compiler pipeline scheduling Data hazard stalls Compiler dependence analysisIdeal CPI, data hazard stalls Software pipelining, trace scheduling Ideal CPI, data hazard stalls Compiler speculationIdeal CPI, data, control stalls

EEL Loop unrolling (software, static method)

EEL FP Loop: Where are the Hazards? Loop:LDF0,0(R1);F0=vector element ADDDF4,F0,F2;add scalar from F2 SD0(R1),F4;store result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Integer opInteger op0 Where are the stalls?

EEL FP Loop Hazards InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Integer opInteger op0 Loop:LDF0,0(R1);F0=vector element ADDDF4,F0,F2;add scalar in F2 SD0(R1),F4;store result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot

EEL FP Loop Showing Stalls 9 clocks: Rewrite code to minimize stalls? InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 1 Loop:LDF0,0(R1);F0=vector element 2stall 3ADDDF4,F0,F2;add scalar in F2 4stall 5stall 6 SD0(R1),F4;store result 7 SUBIR1,R1,8;decrement pointer 8B (DW) 8 BNEZR1,Loop;branch R1!=zero 9stall;delayed branch slot

EEL Revised FP Loop Minimizing Stalls 6 clocks: Unroll loop 4 times code to make faster? InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 1 Loop:LDF0,0(R1) 2stall 3ADDDF4,F0,F2 4SUBIR1,R1,8 5BNEZR1,Loop;delayed branch 6 SD8(R1),F4;altered when move past SUBI Swap BNEZ and SD by changing address of SD

EEL Unroll Loop Four Times (straightforward way) Rewrite loop to minimize stalls? 1 Loop:LDF0,0(R1) 2ADDDF4,F0,F2 3SD0(R1),F4 ;drop SUBI & BNEZ 4LDF6,-8(R1) 5ADDDF8,F6,F2 6SD-8(R1),F8 ;drop SUBI & BNEZ 7LDF10,-16(R1) 8ADDDF12,F10,F2 9SD-16(R1),F12 ;drop SUBI & BNEZ 10LDF14,-24(R1) 11ADDDF16,F14,F2 12SD-24(R1),F16 13SUBIR1,R1,#32;alter to 4*8 14BNEZR1,LOOP 15NOP x (1+2) = 27 clock cycles, or 6.8 per iteration Assumes R1 is multiple of 4

EEL Unrolled Loop That Minimizes Stalls What assumptions made when moved code? –OK to move store past SUBI even though changes register –OK to move loads before stores: get right data? –When is it safe for compiler to do such changes? 1 Loop:LDF0,0(R1) 2LDF6,-8(R1) 3LDF10,-16(R1) 4LDF14,-24(R1) 5ADDDF4,F0,F2 6ADDDF8,F6,F2 7ADDDF12,F10,F2 8ADDDF16,F14,F2 9SD0(R1),F4 10SD-8(R1),F8 11SD-16(R1),F12 12SUBIR1,R1,#32 13BNEZR1,LOOP 14SD8(R1),F16; 8-32 = clock cycles, or 3.5 per iteration When safe to move instructions?

EEL Compiler Perspectives on Code Movement Definitions: compiler concerned about dependencies in program, whether or not a HW hazard depends on a given pipeline Try to schedule to avoid hazards (True) Data dependencies (RAW if a hazard for HW) –Instruction i produces a result used by instruction j, or –Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. If dependent, can’t execute in parallel Easy to determine for registers (fixed names) Hard for memory: –Does 100(R4) = 20(R6)? –From different loop iterations, does 20(R6) = 20(R6)?

EEL Where are the data dependencies? 1 Loop:LDF0,0(R1) 2ADDDF4,F0,F2 3SUBIR1,R1,8 4BNEZR1,Loop;delayed branch 5 SD8(R1),F4;altered when move past SUBI

EEL Compiler Perspectives on Code Movement Another kind of dependence called name dependence: two instructions use same name (register or memory location) but don’t exchange data Antidependence (WAR if a hazard for HW) –Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first Output dependence (WAW if a hazard for HW) –Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.

EEL Where are the name dependencies? 1 Loop:LDF0,0(R1) 2ADDDF4,F0,F2 3SD0(R1),F4 ;drop SUBI & BNEZ 4LDF0,-8(R1) 2ADDDF4,F0,F2 3SD-8(R1),F4 ;drop SUBI & BNEZ 7LDF0,-16(R1) 8ADDDF4,F0,F2 9SD-16(R1),F4 ;drop SUBI & BNEZ 10LDF0,-24(R1) 11ADDDF4,F0,F2 12SD-24(R1),F4 13SUBIR1,R1,#32;alter to 4*8 14BNEZR1,LOOP 15NOP How can remove them?

EEL Where are the name dependencies? 1 Loop:LDF0,0(R1) 2ADDDF4,F0,F2 3SD0(R1),F4 ;drop SUBI & BNEZ 4LDF6,-8(R1) 5ADDDF8,F6,F2 6SD-8(R1),F8 ;drop SUBI & BNEZ 7LDF10,-16(R1) 8ADDDF12,F10,F2 9SD-16(R1),F12 ;drop SUBI & BNEZ 10LDF14,-24(R1) 11ADDDF16,F14,F2 12SD-24(R1),F16 13SUBIR1,R1,#32;alter to 4*8 14BNEZR1,LOOP 15NOP Called “register renaming”

EEL Compiler Perspectives on Code Movement Again name dependencies are hard for memory accesses –Does 100(R4) = 20(R6)? –From different loop iterations, does 20(R6) = 20(R6)? Our example required compiler to know that if R1 doesn’t change then: 0(R1) ≠ -8(R1) ≠ -16(R1) ≠ -24(R1) There were no dependencies between some loads and stores so they could be moved by each other

EEL Compiler Perspectives on Code Movement Final kind of dependence called control dependence Example if p1 {S1;}; if p2 {S2;}; S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.

EEL Compiler Perspectives on Code Movement Two (obvious) constraints on control dependences: –An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. –An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch. Control dependencies relaxed to get parallelism; get same effect if preserve order of exceptions (address in register checked by branch before use) and data flow (value in register depends on branch)

EEL Where are the control dependencies? 1 Loop:LDF0,0(R1) 2ADDDF4,F0,F2 3SD0(R1),F4 4SUBIR1,R1,8 5BEQZR1,exit 6LDF0,0(R1) 7ADDDF4,F0,F2 8SD0(R1),F4 9SUBIR1,R1,8 10BEQZR1,exit 11LDF0,0(R1) 12ADDDF4,F0,F2 13SD0(R1),F4 14SUBIR1,R1,8 15BEQZR1,exit....

EEL When Safe to Unroll Loop? Example: Where are data dependencies? (A,B,C distinct & nonoverlapping) for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */ 1. S2 uses the value, A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a “loop-carried dependence”: between iterations Implies that iterations are dependent, and can’t be executed in parallel Not the case for our prior example; each iteration was distinct

EEL Summary Instruction Level Parallelism (ILP) in SW or HW Loop level parallelism is easiest to see SW parallelism dependencies defined for program, hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can unroll loops –Memory dependencies hardest to determine