1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

Speculative ExecutionCS510 Computer ArchitecturesLecture Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

A scheme to overcome data hazards

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

COMP4611 Tutorial 6 Instruction Level Parallelism

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

CSE 502 Graduate Computer Architecture Lec 11 – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Computer Architecture Lec 8 – Instruction Level Parallelism.

CS136, Advanced Architecture Speculation. CS136 2 Outline Speculation Speculative Tomasulo Example Memory Aliases Exceptions VLIW Increasing instruction.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.

CSE 502 Graduate Computer Architecture Lec – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.

EECC551 - Shaaban #1 lec # 6 Fall Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.

1 EE524 / CptS561 Computer Architecture Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Mar 17, 2009 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

EECC551 - Shaaban #1 lec # 6 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.

CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.

1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.

COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 29, 2003 Topic: Software Approaches for ILP (Compiler Techniques) contd.

EECC551 - Shaaban #1 lec # 6 Winter Evolution of Processor Performance Source: John P. Chen, Intel Labs CPI >

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

Tomasulo’s Approach and Hardware Based Speculation

EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.

CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.

EECC551 - Shaaban #1 lec # 6 Fall Evolution of Processor Performance Source: John P. Chen, Intel Labs CPI > (?)

EECC551 - Shaaban #1 lec # 6 Fall Evolution of Processor Performance Source: John P. Chen, Intel Labs CPI > (?)

1 Instruction Level Parallelism Vincent H. Berk October 15, 2008 Reading for today: A.7 – A.8 Reading for Friday: 2.1 – 2.5 Project Proposals Due Right.

ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:

1 Overcoming Control Hazards with Dynamic Scheduling & Speculation.

1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.

CS/EE 5810 CS/EE 6810 F00: 1 Extracting More ILP.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

Lecture 7: Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.

CS203 – Advanced Computer Architecture ILP and Speculation.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

/ Computer Architecture and Design

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

COMP 740: Computer Architecture and Implementation

Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1

CS203 – Advanced Computer Architecture

CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Adapted from the slides of Prof

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Larry Wittie Computer Science, StonyBrook University and ~lw

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism

Adapted from the slides of Prof

Chapter 3: ILP and Its Exploitation

September 20, 2000 Prof. John Kubiatowicz

Loop-Level Parallelism

Overcoming Control Hazards with Dynamic Scheduling & Speculation

Presentation transcript:

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

2 Multiple Issue  Eliminating data and control stalls can achieve CPI of 1  Can we decrease CPI below 1? Not if we issue only one instruction per clock cycle Not if we issue only one instruction per clock cycle  Multiple-issue processors allow multiple instructions to issue in a clock cycle Superscalar: issue varying numbers of instructions per clock (dynamic issue) Superscalar: issue varying numbers of instructions per clock (dynamic issue)  Statically scheduled by compiler  Dynamically scheduled by hardware VLIW: issue fixed number of instructions per clock (static issue) VLIW: issue fixed number of instructions per clock (static issue)  Statically scheduled by compiler  Examples Superscalar: IBM PowerPC, Sun SuperSPARC, DEC Alpha, HP 8000 Superscalar: IBM PowerPC, Sun SuperSPARC, DEC Alpha, HP 8000 VLIW: Intel/HP Itanium VLIW: Intel/HP Itanium

3 A Superscalar Version of MIPS  Two instructions can be issued per clock cycle One can be load/store/branch/integer operation One can be load/store/branch/integer operation Other can be any FP operation Other can be any FP operation  Need to fetch and decode 64 bits per cycle Instructions paired and aligned on 64-bit boundary Instructions paired and aligned on 64-bit boundary Integer instruction appears first Integer instruction appears first  Dynamic issue First instruction issues if independent and satisfies other criteria First instruction issues if independent and satisfies other criteria Second instruction issues only if first one does, and is independent and satisfies similar criteria Second instruction issues only if first one does, and is independent and satisfies similar criteria  Limitation One-cycle delay for loads and branches now turns into three- instruction delay! One-cycle delay for loads and branches now turns into three- instruction delay!  … because instructions are now squeezed closer together

4 Performance of Static Superscalar LOOP: LDF0, 0(R1) ADDDF4, F0, F2 SD0(R1), F4 SUBIR1, R1, 8 BNEZR1, LOOP LOOP: LDF0, 0(R1) ADDDF4, F0, F2 SD0(R1), F4 SUBIR1, R1, 8 BNEZR1, LOOP LOOP:LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1)ADDD F4, F0, F2 LD F14, -24(R1)ADDD F8, F6, F2 LD F18, -32(R1)ADDD F12, F10, F2 SD 0(R1), F4ADDD F16, F14, F2 SD -8(R1), F8ADDD F20, F18, F2 SD -16(R1), F12 SUBI R1, R1, 40 SD 16(R1), F16 BNEZ R1, LOOP SD 8(R1), F20 LOOP:LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1)ADDD F4, F0, F2 LD F14, -24(R1)ADDD F8, F6, F2 LD F18, -32(R1)ADDD F12, F10, F2 SD 0(R1), F4ADDD F16, F14, F2 SD -8(R1), F8ADDD F20, F18, F2 SD -16(R1), F12 SUBI R1, R1, 40 SD 16(R1), F16 BNEZ R1, LOOP SD 8(R1), F20  Loop unrolled five times and scheduled statically 6 cycles per element in original scheduled code 6 cycles per element in original scheduled code 2.4 cycles per element in superscalar code (2.5x) 2.4 cycles per element in superscalar code (2.5x) Loop unrolling gets us from 6 to 3.5 cycles per element (1.7x) Loop unrolling gets us from 6 to 3.5 cycles per element (1.7x) Superscalar execution from 3.5 to 2.4 cycles per element (1.5x) Superscalar execution from 3.5 to 2.4 cycles per element (1.5x)

5 Multiple-Issue with Dynamic Scheduling  Extend Tomasulo’s algorithm to support issuing two instructions per cycle, one integer and one FP  Simple approach: separate Tomasulo Control for Integer and FP units: separate Tomasulo Control for Integer and FP units:  one set of reservation stations for Integer unit and one for FP unit  How to do instruction issue with two instructions and keep in-order instruction issue for Tomasulo? Issue logic runs in one-half clock cycle Issue logic runs in one-half clock cycle can do two in-order issues in one clock cycle can do two in-order issues in one clock cycle

6 Performance of Dynamic Superscalar Iter. no.InstructionsIssues ExecutesWrites result (clock-cycle number) (clock-cycle number) 1LD F0,0(R1)124 1ADDD F4,F0,F2158 1SD 0(R1),F429 1SUBI R1,R1,8345 1BNEZ R1,LOOP45 2LD F0,0(R1)568 2ADDD F4,F0,F SD 0(R1),F SUBI R1,R1,8789 2BNEZ R1,LOOP89 4 clocks per iteration

7 Limits of Superscalar  While Integer/FP split is simple for the HW, we get CPI of 0.5 only for programs with: Exactly 50% FP operations Exactly 50% FP operations No hazards No hazards  If more instructions issue at same time, greater difficulty of decode and issue Even 2-scalar machine has to do a lot of work Even 2-scalar machine has to do a lot of work  Examine 2 opcodes  Examine 6 register specifiers  Decide if 1 or 2 instructions can issue

8 Very Long Instruction Word (VLIW)  VLIW: tradeoff instruction space for simple decoding The long instruction word has room for many operations The long instruction word has room for many operations By definition, all the operations the compiler puts in the long instruction word can execute in parallel By definition, all the operations the compiler puts in the long instruction word can execute in parallel E.g., 2 integer operations, 2 FP operations, 2 memory references, 1 branch E.g., 2 integer operations, 2 FP operations, 2 memory references, 1 branch  16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide Need very sophisticated compiling technique … Need very sophisticated compiling technique …  … that schedules across several branches

9 Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1reference 2operation 1 op. 2 branch LD F0,0(R1)LD F6,-8(R1)1 LD F10,-16(R1)LD F14,-24(R1)2 LD F18,-32(R1)LD F22,-40(R1)ADDD F4,F0,F2ADDD F8,F6,F23 LD F26,-48(R1)ADDD F12,F10,F2ADDD F16,F14,F24 ADDD F20,F18,F2ADDD F24,F22,F25 SD 0(R1),F4SD -8(R1),F8ADDD F28,F26,F26 SD -16(R1),F12SD -24(R1),F167 SD -32(R1),F20SD -40(R1),F24SUBI R1,R1,#488 SD -0(R1),F28BNEZ R1,LOOP9  Unrolled 7 times to avoid delays  7 results in 9 clocks, or 1.3 clocks per iteration  Need more registers in VLIW

10 Limits to Multi-Issue Machines  Inherent limitations of ILP 1 branch in 5 instructions: How to keep a 5-way VLIW busy? 1 branch in 5 instructions: How to keep a 5-way VLIW busy? Latencies of units: many operations must be scheduled Latencies of units: many operations must be scheduled #independent instr. needed: Pipeline Depth  Number of FU #independent instr. needed: Pipeline Depth  Number of FU  ~15-20 independent instructions!  Difficulties in building the underlying hardware Large increase in bandwidth of memory and register-file Large increase in bandwidth of memory and register-file  Limitations specific to either SS or VLIW implementation Decode issue in SS Decode issue in SS VLIW code size: unroll loops + wasted fields in VLIW VLIW code size: unroll loops + wasted fields in VLIW VLIW lock step  1 hazard & all instructions stall VLIW lock step  1 hazard & all instructions stall VLIW & binary compatibility is practical weakness VLIW & binary compatibility is practical weakness

11 Trace Scheduling  Parallelism across IF branches vs. LOOP branches  Two steps: Trace Selection Trace Selection  Find likely sequence of basic blocks (trace) of (statically predicted) long sequence of straight-line code Trace Compaction Trace Compaction  Squeeze trace into few VLIW instructions  Need bookkeeping code in case prediction is wrong

12 Trace Scheduling

13 Hardware Support for More ILP  Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP If false, then do nothing at all If false, then do nothing at all  neither store result nor cause exception Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move PA-RISC: any register-register instruction can nullify following instruction, making it conditional PA-RISC: any register-register instruction can nullify following instruction, making it conditional  Drawbacks to conditional instructions Still takes a clock even if “annulled” Still takes a clock even if “annulled” Stall if condition evaluated late Stall if condition evaluated late Complex conditions reduce effectiveness; condition becomes known late in pipeline Complex conditions reduce effectiveness; condition becomes known late in pipeline

14 Hardware Support for More ILP  Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences (including exceptions) if branch is not actually taken Hardware needs to provide an “undo” operation = squash Hardware needs to provide an “undo” operation = squash  Often try to combine with dynamic scheduling  Tomasulo: separate speculative bypassing of results from real bypassing of results When instruction no longer speculative, write results (instruction commit) When instruction no longer speculative, write results (instruction commit) execute out-of-order but commit in order execute out-of-order but commit in order Example: PowerPC 620, MIPS R10000, Intel P6, AMD K5 … Example: PowerPC 620, MIPS R10000, Intel P6, AMD K5 …

15 Hardware support for More ILP Need HW buffer for results of uncommitted instructions: reorder buffer – –Reorder buffer can be operand source – –Once instruction commits, result is found in register – –3 fields: instr. type, destination, value – –Use reorder buffer number instead of reservation station – –Instructions commit in order – –As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions Reorder Buffer FP Regs FP Op Queue FP Adder Res Stations Figure 3.29, page 228

16 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instruction & send operands & reorder buffer no. for destination If reservation station and reorder buffer slot free, issue instruction & send operands & reorder buffer no. for destination 2.Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute 3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available 4.Commit—update register with reorder result When instruction at head of reorder buffer & result present, update register with result (or store to memory) and remove instruction from reorder buffer When instruction at head of reorder buffer & result present, update register with result (or store to memory) and remove instruction from reorder buffer