Caltech CS184b Winter2001 -- DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day7:

Slides:



Advertisements
Similar presentations
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
Advertisements

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 4: April 7, 2003 Instruction Level Parallelism ILP.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Instruction-Level Parallelism (ILP)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Review: MIPS Pipeline Data and Control Paths
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day6:
EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A.
EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day3:
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day15:
Pipeline Data Hazards: Detection and Circumvention Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly.
1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day5:
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
Spring 2003CSE P5481 Precise Interrupts Precise interrupts preserve the model that instructions execute in program-generated order, one at a time If an.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
CSE431 L07 Overcoming Data Hazards.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 07: Overcoming Data Hazards Mary Jane Irwin (
CSIE30300 Computer Architecture Unit 05: Overcoming Data Hazards Hsin-Chou Chi [Adapted from material by and
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day8:
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
CSC 4250 Computer Architectures September 22, 2006 Appendix A. Pipelining.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 3: April 4, 2003 Pipelined ISA.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
CSE 340 Computer Architecture Spring 2016 Overcoming Data Hazards.
CS203 – Advanced Computer Architecture ILP and Speculation.
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
CDA3101 Recitation Section 8
/ Computer Architecture and Design
CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions
CS203 – Advanced Computer Architecture
Pipelining: Advanced ILP
Review: MIPS Pipeline Data and Control Paths
Morgan Kaufmann Publishers The Processor
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
How to improve (decrease) CPI
Control unit extension for data hazards
Instruction Level Parallelism (ILP)
Instruction Execution Cycle
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
pipelining: data hazards Prof. Eric Rotenberg
Control unit extension for data hazards
CSC3050 – Computer Architecture
Control unit extension for data hazards
Loop-Level Parallelism
Conceptual execution on a processor which exploits ILP
CS184b: Computer Architecture (Abstractions and Optimizations)
Presentation transcript:

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day7: January 25, 2000 Precise Exceptions ILP intro

Caltech CS184b Winter DeHon 2 Today Handling Exceptions ILP –where? –scoreboard –tomasulo

Caltech CS184b Winter DeHon 3 Exceptions Problem: Maintain sequentially consistent view, while relaxing strict, sequential dependence ordering Sequential stream from ISA Data/control dependence less strict Relaxed dependence accelerates execution

Caltech CS184b Winter DeHon 4 In-Pipe MPY R1,R2,R3 IF ID MPY1 MPY2 MPY3 WB LW R4,16(R6) IF ID EX MEM ---- WB Fault for later instruction should not be visible before earlier.

Caltech CS184b Winter DeHon 5 Out-of-Order Completion MPY R1,R2,R3 IF ID EX MPY1 MPY2 MPY3 MPY4 WB LW R7,(R4) IF ID ALU MEM WB ADD R4,R5,R6 IF ID ALU --- WB State changes from later operations should not be visible if earlier operations fail.

Caltech CS184b Winter DeHon 6 Solutions Stall side-effects as hazards –limit concurrency Imprecise exceptions –? Recoverable / restartable Expose Pipeline –limit scalability, weaken abstraction Save list of PCs –cumberson Precise Exception support

Caltech CS184b Winter DeHon 7 In-Order Completion Stall like data hazards Save up faults in pipeline until commit point –(faults, like WB occur in set place when know predecessors haven’t faulted)

Caltech CS184b Winter DeHon 8 In-Order MPY R1,R2,R3 IF ID MPY1 MPY2 MPY3 WB LW R4,16(R6) IF ID EX MEM ---- WB Commit fault with write back.

Caltech CS184b Winter DeHon 9 In-Order Completion MPY R1,R2,R3 IF ID EX MPY1 MPY2 MPY3 MPY4 WB LW R7,(R4) IF ID ALU MEM WB ADD R4,R5,R6 IF ID ALU --- WB MPY R1,R2,R3 IF ID EX MPY1 MPY2 MPY3 MPY4 WB LW R7,(R4) IF ID ALU MEM WB ADD R4,R5,R6 IF ID ALU WB IO OO

Caltech CS184b Winter DeHon 10 Re-Order Buffer Continue to execute Write-back to register file in-order Buffer results between completion and WB Bypass with newer results

Caltech CS184b Winter DeHon 11 Re-Order IFID Reorder Bypass EX ALU MPY LD/ST RF Complex (big) bypass logic.

Caltech CS184b Winter DeHon 12 History Buffer Keep track of values overwritten in register file Can restore old state from there

Caltech CS184b Winter DeHon 13 History IF ID EX ALU MPY LD/ST RF History History Buffer contain: PC Reg. # prev. reg value Use history to “rollback” state of computation to consistent/committed point.

Caltech CS184b Winter DeHon 14 Future File Keep two copies of register file –committed / visible set –working set

Caltech CS184b Winter DeHon 15 Future IF IDEX ALU MPY LD/ST RF “Architecture” Register File “Future” Future RF contains working state Architecture RF contains only committed (seq. order) state. Reorder

Caltech CS184b Winter DeHon 16 Memory Note: may need to do re-order/bypass to memory as well –same issue as RF –not want to make visible state change –may want to run ahead (avoid adding dep.) Bigger issue as we go to longer latencies, OO-issue, etc.

Caltech CS184b Winter DeHon 17 Instruction Level Parallelism

Caltech CS184b Winter DeHon 18 Real Issue Sequential ISA Model adds an artificial constraint to the computational problem. Original problem (real computation) is not sequentially dependent as a long critical path. –Path Length != # of instructions

Caltech CS184b Winter DeHon 19 Dataflow Graph Real problem is a graph

Caltech CS184b Winter DeHon 20 Task Has Parallelism MPY R3,R2,R2 MPY R4,R2,R5 MPY R3,R6,R3 ADD R4,R4,R7 ADD R4,R3,R4

Caltech CS184b Winter DeHon 21 More when pipelined Working on stream (loop) may be able to perform all ops at once –…appropriately staggered in time.

Caltech CS184b Winter DeHon 22 Problem For sequential ISA: –must linearize graph –create false dependencies MPY R3,R2,R2 MPY R4,R2,R5 MPY R3,R6,R3 ADD R4,R4,R7 ADD R4,R3,R4 MPY R3,R2,R2 MPY R3,R6,R3 MPY R4,R2,R5 ADD R4,R4,R7 ADD R4,R3,R4

Caltech CS184b Winter DeHon 23 ILP The original problem had parallelism Can we exploit it? Can we rediscover it after? –linearizing –scheduling –assigning resources

Caltech CS184b Winter DeHon 24 If we can find the parallelism... …and will spend the silicon area can execute multiple instructions simultaneously MPY R3,R2,R2; MPY R4,R2,R5 MPY R3,R6,R3; ADD R4,R4,R7 ADD R4,R3,R4 IF ID EX RF ALU1 ALU2

Caltech CS184b Winter DeHon 25 First Challenge: Multi-issue, maintain depend Like Pipelining Let instructions go if no hazard Detect (potential hazards) –stall for data available

Caltech CS184b Winter DeHon 26 Scoreboarding Easy conceptual model: –Each Register has a valid bit –At issue, read registers –If all registers have valid data mark result register invalid (stale) forward into execute –else stall until all valid –When done write to register set result to valid

Caltech CS184b Winter DeHon 27 Scoreboard MPY R3,R2,R2 MPY R4,R2,R5 MPY R3,R6,R3 ADD R4,R4,R7 ADD R4,R3,R4 2: 1 3: 1 4: 1 5: 1 6: 1 7: 1 IF ID EX RF ALU1 ALU2 R2.valid=1 issue Set R3.valid=0 2: 1 3: 0 4: 1 5: 1 6: 1 7: 1

Caltech CS184b Winter DeHon 28 Scoreboard MPY R3,R2,R2 MPY R4,R2,R5 MPY R3,R6,R3 ADD R4,R4,R7 ADD R4,R3,R4 2: 1 3: 0 4: 1 5: 1 6: 1 7: 1 IF ID EX RF ALU1 ALU2 R2.valid=1 R5.valid=1 issue Set R4.valid=0 2: 1 3: 0 4: 0 5: 1 6: 1 7: 1

Caltech CS184b Winter DeHon 29 Scoreboard MPY R3,R2,R2 MPY R4,R2,R5 MPY R3,R6,R3 ADD R4,R4,R7 ADD R4,R3,R4 2: 1 3: 0 4: 0 5: 1 6: 1 7: 1 IF ID EX RF ALU1 ALU2 R3.valid=0 R6.valid=1 stall

Caltech CS184b Winter DeHon 30 Scoreboard MPY R3,R2,R2 MPY R4,R2,R5 MPY R3,R6,R3 ADD R4,R4,R7 ADD R4,R3,R4 2: 1 3: 0 4: 0 5: 1 6: 1 7: 1 IF ID EX RF ALU1 ALU2 MPY R3 complete Set R3.valid=1 2: 1 3: 1 4: 0 5: 1 6: 1 7: 1

Caltech CS184b Winter DeHon 31 Scoreboard MPY R3,R2,R2 MPY R4,R2,R5 MPY R3,R6,R3 ADD R4,R4,R7 ADD R4,R3,R4 2: 1 3: 1 4: 0 5: 1 6: 1 7: 1 IF ID EX RF ALU1 ALU2 R3.valid=1 R6.valid=1 Set R3.valid=0 2: 1 3: 0 4: 0 5: 1 6: 1 7: 1 issue

Caltech CS184b Winter DeHon 32 Scoreboard Of course, bypass –bypass as we did in pipeline –incorporate into stall checks so can continue as soon as result shows up Also, careful not to issue –when result register invalid (WAW)

Caltech CS184b Winter DeHon 33 Ordering As shown –issue instructions in order –stall on first dependent instruction get head-of-line-blocking Alternative –Out of order issue

Caltech CS184b Winter DeHon 34 Example MPY R3,R2,R2 MPY R4,R2,R5 MPY R3,R6,R3 ADD R4,R4,R7 ADD R4,R3,R4 MPY R3,R2,R2 MPY R3,R6,R3 MPY R4,R2,R5 ADD R4,R4,R7 ADD R4,R3,R4 MPY R3,R2,R2 MPY R4,R2,R5 MPY R3,R6,R3 ADD R4,R4,R7 ADD R4,R3,R4

Caltech CS184b Winter DeHon 35 Example This sequence block on in-order issue –second instruction depend on first But 3rd instruction not depend on first 2. MPY R3,R2,R2 MPY R3,R6,R3 MPY R4,R2,R5 ADD R4,R4,R7 ADD R4,R3,R4

Caltech CS184b Winter DeHon 36 Example Out of Order –look beyond head pointer for enabled instructions –issue and scoreboard next found MPY R3,R2,R2 MPY R3,R6,R3 MPY R4,R2,R5 ADD R4,R4,R7 ADD R4,R3,R4 MPY R3,R6,R3 stalls for R3 to be computed MPR4,R2,R5 can be issued while R3 waiting

Caltech CS184b Winter DeHon 37 False Sequentialization on Register Names Problem: reuse of small set of register names may introduce false sequentialization ADD R2,R3,R4 SW R2,(R1) ADD R1,1,R1 ADD R2,R5,R6 SW R2,(R1)

Caltech CS184b Winter DeHon 38 False Sequentialization Recognize: –register names are just a way of describing local dataflow ADD R2,R3,R4 SW R2,(R1) ADD R1,1,R1 ADD R2,R5,R6 SW R2,(R1) This says: the result of adding R5 and R6 gets stored into the address pointed to by R1 R2 only describes the dataflow.

Caltech CS184b Winter DeHon 39 Renaming Trick: –separate ISA (“architectural”) register names from functional/physical registers –allocate a new register on definitions (compare def-use chains in cs134b?) –keep track of all uses (until next definition) –assign all uses the new register name at issue –use new register name to track dependencies, bypass, scoreboarding...

Caltech CS184b Winter DeHon 40 Example ADD R2,R3,R4 SW R2,(R1) ADD R1,1,R1 ADD R2,R5,R6 SW R2,(R1) Rename Table R1: P2 R2: P6 R3: P7 R4: P8 R5: P9 R6: P10 Free Table: P1 P3 P4 P11

Caltech CS184b Winter DeHon 41 Example ADD R2,R3,R4 SW R2,(R1) ADD R1,1,R1 ADD R2,R5,R6 SW R2,(R1) Rename Table R1: P2 R2: P6 R3: P7 R4: P8 R5: P9 R6: P10 Free Table: P1 P3 P4 P11 Rename Table R1: P2 R2: P1 R3: P7 R4: P8 R5: P9 R6: P10 Free Table: P3 P4 P11 Issue: ADD P1,P7,P8 Allocate P1 for R2

Caltech CS184b Winter DeHon 42 Example ADD R2,R3,R4 SW R2,(R1) ADD R1,1,R1 ADD R2,R5,R6 SW R2,(R1) Rename Table R1: P2 R2: P1 R3: P7 R4: P8 R5: P9 R6: P10 Free Table: P3 P4 P11 Rename Table R1: P2 R2: P1 R3: P7 R4: P8 R5: P9 R6: P10 Free Table: P3 P4 P11 Issue: SW P1,(P2)

Caltech CS184b Winter DeHon 43 Example ADD R2,R3,R4 SW R2,(R1) ADD R1,1,R1 ADD R2,R5,R6 SW R2,(R1) Rename Table R1: P2 R2: P1 R3: P7 R4: P8 R5: P9 R6: P10 Free Table: P3 P4 P11 Rename Table R1: P3 R2: P1 R3: P7 R4: P8 R5: P9 R6: P10 Free Table: P2 P4 P11 Issue: ADD P3,1,P2 Allocate P3 for P1

Caltech CS184b Winter DeHon 44 Example ADD R2,R3,R4 SW R2,(R1) ADD R1,1,R1 ADD R2,R5,R6 SW R2,(R1) Rename Table R1: P3 R2: P1 R3: P7 R4: P8 R5: P9 R6: P10 Free Table: P2 P4 P11 Rename Table R1: P3 R2: P4 R3: P7 R4: P8 R5: P9 R6: P10 Free Table: P2 P11 Issue: ADD P4,P9,P10 Allocate P4 for R2

Caltech CS184b Winter DeHon 45 Example ADD R2,R3,R4 SW R2,(R1) ADD R1,1,R1 ADD R2,R5,R6 SW R2,(R1) Rename Table R1: P3 R2: P4 R3: P7 R4: P8 R5: P9 R6: P10 Free Table: P2 P11 Rename Table R1: P3 R2: P4 R3: P7 R4: P8 R5: P9 R6: P10 Free Table: P2 P11 Issue: SW P4,(P3)

Caltech CS184b Winter DeHon 46 Free Physical Register Free after complete last use Identify last use by next def? Or, allocate in order (LRU) –interlock if re-assignment conflict –(should correspond to having no free physical registers)

Caltech CS184b Winter DeHon 47 Tomasulo Register renaming Scoreboarding Bypassing IBM 1967 …what’s keeping x86 ISA alive today –compensate for small number of arch. Registers –dusty deck code

Caltech CS184b Winter DeHon 48 Today Seen can turn a basic block –(code between branches) Into executing dataflow graph –I.e. once issues, only dataflow dependencies limit parallelism …all the more reason to want large basic blocks (minimize branch, branch effects)

Caltech CS184b Winter DeHon 49 Reading Note Today: HP4.1-2, Tomasulo Next Week: – rest of HP4 –Fisher/predict relevant probably touch on Tuesday –Subbarao Quantifying… probably Thursday Following Week: VLIW and EPIC –Fisher, IA-64...

Caltech CS184b Winter DeHon 50 Big Ideas Data Versioning –keep old copies, until commit –working versus finalized Parallelism does exist in the problem –obscured by ISA linearization Dataflow Interpretation –preserve dependencies, not control flow sequence –rediscover non-linear “graph”