Sequential Execution Semantics

Slides:



Advertisements
Similar presentations
CSE 502: Computer Architecture
Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
A. Moshovos ©ECE Spring ‘02 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions.
A. Moshovos ©ECE Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
ECE 2162 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
© A. Moshovos (ECE, Toronto) ECE1773 – Spring 2002 ILP, cont. Maintaining Sequential Appearance –Precise Interrupts –RUU approach to OoO Scheduling.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
1 Lecture 7: Speculative Execution and Recovery using Reorder Buffer Branch prediction and speculative execution, precise interrupt, reorder buffer.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.
Instruction-Level Parallelism and Its Dynamic Exploitation
Lecture: Out-of-order Processors
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
CSL718 : Superscalar Processors
Precise Exceptions and Out-of-Order Execution
/ Computer Architecture and Design
Smruti R. Sarangi IIT Delhi
PowerPC 604 Superscalar Microprocessor
CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions
CSE 502: Computer Architecture
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
CS5100 Advanced Computer Architecture Hardware-Based Speculation
Computer Architecture
High-level view Out-of-order pipeline
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Out of Order Processors
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
ECE 2162 Reorder Buffer.
Lecture 11: Memory Data Flow Techniques
Out-of-Order Execution Scheduling
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005
Instruction-Level Parallelism (ILP)
15-740/ Computer Architecture Lecture 10: Out-of-Order Execution
September 20, 2000 Prof. John Kubiatowicz
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 11 – Out-of-Order Execution Krste Asanovic Electrical Engineering.
Patrick Akl and Andreas Moshovos AENAO Research Group
High-level view Out-of-order pipeline
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Sizing Structures Fixed relations Empirical (simulation-based)
Presentation transcript:

Sequential Execution Semantics Contract: How the machine appears to behave A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Sequential Semantics - Review Instructions appear as if they executed: In “program order” As if they executed one after the other Program Order Pipelining Superscalar Out-of-Order A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Execution Order? addi r2, r1, 10 addi r3, r2, 20 add r5, r4, 30 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Execution Order? Dependencies addi r2, r1, 10 addi r3, r2, 20 add r5, r4, 30 add r6, r5, 40 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Execution Order? Pipelined and Superscalar addi r2, r1, 10 addi r3, r2, 20 Pipelining add r5, r4, 30 add r6, r5, 40 addi r2, r1, 10 addi r3, r2, 20 Superscalar add r5, r4, 30 add r6, r5, 40 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Execution Order? Out-of-Order addi r2, r1, 10 addi r3, r2, 20 Superscalar add r5, r4, 30 add r6, r5, 40 addi r2, r1, 10 addi r3, r2, 20 Out-of-Order add r5, r4, 30 add r6, r5, 40 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-of-Order Execution loop: add r4, r4, 1 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop do { sum += a[++m]; i--; } while (i != 0); Superscalar fetch decode sub bne add ld out-of-order fetch decode add fetch decode ld fetch decode add fetch decode sub fetch decode bne A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Sequential Semantics? Execution does NOT adhere to sequential semantics To be precise: Eventually it may Simplest solution: Define problem away IBM 360 did this Not acceptable today: e.g., Virtual Memory Three-phase Instruction execution In-Progress, Completed and Committed inconsistent fetch decode sub bne add ld consistent A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Sequential Semantics? inconsistent consistent Enter Retire Fetch I4 decode Fetch I5 sub bne Fetch I1 add Fetch I2 ld Fetch I3 consistent Enter Retire empty empty empty empty empty empty Program order Reorder buffer A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Sequential Semantics? inconsistent consistent Enter Retire Fetch I4 decode Fetch I5 sub bne Fetch I1 add Fetch I2 ld Fetch I3 consistent Enter Retire I1 F I2 F I3 F I4 F I5 F empty Program order Reorder buffer A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Sequential Semantics? inconsistent consistent Enter Retire Fetch I4 decode Fetch I5 sub bne Fetch I1 add Fetch I2 ld Fetch I3 consistent Enter Retire I1 D I2 D I3 D I4 D I5 D empty (destReg, old value) Program order Reorder buffer A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Sequential Semantics? inconsistent consistent Enter Retire Fetch I4 decode Fetch I5 sub bne Fetch I1 add Fetch I2 ld Fetch I3 consistent Enter Retire completed I1 c I2 D I3 D I4 c I5 D empty Program order Reorder buffer A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Sequential Semantics? inconsistent consistent Enter Retire Fetch I4 decode Fetch I5 sub bne Fetch I1 add Fetch I2 ld Fetch I3 consistent Enter Retire completed empty I2 c I3 D I4 c I5 c empty committed Program order Reorder buffer A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Sequential Semantics? inconsistent consistent Enter Retire Fetch I4 decode Fetch I5 sub bne Fetch I1 add Fetch I2 ld Fetch I3 consistent Enter Retire completed empty empty empty I4 c I5 c empty committed Program order Reorder buffer A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Sequential Semantics? inconsistent consistent Enter Retire Fetch I4 decode Fetch I5 sub bne Fetch I1 add Fetch I2 ld Fetch I3 consistent Enter Retire empty empty empty empty empty empty Program order Reorder buffer A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Sequential Semantics? WHAT IF SOMEONE STOPS US HERE? inconsistent Fetch I4 decode Fetch I5 sub bne Fetch I1 add Fetch I2 ld Fetch I3 consistent Enter Retire completed empty I2 c I3 D I4 c I5 c empty committed Program order Reorder buffer A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Sequential Semantics? consistent Enter Enter Retire Retire Fetch I4 decode Fetch I5 sub bne Fetch I1 add Fetch I2 ld Fetch I3 consistent Enter empty Reorder buffer Program order Retire Enter Retire empty I2 c I3 D I4 c I5 c empty Undo changes Program order Reorder buffer A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Preserving Sequential Semantics: Three phases Instr. exec. in 3 phases: In-progress, Completed, Committed OOO for in-progress and Completed In-order Commits Completed - out-of-order: ”Visible only inside” Results visible to subsequent instructions Results not visible to outsiders On interrupts completed results are discarded Committed - in-order: ”Visible to all” Results visible to outsiders On interrupt committed results are preserved A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

out-of-order completes Commit vs. Complete in-order completes out-of-order completes in-order commits DIV R3, _, _ ADD R1, _, _ ADD _, R1, _ Time In-order commits fetch decode sub bne add ld commit commit commit commit commit complete A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Implementing Completes/Commits Key idea: Maintain sufficient state around to be able to roll-back when necessary Roll-back: Discard (aka Squash) all not committed One solution (conceptual): History File Upon Complete instruction records previous value of target register Upon Discard, instruction restores target value Upon Commit, nothing to do Focus on scheduling mechanisms A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Solution #2: Future File Two states: Architectural Updated only on commit Speculative Updated on complete Normally use Speculative On exception: Flush Speculative Use Architectural Implementing precise interrupts in pipelined processors J. E. Smith and A. Pleszkun, http://www.eecg.toronto.edu/~moshovos/ACA07/readings/smith-interrupts.pdf A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-of-Order Execution Overview Program Form Processing Phase Static program dynamic inst. Stream (trace) execution window completed instructions In-Progress Dispatch/ dependences inst. Issue inst execution inst. Reorder & commit Completed Committed A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-Of-Order Exec. Example loop: add r4, r4, 4 ld r2, 10(r4) 4 cycles lat add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop op src1 src2 tgt status Reservation station A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-Of-Order Exec. Example loop: add r4, r4, 4 ld r2, 10(r4) 4 cycles lat add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 1 Register Availability Vector AKA Scoreboard A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-Of-Order Exec. Example loop: add r4, r4, 4 ld r2, 10(r4) 4 cycles lat add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 1 Cycle 0 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-Of-Order Exec. Example: Cycle 0 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Ready to be executed RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4/0 Rdy 1 Cycle 0 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 1 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Notify those waiting for R4 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Exec 1 ld r4/1 NA/1 r2 Rdy R4 gets produced now A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 2 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 3 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait No dependences sub r1/1 NA/1 r1 Rdy A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 4 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait r1 produced now Notify consumers sub r1/1 NA/1 r1 Exec bne r1/1 r0/1 NA Rdy r1 will be available next cycle A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 5 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait Completed sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Exec executing A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 6 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 Notify consumers RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/1 r3 Rdy Completed sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Exec executing A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 7 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Notify consumers RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Cmtd Executing add r3/1 r2/1 r3 Exec sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Compl Completed A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 8 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Cmtd add r3/1 r2/1 r3 Cmtd sub r1/1 NA/1 r1 Cmtd bne r1/1 r0/1 NA Cmtd A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Window vs. Scheduler Window Distance between oldest and youngest instruction that can co-exist inside the CPU Larger window  Potential for more ILP Scheduler Number of instructions that are waiting to be issued Instructions enter at Fetch Exit at Commit Instructions enter at Decode Leave at writeback/complete Window >= Scheduler Can be the same structure In window but not in scheduler  completed instructions A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Speculative Execution Why? 1 every 5 insts is a branch (control flow) Today’s processors have a window of 128 insts Can’t wait until a branch is resolved to fetch next set of insts Solution: Speculate Wrong squash Fetch branch Resolve branch A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Beyond Simple OoO E will wait for B, C and D. WAR w/ C and D WAW w/ B A: LF F6, 34(R2) B: LF F2, 45(R3) C: MULF F0, F2, F4 D: SUBF F8, F2, F6 E: ADDF F2, F7, F4 E will wait for B, C and D. WAR w/ C and D WAW w/ B Can we do better? A B C D E A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

What if we had infinite registers A: LF F6, 34(R2) B: LF F2, 45(R3) C: MULF F0, F2, F4 D: SUBF F8, F2, F6 E: ADDF F2, F7, F4 E: ADDF F9, F7, F4 No false dependences anymore Since we do not reuse a name we can’t have WAW and WAR A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Register Renaming Register Version Every Write creates a new version Uses read the last version Need to keep a version until all uses have read it. Register Renaming: Architectural vs. Physical Registers more phys. than arch. Maintain a map of arch. to phys. regs. Use in-order decoding to properly identify dependences. Instructions wait only for input op. availability. Only last version is written to reg. file. A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Register renaming example Original Code Lg(# arch. regs) RAT A add r1, r2, 100 B add r3, r1, r1 C sub r1, r2, r2 p4 p1 p5 p5 p1 Architectural Register p2 p3 # arch. regs Renamed Code A add p4, p2, 100 B add p5, p4, p4 C sub p6 p2, p2 Physical Register A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Register renaming example First rename input operands: Original Code Lg(# arch. regs) RAT A add r1, r2, 100 B add r3, r1, r1 C sub r1, r2, r2 p4 p1 p5 p5 p1 Architectural Register p2 p3 # arch. regs Renamed Code A add p4, p2, 100 B add p5, p4, p4 C sub p6 p2, p2 Physical Register A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Register renaming example Then rename destination Original Code Lg(# arch. regs) From free list RAT A add r1, r2, 100 B add r3, r1, r1 C sub r1, r2, r2 p4 p1 p5 p5 p4 Architectural Register p2 p3 # arch. regs Renamed Code A add p4, p2, 100 B add p5, p4, p4 C sub p6 p2, p2 ROB r1 p1 Physical Register For recovery purposes A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Register renaming example Rename inputs Original Code Lg(# arch. regs) From free list RAT A add r1, r2, 100 B add r3, r1, r1 C sub r1, r2, r2 p4 p1 p5 p5 p4 Architectural Register p2 p3 # arch. regs Renamed Code A add p4, p2, 100 B add p5, p4, p4 C sub p6 p2, p2 ROB r1 p1 Physical Register For recovery purposes A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Register renaming example Rename destination Original Code Lg(# arch. regs) From free list RAT A add r1, r2, 100 B add r3, r1, r1 C sub r1, r2, r2 p4 p1 p5 p5 p4 Architectural Register p2 p5 # arch. regs Renamed Code A add p4, p2, 100 B add p5, p4, p4 C sub p6 p2, p2 ROB r1 p1 Physical Register r3 p3 For recovery purposes A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Bird’s Eye View of a Modern CPU Fetch Decode & Rename Dispatch Execution Execution Units I Cache D Cache In order retirement Misspeculation handling Reorder Buffer A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto