1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
CSE 502 Graduate Computer Architecture Lec 11 – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.
Computer Architecture Lec 8 – Instruction Level Parallelism.
Copyright 2001 UCB & Morgan Kaufmann ECE668.1 Adapted from Patterson, Katz and Kubiatowicz © UCB Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Tomasulo With Reorder buffer:
CS136, Advanced Architecture Speculation. CS136 2 Outline Speculation Speculative Tomasulo Example Memory Aliases Exceptions VLIW Increasing instruction.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
CSE 502 Graduate Computer Architecture Lec – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Mar 17, 2009 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
1 Overcoming Control Hazards with Dynamic Scheduling & Speculation.
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
Chapter 3 Instruction Level Parallelism 2 Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2014 Computer Applications Text book slides: Computer Architec ture:
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
EECS 252 Graduate Computer Architecture Lec 8 – Instruction Level Parallelism David Patterson Electrical Engineering and Computer Sciences University of.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
/ Computer Architecture and Design
/ Computer Architecture and Design
COMP 740: Computer Architecture and Implementation
Tomasulo’s Algorithm Born of necessity
Out of Order Processors
Step by step for Tomasulo Scheme
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
CS5100 Advanced Computer Architecture Hardware-Based Speculation
Lecture 12 Reorder Buffers
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
David Patterson Electrical Engineering and Computer Sciences
CS152 Computer Architecture and Engineering Lecture 18 Dynamic Scheduling (Cont), Speculation, and ILP.
John Kubiatowicz Electrical Engineering and Computer Sciences
Tomasulo With Reorder buffer:
11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CS252 Graduate Computer Architecture Lecture 7 Dynamic Scheduling 2: Precise Interrupts February 9th, 2010 John Kubiatowicz Electrical Engineering.
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
John Kubiatowicz Electrical Engineering and Computer Sciences
September 20, 2000 Prof. John Kubiatowicz
Larry Wittie Computer Science, StonyBrook University and ~lw
Tomasulo Organization
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Adapted from the slides of Prof
Chapter 3: ILP and Its Exploitation
September 20, 2000 Prof. John Kubiatowicz
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Overcoming Control Hazards with Dynamic Scheduling & Speculation
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Presentation transcript:

1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution

2 Goal: –Performance (IPC>1) How? –Wide Machine –Speculations (Branch prediction) –Out Of Order execution »Essentially a data flow execution model: Operations execute as soon as their operands are available –Eliminate name dependencies (aka false/anti dependencies) WAW, WAR »Via register renaming But, we still want a precise interrupt model –In-order commit »Via Reorder Buffer (ROB)

3 MIPS FPU with Tomasulo and ROB

4 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers F0 F2 F4 F10 Dest ValueInstruction F0 F2 F4 F10 RAT

5 4 Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”) 2.Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) 3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4.Commit—update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”)

6 Code Example 1. LD F0, 10(R2) 2. ADDD F10, F4, F0 3. DIVD F2, F10, F6 4. BNE F2, 5. LD F4, 0(R3) 6. ADDD F0, F4, F6 7. ADDD F0, F4, F6

7 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers F0 F2 F4 F10 Dest ValueInstruction F0 F2 F4 F10 RAT

8 F0 F2 F4 F10 To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 LD F0,10(R2) N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest F0 F2 F4 F10 Dest ValueInstruction Tomasulo With Reorder buffer: ROB1 RAT Reorder Buffer Registers

9 F0 F2 F4 F10 2 ADDD R(F4),ROB1 To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F10 F0 ADDD F10,F4,F0 LD F0,10(R2) N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers F0 F2 F4 F10 Dest ValueInstruction Tomasulo With Reorder buffer: ROB1 ROB2 RAT

10 F0 F2 F4 F10 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers F0 F2 F4 F10 Dest ValueInstruction ROB1 ROB3 ROB2 RAT

11 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 ADDD F0,F4,F6 N N F4 LD F4,0(R3) N N -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 5 0+R3 Dest Reorder Buffer Registers F0 F2 F4 F10 Dest ValueInstruction F0 F2 F4 F10 ROB6 ROB3 ROB5 ROB2 RAT

12 F0 ROB7 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 ADDD F0,F4,F6 N N N N F4 LD F4,0(R3) N N -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers 7 ADDD ROB5, R(F6) F0 F2 F4 F R2 5 0+R3 Dest ValueInstruction F2 F4 F10 ROB3 ROB5 ROB2 RAT

13 3 DIVD ROB2,R(F6) Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 ADDD F0,F4,F6 N N N N F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers 2 ADDD R(F4),ROB1 6 ADDD M[10],R(F6) 7 ADDD M[10],R(F6) F0 F2 F4 F R2 Dest ValueInstruction F0 F2 F4 F10 ROB7 ROB3 ROB5 ROB2 RAT

14 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 ADDD F0,F4,F6 Y Y Y Y F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers F0 F2 F4 F R2 Dest ValueInstruction F0 F2 F4 F10 ROB7 ROB3 ROB5 ROB2 RAT

15 3 DIVD ROB2,R(F6) 2 ADDD R(F4),M[2] Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 ADDD F0,F4,F6 Y Y Y Y F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N F2 F10 F0 M[2] DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N Y Y Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers F0 F2 F4 F10 Dest ValueInstruction F0 F2 F4 F10 ROB7 ROB3 ROB5 ROB2 RAT

16 3 DIVD,R(F6) Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 ADDD F0,F4,F6 Y Y Y Y F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N F2 F10 F0 M[2] DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N Y Y C C Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers F0 = M[2] F2 F4 F10 Dest ValueInstruction F0 F2 F4 F10 ROB7 ROB3 ROB5 ROB2 RAT

17 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 ADDD F0,F4,F6 Y Y Y Y F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N F2 F10 F0 M[2] DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) Y Y C C C C Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers F0 = M[2] F2 F4 F10= Dest ValueInstruction F0 F2 F4 F10 ROB7 ROB3 ROB5 RAT

18 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue F0 = M[2] F2 = F4 F10= ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 ADDD F0,F4,F6 Y Y Y Y F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N F2 F10 F0 M[2] DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) C C C C C C Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers Dest ValueInstruction F0 F2 F4 F10 ROB7 ROB5 RAT

19 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue F0 = M[2] F2 = F4 F10= ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 ADDD F0,F4,F6 Y Y Y Y F4 M[10] LD F4,0(R3) Y Y -- BNE F2, C C F2 F10 F0 M[2] DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) C C C C C C Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers Dest ValueInstruction F0 F2 F4 F10 ROB7 ROB5 RAT

20 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue F0 = M[2] F2 = F4 = M[10] F10= ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 ADDD F0,F4,F6 Y Y Y Y F4 M[10] LD F4,0(R3) C C -- BNE F2, C C F2 F10 F0 M[2] DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) C C C C C C Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers Dest ValueInstruction F0 F2 F4 F10 ROB7 RAT

21 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue F0 = F2 = F4 = M[10] F10= ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 ADDD F0,F4,F6 Y Y C C F4 M[10] LD F4,0(R3) C C -- BNE F2, C C F2 F10 F0 M[2] DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) C C C C C C Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers Dest ValueInstruction F0 F2 F4 F10 ROB7 RAT

22 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue F0 = F2 = F4 = M[10] F10= ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 ADDD F0,F4,F6 C C C C F4 M[10] LD F4,0(R3) C C -- BNE F2, C C F2 F10 F0 M[2] DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) C C C C C C Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers Dest ValueInstruction F0 F2 F4 F10 RAT

23 Remarks What about timing? –What happens on what cycle? No #cycles in the figure »How many fetches/commits in a cycle? »How many execution units? Homework assignment Preserving precise interrupt model –When an interrupt occurs, we can flush everything »Instructions that were not committed have no effect Commit happens in-order –Exceptions are taken on commit What happen if ROB is full? –Fetch is stopped until some instruction commits »Committed instruction frees its ROB entry

24 Memory Hazards When is memory updated? –On commit (or later) »Relevant for store instructions only WAR/WAW Hazards? –Handled by ROB RAW Hazards? –Must ensure that no in-flight store is targeting the same address –What about memory disambiguation? »Simple answer: before starting the load we must know all the addresses of all other in-flight stores »In real life we speculate on this ST 0(R2),F1 LD F2, 0(R2) ST 0(R2), F1 LD F2, 0(R4) //What if R4=R2?