Copyright 2001 UCB & Morgan Kaufmann ECE668.1 Adapted from Patterson, Katz and Kubiatowicz © UCB Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept.

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
CSE 502 Graduate Computer Architecture Lec 11 – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.
Computer Architecture Lec 8 – Instruction Level Parallelism.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Tomasulo With Reorder buffer:
CS136, Advanced Architecture Speculation. CS136 2 Outline Speculation Speculative Tomasulo Example Memory Aliases Exceptions VLIW Increasing instruction.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
CSE 502 Graduate Computer Architecture Lec – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Mar 17, 2009 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Computer Architecture Lecture 18 Superscalar Processor and High Performance Computing.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
Pipeline Exceptions & ControlCSCE430/830 Pipeline: Exceptions & Control CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Overcoming Control Hazards with Dynamic Scheduling & Speculation.
Instruction-Level Parallelism dynamic scheduling prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University May 2015Instruction-Level Parallelism.
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
Chapter 3 Instruction Level Parallelism 2 Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2014 Computer Applications Text book slides: Computer Architec ture:
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
1 Lecture 7: Speculative Execution and Recovery using Reorder Buffer Branch prediction and speculative execution, precise interrupt, reorder buffer.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.
CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Computer Organization CS224
/ Computer Architecture and Design
COMP 740: Computer Architecture and Implementation
Out of Order Processors
Dynamic Scheduling and Speculation
Step by step for Tomasulo Scheme
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
CS5100 Advanced Computer Architecture Hardware-Based Speculation
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Tomasulo With Reorder buffer:
Lecture 6: Advanced Pipelines
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CS 704 Advanced Computer Architecture
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Larry Wittie Computer Science, StonyBrook University and ~lw
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Adapted from the slides of Prof
Chapter 3: ILP and Its Exploitation
September 20, 2000 Prof. John Kubiatowicz
Overcoming Control Hazards with Dynamic Scheduling & Speculation
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Copyright 2001 UCB & Morgan Kaufmann ECE668.1 Adapted from Patterson, Katz and Kubiatowicz © UCB Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Exceptions, Reorder Buffer (ROB), Speculative Tomasulo

Copyright 2001 UCB & Morgan Kaufmann ECE668.2 Adapted from Patterson, Katz and Kubiatowicz © UCB Exceptions - Basics  Exception = unprogrammed control transfer  system takes action to handle the exception »must record the address of the offending instruction »record any other information necessary to return afterwards  returns control to user  must save & restore user state normal control flow: sequential, jumps, branches, calls, returns user program System Exception Handler Exception: return from exception

Copyright 2001 UCB & Morgan Kaufmann ECE668.3 Adapted from Patterson, Katz and Kubiatowicz © UCB Two Types of Exceptions  Interrupts  caused by external events: »Network, Keyboard, Disk I/O, Timer  asynchronous to program execution »Most interrupts can be disabled for brief periods of time  may be handled between instructions  simply suspend and resume user program  Traps  caused by internal events »exceptional conditions (overflow) »errors (parity) »page faults (non-resident page)  synchronous to program execution  condition must be remedied by the handler  instruction may be retried and program continued or program may be aborted

Copyright 2001 UCB & Morgan Kaufmann ECE668.4 Adapted from Patterson, Katz and Kubiatowicz © UCB Exceptions - Examples

Copyright 2001 UCB & Morgan Kaufmann ECE668.5 Adapted from Patterson, Katz and Kubiatowicz © UCB StagePossible exceptions IFPage fault on instruction fetch; misaligned memory access; memory-protection violation ID Undefined or illegal opcode EX Arithmetic exception MEM Page fault on data fetch; misaligned memory access; memory-protection violation; memory error  How do we stop the pipeline? How do we restart it?  Do we interrupt immediately or wait?  5 instructions, executing in 5 different pipeline stages!  Who caused the interrupt? Exceptions in MIPS pipeline

Copyright 2001 UCB & Morgan Kaufmann ECE668.6 Adapted from Patterson, Katz and Kubiatowicz © UCB Multiple exceptions Time (clock cycles) Load Add Reg ALU DMem Ifetch Reg ALU DMemIfetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 Data page fault Arithmetic exception Time (clock cycles) Load Add Reg ALU DMem Ifetch Reg ALU DMemIfetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 Data page fault Instruction page fault

Copyright 2001 UCB & Morgan Kaufmann ECE668.7 Adapted from Patterson, Katz and Kubiatowicz © UCB Precise Interrupts/Exceptions  Exceptions should be Precise or clean, i.e., the outcome should be exactly the same as in a non- pipelined machine  Precise  state of the machine is preserved as if program executed up to the offending instruction  All previous instructions completed  Offending instruction and all following instructions act as if they have not even started  Same code will work on different processor implementations  Difficult in the presence of pipelining, out-of-order execution,...  Imprecise  system software has to figure out what is where and put it all back together  Modern techniques for out-of-order execution and branch prediction help implement precise interrupts

Copyright 2001 UCB & Morgan Kaufmann ECE668.8 Adapted from Patterson, Katz and Kubiatowicz © UCB Relationship between precise interrupts and speculation  Speculation: guess and check  Important for branch prediction:  Need to “take our best shot” at predicting branch direction  If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly:  This is exactly the same as precise exceptions!  Technique for both precise interrupts/exceptions and speculation: in-order completion or commit

Copyright 2001 UCB & Morgan Kaufmann ECE668.9 Adapted from Patterson, Katz and Kubiatowicz © UCB Handling Exceptions  Exceptions are handled by not recognizing the exception until instruction that caused it is ready to commit in ROB  If a speculated instruction raises an exception, the exception is recorded in the ROB  This is why reorder buffers in all new processors

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB Reorder Buffer (HW support for precise interrupts)  ROB=Buffer for results of uncommitted instructions  An instruction commits when it completes its execution and all its predecessors have already committed  Once instruction commits, result is put into register »Therefore, easy to undo speculated instructions on mispredicted branches or exceptions  Supplies operands between execution complete & commit Reorder Buffer FP Op Queue FP AdderFP Mpier Res Stations FP Regs

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB More on Reorder Buffer operation  Holds instructions in FIFO order, exactly as issued  When instructions complete, results placed into ROB  Supplies operands to other instruction between execution complete & commit  Tag results with ROB buffer number instead of reservation station  Instructions commit  values at head of ROB placed in registers Reorder Buffer FP Op Queue FP Adder Res Stations FP Regs Commit path

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB Another Perspective on Reorder Buffer  If instructions write results in program order, reg/memory always get the correct values  Role of ROB: to reorder out-of-order instruction to program order at the time of writing register/memory (commit)  Instruction cannot write reg/memory immediately after execution, so ROB also buffer the results  No such a place in original Tomasulo Reorder Buffer Decode FU1FU2 ReSt Fetch Unit Rename L-bufS-buf DM Regfile IM

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB ROB: Circular Buffer with Head/Tail Pointers … headtail … headtail … headtail Freed ROB entry Allocated ROB Entry when instr issued Entries between head and tail are valid

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB Reorder Buffer Entry Details Reorder Buffer Dest reg Result Exceptions? Program Counter Branch or L/W? Ready?

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB Organization with ROB and Associated Result Shift Register (from Smith et al. 1988) Common Result Bus Data (upon Commit) Bypass Logic/ Comparators For more details read: J. Smith & A. Pleszkun, IEEETC, May 1988 REGISTER FILE Result Shift Register REORDER BUFFER Control Source Data to functional units Result Shift Register controls Result Bus Stages labeled 1through n, n length longest FU pipeline An instruction taking i clocks reserves stage i in RSR when issues If valid instr already it waits until next clock The issuing instr places control information into RSR Each clock moves to stage towards 1 and next cycle uses control The ROB Tag guides the results to end up in correct ROB entry

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB Example of RSR use (see Smith et al ) PC Instruction Ex_Time (in FU) 6 ADDF F10,F1,F3 6 7 ADD R9,R2,R5 2 StageFunctionalValidTag unit sourceinstr. 10 2Integer ADD Flt. Pt. ADD14 N0 Direction of movement Reorder (circular) Buffer Result Shift Register Head Tail State in RSR (control info plus ROB tag) after the ADD issues (for example below) ROB entry at Tail is given to issuing instruction; Tail ++

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB Four Steps of Speculative Tomasulo Algorithm 1.Issue— get instruction from FP Op Queue If reservation station, reorder buffer slot, and result shift register slot free, issue instr & send operands & reorder buffer no. for destination. (this stage sometimes called “dispatch”) Actions summary: (1) decode the instruction; (2) allocate a RS, RSR and ROB entry; (3) do source register renaming; (4) do dest register renaming; (5) read register file; (6) dispatch the decoded and renamed instruction to the RS and ROB 2.Execution— operate on operands (EX) Action: when both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; this takes care of RAW. (sometimes called “issue”) 3.Write result— finish execution (WB) Action: Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available 4.Commit— update register with result from reorder buffer Action: When instr. at head of ROB & result present, update register with result (or store to memory) and remove instr from ROB. Mispredicted branch flushes reorder buffer. (sometimes called “graduation”)

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 LD F0,10(R2) N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F10 F0 ADDD F10,F4,F0 LD F0,10(R2) N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F0 ADDD F0,F4,F6 N N F4 LD F4,0(R3) N N -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers 5 0+R3

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 6 ADDD ROB5, R(F6) Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 -- F0 ROB5 ST 0(R3),F4 ADDD F0,F4,F6 N N N N F4 LD F4,0(R3) N N -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory Dest Reorder Buffer Registers 1 10+R2 5 0+R3

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB 3 DIVD ROB2,R(F6) Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 -- F0 M[10] ST 0(R3),F4 ADDD F0,F4,F6 Y Y N N F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers 2 ADDD R(F4),ROB1 6 ADDD M[10],R(F6)

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 -- F0 M[10] ST 0(R3),F4 ADDD F0,F4,F6 Y Y Ex F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB -- F0 M[10] ST 0(R3),F4 ADDD F0,F4,F6 Y Y Ex F4 M[10] LD F4,0(R3) Y Y -- BNE F2, N N 3 DIVD ROB2,R(F6) 2 ADDD R(F4),ROB1 Tomasulo With Reorder buffer: To Memory FP adders FP multipliers Reservation Stations FP Op Queue ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 F2 F10 F0 DIVD F2,F10,F6 ADDD F10,F4,F0 LD F0,10(R2) N N N N N N Done? Dest Oldest Newest from Memory 1 10+R2 Dest Reorder Buffer Registers What about memory hazards???

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB Avoiding Memory Hazards  WAW and WAR hazards through memory are eliminated with speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pending  RAW hazards through memory are maintained by two restrictions: 1. not allowing a load to initiate the second step of its execution if any active ROB entry occupied by a store has a Destination field that matches the value of the A field of the load, and 2. maintaining the program order for the computation of an effective address of a load with respect to all earlier stores.  these restrictions ensure that any load that accesses a memory location written to by an earlier store cannot perform the memory access until the store has written the data

Copyright 2001 UCB & Morgan Kaufmann ECE Adapted from Patterson, Katz and Kubiatowicz © UCB Getting CPI below 1  CPI ≥ 1 if issue only 1 instruction every clock cycle  Multiple-issue processors come in many flavors, e.g.,: 1. dynamically-scheduled superscalar processors, and (out-of-order execution) 2. VLIW (very long instruction word) processors »VLIW processors, in contrast, issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (Intel/HP Itanium)