1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Speculative ExecutionCS510 Computer ArchitecturesLecture Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
A scheme to overcome data hazards
COMP4611 Tutorial 6 Instruction Level Parallelism
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Computer Architecture Lec 8 – Instruction Level Parallelism.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Copyright 2001 UCB & Morgan Kaufmann ECE668.1 Adapted from Patterson, Katz and Kubiatowicz © UCB Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
/ Computer Architecture and Design Instructor: Dr. Michael Geiger Summer 2014 Lecture 6: Speculation.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Mar 17, 2009 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
Tomasulo’s Approach and Hardware Based Speculation
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Overcoming Control Hazards with Dynamic Scheduling & Speculation.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Instruction-Level Parallelism dynamic scheduling prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University May 2015Instruction-Level Parallelism.
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
1 Lecture 7: Speculative Execution and Recovery using Reorder Buffer Branch prediction and speculative execution, precise interrupt, reorder buffer.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
COMP25212 Advanced Pipelining Out of Order Processors.
CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
IBM System 360. Common architecture for a set of machines
/ Computer Architecture and Design
COMP 740: Computer Architecture and Implementation
Step by step for Tomasulo Scheme
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
CS5100 Advanced Computer Architecture Hardware-Based Speculation
Advantages of Dynamic Scheduling
11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
September 20, 2000 Prof. John Kubiatowicz
Tomasulo Organization
Adapted from the slides of Prof
September 20, 2000 Prof. John Kubiatowicz
Lecture 7 Dynamic Scheduling
Overcoming Control Hazards with Dynamic Scheduling & Speculation
Conceptual execution on a processor which exploits ILP
Presentation transcript:

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

2 Hardware Support for More ILP  Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences (including exceptions) if branch is not actually taken Hardware needs to provide an “undo” operation = squash Hardware needs to provide an “undo” operation = squash  Often try to combine with dynamic scheduling  Tomasulo: separate speculative bypassing of results from real bypassing of results When instruction no longer speculative, write results (instruction commit) When instruction no longer speculative, write results (instruction commit) execute out-of-order but commit in order execute out-of-order but commit in order Example: PowerPC 620, MIPS R10000, Intel P6, AMD K5 … Example: PowerPC 620, MIPS R10000, Intel P6, AMD K5 …

3 Hardware support for More ILP Need HW buffer for results of uncommitted instructions: reorder buffer (ROB) – –Reorder buffer can be operand source – –Once instruction commits, result is found in register – –3 fields: instr. type, destination, value – –Use reorder buffer number instead of reservation station – –Instructions commit in order – –As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions Reorder Buffer FP Regs Instr Queue FP AdderFP Mult Res Stations Figure 3.29, page 228

4 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instruction & send operands & reorder buffer no. for destination; each RS now also has a field for ROB#. If reservation station and reorder buffer slot free, issue instruction & send operands & reorder buffer no. for destination; each RS now also has a field for ROB#. 2.Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute 3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting RS’s & ROB; mark RS available Write on Common Data Bus to all awaiting RS’s & ROB; mark RS available 4.Commit—update register with reorder result When instruction at head of reorder buffer & result present, update register with result (or store to memory) and remove instruction from reorder buffer When instruction at head of reorder buffer & result present, update register with result (or store to memory) and remove instruction from reorder buffer

5 Result Shift Register and Reorder Buffer  General solution to three problems Precise exceptions Precise exceptions Speculative execution Speculative execution Register renaming Register renaming  Solution in three steps In-order initiation, out-of-order termination (using RSRa) In-order initiation, out-of-order termination (using RSRa) In-order initiation, in-order termination (using RSRb) In-order initiation, in-order termination (using RSRb) In-order initiation, in-order termination, with renaming (using ROB) In-order initiation, in-order termination, with renaming (using ROB)  Architectural model Essentially MIPS FP pipeline Essentially MIPS FP pipeline Add takes 2 clock cycles, multiplication 5, division 10 Add takes 2 clock cycles, multiplication 5, division 10 Memory accesses take 1 clock cycle Memory accesses take 1 clock cycle Integer instructions take 1 clock cycle Integer instructions take 1 clock cycle 1 branch delay slot, delayed branches 1 branch delay slot, delayed branches

6 I-O Initiation, O-O Termination (RSRa) LOOP: LDF6, 32(R2) LDF2, 48(R3) MULTDF0, F2, F4 ADDIR2, R2, 8 ADDIR3, R3, 8 SUBDF8, F6, F2 DIVDF10, F10, F0 ADDDF6, F8, F6 BLEZR4, LOOP ADDIR4, R4, 1 LOOP: LDF6, 32(R2) LDF2, 48(R3) MULTDF0, F2, F4 ADDIR2, R2, 8 ADDIR3, R3, 8 SUBDF8, F6, F2 DIVDF10, F10, F0 ADDDF6, F8, F6 BLEZR4, LOOP ADDIR4, R4, 1

7 I-O Initiation, I-O Termination (RSRb)

8 Idea Behind ROB  Combine benefits of early issue and in-order update of state  Obtained from RSRa by adding a renaming mechanism to it  Add a FIFO to RSRa (implement as circular buffer)  When RSRa allows issuing of new instruction Enter instruction at tail of circular buffer Enter instruction at tail of circular buffer Buffer entry has multiple fields Buffer entry has multiple fields  [Result; Valid Bit; Destination Register Name; PC value; Exceptions]  Termination happens when result is produced, broadcast on CDB, written into circular buffer (replace M with T) Written ROB entry can serve as source of operands from now on Written ROB entry can serve as source of operands from now on  Commit happens when value is moved from circular buffer to register (replace W with C) Happens when instruction reaches head of circular buffer and has completed execution with no exceptions Happens when instruction reaches head of circular buffer and has completed execution with no exceptions

9 I-O Initiation, I-O Termination (ROB) LOOP: LDF6, 32(R2) LDF2, 48(R3) MULTDF0, F2, F4 ADDIR2, R2, 8 ADDIR3, R3, 8 SUBDF8, F6, F2 DIVDF10, F10, F0 ADDDF6, F8, F6 BLEZR4, LOOP ADDIR4, R4, 1 LOOP: LDF6, 32(R2) LDF2, 48(R3) MULTDF0, F2, F4 ADDIR2, R2, 8 ADDIR3, R3, 8 SUBDF8, F6, F2 DIVDF10, F10, F0 ADDDF6, F8, F6 BLEZR4, LOOP ADDIR4, R4, 1

10 States of Circular Buffer LOOP: LDF6, 32(R2) LDF2, 48(R3) MULTDF0, F2, F4 ADDIR2, R2, 8 ADDIR3, R3, 8 SUBDF8, F6, F2 DIVDF10, F10, F0 ADDDF6, F8, F6 BLEZR4, LOOP ADDIR4, R4, 1 LOOP: LDF6, 32(R2) LDF2, 48(R3) MULTDF0, F2, F4 ADDIR2, R2, 8 ADDIR3, R3, 8 SUBDF8, F6, F2 DIVDF10, F10, F0 ADDDF6, F8, F6 BLEZR4, LOOP ADDIR4, R4, 1 Entry in yellow is at head of buffer Entry in green is tail of buffer, i.e., next instruction goes here Greyed instructions have committed

11 Complexity of ROB  Assume dual-issue superscalar Load/Store machine with three-operand instructions Load/Store machine with three-operand instructions 64 registers 64 registers 16-entry circular buffer 16-entry circular buffer  Hardware support needed for ROB For each buffer entry For each buffer entry  One write port  Four read ports (two source operands of two instructions)  Four 6-bit comparators for associative lookup For each read port For each read port  16-way “priority” encoder with wrap-around (to get latest value)  Limited capacity of ROB is a structural hazard  Repeated writes to same register actually happen This is not the case in “classical” Tomasulo This is not the case in “classical” Tomasulo

12 Example: System Interactions Memory access time is 200 ns Which design is fastest?