CSCE 614 Fall 20091 Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
A scheme to overcome data hazards
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Computer Architecture Lec 8 – Instruction Level Parallelism.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
/ Computer Architecture and Design Instructor: Dr. Michael Geiger Summer 2014 Lecture 6: Speculation.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Overcoming Control Hazards with Dynamic Scheduling & Speculation.
Instruction-Level Parallelism dynamic scheduling prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University May 2015Instruction-Level Parallelism.
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Anshul Kumar, CSE IITD CSL718 : Superscalar Processors Speculative Execution 2nd Feb, 2006.
CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Instruction-Level Parallelism and Its Dynamic Exploitation
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
/ Computer Architecture and Design
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Out of Order Processors
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Advanced Computer Architecture
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Adapted from the slides of Prof
Chapter 3: ILP and Its Exploitation
CSC3050 – Computer Architecture
Overcoming Control Hazards with Dynamic Scheduling & Speculation
Conceptual execution on a processor which exploits ILP
Presentation transcript:

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing burden. => Speculating on the outcome of branches and executing the program as if the guesses were correct. Hardware Speculation

CSCE 614 Fall Key Ideas of Hardware Speculation Dynamic Branch Prediction –Choose which instruction to execute. Speculation –Allow the execution of instructions before the control dependences are resolved (with the ability to undo the effect of an incorrectly speculated sequence). Dynamic Scheduling –Deal with the scheduling of different combinations of basic blocks

CSCE 614 Fall Examples PowerPC 603/604/G3/G4 MIPS R10000/12000 Intel Pentium II/III/4 Alpha AMD K5/K6/Athlon

CSCE 614 Fall Hardware Speculation Extended Tomasulo’s algorithm Additional step (instruction commit) required Allow instructions to execute out-of-order but to force them to commit in order. Any irrevocable action (updating state or taking an exception) is prevented until an instruction commits.

CSCE 614 Fall Reorder Buffer (ROB) Holds the result of an instruction between the time the operation associated with the instruction completes and the time the instruction commits. Source of operands for instructions With speculation, the register file (or memory) is not updated until the instruction commits.

CSCE 614 Fall ROB Fields Instruction type: indicates whether the instruction is a branch, a store, or a register operation (ALU or Load). Destination: supplies the register number (for loads and ALU operations) or the memory address (for stores). Value: holds the value of the instruction result until the instruction commits. Ready: indicates that the instruction has completed execution, and the value is ready.

CSCE 614 Fall Issue Execute Write result (to ROB) Commit (write to RF, MEM) Reservation Stations Reorder Buffer (ROB) Hardware Speculation

CSCE 614 Fall Basic Structure of MIPS FP Unit The ROB completely replaces the store buffer. The renaming function of the reservation stations is replaced by the ROB

CSCE 614 Fall Steps of Execution 1.Issue (also called “dispatch”) - Get an instruction from the instruction queue. - Issue the instruction if there is an empty reservation station and an empty slot in ROB. - If either all reservation stations are full or the ROB is full, then instruction issue is stalled.

CSCE 614 Fall Execute - If one or more of the operands is not yet available, monitor the CDB while waiting for the register to be computed. - Also RAW hazards are checked. - When both operands are available at a reservation station, execute the operation. - Loads require two steps (effective address calculation and source operand). - Stores need one step (effective address calculation).

CSCE 614 Fall Write Result - When the result is available, write it on the CDB and from the CDB into the ROB, as well as to any reservation stations waiting for this result. - For stores, if the value to be stored is available, it is written into the Value field of the ROB entry for the store.

CSCE 614 Fall Commit (also called “completion” or “graduation”) - Normal commit: When an instruction reaches the head of the ROB and its result is present in the buffer, the processor updates the register with the result and removes the instruction from the ROB. - Store commit: Similar except that memory is updated. - Branch with an incorrect prediction: The speculation is wrong. The ROB is flushed and execution is restarted at the correct successor of the branch.

CSCE 614 Fall Example L.DF6, 34(R2) L.DF2, 45(R3) MUL.DF0, F2, F4 SUB.DF8, F6, F2 DIV.DF10, F0, F6 ADD.DF6, F8, F2 When the MUL.D is ready to commit.

CSCE 614 Fall Example (p.109) Loop:L.DF0, 0(R1) MUL.DF4, F0, F2 S.DF4, 0(R1) DADDIUR1, R1, #-8 BNER1, R2, Loop Assume that we have issued all the instructions in the loop twice. Assume that L.D and MUL.D from the first iteration have committed and all other instructions have completed execution. Show the contents of the ROB and the FP registers.

CSCE 614 Fall Hardware Speculation Because neither the register values nor any memory values are actually written until an instruction commits, the processor can easily undo its speculative actions when a branch is found to be mispredicted. Exceptions are handled by not recognizing the exception until it is ready to commit.

CSCE 614 Fall Hardware Speculation Figure 2.17 (p.113)

CSCE 614 Fall Multiple-Issue Processors Allow multiple instructions to issue in a clock cycle. Ideal CPI < 1 3 flavors –Statically Scheduled Superscalar –Dynamically Scheduled Superscalar –VLIW (Very Long Instruction Word)

CSCE 614 Fall Superscalar Processors Issue varying numbers of instructions per clock –statically scheduled using compiler techniques in-order execution –dynamically scheduled Tomasulo’s algorithm out-of-order execution

CSCE 614 Fall VLIW Processors issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (EPIC: Explicitly Parallel Instruction Computers). Statically scheduled by the compiler.

CSCE 614 Fall nameIssue structure Hazard detection SchedulingDistinguishing characteristic Examples Superscalar (static) dynamich/wstaticin-order execution MIPS and ARM (embedded) Superscalar (dynamic) dynamich/wdynamicsome out-of- order execution None Superscalar (speculative) dynamich/wdynamic w/ speculation out-of-order execution w/ speculation Pentium 4, MIPS R12K, Alpha 21264, IBM Power5 VLIW/LIWstaticprimarily s/w staticall hazards determined by compiler TI C6x (embedded) EPICmostly static mostly s/w mostly static all hazards determined by compiler Itanium

CSCE 614 Fall Multiple Instruction Issue with Dynamic Scheduling Two-issue dynamically scheduled processor –It can issue any pair of instructions if there are reservation stations of the right type available. –Extended Tomasulo’s algorithm Note that Tomasulo’s algorithm (and Hardware Speculation) is used for both integer operations and FP operations.

CSCE 614 Fall Two approaches to implement –Issue one instruction in half a clock cycle, so that two instructions can be processed in one clock cycle. –Build the logic necessary to handle two instructions at once, including any possible dependences between the instructions. Modern superscalar processors that issue 4 or more instructions per clock often include both approaches.

CSCE 614 Fall How to Handle Branches? Dynamically scheduled processors –Only allow instructions to be fetched and issued (but not actually executed) until the branch has completed. –IBM 360/91 Processors with hardware speculation –Can actually execute instructions based on branch prediction.

CSCE 614 Fall Note that we consider loads and stores, including those to FP registers, as integer operations. Assume that FP adds take 3 execution cycles. Latency: Execute Write CDB

CSCE 614 Fall The throughput improvement versus a single-issue pipeline is small. –There is only one FP operation per iteration. –There is only one Integer ALU for both integer ALU operations and effective address calculations. Larger improvements would be possible if the processor could execute more integer operations per cycle.

CSCE 614 Fall Multiple Issue with Speculation We process multiple instructions per clock assigning reservation stations and reorder buffers to the instructions. To maintain throughput of greater than one instruction per cycle, a speculative processor must be able to handle multiple instruction commits per clock cycle.

CSCE 614 Fall Example (p.119) Loop:LDR2, 0(R1) DADDIUR2, R2, #1 SDR2, 0(R1) DADDIUR1, R1, #8 BNER2, R3, Loop Consider the execution of the loop on a two-issue processor, once without speculation (dynamic scheduling/Tomasulo’s algorithm) and once with speculation. Assume that there are separate integer functional units for effective address calculation, for ALU operations, and for branch condition evaluation. Assume that there are 2 CDBs. Assume that up to two instructions of any type can commit per clock for a processor with speculation. Show the execution timing of the first three iterations of the loop.

CSCE 614 Fall High-Performance Instruction Delivery For multiple-issue (delivering 4~8 instructions per clock cycle) processors –Branch-target buffers –Integrated instruction fetch unit –Return address prediction

CSCE 614 Fall Branch-Target Buffers To reduce the branch penalty for the classic 5-stage pipeline, we want to know what address to fetch by the end of IF. Branch-target buffer: a branch-prediction cache that stores the predicted address for the next instruction after a branch. We access the buffer during the IF stage using the instruction address. (We don’t know what the instruction is.)

CSCE 614 Fall Branch-Target Buffers Branch-Target Cache Optional. May be used for extra prediction state bits.

CSCE 614 Fall Branch-Target Buffers We only need to store the predicted-taken branches in the branch-target buffer. –Why? No branch delay if a branch-prediction entry is found and is correct.

CSCE 614 Fall

CSCE 614 Fall Return Address Predictors Predicting indirect jumps (destination address varies at run time) –Procedure returns, procedure calls, case, select, etc. –SPEC89: 85% of indirect jumps are procedure returns. A small buffer of return addresses operating as a stack –Caches the most recent return addresses –Push a return address on the stack at a call –Pop one off at a return

CSCE 614 Fall Integrated Instruction Fetch Units A separate autonomous unit that feeds instructions to the rest of the pipeline for multiple-issue processors. Have several functions –Integrated branch prediction –Instruction prefetch –Instruction memory access and buffering