ECE562/468 Advanced Computer Architecture Prof. Honggang Wang

Slides:

Advertisements

Similar presentations

CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.

A scheme to overcome data hazards

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

COMP25212 Advanced Pipelining Out of Order Processors.

CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

Out-of-order execution: Scoreboarding and Tomasulo Week 2

Instruction-Level Parallelism dynamic scheduling prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University May 2015Instruction-Level Parallelism.

1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.

Chapter 3 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2011 Computer Applications Text book slides: Computer Architec ture:

04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;

COMP25212 Advanced Pipelining Out of Order Processors.

CS203 – Advanced Computer Architecture ILP and Speculation.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

Instruction-Level Parallelism and Its Dynamic Exploitation

IBM System 360. Common architecture for a set of machines

CS 352H: Computer Systems Architecture

/ Computer Architecture and Design

Approaches to exploiting Instruction Level Parallelism (ILP)

Out of Order Processors

Dynamic Scheduling and Speculation

Step by step for Tomasulo Scheme

CS203 – Advanced Computer Architecture

CSE 520 Computer Architecture Lec Chapter 2 - DS-Tomasulo

Lecture 12 Reorder Buffers

March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

Chapter 3: ILP and Its Exploitation

Advantages of Dynamic Scheduling

High-level view Out-of-order pipeline

11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.

CMSC 611: Advanced Computer Architecture

A Dynamic Algorithm: Tomasulo’s

COMP s1 Seminar 3: Dynamic Scheduling

Out of Order Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

CS 704 Advanced Computer Architecture

Adapted from the slides of Prof

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

CSCE430/830 Computer Architecture

Advanced Computer Architecture

September 20, 2000 Prof. John Kubiatowicz

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Tomasulo Organization

Reduction of Data Hazards Stalls with Dynamic Scheduling

CS5100 Advanced Computer Architecture Dynamic Scheduling

Adapted from the slides of Prof

CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.

CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming February.

Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005

/ Computer Architecture and Design

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

September 20, 2000 Prof. John Kubiatowicz

Lecture 7 Dynamic Scheduling

Conceptual execution on a processor which exploits ILP

Presentation transcript:

ECE562/468 Advanced Computer Architecture Prof. Honggang Wang Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE Department University of Massachusetts Dartmouth 285 Old Westport Rd. North Dartmouth, MA 02747-2300 Slides based on the PowerPoint Presentations created by David Patterson as part of the Instructor Resources for the textbook by Hennessy & Patterson Updated by Honggang Wang. CS252 S05

Administrative Issues (02/25/2016) Xing - Fall09 Administrative Issues (02/25/2016) Project proposal is due Thursday, March.1 5-10 mins project proposal presentation on March. 1 Project proposal guideline can be found on my teaching website www.faculty.umassd.edu/honggang.wang/teaching.html Background Review Lecture

Review of Last Lecture ILP: Concepts and Challenges Compiler techniques to increase ILP Loop Unrolling Branch Prediction Static Branch Prediction Dynamic Branch Prediction Overcoming Data Hazards with Dynamic Scheduling

Outline Dynamic Scheduling Speculation Memory Aliases Exceptions Tomasulo Algorithm Speculation Speculative Tomasulo Example Memory Aliases Exceptions Increasing instruction bandwidth Register Renaming vs. Reorder Buffer Value Prediction The idea, div.D F0, f2, F4 Add.d F10, F0, F8 SUB.D F12, F8, F14 Depndence,the sub instrcution cannot execute because the depdence of ADD.D on DIV.D cause the pipeline to stall. Instruciton issued in program order, out of order execution, which implies out-of-order completion. Tracking instruction dependenceis to allow the execution as soon as operands are available and renaming registers to avoid WAR and WAW. RAW only are avoid by executing an instrcution only when its operands are available. WAR and WAW hazards, which arise from name dependencies, are eliminated by register renaming Compiler cannot handle register renaming across branches, Reservation station which buffers the operands of instructions waiting to issue, the basic idea is that a that a resevation station fetched and buffers an operand as soon as it is avaiable, elimating the need to get the operand from a regsiter, in addition, pending instrcution desinate the reservation station that will provide their input. Finally, when successive writes to a register overlap in execution, only the last one is actually used to update the register. When successive writes to a register overalp in execution, Only the last one is actually used to update the regsiter, As instruction are issued, the register specifiers for pending operands are renamed to the names of the reservation station, which provides register renaming. There are more resevation stations than name dependences than real register. The use of resvation station, rhan centrilized register fiel, leads two imporant properties, first hazard detection and execution controal are distributed: the information held in the reservation stations at each functional unit determine when an instruction can beign execution at the unit. Results are passed directly to functional units from the reservation stations where they are buffered. Common data bus, all the units waiting for a operand can be loaded simutaneously.

Advantages of Dynamic Scheduling Dynamic scheduling - hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior It handles cases when dependences unknown at compile time it allows the processor to tolerate unpredictable delays such as cache misses, by executing other code while waiting for the miss to resolve It allows code that compiled for one pipeline to run efficiently on a different pipeline It simplifies the compiler Hardware speculation, a technique with significant performance advantages, builds on dynamic scheduling The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations. A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency Reduce the delay. Daddu R2, R3, R4 Beqz R2, L1 LW R1, 0(R2) Swith beqz and lw cause memory protection exception.

HW Schemes: Instruction Parallelism Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 Enables out-of-order execution and allows out-of-order completion (e.g., SUBD) In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue) Will distinguish when an instruction begins execution and when it completes execution; between 2 times, the instruction is in execution Note: Dynamic execution creates WAR and WAW hazards and makes exceptions harder Expalin what is WAR, WAW, switching the ordering cuase the hazard. Cause the pipleline to stall; yet sub.d is not data depdence on anything in the pipeline. Not keep the program order Five stage pipeline, both structural and data hazards could be checked. Out of –order executrion , need check data hazard and structural hazard DIV.D F0, f2, f4 ADD.D F6, F0, F8 SUB.D F8,F10,F14 MUL.D F6, F10, F8

Dynamic Scheduling Step 1 Simple pipeline had 1 stage to check both structural and data hazards: Instruction Decode (ID), also called Instruction Issue Split the ID pipe stage of simple 5-stage pipeline into 2 stages: Issue—Decode instructions, check for structural hazards Read operands—Wait until no data hazards, then read operands

A Dynamic Algorithm: Tomasulo’s For IBM 360/91 (before caches!)  Long memory latency Goal: High Performance without special compilers Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operations This led Tomasulo to try to figure out how to get more effective registers — renaming in hardware! Why Study 1966 Computer? The descendants of this have flourished! Alpha 21264, Pentium 4, AMD Opteron, Power 5, …

Tomasulo Algorithm Control & buffers distributed with Function Units (FU) FU buffers called “reservation stations”; have pending operands Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ; Renaming avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers can’t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Avoids RAW hazards by executing an instruction only when its operands are available Load and Stores treated as FUs with RSs as well The Tomasulo algorithm is a hardware algorithm developed in 1967 by Robert Tomasulo from IBM. It allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially (out-of-order execution) Algorithm: http://www.ecs.umass.edu/ece/koren/architecture/Tomasulo1/tomasulo.htm The basical idea is that a reservation station feteches and buffers an operand as soon as it is available, eliminating the need to get the operand from a register. Register rename is provided by reservation stations. Better than centralized register files. Hazard detection and execution control are distributed., Results are directly passed to function units from reservation stations where they are buffered, rather than going through the registers. This is done by CDB. not going trhorugh the register,

Tomasulo Organization FP Registers From Mem FP Op Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 Store Buffers Add1 Add2 Add3 Mult1 Mult2 Reservation Stations To Mem FP adders FP multipliers Common Data Bus (CDB)

Reservation Station Components Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

Three Stages of Tomasulo Algorithm 1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execute—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Common data bus: data + source (“come from” bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast Example speed: 2 clocks for +,-; 10 for * ; 40 clks for /

Tomasulo Example Instruction stream 3 Load/Buffers FU count down 3 FP Adder R.S. 2 FP Mult R.S. Clock cycle counter

Tomasulo Example Cycle 1

Tomasulo Example Cycle 2 Note: Can have multiple loads outstanding

Tomasulo Example Cycle 3 Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued Load1 completing; what is waiting for Load1?

Tomasulo Example Cycle 4 Load2 completing; what is waiting for Load2?

Tomasulo Example Cycle 5 Timer starts down for Add1, Mult1

Tomasulo Example Cycle 6 There are three situations in which a data hazard can occur: read after write (RAW), a true dependency write after read (WAR) write after write (WAW) consider two instructions i and j, with i occurring before j in program order. [edit] Read After Write (RAW) (j tries to read a source before i writes to it) A read after write (RAW) data hazard refers to a situation where an instruction refers to a result that has not yet been calculated or retrieved. This can occur because even though an instruction is executed after a previous instruction, the previous instruction has not been completely processed through the pipeline. [edit] Example For example: i1. R2 <- R1 + R3 i2. R4 <- R2 + R3 The first instruction is calculating a value to be saved in register 2, and the second is going to use this value to compute a result for register 4. However, in a pipeline, when we fetch the operands for the 2nd operation, the results from the first will not yet have been saved, and hence we have a data dependency. We say that there is a data dependency with instruction 2, as it is dependent on the completion of instruction 1. [edit] Write After Read (WAR) (j tries to write a destination before it is read by i) A write after read (WAR) data hazard represents a problem with concurrent execution. i1. R4 <- R1 + R3 i2. R3 <- R1 + R2 If we are in a situation that there is a chance that i2 may be completed before i1 (i.e. with concurrent execution) we must ensure that we do not store the result of register 3 before i1 has had a chance to fetch the operands. [edit] Write After Write (WAW) (j tries to write an operand before it is written by i) A write after write (WAW) data hazard may occur in a concurrent execution environment. i1. R2 <- R1 + R2 i2. R2 <- R4 + R7 We must delay the WB (Write Back) of i2 until the execution of i1. Issue ADDD here despite name dependency on F6?

Tomasulo Example Cycle 7 Add1 (SUBD) completing; what is waiting for it?

Tomasulo Example Cycle 8

Tomasulo Example Cycle 9

Tomasulo Example Cycle 10 Add2 (ADDD) completing; what is waiting for it?

Tomasulo Example Cycle 11 Write result of ADDD here? All quick instructions complete in this cycle!

Tomasulo Example Cycle 12

Tomasulo Example Cycle 13

Tomasulo Example Cycle 14

Tomasulo Example Cycle 15 Mult1 (MULTD) completing; what is waiting for it?

Tomasulo Example Cycle 16 Just waiting for Mult2 (DIVD) to complete

Faster than light computation (skip a couple of cycles)

Tomasulo Example Cycle 55

Tomasulo Example Cycle 56 Mult2 (DIVD) is completing; what is waiting for it?

Tomasulo Example Cycle 57 Once again: In-order issue, out-of-order execution and out-of-order completion.

Why can Tomasulo overlap iterations of loops? Reservation stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR hazards (by buffering old values of registers) and avoids WAW hazards Allows loop unrolling in HW – “dynamic loop unrolling” (Register Renaming: Multiple iterations use different physical destinations for registers) Not limited to basic blocks Other perspective: Tomasulo building data flow dependency graph on the fly

Tomasulo’s scheme offers 2 major advantages Distribution of the hazard detection logic distributed reservation stations and the CDB If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB If a centralized register file were used, the units would have to read their results from the registers when register buses are available Elimination of stalls for WAW and WAR hazards

Tomasulo Drawbacks Complexity Performance limited by Common Data Bus delays of 360/91, MIPS 10000, Alpha 21264, IBM PPC 620 in CA:AQA 2/e, but not in silicon! Performance limited by Common Data Bus Each CDB must go to multiple functional units high capacitance, high wiring density Number of functional units that can complete per cycle limited to one! Multiple CDBs  more FU logic for parallel assoc stores Non-precise interrupts! We will address this later

Next Topics Things To Do Project proposal is due Tuesday, March 1 Xing - Fall09 Next Topics Ch2. Dynamic Scheduling Things To Do Project proposal is due Tuesday, March 1 5-10 mins project proposal presentation on March. 1 Check out the class website about lecture notes reading assignments