ECE562/468 Advanced Computer Architecture Prof. Honggang Wang

ECE562/468 Advanced Computer Architecture Prof. Honggang Wang
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE Department University of Massachusetts Dartmouth 285 Old Westport Rd. North Dartmouth, MA Slides based on the PowerPoint Presentations created by David Patterson as part of the Instructor Resources for the textbook by Hennessy & Patterson Updated by Honggang Wang. CS252 S05

Administrative Issues (02/25/2016)
Xing - Fall09 Administrative Issues (02/25/2016) Project proposal is due Thursday, March.1 5-10 mins project proposal presentation on March. 1 Project proposal guideline can be found on my teaching website Background Review Lecture

Review of Last Lecture ILP: Concepts and Challenges
Compiler techniques to increase ILP Loop Unrolling Branch Prediction Static Branch Prediction Dynamic Branch Prediction Overcoming Data Hazards with Dynamic Scheduling

Outline Dynamic Scheduling Speculation Memory Aliases Exceptions
Tomasulo Algorithm Speculation Speculative Tomasulo Example Memory Aliases Exceptions Increasing instruction bandwidth Register Renaming vs. Reorder Buffer Value Prediction The idea, div.D F0, f2, F4 Add.d F10, F0, F8 SUB.D F12, F8, F14 Depndence,the sub instrcution cannot execute because the depdence of ADD.D on DIV.D cause the pipeline to stall. Instruciton issued in program order, out of order execution, which implies out-of-order completion. Tracking instruction dependenceis to allow the execution as soon as operands are available and renaming registers to avoid WAR and WAW. RAW only are avoid by executing an instrcution only when its operands are available. WAR and WAW hazards, which arise from name dependencies, are eliminated by register renaming Compiler cannot handle register renaming across branches, Reservation station which buffers the operands of instructions waiting to issue, the basic idea is that a that a resevation station fetched and buffers an operand as soon as it is avaiable, elimating the need to get the operand from a regsiter, in addition, pending instrcution desinate the reservation station that will provide their input. Finally, when successive writes to a register overlap in execution, only the last one is actually used to update the register. When successive writes to a register overalp in execution, Only the last one is actually used to update the regsiter, As instruction are issued, the register specifiers for pending operands are renamed to the names of the reservation station, which provides register renaming. There are more resevation stations than name dependences than real register. The use of resvation station, rhan centrilized register fiel, leads two imporant properties, first hazard detection and execution controal are distributed: the information held in the reservation stations at each functional unit determine when an instruction can beign execution at the unit. Results are passed directly to functional units from the reservation stations where they are buffered. Common data bus, all the units waiting for a operand can be loaded simutaneously.

Advantages of Dynamic Scheduling
Dynamic scheduling - hardware rearranges the instruction execution to reduce stalls while maintaining data flow and exception behavior It handles cases when dependences unknown at compile time it allows the processor to tolerate unpredictable delays such as cache misses, by executing other code while waiting for the miss to resolve It allows code that compiled for one pipeline to run efficiently on a different pipeline It simplifies the compiler Hardware speculation, a technique with significant performance advantages, builds on dynamic scheduling The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations. A cache miss refers to a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency Reduce the delay. Daddu R2, R3, R4 Beqz R2, L1 LW R1, 0(R2) Swith beqz and lw cause memory protection exception.

HW Schemes: Instruction Parallelism
Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 Enables out-of-order execution and allows out-of-order completion (e.g., SUBD) In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue) Will distinguish when an instruction begins execution and when it completes execution; between 2 times, the instruction is in execution Note: Dynamic execution creates WAR and WAW hazards and makes exceptions harder Expalin what is WAR, WAW, switching the ordering cuase the hazard. Cause the pipleline to stall; yet sub.d is not data depdence on anything in the pipeline. Not keep the program order Five stage pipeline, both structural and data hazards could be checked. Out of –order executrion , need check data hazard and structural hazard DIV.D F0, f2, f4 ADD.D F6, F0, F8 SUB.D F8,F10,F14 MUL.D F6, F10, F8

Dynamic Scheduling Step 1
Simple pipeline had 1 stage to check both structural and data hazards: Instruction Decode (ID), also called Instruction Issue Split the ID pipe stage of simple 5-stage pipeline into 2 stages: Issue—Decode instructions, check for structural hazards Read operands—Wait until no data hazards, then read operands

A Dynamic Algorithm: Tomasulo’s
For IBM 360/91 (before caches!)  Long memory latency Goal: High Performance without special compilers Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operations This led Tomasulo to try to figure out how to get more effective registers — renaming in hardware! Why Study 1966 Computer? The descendants of this have flourished! Alpha 21264, Pentium 4, AMD Opteron, Power 5, …

Tomasulo Algorithm Control & buffers distributed with Function Units (FU) FU buffers called “reservation stations”; have pending operands Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ; Renaming avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers can’t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Avoids RAW hazards by executing an instruction only when its operands are available Load and Stores treated as FUs with RSs as well The Tomasulo algorithm is a hardware algorithm developed in 1967 by Robert Tomasulo from IBM. It allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially (out-of-order execution) Algorithm: The basical idea is that a reservation station feteches and buffers an operand as soon as it is available, eliminating the need to get the operand from a register. Register rename is provided by reservation stations. Better than centralized register files. Hazard detection and execution control are distributed., Results are directly passed to function units from reservation stations where they are buffered, rather than going through the registers. This is done by CDB. not going trhorugh the register,

Tomasulo Organization
FP Registers From Mem FP Op Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 Store Buffers Add1 Add2 Add3 Mult1 Mult2 Reservation Stations To Mem FP adders FP multipliers Common Data Bus (CDB)

Reservation Station Components
Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execute—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Common data bus: data + source (“come from” bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast Example speed: 2 clocks for +,-; 10 for * ; 40 clks for /

Tomasulo Example Instruction stream 3 Load/Buffers FU count
down 3 FP Adder R.S. 2 FP Mult R.S. Clock cycle counter

Tomasulo Example Cycle 1

Note: Can have multiple loads outstanding

Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued Load1 completing; what is waiting for Load1?

Load2 completing; what is waiting for Load2?

Timer starts down for Add1, Mult1

There are three situations in which a data hazard can occur: read after write (RAW), a true dependency write after read (WAR) write after write (WAW) consider two instructions i and j, with i occurring before j in program order. [edit] Read After Write (RAW) (j tries to read a source before i writes to it) A read after write (RAW) data hazard refers to a situation where an instruction refers to a result that has not yet been calculated or retrieved. This can occur because even though an instruction is executed after a previous instruction, the previous instruction has not been completely processed through the pipeline. [edit] Example For example: i1. R2 <- R1 + R3 i2. R4 <- R2 + R3 The first instruction is calculating a value to be saved in register 2, and the second is going to use this value to compute a result for register 4. However, in a pipeline, when we fetch the operands for the 2nd operation, the results from the first will not yet have been saved, and hence we have a data dependency. We say that there is a data dependency with instruction 2, as it is dependent on the completion of instruction 1. [edit] Write After Read (WAR) (j tries to write a destination before it is read by i) A write after read (WAR) data hazard represents a problem with concurrent execution. i1. R4 <- R1 + R3 i2. R3 <- R1 + R2 If we are in a situation that there is a chance that i2 may be completed before i1 (i.e. with concurrent execution) we must ensure that we do not store the result of register 3 before i1 has had a chance to fetch the operands. [edit] Write After Write (WAW) (j tries to write an operand before it is written by i) A write after write (WAW) data hazard may occur in a concurrent execution environment. i1. R2 <- R1 + R2 i2. R2 <- R4 + R7 We must delay the WB (Write Back) of i2 until the execution of i1. Issue ADDD here despite name dependency on F6?

Add1 (SUBD) completing; what is waiting for it?

Add2 (ADDD) completing; what is waiting for it?

Write result of ADDD here? All quick instructions complete in this cycle!

Mult1 (MULTD) completing; what is waiting for it?

Just waiting for Mult2 (DIVD) to complete

Faster than light computation (skip a couple of cycles)

Mult2 (DIVD) is completing; what is waiting for it?

Once again: In-order issue, out-of-order execution and out-of-order completion.

Why can Tomasulo overlap iterations of loops?
Reservation stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR hazards (by buffering old values of registers) and avoids WAW hazards Allows loop unrolling in HW – “dynamic loop unrolling” (Register Renaming: Multiple iterations use different physical destinations for registers) Not limited to basic blocks Other perspective: Tomasulo building data flow dependency graph on the fly

Tomasulo’s scheme offers 2 major advantages
Distribution of the hazard detection logic distributed reservation stations and the CDB If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB If a centralized register file were used, the units would have to read their results from the registers when register buses are available Elimination of stalls for WAW and WAR hazards

Tomasulo Drawbacks Complexity Performance limited by Common Data Bus
delays of 360/91, MIPS 10000, Alpha 21264, IBM PPC 620 in CA:AQA 2/e, but not in silicon! Performance limited by Common Data Bus Each CDB must go to multiple functional units high capacitance, high wiring density Number of functional units that can complete per cycle limited to one! Multiple CDBs  more FU logic for parallel assoc stores Non-precise interrupts! We will address this later

Next Topics Things To Do Project proposal is due Tuesday, March 1
Xing - Fall09 Next Topics Ch2. Dynamic Scheduling Things To Do Project proposal is due Tuesday, March 1 5-10 mins project proposal presentation on March. 1 Check out the class website about lecture notes reading assignments

ECE562/468 Advanced Computer Architecture Prof. Honggang Wang

Similar presentations

Presentation on theme: "ECE562/468 Advanced Computer Architecture Prof. Honggang Wang"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECE562/468 Advanced Computer Architecture Prof. Honggang Wang

Similar presentations

Presentation on theme: "ECE562/468 Advanced Computer Architecture Prof. Honggang Wang"— Presentation transcript:

Similar presentations

About project

Feedback