Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCE 614 Fall 20091 Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

Similar presentations


Presentation on theme: "CSCE 614 Fall 20091 Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing."— Presentation transcript:

1 CSCE 614 Fall 20091 Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing burden. => Speculating on the outcome of branches and executing the program as if the guesses were correct. Hardware Speculation

2 CSCE 614 Fall 20092 3 Key Ideas of Hardware Speculation Dynamic Branch Prediction –Choose which instruction to execute. Speculation –Allow the execution of instructions before the control dependences are resolved (with the ability to undo the effect of an incorrectly speculated sequence). Dynamic Scheduling –Deal with the scheduling of different combinations of basic blocks

3 CSCE 614 Fall 20093 Examples PowerPC 603/604/G3/G4 MIPS R10000/12000 Intel Pentium II/III/4 Alpha 21264 AMD K5/K6/Athlon

4 CSCE 614 Fall 20094 Hardware Speculation Extended Tomasulo’s algorithm Additional step (instruction commit) required Allow instructions to execute out-of-order but to force them to commit in order. Any irrevocable action (updating state or taking an exception) is prevented until an instruction commits.

5 CSCE 614 Fall 20095 Reorder Buffer (ROB) Holds the result of an instruction between the time the operation associated with the instruction completes and the time the instruction commits. Source of operands for instructions With speculation, the register file (or memory) is not updated until the instruction commits.

6 CSCE 614 Fall 20096 ROB Fields Instruction type: indicates whether the instruction is a branch, a store, or a register operation (ALU or Load). Destination: supplies the register number (for loads and ALU operations) or the memory address (for stores). Value: holds the value of the instruction result until the instruction commits. Ready: indicates that the instruction has completed execution, and the value is ready.

7 CSCE 614 Fall 20097 Issue Execute Write result (to ROB) Commit (write to RF, MEM) Reservation Stations Reorder Buffer (ROB) Hardware Speculation

8 CSCE 614 Fall 20098 Basic Structure of MIPS FP Unit The ROB completely replaces the store buffer. The renaming function of the reservation stations is replaced by the ROB

9 CSCE 614 Fall 20099 4 Steps of Execution 1.Issue (also called “dispatch”) - Get an instruction from the instruction queue. - Issue the instruction if there is an empty reservation station and an empty slot in ROB. - If either all reservation stations are full or the ROB is full, then instruction issue is stalled.

10 CSCE 614 Fall 200910 2. Execute - If one or more of the operands is not yet available, monitor the CDB while waiting for the register to be computed. - Also RAW hazards are checked. - When both operands are available at a reservation station, execute the operation. - Loads require two steps (effective address calculation and source operand). - Stores need one step (effective address calculation).

11 CSCE 614 Fall 200911 3. Write Result - When the result is available, write it on the CDB and from the CDB into the ROB, as well as to any reservation stations waiting for this result. - For stores, if the value to be stored is available, it is written into the Value field of the ROB entry for the store.

12 CSCE 614 Fall 200912 4. Commit (also called “completion” or “graduation”) - Normal commit: When an instruction reaches the head of the ROB and its result is present in the buffer, the processor updates the register with the result and removes the instruction from the ROB. - Store commit: Similar except that memory is updated. - Branch with an incorrect prediction: The speculation is wrong. The ROB is flushed and execution is restarted at the correct successor of the branch.

13 CSCE 614 Fall 200913 Example L.DF6, 34(R2) L.DF2, 45(R3) MUL.DF0, F2, F4 SUB.DF8, F6, F2 DIV.DF10, F0, F6 ADD.DF6, F8, F2 When the MUL.D is ready to commit.

14 CSCE 614 Fall 200914 Example (p.109) Loop:L.DF0, 0(R1) MUL.DF4, F0, F2 S.DF4, 0(R1) DADDIUR1, R1, #-8 BNER1, R2, Loop Assume that we have issued all the instructions in the loop twice. Assume that L.D and MUL.D from the first iteration have committed and all other instructions have completed execution. Show the contents of the ROB and the FP registers.

15 CSCE 614 Fall 200915 Hardware Speculation Because neither the register values nor any memory values are actually written until an instruction commits, the processor can easily undo its speculative actions when a branch is found to be mispredicted. Exceptions are handled by not recognizing the exception until it is ready to commit.

16 CSCE 614 Fall 200916 Hardware Speculation Figure 2.17 (p.113)

17 CSCE 614 Fall 200917 Multiple-Issue Processors Allow multiple instructions to issue in a clock cycle. Ideal CPI < 1 3 flavors –Statically Scheduled Superscalar –Dynamically Scheduled Superscalar –VLIW (Very Long Instruction Word)

18 CSCE 614 Fall 200918 Superscalar Processors Issue varying numbers of instructions per clock –statically scheduled using compiler techniques in-order execution –dynamically scheduled Tomasulo’s algorithm out-of-order execution

19 CSCE 614 Fall 200919 VLIW Processors issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction (EPIC: Explicitly Parallel Instruction Computers). Statically scheduled by the compiler.

20 CSCE 614 Fall 200920 nameIssue structure Hazard detection SchedulingDistinguishing characteristic Examples Superscalar (static) dynamich/wstaticin-order execution MIPS and ARM (embedded) Superscalar (dynamic) dynamich/wdynamicsome out-of- order execution None Superscalar (speculative) dynamich/wdynamic w/ speculation out-of-order execution w/ speculation Pentium 4, MIPS R12K, Alpha 21264, IBM Power5 VLIW/LIWstaticprimarily s/w staticall hazards determined by compiler TI C6x (embedded) EPICmostly static mostly s/w mostly static all hazards determined by compiler Itanium

21 CSCE 614 Fall 200921 Multiple Instruction Issue with Dynamic Scheduling Two-issue dynamically scheduled processor –It can issue any pair of instructions if there are reservation stations of the right type available. –Extended Tomasulo’s algorithm Note that Tomasulo’s algorithm (and Hardware Speculation) is used for both integer operations and FP operations.

22 CSCE 614 Fall 200922 Two approaches to implement –Issue one instruction in half a clock cycle, so that two instructions can be processed in one clock cycle. –Build the logic necessary to handle two instructions at once, including any possible dependences between the instructions. Modern superscalar processors that issue 4 or more instructions per clock often include both approaches.

23 CSCE 614 Fall 200923 How to Handle Branches? Dynamically scheduled processors –Only allow instructions to be fetched and issued (but not actually executed) until the branch has completed. –IBM 360/91 Processors with hardware speculation –Can actually execute instructions based on branch prediction.

24 CSCE 614 Fall 200924 Note that we consider loads and stores, including those to FP registers, as integer operations. Assume that FP adds take 3 execution cycles. Latency: Execute Write CDB

25 CSCE 614 Fall 200925 The throughput improvement versus a single-issue pipeline is small. –There is only one FP operation per iteration. –There is only one Integer ALU for both integer ALU operations and effective address calculations. Larger improvements would be possible if the processor could execute more integer operations per cycle.

26 CSCE 614 Fall 200926 Multiple Issue with Speculation We process multiple instructions per clock assigning reservation stations and reorder buffers to the instructions. To maintain throughput of greater than one instruction per cycle, a speculative processor must be able to handle multiple instruction commits per clock cycle.

27 CSCE 614 Fall 200927 Example (p.119) Loop:LDR2, 0(R1) DADDIUR2, R2, #1 SDR2, 0(R1) DADDIUR1, R1, #8 BNER2, R3, Loop Consider the execution of the loop on a two-issue processor, once without speculation (dynamic scheduling/Tomasulo’s algorithm) and once with speculation. Assume that there are separate integer functional units for effective address calculation, for ALU operations, and for branch condition evaluation. Assume that there are 2 CDBs. Assume that up to two instructions of any type can commit per clock for a processor with speculation. Show the execution timing of the first three iterations of the loop.

28 CSCE 614 Fall 200928 High-Performance Instruction Delivery For multiple-issue (delivering 4~8 instructions per clock cycle) processors –Branch-target buffers –Integrated instruction fetch unit –Return address prediction

29 CSCE 614 Fall 200929 Branch-Target Buffers To reduce the branch penalty for the classic 5-stage pipeline, we want to know what address to fetch by the end of IF. Branch-target buffer: a branch-prediction cache that stores the predicted address for the next instruction after a branch. We access the buffer during the IF stage using the instruction address. (We don’t know what the instruction is.)

30 CSCE 614 Fall 200930 Branch-Target Buffers Branch-Target Cache Optional. May be used for extra prediction state bits.

31 CSCE 614 Fall 200931 Branch-Target Buffers We only need to store the predicted-taken branches in the branch-target buffer. –Why? No branch delay if a branch-prediction entry is found and is correct.

32 CSCE 614 Fall 200932

33 CSCE 614 Fall 200933 Return Address Predictors Predicting indirect jumps (destination address varies at run time) –Procedure returns, procedure calls, case, select, etc. –SPEC89: 85% of indirect jumps are procedure returns. A small buffer of return addresses operating as a stack –Caches the most recent return addresses –Push a return address on the stack at a call –Pop one off at a return

34 CSCE 614 Fall 200934 Integrated Instruction Fetch Units A separate autonomous unit that feeds instructions to the rest of the pipeline for multiple-issue processors. Have several functions –Integrated branch prediction –Instruction prefetch –Instruction memory access and buffering


Download ppt "CSCE 614 Fall 20091 Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing."

Similar presentations


Ads by Google