Download presentation
Presentation is loading. Please wait.
1
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture A Quantitative Approach, Fifth Edition
2
2 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling Rearrange order of instructions to reduce stalls while maintaining data flow Instructions initiate and/or complete out of order Advantages: Compiler doesn’t need to have knowledge of microarchitecture Handles cases where dependencies are complex or unknown at compile time Disadvantage: Substantial increase in hardware complexity New types of data hazards Branch Prediction
3
3 Name Dependencies A name dependence occurs when two unrelated instructions use the same registers Not an issue if instructions are executed in original ordering Antidependence: Instruction i must read data from register R before instruction j writes to R so that i reads the correct data Output dependence: Instruction i must write data to register R before instruction j writes to R so that R contains the correct data Copyright © 2012, Elsevier Inc. All rights reserved.
4
4 Data Hazards RAW An instruction tries to a read data before a previous instruction writes it (i.e., data is not yet ready) Solution: stall the pipeline until data is ready or forward data WAR Due to reordering, an instruction reads incorrect data because a later instruction has already written to the register (antidependence) Solution: maintain original order or rename registers WAW Due to reordering, an instruction writes incorrect data to a register because a later instruction has already written to the register (output dependence) Solution: maintain original order or rename registers Copyright © 2012, Elsevier Inc. All rights reserved.
5
5 Register Renaming Example: DIV.D F0,F2,F4 ADD.D F6,F0,F8 S.D F6,0(R1) SUB.D F8,F10,F14 MUL.D F6,F10,F8 Name dependence with F6 and F8 Branch Prediction antidependence
6
6 Copyright © 2012, Elsevier Inc. All rights reserved. Register Renaming Example: DIV.D F0,F2,F4 ADD.D S,F0,F8 S.D S,0(R1) SUB.D T,F10,F14 MUL.D F6,F10,T Now only RAW hazards remain, which can be strictly ordered Branch Prediction
7
7 Tomasulo’s Approach Key components: Register Renaming Allows multiple copies of register contents Data is buffered per instruction instead of per register Eliminates WAR hazards Register Status Track instructions writing to registers to enforce write order Eliminates WAW hazards Common Data Bus (CDB) Broadcast medium for distribution of results Data is forwarded as soon as its ready No need to wait for registers Copyright © 2012, Elsevier Inc. All rights reserved.
8
8 Reservation Stations Register renaming is provided by reservation stations (RS) Contains: The instruction (Op) Buffered operand values (Vj, Vk) when available Reservation station number of instruction providing the operand values (Qj, Qk) As instructions are issued, operand values in registers are buffered If operand values are not in registers, find and store the RS which containing the source instruction Listen for needed operand values on the CDB Eliminates RAW hazards May be more reservation stations than registers Branch Prediction
9
9 Copyright © 2012, Elsevier Inc. All rights reserved. Tomasulo’s Algorithm Load and store buffers Contain data and addresses, act like reservation stations Top-level design: Branch Prediction
10
10 Copyright © 2012, Elsevier Inc. All rights reserved. Tomasulo’s Algorithm Three Steps: Issue Get next instruction from FIFO queue If available RS, issue the instruction to the RS Stall if no RS available Execute When all operands are ready, begin execution Must also wait until all preceding branches have completed (no branch prediction) Write result Write result on CDB into reservation stations, store buffers, and registers Branch Prediction
11
11 Tomasulo Example Example uses a FP instruction sequence in which instructions require multiple execution cycles HW assumptions: one dedicated integer addressing unit three memory loaders three DP adders (also handle subtraction) two DP multipliers (also handle division) Latency assumptions: Memory operation takes 2 cycles FP add takes 2 cycles FP multiply takes 10 cycles FP divide takes 40 cycles Copyright © 2012, Elsevier Inc. All rights reserved.
12
12 Tomasulo Example Cycle 0
13
13 Tomasulo Example Cycle 1
14
14 Tomasulo Example Cycle 2
15
15 Tomasulo Example Cycle 3 Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued Load1 completing; what is waiting for Load1?
16
16 Tomasulo Example Cycle 4 Load2 completing; what is waiting for it?
17
17 Tomasulo Example Cycle 5
18
18 Tomasulo Example Cycle 6 Issue ADDD here
19
19 Tomasulo Example Cycle 7 Add1 (SUBD) completing; what is waiting for it?
20
20 Tomasulo Example Cycle 8
21
21 Tomasulo Example Cycle 9
22
22 Tomasulo Example Cycle 10 Add2 completing; what is waiting for it?
23
23 Tomasulo Example Cycle 11 Write result of ADDD here
24
24 Tomasulo Example Cycle 12 Note: all quick instructions complete already
25
25 Tomasulo Example Cycle 13
26
26 Tomasulo Example Cycle 14
27
27 Tomasulo Example Cycle 15 Mult1 completing; what is waiting for it?
28
28 Tomasulo Example Cycle 16 Note: Just waiting for divide
29
29 Tomasulo Example Cycle 55
30
30 Tomasulo Example Cycle 56 Mult 2 completing; what is waiting for it?
31
31 Tomasulo Example Cycle 57
32
32 Tomasulo’s Algorithm Summary Reservations stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards Allows loop unrolling in HW Not limited to basic blocks Lasting Contributions Dynamic scheduling Register renaming Copyright © 2012, Elsevier Inc. All rights reserved.
33
33 Copyright © 2012, Elsevier Inc. All rights reserved. Hardware-Based Speculation Branch prediction Predict branch outcome to allow fetching and decoding of subsequent instructions Branch speculation Predict branch output to allow fetching, decoding, and execution of subsequent instructions Relatively easy in a simple pipeline Just no-op instructions before they complete if prediction was incorrect More difficult out-of-order processors Requires treating speculative instructions as a transaction Branch Prediction
34
34 Copyright © 2012, Elsevier Inc. All rights reserved. Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Add an Instruction Commit phase to Tomasulo’s design Only allow an instruction to update the register file when instruction is no longer speculative Need an additional piece of hardware to buffer instructions until they commit Branch Prediction
35
35 Copyright © 2012, Elsevier Inc. All rights reserved. Reorder Buffer Reorder buffer – ordered buffer which holds the result of instruction between completion and commit Process instructions in issue order as they complete Four fields: Instruction type: branch/store/register Destination field: register number Value field: output value Ready field: completed execution? Modify reservation stations: Get operand values from ROB instead of other RSs Store ROB identifier for instruction Branch Prediction
36
36 Copyright © 2012, Elsevier Inc. All rights reserved. Reorder Buffer When an instruction reaches the head of the ROB, handle based on type Register/store types Commit instruction by writing value and destination to the CDB and removing from the ROB Branch type If prediction correct, commit normally If prediction incorrect, flush the ROB Branch Prediction
37
37 Reorder Buffer Copyright © 2012, Elsevier Inc. All rights reserved.
38
38 Extending Speculation Branch-Target Buffer (BTB) Store target PC with each prediction Allows predicted next instruction to be fetched immediately following the prediction Return-Address Predictor Predicts next PC after a function return Functions my be called from many different addresses, so organize buffer as a stack Buffer then imitates the call stack! Copyright © 2012, Elsevier Inc. All rights reserved.
39
39 Copyright © 2012, Elsevier Inc. All rights reserved. Multiple Issue and Static Scheduling To achieve CPI < 1, need to issue multiple instructions per clock cycle Solutions: Statically scheduled superscalar processors In-order execution VLIW (Very Long Instruction Word) processors In-order execution Dynamically scheduled superscalar processors Out-of-order execution Multiple Issue and Static Scheduling
40
40 Copyright © 2012, Elsevier Inc. All rights reserved. Multiple Issue Multiple Issue and Static Scheduling
41
41 Copyright © 2012, Elsevier Inc. All rights reserved. Multiple Issue, Static Scheduling Modern energy-efficient microarchitectures: Static scheduling + multiple issue Compiler performs scheduling Processor issues instructions in order Up to a fixed number of instructions can be issued simultaneously in no interdependencies Always issue at least one instruction Issue logic is relatively simple Just need to detect interdependencies Dynamic Scheduling, Multiple Issue, and Speculation
42
42 Copyright © 2012, Elsevier Inc. All rights reserved. Multiple Issue, Dynamic Scheduling Modern high-performance microarchitectures: Dynamic scheduling + multiple issue + speculation Two approaches: Assign reservation stations and update pipeline control table in half clock cycles Only supports 2 instructions/clock Design logic to handle any possible dependencies between the instructions Hybrid approaches Dynamic Scheduling, Multiple Issue, and Speculation
43
43 Copyright © 2012, Elsevier Inc. All rights reserved. Limit the number of instructions of a given class that can be issued in a “bundle” i.e. one FP, one integer, one load, one store Usually one per Reservation Station Examine all the interdependencies among the instructions in the bundle If interdependencies exist in bundle, encode them in reservation stations Issue logic is a major bottleneck Grows quadratically with number of instructions! Also need multiple completion/commit Dynamic Scheduling, Multiple Issue, and Speculation Multiple Issue, Dynamic Scheduling
44
44 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling, Multiple Issue, and Speculation Multiple Issue, Dynamic Scheduling
45
45 Copyright © 2012, Elsevier Inc. All rights reserved. Loop:LD R2,0(R1);R2=array element DADDIU R2,R2,#1;increment R2 SD R2,0(R1);store result DADDIU R1,R1,#8;increment pointer BNE R2,R3,LOOP;branch if not last element Dynamic Scheduling, Multiple Issue, and Speculation Example
46
46 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling, Multiple Issue, and Speculation Example (No Speculation)
47
47 Copyright © 2012, Elsevier Inc. All rights reserved. Dynamic Scheduling, Multiple Issue, and Speculation Example
48
48 Limits of ILP Get an idea of ILP limits by finding available parallelism in SPEC benchmarks Assume an ideal architecture Infinite register renaming No register pressure Perfect branch prediction Always know branch outcome immediately Perfect caches No cache misses of any kind Infinite window size Issue logic can examine entire program Copyright © 2012, Elsevier Inc. All rights reserved.
49
49 Limits of ILP Copyright © 2012, Elsevier Inc. All rights reserved.
50
50 Limits of ILP Now assume a more realistic architecture Up to 64 issues per clock cycle More than 10x of current Tournament branch predictor with 1K entries Reasonable Register renaming with 64 integer and 64 FP registers in reservation stations, 128 reorder-buffer entries Reasonable Vary window size Copyright © 2012, Elsevier Inc. All rights reserved.
51
51 Limits of ILP Copyright © 2012, Elsevier Inc. All rights reserved.
52
52 A8 Processor Dual issue, statically scheduled In-order issue In-order execution Up to two instructions per cycle One core, no FP Two-level cache hierarchy 1 GHz clock rate 2W power design ARM ISA (RISC) Copyright © 2012, Elsevier Inc. All rights reserved.
53
53 A8 Pipeline Copyright © 2012, Elsevier Inc. All rights reserved.
54
54 A8 Pipeline Decode Copyright © 2012, Elsevier Inc. All rights reserved.
55
55 A8 Pipeline Execute Copyright © 2012, Elsevier Inc. All rights reserved.
56
56 A8 CPI Copyright © 2012, Elsevier Inc. All rights reserved.
57
57 A9 Vs A8 Copyright © 2012, Elsevier Inc. All rights reserved.
58
58 i7 920 Processor Multiple issue, dynamically scheduled In-order issue Out-of-order execution Up to four instructions per cycle (plus fusion) Four cores, each with FP Three-level cache hierarchy 2.66 GHz clock rate 130W power design X86-64 ISA (CISC) 1-17 byte instructions are decoded into RISC microinstructions Copyright © 2012, Elsevier Inc. All rights reserved.
59
59 i7 Pipeline Copyright © 2012, Elsevier Inc. All rights reserved.
60
60 i7 CPI Copyright © 2012, Elsevier Inc. All rights reserved.
61
61 Atom 230 Processor Multiple issue, dynamically scheduled In-order issue In-order execution Up to two instructions per cycle One core (dual core available), with FP Two-level cache hierarchy 1.66 GHz clock rate 4W power design X86-64 ISA (CISC) Decodes to RISC microinstructions Copyright © 2012, Elsevier Inc. All rights reserved.
62
62 i7 Vs Atom Copyright © 2012, Elsevier Inc. All rights reserved.
63
63 Fallacies Processors with lower CPIs will always be faster Processors with higher clock rates will always be faster Copyright © 2012, Elsevier Inc. All rights reserved.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.