Dept. of Info. Sci. & Elec. Engg.

Dept. of Info. Sci. & Elec. Engg.
Computer Architecture and Parallel Computing 体系结构与并行计算 Lecture 5 – Complex Pipeline Peng Liu Dept. of Info. Sci. & Elec. Engg. Zhejiang University May 23, 2011

Complex Pipelining: Motivation
Pipelining becomes complex when we want high performance in the presence of: Long latency or partially pipelined floating-point units Memory systems with variable access time Multiple arithmetic and memory units 2 CS252 S05 2

Floating-Point Unit (FPU)
Much more hardware than an integer unit Single-cycle FPU is a bad idea - why? it is common to have several FPU’s it is common to have different types of FPU’s Fadd, Fmul, Fdiv, ... an FPU may be pipelined, partially pipelined or not pipelined To operate several FPU’s concurrently the FP register file needs to have more read and write ports 3 CS252 S05 3

Functional Unit Characteristics
fully pipelined 1cyc partially pipelined 2 cyc Functional units have internal pipeline registers  operands are latched when an instruction enters a functional unit  following instructions are able to write register file during a long-latency operation 4 CS252 S05 4

Floating-Point ISA Interaction between the floating-point datapath
and the integer datapath is determined largely by the ISA MIPS ISA separate register files for FP and Integer instructions the only interaction is via a set of move instructions (some ISA’s don’t even permit this) separate load/store for FPR’s and GPR’s but both use GPR’s for address calculation separate conditions for branches FP branches are defined in terms of condition codes Why not have a transfer between FP and GPR? Why have it? 5 CS252 S05 5

Realistic Memory Systems
Common approaches to improving memory performance caches single cycle except in case of a miss stall interleaved memory multiple memory accesses  bank conflicts split-phase memory operations (separate memory request from response)  out-of-order responses Latency of access to the main memory is usually much greater than one cycle and often unpredictable Solving this problem is a central issue in computer architecture 6 CS252 S05 6

Issues in Complex Pipeline Control
Structural conflicts at the execution stage if some FPU or memory unit is not pipelined and takes more than one cycle Structural conflicts at the write-back stage due to variable latencies of different functional units Out-of-order write hazards due to variable latencies of different functional units How to handle exceptions? IF ID WB ALU Mem Fadd Fmul Fdiv Issue GPRs FPRs 7 CS252 S05 7

Complex In-Order Pipeline
Commit Point PC Inst. Mem D Decode X1 X2 Data Mem W + GPRs FAdd X3 FPRs FMul FDiv Unpipelined divider Delay writeback so all operations have same latency to W stage Write ports never oversubscribed (one inst. in & one inst. out every cycle) Stall pipeline on long latency operations, e.g., divides, cache misses Handle exceptions in-order at commit point How to prevent increased writeback latency from slowing down single cycle integer operations? Bypassing 8 CS252 S05 8

In-Order Superscalar Pipeline
Commit Point 2 PC Inst. Mem D Dual Decode X1 X2 Data Mem W + GPRs FAdd X3 FPRs FMul FDiv Unpipelined divider Fetch two instructions per cycle; issue both simultaneously if one is integer/memory and other is floating point Inexpensive way of increasing throughput, examples include Alpha (1992) & MIPS R5000 series (1996) Same idea can be extended to wider issue by duplicating functional units (e.g. 4-issue UltraSPARC& Alpha 21164) but regfile ports and bypassing costs grow quickly 9 CS252 S05 9

Types of Data Hazards Consider executing a sequence of
rkri op rj type of instructions Data-dependence r3 r1 op r2 Read-after-Write r5 r3 op r4 (RAW) hazard Anti-dependence r3 r1 op r2 Write-after-Read r1 r4 op r5 (WAR) hazard Output-dependence r3 r1 op r2 Write-after-Write r3 r6 op r7 (WAW) hazard 10 CS252 S05 10

Register vs. Memory Dependence
Data hazards due to register operands can be determined at the decode stage but data hazards due to memory operands can be determined only after computing the effective address store M[r1 + disp1]  r2 load r3 M[r4 + disp2] Does (r1 + disp1) = (r4 + disp2) ? 11 CS252 S05 11

Data Hazards: An Example
I1 DIVD f6, f6, f4 I2 LD f2, 45(r3) I3 MULTD f0, f2, f4 I4 DIVD f8, f6, f2 I5 SUBD f10, f0, f6 I6 ADDD f6, f8, f2 RAW Hazards WAR Hazards WAW Hazards 12 CS252 S05 12

Instruction Scheduling
I1 DIVD f6, f6, f4 I2 LD f2, 45(r3) I3 MULTD f0, f2, f4 I4 DIVD f8, f6, f2 I5 SUBD f10, f0, f6 I6 ADDD f6, f8, f2 I6 I2 I4 I1 I5 I3 Valid orderings: in-order I1 I2 I3 I4 I5 I6 out-of-order I2 I1I3 I4 I5 I6 I1 I2 I3 I5 I4I6 13 CS252 S05 13

Out-of-order Completion In-order Issue
Latency I1 DIVD f6, f6, f4 4 I2 LD f2, 45(r3) 1 I3 MULTD f0, f2, f4 3 I4 DIVD f8, f6, f2 4 I5 SUBD f10, f0, f6 1 I6 ADDD f6, f8, f2 1 in-order comp out-of-order comp 1 2 14 CS252 S05 14

Complex Pipeline ALU Mem IF WB ID Issue Fadd GPR’s FPR’s Fmul
Can we solve write hazards without equalizing all pipeline depths and without bypassing? Fdiv 15 CS252 S05 15

When is it Safe to Issue an Instruction?
Suppose a data structure keeps track of all the instructions in all the functional units The following checks need to be made before the Issue stage can dispatch an instruction Is the required function unit available? Is the input data available?  RAW? Is it safe to write the destination? WAR? WAW? Is there a structural conflict at the WB stage? 16 CS252 S05 16

Data Dependence and Hazards
InstrJ is data dependent on InstrI InstrJ tries to read operand before InstrI writes it or InstrJ is data dependent on InstrK which is dependent on InstrI Caused by a “True Dependence” (compiler term) If true dependence caused a hazard in the pipeline, called a Read After Write (RAW) hazard I: add r1,r2,r3 J: sub r4,r1,r3

Data Dependence and Hazards
Dependences are a property of programs Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is a property of the pipeline Importance of the data dependencies 1) indicates the possibility of a hazard 2) determines order in which results must be calculated 3) sets an upper bound on how much parallelism can possibly be exploited Today looking at HW schemes to avoid hazard

Name Dependence #1: Anti-dependence
Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; 2 versions of name dependence InstrJ writes operand before InstrI reads it Called an “anti-dependence” in compiler work. This results from reuse of the name “r1” If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Name Dependence #2: Output dependence
InstrJ writes operand before InstrI writes it. Called an “output dependence” by compiler writers This also results from the reuse of name “r1” If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

ILP and Data Hazards program order: order instructions would execute in if executed sequentially 1 at a time as determined by original source program HW/SW goal: exploit parallelism by preserving appearance of program order modify order in manner than cannot be observed by program must not affect the outcome of the program Ex: Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict Register renaming resolves name dependence for regs Either by compiler or by HW add r1, r2, r3 sub r2, r4,r5 and r3, r2, 1

Control Dependencies Every instruction is control dependent on some set of branches, and, in general, these control dependencies must be preserved to preserve program order if p1 { S1; }; if p2 { S2; } S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1.

Control Dependence Ignored
Control dependence need not always be preserved willing to execute instructions that should not have been executed, thereby violating the control dependences, if can do so without affecting correctness of the program Instead, 2 properties critical to program correctness are exception behavior and data flow

Exception Behavior Preserving exception behavior => any changes in instruction execution order must not change how exceptions are raised in program (=> no new exceptions) Example: DADDU R2,R3,R4 BEQZ R2,L1 LW R1,0(R2) L1: Problem with moving LW before BEQZ?

Data Flow Data flow: actual flow of data values among instructions that produce results and those that consume them branches make flow dynamic, determine which instruction is supplier of data Example: DADDU R1,R2,R3 BEQZ R4,L DSUBU R1,R5,R6 L: … OR R7,R1,R8 OR depends on DADDU or DSUBU? Must preserve data flow on execution

Advantages of Dynamic Scheduling
Handles cases when dependences unknown at compile time (e.g., because they may involve a memory reference) It simplifies the compiler Allows code that compiled for one pipeline to run efficiently on a different pipeline Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling

HW Schemes: Instruction Parallelism
Key idea: Allow instructions behind stall to proceed DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 Enables out-of-order execution and allows out-of-order completion Will distinguish when an instruction begins execution and when it completes execution; between 2 times, the instruction is in execution In a dynamically scheduled pipeline, all instructions pass through issue stage in order (in-order issue)

Dynamic Scheduling Step 1
Simple pipeline had 1 stage to check both structural and data hazards: Instruction Decode (ID), also called Instruction Issue Split the ID pipe stage of simple 5-stage pipeline into 2 stages: Issue—Decode instructions, check for structural hazards Read operands—Wait until no data hazards, then read operands

A Dynamic Algorithm: Tomasulo’s Algorithm
For IBM 360/91 (before caches!) Goal: High Performance without special compilers Small number of floating point registers (4 in 360) prevented interesting compiler scheduling of operations This led Tomasulo to try to figure out how to get more effective registers — renaming in hardware! Why Study 1966 Computer? The descendants of this have flourished! Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, …

Tomasulo Algorithm Control & buffers distributed with Function Units (FU) FU buffers called “reservation stations”; have pending operands Registers in instructions replaced by values or pointers to reservation stations(RS); form of register renaming ; avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers can’t Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue

Tomasulo Organization
FP Registers From Mem FP Op Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 Store Buffers Add1 Add2 Add3 Mult1 Mult2 Resolve RAW memory conflict? (address in memory buffers) Integer unit executes in parallel Reservation Stations To Mem FP adders FP multipliers Common Data Bus (CDB)

Reservation Station Components
Op: Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) Note: Qj,Qk=0 => ready Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register. What you might have thought 1. 4 stages of instruction executino 2.Status of FU: Normal things to keep track of (RAW & structura for busyl): Fi from instruction format of the mahine (Fi is dest) Add unit can Add or Sub Rj, Rk - status of registers (Yes means ready) Qj,Qk - If a no in Rj, Rk, means waiting for a FU to write result; Qj, Qk means wihch FU waiting for it 3.Status of register result (WAW &WAR)s: which FU is going to write into registers Scoreboard on 6600 = size of FU 6.7, 6.8, 6.9, 6.12, 6.13, 6.16, 6.17 FU latencies: Add 2, Mult 10, Div 40 clocks

Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2. Execute—operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Normal data bus: data + destination (“go to” bus) Common data bus: data + source (“come from” bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast Example speed: 3 clocks for Fl .pt. +,-; 10 for * ; 40 clks for /

Tomasulo Example Instruction stream 3 Load/Buffers FU count
down 3 FP Adder R.S. 2 FP Mult R.S. Clock cycle counter

Tomasulo Example Cycle 1

Note: Can have multiple loads outstanding

Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued Load1 completing; what is waiting for Load1?

Load2 completing; what is waiting for Load2?

Timer starts down for Add1, Mult1

Issue ADDD here despite name dependency on F6?

Add1 (SUBD) completing; what is waiting for it?

Add2 (ADDD) completing; what is waiting for it?

Write result of ADDD here? All quick instructions complete in this cycle!

Mult1 (MULTD) completing; what is waiting for it?

Just waiting for Mult2 (DIVD) to complete

Faster than light computation (skip a couple of cycles)

Mult2 (DIVD) is completing; what is waiting for it?

Once again: In-order issue, out-of-order execution and out-of-order completion.

Tomasulo Drawbacks Complexity
delays of 360/91, MIPS 10000, Alpha 21264, IBM PPC 620 in CA:AQA 2/e, but not in silicon! Many associative stores (CDB) at high speed Performance limited by Common Data Bus Each CDB must go to multiple functional units high capacitance, high wiring density Number of functional units that can complete per cycle limited to one! Multiple CDBs  more FU logic for parallel assoc stores Non-precise interrupts! We will address this later

Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
SUBI R1 R1 #8 BNEZ R1 Loop This time assume Multiply takes 4 clocks Assume 1st load takes 8 clocks (L1 cache miss), 2nd load takes 1 clock (hit) To be clear, will show clocks for SUBI, BNEZ Reality: integer instructions ahead of Fl. Pt. Instructions Show 2 iterations

Value of Register used for address, iteration control
Loop Example Iter- ation Count Added Store Buffers Instruction Loop Value of Register used for address, iteration control

Loop Example Cycle 1

Loop Example Cycle 3 Implicit renaming sets up data flow graph

Loop Example Cycle 4 Dispatching SUBI Instruction (not in FP queue)

Loop Example Cycle 5 And, BNEZ instruction (not in FP queue)

Loop Example Cycle 6 Notice that F0 never sees Load from location 80

Loop Example Cycle 7 Register file completely detached from computation First and Second iteration completely overlapped

Loop Example Cycle 9 Load1 completing: who is waiting?
Note: Dispatching SUBI

Loop Example Cycle 10 Load2 completing: who is waiting?
Note: Dispatching BNEZ

Loop Example Cycle 11 Next load in sequence

Loop Example Cycle 12 Why not issue third multiply?

Loop Example Cycle 13 Why not issue third store?

Loop Example Cycle 14 Mult1 completing. Who is waiting?

Loop Example Cycle 15 Mult2 completing. Who is waiting?

Loop Example Cycle 20 Once again: In-order issue, out-of-order execution and out-of-order completion.

Why can Tomasulo overlap iterations of loops?
Register renaming Multiple iterations use different physical destinations for registers (dynamic loop unrolling). Reservation stations Permit instruction issue to advance past integer control flow operations Also buffer old values of registers - totally avoiding the WAR stall that we saw in the scoreboard. Other perspective: Tomasulo building data flow dependency graph on the fly.

Tomasulo’s scheme offers 2 major advantages
the distribution of the hazard detection logic distributed reservation stations and the CDB If multiple instructions waiting on single result, & each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB If a centralized register file were used, the units would have to read their results from the registers when register buses are available. (2) the elimination of stalls for WAW and WAR hazards

What about Precise Interrupts?
State of machine looks as if no instruction beyond faulting instructions has issued Tomasulo had: In-order issue, out-of-order execution, and out-of-order completion Need to “fix” the out-of-order completion aspect so that we can find precise breakpoint in instruction stream.

Relationship between precise interrupts and specultation:
Speculation: guess and check Important for branch prediction: Need to “take our best shot” at predicting branch direction. If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly: This is exactly same as precise exceptions! Technique for both precise interrupts/exceptions and speculation: in-order completion or commit

HW support for precise interrupts
Need HW buffer for results of uncommitted instructions: reorder buffer 3 fields: instr, destination, value Use reorder buffer number instead of reservation station when execution completes Supplies operands between execution complete & commit (Reorder buffer can be operand source => more registers like RS) Instructions commit Once instruction commits, result is put into register As a result, easy to undo speculated instructions on mispredicted branches or exceptions Reorder Buffer FP Op Queue FP Regs Res Stations Res Stations FP Adder FP Adder

Four Steps of Speculative Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”) 2. Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) 3. Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit—update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”)

What are the hardware complexities with reorder buffer (ROB)?
FP Op Queue FP Adder Res Stations FP Regs Compar network Reorder Table Dest Reg Result Exceptions? Valid Program Counter How do you find the latest version of a register? (As specified by Smith paper) need associative comparison network Could use future file or just use the register result status buffer to track which specific reorder buffer has received the value Need as many ports on ROB as register file

Summary Reservations stations: implicit register renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Not limited to basic blocks (integer units gets ahead, beyond branches) Today, helps cache misses as well Don’t stall for L1 Data cache miss (insufficient ILP for L2 miss?) Lasting Contributions Dynamic scheduling Register renaming Load/store disambiguation 360/91 descendants are Pentium III; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

How many instructions can be in the pipeline?
Which features of an ISA limit the number of instructions in the pipeline? Number of Registers Ilustrates how one feature alone may not help – happens today when people study single new idea in a very detailed model. Out-of-order dispatch by itself does not provide any significant performance improvement! 86 CS252 S05 86

Overcoming the Lack of Register Names
Floating Point pipelines often cannot be kept filled with small number of registers. IBM 360 had only 4 floating-point registers Can a microarchitecture use more registers than specified by the ISA without loss of ISA compatibility ? Robert Tomasulo of IBM suggested an ingenious solution in 1967 using on-the-fly register renaming 87 CS252 S05 87

Instruction-level Parallelism via Renaming
latency 1 LD F2, 34(R2) 1 2 LD F4, 45(R3) long 3 MULTD F6, F4, F2 3 4 SUBD F8, F2, F2 1 5 DIVD F4’, F2, F8 4 6 ADDD F10, F6, F4’1 1 2 3 4 5 6 X In-order: 1 (2,1) Out-of-order: 1 (2,1) (3,5) 3 6 6 Any antidependence can be eliminated by renaming. (renaming  additional storage) Can it be done in hardware? yes! 88 CS252 S05 88

Register Renaming IF ID WB ALU Mem Fadd Fmul Issue Decode does register renaming and adds instructions to the issue-stage instruction reorder buffer (ROB) renaming makes WAR or WAW hazards impossible Any instruction in ROB whose RAW hazards have been satisfied can be dispatched. Out-of-order or dataflow execution 89 CS252 S05 89

Renaming Structures Renaming table & regfile Reorder buffer
Ins# use exec op p1 src1 p2 src2 t1 t2 . tn Reorder buffer Replacing the tag by its value is an expensive operation Load Unit Store Unit FU FU < t, result > Instruction template (i.e., tag t) is allocated by the Decode stage, which also associates tag with register in regfile When an instruction completes, its tag is deallocated 90 CS252 S05 90

Reorder Buffer Management
. tn ptr2 next to deallocate ptr1 next available Ins# use exec op p1 src1 p2 src2 Destination registers are renamed to the instruction’s slot tag ROB managed circularly “exec” bit is set when instruction begins execution When an instruction completes its “use” bit is marked free ptr2 is incremented only if the “use” bit is marked free Instruction slot is candidate for execution when: It holds a valid instruction (“use” bit is set) It has not already started execution (“exec” bit is clear) Both operands are available (p1 and p2 are set) 91 CS252 S05 91

Renaming & Out-of-order Issue An example
Renaming table Reorder buffer Ins# use exec op p1 src1 p2 src2 t1 t2 t3 t4 t5 . data / ti p data F1 F2 F3 F4 F5 F6 F7 F8 LD LD t1 v1 LD LD v1 MUL v v1 MUL t v1 t2 t5 SUB v v1 SUB v v1 DIV v t4 DIV v v4 t3 t4 v4 1 LD F2, 34(R2) 2 LD F4, 45(R3) 3 MULTD F6, F4, F2 4 SUBD F8, F2, F2 5 DIVD F4, F2, F8 6 ADDD F10, F6, F4 When are tags in sources replaced by data? When can a name be reused? Whenever an FU produces data Whenever an instruction completes 92 CS252 S05 92

IBM 360/91 Floating-Point Unit R. M. Tomasulo, 1967
2 3 4 5 6 p tag/data load buffers (from memory) ... instructions 1 2 3 4 Floating- Point Regfile p tag/data p tag/data p tag/data p tag/data p tag/data p tag/data p tag/data p tag/data p tag/data Distribute instruction templates by functional units 1 2 3 p tag/data p tag/data 1 p tag/data p tag/data p tag/data p tag/data 2 p tag/data p tag/data p tag/data p tag/data Adder Mult < tag, result > Common bus ensures that data is made available immediately to all the instructions waiting for it. Match tag, if equal, copy value & set presence “p”. p tag/data store buffers (to memory) p tag/data p tag/data 93 CS252 S05 93

Precise Interrupts It must appear as if an interrupt is taken between two instructions (say Ii and Ii+1) the effect of all instructions up to and including Ii is totally complete no effect of any instruction after Ii has taken place The interrupt handler either aborts the program or restarts it at Ii+1 .

Effect on Interrupts Out-of-order Completion
I1 DIVD f6, f6, f4 I2 LD f2, 45(r3) I3 MULTD f0, f2, f4 I4 DIVD f8, f6, f2 I5 SUBD f10, f0, f6 I6 ADDD f6, f8, f2 out-of-order comp restore f2 restore f10 Consider interrupts Precise interrupts are difficult to implement at high speed - want to start execution of later instructions before exception checks finished on earlier instructions

Exception Handling (In-Order Five-Stage Pipeline)
Asynchronous Interrupts Exc D PC Inst. Mem Decode E M Data Mem W + Cause EPC Kill D Stage Kill F Stage Kill E Stage Illegal Opcode Overflow Data Addr Except PC Address Exceptions Kill Writeback Select Handler PC Commit Point Hold exception flags in pipeline until commit point (M stage) Exceptions in earlier pipe stages override later exceptions Inject external interrupts at commit point (override others) If exception at commit: update Cause and EPC registers, kill all stages, inject handler PC into fetch stage

Phases of Instruction Execution
PC Fetch: Instruction bits retrieved from cache. I-cache Fetch Buffer Decode: Instructions placed in appropriate issue (aka “dispatch”) stage buffer Issue Buffer Execute: Instructions and operands sent to execution units . When execution completes, all results and exception flags are available. Func. Units Result Buffer Commit: Instruction irrevocably updates architectural state (aka “graduation” or “completion”). Arch. State

In-Order Commit for Precise Exceptions
Out-of-order In-order Commit Fetch Decode Reorder Buffer Kill Kill Kill Exception? Execute Inject handler PC Instructions fetched and decoded into instruction reorder buffer in-order Execution is out-of-order (  out-of-order completion) Commit (write-back to architectural state, i.e., regfile & memory, is in-order Temporary storage needed to hold results before commit (shadow registers and store buffers)

Extensions for Precise Exceptions
Inst# use exec op p1 src1 p2 src2 pd dest data cause ptr2 next to commit ptr1 next available Reorder buffer add <pd, dest, data, cause> fields in the instruction template commit instructions to reg file and memory in program order  buffers can be maintained circularly on exception, clear reorder buffer by resetting ptr1=ptr2 (stores must wait for commit before updating memory)

Search the “dest” field in the reorder buffer
Rollback and Renaming Register File (now holds only committed state) Reorder buffer Load Unit FU Store < t, result > t1 t2 . tn Ins# use exec op p1 src1 p2 src2 pd dest data Commit Register file does not contain renaming tags any more. How does the decode stage find the tag of a source register? Search the “dest” field in the reorder buffer

Renaming Table tag Register File Rename valid bit Table Reorder buffer
Load Unit FU Store < t, result > t1 t2 . tn Ins# use exec op p1 src1 p2 src2 pd dest data Commit Renaming table is a cache to speed up register name look up. It needs to be cleared after each exception taken. When else are valid bits cleared? Control transfers

~ Loop length x pipeline width
Control Flow Penalty Branch executed Next fetch started I-cache Fetch Buffer Issue Buffer Func. Units Arch. State Execute Decode Result Commit PC Fetch Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution ! How much work is lost if pipeline doesn’t follow correct instruction flow? ~ Loop length x pipeline width

MIPS Branches and Jumps
Each instruction fetch depends on one or two pieces of information from the preceding instruction: 1) Is the preceding instruction a taken branch? 2) If so, what is the target address? Instruction Taken known? Target known? J JR BEQZ/BNEZ After Inst. Decode After Inst. Decode After Inst. Decode After Reg. Fetch After Reg. Fetch* *Assuming zero detect on register read After Inst. Decode

Branch Penalties in Modern Pipelines
UltraSPARC-III instruction fetch pipeline stages (in-order issue, 4-way superscalar, 750MHz, 2000) A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute Remainder of execute pipeline (+ another 6 stages) Branch Target Address Known Branch Direction & Jump Register Target Known

Reducing Control Flow Penalty
Software solutions Eliminate branches - loop unrolling Increases the run length Reduce resolution time - instruction scheduling Compute the branch condition as early as possible (of limited value) Hardware solutions Find something else to do - delay slots Replaces pipeline bubbles with useful work (requires software cooperation) Speculate - branch prediction Speculative execution of instructions beyond the branch Like using compiler to avoid WAW hazards, one can use compiler to reduce control flow penalty.

Branch Prediction Motivation: Required hardware support:
Branch penalties limit performance of deeply pipelined processors Modern branch predictors have high accuracy (>95%) and can reduce branch penalties significantly Required hardware support: Prediction structures: Branch history tables, branch target buffers, etc. Mispredict recovery mechanisms: Keep result computation separate from commit Kill instructions following branch in pipeline Restore state to state following branch

Static Branch Prediction
Overall probability a branch is taken is ~60-70% but: JZ JZ backward 90% forward 50% ISA can attach preferred direction semantics to branches, e.g., Motorola MC88110 bne0 (preferred taken) beq0 (not taken) ISA can allow arbitrary choice of statically predicted direction, e.g., HP PA-RISC, Intel IA typically reported as ~80% accurate

Dynamic Branch Prediction learning based on past behavior
Temporal correlation The way a branch resolves may be a good predictor of the way it will resolve at the next execution Spatial correlation Several branches may resolve in a highly correlated manner (a preferred path of execution)

Branch Prediction Bits
Assume 2 BP bits per instruction Change the prediction after two consecutive mistakes! ¬take wrong taken ¬ taken right take BP state: (predict take/¬take) x (last prediction right/wrong)

Branch History Table Fetch PC k BHT Index 2k-entry BHT, 2 bits/entry
Fetch PC k BHT Index 2k-entry BHT, 2 bits/entry Taken/¬Taken? I-Cache Opcode offset Instruction Branch? Target PC + 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions

Acknowledgements These slides contain material developed and copyright by: UCB John Kubiatowicz (UCB) David Patterson (UCB) UCB material derived from course CS252,CS152

Dept. of Info. Sci. & Elec. Engg.

Similar presentations

Presentation on theme: "Dept. of Info. Sci. & Elec. Engg."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dept. of Info. Sci. & Elec. Engg.

Similar presentations

Presentation on theme: "Dept. of Info. Sci. & Elec. Engg."— Presentation transcript:

Similar presentations

About project

Feedback