Download presentation
Presentation is loading. Please wait.
Published byBritton Arnold Modified over 9 years ago
1
Data Hazard Solution 2: Data Forwarding Our naïve pipeline would experience many data stalls Register isn’t written until completion of write-back stage Source operands are read from register file in decode stage Values need to be in register file at start of stage Leads to many more stall cycles than necessary Key observation The value we want is generated in execute or memory stage It is “available” 1-2 cycles before write-back Trick: go get it! Pass value directly from stage of generating instruction to decode stage Must be available before end of decode stage to avoid stall
2
Detecting Stall Condition 0x000: irmovl $10,%edx 123456789 FDEMW 0x006: irmovl $3,%eax FDEMW 0x00c: nop FDEMW bubble F EMW 0x00e: addl %edx,%eax DDEMW 0x010: halt FDEMW 10 # demo-h2.ys F FDEMW 0x00d: nop 11 Cycle 6 W D W_dstE = %eax W_valE = 3 srcA = %edx srcB = %eax
3
Data Forwarding Example irmovl in write-back stage Destination value in W pipeline register “Forward” as valB for decode stage addl instruction can proceed without stalling When do we actually know the values of %eax and %edx?
4
Data Forwarding Example #2 Register %edx Generated by ALU during previous cycle Forward from memory as valA Register %eax Value just generated by ALU Forward from execute as valB
5
Forwarding Hardware Feedback paths from E, M, and W registers to decode stage Logic blocks to select source for valA and valB in decode stage Note: we either do forwarding or stall on data hazards Forwarding has better performance, higher cost PIPE
6
## Actions of “Sel+Fwd A” block ## Pick the correct A value ## Order is important! int new_E_valA = [ # Use incremented PC D_icode in {ICALL,IJXX} : D_valP; # Forward valE from execute d_srcA == E_dstE : e_valE; # Forward valM from memory d_srcA == M_dstM : m_valM; # Forward valE from memory d_srcA == M_dstE : M_valE; # Forward valM from write back d_srcA == W_dstM : W_valM; # Forward valE from write back d_srcA == W_dstE : W_valE; # Use value read from register file 1 : d_rvalA; ]; Forwarding Control PIPE
7
At clock cycle 4 d_srcA = d_srcB = E_dstE = e_valE = M_dstE = M_valE = What are the forwarding conditions? Highlight the data path in D, E, M stages. Forwarding Example PIPE
8
At clock cycle 4 d_srcA = ecx d_srcB = edx M_dstE = edx m_valE = 128 E_dstE = ecx e_valE = 3 What are the forwarding conditions? Forwarding Example PIPE
9
Limitation of Forwarding Load-use dependency Value needed by end of decode stage in cycle 7 Value read from memory in memory stage of cycle 8 Terminology This is a load hazard Only solution is a load stall Preferred: compiler avoid
10
Load/Use Hazard: Desired Behavior Best we can do in hardware Stall reading instruction for one cycle Then forward value from memory stage Better yet Have compiler avoid in code it generates
11
Addressing Load/Use Hazard Detection Previous instr. is loading from memory to src register dstM in E register matches srcA or srcB (and not 0xF) Action Stall instruction in decode
12
Interrupts and Exceptions Basic interrupt mechanism … instr i instr i+1 instr i+2 instr i+3 instr i+4 instr i+5 instr i+6 … CPU running current process Event occurs that needs attention (e.g., Disc read finishes) HW asserts CPU interrupt line Control transferred to interrupt handler (think HW-induced function call) instr 1 instr 2 instr 3 … Handler How is state of interrupted process saved? How is location of handler determined?
13
Interrupt Handling Calling handler Save return address (PC) on stack Address of next instruction to be executed for this process –Depending on event, either current or next instruction PC usually passed through pipeline along with instruction Precise exception: all instructions to PC executed, none past (“Clean Break”) Jump to handler address Usually obtained from table stored at fixed memory address Index to table entry determined by exception type/interrupt priority level Interrupt vector table written by software, accessed by hardware Implementation Critical for real hardware Seldom implemented in simulators: no OS running to pass control to!
14
Exceptions Events occurring within processor under which pipeline cannot continue normal operation Possible causes Halt instruction executed (Current) Bad address for instruction or data (Previous) Invalid instruction (Previous) Pipeline control error (Previous) System calls, page faults, math errors (not in Y86) Desired action Complete instructions to specific point Either current or previous (depends on exception type) Discard instructions that follow Transfer control to exception handler in OS Save return address, get handler address from table
15
Exception Examples Detect in fetch stage irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address (for Y86 tools) jmp $-1 # Invalid jump target.byte 0xFF # Invalid instruction code halt # Halt instruction Detect in memory stage
16
Exceptions in Pipeline Processor (#1) Desired behavior rmmovl should cause exception (1 st in sequential machine) Tricky because invalid instruction code detected first # demo-exc1.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # Invalid address nop.byte 0xFF # Invalid instruction code 0x000: irmovl $100,%eax 1234 FDEM FDE 0x006: rmmovl %eax,0x10000(%eax) 0x00c: nop 0x00d:.byte 0xFF FD F W 5 M E D Exception detected
17
Exceptions in Pipeline Processor (#2) Desired behavior No exception should occur Must match behavior and results of sequential execution # demo-exc2.ys 0x000: xorl %eax,%eax # Set condition codes 0x002: jne t # Not taken 0x007: irmovl $1,%eax 0x00d: irmovl $2,%edx 0x013: halt 0x014: t:.byte 0xFF # Target 0x000: xorl %eax,%eax 123 FDE FD 0x002: jne t 0x014: t:.byte 0xFF 0x???: (I’m lost!) F Exception detected 0x007: irmovl $1,%eax 4 M E F D W 5 M D F E M E W 7 W M 8 W 9 E D M 6 W
18
Correct Exception Handling Challenges: respond to exceptions in program order, and only those that “really” occur Motivation for exception status field (stat) in pipeline registers Fetch stage sets to either “AOK,” “ADR” (bad fetch address), or “INS” (illegal instruction) Decode & execute stages pass values through Memory stage either passes through or sets to “ADR” CPU responds to exception only when instruction reaches write back F predPC W icodevalEvalMdstEdstMstat M CndicodevalEvalAdstEdstMstat E icodeifunvalCvalAvalBdstEdstMsrcAsrcBstat D rBvalCvalPicodeifunrAstat
19
Avoiding Side Effects Desired behavior rmmovl should cause exception No following instruction should change any state Note special challenge of condition codes! # demo-exc3.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address addl %eax,%eax # Sets condition codes 0x000: irmovl $100,%eax 1234 FDEM FDE 0x006: rmmovl %eax,0x10000(%eax) 0x00c: addl %eax,%eax FD W 5 M E Exception detected Condition code set
20
Avoiding Side Effects Exception should disable state update for following instructions When exception detected in memory stage Disable condition code setting in execute Must happen in same clock cycle When exception passes to write-back stage Disable memory write in memory stage Disable condition code setting in execute stage Let’s see how these are handled in PIPE processor
21
PIPE: Fetch Details Main points Branch prediction Branch misprediction recovery Return handling Stat initialization F D rB M_icode Predict PC valCvalPicodeifunrA Instruction memory Instruction memory PC incr. PC incr. predPC Need regids Need valC Instr valid Align Split Bytes 1-5Byte 0 Select PC M_Cnd M_valA W_icode W_valM f_pc stat imem_error icodeifun
22
PIPE: Decode and Write-back Main points Forwarding logic, paths Forwarding priority valA and valP merged
23
PIPE: Execute Main points CC update inhibited by prior exceptions Values for forwarding Special handling for dstE (?) E M CndicodevalEvalAdstEdstM icodeifunvalCvalAvalBdstEdstMsrcAsrcB CC ALU A ALU B ALU fun. Set CC cond e_valE e_Cnd e_dstE stat dstE m_stat W_stat
24
PIPE: Memory and Write-back Main points Values for forwarding Stat update logic Feedback for branch misprediction recovery M W Addr icode data in M_valA valEvalMdstEdstM CndicodevalEvalAdstEdstM Data memory Data memory Mem. read write data out Mem. write M_valE W_dstM W_valE W_valM W_dstE W_icode M_icode M_dstM M_dstE m_valM M_Cnd stat dmem_error m_stat stat Stat
25
Pipeline Control: Register Modes Rising clock Rising clock Output = y yy Rising clock Rising clock Output = x xx xx n o p Rising clock Rising clock Output = nop Output = xInput = y stall = 0 bubble = 0 xx Normal Output = xInput = y stall = 1 bubble = 0 xx Stall Output = xInput = y stall = 0 bubble = 1 Bubble
26
PIPE Control Logic Handles special cases Handles ret, load/use hazards, misprediction recovery, exceptions Existing PIPE logic handles forwarding, branch prediction E M W F D CC rB srcA srcB icodevalEvalMdstEdstM CndicodevalEvalAdstEdstM icodeifunvalCvalAvalBdstEdstMsrcAsrcB valCvalPicodeifunrA predPC d_srcB d_srcA e_Cnd D_icode E_icode M_icode E_dstM Pipe control logic D_bubble D_stall E_bubble F_stall M_bubble W_stall set_cc stat W_stat stat m_stat
27
PIPE: Actual ret handling 0x000: irmovl Stack,%edx 123456789 FDEMW 0x006: call proc FDEMW EMW 10 # prog7 0x020: ret bubble FDEMW D EMW D EMW D 0x00b: irmovl $10,%edx # Return point FDEMW 11 F F F 0x000: irmovl Stack,%edx 123456789 FDEMW 0x006: call proc FDEMW F EMW 10 # prog7 0x020: ret 0x021: rrmovl %edx,%ebx # Not executed bubble FDEMW D F EMW 0x021: rrmovl %edx,%ebx # Not executed bubble D F EMW 0x021: rrmovl %edx,%ebx # Not executed bubble D 0x00b: irmovl $10,%edx # Return point FDEMW 11 Simplified view What hardware actually does
28
PIPE: Actual exception handling Scenario: pushl uses bad memory address Actions: disable CC, inject bubbles into memory stage, stall write-back 0x000: irmovl $1,%eax 123456789 FDEMW 0x006: xorl %esp,%esp #CC = 100 FDEMW 10 # prog10 0x008: pushl %eax 0x00a: addl %eax,%eax FD 0x00c: irmovl $2, %eax FDEMW E FDE WWW Cycle 6 M mem_error = 1 E New CC = 000 set_cc 0
29
Special Control Cases: Exceptions Detection Action (on next cycle) Also: disable setting of condition codes in execute in current cycle ConditionTrigger Exception m_stat is in {SADR, SINS, SHLT} || W_stat is in {SADR, SINS, SHLT} ConditionFDEMWExceptionnormalnormalnormalbubblestall
30
Special Control Cases: Non-exceptions Detection Action (on next cycle) ConditionTrigger Processing ret IRET in { D_icode, E_icode, M_icode } Load/Use Hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } Mispredicted Branch E_icode = IJXX & !e_Cnd ConditionFDEMW Processing ret stallbubblenormalnormalnormal Load/Use Hazard stallstallbubblenormalnormal Mispredicted Branch normalbubblebubblenormalnormal
31
Pipeline Control, rev. 1.0 bool F_stall = # Conditions for a load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode }; bool D_stall = # Conditions for a load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB }; bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode }; bool E_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB}; How do we know this works?
32
Analysis: Control Combinations Special cases that can arise on same clock cycle Combination A Not-taken branch ret instruction at branch target Combination B Instruction that reads from memory to %esp Followed by ret instruction
33
Control Combination A Should be handled as mispredicted branch Combination will also stall F pipeline register But PC selection logic will be using M_valA anyway Correct action taken! JXX E D M Mispredict JXX E D M Mispredict E ret D M 1 E D M 1 E D M 1 Combination A ConditionFDEMW Processing ret stallbubblenormalnormalnormal Mispredicted branch normalbubblebubblenormalnormal Combinationstallbubblebubblenormalnormal
34
Control Combination B Would assert both bubble and stall for pipeline register D Should be signaled by processor as pipeline error Combination not handled correctly in control code 1.0 But it passed many simulation tests; caught only with systematic analysis Load E Use D M Load/use E ret D M 1 E D M 1 E D M 1 Combination B ConditionFDEMW Processing ret stallbubblenormalnormalnormal Load/use hazard stallstallbubblenormalnormal Combinationstall bubble + stall bubblenormalnormal
35
Control Combination B: Correct Handling Load E Use D M Load/use E ret D M 1 E D M 1 E D M 1 Combination B ConditionFDEMW Processing ret stallbubblenormalnormalnormal Load/use hazard stallstallbubblenormalnormal Combinationstallstallbubblenormalnormal Load/use hazard should get priority ret instruction should be held in decode stage for additional cycle
36
Corrected Pipeline Control Logic Load/use hazard should get priority ret instruction should be held in decode stage for additional cycle ConditionFDEMW Processing ret stallbubblenormalnormalnormal Load/use hazard stallstallbubblenormalnormal Combinationstallstallbubblenormalnormal bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode } # but not condition for a load/use hazard && !(E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB }); New
37
Lesson Learned Extensive and thorough testing is good, but it can’t prove a design correct Formal verification important, but field not mature enough for large-scale designs Important and active research area
38
Performance Metrics Clock rate Measured in Megahertz or Gigahertz Function of stage partitioning and circuit design To increase: keep amount of work per stage small Rate at which instructions executed CPI: cycles per instruction On average, how many clock cycles does each instruction require (after completion of previous instruction)? CPI a function of pipeline design and the program How frequently are branches mispredicted? How frequent are load stalls? How frequent are ret instructions?
39
CPI for PIPE Ideal CPI = 1.0 Fetch instruction each clock cycle Process new instruction every cycle Although each individual instruction has latency of 5 cycles Actual CPI > 1.0 Due to pipeline stalls, branch mispredictions Computing CPI C clock cycles I instructions executed to completion B bubbles injected (C = I + B) CPI = C/I = (I+B)/I = 1.0 + B/I B/I represents average penalty (per instruction) due to bubbles
40
CPI for PIPE (Cont.) B/I = LP + MP + RP LP: Penalty due to load/use hazard stalling Fraction of instructions that are loads0.25 Fraction of load instructions requiring stall0.20 Number of bubbles injected each time1 LP = 0.25 * 0.20 * 1 = 0.05 MP: Penalty due to mispredicted branches Fraction of instructions that are cond. jumps 0.20 Fraction of cond. jumps mispredicted0.40 Number of bubbles injected each time 2 MP = 0.20 * 0.40 * 2 = 0.16 RP: Penalty due to ret instructions Fraction of instructions that are returns0.02 Number of bubbles injected each time 3 RP = 0.02 * 3 = 0.06 Net effect of penalties: 0.05 + 0.16 + 0.06 = 0.27 CPI = 1.27 (Not bad! Assumes perfect memories.) Typical Values
41
State-of-the-Art Pipelining What have we ignored in our Y86 implementation? Balancing delay in each stage Which stage is longest, how might we speed it up? Multicycle instructions Realistic memory systems
42
Fetch Logic Revisited During fetch cycle 1. Select PC 2. Read bytes from instruction memory 3. Examine icode to determine instruction length 4. Increment PC Timing Steps 2 & 4 require significant amount of time F D rB M_icode Predict PC valCvalPicodeifunrA Instruction memory Instruction memory PC incr. PC incr. predPC Need regids Need valC Instr valid Align Split Bytes 1-5Byte 0 Select PC M_Cnd M_valA W_icode W_valM f_pc stat imem_error icodeifun
43
Standard Fetch Timing Must perform everything in sequence: Can’t compute incremented PC until we know value to increment it with Why is increment slow? How could we speed this up? Select PC Mem. ReadIncrement need_regids, need_valC 1 clock cycle
44
A Fast PC Increment Circuit 3-bit adder need_ValC need_regids 0 29-bit incre- menter MUX High-order 29 bits Low-order 3 bits High-order 29 bitsLow-order 3 bits 01 PC incrPC SlowFast carry 1
45
Modified Fetch Timing 29-bit incrementer Acts as soon as PC selected Output not needed until final MUX Works in parallel with memory read Select PC Mem. Read Incrementer need_regids, need_valC 3-bit add MUX 1 clock cycle Standard cycle
46
State-of-the-Art Pipelining Other issues to consider More complex instructions: consider FP divide, sqrt Take many cycles to execute Forwarding can’t resolve hazards: more data stalls Important for compiler to schedule code Deeper pipelines to allow faster cycle times Increased penalty from misprediction, data hazard stalls, etc. Increased emphasis on branch prediction Actual memory hierarchy issues (will increase CPI) Difficult to complete memory access in one cycle! Possibility of cache misses, TLB misses, page faults Superscalar/VLIW: process multiple instructions/cycle Dynamic scheduling (discussed in Chapter 5) Scheduling = determining instruction execution order Hardware decides based on data dependencies, resources
47
Pentium 4 Pipeline Very deep pipeline Enables very high clock rates, but 20+ cycle branch penalty Slower than Pentium III for a given clock rate 123456789101112 TC Nxt IPTC FetchDriveAlloc Rename QueSch 1314 Disp 1516 17 18 19 20 RF Ex Flgs Br Ck Drive RF
48
Multicycle FP Operations Multiple functional units: one approach Special-purpose hardware for FP operations Increased latency causes more frequent stalls can cause “structural” hazards Single cycle integer unit F DMW Fully pipelined multiplier Non-pipelined divider Fully pipelined FP adder
49
Dynamic scheduling Out-of-order execution engine: one view (Pentium 4) Image from www.xbitlabs. com Fetching, decoding, translation of x86 instrs to uops to support precise exceptions and recovery from mispredicted branches
50
Branch Prediction: Simplistic Branch history Encode history about prior history of each individual branch instruction; store as hash table on instr. address Use history to predict branch outcome State machine stores history Each time branch taken, move to left Each time branch not taken, move to right In state Yes*, predict taken; in state No*, predict not taken Can be encoded using 2 bits per table entry TTT Yes!Yes?No?No! NT T
51
Branch Prediction: Realistic Alpha 21264 “tournament” predictor Addr. of branch instr 12 bits 4K entries, each 2 bits, to select predictor Global Predictor Local Predictor 12 bit shift register of last branch outcomes globally 10 bits of branch address 4K entries, standard 2-bit branch predictors 1K entries, each 10 bits, history of behavior of this branch 1K entries, each a 3-bit saturation counter Predictor size: 8K + 8K + 10K + 3K = 29K bits! 8K 10K3K
52
Discussion Questions Data hazards on register values can be dealt with by stalling or by forwarding. Can hazards occur on condition codes? Can hazards occur on data memory accesses? Can software be responsible for pipeline correctness? Schedule instructions, use nops DSP chips historically have had exposed pipelines Relationship of hazards and dependencies Does every data dependence cause a data hazard? Is every data hazard caused by a data dependence?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.