Data Hazard Solution 2: Data Forwarding Our naïve pipeline would experience many data stalls  Register isn’t written until completion of write-back stage.

Slides:

Advertisements

Similar presentations

Randal E. Bryant Carnegie Mellon University CS:APP CS:APP Chapter 4 Computer Architecture PipelinedImplementation Part II CS:APP Chapter 4 Computer Architecture.

Advertisements

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

1 Seoul National University Wrap-Up. 2 Overview Seoul National University Wrap-Up of PIPE Design  Exception conditions  Performance analysis Modern.

Real-World Pipelines: Car Washes Idea  Divide process into independent stages  Move objects through stages in sequence  At any instant, multiple objects.

PipelinedImplementation Part I PipelinedImplementation.

Instructor: Erol Sahin

– 1 – Chapter 4 Processor Architecture Pipelined Implementation Chapter 4 Processor Architecture Pipelined Implementation Instructor: Dr. Hyunyoung Lee.

Randal E. Bryant Carnegie Mellon University CS:APP2e CS:APP Chapter 4 Computer Architecture SequentialImplementation CS:APP Chapter 4 Computer Architecture.

Computer Architecture Carnegie Mellon University

PipelinedImplementation Part I CSC 333. – 2 – Overview General Principles of Pipelining Goal Difficulties Creating a Pipelined Y86 Processor Rearranging.

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Wrap-Up CSC 333. – 2 – Overview Wrap-Up of PIPE Design Performance analysis Fetch stage design Exceptional conditions Modern High-Performance Processors.

Appendix A Pipelining: Basic and Intermediate Concepts

Randal E. Bryant CS:APP Chapter 4 Computer Architecture SequentialImplementation CS:APP Chapter 4 Computer Architecture SequentialImplementation Slides.

Datapath Design II Topics Control flow instructions Hardware for sequential machine (SEQ) Systems I.

Pipelining III Topics Hazard mitigation through pipeline forwarding Hardware support for forwarding Forwarding to mitigate control (branch) hazards Systems.

David O’Hallaron Carnegie Mellon University Processor Architecture PIPE: Pipelined Implementation Part I Processor Architecture PIPE: Pipelined Implementation.

Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13

1 Seoul National University Pipelined Implementation : Part I.

1 Naïve Pipelined Implementation. 2 Outline General Principles of Pipelining –Goal –Difficulties Naïve PIPE Implementation Suggested Reading 4.4, 4.5.

Randal E. Bryant Carnegie Mellon University CS:APP CS:APP Chapter 4 Computer Architecture SequentialImplementation CS:APP Chapter 4 Computer Architecture.

Datapath Design I Topics Sequential instruction execution cycle Instruction mapping to hardware Instruction decoding Systems I.

CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Pipelining Basics.

Computer Architecture: Wrap-up CENG331 - Computer Organization Instructors: Murat Manguoglu(Section 1) Erol Sahin (Section 2 & 3) Adapted from slides of.

1 Sequential CPU Implementation. 2 Outline Logic design Organizing Processing into Stages SEQ timing Suggested Reading 4.2,4.3.1 ~

Pipeline Architecture I Slides from: Bryant & O’ Hallaron

PipelinedImplementation Part II PipelinedImplementation.

Computer Architecture: Pipelined Implementation - I

Computer Architecture adapted by Jason Fritts

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

Pipelining IV Topics Implementing pipeline control Pipelining and performance analysis Systems I.

1 SEQ CPU Implementation. 2 Outline SEQ Implementation Suggested Reading 4.3.1,

Randal E. Bryant Carnegie Mellon University CS:APP2e CS:APP Chapter 4 Computer Architecture PipelinedImplementation Part II CS:APP Chapter 4 Computer Architecture.

Sequential Hardware “God created the integers, all else is the work of man” Leopold Kronecker (He believed in the reduction of all mathematics to arguments.

Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.

1 Pipelined Implementation. 2 Outline Handle Control Hazard Handle Exception Performance Analysis Suggested Reading 4.5.

1 Seoul National University Sequential Implementation.

Real-World Pipelines Idea Divide process into independent stages

Computer Organization CS224

Lecture 13 Y86-64: SEQ – sequential implementation

William Stallings Computer Organization and Architecture 8th Edition

Systems I Pipelining IV

Lecture 14 Y86-64: PIPE – pipelined implementation

Sequential Implementation

Administrivia Midterm to be posted on Tuesday after class

Lecture 6: Advanced Pipelines

Pipelined Implementation : Part I

Seoul National University

Seoul National University

The processor: Pipelining and Branching

Instruction Decoding Optional icode ifun valC Instruction Format

Pipelined Implementation : Part II

Systems I Pipelining III

Computer Architecture adapted by Jason Fritts

Systems I Pipelining II

Pipelined Implementation : Part I

Seoul National University

Control unit extension for data hazards

Pipeline Architecture I Slides from: Bryant & O’ Hallaron

Pipelined Implementation : Part I

Pipelined Implementation

Pipelined Implementation

Systems I Pipelining II

Chapter 4 Processor Architecture

Systems I Pipelining II

Pipelined Implementation

Real-World Pipelines: Car Washes

Sequential CPU Implementation

Sequential Design תרגול 10.

Presentation transcript:

Data Hazard Solution 2: Data Forwarding Our naïve pipeline would experience many data stalls  Register isn’t written until completion of write-back stage  Source operands are read from register file in decode stage  Values need to be in register file at start of stage  Leads to many more stall cycles than necessary Key observation  The value we want is generated in execute or memory stage  It is “available” 1-2 cycles before write-back Trick: go get it!  Pass value directly from stage of generating instruction to decode stage  Must be available before end of decode stage to avoid stall

Detecting Stall Condition 0x000: irmovl $10,%edx FDEMW 0x006: irmovl $3,%eax FDEMW 0x00c: nop FDEMW bubble F EMW 0x00e: addl %edx,%eax DDEMW 0x010: halt FDEMW 10 # demo-h2.ys F FDEMW 0x00d: nop 11 Cycle 6 W D W_dstE = %eax W_valE = 3 srcA = %edx srcB = %eax

Data Forwarding Example  irmovl in write-back stage  Destination value in W pipeline register  “Forward” as valB for decode stage  addl instruction can proceed without stalling  When do we actually know the values of %eax and %edx?

Data Forwarding Example #2 Register %edx  Generated by ALU during previous cycle  Forward from memory as valA Register %eax  Value just generated by ALU  Forward from execute as valB

Forwarding Hardware  Feedback paths from E, M, and W registers to decode stage  Logic blocks to select source for valA and valB in decode stage  Note: we either do forwarding or stall on data hazards  Forwarding has better performance, higher cost PIPE

## Actions of “Sel+Fwd A” block ## Pick the correct A value ## Order is important! int new_E_valA = [ # Use incremented PC D_icode in {ICALL,IJXX} : D_valP; # Forward valE from execute d_srcA == E_dstE : e_valE; # Forward valM from memory d_srcA == M_dstM : m_valM; # Forward valE from memory d_srcA == M_dstE : M_valE; # Forward valM from write back d_srcA == W_dstM : W_valM; # Forward valE from write back d_srcA == W_dstE : W_valE; # Use value read from register file 1 : d_rvalA; ]; Forwarding Control PIPE

At clock cycle 4 d_srcA = d_srcB = E_dstE = e_valE = M_dstE = M_valE = What are the forwarding conditions? Highlight the data path in D, E, M stages. Forwarding Example PIPE

At clock cycle 4 d_srcA = ecx d_srcB = edx M_dstE = edx m_valE = 128 E_dstE = ecx e_valE = 3 What are the forwarding conditions? Forwarding Example PIPE

Limitation of Forwarding Load-use dependency  Value needed by end of decode stage in cycle 7  Value read from memory in memory stage of cycle 8 Terminology  This is a load hazard  Only solution is a load stall  Preferred: compiler avoid

Load/Use Hazard: Desired Behavior Best we can do in hardware  Stall reading instruction for one cycle  Then forward value from memory stage Better yet  Have compiler avoid in code it generates

Addressing Load/Use Hazard Detection  Previous instr. is loading from memory to src register  dstM in E register matches srcA or srcB (and not 0xF) Action  Stall instruction in decode

Interrupts and Exceptions Basic interrupt mechanism … instr i instr i+1 instr i+2 instr i+3 instr i+4 instr i+5 instr i+6 … CPU running current process Event occurs that needs attention (e.g., Disc read finishes) HW asserts CPU interrupt line Control transferred to interrupt handler (think HW-induced function call) instr 1 instr 2 instr 3 … Handler How is state of interrupted process saved? How is location of handler determined?

Interrupt Handling Calling handler  Save return address (PC) on stack  Address of next instruction to be executed for this process –Depending on event, either current or next instruction  PC usually passed through pipeline along with instruction  Precise exception: all instructions to PC executed, none past (“Clean Break”)  Jump to handler address  Usually obtained from table stored at fixed memory address  Index to table entry determined by exception type/interrupt priority level  Interrupt vector table written by software, accessed by hardware Implementation  Critical for real hardware  Seldom implemented in simulators: no OS running to pass control to!

Exceptions  Events occurring within processor under which pipeline cannot continue normal operation Possible causes  Halt instruction executed (Current)  Bad address for instruction or data (Previous)  Invalid instruction (Previous)  Pipeline control error (Previous)  System calls, page faults, math errors (not in Y86) Desired action  Complete instructions to specific point  Either current or previous (depends on exception type)  Discard instructions that follow  Transfer control to exception handler in OS  Save return address, get handler address from table

Exception Examples Detect in fetch stage irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address (for Y86 tools) jmp $-1 # Invalid jump target.byte 0xFF # Invalid instruction code halt # Halt instruction Detect in memory stage

Exceptions in Pipeline Processor (#1) Desired behavior  rmmovl should cause exception (1 st in sequential machine)  Tricky because invalid instruction code detected first # demo-exc1.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # Invalid address nop.byte 0xFF # Invalid instruction code 0x000: irmovl $100,%eax 1234 FDEM FDE 0x006: rmmovl %eax,0x10000(%eax) 0x00c: nop 0x00d:.byte 0xFF FD F W 5 M E D Exception detected

Exceptions in Pipeline Processor (#2) Desired behavior  No exception should occur  Must match behavior and results of sequential execution # demo-exc2.ys 0x000: xorl %eax,%eax # Set condition codes 0x002: jne t # Not taken 0x007: irmovl $1,%eax 0x00d: irmovl $2,%edx 0x013: halt 0x014: t:.byte 0xFF # Target 0x000: xorl %eax,%eax 123 FDE FD 0x002: jne t 0x014: t:.byte 0xFF 0x???: (I’m lost!) F Exception detected 0x007: irmovl $1,%eax 4 M E F D W 5 M D F E M E W 7 W M 8 W 9 E D M 6 W

Correct Exception Handling Challenges: respond to exceptions in program order, and only those that “really” occur  Motivation for exception status field (stat) in pipeline registers  Fetch stage sets to either “AOK,” “ADR” (bad fetch address), or “INS” (illegal instruction)  Decode & execute stages pass values through  Memory stage either passes through or sets to “ADR”  CPU responds to exception only when instruction reaches write back F predPC W icodevalEvalMdstEdstMstat M CndicodevalEvalAdstEdstMstat E icodeifunvalCvalAvalBdstEdstMsrcAsrcBstat D rBvalCvalPicodeifunrAstat

Avoiding Side Effects Desired behavior  rmmovl should cause exception  No following instruction should change any state  Note special challenge of condition codes! # demo-exc3.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address addl %eax,%eax # Sets condition codes 0x000: irmovl $100,%eax 1234 FDEM FDE 0x006: rmmovl %eax,0x10000(%eax) 0x00c: addl %eax,%eax FD W 5 M E Exception detected Condition code set

Avoiding Side Effects Exception should disable state update for following instructions  When exception detected in memory stage  Disable condition code setting in execute  Must happen in same clock cycle  When exception passes to write-back stage  Disable memory write in memory stage  Disable condition code setting in execute stage Let’s see how these are handled in PIPE processor

PIPE: Fetch Details Main points  Branch prediction  Branch misprediction recovery  Return handling  Stat initialization F D rB M_icode Predict PC valCvalPicodeifunrA Instruction memory Instruction memory PC incr. PC incr. predPC Need regids Need valC Instr valid Align Split Bytes 1-5Byte 0 Select PC M_Cnd M_valA W_icode W_valM f_pc stat imem_error icodeifun

PIPE: Decode and Write-back Main points  Forwarding logic, paths  Forwarding priority  valA and valP merged

PIPE: Execute Main points  CC update inhibited by prior exceptions  Values for forwarding  Special handling for dstE (?) E M CndicodevalEvalAdstEdstM icodeifunvalCvalAvalBdstEdstMsrcAsrcB CC ALU A ALU B ALU fun. Set CC cond e_valE e_Cnd e_dstE stat dstE m_stat W_stat

PIPE: Memory and Write-back Main points  Values for forwarding  Stat update logic  Feedback for branch misprediction recovery M W Addr icode data in M_valA valEvalMdstEdstM CndicodevalEvalAdstEdstM Data memory Data memory Mem. read write data out Mem. write M_valE W_dstM W_valE W_valM W_dstE W_icode M_icode M_dstM M_dstE m_valM M_Cnd stat dmem_error m_stat stat Stat

Pipeline Control: Register Modes Rising clock Rising clock  Output = y yy Rising clock Rising clock  Output = x xx xx n o p Rising clock Rising clock  Output = nop Output = xInput = y stall = 0 bubble = 0 xx Normal Output = xInput = y stall = 1 bubble = 0 xx Stall Output = xInput = y stall = 0 bubble = 1 Bubble

PIPE Control Logic Handles special cases  Handles ret, load/use hazards, misprediction recovery, exceptions  Existing PIPE logic handles forwarding, branch prediction E M W F D CC rB srcA srcB icodevalEvalMdstEdstM CndicodevalEvalAdstEdstM icodeifunvalCvalAvalBdstEdstMsrcAsrcB valCvalPicodeifunrA predPC d_srcB d_srcA e_Cnd D_icode E_icode M_icode E_dstM Pipe control logic D_bubble D_stall E_bubble F_stall M_bubble W_stall set_cc stat W_stat stat m_stat

PIPE: Actual ret handling 0x000: irmovl Stack,%edx FDEMW 0x006: call proc FDEMW EMW 10 # prog7 0x020: ret bubble FDEMW D EMW D EMW D 0x00b: irmovl $10,%edx # Return point FDEMW 11 F F F 0x000: irmovl Stack,%edx FDEMW 0x006: call proc FDEMW F EMW 10 # prog7 0x020: ret 0x021: rrmovl %edx,%ebx # Not executed bubble FDEMW D F EMW 0x021: rrmovl %edx,%ebx # Not executed bubble D F EMW 0x021: rrmovl %edx,%ebx # Not executed bubble D 0x00b: irmovl $10,%edx # Return point FDEMW 11 Simplified view What hardware actually does

PIPE: Actual exception handling Scenario: pushl uses bad memory address  Actions: disable CC, inject bubbles into memory stage, stall write-back 0x000: irmovl $1,%eax FDEMW 0x006: xorl %esp,%esp #CC = 100 FDEMW 10 # prog10 0x008: pushl %eax 0x00a: addl %eax,%eax FD 0x00c: irmovl $2, %eax FDEMW E FDE WWW      Cycle 6 M mem_error = 1 E New CC = 000 set_cc  0

Special Control Cases: Exceptions Detection Action (on next cycle)  Also: disable setting of condition codes in execute in current cycle ConditionTrigger Exception m_stat is in {SADR, SINS, SHLT} || W_stat is in {SADR, SINS, SHLT} ConditionFDEMWExceptionnormalnormalnormalbubblestall

Special Control Cases: Non-exceptions Detection Action (on next cycle) ConditionTrigger Processing ret IRET in { D_icode, E_icode, M_icode } Load/Use Hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } Mispredicted Branch E_icode = IJXX & !e_Cnd ConditionFDEMW Processing ret stallbubblenormalnormalnormal Load/Use Hazard stallstallbubblenormalnormal Mispredicted Branch normalbubblebubblenormalnormal

Pipeline Control, rev. 1.0 bool F_stall = # Conditions for a load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode }; bool D_stall = # Conditions for a load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB }; bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode }; bool E_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB}; How do we know this works?

Analysis: Control Combinations  Special cases that can arise on same clock cycle Combination A  Not-taken branch  ret instruction at branch target Combination B  Instruction that reads from memory to %esp  Followed by ret instruction

Control Combination A  Should be handled as mispredicted branch  Combination will also stall F pipeline register  But PC selection logic will be using M_valA anyway  Correct action taken! JXX E D M Mispredict JXX E D M Mispredict E ret D M 1 E D M 1 E D M 1 Combination A ConditionFDEMW Processing ret stallbubblenormalnormalnormal Mispredicted branch normalbubblebubblenormalnormal Combinationstallbubblebubblenormalnormal

Control Combination B  Would assert both bubble and stall for pipeline register D  Should be signaled by processor as pipeline error  Combination not handled correctly in control code 1.0  But it passed many simulation tests; caught only with systematic analysis Load E Use D M Load/use E ret D M 1 E D M 1 E D M 1 Combination B ConditionFDEMW Processing ret stallbubblenormalnormalnormal Load/use hazard stallstallbubblenormalnormal Combinationstall bubble + stall bubblenormalnormal

Control Combination B: Correct Handling Load E Use D M Load/use E ret D M 1 E D M 1 E D M 1 Combination B ConditionFDEMW Processing ret stallbubblenormalnormalnormal Load/use hazard stallstallbubblenormalnormal Combinationstallstallbubblenormalnormal  Load/use hazard should get priority  ret instruction should be held in decode stage for additional cycle

Corrected Pipeline Control Logic  Load/use hazard should get priority  ret instruction should be held in decode stage for additional cycle ConditionFDEMW Processing ret stallbubblenormalnormalnormal Load/use hazard stallstallbubblenormalnormal Combinationstallstallbubblenormalnormal bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode } # but not condition for a load/use hazard && !(E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB }); New

Lesson Learned Extensive and thorough testing is good, but it can’t prove a design correct Formal verification important, but field not mature enough for large-scale designs  Important and active research area

Performance Metrics Clock rate  Measured in Megahertz or Gigahertz  Function of stage partitioning and circuit design  To increase: keep amount of work per stage small Rate at which instructions executed  CPI: cycles per instruction  On average, how many clock cycles does each instruction require (after completion of previous instruction)?  CPI a function of pipeline design and the program  How frequently are branches mispredicted?  How frequent are load stalls?  How frequent are ret instructions?

CPI for PIPE Ideal CPI = 1.0  Fetch instruction each clock cycle  Process new instruction every cycle  Although each individual instruction has latency of 5 cycles Actual CPI > 1.0  Due to pipeline stalls, branch mispredictions Computing CPI  C clock cycles  I instructions executed to completion  B bubbles injected (C = I + B) CPI = C/I = (I+B)/I = B/I  B/I represents average penalty (per instruction) due to bubbles

CPI for PIPE (Cont.) B/I = LP + MP + RP  LP: Penalty due to load/use hazard stalling  Fraction of instructions that are loads0.25  Fraction of load instructions requiring stall0.20  Number of bubbles injected each time1  LP = 0.25 * 0.20 * 1 = 0.05  MP: Penalty due to mispredicted branches  Fraction of instructions that are cond. jumps 0.20  Fraction of cond. jumps mispredicted0.40  Number of bubbles injected each time 2  MP = 0.20 * 0.40 * 2 = 0.16  RP: Penalty due to ret instructions  Fraction of instructions that are returns0.02  Number of bubbles injected each time 3  RP = 0.02 * 3 = 0.06  Net effect of penalties: = 0.27  CPI = 1.27 (Not bad! Assumes perfect memories.) Typical Values

State-of-the-Art Pipelining What have we ignored in our Y86 implementation?  Balancing delay in each stage  Which stage is longest, how might we speed it up?  Multicycle instructions  Realistic memory systems

Fetch Logic Revisited During fetch cycle 1. Select PC 2. Read bytes from instruction memory 3. Examine icode to determine instruction length 4. Increment PC Timing  Steps 2 & 4 require significant amount of time F D rB M_icode Predict PC valCvalPicodeifunrA Instruction memory Instruction memory PC incr. PC incr. predPC Need regids Need valC Instr valid Align Split Bytes 1-5Byte 0 Select PC M_Cnd M_valA W_icode W_valM f_pc stat imem_error icodeifun

Standard Fetch Timing  Must perform everything in sequence:  Can’t compute incremented PC until we know value to increment it with  Why is increment slow?  How could we speed this up? Select PC Mem. ReadIncrement need_regids, need_valC 1 clock cycle

A Fast PC Increment Circuit 3-bit adder need_ValC need_regids 0 29-bit incrementer MUX High-order 29 bits Low-order 3 bits High-order 29 bitsLow-order 3 bits 01 PC incrPC SlowFast carry 1

Modified Fetch Timing 29-bit incrementer  Acts as soon as PC selected  Output not needed until final MUX  Works in parallel with memory read Select PC Mem. Read Incrementer need_regids, need_valC 3-bit add MUX 1 clock cycle Standard cycle

State-of-the-Art Pipelining Other issues to consider  More complex instructions: consider FP divide, sqrt  Take many cycles to execute  Forwarding can’t resolve hazards: more data stalls  Important for compiler to schedule code  Deeper pipelines to allow faster cycle times  Increased penalty from misprediction, data hazard stalls, etc.  Increased emphasis on branch prediction  Actual memory hierarchy issues (will increase CPI)  Difficult to complete memory access in one cycle!  Possibility of cache misses, TLB misses, page faults  Superscalar/VLIW: process multiple instructions/cycle  Dynamic scheduling (discussed in Chapter 5)  Scheduling = determining instruction execution order  Hardware decides based on data dependencies, resources

Pentium 4 Pipeline Very deep pipeline  Enables very high clock rates, but 20+ cycle branch penalty  Slower than Pentium III for a given clock rate TC Nxt IPTC FetchDriveAlloc Rename QueSch 1314 Disp RF Ex Flgs Br Ck Drive RF

Multicycle FP Operations Multiple functional units: one approach  Special-purpose hardware for FP operations  Increased latency causes more frequent stalls can cause “structural” hazards Single cycle integer unit F DMW Fully pipelined multiplier Non-pipelined divider Fully pipelined FP adder

Dynamic scheduling Out-of-order execution engine: one view (Pentium 4) Image from com Fetching, decoding, translation of x86 instrs to uops to support precise exceptions and recovery from mispredicted branches

Branch Prediction: Simplistic Branch history  Encode history about prior history of each individual branch instruction; store as hash table on instr. address  Use history to predict branch outcome State machine stores history  Each time branch taken, move to left  Each time branch not taken, move to right  In state Yes*, predict taken; in state No*, predict not taken  Can be encoded using 2 bits per table entry TTT Yes!Yes?No?No! NT T

Branch Prediction: Realistic Alpha “tournament” predictor Addr. of branch instr 12 bits 4K entries, each 2 bits, to select predictor Global Predictor Local Predictor 12 bit shift register of last branch outcomes globally 10 bits of branch address 4K entries, standard 2-bit branch predictors 1K entries, each 10 bits, history of behavior of this branch 1K entries, each a 3-bit saturation counter Predictor size: 8K + 8K + 10K + 3K = 29K bits! 8K 10K3K

Discussion Questions Data hazards on register values can be dealt with by stalling or by forwarding.  Can hazards occur on condition codes?  Can hazards occur on data memory accesses? Can software be responsible for pipeline correctness?  Schedule instructions, use nops  DSP chips historically have had exposed pipelines Relationship of hazards and dependencies  Does every data dependence cause a data hazard?  Is every data hazard caused by a data dependence?