Real-World Pipelines: Car Washes

Slides:



Advertisements
Similar presentations
Randal E. Bryant Carnegie Mellon University CS:APP CS:APP Chapter 4 Computer Architecture PipelinedImplementation Part II CS:APP Chapter 4 Computer Architecture.
Advertisements

University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.
Lecture 4: CPU Performance
Pipelining I Topics Pipelining principles Pipeline overheads Pipeline registers and stages Systems I.
Chapter 8. Pipelining.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
1 Seoul National University Wrap-Up. 2 Overview Seoul National University Wrap-Up of PIPE Design  Exception conditions  Performance analysis Modern.
Real-World Pipelines: Car Washes Idea  Divide process into independent stages  Move objects through stages in sequence  At any instant, multiple objects.
PipelinedImplementation Part I PipelinedImplementation.
Instructor: Erol Sahin
– 1 – Chapter 4 Processor Architecture Pipelined Implementation Chapter 4 Processor Architecture Pipelined Implementation Instructor: Dr. Hyunyoung Lee.
Randal E. Bryant Carnegie Mellon University CS:APP2e CS:APP Chapter 4 Computer Architecture SequentialImplementation CS:APP Chapter 4 Computer Architecture.
PipelinedImplementation Part I CSC 333. – 2 – Overview General Principles of Pipelining Goal Difficulties Creating a Pipelined Y86 Processor Rearranging.
Appendix A Pipelining: Basic and Intermediate Concepts
Randal E. Bryant CS:APP Chapter 4 Computer Architecture SequentialImplementation CS:APP Chapter 4 Computer Architecture SequentialImplementation Slides.
Datapath Design II Topics Control flow instructions Hardware for sequential machine (SEQ) Systems I.
Pipelining III Topics Hazard mitigation through pipeline forwarding Hardware support for forwarding Forwarding to mitigate control (branch) hazards Systems.
David O’Hallaron Carnegie Mellon University Processor Architecture PIPE: Pipelined Implementation Part I Processor Architecture PIPE: Pipelined Implementation.
Data Hazard Solution 2: Data Forwarding Our naïve pipeline would experience many data stalls  Register isn’t written until completion of write-back stage.
1 Seoul National University Pipelined Implementation : Part I.
1 Naïve Pipelined Implementation. 2 Outline General Principles of Pipelining –Goal –Difficulties Naïve PIPE Implementation Suggested Reading 4.4, 4.5.
Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.
Randal E. Bryant Carnegie Mellon University CS:APP CS:APP Chapter 4 Computer Architecture SequentialImplementation CS:APP Chapter 4 Computer Architecture.
Randal E. Bryant adapted by Jason Fritts CS:APP2e CS:APP Chapter 4 Computer Architecture SequentialImplementation CS:APP Chapter 4 Computer Architecture.
Datapath Design I Topics Sequential instruction execution cycle Instruction mapping to hardware Instruction decoding Systems I.
1 Sequential CPU Implementation. 2 Outline Logic design Organizing Processing into Stages SEQ timing Suggested Reading 4.2,4.3.1 ~
Pipeline Architecture I Slides from: Bryant & O’ Hallaron
PipelinedImplementation Part II PipelinedImplementation.
Computer Architecture: Pipelined Implementation - I
Computer Architecture adapted by Jason Fritts
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
1 SEQ CPU Implementation. 2 Outline SEQ Implementation Suggested Reading 4.3.1,
Randal E. Bryant Carnegie Mellon University CS:APP2e CS:APP Chapter 4 Computer Architecture PipelinedImplementation Part II CS:APP Chapter 4 Computer Architecture.
Sequential Hardware “God created the integers, all else is the work of man” Leopold Kronecker (He believed in the reduction of all mathematics to arguments.
Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.
1 Pipelined Implementation. 2 Outline Handle Control Hazard Handle Exception Performance Analysis Suggested Reading 4.5.
1 Seoul National University Sequential Implementation.
CPSC 121: Models of Computation
Real-World Pipelines Idea Divide process into independent stages
CPSC 121: Models of Computation
Lecture 13 Y86-64: SEQ – sequential implementation
Systems I Pipelining IV
Lecture 14 Y86-64: PIPE – pipelined implementation
Sequential Implementation
Administrivia Midterm to be posted on Tuesday after class
Samira Khan University of Virginia Feb 14, 2017
Computer Architecture adapted by Jason Fritts then by David Ferry
Y86 Processor State Program Registers
Pipelined Implementation : Part I
Seoul National University
Seoul National University
Instruction Decoding Optional icode ifun valC Instruction Format
Pipelined Implementation : Part II
Systems I Pipelining III
Computer Architecture adapted by Jason Fritts
Systems I Pipelining II
Pipelined Implementation : Part I
Seoul National University
Systems I Pipelining I Topics Pipelining principles Pipeline overheads
Pipeline Architecture I Slides from: Bryant & O’ Hallaron
Pipelined Implementation : Part I
Pipelined Implementation
Pipelined Implementation
Chapter 8. Pipelining.
Systems I Pipelining II
Chapter 4 Processor Architecture
Systems I Pipelining II
Pipelined Implementation
Sequential CPU Implementation
Sequential Design תרגול 10.
Presentation transcript:

Real-World Pipelines: Car Washes Sequential Parallel Pipelined Idea Divide process into independent stages Move objects through stages in sequence At any instant, multiple objects being processed

Administrivia HW5 due today Lab5 this week Next Tuesday is Monday HW6 this week Lab5 this week Next Tuesday is Monday No lab next week (I will fix the website later) Quiz5 due next Thursday in class Midterm starts after class on Tuesday Feb 28 Open book/notes Download PDF from blackboard, work on paper Due 5PM, Mar 5. Drop off at CTB 265 to the secretary.

Computational Example Combinational logic R e g 300 ps 20 ps Clock Delay/Latency = 320 ps Throughput = 3.12 GOPS System Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Clock cycle must be at least 320 ps 1 operation 1000 ps Throughput = x  3.12 GOPS 320 ps 1 ns

3-Stage Pipelined Version Clock Comb. logic A B C 100 ps 20 ps Delay/Latency = 360 ps Throughput = ??? System Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps

3-Stage Pipelined Version Clock Comb. logic A B C 100 ps 20 ps Delay/Latency = 360 ps Throughput = 8.33 GOPS System Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps Performance impact Latency increases: 360 ps from start to finish (was 320) Throughput Speedup is less than 3x – why?

Pipelining Unpipelined 3-way pipelined Cannot start new operation until previous one completes 3-way pipelined Up to 3 operations in process simultaneously Time OP1 OP2 OP3 Time A B C OP1 OP2 OP3

Operating a Pipeline Clock Comb. logic A B C 100 ps 20 ps 359 100 ps 300 R e g Clock Comb. logic A B C 100 ps 20 ps 239 R e g Clock Comb. logic A B C 100 ps 20 ps 241 Time OP1 OP2 OP3 A B C 120 240 360 480 640 Clock

Real-World Pipelines: Laundry Assuming you’ve got: One washer (takes 30 minutes) One drier (takes 40 minutes) One “folder” (takes 20 minutes) It takes 90 minutes to wash, dry, and fold 1 load of laundry. How long does 4 loads take?

The slow way If each load is done sequentially it takes 6 hours 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 If each load is done sequentially it takes 6 hours

Laundry Pipelining Start each load as soon as possible: overlap loads Pipelined laundry takes 3.5 hours (WHY NOT 3X?) 6 PM 7 8 9 10 11 Midnight Time 30 40 20

In a Perfect World Would like to divide the logic to uniform stages. R Clock Comb. logic A B C 100 ps 20 ps Delay/Latency = 360 ps Throughput = 8.33 GOPS Would like to divide the logic to uniform stages.

Limitations: Nonuniform Delays g Clock Comb. logic B C 50 ps 20 ps 150 ps 100 ps Delay = ??? Throughput = ??? A Time OP1 OP2 OP3 A B C What is the delay or time to execute one instruction? What is the throughput?

Limitations: Nonuniform Delays g Clock Comb. logic B C 50 ps 20 ps 150 ps 100 ps Delay = 510 ps Throughput = 5.88 GOPS A Time OP1 OP2 OP3 A B C Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages

Limitations: Register Overhead Delay = 420 ps, Throughput = 14.29 GOPS Clock R e g Comb. logic 50 ps 20 ps As pipeline deepens, overhead of loading registers becomes more significant Percentage of clock cycle spent loading register: 1-stage pipeline: 6.25% 3-stage pipeline: 16.67% 6-stage pipeline: 28.57% High speeds of modern processor designs obtained through deep pipelining: high overhead

Data Dependencies Each operation depends on result from preceding one Clock Combinational logic R e g Time OP1 OP2 OP3 Each operation depends on result from preceding one Not a problem for SEQ implementation

Limitations: Data Hazards e g Clock Comb. logic A B C Time OP1 OP2 OP3 A B C OP4 Result does not feed back around in time for next operation Unless special action is taken, results will be incorrect!

Review: Arithmetic and Logical Operations Instruction Code Function Code Add addl rA, rB 6 rA rB Refer to generically as “OPl” Encodings differ only by “function code” Low-order 4 bits in first instruction word All set condition codes as side effect Subtract (rA from rB) subl rA, rB 6 1 rA rB And andl rA, rB 6 2 rA rB Exclusive-Or xorl rA, rB 6 3 rA rB

Review: Move Operations Register --> Register rrmovl rA, rB 2 rA rB Immediate --> Register irmovl V, rB 3 F rB V Register --> Memory rmmovl rA, D(rB) 4 rA rB D Memory --> Register mrmovl D(rB), rA 5 rA rB D Similar to the IA32 movl instruction Simpler format for memory addresses Separated into different instructions to simplify hardware implementation

Data Dependencies in Y86 Code Time OP1 OP2 OP3 A B C OP4 1 irmovl $50, %eax 2 addl %eax , %ebx 3 mrmovl 100( %ebx ), %edx Result from one instruction used as operand for another Read-after-write (RAW) dependency Very common in actual programs Must make sure our pipeline handles these properly Essential: results must be correct Desirable: performance impact should be minimized

SEQ Hardware Stages occur in sequence Instruction memory PC increment CC ALU Data New rB dstE dstM A B Addr srcA srcB read fun. Fetch Decode Execute Memory Write back data out Register file M E Cnd icode ifun rA valC valP valB valA valE valM newPC Stat dmem_error instr_valid imem_error stat SEQ Hardware Stages occur in sequence One operation in process at a time Mem. control write

Adding Pipeline Registers PC increment CC ALU Data memory Fetch Decode Execute Memory Write back Register file A B M E d_srcA, d_srcB valA, valB aluA, aluB Cnd valE Addr, Data valM W_valE, W_valM, W_dstE, W_dstM W_icode, W_valM icode, ifun, rA, rB, valC W F D valP f_pc predPC Instruction M_icode, M_Cnd, M_valA Instruction memory PC increment CC ALU Data Fetch Decode Execute Memory Write back icode, ifun rA, rB valC Register file A B M E valP srcA, srcB dstE, dstM valA, valB aluA, aluB Cnd valE Addr, Data valM valE, valM newPC

Pipeline Stages Fetch Decode Execute Memory Write back PC increment CC ALU Data memory Fetch Decode Execute Memory Write back Register file A B M E d_srcA, d_srcB valA, valB aluA, aluB Cnd valE Addr, Data valM W_valE, W_valM, W_dstE, W_dstM W_icode, W_valM icode, ifun, rA, rB, valC W F D valP f_pc predPC Instruction M_icode, M_Cnd, M_valA Fetch Select current PC Read instruction Compute incremented PC Decode Read program registers Execute Perform ALU operation Memory Read or write data memory Write back Update register file

PIPE- Hardware Forward (upward) paths M W F D Instruction memory PC increment Register file ALU Data Select rB dstE dstM A B Mem. control Addr srcA srcB read write fun. Fetch Decode Execute Memory icode data out data in M_valA W_valM W_valE d_rvalA f_pc Predict valE valM Cnd valA ifun valC valB valP rA predPC CC d_srcB d_srcA e_Cnd M_Cnd stat imem_error instr_valid dmem_error m_stat W_stat M_stat E_stat D_stat f_stat Write back Stat PIPE- Hardware Pipeline registers hold intermediate values from instruction execution Note naming conventions Forward (upward) paths Values flow from one stage to next Values cannot skip stages; look at valC in decode dstE and dstM in execute, memory

PIPE- Example d_srcA d_srcB E_dstE e_valE M_dstE M_valE m_valM W F D Instruction memory PC increment Register file ALU Data Select rB dstE dstM A B Mem. control Addr srcA srcB read write fun. Fetch Decode Execute Memory icode data out data in M_valA W_valM W_valE d_rvalA f_pc Predict valE valM Cnd valA ifun valC valB valP rA predPC CC d_srcB d_srcA e_Cnd M_Cnd stat imem_error instr_valid dmem_error m_stat W_stat M_stat E_stat D_stat f_stat Write back Stat PIPE- Example 1: mrmovl 4(%ebx), %ecx 2: addl %eax, %edx 3: addl %ecx, %edx d_srcA d_srcB E_dstE e_valE M_dstE M_valE m_valM

PIPE- Example At clock cycle 4, what are the values of d_srcA d_srcB Instruction memory PC increment Register file ALU Data Select rB dstE dstM A B Mem. control Addr srcA srcB read write fun. Fetch Decode Execute Memory icode data out data in M_valA W_valM W_valE d_rvalA f_pc Predict valE valM Cnd valA ifun valC valB valP rA predPC CC d_srcB d_srcA e_Cnd M_Cnd stat imem_error instr_valid dmem_error m_stat W_stat M_stat E_stat D_stat f_stat Write back Stat PIPE- Example 1: mrmovl 4(%ebx), %ecx 2: addl %eax, %edx 3: addl %ecx, %edx At clock cycle 4, what are the values of d_srcA d_srcB E_dstE e_valE M_dstE M_valE m_valM Highlight the datapath in D, E, M stages.

Problems: Feedback Paths W F D Instruction memory PC increment Register file ALU Data Select rB dstE dstM A B Mem. control Addr srcA srcB read write fun. Fetch Decode Execute Memory icode data out data in M_valA W_valM W_valE d_rvalA f_pc Predict valE valM Cnd valA ifun valC valB valP rA predPC CC d_srcB d_srcA e_Cnd M_Cnd stat imem_error instr_valid dmem_error m_stat W_stat M_stat E_stat D_stat f_stat Write back Stat Problems: Feedback Paths Predicted PC Guess value of next PC Branch information Jump taken/not-taken Fall-through or target address Return address Read from memory Register updates To register file write ports

Predicting the PC PC increment CC ALU Data memory Fetch Decode Execute Memory Write back Register file A B M E d_srcA, d_srcB valA, valB aluA, aluB Cnd valE Addr, Data valM W_valE, W_valM, W_dstE, W_dstM W_icode, W_valM icode, ifun, rA, rB, valC W F D valP f_pc predPC Instruction M_icode, M_Cnd, M_valA Pipeline timing requires next instruction fetch to begin one cycle after current instruction is fetched Not enough time to determine next instruction in all cases Can be done for all but conditional jumps, return Solution for conditional jumps: guess outcome + continue Recover later if prediction was incorrect

Simple Prediction Strategy Instructions that don’t transfer control Next PC is always valP (no guess required) Call and unconditional jumps Next PC is always valC (no guess required) Conditional jumps Could predict next PC to be valC (branch target) Correct only if branch is taken (~60% of time) Could predict next PC to be valP (sequential successor) Correct only if branch not taken (~40% of time) Could predict taken if forward, not-taken if backward Correct ~65% of time Return instruction Don’t try to predict (Why? What would be required?)

Branch Misprediction Example 0x000: xorl %eax,%eax 0x002: jne t # Not taken 0x007: irmovl $1, %eax # Fall through 0x00d: nop 0x00e: nop 0x00f: nop 0x010: halt 0x011: t: irmovl $2, %edx # Target (Should not execute) 0x017: irmovl $3, %ebx # Should not execute 0x01d: irmovl $4, %edx # Should not execute Should execute only first 7 instructions Our predictor will guess the branch will be taken What must be done to recover?

Handling Misprediction: Desired Behavior Detection Branch is in execute, and it was not taken Two incorrect instructions follow branch in pipe Action On next cycle, replace mispredicted instructions with “bubbles”

Return Example 0x000: irmovl Stack,%esp # Initialize stack pointer 0x006: call p # Procedure call 0x00b: irmovl $5,%esi # Return point 0x011: halt 0x020: .pos 0x20 0x020: p: irmovl $-1,%edi # procedure 0x026: ret 0x027: irmovl $1,%eax # Should not be executed 0x02d: irmovl $2,%ecx # Should not be executed 0x033: irmovl $3,%edx # Should not be executed 0x039: irmovl $4,%ebx # Should not be executed 0x100: .pos 0x100 0x100: Stack: # Stack: Stack pointer Without special handling we would fetch three more instructions after ret before we know return address Instead, we’ll freeze the pipe when ret is fetched

Handling Return: Desired Behavior # demo - retb 0x026: ret F D E M W bubble F D E M W bubble F D E M W bubble F D E M W 0x00b: irmovl $5,% esi # Return F F D D E E M M W W Detection ret is in decode, execute or memory Address of next instruction unavailable Action Fill pipe with bubbles until ret’s memory access completes W valM = 0x0b • F F valC valC f f 5 5 rB rB f f % % esi esi

Pipeline Operation F D E M W Cycle 5 I1 I2 I3 I4 I5 irmovl $1,%eax #I1 6 7 8 9 F D E M W irmovl $2,%ecx #I2 irmovl $3,%edx #I3 irmovl $4,%ebx #I4 halt #I5 Cycle 5 I1 I2 I3 I4 I5

Problems with PIPE-: Example with 3 Nops irmovl $10,% edx 1 2 3 4 5 6 7 8 9 F D E M W 0x006: $3,% eax 0x00c: nop 0x00d: 0x00e: 0x00f: addl % ,% 10 R[ ] f valA = valB # demo-h3.ys Cycle 6 11 0x011: halt Cycle 7

Example: 2 Nops F D E M W • Cycle 6 0x000: irmovl $10,% edx 0x006: 3 4 5 6 7 8 9 F D E M W 0x006: $3,% eax 0x00c: nop 0x00d: 0x00e: addl % ,% 0x010: halt 10 # demo-h2.ys R[ ] f valA = valB • Cycle 6 Error

Example: 1 Nop F D E M W • Cycle 5 0x000: irmovl $10,% edx 0x006: $3,% 2 3 4 5 6 7 8 9 F D E M W 0x006: $3,% eax 0x00c: nop 0x00d: addl % ,% 0x00f: halt # demo-h1.ys R[ ] f 10 valA = valB • Cycle 5 Error M_ valE = 3 dstE

Example: 0 Nops F D E M W Cycle 4 0x000: irmovl $10,% edx 0x006: $3,% 2 3 4 5 6 7 8 F D E M W 0x006: $3,% eax 0x00c: addl % ,% 0x00e: halt # demo-h0.ys valA f R[ ] = valB Cycle 4 Error M_ valE = 10 dstE e_ 0 + 3 = 3 E_

Hazard Solution 1: Stalling 2 3 4 5 6 7 8 9 10 11 # demo-h2.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W 0x00c: nop F D E M W 0x00d: nop F D E M W bubble F E M W 0x00e: addl %edx,%eax D D E M W 0x010: halt F F D E M W If instruction that reads register follows too closely after instruction that writes same register, delay reading instruction Delay or stall always takes place in decode stage Following instruction also delayed in fetch, but just one “stall cycle” tallied Stall = dynamically injecting nop (bubble) into execute stage Hazard  we get wrong answer without special action

Stalls Detection Action Write is pending for a src register M W F D Instruction memory PC increment Register file ALU Data Select rB dstE dstM A B Mem. control Addr srcA srcB read write fun. Fetch Decode Execute Memory icode data out data in M_valA W_valM W_valE d_rvalA f_pc Predict valE valM Cnd valA ifun valC valB valP rA predPC CC d_srcB d_srcA e_Cnd M_Cnd stat imem_error instr_valid dmem_error m_stat W_stat M_stat E_stat D_stat f_stat Write back Stat Stalls Detection Write is pending for a src register srcA or srcB (not 0xF) matches dstE or dstM in execute, memory, or write-back stages Action Stall instruction in decode 1: addl %eax, %edx 2: mrmovl 4(%ebx), %ecx 3: addl %ecx, %edx

Stalls Detection Action Write is pending for a src register M W F D Instruction memory PC increment Register file ALU Data Select rB dstE dstM A B Mem. control Addr srcA srcB read write fun. Fetch Decode Execute Memory icode data out data in M_valA W_valM W_valE d_rvalA f_pc Predict valE valM Cnd valA ifun valC valB valP rA predPC CC d_srcB d_srcA e_Cnd M_Cnd stat imem_error instr_valid dmem_error m_stat W_stat M_stat E_stat D_stat f_stat Write back Stat Stalls Detection Write is pending for a src register srcA or srcB (not 0xF) matches dstE or dstM in execute, memory, or write-back stages Action Stall instruction in decode 1: addl %eax, %edx 2: mrmovl 4(%ebx), %ecx 3: addl %ecx, %edx

Detecting Stall Condition 1 2 3 4 5 6 7 8 9 10 11 # demo-h2.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W 0x00c: nop F D E M W 0x00d: nop F D E M W 0x00e: addl %edx,%eax F D E M W 0x010: halt F D E M W Cycle 6 W D • W_dstE = %eax W_valE = 3 srcA = %edx srcB = %eax

Detecting Stall Condition 1 2 3 4 5 6 7 8 9 10 11 # demo-h2.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W 0x00c: nop F D E M W 0x00d: nop F D E M W bubble F E M W 0x00e: addl %edx,%eax D D E M W 0x010: halt F F D E M W Cycle 6 W D • W_dstE = %eax W_valE = 3 srcA = %edx srcB = %eax

Multi-cycle Stalls F D E M W F D E M W E M W E M W E M W F D D D D E M 1 2 3 4 5 6 7 8 9 10 11 # demo-h0.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W bubble E M W bubble E M W bubble E M W 0x00c: addl %edx,%eax F D D D D E M W 0x00e: halt F F F F D E M W Cycle 4 D srcA = %edx srcB = %eax

Multi-cycle Stalls F D E M W F D E M W E M W E M W E M W F D D D D E M 1 2 3 4 5 6 7 8 9 10 11 # demo-h0.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W bubble E M W bubble E M W bubble E M W 0x00c: addl %edx,%eax F D D D D E M W 0x00e: halt F F F F D E M W Cycle 4 E E_dstE = %eax D srcA = %edx srcB = %eax

Multi-cycle Stalls F D E M W F D E M W E M W E M W E M W F D D D D E M 1 2 3 4 5 6 7 8 9 10 11 # demo-h0.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W bubble E M W bubble E M W bubble E M W 0x00c: addl %edx,%eax F D D D D E M W 0x00e: halt F F F F D E M W Cycle 5 Cycle 4 E E_dstE = %eax • D srcA = %edx srcB = %eax D srcA = %edx srcB = %eax

Multi-cycle Stalls F D E M W F D E M W E M W E M W E M W F D D D D E M 1 2 3 4 5 6 7 8 9 10 11 # demo-h0.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W bubble E M W bubble E M W bubble E M W 0x00c: addl %edx,%eax F D D D D E M W 0x00e: halt F F F F D E M W Cycle 5 M M_dstE = %eax Cycle 4 E E_dstE = %eax • D srcA = %edx srcB = %eax D srcA = %edx srcB = %eax

Multi-cycle Stalls F D E M W F D E M W E M W E M W E M W F D D D D E M 1 2 3 4 5 6 7 8 9 10 11 # demo-h0.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W bubble E M W bubble E M W bubble E M W 0x00c: addl %edx,%eax F D D D D E M W 0x00e: halt F F F F D E M W Cycle 6 Cycle 5 M M_dstE = %eax Cycle 4 • E E_dstE = %eax • D srcA = %edx srcB = %eax D srcA = %edx srcB = %eax D srcA = %edx srcB = %eax

Multi-cycle Stalls F D E M W F D E M W E M W E M W E M W F D D D D E M 1 2 3 4 5 6 7 8 9 10 11 # demo-h0.ys 0x000: irmovl $10,%edx F D E M W 0x006: irmovl $3,%eax F D E M W bubble E M W bubble E M W bubble E M W 0x00c: addl %edx,%eax F D D D D E M W 0x00e: halt F F F F D E M W Cycle 6 W W_dstE = %eax Cycle 5 M M_dstE = %eax Cycle 4 • E E_dstE = %eax • D srcA = %edx srcB = %eax D srcA = %edx srcB = %eax D srcA = %edx srcB = %eax

Stalling: Another View 0x006: irmovl $3,%eax bubble 0x00c: addl %edx,%eax Cycle 6 0x00e: halt bubble Cycle 8 0x00c: addl %edx,%eax 0x00e: halt bubble 0x00c: addl %edx,%eax Cycle 7 0x00e: halt 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax bubble 0x00c: addl %edx,%eax Cycle 5 0x00e: halt 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax Cycle 4 0x00e: halt 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax # demo-h0.ys 0x00e: halt Write Back Memory Execute Decode Fetch Operational rules: Stalling instruction held back in decode stage Following instruction stays in fetch stage Bubbles injected into execute stage Like dynamically generated nop’s Move through stage-by-stage as regular instructions

Summary Pipeline Hazards Data Hazards Solution #1: Stalling Control Hazard jmp, ret Solution #1: Simple Prediction