1 Naïve Pipelined Implementation. 2 Outline General Principles of Pipelining –Goal –Difficulties Naïve PIPE Implementation Suggested Reading 4.4, 4.5.

Slides:



Advertisements
Similar presentations
Randal E. Bryant Carnegie Mellon University CS:APP CS:APP Chapter 4 Computer Architecture PipelinedImplementation Part II CS:APP Chapter 4 Computer Architecture.
Advertisements

University of Amsterdam Computer Systems – the processor architecture Arnoud Visser 1 Computer Systems The processor architecture.
Lecture 4: CPU Performance
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Pipelining I Topics Pipelining principles Pipeline overheads Pipeline registers and stages Systems I.
1 Seoul National University Wrap-Up. 2 Overview Seoul National University Wrap-Up of PIPE Design  Exception conditions  Performance analysis Modern.
Real-World Pipelines: Car Washes Idea  Divide process into independent stages  Move objects through stages in sequence  At any instant, multiple objects.
PipelinedImplementation Part I PipelinedImplementation.
Instructor: Erol Sahin
– 1 – Chapter 4 Processor Architecture Pipelined Implementation Chapter 4 Processor Architecture Pipelined Implementation Instructor: Dr. Hyunyoung Lee.
Randal E. Bryant Carnegie Mellon University CS:APP2e CS:APP Chapter 4 Computer Architecture SequentialImplementation CS:APP Chapter 4 Computer Architecture.
PipelinedImplementation Part I CSC 333. – 2 – Overview General Principles of Pipelining Goal Difficulties Creating a Pipelined Y86 Processor Rearranging.
Chapter 12 Pipelining Strategies Performance Hazards.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Appendix A Pipelining: Basic and Intermediate Concepts
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Randal E. Bryant CS:APP Chapter 4 Computer Architecture SequentialImplementation CS:APP Chapter 4 Computer Architecture SequentialImplementation Slides.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Datapath Design II Topics Control flow instructions Hardware for sequential machine (SEQ) Systems I.
Pipelining III Topics Hazard mitigation through pipeline forwarding Hardware support for forwarding Forwarding to mitigate control (branch) hazards Systems.
David O’Hallaron Carnegie Mellon University Processor Architecture PIPE: Pipelined Implementation Part I Processor Architecture PIPE: Pipelined Implementation.
Data Hazard Solution 2: Data Forwarding Our naïve pipeline would experience many data stalls  Register isn’t written until completion of write-back stage.
1 Seoul National University Pipelined Implementation : Part I.
Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.
Randal E. Bryant Carnegie Mellon University CS:APP CS:APP Chapter 4 Computer Architecture SequentialImplementation CS:APP Chapter 4 Computer Architecture.
Randal E. Bryant adapted by Jason Fritts CS:APP2e CS:APP Chapter 4 Computer Architecture SequentialImplementation CS:APP Chapter 4 Computer Architecture.
Datapath Design I Topics Sequential instruction execution cycle Instruction mapping to hardware Instruction decoding Systems I.
1 Sequential CPU Implementation. 2 Outline Logic design Organizing Processing into Stages SEQ timing Suggested Reading 4.2,4.3.1 ~
Pipeline Architecture I Slides from: Bryant & O’ Hallaron
PipelinedImplementation Part II PipelinedImplementation.
Computer Architecture: Pipelined Implementation - I
Computer Architecture adapted by Jason Fritts
Pipelining IV Topics Implementing pipeline control Pipelining and performance analysis Systems I.
1 SEQ CPU Implementation. 2 Outline SEQ Implementation Suggested Reading 4.3.1,
Randal E. Bryant Carnegie Mellon University CS:APP2e CS:APP Chapter 4 Computer Architecture PipelinedImplementation Part II CS:APP Chapter 4 Computer Architecture.
Sequential Hardware “God created the integers, all else is the work of man” Leopold Kronecker (He believed in the reduction of all mathematics to arguments.
Rabi Mahapatra CS:APP3e Slides are from Authors: Bryant and O Hallaron.
Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.
1 Pipelined Implementation. 2 Outline Handle Control Hazard Handle Exception Performance Analysis Suggested Reading 4.5.
Sequential CPU Implementation Implementation. – 2 – Processor Suggested Reading - Chap 4.3.
1 Seoul National University Sequential Implementation.
Real-World Pipelines Idea Divide process into independent stages
Lecture 13 Y86-64: SEQ – sequential implementation
Systems I Pipelining IV
Lecture 14 Y86-64: PIPE – pipelined implementation
Sequential Implementation
Administrivia Midterm to be posted on Tuesday after class
Samira Khan University of Virginia Feb 14, 2017
Pipelined Implementation : Part I
Seoul National University
Seoul National University
Instruction Decoding Optional icode ifun valC Instruction Format
Pipelined Implementation : Part II
Systems I Pipelining III
Computer Architecture adapted by Jason Fritts
Systems I Pipelining II
Pipelined Implementation : Part I
Seoul National University
Systems I Pipelining I Topics Pipelining principles Pipeline overheads
Pipeline Architecture I Slides from: Bryant & O’ Hallaron
Pipelined Implementation : Part I
Pipelined Implementation
Pipelined Implementation
Systems I Pipelining II
Chapter 4 Processor Architecture
Systems I Pipelining II
Pipelined Implementation
Real-World Pipelines: Car Washes
Sequential CPU Implementation
Presentation transcript:

1 Naïve Pipelined Implementation

2 Outline General Principles of Pipelining –Goal –Difficulties Naïve PIPE Implementation Suggested Reading 4.4, 4.5

3 Pipeline

4 Problem of SEQ Too slow –Too many tasks needed to finish in one clock cycle –Signals need long time to propagate through all of the stages –The clock must run slowly enough Does not make good use of hardware units –Every unit is active for part of the total clock cycle

5 Real-World Pipelines: Car Washes Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being processed Sequential Parallel Pipelined

6 Computational Example System –Computation requires total of 300 picoseconds –Additional 20 picoseconds to save result in register –Can must have clock cycle of at least 320 ps Combinational logic RegReg 300 ps20 ps Clock Delay = 320 ps Throughput = 3.12 GOPS

7 3-Way Pipelined Version System –Divide combinational logic into 3 blocks of 100 ps each –Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps –Overall latency increases 360 ps from start to finish RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps Delay = 360 ps Throughput = 8.33 GOPS

8 Pipeline Diagrams Unpipelined –Cannot start new operation until previous one completes 3-Way Pipelined –Up to 3 operations in process simultaneously Time OP1 OP2 OP3 Time ABC ABC ABC OP1 OP2 OP3

9 Operating a Pipeline Time OP1 OP2 OP3 ABC ABC ABC Clock RegReg Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 239 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 241 RegReg RegReg RegReg 100 ps20 ps100 ps20 ps100 ps20 ps Comb. logic A Comb. logic B Comb. logic C Clock 300 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 359

10 Computational Example System –Computation requires total of 300 picoseconds –Additional 20 picoseconds to save result in register –Can must have clock cycle of at least 320 ps Combinational logic RegReg 300 ps20 ps Clock Delay = 320 ps Throughput = 3.12 GOPS

11 3-Way Pipelined Version System –Divide combinational logic into 3 blocks of 100 ps each –Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps –Overall latency increases 360 ps from start to finish RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps Delay = 360 ps Throughput = 8.33 GOPS

12 Pipeline Diagrams Unpipelined –Cannot start new operation until previous one completes 3-Way Pipelined –Up to 3 operations in process simultaneously Time OP1 OP2 OP3 Time ABC ABC ABC OP1 OP2 OP3

13 Limitations: Nonuniform Delays Time OP1 OP2 OP3 ABC ABC ABC RegReg Clock RegReg Comb. logic B RegReg Comb. logic C 50 ps20 ps150 ps20 ps100 ps20 ps Delay = 510 ps Throughput = 5.88 GIPS Comb. logic A

14 Limitations: Nonuniform Delays Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages

15 Limitations: Register Overhead Delay = 420 ps, Throughput = GOPS Clock RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps

16 Limitations: Register Overhead As try to deepen pipeline, overhead of loading registers becomes more significant Percentage of clock cycle spent loading register: –1-stage pipeline: 6.25% –3-stage pipeline: 16.67% –6-stage pipeline: 28.57% High speeds of modern processor designs obtained through very deep pipelining

Data Dependencies in Processors –Result from one instruction used as operand for another Read-after-write (RAW) dependency –Very common in actual programs –Must make sure our pipeline handles these properly Get correct results Minimize performance impact 1 irmovl $50, %eax 2 addl %eax, %ebx 3 mrmovl 100( %ebx ), %edx 17

18 Data Dependencies System –Each operation depends on result from preceding one Clock Combinational logic RegReg Time OP1 OP2 OP3

19 Data Hazards RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C Time OP1 OP2 OP3 ABC ABC ABC OP4 ABC

20 Data Hazards Result does not feed back around in time for next operation Pipelining has changed behavior of system

21 Naïve PIPE Implementation

SEQ Hardware Stages occur in sequence One operation in process at a time

SEQ+ Hardware –Still sequential implementation –Reorder PC stage to put at beginning PC Stage –Task is to select PC for current instruction –Based on results computed by previous instruction Processor State –PC is no longer stored in register –But, can determine PC based on other stored information

24 Memory Execute Decode Fetch Instruction memory PC increment CC ALU Data Memory icode,ifunrA,rB valC Register File ABM E pState valP srcA,srcB dstA, dstB valA, valB aluA, aluB Cnd valE Addr, Data valM valE, valM icode, valC valP PC Write Back

25 Memory Execute Decode Fetch

26 Pipeline Stages Fetch –Select current PC –Read instruction –Compute incremented PC Decode –Read program registers Execute –Operate ALU Memory –Read or write data memory Write Back –Update register file

27 PIPE- Hardware Pipeline registers hold intermediate values from instruction execution Forward (Upward) Paths –Values passed from one stage to next –Cannot jump past stages e.g., valC passes through decode

28 E M W F D PC ALU Data Memory Select PC rB Select A ALUAALUB Mem. control Addr read ALU fun Fetch Decode Execute Memory Write back data out data in M_valA W_valM W_valE M_valA W_valM d_rvalA f_PC Predict PC valEvalM dstEdstM Cnd valE valA dstE dstM icodeifun valC valAvalBdstE srcB valC valP icode ifunrA predPC CC e_Cnd M_Bch write dstMsrcA Register File ABM E dstEdstMsrcAsrcB d_srcA d_srcB Instruction Instruction Memory increment icode

29 E M W F D PC ALU Data Memory Select PC rB Select A ALUAALUB Mem. control Addr read ALU fun Fetch Decode Execute Memory Write back icode data out data in M_valA W_valM W_valE M_valA W_valM d_rvalA f_PC Predict PC valEvalM dstEdstM Cnd icode valE valA dstE dstM icodeifun valC valAvalBdstE srcB valC valP icode ifunrA predPC CC e_Cnd M_Bch write dstMsrcA Register File ABM E srcB srcA dstM dstE d_srcA d_srcB Instruction Instruction Memory increment

30 Feedback Paths Predicted PC –Guess value of next PC –Branch information Jump taken/not-taken Fall-through or target address –Return point Read from memory Register updates To register file write ports

31 Start fetch of new instruction after current one has completed fetch stage –Not enough time to reliably determine next instruction Guess which instruction will follow –Recover if prediction was incorrect Predicting the PC

32 Predicting the PC F D M_icode Predict PC Instruction Instruction Memory PC increment predPC Need regids Need valC Instr valid Align Split Bytes1-5 Byte 0 Select PC M_Cnd M_valA W_icode W_valM D rB valC valP icode ifunrA

33 Our Prediction Strategy Instructions that Don’t Transfer Control –Predict next PC to be valP –Always reliable Call and Unconditional Jumps –Predict next PC to be valC (destination) –Always reliable

34 Our Prediction Strategy Conditional Jumps –Predict next PC to be valC (destination) –Only correct if branch is taken Typically right 60% of time Return Instruction –Don’t try to predict

35 Fetch Predict PC D rB valC valP icode ifunrA F predPC Instruction Instruction Memory PC increment Select PC M_valA W_valM f_PC M_Cnd M_icode W_icode Select PC Int F_predPC = [ f_icode in {IJXX, ICALL} : f_valC; 1: f_valP; ];

36 Recovering from PC Misprediction Mispredicted Jump –Will see branch flag once instruction reaches memory stage –Can get fall-through PC from valA Return Instruction –Will get return PC when ret reaches write-back stage

37 Select PC int f_PC = [ #mispredicted branch. Fetch at incremented PC M_icode == IJXX && !M_Cnd : M_valA; #completion of RET instruciton W_icode == IRET : W_valM; #default: Use predicted value of PC 1: F_predPC ];

38 Pipeline Demonstration FDEM WFDEM W FDEMW FDEMWFDEMW Cycle 5 W I1 M I2 E I3 D I4 F I5 irmovl $1,%eax#I1 irmovl $2,%ecx#I2 irmovl $3,%edx#I3 irmovl $4,%ebx #I4 halt#I5

39 Data Dependencies in Processors Result from one instruction used as operand for another –Read-after-write (RAW) dependency Very common in actual programs Must make sure our pipeline handles these properly –Get correct results –Minimize performance impact 1 irmovl $50, %eax 2 addl %eax, %ebx 3 mrmovl 100(%ebx),%edx

40 Data Dependencies: No Nop FDEM WFDEM W FDEMW FDEMW E D valA  R[ %edx ]=0 valB  R[ %eax ]=0 D valA  R[ %edx ]=0 valB  R[ %eax ]=0 Cycle 4 Error M M_valE= 10 M_dstE= %edx e_valE  = 3 E_dstE= %eax # demo-h0.ys 0x000:irmovl $10,% edx 0x006:irmovl $3,%eax 0x00c:addl %edx,%eax 0x0e:halt

41 Data Dependencies: 1 Nop FDEM WFDEM W FDEMWFDEMW FDEMWFDEMW FDEMWFDEMW W R[ %edx ]  10 W R[ %edx ]  10 D valA  R[ %edx ]=0 valB  R[ %eax ]=0 D valA  R[ %edx ]=0 valB  R[ %eax ]=0 Cycle 5 Error M M_valE= 3 M_dstE= %eax # demo-h1.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:addl %edx,%eax 0x0f:halt

42 Data Dependencies: 2 Nop’s W R[ %eax ]  3 D valA  R[ %edx ]=10 valB  R[ %eax ]=0 W R[ %eax ]  3 W R[ %eax ]  3 D valA  R[ %edx ]=10 valB  R[ %eax ]=0 D valA  R[ %edx ]=10 valB  R[ %eax ]=0 Cycle 6 Error # demo-h2.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:nop 0x00e:addl %edx,%eax 0x010:halt

43 Data Dependencies: 3 Nop’s D D val ← R[%edx]=10 val←R[%eax]=3 WW R[%eax] ←3 Cycle 6 Cycle 7 # demo-h3.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:nop 0x00e:nop 0x00f:addl %edx,%eax 0x011:halt

44 Classes of Data Hazards Hazards can potentially occur when one instruction updates part of the program state that read by a later instruction

45 Classes of Data Hazards Program states: –Program registers The hazards already identified. –Condition codes Both written and read in the execute stage. No hazards can arise –Program counter Conflicts between updating and reading PC cause control hazards –Memory Both written and read in the memory stage. Without self-modified code, no hazards.

46 Data Dependencies: 2 Nop’s W R[ %eax ]  3 D valA  R[ %edx ]=10 valB  R[ %eax ]=0 W R[ %eax ]  3 W R[ %eax ]  3 D valA  R[ %edx ]=10 valB  R[ %eax ]=0 D valA  R[ %edx ]=10 valB  R[ %eax ]=0 Cycle 6 Error # demo-h2.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:nop 0x00e:addl %edx,%eax 0x010:halt

47 Data Dependencies: 2 Nop’s # demo-h2.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:nop bubble 0x00e:addl %edx,%eax 0x010:halt FDEMW FDEMW FDEMW F EMW DDEMW FDEMW 10 F FDEMW 11 If instruction follows too closely after one that writes register, slow it down Hold instruction in decode Dynamically inject nop into execute stage

dstM dstE 48 E M W F D PC ALU Data Memory Select PC rB Select A ALUAALUB Mem. control Addr read ALU fun Fetch Decode Execute Memory Write back data out data in M_valA W_valM W_valE M_valA W_valM d_rvalA f_PC Predict PC valEvalM dstEdstM Bch valE valA dstE dstM icodeifun valC valAvalB dstE srcB valC valP icode ifunrA predPC CC e_Bch M_Bch write dstM srcA Register File ABM E srcAsrcB d_srcA d_srcB Instruction Instruction Memory increment icode

49 Stall Condition Source Registers –srcA and srcB of current instruction in decode stage Destination Registers –dstE and dstM fields –Instructions in execute, memory, and write-back stages Condition –srcA==dstE or srcA==dstM –srcB==dstE or srcB==dstM Special Case –Don’t stall for register ID F Indicates absence of register operand

50 Data Dependencies: 2 Nop’s # demo-h2.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:nop bubble 0x00e:addl %edx,%eax 0x010:halt FDEMW FDEMW FDEMW F EMW DDEMW FDEMW 10 F FDEMW 11 Cycle 6 W D W_dstE = %eax W_valE = 3 srcA = %edx srcB = %eax

51 Stalling X FDEMW FDEMW F EMW D EMW DDEMW FDEMW 10 FF D F EMW 11 Cycle 4 W W_dstE = %eax D srcA = %edx srcB = %eax M M_dstE = %eax D srcA = %edx srcB = %eax E E_dstE = %eax D srcA = %edx srcB = %eax Cycle 5 Cycle 6 # demo-h0.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax bubble 0x00c:addl %edx,%eax 0x0e:halt

52 What Happens When Stalling? Stalling instruction held back in decode stage Following instruction stays in fetch stage Bubbles injected into execute stage –Like dynamically generated nop’s –Move through later stages 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax Cycle 4 0x00e: halt 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax # demo-h0.ys 0x00e: halt 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax bubble 0x00c: addl %edx,%eax Cycle 5 0x00e: halt 0x006: irmovl $3,%eax bubble 0x00c: addl %edx,%eax bubble Cycle 6 0x00e: halt bubble 0x00c: addl %edx,%eax bubble Cycle 7 0x00e: halt bubble Cycle 8 0x00c: addl %edx,%eax 0x00e: halt Write Back Memory Execute Decode Fetch

53 Implementing Stalling Pipeline Control –Combinational logic detects stall condition –Sets mode signals for how pipeline registers should update

Implementing Stalling E M W F D CC rB srcA srcB icodevalEvalMdstEdstM CndicodevalEvalAdstEdstM icodeifunvalCvalAvalBdstEdstMsrcAsrcB valCvalPicodeifunrA predPC d_srcB d_srcA e_Cnd D_icode E_icode M_icode E_dstM Pipe control logic D_bubble D_stall E_bubble F_stall M_bubble W_stall set_cc stat W_stat stat m_stat

55 Pipeline Register Modes Rising clock Rising clock  Output = x xx Input = y stall = 1 bubble = 0 xx Stall

56 Pipeline Register Modes n o p Rising clock Rising clock  Output = nop xx Output = xInput = y stall = 0 bubble = 1 Bubble

57 Pipeline Register Modes Rising clock Rising clock  Output = y yy Output = xInput = y stall = 0 bubble = 0 xx Normal

Next Data Forwarding Handle Control Hazard Performance matrix Suggested Reading