Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Naïve Pipelined Implementation. 2 Outline General Principles of Pipelining –Goal –Difficulties Naïve PIPE Implementation Suggested Reading 4.4, 4.5.

Similar presentations


Presentation on theme: "1 Naïve Pipelined Implementation. 2 Outline General Principles of Pipelining –Goal –Difficulties Naïve PIPE Implementation Suggested Reading 4.4, 4.5."— Presentation transcript:

1 1 Naïve Pipelined Implementation

2 2 Outline General Principles of Pipelining –Goal –Difficulties Naïve PIPE Implementation Suggested Reading 4.4, 4.5

3 3 Pipeline

4 4 Problem of SEQ Too slow –Too many tasks needed to finish in one clock cycle –Signals need long time to propagate through all of the stages –The clock must run slowly enough Does not make good use of hardware units –Every unit is active for part of the total clock cycle

5 5 Real-World Pipelines: Car Washes Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being processed Sequential Parallel Pipelined

6 6 Computational Example System –Computation requires total of 300 picoseconds –Additional 20 picoseconds to save result in register –Can must have clock cycle of at least 320 ps Combinational logic RegReg 300 ps20 ps Clock Delay = 320 ps Throughput = 3.12 GOPS

7 7 3-Way Pipelined Version System –Divide combinational logic into 3 blocks of 100 ps each –Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps –Overall latency increases 360 ps from start to finish RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps Delay = 360 ps Throughput = 8.33 GOPS

8 8 Pipeline Diagrams Unpipelined –Cannot start new operation until previous one completes 3-Way Pipelined –Up to 3 operations in process simultaneously Time OP1 OP2 OP3 Time ABC ABC ABC OP1 OP2 OP3

9 9 Operating a Pipeline Time OP1 OP2 OP3 ABC ABC ABC 0120240360480640 Clock RegReg Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 239 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 241 RegReg RegReg RegReg 100 ps20 ps100 ps20 ps100 ps20 ps Comb. logic A Comb. logic B Comb. logic C Clock 300 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 359

10 10 Computational Example System –Computation requires total of 300 picoseconds –Additional 20 picoseconds to save result in register –Can must have clock cycle of at least 320 ps Combinational logic RegReg 300 ps20 ps Clock Delay = 320 ps Throughput = 3.12 GOPS

11 11 3-Way Pipelined Version System –Divide combinational logic into 3 blocks of 100 ps each –Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps –Overall latency increases 360 ps from start to finish RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps Delay = 360 ps Throughput = 8.33 GOPS

12 12 Pipeline Diagrams Unpipelined –Cannot start new operation until previous one completes 3-Way Pipelined –Up to 3 operations in process simultaneously Time OP1 OP2 OP3 Time ABC ABC ABC OP1 OP2 OP3

13 13 Limitations: Nonuniform Delays Time OP1 OP2 OP3 ABC ABC ABC RegReg Clock RegReg Comb. logic B RegReg Comb. logic C 50 ps20 ps150 ps20 ps100 ps20 ps Delay = 510 ps Throughput = 5.88 GIPS Comb. logic A

14 14 Limitations: Nonuniform Delays Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages

15 15 Limitations: Register Overhead Delay = 420 ps, Throughput = 14.29 GOPS Clock RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps

16 16 Limitations: Register Overhead As try to deepen pipeline, overhead of loading registers becomes more significant Percentage of clock cycle spent loading register: –1-stage pipeline: 6.25% –3-stage pipeline: 16.67% –6-stage pipeline: 28.57% High speeds of modern processor designs obtained through very deep pipelining

17 Data Dependencies in Processors –Result from one instruction used as operand for another Read-after-write (RAW) dependency –Very common in actual programs –Must make sure our pipeline handles these properly Get correct results Minimize performance impact 1 irmovl $50, %eax 2 addl %eax, %ebx 3 mrmovl 100( %ebx ), %edx 17

18 18 Data Dependencies System –Each operation depends on result from preceding one Clock Combinational logic RegReg Time OP1 OP2 OP3

19 19 Data Hazards RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C Time OP1 OP2 OP3 ABC ABC ABC OP4 ABC

20 20 Data Hazards Result does not feed back around in time for next operation Pipelining has changed behavior of system

21 21 Naïve PIPE Implementation

22 SEQ Hardware Stages occur in sequence One operation in process at a time

23 SEQ+ Hardware –Still sequential implementation –Reorder PC stage to put at beginning PC Stage –Task is to select PC for current instruction –Based on results computed by previous instruction Processor State –PC is no longer stored in register –But, can determine PC based on other stored information

24 24 Memory Execute Decode Fetch Instruction memory PC increment CC ALU Data Memory icode,ifunrA,rB valC Register File ABM E pState valP srcA,srcB dstA, dstB valA, valB aluA, aluB Cnd valE Addr, Data valM valE, valM icode, valC valP PC Write Back

25 25 Memory Execute Decode Fetch

26 26 Pipeline Stages Fetch –Select current PC –Read instruction –Compute incremented PC Decode –Read program registers Execute –Operate ALU Memory –Read or write data memory Write Back –Update register file

27 27 PIPE- Hardware Pipeline registers hold intermediate values from instruction execution Forward (Upward) Paths –Values passed from one stage to next –Cannot jump past stages e.g., valC passes through decode

28 28 E M W F D PC ALU Data Memory Select PC rB Select A ALUAALUB Mem. control Addr read ALU fun Fetch Decode Execute Memory Write back data out data in M_valA W_valM W_valE M_valA W_valM d_rvalA f_PC Predict PC valEvalM dstEdstM Cnd valE valA dstE dstM icodeifun valC valAvalBdstE srcB valC valP icode ifunrA predPC CC e_Cnd M_Bch write dstMsrcA Register File ABM E dstEdstMsrcAsrcB d_srcA d_srcB Instruction Instruction Memory increment icode

29 29 E M W F D PC ALU Data Memory Select PC rB Select A ALUAALUB Mem. control Addr read ALU fun Fetch Decode Execute Memory Write back icode data out data in M_valA W_valM W_valE M_valA W_valM d_rvalA f_PC Predict PC valEvalM dstEdstM Cnd icode valE valA dstE dstM icodeifun valC valAvalBdstE srcB valC valP icode ifunrA predPC CC e_Cnd M_Bch write dstMsrcA Register File ABM E srcB srcA dstM dstE d_srcA d_srcB Instruction Instruction Memory increment

30 30 Feedback Paths Predicted PC –Guess value of next PC –Branch information Jump taken/not-taken Fall-through or target address –Return point Read from memory Register updates To register file write ports

31 31 Start fetch of new instruction after current one has completed fetch stage –Not enough time to reliably determine next instruction Guess which instruction will follow –Recover if prediction was incorrect Predicting the PC

32 32 Predicting the PC F D M_icode Predict PC Instruction Instruction Memory PC increment predPC Need regids Need valC Instr valid Align Split Bytes1-5 Byte 0 Select PC M_Cnd M_valA W_icode W_valM D rB valC valP icode ifunrA

33 33 Our Prediction Strategy Instructions that Don’t Transfer Control –Predict next PC to be valP –Always reliable Call and Unconditional Jumps –Predict next PC to be valC (destination) –Always reliable

34 34 Our Prediction Strategy Conditional Jumps –Predict next PC to be valC (destination) –Only correct if branch is taken Typically right 60% of time Return Instruction –Don’t try to predict

35 35 Fetch Predict PC D rB valC valP icode ifunrA F predPC Instruction Instruction Memory PC increment Select PC M_valA W_valM f_PC M_Cnd M_icode W_icode Select PC Int F_predPC = [ f_icode in {IJXX, ICALL} : f_valC; 1: f_valP; ];

36 36 Recovering from PC Misprediction Mispredicted Jump –Will see branch flag once instruction reaches memory stage –Can get fall-through PC from valA Return Instruction –Will get return PC when ret reaches write-back stage

37 37 Select PC int f_PC = [ #mispredicted branch. Fetch at incremented PC M_icode == IJXX && !M_Cnd : M_valA; #completion of RET instruciton W_icode == IRET : W_valM; #default: Use predicted value of PC 1: F_predPC ];

38 38 Pipeline Demonstration 123456789 FDEM WFDEM W FDEMW FDEMWFDEMW Cycle 5 W I1 M I2 E I3 D I4 F I5 irmovl $1,%eax#I1 irmovl $2,%ecx#I2 irmovl $3,%edx#I3 irmovl $4,%ebx #I4 halt#I5

39 39 Data Dependencies in Processors Result from one instruction used as operand for another –Read-after-write (RAW) dependency Very common in actual programs Must make sure our pipeline handles these properly –Get correct results –Minimize performance impact 1 irmovl $50, %eax 2 addl %eax, %ebx 3 mrmovl 100(%ebx),%edx

40 40 Data Dependencies: No Nop 12345678 FDEM WFDEM W FDEMW FDEMW E D valA  R[ %edx ]=0 valB  R[ %eax ]=0 D valA  R[ %edx ]=0 valB  R[ %eax ]=0 Cycle 4 Error M M_valE= 10 M_dstE= %edx e_valE  0 + 3 = 3 E_dstE= %eax # demo-h0.ys 0x000:irmovl $10,% edx 0x006:irmovl $3,%eax 0x00c:addl %edx,%eax 0x0e:halt

41 41 Data Dependencies: 1 Nop 123456789 FDEM WFDEM W FDEMWFDEMW FDEMWFDEMW FDEMWFDEMW W R[ %edx ]  10 W R[ %edx ]  10 D valA  R[ %edx ]=0 valB  R[ %eax ]=0 D valA  R[ %edx ]=0 valB  R[ %eax ]=0 Cycle 5 Error M M_valE= 3 M_dstE= %eax # demo-h1.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:addl %edx,%eax 0x0f:halt

42 42 Data Dependencies: 2 Nop’s W R[ %eax ]  3 D valA  R[ %edx ]=10 valB  R[ %eax ]=0 W R[ %eax ]  3 W R[ %eax ]  3 D valA  R[ %edx ]=10 valB  R[ %eax ]=0 D valA  R[ %edx ]=10 valB  R[ %eax ]=0 Cycle 6 Error # demo-h2.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:nop 0x00e:addl %edx,%eax 0x010:halt

43 43 Data Dependencies: 3 Nop’s D D val ← R[%edx]=10 val←R[%eax]=3 WW R[%eax] ←3 Cycle 6 Cycle 7 # demo-h3.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:nop 0x00e:nop 0x00f:addl %edx,%eax 0x011:halt

44 44 Classes of Data Hazards Hazards can potentially occur when one instruction updates part of the program state that read by a later instruction

45 45 Classes of Data Hazards Program states: –Program registers The hazards already identified. –Condition codes Both written and read in the execute stage. No hazards can arise –Program counter Conflicts between updating and reading PC cause control hazards –Memory Both written and read in the memory stage. Without self-modified code, no hazards.

46 46 Data Dependencies: 2 Nop’s W R[ %eax ]  3 D valA  R[ %edx ]=10 valB  R[ %eax ]=0 W R[ %eax ]  3 W R[ %eax ]  3 D valA  R[ %edx ]=10 valB  R[ %eax ]=0 D valA  R[ %edx ]=10 valB  R[ %eax ]=0 Cycle 6 Error # demo-h2.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:nop 0x00e:addl %edx,%eax 0x010:halt

47 47 Data Dependencies: 2 Nop’s # demo-h2.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:nop bubble 0x00e:addl %edx,%eax 0x010:halt 123456789 FDEMW FDEMW FDEMW F EMW DDEMW FDEMW 10 F FDEMW 11 If instruction follows too closely after one that writes register, slow it down Hold instruction in decode Dynamically inject nop into execute stage

48 dstM dstE 48 E M W F D PC ALU Data Memory Select PC rB Select A ALUAALUB Mem. control Addr read ALU fun Fetch Decode Execute Memory Write back data out data in M_valA W_valM W_valE M_valA W_valM d_rvalA f_PC Predict PC valEvalM dstEdstM Bch valE valA dstE dstM icodeifun valC valAvalB dstE srcB valC valP icode ifunrA predPC CC e_Bch M_Bch write dstM srcA Register File ABM E srcAsrcB d_srcA d_srcB Instruction Instruction Memory increment icode

49 49 Stall Condition Source Registers –srcA and srcB of current instruction in decode stage Destination Registers –dstE and dstM fields –Instructions in execute, memory, and write-back stages Condition –srcA==dstE or srcA==dstM –srcB==dstE or srcB==dstM Special Case –Don’t stall for register ID F Indicates absence of register operand

50 50 Data Dependencies: 2 Nop’s # demo-h2.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:nop bubble 0x00e:addl %edx,%eax 0x010:halt 123456789 FDEMW FDEMW FDEMW F EMW DDEMW FDEMW 10 F FDEMW 11 Cycle 6 W D W_dstE = %eax W_valE = 3 srcA = %edx srcB = %eax

51 51 Stalling X3 123456789 FDEMW FDEMW F EMW D EMW DDEMW FDEMW 10 FF D F EMW 11 Cycle 4 W W_dstE = %eax D srcA = %edx srcB = %eax M M_dstE = %eax D srcA = %edx srcB = %eax E E_dstE = %eax D srcA = %edx srcB = %eax Cycle 5 Cycle 6 # demo-h0.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax bubble 0x00c:addl %edx,%eax 0x0e:halt

52 52 What Happens When Stalling? Stalling instruction held back in decode stage Following instruction stays in fetch stage Bubbles injected into execute stage –Like dynamically generated nop’s –Move through later stages 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax Cycle 4 0x00e: halt 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax # demo-h0.ys 0x00e: halt 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax bubble 0x00c: addl %edx,%eax Cycle 5 0x00e: halt 0x006: irmovl $3,%eax bubble 0x00c: addl %edx,%eax bubble Cycle 6 0x00e: halt bubble 0x00c: addl %edx,%eax bubble Cycle 7 0x00e: halt bubble Cycle 8 0x00c: addl %edx,%eax 0x00e: halt Write Back Memory Execute Decode Fetch

53 53 Implementing Stalling Pipeline Control –Combinational logic detects stall condition –Sets mode signals for how pipeline registers should update

54 Implementing Stalling E M W F D CC rB srcA srcB icodevalEvalMdstEdstM CndicodevalEvalAdstEdstM icodeifunvalCvalAvalBdstEdstMsrcAsrcB valCvalPicodeifunrA predPC d_srcB d_srcA e_Cnd D_icode E_icode M_icode E_dstM Pipe control logic D_bubble D_stall E_bubble F_stall M_bubble W_stall set_cc stat W_stat stat m_stat

55 55 Pipeline Register Modes Rising clock Rising clock  Output = x xx Input = y stall = 1 bubble = 0 xx Stall

56 56 Pipeline Register Modes n o p Rising clock Rising clock  Output = nop xx Output = xInput = y stall = 0 bubble = 1 Bubble

57 57 Pipeline Register Modes Rising clock Rising clock  Output = y yy Output = xInput = y stall = 0 bubble = 0 xx Normal

58 Next Data Forwarding Handle Control Hazard Performance matrix Suggested Reading 4.5 58


Download ppt "1 Naïve Pipelined Implementation. 2 Outline General Principles of Pipelining –Goal –Difficulties Naïve PIPE Implementation Suggested Reading 4.4, 4.5."

Similar presentations


Ads by Google