1 Naïve Pipelined Implementation
2 Outline General Principles of Pipelining –Goal –Difficulties Naïve PIPE Implementation Suggested Reading 4.4, 4.5
3 Pipeline
4 Problem of SEQ Too slow –Too many tasks needed to finish in one clock cycle –Signals need long time to propagate through all of the stages –The clock must run slowly enough Does not make good use of hardware units –Every unit is active for part of the total clock cycle
5 Real-World Pipelines: Car Washes Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being processed Sequential Parallel Pipelined
6 Computational Example System –Computation requires total of 300 picoseconds –Additional 20 picoseconds to save result in register –Can must have clock cycle of at least 320 ps Combinational logic RegReg 300 ps20 ps Clock Delay = 320 ps Throughput = 3.12 GOPS
7 3-Way Pipelined Version System –Divide combinational logic into 3 blocks of 100 ps each –Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps –Overall latency increases 360 ps from start to finish RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps Delay = 360 ps Throughput = 8.33 GOPS
8 Pipeline Diagrams Unpipelined –Cannot start new operation until previous one completes 3-Way Pipelined –Up to 3 operations in process simultaneously Time OP1 OP2 OP3 Time ABC ABC ABC OP1 OP2 OP3
9 Operating a Pipeline Time OP1 OP2 OP3 ABC ABC ABC Clock RegReg Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 239 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 241 RegReg RegReg RegReg 100 ps20 ps100 ps20 ps100 ps20 ps Comb. logic A Comb. logic B Comb. logic C Clock 300 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 359
10 Computational Example System –Computation requires total of 300 picoseconds –Additional 20 picoseconds to save result in register –Can must have clock cycle of at least 320 ps Combinational logic RegReg 300 ps20 ps Clock Delay = 320 ps Throughput = 3.12 GOPS
11 3-Way Pipelined Version System –Divide combinational logic into 3 blocks of 100 ps each –Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps –Overall latency increases 360 ps from start to finish RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps Delay = 360 ps Throughput = 8.33 GOPS
12 Pipeline Diagrams Unpipelined –Cannot start new operation until previous one completes 3-Way Pipelined –Up to 3 operations in process simultaneously Time OP1 OP2 OP3 Time ABC ABC ABC OP1 OP2 OP3
13 Limitations: Nonuniform Delays Time OP1 OP2 OP3 ABC ABC ABC RegReg Clock RegReg Comb. logic B RegReg Comb. logic C 50 ps20 ps150 ps20 ps100 ps20 ps Delay = 510 ps Throughput = 5.88 GIPS Comb. logic A
14 Limitations: Nonuniform Delays Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages
15 Limitations: Register Overhead Delay = 420 ps, Throughput = GOPS Clock RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps
16 Limitations: Register Overhead As try to deepen pipeline, overhead of loading registers becomes more significant Percentage of clock cycle spent loading register: –1-stage pipeline: 6.25% –3-stage pipeline: 16.67% –6-stage pipeline: 28.57% High speeds of modern processor designs obtained through very deep pipelining
Data Dependencies in Processors –Result from one instruction used as operand for another Read-after-write (RAW) dependency –Very common in actual programs –Must make sure our pipeline handles these properly Get correct results Minimize performance impact 1 irmovl $50, %eax 2 addl %eax, %ebx 3 mrmovl 100( %ebx ), %edx 17
18 Data Dependencies System –Each operation depends on result from preceding one Clock Combinational logic RegReg Time OP1 OP2 OP3
19 Data Hazards RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C Time OP1 OP2 OP3 ABC ABC ABC OP4 ABC
20 Data Hazards Result does not feed back around in time for next operation Pipelining has changed behavior of system
21 Naïve PIPE Implementation
SEQ Hardware Stages occur in sequence One operation in process at a time
SEQ+ Hardware –Still sequential implementation –Reorder PC stage to put at beginning PC Stage –Task is to select PC for current instruction –Based on results computed by previous instruction Processor State –PC is no longer stored in register –But, can determine PC based on other stored information
24 Memory Execute Decode Fetch Instruction memory PC increment CC ALU Data Memory icode,ifunrA,rB valC Register File ABM E pState valP srcA,srcB dstA, dstB valA, valB aluA, aluB Cnd valE Addr, Data valM valE, valM icode, valC valP PC Write Back
25 Memory Execute Decode Fetch
26 Pipeline Stages Fetch –Select current PC –Read instruction –Compute incremented PC Decode –Read program registers Execute –Operate ALU Memory –Read or write data memory Write Back –Update register file
27 PIPE- Hardware Pipeline registers hold intermediate values from instruction execution Forward (Upward) Paths –Values passed from one stage to next –Cannot jump past stages e.g., valC passes through decode
28 E M W F D PC ALU Data Memory Select PC rB Select A ALUAALUB Mem. control Addr read ALU fun Fetch Decode Execute Memory Write back data out data in M_valA W_valM W_valE M_valA W_valM d_rvalA f_PC Predict PC valEvalM dstEdstM Cnd valE valA dstE dstM icodeifun valC valAvalBdstE srcB valC valP icode ifunrA predPC CC e_Cnd M_Bch write dstMsrcA Register File ABM E dstEdstMsrcAsrcB d_srcA d_srcB Instruction Instruction Memory increment icode
29 E M W F D PC ALU Data Memory Select PC rB Select A ALUAALUB Mem. control Addr read ALU fun Fetch Decode Execute Memory Write back icode data out data in M_valA W_valM W_valE M_valA W_valM d_rvalA f_PC Predict PC valEvalM dstEdstM Cnd icode valE valA dstE dstM icodeifun valC valAvalBdstE srcB valC valP icode ifunrA predPC CC e_Cnd M_Bch write dstMsrcA Register File ABM E srcB srcA dstM dstE d_srcA d_srcB Instruction Instruction Memory increment
30 Feedback Paths Predicted PC –Guess value of next PC –Branch information Jump taken/not-taken Fall-through or target address –Return point Read from memory Register updates To register file write ports
31 Start fetch of new instruction after current one has completed fetch stage –Not enough time to reliably determine next instruction Guess which instruction will follow –Recover if prediction was incorrect Predicting the PC
32 Predicting the PC F D M_icode Predict PC Instruction Instruction Memory PC increment predPC Need regids Need valC Instr valid Align Split Bytes1-5 Byte 0 Select PC M_Cnd M_valA W_icode W_valM D rB valC valP icode ifunrA
33 Our Prediction Strategy Instructions that Don’t Transfer Control –Predict next PC to be valP –Always reliable Call and Unconditional Jumps –Predict next PC to be valC (destination) –Always reliable
34 Our Prediction Strategy Conditional Jumps –Predict next PC to be valC (destination) –Only correct if branch is taken Typically right 60% of time Return Instruction –Don’t try to predict
35 Fetch Predict PC D rB valC valP icode ifunrA F predPC Instruction Instruction Memory PC increment Select PC M_valA W_valM f_PC M_Cnd M_icode W_icode Select PC Int F_predPC = [ f_icode in {IJXX, ICALL} : f_valC; 1: f_valP; ];
36 Recovering from PC Misprediction Mispredicted Jump –Will see branch flag once instruction reaches memory stage –Can get fall-through PC from valA Return Instruction –Will get return PC when ret reaches write-back stage
37 Select PC int f_PC = [ #mispredicted branch. Fetch at incremented PC M_icode == IJXX && !M_Cnd : M_valA; #completion of RET instruciton W_icode == IRET : W_valM; #default: Use predicted value of PC 1: F_predPC ];
38 Pipeline Demonstration FDEM WFDEM W FDEMW FDEMWFDEMW Cycle 5 W I1 M I2 E I3 D I4 F I5 irmovl $1,%eax#I1 irmovl $2,%ecx#I2 irmovl $3,%edx#I3 irmovl $4,%ebx #I4 halt#I5
39 Data Dependencies in Processors Result from one instruction used as operand for another –Read-after-write (RAW) dependency Very common in actual programs Must make sure our pipeline handles these properly –Get correct results –Minimize performance impact 1 irmovl $50, %eax 2 addl %eax, %ebx 3 mrmovl 100(%ebx),%edx
40 Data Dependencies: No Nop FDEM WFDEM W FDEMW FDEMW E D valA R[ %edx ]=0 valB R[ %eax ]=0 D valA R[ %edx ]=0 valB R[ %eax ]=0 Cycle 4 Error M M_valE= 10 M_dstE= %edx e_valE = 3 E_dstE= %eax # demo-h0.ys 0x000:irmovl $10,% edx 0x006:irmovl $3,%eax 0x00c:addl %edx,%eax 0x0e:halt
41 Data Dependencies: 1 Nop FDEM WFDEM W FDEMWFDEMW FDEMWFDEMW FDEMWFDEMW W R[ %edx ] 10 W R[ %edx ] 10 D valA R[ %edx ]=0 valB R[ %eax ]=0 D valA R[ %edx ]=0 valB R[ %eax ]=0 Cycle 5 Error M M_valE= 3 M_dstE= %eax # demo-h1.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:addl %edx,%eax 0x0f:halt
42 Data Dependencies: 2 Nop’s W R[ %eax ] 3 D valA R[ %edx ]=10 valB R[ %eax ]=0 W R[ %eax ] 3 W R[ %eax ] 3 D valA R[ %edx ]=10 valB R[ %eax ]=0 D valA R[ %edx ]=10 valB R[ %eax ]=0 Cycle 6 Error # demo-h2.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:nop 0x00e:addl %edx,%eax 0x010:halt
43 Data Dependencies: 3 Nop’s D D val ← R[%edx]=10 val←R[%eax]=3 WW R[%eax] ←3 Cycle 6 Cycle 7 # demo-h3.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:nop 0x00e:nop 0x00f:addl %edx,%eax 0x011:halt
44 Classes of Data Hazards Hazards can potentially occur when one instruction updates part of the program state that read by a later instruction
45 Classes of Data Hazards Program states: –Program registers The hazards already identified. –Condition codes Both written and read in the execute stage. No hazards can arise –Program counter Conflicts between updating and reading PC cause control hazards –Memory Both written and read in the memory stage. Without self-modified code, no hazards.
46 Data Dependencies: 2 Nop’s W R[ %eax ] 3 D valA R[ %edx ]=10 valB R[ %eax ]=0 W R[ %eax ] 3 W R[ %eax ] 3 D valA R[ %edx ]=10 valB R[ %eax ]=0 D valA R[ %edx ]=10 valB R[ %eax ]=0 Cycle 6 Error # demo-h2.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:nop 0x00e:addl %edx,%eax 0x010:halt
47 Data Dependencies: 2 Nop’s # demo-h2.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:nop bubble 0x00e:addl %edx,%eax 0x010:halt FDEMW FDEMW FDEMW F EMW DDEMW FDEMW 10 F FDEMW 11 If instruction follows too closely after one that writes register, slow it down Hold instruction in decode Dynamically inject nop into execute stage
dstM dstE 48 E M W F D PC ALU Data Memory Select PC rB Select A ALUAALUB Mem. control Addr read ALU fun Fetch Decode Execute Memory Write back data out data in M_valA W_valM W_valE M_valA W_valM d_rvalA f_PC Predict PC valEvalM dstEdstM Bch valE valA dstE dstM icodeifun valC valAvalB dstE srcB valC valP icode ifunrA predPC CC e_Bch M_Bch write dstM srcA Register File ABM E srcAsrcB d_srcA d_srcB Instruction Instruction Memory increment icode
49 Stall Condition Source Registers –srcA and srcB of current instruction in decode stage Destination Registers –dstE and dstM fields –Instructions in execute, memory, and write-back stages Condition –srcA==dstE or srcA==dstM –srcB==dstE or srcB==dstM Special Case –Don’t stall for register ID F Indicates absence of register operand
50 Data Dependencies: 2 Nop’s # demo-h2.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax 0x00c:nop 0x00d:nop bubble 0x00e:addl %edx,%eax 0x010:halt FDEMW FDEMW FDEMW F EMW DDEMW FDEMW 10 F FDEMW 11 Cycle 6 W D W_dstE = %eax W_valE = 3 srcA = %edx srcB = %eax
51 Stalling X FDEMW FDEMW F EMW D EMW DDEMW FDEMW 10 FF D F EMW 11 Cycle 4 W W_dstE = %eax D srcA = %edx srcB = %eax M M_dstE = %eax D srcA = %edx srcB = %eax E E_dstE = %eax D srcA = %edx srcB = %eax Cycle 5 Cycle 6 # demo-h0.ys 0x000:irmovl $10,%edx 0x006:irmovl $3,%eax bubble 0x00c:addl %edx,%eax 0x0e:halt
52 What Happens When Stalling? Stalling instruction held back in decode stage Following instruction stays in fetch stage Bubbles injected into execute stage –Like dynamically generated nop’s –Move through later stages 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax Cycle 4 0x00e: halt 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax # demo-h0.ys 0x00e: halt 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax bubble 0x00c: addl %edx,%eax Cycle 5 0x00e: halt 0x006: irmovl $3,%eax bubble 0x00c: addl %edx,%eax bubble Cycle 6 0x00e: halt bubble 0x00c: addl %edx,%eax bubble Cycle 7 0x00e: halt bubble Cycle 8 0x00c: addl %edx,%eax 0x00e: halt Write Back Memory Execute Decode Fetch
53 Implementing Stalling Pipeline Control –Combinational logic detects stall condition –Sets mode signals for how pipeline registers should update
Implementing Stalling E M W F D CC rB srcA srcB icodevalEvalMdstEdstM CndicodevalEvalAdstEdstM icodeifunvalCvalAvalBdstEdstMsrcAsrcB valCvalPicodeifunrA predPC d_srcB d_srcA e_Cnd D_icode E_icode M_icode E_dstM Pipe control logic D_bubble D_stall E_bubble F_stall M_bubble W_stall set_cc stat W_stat stat m_stat
55 Pipeline Register Modes Rising clock Rising clock Output = x xx Input = y stall = 1 bubble = 0 xx Stall
56 Pipeline Register Modes n o p Rising clock Rising clock Output = nop xx Output = xInput = y stall = 0 bubble = 1 Bubble
57 Pipeline Register Modes Rising clock Rising clock Output = y yy Output = xInput = y stall = 0 bubble = 0 xx Normal
Next Data Forwarding Handle Control Hazard Performance matrix Suggested Reading