Download presentation
Presentation is loading. Please wait.
Published byBailey Iversen Modified over 9 years ago
1
Real-World Pipelines: Car Washes Idea Divide process into independent stages Move objects through stages in sequence At any instant, multiple objects being processed SequentialParallel Pipelined
2
Computational Example System Computation requires total of 300 picoseconds Additional 20 picoseconds to save result in register Clock cycle must be at least 320 ps Combinational logic RegReg 300 ps20 ps Clock Delay = 320 ps Throughput = 3.12 GOPS Throughput = 1 operation 320 ps1 ns 1000 ps 3.12 GOPS x
3
3-Stage Pipelined Version System Divide combinational logic into 3 blocks of 100 ps each Can begin new operation as soon as previous one passes through stage A. Begin new operation every 120 ps Performance impact Latency increases: 360 ps from start to finish (was 320) Speedup is less than 3x – why? RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps Delay = 360 ps Throughput = 8.33 GOPS
4
Pipelining Unpipelined Cannot start new operation until previous one completes 3-way pipelined Up to 3 operations in process simultaneously Time OP1 OP2 OP3 Time ABC ABC ABC OP1 OP2 OP3
5
Operating a Pipeline Time OP1 OP2 OP3 ABC ABC ABC 0120240360480640 Clock RegReg Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 239 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 241 RegReg RegReg RegReg 100 ps20 ps100 ps20 ps100 ps20 ps Comb. logic A Comb. logic B Comb. logic C Clock 300 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 359
6
Real-World Pipelines: Laundry Assuming you’ve got: One washer (takes 30 minutes) One drier (takes 40 minutes) One “folder” (takes 20 minutes) It takes 90 minutes to wash, dry, and fold 1 load of laundry. How long does 4 loads take?
7
The slow way If each load is done sequentially it takes 6 hours 304020304020304020304020 6 PM 789 10 11 Midnight Time
8
8 Laundry Pipelining Start each load as soon as possible: overlap loads Pipelined laundry takes 3.5 hours 6 PM 789 10 11 Midnight Time 3040 20
9
Limitations: Nonuniform Delays Throughput limited by slowest stage Other stages sit idle for much of the time Challenging to partition system into balanced stages RegReg Clock RegReg Comb. logic B RegReg Comb. logic C 50 ps20 ps150 ps20 ps100 ps20 ps Delay = 510 ps Throughput = 5.88 GOPS Comb. logic A Time OP1 OP2 OP3 ABC ABC ABC
10
Limitations: Register Overhead As pipeline deepens, overhead of loading registers becomes more significant Percentage of clock cycle spent loading register: 1-stage pipeline: 6.25% 3-stage pipeline: 16.67% 6-stage pipeline: 28.57% High speeds of modern processor designs obtained through deep pipelining: high overhead Delay = 420 ps, Throughput = 14.29 GOPSClock RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps
11
Data Dependencies Each operation depends on result from preceding one Not a problem for SEQ implementation Clock Combinational logic RegReg Time OP1 OP2 OP3
12
Limitations: Data Hazards Result does not feed back around in time for next operation Unless special action is taken, results will be incorrect! RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C Time OP1 OP2 OP3 ABC ABC ABC OP4 ABC
13
Review: Arithmetic and Logical Operations Refer to generically as “ OPl ” Encodings differ only by “function code” Low-order 4 bits in first instruction word All set condition codes as side effect addl rA, rB 60 rArB subl rA, rB 61 rArB andl rA, rB 62 rArB xorl rA, rB 63 rArB Add Subtract (rA from rB) And Exclusive-Or Instruction CodeFunction Code
14
Review: Move Operations Similar to the IA32 movl instruction Simpler format for memory addresses Separated into different instructions to simplify hardware implementation rrmovl rA, rB 20 rArB Register --> Register Immediate --> Register irmovl V, rB 30F rB V Register --> Memory rmmovl rA, D ( rB) 40 rArB D Memory --> Register mrmovl D ( rB), rA 50 rArB D
15
Data Dependencies in Y86 Code Result from one instruction used as operand for another Read-after-write (RAW) dependency Very common in actual programs Must make sure our pipeline handles these properly Essential: results must be correct Desirable: performance impact should be minimized 1 irmovl $50, %eax 2 addl %eax, %ebx 3 mrmovl 100( %ebx ), %edx
16
SEQ Hardware Stages occur in sequence One operation in process at a time write Mem. control
17
SEQ+ Hardware Still sequential implementation Reorder PC stage to put at beginning PC stage Task is to select PC for current instruction Based on results computed by previous instruction Processor state PC is no longer stored in register PC is determined from other stored information Instruction memory Instruction memory PC increment PC increment CC ALU Data memory Data memory PC rB dstEdstM ALU A ALU B Mem. control Addr srcAsrcB read write ALU fun. Fetch Decode Execute Memory Write back data out Register file Register file AB M E Cnd dstEdstMsrcAsrcB icodeifunrA pCndpValMpValCpValP pIcode PC valCvalP valBvalA Data valE valM PC Stat dmem_error stat instr_valid imem_error
18
Adding Pipeline Registers Instruction memory Instruction memory PC increment PC increment CC ALU Data memory Data memory Fetch Decode Execute Memory Write back icode, ifun rA, rB valC Register file Register file AB M E PC valP srcA, srcB dstE, dstM valA, valB aluA, aluB Cnd valE Addr, Data valM PC valE, valM newPC PC increment PC increment CC ALU Data memory Data memory Fetch Decode Execute Memory Write back Register file Register file AB M E d_srcA, d_srcB valA, valB aluA, aluB Cnd valE Addr, Data valM W_valE, W_valM, W_dstE, W_dstMW_icode, W_valM icode, ifun, rA, rB, valC E M W F D valP f_pc predPC Instruction memory Instruction memory M_icode, M_Cnd, M_valA
19
Pipeline Stages Fetch Select current PC Read instruction Compute incremented PC Decode Read program registers Execute Perform ALU operation Memory Read or write data memory Write back Update register file PC increment PC increment CC ALU Data memory Data memory Fetch Decode Execute Memory Write back Register file Register file AB M E d_srcA, d_srcB valA, valB aluA, aluB Cnd valE Addr, Data valM W_valE, W_valM, W_dstE, W_dstMW_icode, W_valM icode, ifun, rA, rB, valC E M W F D valP f_pc predPC Instruction memory Instruction memory M_icode, M_Cnd, M_valA
20
PIPE- Hardware Pipeline registers hold intermediate values from instruction execution Note naming conventions Forward (upward) paths Values flow from one stage to next Values cannot skip stages; look at valC in decode dstE and dstM in execute, memory
21
PIPE- Example At clock cycle 4, what are the values of d_srcA d_srcB E_dstE e_valE M_dstE M_valE m_valM Highlight the datapath in D, E, M stages. 1: mrmovl 4(%ebx), %ecx 2: addl %eax, %edx 3: addl %ecx, %edx
22
Problems: Feedback Paths Predicted PC Guess value of next PC Branch information Jump taken/not-taken Fall-through or target address Return address Read from memory Register updates To register file write ports
23
Predicting the PC Pipeline timing requires next instruction fetch to begin one cycle after current instruction is fetched Not enough time to determine next instruction in all cases Can be done for all but conditional jumps, return Solution for conditional jumps: guess outcome + continue Recover later if prediction was incorrect PC increment PC increment CC ALU Data memory Data memory Fetch Decode Execute Memory Write back Register file Register file AB M E d_srcA, d_srcB valA, valB aluA, aluB Cnd valE Addr, Data valM W_valE, W_valM, W_dstE, W_dstMW_icode, W_valM icode, ifun, rA, rB, valC E M W F D valP f_pc predPC Instruction memory Instruction memory M_icode, M_Cnd, M_valA
24
Simple Prediction Strategy Instructions that don’t transfer control Next PC is always valP (no guess required) Call and unconditional jumps Next PC is always valC (no guess required) Conditional jumps Could predict next PC to be valC (branch target) Correct only if branch is taken (~60% of time) Could predict next PC to be valP (sequential successor) Correct only if branch not taken (~40% of time) Could predict taken if forward, not-taken if backward Correct ~65% of time Return instruction Don’t try to predict (Why? What would be required?)
25
Branch Misprediction Example Should execute only first 7 instructions Our predictor will guess the branch will be taken What must be done to recover? 0x000: xorl %eax,%eax 0x002: jne t # Not taken 0x007: irmovl $1, %eax # Fall through 0x00d: nop 0x00e: nop 0x00f: nop 0x010: halt 0x011: t: irmovl $2, %edx # Target (Should not execute) 0x017: irmovl $3, %ebx # Should not execute 0x01d: irmovl $4, %edx # Should not execute
26
Handling Misprediction: Desired Behavior Detection Branch is in execute, and it was not taken Two incorrect instructions follow branch in pipe Action On next cycle, replace mispredicted instructions with “bubbles”
27
0x000: irmovl Stack,%esp # Initialize stack pointer 0x006: call p # Procedure call 0x00b: irmovl $5,%esi # Return point 0x011: halt 0x020:.pos 0x20 0x020: p: irmovl $-1,%edi # procedure 0x026: ret 0x027: irmovl $1,%eax # Should not be executed 0x02d: irmovl $2,%ecx # Should not be executed 0x033: irmovl $3,%edx # Should not be executed 0x039: irmovl $4,%ebx # Should not be executed 0x100:.pos 0x100 0x100: Stack: # Stack: Stack pointer Return Example Without special handling we would fetch three more instructions after ret before we know return address Instead, we’ll freeze the pipe when ret is fetched
28
0x026: ret FDEM W bubble FDEM W FDEMW FDEMW 0x00b:irmovl$5,%esi# Return FDEMW # demo-retb FDEMW F valC 5 rB %esi F valC 5 rB %esi W valM= 0x0b W valM= 0x0b Handling Return: Desired Behavior Detection ret is in decode, execute or memory Address of next instruction unavailable Action Fill pipe with bubbles until ret ’s memory access completes
29
Pipeline Operation irmovl $1,%eax #I1 123456789 FDEM W irmovl $2,%ecx #I2 FDEM W irmovl $3,%edx #I3 FDEMW irmovl $4,%ebx #I4 FDEMW halt #I5 FDEMW Cycle 5 W I1 M I2 E I3 D I4 F I5
30
Problems with PIPE-: Example with 3 Nops
31
Example: 2 Nops
32
Example: 1 Nop
33
Example: 0 Nops 0x000:irmovl$10,%edx 12345678 FDEM W 0x006:irmovl$3,%eax FDEM W FDEMW 0x00c:addl%edx,%eax FDEMW 0x00e: halt # demo-h0.ys E D valA R[ %edx ]=0 valB R[ %eax ]=0 D valA R[ %edx ]=0 valB R[ %eax ]=0 Cycle 4 Error M M_valE= 10 M_dstE= %edx e_valE 0 + 3 = 3 E_dstE= %eax
34
Hazard Solution 1: Stalling If instruction that reads register follows too closely after instruction that writes same register, delay reading instruction Delay or stall always takes place in decode stage Following instruction also delayed in fetch, but just one “stall cycle” tallied Stall = dynamically injecting nop (bubble) into execute stage Hazard we get wrong answer without special action 0x000: irmovl $10,%edx 123456789 FDEMW 0x006: irmovl $3,%eax FDEMW 0x00c: nop FDEMW bubble F EMW 0x00e: addl %edx,%eax DDEMW 0x010: halt FDEMW 10 # demo-h2.ys F FDEMW 0x00d: nop 11
35
Stalls Detection Write is pending for a src register srcA or srcB (not 0xF) matches dstE or dstM in execute, memory, or write- back stages Action Stall instruction in decode 1: addl %eax, %edx 2: mrmovl 4(%ebx), %ecx 3: addl %ecx, %edx
36
Stalls Detection Write is pending for a src register srcA or srcB (not 0xF) matches dstE or dstM in execute, memory, or write- back stages Action Stall instruction in decode 1: addl %eax, %edx 2: mrmovl 4(%ebx), %ecx 3: addl %ecx, %edx
37
Detecting Stall Condition 0x000: irmovl $10,%edx 123456789 FDEMW 0x006: irmovl $3,%eax FDEMW 0x00c: nop FDEMW F 0x00e: addl %edx,%eax DEMW 0x010: halt DEMW 10 # demo-h2.ys F FDEMW 0x00d: nop 11 Cycle 6 W D W_dstE = %eax W_valE = 3 srcA = %edx srcB = %eax
38
Detecting Stall Condition 0x000: irmovl $10,%edx 123456789 FDEMW 0x006: irmovl $3,%eax FDEMW 0x00c: nop FDEMW bubble F EMW 0x00e: addl %edx,%eax DDEMW 0x010: halt FDEMW 10 # demo-h2.ys F FDEMW 0x00d: nop 11 Cycle 6 W D W_dstE = %eax W_valE = 3 srcA = %edx srcB = %eax
39
Multi-cycle Stalls 0x000: irmovl $10,%edx 123456789 FDEMW 0x006: irmovl $3,%eax FDEMW bubble F EMW D EMW 0x00c: addl %edx,%eax DDEMW 0x00e: halt FDEMW 10 # demo-h0.ys FF D F EMW bubble 11 Cycle 4 D srcA = %edx srcB = %eax
40
Multi-cycle Stalls 0x000: irmovl $10,%edx 123456789 FDEMW 0x006: irmovl $3,%eax FDEMW bubble F EMW D EMW 0x00c: addl %edx,%eax DDEMW 0x00e: halt FDEMW 10 # demo-h0.ys FF D F EMW bubble 11 Cycle 4 E E_dstE = %eax D srcA = %edx srcB = %eax
41
Multi-cycle Stalls 0x000: irmovl $10,%edx 123456789 FDEMW 0x006: irmovl $3,%eax FDEMW bubble F EMW D EMW 0x00c: addl %edx,%eax DDEMW 0x00e: halt FDEMW 10 # demo-h0.ys FF D F EMW bubble 11 Cycle 4 D srcA = %edx srcB = %eax E E_dstE = %eax D srcA = %edx srcB = %eax Cycle 5
42
Multi-cycle Stalls 0x000: irmovl $10,%edx 123456789 FDEMW 0x006: irmovl $3,%eax FDEMW bubble F EMW D EMW 0x00c: addl %edx,%eax DDEMW 0x00e: halt FDEMW 10 # demo-h0.ys FF D F EMW bubble 11 Cycle 4 M M_dstE = %eax D srcA = %edx srcB = %eax E E_dstE = %eax D srcA = %edx srcB = %eax Cycle 5
43
Multi-cycle Stalls 0x000: irmovl $10,%edx 123456789 FDEMW 0x006: irmovl $3,%eax FDEMW bubble F EMW D EMW 0x00c: addl %edx,%eax DDEMW 0x00e: halt FDEMW 10 # demo-h0.ys FF D F EMW bubble 11 Cycle 4 D srcA = %edx srcB = %eax M M_dstE = %eax D srcA = %edx srcB = %eax E E_dstE = %eax D srcA = %edx srcB = %eax Cycle 5 Cycle 6
44
Multi-cycle Stalls 0x000: irmovl $10,%edx 123456789 FDEMW 0x006: irmovl $3,%eax FDEMW bubble F EMW D EMW 0x00c: addl %edx,%eax DDEMW 0x00e: halt FDEMW 10 # demo-h0.ys FF D F EMW bubble 11 Cycle 4 W W_dstE = %eax D srcA = %edx srcB = %eax M M_dstE = %eax D srcA = %edx srcB = %eax E E_dstE = %eax D srcA = %edx srcB = %eax Cycle 5 Cycle 6
45
Stalling: Another View Operational rules: Stalling instruction held back in decode stage Following instruction stays in fetch stage Bubbles injected into execute stage Like dynamically generated nop’s Move through stage-by-stage as regular instructions 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax Cycle 4 0x00e: halt 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax 0x00c: addl %edx,%eax # demo-h0.ys 0x00e: halt 0x000: irmovl $10,%edx 0x006: irmovl $3,%eax bubble 0x00c: addl %edx,%eax Cycle 5 0x00e: halt 0x006: irmovl $3,%eax bubble 0x00c: addl %edx,%eax bubble Cycle 6 0x00e: halt bubble 0x00c: addl %edx,%eax bubble Cycle 7 0x00e: halt bubble Cycle 8 0x00c: addl %edx,%eax 0x00e: halt Write Back Memory Execute Decode Fetch
46
Summary Pipeline Hazards Data Hazards Solution #1: Stalling Control Hazard jmp, ret Solution #1: Simple Prediction
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.