1 1998 Morgan Kaufmann Publishers Chapter Six
2 1998 Morgan Kaufmann Publishers Pipelining Improve performance by increasing instruction throughput Ideal speedup is number of stages in the pipeline. Do we achieve this?
3 1998 Morgan Kaufmann Publishers Pipelining What makes it easy –all instructions are the same length –just a few instruction formats –memory operands appear only in loads and stores What makes it hard? –structural hazards: suppose we had only one memory –control hazards: need to worry about branch instructions –data hazards: an instruction depends on a previous instruction We will build a simple pipeline and look at these issues We will talk about modern processors and what really makes it hard: –exception handling –trying to improve performance with out-of-order execution, etc.
4 1998 Morgan Kaufmann Publishers Basic Idea What do we need to add to actually split the datapath into stages?
5 1998 Morgan Kaufmann Publishers Pipelined Datapath Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?
6 1998 Morgan Kaufmann Publishers Corrected Datapath
7 1998 Morgan Kaufmann Publishers Instruction fetch stage of lw
8 1998 Morgan Kaufmann Publishers Instruction decode stage of lw
9 1998 Morgan Kaufmann Publishers Execution stage of lw
10 1998 Morgan Kaufmann Publishers Memory stage of lw
11 1998 Morgan Kaufmann Publishers Write back stage of lw
12 1998 Morgan Kaufmann Publishers Execution stage of sw
13 1998 Morgan Kaufmann Publishers Memory stage of sw
14 1998 Morgan Kaufmann Publishers Write back stage of sw
15 1998 Morgan Kaufmann Publishers Instructions flowing in pipeline
16 1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)
17 1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)
18 1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)
19 1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)
20 1998 Morgan Kaufmann Publishers Instructions flowing in pipeline (cont.)
21 1998 Morgan Kaufmann Publishers Graphically Representing Pipelines Can help with answering questions like: –how many cycles does it take to execute this code? –what is the ALU doing during cycle 4? –use this representation to help understand datapaths
22 1998 Morgan Kaufmann Publishers Pipeline Control Can these operations be completed one stage earlier? Think about critical path!!! Why does IM have no control signal at all? Why does RF have only write control? Can ALU result be written back in MEM? Cost of RF write port!!!
23 1998 Morgan Kaufmann Publishers Pipeline design considerations Simplification of control mechanism –Active in every clock cycle (always enable) Ex: –Instruction memory has no control signal. –Pipeline registers Minimization of power consumption –Explicit control for infrequent operations Ex: both read and write controls for data memory Cost consideration –Ex: alternatives of ALU write: in MEM or WB
24 1998 Morgan Kaufmann Publishers We have 5 stages. What needs to be controlled in each stage? –Instruction Fetch and PC Increment –Instruction Decode / Register Fetch –Execution –Memory Stage –Write Back How would control be handled in an automobile plant? a fancy control center telling everyone what to do? should we use a finite state machine? –Centralized –Distributed Pipeline control
25 1998 Morgan Kaufmann Publishers Pass control signals along just like the data –Generate all control signals at the decode stage…similar to the single cycle implementation –Pass the generated control signals along the pipeline and consume the related control signals at the corresponding stage…similar to the multicycle implementation Pipeline Control
26 1998 Morgan Kaufmann Publishers Datapath with Control
27 1998 Morgan Kaufmann Publishers
28 1998 Morgan Kaufmann Publishers
29 1998 Morgan Kaufmann Publishers
30 1998 Morgan Kaufmann Publishers $4, $5
31 1998 Morgan Kaufmann Publishers
32 1998 Morgan Kaufmann Publishers
33 1998 Morgan Kaufmann Publishers
34 1998 Morgan Kaufmann Publishers
35 1998 Morgan Kaufmann Publishers
36 1998 Morgan Kaufmann Publishers Problem with starting next instruction before first is finished –dependencies that point backward in time are data hazards Dependencies
37 1998 Morgan Kaufmann Publishers Have compiler guarantee no hazards Where do we insert the nops?? sub$2, $1, $3 and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) Insert NOP… sub$2, $1, $3 NOP NOP NOP and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) Problem: this really slows us down! Software Solution
38 1998 Morgan Kaufmann Publishers Use temporary results, don’t wait for them to be written ALU forwarding Register file forwarding (latch-based register file) to handle read/write to same register (read what you just write!!!) Forwarding what if this $2 was $13? Transparent latch
39 1998 Morgan Kaufmann Publishers No forwarding datapath
40 1998 Morgan Kaufmann Publishers With forwarding datapath
41 1998 Morgan Kaufmann Publishers The control values for the forwarding multiplexors Mux controlSourceExplanation ForwardingA = 00ID/EX The first ALU operand comes from the register file ForwardingA = 10EX/MEM The first ALU operand is forwarded from prior ALU result ForwardingA = 01MEM/WB The first ALU operand is forwarded from data memory or an earlier ALU result ForwardingB = 00ID/EX The second ALU operand comes from the register file ForwardingB = 10EX/MEM The second ALU operand is forwarded from prior ALU result ForwardingB = 01MEM/WB The second ALU operand is forwarded from data memory or an earlier ALU result If (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and (EX/MEM.RegisterRd ID/EX.RegisterRs) and (MEM/WB.registerRd = ID/EX.RegisterRs)) ForwardA = 01 If (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and (EX/MEM.RegisterRd ID/EX.RegisterRt) and (MEM/WB.registerRd = ID/EX.RegisterRt)) ForwardB = 01
42 1998 Morgan Kaufmann Publishers The modified datapath resolves hazards via forwarding
43 1998 Morgan Kaufmann Publishers
44 1998 Morgan Kaufmann Publishers
45 1998 Morgan Kaufmann Publishers $2 Register file forwarding (through the transparent latch).
46 1998 Morgan Kaufmann Publishers Simultaneously match the $4 operands in stage MEM and stage WB. Forward from the nearest stage: MEM (based on the sequential programming model) Write from the WB stage to the register file (RF). Forward Reg. Write
47 1998 Morgan Kaufmann Publishers Load word can still cause a hazard: –an instruction tries to read a register following a load instruction that writes to the same register. Thus, we need a hazard detection unit to stall the load instruction Can't always forward Latch-based RF: read what you just write! (write then read)
48 1998 Morgan Kaufmann Publishers Stalling We can stall the pipeline by keeping an instruction in the same stage
49 1998 Morgan Kaufmann Publishers Hazard Detection Unit Stall by letting an instruction that won’t write anything go forward stall Insert NOP LOAD continues to next stage
50 1998 Morgan Kaufmann Publishers
51 1998 Morgan Kaufmann Publishers
52 1998 Morgan Kaufmann Publishers
53 1998 Morgan Kaufmann Publishers
54 1998 Morgan Kaufmann Publishers
55 1998 Morgan Kaufmann Publishers
56 1998 Morgan Kaufmann Publishers When we decide to branch, other instructions are in the pipeline! We are predicting branch not taken –need to add hardware for flushing instructions if we are wrong Branch Hazards
57 1998 Morgan Kaufmann Publishers Flushing Instructions Optimized data path for branch performance: Branch delay: 3 => 1 Original data path
58 1998 Morgan Kaufmann Publishers Instruction Flushing for Branch
59 1998 Morgan Kaufmann Publishers Instruction Flushing for Branch (Flushed and instr.)
60 1998 Morgan Kaufmann Publishers Improving Performance Try and avoid stalls! E.g., reorder these instructions: lw $t0, 0($t1) lw $t0, 0($t1) lw $t2, 4($t1)lw $t2, 4($t1) sw $t2, 0($t1)sw $t0, 4($t1) sw $t0, 4($t1)sw $t2, 0($t1) Add a branch delay slot (delayed branch) –the next instruction after a branch is always executed –rely on compiler to fill the slot with something useful add $2, $3, $4 beq $9, $10, 400 beq $9, $10, 400add $2, $3, $4 ; always executedsub $11, $12, $13: Superscalar: start more than one instruction in the same cycle
61 1998 Morgan Kaufmann Publishers Utilizing the branch delay slot (compiler’s task) : :
62 1998 Morgan Kaufmann Publishers Final data/control path for hazard handling
63 1998 Morgan Kaufmann Publishers Dynamic Branch Prediction
64 1998 Morgan Kaufmann Publishers Final data/control path for exception handling 1.flush instr.; 2.save PC (PC+4); Cause 3.set new PC; 4.overflowed instr. (EX) => NOP
65 1998 Morgan Kaufmann Publishers Overflow!
66 1998 Morgan Kaufmann Publishers continue! ………… Flushing (NOP’s) ……………
67 1998 Morgan Kaufmann Publishers … Fig.6.56: Complete data path and control for Chap. 6
68 1998 Morgan Kaufmann Publishers Instruction typePipe stages ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB ALU or branch instructionIFIDEXMEMWB Load or store instructionIFIDEXMEMWB Suerscalar Execution
69 1998 Morgan Kaufmann Publishers
70 1998 Morgan Kaufmann Publishers Simple Superscalar Code Scheduling Loop : lw$t0, 0($s1) # $t0=array element ($s1 is i) addu$t0, $t0, $s2# add dcalar in $s2 (B) sw$t0, 0($s1)# store result addi$s1, $s1, -4# decrement pointer bne$s1, $zero, Loop# branch $s1 != 0 ALU or branch instructionData transfer instructionClock cycle Loop:lw$t0, 0($s1)1 addi$s1, $s1, -42 addu$t0, $t0, $s23 bne$s1, $zero, Loopsw$t0, 0($s1)4 Do { *I = *I + B; I = I -4 ; } While (I != 0) ; Do { *I = *I + B; I = I -4 ; } While (I != 0) ;
71 1998 Morgan Kaufmann Publishers Loop Unrolling for Superscalar Pipelines ALU or branch instructionData transfer instructionClock cycle Loop:addi$s1, $s1, -16lw$t0, 0($s1)1 lw$t1, 12($s1)2 addu$t0, $t0, $s2lw$t2, 8($s1)3 addu$t1, $t1, $s2lw$t3, 4($s1)4 addu$t2, $t2, $s2sw$t0, 16($s1)5 addu$t3, $t3, $s2sw$t1, 12($s1)6 sw$t2, 8($s1)7 bne$s1, $zero, Loopsw$t2, 8($s1)8 Do { I = I - 16 ; *I = *I + B; *(I+12) = *(I+12) + B; *(I+8) = *(I+8) + B; *(I+4) = *(I+4) + B; ; } While (I != 0) ; Do { I = I - 16 ; *I = *I + B; *(I+12) = *(I+12) + B; *(I+8) = *(I+8) + B; *(I+4) = *(I+4) + B; ; } While (I != 0) ;
72 1998 Morgan Kaufmann Publishers Loop Unrolling Superscalar has the architecture to perform parallel calculation For C source code: –for(i=100; i!=0; i--) { A[i]=A[i]+1; } –for(i=100; i!=0; i=i-4) { A[i]=A[i]+1; A[i-1]=A[i-1]+1; A[i-2]=A[i-2]+1; A[i-3]=A[i-3]+1; } In uni-processor, the functionalities are the same. But in superscalar, large amount of operations provide a richer opportunity for parallel execution.
73 1998 Morgan Kaufmann Publishers
74 1998 Morgan Kaufmann Publishers
75 1998 Morgan Kaufmann Publishers Dynamic Scheduling: dispatch add
76 1998 Morgan Kaufmann Publishers Dynamic Scheduling: dispatch subi
77 1998 Morgan Kaufmann Publishers Dynamic Scheduling: execute add
78 1998 Morgan Kaufmann Publishers Dynamic Scheduling: execute subi
79 1998 Morgan Kaufmann Publishers Dynamic Scheduling: write back add
80 1998 Morgan Kaufmann Publishers Dynamic Scheduling: write back subi
81 1998 Morgan Kaufmann Publishers
82 1998 Morgan Kaufmann Publishers
83 1998 Morgan Kaufmann Publishers Dynamic Scheduling The hardware performs the scheduling? –hardware tries to find instructions to execute –out of order execution is possible –speculative execution and dynamic branch prediction All modern processors are very complicated –DEC Alpha 21264: 9 stage pipeline, 6 instruction issue –PowerPC and Pentium: branch history table –Compiler technology important This class has given you the background you need to learn more Video: An Overview of Intel Pentium Processor (available from University Video Communications)
84 1998 Morgan Kaufmann Publishers Figure 6.52: The performance consequences of single-cycle, multiple-cycle and pipelined
85 1998 Morgan Kaufmann Publishers Figure 6.53: Basic relationship between the datapaths in Figure 6.52