Morgan Kaufmann Publishers 17 April, 2017 Pipeline Performance Assume time for stages is 100ps for register read or write 200ps for other stages Compare pipelined datapath with single-cycle datapath Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 800ps sw 700ps R-format 600ps beq 500ps Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps) Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Pipeline Speedup If all stages are balanced i.e., all take the same time Time between instructionspipelined = Time between instructionsnonpipelined Number of stages If not balanced, speedup is less Speedup due to increased throughput Latency (time for each instruction) does not decrease Chapter 4 — The Processor
Pipelining and ISA Design Morgan Kaufmann Publishers 17 April, 2017 Pipelining and ISA Design MIPS ISA designed for pipelining All instructions are 32-bits Easier to fetch and decode in one cycle c.f. x86: 1- to 17-byte instructions Few and regular instruction formats Can decode and read registers in one step Load/store addressing Can calculate address in 3rd stage, access memory in 4th stage Alignment of memory operands Memory access takes only one cycle Chapter 4 — The Processor
Improving performance Morgan Kaufmann Publishers 17 April, 2017 Improving performance Two ideas for improving performance: Spilt each instruction into multiple steps, each taking 1 cycle steps: IF (instruction fetch), ID (instruction decode), EX (execute ALU operation), MEM (memory access), WB (register write-back) slow instructions take more cycles than fast instructions known as a multi-cycle implementation Crucial observation: each instruction uses only a portion of the datapath in each step can overlap instructions; each uses one portion of the datapath known as a pipelined implementation Examples of pipelining: any assembly process (cars, sandwiches), multiple loads of laundry (washer + dryer can be pipelined), etc. 5 Chapter 4 — The Processor
Pipelining not just Multiprocessing Morgan Kaufmann Publishers 17 April, 2017 Pipelining not just Multiprocessing Pipelining does involve parallel processing, but in a specific way Both multiprocessing and pipelining relate to the processing of multiple “things” using multiple “functional units” In multiprocessing, each thing is processed entirely by a single functional unit e.g. multiple lanes at the supermarket In pipelining, each thing is broken into a sequence of pieces, where each piece is handled by a different (specialized) functional unit e.g. checker vs. bagger Pipelining and multiprocessing are not mutually exclusive Modern processors do both, with multiple pipelines (e.g. superscalar) Pipelining is a general-purpose efficiency technique; used elsewhere in CS: Networking, I/O devices, server software architecture 6 Chapter 4 — The Processor
Instruction Fetch (IF) Morgan Kaufmann Publishers 17 April, 2017 Instruction Fetch (IF) While IF is executing, the rest of the data path is sitting idle… Read address Instruction memory [31-0] Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite 7 Chapter 4 — The Processor
Instruction Decode (ID) Morgan Kaufmann Publishers 17 April, 2017 Instruction Decode (ID) Then while ID is executing, the IF-related portion becomes idle… Read address Instruction memory [31-0] Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite 8 Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Execute (EX) ..and so on for the EX portion… Read address Instruction memory [31-0] Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite 9 Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Memory (MEM) …the MEM portion… Read address Instruction memory [31-0] Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite 10 Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Write back (WB) …and the WB portion RegWrite MemWrite MemToReg Read address Instruction [31-0] I [25 - 21] Read register 1 Read data 1 ALU Read address Read data 1 M u x I [20 - 16] Read register 2 Zero Instruction memory Read data 2 M u x 1 M u x 1 Result Write address Write register Data memory Write data Registers I [15 - 11] Write data ALUOp MemRead ALUSrc RegDst I [15 - 0] Sign extend 11 Chapter 4 — The Processor
Decoding and fetching together Morgan Kaufmann Publishers 17 April, 2017 Decoding and fetching together Why don’t we go ahead and fetch the next instruction while we’re decoding the first one? Fetch 2nd Decode 1st instruction Instruction memory [31-0] Read address Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite 12 Chapter 4 — The Processor
Executing, decoding and fetching Morgan Kaufmann Publishers 17 April, 2017 Executing, decoding and fetching Similarly, once the first instruction enters its Execute stage, we can go ahead and decode the second instruction. But now the instruction memory is free again, so we can fetch the third instruction! Fetch 3rd Decode 2nd Execute 1st Read address Instruction memory [31-0] Write data Data MemWrite MemRead 1 M u x MemToReg Sign extend ALUSrc Result Zero ALU ALUOp I [15 - 0] I [25 - 21] I [20 - 16] I [15 - 11] RegDst register 1 register 2 register data 2 data 1 Registers RegWrite 13 Chapter 4 — The Processor
Break datapath into 5 stages Morgan Kaufmann Publishers 17 April, 2017 Break datapath into 5 stages Each stage has its own functional units Full pipeline the datapath is simultaneously working on 5 instructions! IF ID EXE MEM WB RegWrite MemWrite MemToReg Read address Instruction [31-0] I [25 - 21] Read register 1 Read data 1 ALU Read address Read data 1 M u x I [20 - 16] Read register 2 Zero Instruction memory Read data 2 M u x 1 M u x 1 Result Write address Write register Data memory Write data Registers I [15 - 11] Write data ALUOp MemRead ALUSrc RegDst I [15 - 0] Sign extend newest oldest 14 Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 A pipeline diagram Clock cycle 1 2 3 4 5 6 7 8 9 lw $t0, 4($sp) IF ID EX MEM WB sub $v0, $a0, $a1 and $t1, $t2, $t3 or $s0, $s1, $s2 addi $sp, $sp, -4 A pipeline diagram shows the execution of a series of instructions The instruction sequence is shown vertically, from top to bottom Clock cycles are shown horizontally, from left to right Each instruction is divided into its component stages This clearly indicates the overlapping of instructions. For example, there are three instructions active in the third cycle above. The “lw” instruction is in its Execute stage. Simultaneously, the “sub” is in its Instruction Decode stage. Also, the “and” instruction is just being fetched. 15 Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Pipeline terminology Clock cycle 1 2 3 4 5 6 7 8 9 lw $t0, 4($sp) IF ID EX MEM WB sub $v0, $a0, $a1 and $t1, $t2, $t3 or $s0, $s1, $s2 add $sp, $sp, -4 filling full emptying The pipeline depth is the number of stages—in this case, five In the first four cycles here, the pipeline is filling, since there are unused functional units In cycle 5, the pipeline is full. Five instructions are being executed simultaneously, so all hardware units are in use In cycles 6-9, the pipeline is emptying 16 Chapter 4 — The Processor
Pipelining Performance Morgan Kaufmann Publishers 17 April, 2017 Pipelining Performance Clock cycle 1 2 3 4 5 6 7 8 9 lw $t0, 4($sp) IF ID EX MEM WB lw $t1, 8($sp) lw $t2, 12($sp) lw $t3, 16($sp) lw $t4, 20($sp) filling Execution time on ideal pipeline: time to fill the pipeline + one cycle per instruction How long for N instructions? k 1 + N, where k = pipeline depth Alternate way of arriving at this formula: k cycles for the first instruction, plus 1 for each of the remaining N 1 instructions. Compare this pipelined implementation (2ns clock period) vs. a single cycle implementation (8ns clock period). How much faster is pipelining for N=1000 ? 17 Chapter 4 — The Processor
Pipeline Datapath: Resource Requirements Morgan Kaufmann Publishers 17 April, 2017 Pipeline Datapath: Resource Requirements Clock cycle 1 2 3 4 5 6 7 8 9 lw $t0, 4($sp) IF ID EX MEM WB lw $t1, 8($sp) lw $t2, 12($sp) lw $t3, 16($sp) lw $t4, 20($sp) We need to perform several operations in the same cycle. Increment the PC and add registers at the same time. Fetch one instruction while another one reads or writes data. What does that mean for our hardware? Separate ADDER and ALU Two memories (instruction memory and data memory) 18 Chapter 4 — The Processor
Single-cycle datapath, slightly rearranged Morgan Kaufmann Publishers 17 April, 2017 Single-cycle datapath, slightly rearranged 1 PCSrc 4 Add Add P C Shift left 2 RegWrite Read register 1 Read data 1 MemWrite ALU Read address Instruction [31-0] Zero Read register 2 Read data 2 1 Result Address Write register Data memory Instruction memory MemToReg Registers ALUOp Write data ALUSrc Write data Read data 1 Instr [15 - 0] Sign extend RegDst MemRead Instr [20 - 16] 1 Instr [15 - 11] 19 Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Pipeline registers In pipelining, we divide instruction execution into multiple cycles IF ID EX MEM WB Information computed during one cycle may be needed in a later cycle: Instruction read in IF stage determines which registers are fetched in ID stage, what immediate is used for EX stage, and what destination register is for WB Register values read in ID are used in EX and/or MEM stages ALU output produced in EX is an effective address for MEM or a result for WB A lot of information to save! Saved in intermediate registers called pipeline registers The registers are named for the stages they connect: IF/ID ID/EX EX/MEM MEM/WB No register is needed after the WB stage, because after WB the instruction is done 20 Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Pipelined datapath 1 PCSrc 4 IF/ID ID/EX EX/MEM MEM/WB Add Add P C Shift left 2 RegWrite Read register 1 Read data 1 MemWrite ALU Read address Instruction [31-0] Zero Read register 2 Read data 2 1 Result Address Write register Data memory Instruction memory MemToReg Registers ALUOp Write data ALUSrc Write data Read data 1 Instr [15 - 0] Sign extend RegDst MemRead Instr [20 - 16] 1 Instr [15 - 11] 21 Chapter 4 — The Processor
Propagating values forward Morgan Kaufmann Publishers 17 April, 2017 Propagating values forward Data values required later propagated through the pipeline registers The most extreme example is the destination register (rd or rt) It is retrieved in IF, but isn’t updated until the WB Thus, it must be passed through all pipeline stages, as shown in red on the next slide Notice that we can’t keep a single “instruction register,” because the pipelined machine needs to fetch a new instruction every clock cycle 22 Chapter 4 — The Processor
The destination register Morgan Kaufmann Publishers 17 April, 2017 The destination register 1 PCSrc 4 IF/ID ID/EX EX/MEM MEM/WB Add Add P C Shift left 2 RegWrite Read register 1 Read data 1 MemWrite ALU Read address Instruction [31-0] Zero Read register 2 Read data 2 1 Result Address Write register Data memory Instruction memory MemToReg Registers ALUOp Write data ALUSrc Write data Read data 1 Instr [15 - 0] Sign extend RegDst MemRead Instr [20 - 16] 1 Instr [15 - 11] 23 Chapter 4 — The Processor
What about control signals? Morgan Kaufmann Publishers 17 April, 2017 What about control signals? Control signals generated similar to the single-cycle processor in the ID stage, the processor decodes the instruction fetched in IF and produces the appropriate control values Some of the control signals will not be needed until later stages These signals must be propagated through the pipeline until they reach the appropriate stage We just pass them in the pipeline registers, along with the data Control signals can be categorized by the pipeline stage that uses them Stage Control signals needed EX ALUSrc ALUOp RegDst MEM MemRead MemWrite PCSrc WB RegWrite MemToReg 24 Chapter 4 — The Processor
Pipelined data path and control Morgan Kaufmann Publishers 17 April, 2017 Pipelined data path and control 1 ID/EX WB EX/MEM PCSrc Control M WB MEM/WB 4 IF/ID EX M WB Add Add P C Shift left 2 RegWrite Read register 1 Read data 1 MemWrite ALU Read address Instruction [31-0] Zero Read register 2 Read data 2 1 Result Address Write register Data memory Instruction memory MemToReg Registers ALUOp Write data ALUSrc Write data Read data 1 Instr [15 - 0] Sign extend RegDst MemRead Instr [20 - 16] 1 Instr [15 - 11] 25 Chapter 4 — The Processor
An example execution sequence Morgan Kaufmann Publishers 17 April, 2017 An example execution sequence Here’s a sample sequence of instructions to execute 1000: lw $8, 4($29) 1004: sub $2, $4, $5 1008: and $9, $10, $11 1012: or $16, $17, $18 1016: add $13, $14, $0 We’ll make some assumptions, just so we can show actual data values: Each register contains its number plus 100. For instance, register $8 contains 108, register $29 contains 129, etc. Every data memory location contains 99 Our pipeline diagrams will follow some conventions: An X indicates values that aren’t important, like the constant field of an R-type instruction Question marks ??? indicate values we don’t know, usually resulting from instructions coming before and after the ones in our example addresses in decimal 26 Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Cycle 1 (filling) IF: lw $8, 4($29) ID: ??? EX: ??? MEM: ??? WB: ??? Read address Instruction memory [31-0] Address Write data Data MemWrite (?) MemRead (?) 1 MemToReg (?) Shift left 2 Add PCSrc ALUSrc (?) Result Zero ALU ALUOp (???) RegDst (?) register 1 register 2 register data 2 data 1 Registers RegWrite (?) IF/ID ID/EX EX/MEM MEM/WB Control M WB 1000 1004 ??? 4 P C Sign extend EX 27 Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Cycle 2 IF: sub $2, $4, $5 ID: lw $8, 4($29) EX: ??? MEM: ??? WB: ??? Read address Instruction memory [31-0] Address Write data Data 1 4 Shift left 2 Add PCSrc Result Zero ALU register 1 register 2 register data 2 data 1 Registers X 8 IF/ID ID/EX EX/MEM MEM/WB Control M WB 1004 29 1008 129 MemToReg (?) ??? RegWrite (?) MemWrite (?) MemRead (?) ALUSrc (?) ALUOp (???) RegDst (?) P C Sign extend EX 28 Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Cycle 3 IF: and $9, $10, $11 ID: sub $2, $4, $5 EX: lw $8, 4($29) MEM: ??? WB: ??? MemToReg (?) Read address Instruction memory [31-0] Address Write data Data MemWrite (?) MemRead (?) 1 4 Shift left 2 Add PCSrc ALUSrc (1) Result Zero ALU ALUOp (add) X RegDst (0) register 1 register 2 register data 2 data 1 Registers 2 IF/ID ID/EX EX/MEM MEM/WB Control M WB 1008 5 1012 104 105 129 8 133 ??? RegWrite (?) P C Sign extend EX 29 Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Cycle 4 IF: or $16, $17, $18 ID: and $9, $10, $11 EX: sub $2, $4, $5 MEM: lw $8, 4($29) WB: ??? Read address Instruction memory [31-0] Address Write data Data MemWrite (0) MemRead (1) 1 MemToReg (?) 4 Shift left 2 Add PCSrc ALUSrc (0) Result Zero ALU ALUOp (sub) X RegDst (1) register 1 register 2 register data 2 data 1 Registers RegWrite (?) 9 IF/ID ID/EX EX/MEM MEM/WB Control M WB 1012 10 11 1016 110 111 104 105 2 –1 133 99 8 ??? P C Sign extend EX 30 Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Cycle 5 (full) IF: add $13, $14, $0 ID: or $16, $17, $18 EX: and $9, $10, $11 MEM: sub $2, $4, $5 WB: lw $8, 4($29) Read address Instruction memory [31-0] Address Write data Data MemWrite (0) MemRead (0) 1 MemToReg (1) 4 Shift left 2 Add PCSrc ALUSrc (0) Result Zero ALU ALUOp (and) X RegDst (1) register 1 register 2 register data 2 data 1 Registers RegWrite (1) 16 IF/ID ID/EX EX/MEM MEM/WB Control M WB 1016 17 18 1020 117 118 110 111 9 -1 105 2 99 133 8 P C Sign extend EX 31 Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Cycle 6 (emptying) IF: ??? ID: add $13, $14, $0 EX: or $16, $17, $18 MEM: and $9, $10, $11 WB: sub $2, $4, $5 Read address Instruction memory [31-0] Address Write data Data MemWrite (0) MemRead (0) 1 MemToReg (0) 4 Shift left 2 Add PCSrc ALUSrc (0) Result Zero ALU ALUOp (or) X RegDst (1) register 1 register 2 register data 2 data 1 Registers RegWrite (1) 13 IF/ID ID/EX EX/MEM MEM/WB Control M WB 1020 14 ??? 114 117 118 16 119 110 111 9 -1 2 P C Sign extend EX 32 Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Cycle 7 IF: ??? ID: ??? EX: add $13, $14, $0 MEM: or $16, $17, $18 WB: and $9, $10, $11 Read address Instruction memory [31-0] Address Write data Data MemWrite (0) MemRead (0) 1 MemToReg (0) 4 Shift left 2 Add PCSrc ALUSrc (0) Result Zero ALU ALUOp (add) ??? RegDst (1) register 1 register 2 register data 2 data 1 Registers RegWrite (1) IF/ID ID/EX EX/MEM MEM/WB Control M WB 114 X 13 119 118 16 110 9 P C Sign extend EX 33 Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Cycle 8 IF: ??? ID: ??? EX: ??? MEM: add $13, $14, $0 WB: or $16, $17, $18 Read address Instruction memory [31-0] Address Write data Data MemWrite (0) MemRead (0) 1 MemToReg (0) 4 Shift left 2 Add PCSrc ALUSrc (?) Result Zero ALU ALUOp (???) ??? RegDst (?) register 1 register 2 register data 2 data 1 Registers RegWrite (1) IF/ID ID/EX EX/MEM MEM/WB Control M WB 114 X 13 119 16 P C Sign extend EX 34 Chapter 4 — The Processor
Morgan Kaufmann Publishers 17 April, 2017 Cycle 9 IF: ??? ID: ??? EX: ??? MEM: ??? WB: add $13, $14, $0 Read address Instruction memory [31-0] Address Write data Data MemWrite (?) MemRead (?) 1 MemToReg (0) 4 Shift left 2 Add PCSrc ALUSrc (?) Result Zero ALU ALUOp (???) ??? RegDst (?) register 1 register 2 register data 2 data 1 Registers RegWrite (1) IF/ID ID/EX EX/MEM MEM/WB Control M WB ? X 114 13 P C Sign extend EX 35 Chapter 4 — The Processor
That’s a lot of diagrams there Morgan Kaufmann Publishers 17 April, 2017 That’s a lot of diagrams there Compare the last few slides with the pipeline diagram above You can see how instruction executions are overlapped Each functional unit is used by a different instruction in each cycle The pipeline registers save control and data values generated in previous clock cycles for later use When the pipeline is full in clock cycle 5, all of the hardware units are utilized. This is the ideal situation, and what makes pipelined processors so fast Clock cycle 1 2 3 4 5 6 7 8 9 lw $t0, 4($sp) IF ID EX MEM WB sub $v0, $a0, $a1 and $t1, $t2, $t3 or $s0, $s1, $s2 add $t5, $t6, $0 36 Chapter 4 — The Processor
Note how everything goes left to right, except Morgan Kaufmann Publishers 17 April, 2017 Note how everything goes left to right, except 1 PCSrc 4 IF/ID ID/EX EX/MEM MEM/WB Add Add P C Shift left 2 RegWrite Read register 1 Read data 1 MemWrite ALU Read address Instruction [31-0] Zero Read register 2 Read data 2 1 Result Address Write register Data memory Instruction memory MemToReg Registers ALUOp Write data ALUSrc Write data Read data 1 Instr [15 - 0] Sign extend RegDst MemRead Instr [20 - 16] 1 Instr [15 - 11] 37 Chapter 4 — The Processor