CSC 4250 Computer Architectures September 15, 2006 Appendix A. Pipelining
What is Pipelining? Implementation technique whereby multiple instructions are overlapped in execution Pipelining exploits parallelism among the instructions in a sequential instruction stream Recall the formula: CPU time = IC × CPI × cct Pipelining yields a reduction in the average execution time per instruction; i.e., it decreases the CPI
RISC Architectures Reduced Instruction Set Computer All operations on data apply to data in registers Only operations that affect memory are loads and stores that move data from memory to register or to memory from register, respectively Instruction formats are few in number with all instructions typically the same in size
Three Classes of Instructions We consider ALU instructions Load and store instructions Branches (no jumps)
ALU Instructions Take either two registers or a register and a sign-extended immediate, operate on them, and store result into a third register: DADDR1,R2,R3 OpcodeR2 R3 R1 shamt opx rs rt rd Reg[R1] ← Reg[R2] + Reg[R3] DADDIR1,R2,#3 Opcode R2 R1 Immediate rs rt Reg[R1] ← Reg[R2] + 3
Load and Store Instructions Take register source (base register) and immediate field (offset). The sum (effective address) is memory address. Second register is destination (load) or source (store) of data. LDR2,30(R1) OpcodeR1 R2 Immediate Reg[R2] ← Mem[30+Reg[R1]] SDR2,30(R1) OpcodeR1 R2 Immediate Mem[offset+Reg[R1]] ← Reg[R2]
Branches Branches are conditional transfers of control Branch destination obtained by adding a sign-extended offset to current PC We consider only comparison against zero: BEQZ R1,name BEQZ is pseudo-instruction for BEQ with R0: BEQ R1,R0,name Opcode R1 R0 Immediate
RISC Instruction Set At most five clock cycles: 1. Instruction fetch cycle (IF) 2. Instruction decode/register fetch cycle (ID) 3. Execution/effective address cycle (EX) 4. Memory access/branch completion (MEM) 5. Write-back cycle (WB)
Instruction Fetch (IF) Send program counter (PC) to memory and fetch current instruction from memory; Update PC by adding 4 (why 4?). Operations: IR←Mem[PC]; NPC←PC + 4;
Instruction Decode/Register Fetch (ID) Decode instruction Read registers Decoding is done in parallel with reading registers (fixed-field decoding) Sign-extend the offset field Operations: A←Reg[rs]; B←Reg[rt]; Imm←sign-extended immediate field of IR (A and B are temporary registers).
Execution/Effective Address (EX) ALU operates on the operands prepared in ID, performing one of four possible functions: Memory ref. (add base register and offset): ALUOutput← A + Imm Register-Register ALU instruction: ALUOutput← A func B Register-Immediate ALU instruction: ALUOutput← A op Imm Branch: ALUOutput← NPC + (Imm << 2) Cond← (A == 0)
Memory Access/Branch Completion (MEM) PC is updated: PC←NPC Access memory if needed: LMD = Load Memory Data Register LMD←Mem[ALUOutput] or Mem[ALUOutput]←B Branch: If (cond)PC←ALUOutput
Write Back (WB) Register-Register ALU: Reg[rd]←ALUOutput Register-Immediate ALU: Reg[rt]←ALUOutput Load: Reg[rt]←LMD
Simple RISC Pipeline Clock Number Instr. # Instr. i IF ID EX ME WB Instr. i+1IF ID EX ME WB Instr. i+2 IF ID EX ME WB Instr. i+3 IF ID EX ME WB Instr. i+4 IF ID EX ME WB What are the stages needed for an ALU instruction? What are the stages needed for a Store instruction? What are the stages needed for a Branch instruction? Which stage is expected to take the most time?
Figure A.2. Pipeline
Three Observations on Overlapping Execution 1. Use separate instruction and data memories, which is typically implemented with separate instruction and data caches. The use of separate caches eliminates a conflict for a single memory that would arise between instruction fetch and data memory access.
Three Observations on Overlapping Execution 2. The register file is used in two stages: one for reading in ID and one for writing in WB. These uses are distinct. Hence, we need to perform two reads and one write every clock cycle (why two reads?). To handle reads and a write to the same register (and for another reason that will arise), we perform the register write in the first half and the reads in the second half.
Three Observations on Overlapping Execution 3. To start a new instruction every clock, we must increment and store the PC every clock, and this must be done during the IF stage in preparation for the next instruction. Another problem is that a branch does not change the PC until the MEM stage (this problem will be handled soon).
Pipeline Registers Prevent interference between two different instructions in adjacent stages in pipeline. Carry data of a given instruction from one stage to the next. Registers are triggered by clock edge ─ values change instantaneously on clock edge. Add pipelining overhead.
Figure A.3. Pipeline Registers
Example Consider unpipelined processor. Assume 1 ns clock cycle, 4 cycles for ALU operations and branches, and 5 cycles for memory operations. Suppose relative frequencies are 40%, 20%, and 40%, respectively. The pipelining overhead is 0.2 ns. What is the speedup from pipelining?
Answer Average execution time on unpipelined processor =Clock ×Average CPI =1 ns × ((40%+20%)×4+40%×5) =4.4 ns Speedup from pipelining =4.4 ns / 1.2 ns =3.7