COMP541 Multicycle MIPS Montek Singh Apr 8, 2015
Topics Challenges w/ single-cycle MIPS implementation Multicycle MIPS State elements Now add registers between stages How to control Performance
Review: Processor Performance Program execution time Execution Time = (# instructions) (cycles/instruction)(seconds/cycle) = IC x CPI x Tc Definitions: IC = instruction count Cycles/instruction = CPI Seconds/cycle = clock period = Tc 1/CPI = Instructions/cycle = IPC Challenge is to satisfy constraints of: Cost Power Performance
Single-Cycle Performance (textbook version) TC is limited by the critical path (lw) lw is typically the longest instruction
Single-Cycle Performance (textbook version) Single-cycle critical path: Tc = tpcq_PC + tmem + max(tRFread, tsext + tmux) + tALU + tmem + tmux + tRFsetup In most implementations, limiting paths are: memory, ALU, register file. Tc = tpcq_PC + 2tmem + tRFread + tALU + tmux + tRFsetup
Single-Cycle Performance Example Tc = tpcq_PC + 2tmem + tRFread + tALU + tmux + tRFsetup = [30 + 2(250) + 150 + 200 + 25 + 20] ps = 925 ps What’s the max clock frequency?
Single-Cycle Performance Example For a program with 100 billion instructions executing on a single-cycle MIPS processor, Execution Time = # instructions x CPI x TC = (100 × 109)(1)(925 × 10-12 s) = 92.5 seconds
Key idea: Break instruction execution into multiple clock cycles Multicycle MIPS Key idea: Break instruction execution into multiple clock cycles
Multicycle MIPS Processor Single-cycle microarchitecture: + simple cycle time limited by longest instruction (lw) two adders/ALUs and two memories Multicycle microarchitecture: + higher clock speed + simpler instructions run faster + reuse expensive hardware on multiple cycles - sequencing overhead Same design steps: datapath & control
Multicycle State Elements Replace Instruction and Data memories with a single unified memory More realistic (buy one big RAM!) Was not possible in single-cycle implementation both instruction and data accesses needed within same clock cycle Now: Use same memory twice if needed instruction fetch and data access are in distinct clock cycles
Multicycle Datapath: lw instr fetch First consider executing lw STEP 1: Fetch instruction introduce Instruction Register to buffer this instruction a “non-architectural register” not accessible to programmer
Multicycle Datapath: lw register read Read register $rs insert another non-architectural register, A buffers the value of $rs read from register file
Multicycle Datapath: lw immediate Immediate field is sign-extended for consistency, could insert another non-architectural register to buffer SignImm skipped in this version because SignImm is a simple combinational function of Instr, which is already being held in Instruction Register
Multicycle Datapath: lw address ALU computes memory address insert another register to buffer ALUOut
Multicycle Datapath: lw memory read Same memory read now for data access insert a mutiplexer in front of memory’s address input choose either PC or ALUOut as address i.e., either instruction fetch or data access controlled by new control signal IorD
Multicycle Datapath: lw write register Data from memory is written into register file
Multicycle Datapath: increment PC PC incremented by re-using the ALU to do PC + 4 in single-cycle, we had to introduce a dedicated +4 adder in multi-cycle, same ALU used twice, in distinct cycles! Now using main ALU when it is not busy (instead of dedicated adder)
Multicycle Datapath: sw Compared to lw address computation is identical to lw write data in $rt to memory MemWrite will be 1 during the appropriate clock cycle $rt is buffered using nonarchitectural register B
Multicycle Datapath: R-type Instrs. Read from $rs and $rt multiplexers in front of ALU choose $rs and $rt as operands rite ALUResult to register file Write to $rd (instead of $rt) multiplexers in front of write address/data to register file
Multicycle Datapath: beq 2 tasks Determine whether values in rs and rt are equal Calculate branch target address: BTA = (sign-extended immediate << 2) + (PC+4) ALU reused!
Complete Multicycle Processor Caveat: Same differences in functionality w.r.t. our lab version as single-cycle MIPS
Control Unit
Main Controller FSM: Fetch
Main Controller FSM: Fetch Fetch instruction Also increment PC (because ALU not in use) Note: signals only shown when needed and enables only when asserted.
Main Controller FSM: Decode No signals needed for decode Register values also fetched Perhaps will not be used
Main Controller FSM: Address Calculation Now change states depending on instr
Main Controller FSM: Address Calculation For lw or sw, need to compute addr
Main Controller FSM: lw For lw now need to read from memory Then write to register
Main Controller FSM: sw sw just writes to memory One step shorter
Main Controller FSM: R-Type The r-type instructions have two steps: compute result in ALU and write to reg
Main Controller FSM: beq beq needs to use ALU twice, so consumes two cycles One to compute addr Another to decide on eq Can take advantage of decode when ALU not used to compute BTA (no harm if BTA not used)
Complete Multicycle Controller FSM
Main Controller FSM: addi Similar to r-type Add Write back
Main Controller FSM: addi
Extended Functionality: j
Control FSM: j
Control FSM: j
Multicycle Performance Instructions take different number of cycles: 3 cycles: beq, j 4 cycles: R-Type, sw, addi 5 cycles: lw CPI is weighted average SPECINT2000 benchmark: 25% loads 10% stores 11% branches 2% jumps 52% R-type Average CPI = (0.11 + 0.2)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12
Multicycle Performance Multicycle critical path: Tc = tpcq + tmux + max(tALU + tmux, tmem) + tsetup
Multicycle Performance Example Tc = tpcq_PC + tmux + max(tALU + tmux, tmem) + tsetup = tpcq_PC + tmux + tmem + tsetup = [30 + 25 + 250 + 20] ps = 325 ps
Multicycle Performance Example For a program with 100 billion instructions executing on a multicycle MIPS processor CPI = 4.12 Tc = 325 ps Execution Time = (# instructions) × CPI × Tc = (100 × 109)(4.12)(325 × 10-12) = 133.9 seconds This is slower than the single-cycle processor (92.5 seconds). Why? Not all steps the same length Sequencing overhead for each step (tpcq + tsetup= 50 ps)
Review: Single-Cycle MIPS Processor
Review: Multicycle MIPS Processor
Next Time Next topic: We’ll look at pipelined MIPS Improving throughput (and adding complexity!) by trying to use all of the hardware every cycle