Download presentation
Published byNicholas Powell Modified over 9 years ago
1
COMP541 Multicycle MIPS Montek Singh Apr 8, 2015
2
Topics Challenges w/ single-cycle MIPS implementation Multicycle MIPS
State elements Now add registers between stages How to control Performance
3
Review: Processor Performance
Program execution time Execution Time = (# instructions) (cycles/instruction)(seconds/cycle) = IC x CPI x Tc Definitions: IC = instruction count Cycles/instruction = CPI Seconds/cycle = clock period = Tc 1/CPI = Instructions/cycle = IPC Challenge is to satisfy constraints of: Cost Power Performance
4
Single-Cycle Performance (textbook version)
TC is limited by the critical path (lw) lw is typically the longest instruction
5
Single-Cycle Performance (textbook version)
Single-cycle critical path: Tc = tpcq_PC + tmem + max(tRFread, tsext + tmux) + tALU + tmem + tmux + tRFsetup In most implementations, limiting paths are: memory, ALU, register file. Tc = tpcq_PC + 2tmem + tRFread + tALU + tmux + tRFsetup
6
Single-Cycle Performance Example
Tc = tpcq_PC + 2tmem + tRFread + tALU + tmux + tRFsetup = [30 + 2(250) ] ps = 925 ps What’s the max clock frequency?
7
Single-Cycle Performance Example
For a program with 100 billion instructions executing on a single-cycle MIPS processor, Execution Time = # instructions x CPI x TC = (100 × 109)(1)(925 × s) = 92.5 seconds
8
Key idea: Break instruction execution into multiple clock cycles
Multicycle MIPS Key idea: Break instruction execution into multiple clock cycles
9
Multicycle MIPS Processor
Single-cycle microarchitecture: + simple cycle time limited by longest instruction (lw) two adders/ALUs and two memories Multicycle microarchitecture: + higher clock speed + simpler instructions run faster + reuse expensive hardware on multiple cycles - sequencing overhead Same design steps: datapath & control
10
Multicycle State Elements
Replace Instruction and Data memories with a single unified memory More realistic (buy one big RAM!) Was not possible in single-cycle implementation both instruction and data accesses needed within same clock cycle Now: Use same memory twice if needed instruction fetch and data access are in distinct clock cycles
11
Multicycle Datapath: lw instr fetch
First consider executing lw STEP 1: Fetch instruction introduce Instruction Register to buffer this instruction a “non-architectural register” not accessible to programmer
12
Multicycle Datapath: lw register read
Read register $rs insert another non-architectural register, A buffers the value of $rs read from register file
13
Multicycle Datapath: lw immediate
Immediate field is sign-extended for consistency, could insert another non-architectural register to buffer SignImm skipped in this version because SignImm is a simple combinational function of Instr, which is already being held in Instruction Register
14
Multicycle Datapath: lw address
ALU computes memory address insert another register to buffer ALUOut
15
Multicycle Datapath: lw memory read
Same memory read now for data access insert a mutiplexer in front of memory’s address input choose either PC or ALUOut as address i.e., either instruction fetch or data access controlled by new control signal IorD
16
Multicycle Datapath: lw write register
Data from memory is written into register file
17
Multicycle Datapath: increment PC
PC incremented by re-using the ALU to do PC + 4 in single-cycle, we had to introduce a dedicated +4 adder in multi-cycle, same ALU used twice, in distinct cycles! Now using main ALU when it is not busy (instead of dedicated adder)
18
Multicycle Datapath: sw
Compared to lw address computation is identical to lw write data in $rt to memory MemWrite will be 1 during the appropriate clock cycle $rt is buffered using nonarchitectural register B
19
Multicycle Datapath: R-type Instrs.
Read from $rs and $rt multiplexers in front of ALU choose $rs and $rt as operands rite ALUResult to register file Write to $rd (instead of $rt) multiplexers in front of write address/data to register file
20
Multicycle Datapath: beq
2 tasks Determine whether values in rs and rt are equal Calculate branch target address: BTA = (sign-extended immediate << 2) + (PC+4) ALU reused!
21
Complete Multicycle Processor
Caveat: Same differences in functionality w.r.t. our lab version as single-cycle MIPS
22
Control Unit
23
Main Controller FSM: Fetch
24
Main Controller FSM: Fetch
Fetch instruction Also increment PC (because ALU not in use) Note: signals only shown when needed and enables only when asserted.
25
Main Controller FSM: Decode
No signals needed for decode Register values also fetched Perhaps will not be used
26
Main Controller FSM: Address Calculation
Now change states depending on instr
27
Main Controller FSM: Address Calculation
For lw or sw, need to compute addr
28
Main Controller FSM: lw
For lw now need to read from memory Then write to register
29
Main Controller FSM: sw
sw just writes to memory One step shorter
30
Main Controller FSM: R-Type
The r-type instructions have two steps: compute result in ALU and write to reg
31
Main Controller FSM: beq
beq needs to use ALU twice, so consumes two cycles One to compute addr Another to decide on eq Can take advantage of decode when ALU not used to compute BTA (no harm if BTA not used)
32
Complete Multicycle Controller FSM
33
Main Controller FSM: addi
Similar to r-type Add Write back
34
Main Controller FSM: addi
35
Extended Functionality: j
36
Control FSM: j
37
Control FSM: j
38
Multicycle Performance
Instructions take different number of cycles: 3 cycles: beq, j 4 cycles: R-Type, sw, addi 5 cycles: lw CPI is weighted average SPECINT2000 benchmark: 25% loads 10% stores 11% branches 2% jumps 52% R-type Average CPI = ( )(3) + ( )(4) + (0.25)(5) = 4.12
39
Multicycle Performance
Multicycle critical path: Tc = tpcq + tmux + max(tALU + tmux, tmem) + tsetup
40
Multicycle Performance Example
Tc = tpcq_PC + tmux + max(tALU + tmux, tmem) + tsetup = tpcq_PC + tmux + tmem + tsetup = [ ] ps = 325 ps
41
Multicycle Performance Example
For a program with 100 billion instructions executing on a multicycle MIPS processor CPI = 4.12 Tc = 325 ps Execution Time = (# instructions) × CPI × Tc = (100 × 109)(4.12)(325 × 10-12) = seconds This is slower than the single-cycle processor (92.5 seconds). Why? Not all steps the same length Sequencing overhead for each step (tpcq + tsetup= 50 ps)
42
Review: Single-Cycle MIPS Processor
43
Review: Multicycle MIPS Processor
44
Next Time Next topic: We’ll look at pipelined MIPS
Improving throughput (and adding complexity!) by trying to use all of the hardware every cycle
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.