COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.

Name: COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.
Uploaded: 2017-08-25T13:00:15+00:00
Duration: PTM12S7
Channel: Nicholas Powell
Description: COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.

COMP541 Multicycle MIPS Montek Singh Apr 8, 2015

Topics Challenges w/ single-cycle MIPS implementation Multicycle MIPS
State elements Now add registers between stages How to control Performance

Review: Processor Performance
Program execution time Execution Time = (# instructions) (cycles/instruction)(seconds/cycle) = IC x CPI x Tc Definitions: IC = instruction count Cycles/instruction = CPI Seconds/cycle = clock period = Tc 1/CPI = Instructions/cycle = IPC Challenge is to satisfy constraints of: Cost Power Performance

Single-Cycle Performance (textbook version)
TC is limited by the critical path (lw) lw is typically the longest instruction

Single-Cycle Performance (textbook version)
Single-cycle critical path: Tc = tpcq_PC + tmem + max(tRFread, tsext + tmux) + tALU + tmem + tmux + tRFsetup In most implementations, limiting paths are: memory, ALU, register file. Tc = tpcq_PC + 2tmem + tRFread + tALU + tmux + tRFsetup

Single-Cycle Performance Example
Tc = tpcq_PC + 2tmem + tRFread + tALU + tmux + tRFsetup = [30 + 2(250) ] ps = 925 ps What’s the max clock frequency?

Single-Cycle Performance Example
For a program with 100 billion instructions executing on a single-cycle MIPS processor, Execution Time = # instructions x CPI x TC = (100 × 109)(1)(925 × s) = 92.5 seconds

Key idea: Break instruction execution into multiple clock cycles
Multicycle MIPS Key idea: Break instruction execution into multiple clock cycles

Multicycle MIPS Processor
Single-cycle microarchitecture: + simple cycle time limited by longest instruction (lw) two adders/ALUs and two memories Multicycle microarchitecture: + higher clock speed + simpler instructions run faster + reuse expensive hardware on multiple cycles - sequencing overhead Same design steps: datapath & control

Multicycle State Elements
Replace Instruction and Data memories with a single unified memory More realistic (buy one big RAM!) Was not possible in single-cycle implementation both instruction and data accesses needed within same clock cycle Now: Use same memory twice if needed instruction fetch and data access are in distinct clock cycles

Multicycle Datapath: lw instr fetch
First consider executing lw STEP 1: Fetch instruction introduce Instruction Register to buffer this instruction a “non-architectural register” not accessible to programmer

Multicycle Datapath: lw register read
Read register $rs insert another non-architectural register, A buffers the value of $rs read from register file

Multicycle Datapath: lw immediate
Immediate field is sign-extended for consistency, could insert another non-architectural register to buffer SignImm skipped in this version because SignImm is a simple combinational function of Instr, which is already being held in Instruction Register

Multicycle Datapath: lw address
ALU computes memory address insert another register to buffer ALUOut

Multicycle Datapath: lw memory read
Same memory read now for data access insert a mutiplexer in front of memory’s address input choose either PC or ALUOut as address i.e., either instruction fetch or data access controlled by new control signal IorD

Multicycle Datapath: lw write register
Data from memory is written into register file

Multicycle Datapath: increment PC
PC incremented by re-using the ALU to do PC + 4 in single-cycle, we had to introduce a dedicated +4 adder in multi-cycle, same ALU used twice, in distinct cycles! Now using main ALU when it is not busy (instead of dedicated adder)

Multicycle Datapath: sw
Compared to lw address computation is identical to lw write data in $rt to memory MemWrite will be 1 during the appropriate clock cycle $rt is buffered using nonarchitectural register B

Multicycle Datapath: R-type Instrs.
Read from $rs and $rt multiplexers in front of ALU choose $rs and $rt as operands rite ALUResult to register file Write to $rd (instead of $rt) multiplexers in front of write address/data to register file

Multicycle Datapath: beq
2 tasks Determine whether values in rs and rt are equal Calculate branch target address: BTA = (sign-extended immediate << 2) + (PC+4) ALU reused!

Complete Multicycle Processor
Caveat: Same differences in functionality w.r.t. our lab version as single-cycle MIPS

Control Unit

Main Controller FSM: Fetch

Main Controller FSM: Fetch
Fetch instruction Also increment PC (because ALU not in use) Note: signals only shown when needed and enables only when asserted.

Main Controller FSM: Decode
No signals needed for decode Register values also fetched Perhaps will not be used

Main Controller FSM: Address Calculation
Now change states depending on instr

Main Controller FSM: Address Calculation
For lw or sw, need to compute addr

Main Controller FSM: lw
For lw now need to read from memory Then write to register

Main Controller FSM: sw
sw just writes to memory One step shorter

Main Controller FSM: R-Type
The r-type instructions have two steps: compute result in ALU and write to reg

Main Controller FSM: beq
beq needs to use ALU twice, so consumes two cycles One to compute addr Another to decide on eq Can take advantage of decode when ALU not used to compute BTA (no harm if BTA not used)

Complete Multicycle Controller FSM

Main Controller FSM: addi
Similar to r-type Add Write back

Main Controller FSM: addi

Extended Functionality: j

Control FSM: j

Multicycle Performance
Instructions take different number of cycles: 3 cycles: beq, j 4 cycles: R-Type, sw, addi 5 cycles: lw CPI is weighted average SPECINT2000 benchmark: 25% loads 10% stores 11% branches 2% jumps 52% R-type Average CPI = ( )(3) + ( )(4) + (0.25)(5) = 4.12

Multicycle Performance
Multicycle critical path: Tc = tpcq + tmux + max(tALU + tmux, tmem) + tsetup

Multicycle Performance Example
Tc = tpcq_PC + tmux + max(tALU + tmux, tmem) + tsetup = tpcq_PC + tmux + tmem + tsetup = [ ] ps = 325 ps

Multicycle Performance Example
For a program with 100 billion instructions executing on a multicycle MIPS processor CPI = 4.12 Tc = 325 ps Execution Time = (# instructions) × CPI × Tc = (100 × 109)(4.12)(325 × 10-12) = seconds This is slower than the single-cycle processor (92.5 seconds). Why? Not all steps the same length Sequencing overhead for each step (tpcq + tsetup= 50 ps)

Review: Single-Cycle MIPS Processor

Review: Multicycle MIPS Processor

Next Time Next topic: We’ll look at pipelined MIPS
Improving throughput (and adding complexity!) by trying to use all of the hardware every cycle

COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.

Similar presentations

Presentation on theme: "COMP541 Multicycle MIPS Montek Singh Apr 8, 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.

Similar presentations

Presentation on theme: "COMP541 Multicycle MIPS Montek Singh Apr 8, 2015."— Presentation transcript:

Similar presentations

About project

Feedback