CS/COE0447 Computer Organization & Assembly Language

CS/COE0447 Computer Organization & Assembly Language
Chapter 5 Part 3

Single-cycle Implementation of MIPS
Our first implementation of MIPS used a single long clock cycle for every instruction Every instruction began on one up (or, down) clock edge and ended on the next up (or, down) clock edge This approach is not practical as it is much slower than a multicycle implementation where different instruction classes can take different numbers of cycles in a single-cycle implementation every instruction must take the same amount of time as the slowest instruction in a multicycle implementation this problem is avoided by allowing quicker instructions to use fewer cycles Even though the single-cycle approach is not practical it was simpler and useful to understand first Now we are covering a multicycle implementation of MIPS

A Multi-cycle Datapath
A single memory unit for both instructions and data Single ALU rather than ALU & two adders Registers added after every major functional unit to hold the output until it is used in a subsequent clock cycle

Multi-Cycle Control What we need to cover
Adding registers after every functional unit Need to modify the “instruction execution” slides to reflect this Breaking instruction execution down into cycles What can be done during the same cycle? What requires a cycle? Need to modify the “instruction execution” slides again Timing Control signal values What they are per cycle, per instruction Finite state machine which determines signals based on instruction type + which cycle it is Putting it all together

Execution: single-cycle (reminder)
add Fetch instruction and add 4 to PC add $t2,$t1,$t0 Read two source registers $t1 and $t0 Add two values $t1 + $t0 Store result to the destination register $t1 + $t0  $t2

For add: Instruction is stored in the instruction register (IR) Values read from rs and rt are stored in A and B Result of ALU is stored in ALUOut

Multi-Cycle Execution: R-type
Instruction fetch IR <= Memory[PC]; sub $t0,$t1,$t2 PC <= PC + 4; Decode instruction/register read A <= Reg[IR[25:21]]; rs B <= Reg[IR[20:16]]; rt ALUOut <= PC + (sign-extend(IR[15:0])<<2);  later Execution ALUOut <= A op B; op = add, sub, and, or,… Completion Reg[IR[15:11]] <= ALUOut; $t0 <= ALU result

lw (load word) Fetch instruction and add 4 to PC lw $t0,-12($t1) Read the base register $t1 Sign-extend the immediate offset fff4  fffffff4 Add two values to get address X = fffffff4 + $t1 Access data memory with the computed address M[X] Store the memory data to the destination register $t0

For lw: lw $t0, -12($t1) Instruction is stored in the IR Contents of rs stored in A $t1 Output of ALU (address of memory location to be read) stored in ALUOut Value read from memory is stored in the memory data register (MDR)

Multi-cycle Execution: lw
Instruction fetch IR <= Memory[PC]; lw $t0,-12($t1) PC <= PC + 4; Instruction Decode/register read A <= Reg[IR[25:21]]; rs B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(IR[15:0])<<2); Execution ALUOut <= A + sign-extend(IR[15:0]); $t (sign extended) Memory Access MDR <= Memory[ALUOut]; M[$t ] Write-back Load: Reg[IR[20:16]] <= MDR; $t0 <= M[$t ]

sw Fetch instruction and add 4 to PC sw $t0,-4($t1) Read the base register $t1 Read the source register $t0 Sign-extend the immediate offset fffc  fffffffc Add two values to get address X = fffffffc + $t1 Store the contents of the source register to the computed address $t0  Memory[X]

For sw: sw $t0, -12($t1) Instruction is stored in the IR Contents of rs stored in A $t1 Output of ALU (address of memory location to be written) stored in ALUOut

Multi-cycle Execution: sw
Instruction fetch IR <= Memory[PC]; sw $t0,-12($t1) PC <= PC + 4; Decode/register read A <= Reg[IR[25:21]]; rs B <= Reg[IR[20:16]]; rt ALUOut <= PC + (sign-extend(IR[15:0])<<2); Execution ALUOut <= A + sign-extend(IR[15:0]); $t (sign extended) Memory Access Memory[ALUOut] <= B; M[$t ] <= $t0

beq Fetch instruction and add 4 to PC beq $t0,$t1,L Assume that L is +3 instructions away Read two source registers $t0,$t1 Sign Extend the immediate, and shift it left by 2 0x0003  0x c Perform the test, and update the PC if it is true If $t0 == $t1, the PC = PC + 0x c [we will follow what Mars does, so this is not Immediate == 0x0002; PC = PC x ]

For beq beq $t0,$t1,label Instruction stored in IR Registers rs and rt are stored in A and B Result of ALU (rs – rt) is stored in ALUOut

Multi-cycle execution: beq
Instruction fetch IR <= Memory[PC]; beq $t0,$t1,label PC <= PC + 4; Decode/register read A <= Reg[IR[25:21]]; rs B <= Reg[IR[20:16]]; rt ALUOut <= PC + (sign-extend(IR[15:0])<<2); PC + #bytes away label is (negative for backward branches, positive for forward branches) Execution if (A == B) then PC <= ALUOut; if $t0 == $t1 perform branch Note: the ALU is used to evaluate A == B; we’ll see that this does not clash with the use of the ALU above.

j Fetch instruction and add 4 to PC Take the 26-bit immediate field Shift left by 2 (to make 28-bit immediate) Get 4 bits from the current PC and attach to the left of the immediate Assign the value to PC

For j No accesses to registers or memory; no need for ALU

Multi-cycle execution: j
Instruction fetch IR <= Memory[PC]; j label PC <= PC + 4; Decode/register read A <= Reg[IR[25:21]]; B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(IR[15:0])<<2); Execution PC <= {PC[31:28],IR[25:0],”00”};

Adding registers after every functional unit Need to modify the “instruction execution” slides to reflect this Breaking instruction execution down into cycles  What can be done during the same cycle? What requires a cycle? Need to modify the “instruction execution” slides again Timing Control signal values What they are per cycle, per instruction Finite state machine which determines signals based on instruction type + which cycle it is Putting it all together

Multicycle Approach Break up the instructions into steps
each step takes one clock cycle balance the amount of work to be done in each step/cycle so that they are about equal restrict each cycle to use at most once each major functional unit so that such units do not have to be replicated functional units can be shared between different cycles within one instruction Between steps/cycles At the end of one cycle store data to be used in later cycles of the same instruction need to introduce additional internal (programmer-invisible) registers for this purpose Data to be used in later instructions are stored in programmer-visible state elements: the register file, PC, memory

Operations These take time:
Memory (read/write); register file (read/write); ALU operations The other connections and logical elements have no latency (for our purposes)

Operations Before: we had separate memories for instructions and data, and we had extra adders for incrementing the PC and calculating the branch address. Now we have just one memory and just one ALU.

Five Execution Steps Each takes one cycle
In one cycle, there can be at most one memory access, at most one register access, and at most one ALU operation But, you can have a memory access, an ALU op, and/or a register access, as long as there is no contention for resources Changes to registers are made at the end of the clock cycle PC, ALUOut, A, B, etc. save information for the next clock cycle

Step 1: Instruction Fetch
Access memory w/ PC to fetch instruction and store it in Instruction Register (IR) Increment PC by 4 We can do this because the ALU is not being used for something else this cycle

Step 2: Decode and Reg. Read
Read registers rs and rt We read both of them regardless of necessity Compute the branch address in case the instruction is a branch We can do this because the ALU is not busy ALUOut will keep the target address

Step 3: Various Actions ALU performs one of three functions based on instruction type Memory reference ALUOut <= A + sign-extend(IR[15:0]); R-type ALUOut <= A op B; Branch: if (A==B) PC <= ALUOut; Jump: PC <= {PC[31:28],IR[25:0],2’b00};

Step 4: Memory Access… If the instruction is memory reference
MDR <= Memory[ALUOut]; // if it is a load Memory[ALUOut] <= B; // if it is a store Store is complete! If the instruction is R-type Reg[IR[15:11]] <= ALUOut; Now the instruction is complete!

Step 5: Register Write Back
Only the lw instruction reaches this step Reg[IR[20:16]] <= MDR;

Summary of Instruction Execution
Step 1: IF 2: ID 3: EX 4: MEM 5: WB

Multicycle Execution Step (1): Instruction Fetch
IR = Memory[PC]; PC = PC + 4; 4 PC + 4

Multicycle Execution Step (2): Instruction Decode & Register Fetch
A = Reg[IR[25-21]]; (A = Reg[rs]) B = Reg[IR[20-15]]; (B = Reg[rt]) ALUOut = (PC + sign-extend(IR[15-0]) << 2) PC + 4 Branch Target Address Reg[rs] Reg[rt]

Multicycle Execution Step (3): Memory Reference Instructions
ALUOut = A + sign-extend(IR[15-0]); Reg[rs] Reg[rt] PC + 4 Mem. Address

Multicycle Execution Step (4): Memory Access - Write (sw)
Memory[ALUOut] = B; PC + 4 Reg[rs] Reg[rt]

Multicycle Execution Step (4): Memory Access - Read (lw)
MDR = Memory[ALUOut]; PC + 4 Reg[rs] Reg[rt] Mem. Address Mem. Data

Multicycle Execution Step (5): Memory Read Completion (lw)
Reg[IR[20-16]] = MDR; PC + 4 Reg[rs] Reg[rt] Mem. Data Address

Multicycle Execution Step (3): ALU Instruction (R-Type)
ALUOut = A op B Reg[rs] Reg[rt] PC + 4 R-Type Result

Multicycle Execution Step (4): ALU Instruction (R-Type)
Reg[IR[15:11]] = ALUOUT R-Type Result Reg[rs] Reg[rt] PC + 4

Multicycle Execution Step (3): Branch Instructions
if (A == B) PC = ALUOut; Reg[rs] Reg[rt] Branch Target Address Branch Target Address

Multicycle Execution Step (3): Jump Instruction
PC = PC[31-28] concat (IR[25-0] << 2) Reg[rs] Reg[rt] Branch Target Address Jump Address

For Reference The next 5 slides give the steps, one slide per instruction

Multi-Cycle Execution: R-type
Instruction fetch IR <= Memory[PC]; sub $t0,$t1,$t2 PC <= PC + 4; Decode instruction/register read A <= Reg[IR[25:21]]; rs B <= Reg[IR[20:16]]; rt ALUOut <= PC + (sign-extend(IR[15:0])<<2); Execution ALUOut <= A op B; op = add, sub, and, or,… Completion Reg[IR[15:11]] <= ALUOut; $t0 <= ALU result

Multi-cycle Execution: lw
Instruction fetch IR <= Memory[PC]; lw $t0,-12($t1) PC <= PC + 4; Instruction Decode/register read A <= Reg[IR[25:21]]; rs B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(IR[15:0])<<2); Execution ALUOut <= A + sign-extend(IR[15:0]); $t (sign extended) Memory Access MDR <= Memory[ALUOut]; M[$t ] Write-back Load: Reg[IR[20:16]] <= MDR; $t0 <= M[$t ]

Multi-cycle Execution: sw
Instruction fetch IR <= Memory[PC]; sw $t0,-12($t1) PC <= PC + 4; Decode/register read A <= Reg[IR[25:21]]; rs B <= Reg[IR[20:16]]; rt ALUOut <= PC + (sign-extend(IR[15:0])<<2); Execution ALUOut <= A + sign-extend(IR[15:0]); $t (sign extended) Memory Access Memory[ALUOut] <= B; M[$t ] <= $t0

Multi-cycle execution: beq
Instruction fetch IR <= Memory[PC]; beq $t0,$t1,label PC <= PC + 4; Decode/register read A <= Reg[IR[25:21]]; rs B <= Reg[IR[20:16]]; rt ALUOut <= PC + (sign-extend(IR[15:0])<<2); Execution if (A == B) then PC <= ALUOut; if $t0 == $t1 perform branch

Multi-cycle execution: j
Instruction fetch IR <= Memory[PC]; j label PC <= PC + 4; Decode/register read A <= Reg[IR[25:21]]; B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(IR[15:0])<<2); Execution PC <= {PC[31:28],IR[25:0],”00”};

Example: CPI in a multicycle CPU
Assume the control design of the previous slides An instruction mix of 22% loads, 11% stores, 49% R-type operations, 16% branches, and 2% jumps What is the CPI assuming each step requires 1 clock cycle? Solution: Number of clock cycles from previous slide for each instruction class: loads 5, stores 4, R-type instructions 4, branches 3, jumps 3 CPI = CPU clock cycles / instruction count =  (instruction countclass i  CPIclass i) / instruction count =  (instruction countclass I / instruction count)  CPIclass I = 0.22      3 = 4.04

Adding registers after every functional unit Need to modify the “instruction execution” slides to reflect this Breaking instruction execution down into cycles What can be done during the same cycle? What requires a cycle? Need to modify the “instruction execution” slides again Timing Control signal values  What they are per cycle, per instruction Finite state machine which determines signals based on instruction type + which cycle it is Putting it all together

A (Refined) Datapath fig 5.26

Datapath w/ Control Signals Fig 5.27

Final Version w/ Control Fig 5.28

Multicycle Control Step (1): Fetch
IR = Memory[PC]; PC = PC + 4; 1 1 X 010 X 1 1

Multicycle Control Step (2): Instruction Decode & Register Fetch
A = Reg[IR[25-21]]; (A = Reg[rs]) B = Reg[IR[20-15]]; (B = Reg[rt]) ALUOut = (PC + sign-extend(IR[15-0]) << 2); X X 010 X X 3

Multicycle Control Step (3): Memory Reference Instructions
ALUOut = A + sign-extend(IR[15-0]); X 1 X 010 X X 2

Multicycle Control Step (3): ALU Instruction (R-Type)
ALUOut = A op B; X 1 X ??? X X

Multicycle Control Step (3): Branch Instructions
if (A == B) PC = ALUOut; 1 if Zero=1 X 1 X 011 1 X

Multicycle Execution Step (3): Jump Instruction
PC = PC[21-28] concat (IR[25-0] << 2); 1 X X X XXX 2 X X

Multicycle Control Step (4): Memory Access - Read (lw)
MDR = Memory[ALUOut]; 1 X X XXX X X 1 X

Multicycle Execution Steps (4) Memory Access - Write (sw)
Memory[ALUOut] = B; 1 X 1 X XXX X X X

Multicycle Control Step (4): ALU Instruction (R-Type)
Reg[IR[15:11]] = ALUOut; (Reg[Rd] = ALUOut) IRWrite I Instruction I jmpaddr 28 32 R <<2 CONCAT 5 I[25:0] PCWr* rs rt rd X 1 2 M U X MUX 1 X 32 RegDst IorD 5 5 1 XXX 5 ALUSrcA PC Operation M U X 1 MemWrite RN1 RN2 WN M U X 1 3 ADDR M M U X 1 Registers PCSource Memory D RD1 A Zero X RD R WD ALU ALU WD RD2 B M U X 1 2 3 OUT MemRead MemtoReg 4 1 RegWrite 1 E X T N D ALUSrcB immediate 16 32 <<2 X

Multicycle Execution Steps (5) Memory Read Completion (lw)
Reg[IR[20-16]] = MDR; IRWrite I Instruction I 28 32 R jmpaddr 5 I[25:0] <<2 CONCAT PCWr* X rs rt rd X 1 2 M U X MUX 1 32 RegDst 5 5 XXX IorD 5 ALUSrcA PC Operation M U X 1 MemWrite RN1 RN2 WN M U X 1 3 ADDR M M U X 1 Registers PCSource Zero X Memory D RD1 A RD R WD ALU ALU OUT WD RD2 B M U X 1 2 3 MemRead MemtoReg 4 RegWrite 1 E X T N D ALUSrcB immediate 16 32 X <<2

Adding registers after every functional unit Need to modify the “instruction execution” slides to reflect this Breaking instruction execution down into cycles What can be done during the same cycle? What requires a cycle? Need to modify the “instruction execution” slides again Timing: Registers/memory updated at the beginning of the next clock cycle Control signal values What they are per cycle, per instruction Finite state machine which determines signals based on instruction type + which cycle it is  Putting it all together

Fig 5.28 For reference Note: In the previous diagrams, the values for the MemtoReg MUX are Backward. The values shown in those slides match the pictures.

A FSM State Diagram  this one is wrong; RegDst = 0; MemToReg = 1

State Diagram, Big Picture

Handling Memory Instructions

R-type Instruction

Branch and Jump

FSM Implementation

Example: Load (1) 00 1 1 1 01 00

Example: Load (2) rs rt 11 00

Example: Load (3) 1 10 00

Example: Load (4) 1 1

Example: Load (5) 1 1

Example: Jump (1) 00 1 1 1 01 00

Example: Jump (2) 11 00

Example: Jump (3) 1 10 1

To Summarize… From several building blocks, we constructed a datapath for a subset of the MIPS instruction set First, we analyzed instructions for functional requirements Second, we connected buildings blocks in a way to accommodate instructions Third, we refined the datapath and added controls

To Summarize… We looked at how an instruction is executed on the datapath in a pictorial way We looked at control signals connected to functional blocks in our datapath We analyzed how execution steps of an instruction change the control signals

To Summarize… We compared a single-cycle implementation and a multi-cycle implementation of our datapath We analyzed multi-cycle execution of instructions We refined multi-cycle datapath We designed multi-cycle control

To Summarize… We looked at the multi-cycle control scheme in detail
Multi-cycle control can be implemented using FSM FSM is composed of some combinational logic and memory element

Summary Techniques described in this chapter to design datapaths and control are at the core of all modern computer architecture Multicycle datapaths offer two great advantages over single-cycle functional units can be reused within a single instruction if they are accessed in different cycles – reducing the need to replicate expensive logic instructions with shorter execution paths can complete quicker by consuming fewer cycles Modern computers, in fact, take the multicycle paradigm to a higher level to achieve greater instruction throughput: pipelining (later class) where multiple instructions execute simultaneously by having cycles of different instructions overlap in the datapath the MIPS architecture was designed to be pipelined

CS/COE0447 Computer Organization & Assembly Language

Similar presentations

Presentation on theme: "CS/COE0447 Computer Organization & Assembly Language"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS/COE0447 Computer Organization & Assembly Language

Similar presentations

Presentation on theme: "CS/COE0447 Computer Organization & Assembly Language"— Presentation transcript:

Similar presentations

About project

Feedback