EECS 361 Computer Architecture Lecture 10: Designing a Multiple Cycle Processor Start X:40.

EECS 361 Computer Architecture Lecture 10: Designing a Multiple Cycle Processor
Start X:40

Recap: A Single Cycle Datapath
We have everything except control signals (underline) Today’s lecture will show you how to generate the control signals 32 ALUctr Clk busW RegWr busA busB 5 Rw Ra Rb 32 32-bit Registers Rs Rt Rd RegDst Extender Mux 16 imm16 ALUSrc ExtOp MemtoReg Data In WrEn Adr Data Memory MemWr ALU Instruction Fetch Unit Zero Instruction<31:0> 1 <21:25> <16:20> <11:15> <0:15> Imm16 nPC_sel The result of the last lecture is this single-cycle datapath. +1 = 6 min. (X:46)

Recap: PLA Implementation of the Main Control
op<0> op<5> . <0> R-type ori lw sw beq jump RegWrite ALUSrc RegDst Similarly, for ALUSrc, we need to OR the ori, load, and store terms together because we need to assert the ALUSrc signals whenever we have the Ori, load, or store instructions. The RegDst, MemtoReg, MemWrite, Branch, and Jump signals are very simple. They don’t need to OR any product terms together because each is asserted for only one instruction. For example, RegDst is asserted ONLY for R-type instruction and MemtoReg is asserted ONLY for load instruction. ExtOp, on the other hand, needs to be set to 1 for both the load and store instructions so the immediate field is sign extended properly. Therefore, we need to OR the load and store terms together to form the signal ExtOp. Finally, we have the ALUop signals. But clever encoding of the ALUop field, we are able to keep them simple so that no OR gates is needed. If you don’t already know, this regular structure with an array of AND gates followed by another array of OR gates is called a Programmable Logic Array, or PLA for short. It is one of the most common ways to implement logic function and there are a lot of CAD tools available to simplify them. +3 = 70 min. (Y:50) MemtoReg MemWrite Branch Jump ExtOp ALUop<2> ALUop<1> ALUop<0>

The Big Picture: Where are We Now?
The Five Classic Components of a Computer Today’s Topic: Designing the Datapath for the Multiple Clock Cycle Datapath Processor Input Control Memory Datapath Output So where are in in the overall scheme of things. Well, we just finished designing the processor’s datapath. Now I am going to show you how to design the control for the datapath. +1 = 7 min. (X:47)

Outline of Today’s Lecture
Recap and Introduction Introduction to the Concept of Multiple Cycle Processor Multiple Cycle Implementation of R-type Instructions What is a Multiple Cycle Delay Path and Why is it Bad? Multiple Cycle Implementation of Or Immediate Multiple Cycle Implementation of Load and Store Putting it all Together In the next fifteen minutes or so, I will give you an introduction to the concept of multiple cycle processor implementation. Then I will show you how to build a datapath for a multiple cycle processor by looking at what needs to be done for the different MIPS instructions. I will also show you a phenomenon called multiple cycle delay path and how to avoid them. Finally, I will summarize what we learned. The list of things we want to cover today is long so let’s get on with today’s lecture. +1 = 6 min. (X:46)

Abstract View of our single cycle processor
Main Control op ALU control fun ALUSrc Equal ExtOp MemRd MemWr nPC_sel RegDst RegWr MemWr ALUctr Reg. Wrt Register Fetch ALU Ext Mem Access PC Next PC Instruction Fetch Result Store Data Mem looks like a FSM with PC as state

What’s wrong with our CPI=1 processor?
Arithmetic & Logical PC Inst Memory Reg File ALU mux mux setup Load PC Inst Memory Reg File ALU Data Mem mux mux setup Critical Path Store PC Inst Memory Reg File ALU Data Mem mux Branch PC Inst Memory Reg File cmp mux Long Cycle Time All instructions take as much time as the slowest Real memory is not so nice as our idealized memory cannot always get the job done in one (short) cycle

Drawbacks of this Single Cycle Processor
Long cycle time: Cycle time must be long enough for the load instruction: PC’s Clock -to-Q + Instruction Memory Access Time + Register File Access Time + ALU Delay (address calculation) + Data Memory Access Time + Register File Setup Time + Clock Skew Cycle time is much longer than needed for all other instructions. Examples: R-type instructions do not require data memory access Jump does not require ALU operation nor data memory access One of the biggest disadvantage of the single cycle implementation: it has a long cycle time. More specifically, the cycle time must be long enough for the load instruction which has the following seven components: (1) Clock to Q time of the Program Counter. (2) Instruction Memory Access Time. (3) Register File Access Time. (4) ALU delay to perform a 32-bit address calculation. (5) Data Memory Access Time (6) And finally the Set Up time for Register File Write and (7) Potential Clock Skew. Having a long cycle time is a big problem but not the the only problem. Another problem is that this cycle time (point to the list), which is long enough for the load instruction, is too long for all other instructions. For example: (1) The R-type instruction does not need to have a cycle time as long as the load instruction because the R-type instructions do not require any data memory access. (2) Similarly, the Jump instruction does not need a cycle time this long because the Jump does not require any data memory access nor ALU operation. Consequently, for the R-type and Jump instruction, the processor is actually doing nothing at the last part of the clock cycle. +2 = 8 min. (X:48)

Overview of a Multiple Cycle Implementation
The root of the single cycle processor’s problems: The cycle time has to be long enough for the slowest instruction Solution: Break the instruction into smaller steps Execute each step (instead of the entire instruction) in one cycle Cycle time: time it takes to execute the longest step Keep all the steps to have similar length This is the essence of the multiple cycle processor The advantages of the multiple cycle processor: Cycle time is much shorter Different instructions take different number of cycles to complete Load takes five cycles Jump only takes three cycles Allows a functional unit to be used more than once per instruction Well, the root of these problems of course is that fact that the Single Cycle Processor’s cycle time has to be long enough for the slowest instruction. The solution is simple. Just break the instruction into smaller steps and instead of executing an entire instruction in one cycle, we will execute each of these steps in one cycle. Since the cycle time in this case will be the time it takes to execute the longest step, our goal should be keeping all the steps to have similar length when we break up the instruction. Well the last two bullets pretty much summarize what a multiple cycle processor is all about. The first advantage of the multiple cycle processor is of course shorter cycle time. The cycle time now only has to be long enough to execute the longest step. But may be more importantly, now different instructions can take different number of cycles to complete. For example: (1) The load instruction will take five cycles to complete. (2) But the Jump instruction will only take three cycles. This feature greatly reduce the idle time inside the processor. Finally, the multiple cycle implementation allows a functional unit to be used more than once per instruction as long as it is used on different clock cycles. For example, as I will show you later in today’s lecture, we can use the ALU to increment the Program Counter as well as doing address calculation. +3 = 11 min. (X:51)

The Five Steps of a Load Instruction
Instruction Fetch Instr Decode / Address Data Memory Reg Wr Reg. Fetch Clk Clk-to-Q PC Old Value New Value Instruction Memory Access Time Rs, Rt, Rd, Op, Func Old Value New Value Delay through Control Logic ALUctr Old Value New Value ExtOp Old Value New Value ALUSrc Old Value New Value Well let’s take a look at the Load instruction’s timing diagram and see how we can break it up into smaller steps. The biggest contributors to the cycle time appears to be: (1) Instruction Memory Access Time. (2) Delay through the Control Logic, which happens in parallel with Register File Access. (3) ALU Delay. (4) Data Memory Access Time. (5) And Register File Write Time. Therefore, it makes sense to break up the Load instructions into these five steps: (1) Instruction Fetch. (2) Instruction Decode “slash” Register Fetch. (3) Memory Address Calculation. (4) Data Memory Access. (5) And finally, Register File Write. Notice that here I have used the term Register File Write time instead of Register File Write Setup time. The reason is that in a “real” register file, there is no such thing as set up time. +2 = 13 min. (X:53) RegWr Old Value New Value Register File Access Time busA Old Value New Value Delay through Extender & Mux Register File Write Time busB Old Value New Value ALU Delay Address Old Value New Value Data Memory Access Time busW Old Value New

Register File & Memory Write Timing: Ideal vs. Reality
In previous lectures, register file and memory are simplified: Write happens at the clock tick Address, data, and write enable must be stable one “set-up” time before the clock tick In real life: Neither register file nor ideal memory has the clock input The write path is a combinational logic delay path: Write enable goes to 1 and Din settles down Memory write access delay Din is written into mem[address] Important: Address and Data must be stable BEFORE Write Enable goes to 1 Adr Din WrEn Dout Ideal Memory 32 Clk Because in a real register file, there is NO clock input (use the bottom picture). In previous lectures, I tried to simplify things by giving both the register file and data memory a clock input such that all write happens at the clock tick-that is H to L transition of the clock. Consequently, the address bus, the Data In bus, and the Write Enable signals must ALL be stable at least ONE set up time before the clock tick. But in real life, neither register file nor ideal data memory has clock input. The Write path is pure combinational. That is after the control signal: (1) Write Enable has gone to 1 and the Data In bus has settle down to a given value. (2) It will take a delay equal to the Memory Write Access Delay. (3) BEFORE the value on the Data In bus is written into the memory location specified by the address bus. It is very VERY important that the address bus is stable BEFORE the control signal Write Enable is set to 1. Otherwise, you may end up destroying data already in memory by writing to the wrong address location if there is any glitches on the address bus when Write Enable is asserted. +2 = 15 min. (X:55) Adr Din WrEn Dout Ideal Memory 32

Race Condition Between Address and Write Enable
Reg File Ra Rw busW Rb busA busB RegWr 5 32 This “real” (no clock input) register file may not work reliably in the single cycle processor because: We cannot guarantee Rw will be stable BEFORE RegWr = 1 There is a “race” between Rw (address) and RegWr (write enable) The “real” (no clock input) memory may not work reliably in the single cycle processor because: We cannot guarantee Address will be stable BEFORE WrEn = 1 There is a race between Adr and WrEn Notice that this real register file, which does not have a clock input, may not work reliably in our single cycle processor because if you look at the timing diagram, you will notice that: (1) We cannot guarantee Rw, which specifies the register to be written, will be stable BEFORE the control signal RegWr goes to 1. (2) In other words we have a race between the setting of Rw and the assertion of RegWr. On a good day, if Rw does settle down before RegWr goes to 1, everything works. But once in a while, if RegWr happens to go to 1 before Rw settles down, we have a problem. Race condition like this is what caused machine to crash mysteriously during initial testing. Similarly, I did not use this data memory in our single cycle processor design because we cannot guarantee the address bus to be stable BEFORE Write Enable is set to 1. Once again, we have a race condition between the Address and the Write Enable signal. How can we avoid these two race conditions in our multiple cycle implementation? +2 = 17 min. (X:57) Adr Din WrEn Dout Ideal Memory 32

How to Avoid this Race Condition?
Solution for the multiple cycle implementation: Make sure Address is stable by the end of Cycle N Assert Write Enable signal ONE cycle later at Cycle (N + 1) Address cannot change until Write Enable is disasserted Well, for the multiple cycle implementation, we can avoid this race condition by: (1) Making sure the address bus is stable by the end of Cycle N. (2) Then we can assert the write enable signal ONE cycle later at Cycle N + 1. (3) Finally, we have to make sure the address bus does not change until the Write Enable signal is disasserted. +1 = 18 min. (X:18)

Dual-Port Ideal Memory
Independent Read (RAdr, Dout) and Write (WAdr, Din) ports Read and write (to different location) can occur at the same cycle Read Port is a combinational path: Read Address Valid --> Memory Read Access Delay --> Data Out Valid Write Port is also a combinational path: MemWrite = 1 --> Memory Write Access Delay --> Data In is written into location[WrAdr] Ideal Memory <31:2> WrAdr Din RAdr<1:0> 00 30 32 Dout MemWr One important feature of the multiple cycle implementation is that a functional unit can be used more than once per instruction as long as it is used on different clock cycles. The Ideal Memory is one such unit which our multiple cycle processor will use more than once per instruction. More specifically ... Unlike the single processor which has separate Instruction and Data memory, the Multiple Cycle Implementation will only have 1 memory unit where instructions and data are stored. In order to keep things conceptually simple, the Ideal Memory will have independent Read port (Read Address and Data Out) and Write port (Write Address and Data In). That is, if we put an address on the Read Address inputs, the Data Out bus will be valid after the Memory Read Access Delay time. On the other hand, if you put an address on the Write Address inputs and THEN assert the Write Enable signal, the value on the Data In bus will be written into memory location specified by the address after the Memory Write Access Delay Time. +2 = 20 min. (Y:00)

Instruction Fetch Cycle: In the Beginning
Every cycle begins right AFTER the clock tick: mem[PC] PC<31:0> + 4 Clk One “Logic” Clock Cycle You are here! PCWr=? ALU PC As far as LOGIC is concerned, I think the easiest way to think about a clock cycle is that a clock cycle begins right AFTER a clock tick and ends at the next clock tick. I have intentionally shown the L time to be much longer than the H time to emphasis a point: the H and L time does not affect your design as long as you use the simple clocking methodology where all storage elements are triggered at the same clock tick. The only important thing here is the time between the two clock ticks, the cycle time. Most of the time, however, the high and low time are the same because it is much easier to generate a clock that has high and low time the same length. Well enough about clock ticks. Let’s see what happens at the beginning (You are Here) of the Ifetch cycle: (a) We need to fetch the instruction from Memory so we sent the address to the memory. (b) We also needs to update the PC so we better send the address to the ALU as well. ***** What values do you think the control signals PCwr and ALUop have at this point? Well since we are only at the beginning of the cycle (Your are Here), these two signals will still have the old values from the last cycle of the previous value. +2 = 27 min. (Y:07) 32 MemWr=? IRWr=? 32 32 Clk RAdr 4 32 32 Ideal Memory Instruction Reg ALU Control 32 WrAdr 32 Dout Din 32 ALUop=? 32 Clk

Instruction Fetch Cycle: The End
Every cycle ends AT the next clock tick (storage element updates): IR <-- mem[PC] PC<31:0> <-- PC<31:0> + 4 Clk One “Logic” Clock Cycle You are here! PCWr=1 ALU PC 32 As time goes by, the output of the memory will become valid and the ALU, with ALUOp sets to Add, will finish the 32-bit add. Hopefully, we are smart enough to set the cycle time so the time between the clock tick is long enough to allow these (output of Memory and ALU) to stabilize. So at the end of the cycle, the clock tick will trigger the Instruction Register to save the current instruction word (output of Instruction Memory). Similarly, the Program Counter register is triggered (point to the clock input) to save the next instruction’s address (output of ALU). Unlike the single cycle processor where a 30-bit PC can reduce the length of two adders by two bits, here we are using the 32-bit ALU to do the PC update anyway. So the only saving we can get for using a 32 bit PC are two register bits. That’s why we didn’t bother to do it and keep a 32-bit Program Counter. The Memory Unit here is also used to store data and the ALU here is also use for instruction execution. Therefore, we know we will need some MUXes in front of them. +2 = 29 min. (Y:09) MemWr=0 IRWr=1 32 00 32 Clk RAdr 4 32 32 Ideal Memory Instruction Reg ALU Control 32 32 WrAdr Dout Din 32 ALUOp = Add 32 Clk

Instruction Fetch Cycle: Overall Picture
1: PCWr, IRWr ALUOp=Add Others: 0s x: PCWrCond RegDst, Mem2R Ifetch PCWr=1 PCWrCond=x PCSrc=0 BrWr=0 Zero IorD=0 MemWr=0 IRWr=1 ALUSelA=0 1 Target 32 32 Mux PC Mux 1 32 Zero For example, the Memory can get its read address from the PC for instruction fetch but it can also get the read address from other part of the datapath for data fetch. Similarly, the ALU can get its operands from the PC and a constant 4 as I showed you on the last slide, but we know the ALU can also gets its operands from the register file. We will fill in the details here (Hole) later but for now, we know we need to set the control signals IorD, MemWr, ALUSelA, ALUSelB to zeros and IRWr and PCWr to 1s. Notice that I have added a MUX in the PC feedback path because we know for the branch instruction, the next PC will have a value OTHER than PC plus 4 (ALU inputs). We will worry about how we get this “other” value (Target) later. For this cycle, we have to set the MUX control (PCSrc) to zero to select the PC plus 4 value. The settings of all the control signals are summarized in this circle. Due to space limitation, I have only shown the signals that have values other than zeros. I want to emphasis that this is the picture at the END of the Instruction Fetch Cycle where evaluation is completed and control signals are settled. This is the interesting part. The start of the cycle is boring by comparison. Consequently, all the datapath pictures I show from now on are the pictures at end of a cycle. +2 = 31 min. (Y:11) ALU Mux 1 32 RAdr 32 busA 32 Ideal Memory Instruction Reg 32 4 32 WrAdr 32 1 32 32 Din Dout busB 32 2 32 ALU Control 3 ALUSelB=00 ALUOp=Add

Register Fetch / Instruction Decode
busA <- RegFile[rs] ; busB <- RegFile[rt] ; ALU is not being used: ALUctr = xx PCWr=0 PCWrCond=0 PCSrc=x Zero IorD=x MemWr=0 IRWr=0 RegDst=x RegWr=0 ALUSelA=x 1 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra Now that we have the instruction word saved in the IR, the next thing we can do is decode the instruction (Go to the Control) and fetch the registers from the Register file (Rs Rt). I want to point out at this point, we do not know what instruction we have yet because we are still in the process of decoding the Op and Func field. Therefore we are “jumping the gun” in fetching the registers Rs and Rt from the register file. The Rt field may not even be a source register if we have a I-type instruction. But this is OK because if after we decode the instruction, we realize we don’t need the registers, we just don’t use them. No big deal. Notice that the ALU is not being used in this cycle. That is not good. Instead of just letting the ALU sits idle, we may just as well let it do something. ***** Can we think of anyway we can use this ALU at this cycle? We cannot do the use the ALU to do anything involves the registers because we are still in the process of reading them-we do not have the register values yet. +2 = 33 min. (X:13) 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory Instruction Reg Mux 1 5 Reg File 32 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control 3 Go to the Control Op 6 Imm ALUSelB=xx Func 6 16 ALUOp=xx

Register Fetch / Instruction Decode (Continue)
1: BrWr, ExtOp ALUOp=Add Others: 0s x: RegDst, PCSrc ALUSelB=10 IorD, MemtoReg Rfetch/Decode busA <- Reg[rs] ; busB <- Reg[rt] ; Target <- PC + SignExt(Imm16)*4 PCWr=0 PCWrCond=0 PCSrc=x BrWr=1 Zero IorD=x MemWr=0 IRWr=0 RegDst=x RegWr=0 ALUSelA=0 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra What we can do is to use this ALU to calculate the branch address in advance. (1) We will set ALUSelA to 0 such that the PC is fed to the ALU input. (2) The other ALU input will come from (ALUSelB=10) the Sign Extended (ExtOp=1) version of the 16-bit immediate filed. Once we added (ALUOp = Add) these two numbers together, we will save the result in the Target register (BrWr = 1). We cannot write the ALU output to the PC yet (PCWr = 0). The OP and Func is still being decode in this cycle (Control) so we cannot update the PC to this value unless we are SURE we have a branch and the branch condition is met (AND-OR). Once again, I have summarized all the control signals settings inside this circle. So far this and the Instruction Fetch cycles are shared by all instructions. But by the end of this cycle, we will know exactly what instruction we have (Control output). Let’s say, we have a branch, what do we do? +2 = 35 min. (Y:15) 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory Instruction Reg Mux 1 5 Reg File 32 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control 3 << 2 Beq Control Op Rtype Imm 6 ALUSelB=10 Ori Func Extend Memory 6 16 32 ALUOp=Add : ExtOp=1

Branch Completion if (busA == busB) PC <- Target BrComplete PCWr=0
1: PCWrCond ALUOp=Sub x: IorD, Mem2Reg ALUSelB=01 RegDst, ExtOp ALUSelA BrComplete PCSrc if (busA == busB) PC <- Target PCWr=0 PCWrCond=1 PCSrc=1 BrWr=0 Zero IorD=x MemWr=0 IRWr=0 RegDst=x RegWr=0 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra We already have the values of registers Rs and Rt on busA and busB from last cycle, all we have to do is perform a Subtract (ALUOp) to compare them (ALUSelA, B). If they are equal, the ALU’s Zero output will be asserted, and with PCSrc and PCCond set to one, the Branch Target will get written into the Program Counter. The Branch is taken. If Rs and Rt are not equal, Zero will not be asserted and the Target value will NOT be written into the Program Counter. That is the Branch is NOT taken. Since I am running out of space in this circle, I did not say it explicitly but all control signals not specified in this circle are default to zeros (point to the datapath for examples). +1 = 36 min. (Y:16) 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory Instruction Reg Mux 1 5 Reg File 32 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control 3 << 2 Imm ALUSelB=01 Extend 16 32 ALUOp=Sub ExtOp=x

Instruction Decode: We have a R-type!
Next Cycle: R-type Execution PCWr=0 PCWrCond=0 PCSrc=x BrWr=1 Zero IorD=x MemWr=0 IRWr=0 RegDst=x RegWr=0 ALUSelA=0 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra Let’s go back to the end of the Register Fetch slash Instruction Decode Cycle. Assume the result of the Instruction Decode indicates we have a R-type instruction, what do we do then? Well, simple enough: we just go to the R-type execution cycle. +1 = 37 min. (Y:17) 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory Instruction Reg Mux 1 5 Reg File 32 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control 3 << 2 Beq Control Op Rtype Imm 6 ALUSelB=10 Ori Func Extend Memory 6 16 32 ALUOp=Add : ExtOp=1

R-type Execution ALU Output <- busA op busB RExec PCWr=0 PCWrCond=0
1: RegDst ALUOp=Rtype ALUSelB=01 x: PCSrc, IorD MemtoReg ALUSelA ExtOp RExec ALU Output <- busA op busB PCWr=0 PCWrCond=0 PCSrc=x BrWr=0 Zero IorD=x MemWr=0 IRWr=0 RegDst=1 RegWr=0 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory Instruction Reg Mux 1 5 Reg File 32 Once again, fetching the registers Rs and Rt in the previous cycle pays off. We need these two registers now and they are already on busA and busB, respectively. So all we need is set the ALUSelA and ALUSelB to feed busA and busB into the ALU and tell the ALU local control we have a R-type instruction (ALUOp). The ALU will then generate the correct result (ALU output) at the end of this cycle. Notice that I have set RegDst to 1 here even though we are not writing the register file (RegWr is zero). Register file is not written until the next cycle. You would think RegDst should be don’t care at this point. ****** Anybody want to guess why I set RegDst to 1 at this point? Remember: for this Real memory and register file that do not have a clock input , the address (Rw) MUST be stable BEFORE we set Write Enable (RegWr) to one. Here by setting RegDst to one, I can guarantee the Rw specifier will be stable by the next clock cycle where I will perform the write by setting RegWr to 1. +2 = 39 min. (Y:19) Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control Mux 1 3 << 2 Extend Imm 16 32 ALUOp=Rtype ExtOp=x MemtoReg=x ALUSelB=01

R-type Completion R[rd] <- ALU Output Rfinish PCWr=0 PCWrCond=0
1: RegDst, RegWr ALUOp=Rtype ALUselA x: IorD, PCSrc ALUSelB=01 ExtOp Rfinish R[rd] <- ALU Output PCWr=0 PCWrCond=0 PCSrc=x BrWr=0 Zero IorD=x MemWr=0 IRWr=0 RegDst=1 RegWr=1 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory Instruction Reg Mux 1 5 Reg File 32 So here is the picture where we finish off the R-type instruction by writing the ALU output back to the register file (MemtoReg=0 and RegWr = 1). Notice that in order to keep the ALU output from changing, the ALUSelA, ALUSelB, and ALUOp control signals must remain the same as the previous cycle. This brings us to a side topic I want to cover. +1 = 40 min. (Y:20) Rt 4 Rw 32 WrAdr 32 1 32 32 Rd Din Dout busW busB 32 2 32 ALU Control Mux 1 3 << 2 Extend Imm 16 32 ALUOp=Rtype ExtOp=x MemtoReg=0 ALUSelB=01

A Multiple Cycle Delay Path
There is no register to save the results between: Register Fetch: busA <- Reg[rs] ; busB <- Reg[rt] R-type Execution: ALU output <- busA op busB R-type Completion: Reg[rd] <- ALU output Register here to save outputs of Rfetch? ALUselA PCWr Mux 1 Register here to save outputs of RExec? The side topic I want to cover in the next few minutes is called Multiple Cycle Delay Path. Looking back at the Register Fetch, R-type Execution, and R-type Completion cycles, you will notice that there is no registers in between the cycles to save the results. That is, at the end of the Register Fetch cycle, registers Rs and Rt are placed on busA and busB but we do not have a registers on busA and busB to save them. They just sit on the bus and wait until the next cycle, the R-type Execution cycle, when they propagate through the MUXes and into the ALU inputs. By the end of the R-type Execution cycle, the ALU output is valid and once again, the value is not saved in a register. It just sits there and wait until the R-type Completion cycle begins. Registers are not needed to save the values of busA and bus B because IRWr is zero so busA and bus B will not change after the Register Fetch cycle. Similarly, a register is not needed to save the ALU output at the end of the R-type execution cycle because: (1) We already established that Bus A and busB will not change the values. (2) And with control signals ALUSelA, ALUSelB, and ALUOp remains constant. (3) ALU output will not change its value during the R-type completion stage. +3 = 43 min. (Y:23) Zero Rs ALU Ra 5 busA 32 Rt Rb 32 Instruction Reg Mux 1 5 Reg File 32 Rt 4 Rw 32 1 Rd 32 busW busB 32 2 ALU Control Mux 1 3 ALUselB ALUOp

A Multiple Cycle Delay Path (Continue)
Register is NOT needed to save the outputs of Register Fetch: IRWr = 0: busA and busB will not change after Register Fetch Register is NOT needed to save the outputs of R-type Execution: busA and busB will not change after Register Fetch Control signals ALUSelA, ALUSelB, and ALUOp will not change after R-type Execution Consequently ALU output will not change after R-type Execution In theory (P. 316, P&H), you need a register to hold a signal value if: (1) The signal is computed in one clock cycle and used in another. (2) AND the inputs to the functional block that computes this signal can change before the signal is written into a state element. You can save a register if Cond 1 is true BUT Cond 2 is false: But in practice, this will introduce a multiple cycle delay path: A logic delay path that takes multiple cycles to propagate from one storage element to the next storage element In theory, you ONLY need a register to hold a signal value if: (1) The signal is computed in one clock cycle and used in another. (2) AND the inputs to the functional block that computes this signal can change before this this signal can change before this signal is written into a state element. In other words, you do not need register even if Cond 1 is true as long as Cond 2 is not true. That is as long as the input to the functional element that compute the signal does not change, we do not need a register to save its output. This is the case in our example here. However in practice, if you do not use a register to save a signal’s value when Cond 1 is true, you will introduce a multiple cycle delay path into your design. By definition, a Multiple Cycle Delay path is a COMBINATIONAL logic delay path that takes multiple cycles to propagate from one storage element to the next storage element. Let me show you what I mean by this. +2 = 45 min. (Y:25)

Pros and Cons of a Multiple Cycle Delay Path
A 3-cycle path example: IR (storage) -> Reg File Read -> ALU -> Reg File Write (storage) Advantages: Register savings We can share time among cycles: If ALU takes longer than one cycle, still “a OK” as long as the entire path takes less than 3 cycles to finish Mux 1 For example, the datapath activities during the Register Fetch (Reg File), R-type Execution (ALU), and R-type Completion (Feedback) is a 3-cycle delay path. The 2 storage elements in this path are the Instruction Register and the Register File’s WRITE port. Remember the Register File’s Read port acts like a combinational logic path. So in three cycles, we need to propagate the signals from the Instruction Register, through the Register File’s Read Port, through the ALU, and finally into the Register File’s WR port. So what is the advantages of a multiple cycle datapath? Well the obvious one is register savings. Here we save 32 register bits at Bus A, 32 register bits at Bus B, and another 32 register bits at ALU output. A total saving of 96 register bits. But register bits are cheap. Another advantage of a multiple cycle delay path is that it allows us to share time among different cycles. For example here, if we have a slow ALU that has a delay longer than one clock cycle, we will still meet the timing requirement AS LONG AS the total delay is less than three cycles. +2 = 47 min. (Y:27) Zero Rs ALU Ra 5 busA 32 Rt Rb 32 Instruction Reg Mux 1 5 Reg File 32 Rt 4 Rw 32 1 Rd 32 busW busB 32 2 ALU Control Mux 1 3 ALUselB

Pros and Cons of a Multiple Cycle Delay Path (Continue)
Disadvantage: Static timing analyzer, which ONLY looks at delay between two storage elements, will report this as a timing violation You have to ignore the static timing analyzer’s warnings Mux 1 The main disadvantage of having multiple cycle delay paths in your design is that static timing analyzers will report this as a timing violation. A static timing analyzer is a CAD tool that looks at every combinational logic delay path between any two storage elements and report the delay of that path. Consequently, the static timing analyzer will report this as a violation because it will think, and I must say correctly, that this path will take three times the cycle time to finish. It will be OK if there is ONLY one multiple cycle delay path in your design because all you will do is look at the timing analyzer output and say, YEAH, I know this takes 3 times the cycle time to finish but it is OK and ignore the timing analyzer's warning. However, if you have hundreds of them in your design, then you have to look at the timing analyzer's violation report one by one and decide whether to ignore it in a case by case basis. Needless to say, this is a very tedious and error prone task and you may end up ignoring some real timing violations which you think are legitimate multiple cycle delay path. Therefore I try to avoid having multiple cycle path in my design as much as possible by using a register to save a signal’s value whenever I generate a signal in one cycle and not use it until another cycle later. For example here, I will put in registers to save the values of BusA and BusB between the Register Fetch and R-type Execution cycle and a register here to save the values of the ALU output between the R-type Execution and R-type Completion cycle. Due to the space I have on each slide and I also like to try to keep this datapath as similar to the one in your text book as possible, I will not put in these registers in the lecture slides. However, I do recommend you to avoid multiple cycle delay path in your design work. +3 = 50 min. (Y:30) Zero Rs ALU Ra 5 busA 32 Rt Rb 32 Instruction Reg Mux 1 5 Reg File 32 Rt 4 Rw 32 1 Rd 32 busW busB 32 2 ALU Control Mux 1 3 ALUselB

Instruction Decode: We have an Ori!
Next Cycle: Ori Execution PCWr=0 PCWrCond=0 PCSrc=x BrWr=1 Zero IorD=x MemWr=0 IRWr=0 RegDst=x RegWr=0 ALUSelA=0 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra Let’s go back to the end of the Register Fetch slash Instruction Decode Cycle. Assume the result of the Instruction Decode indicates we have a OR immediate instruction, what do we do then? Well, we go to the OR immediate execution cycle. +1 = 56 min. (Y:36) 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory Intruction Reg Mux 1 5 Reg File 32 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control 3 << 2 Beq Control Op Rtype Imm 6 ALUSelB=10 Ori Func Extend Memory 6 16 32 ALUOp=Add : ExtOp=1

Ori Execution ALU output <- busA or ZeroExt[Imm16] OriExec PCWr=0
ALUOp=Or IorD, PCSrc 1: ALUSelA ALUSelB=11 x: MemtoReg OriExec ALU output <- busA or ZeroExt[Imm16] PCWr=0 PCWrCond=0 PCSrc=x BrWr=0 Zero IorD=x MemWr=0 IRWr=0 RegDst=0 RegWr=0 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory The first operand of OR immediate comes from register Rs. It is already on busA so we just set ALUSelA to 1. The second operand, on the other hand, does NOT come from Rt. It comes from the Zero Extended (ExtOp = 0) version of the immediate field (ALUSelB = 11). Once we have the operands, all we have to do is to ask the ALU to OR (ALUop) them together and the ALU output will have the correct result at the end of this cycle. Notice that I have set RegDst to zero so the Rt field of the instruction word will be stable at Register File’s Rw address port before the next cycle. What do we do in the next cycle? +2 = 58 min. (Y:38) Instruction Reg Mux 1 5 Reg File 32 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control Mux 1 3 << 2 Extend Imm 16 32 ALUOp=Or ExtOp=0 MemtoReg=x ALUSelB=11

Ori Completion Reg[rt] <- ALU output OriFinish PCWr=0 PCWrCond=0
1: ALUSelA ALUOp=Or x: IorD, PCSrc RegWr ALUSelB=11 OriFinish Reg[rt] <- ALU output PCWr=0 PCWrCond=0 PCSrc=x BrWr=0 Zero IorD=x MemWr=0 IRWr=0 RegDst=0 RegWr=1 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory Well we do a register write (RegWr=1). Once again, I have set up the register write address Rw in advance (RegDst = 0) during the previous cycle so I can guarantee Rw is stable when I assert RegWr in this cycle. Also, remember we have a multiple cycle delay path from the Instruction Register to the Register File Write Port. Therefore, IRWr must be 0 and ALUSelA & B, and ALUOp must remain the same as the previous cycle in order to guarantee ALU output to be stable during register write. +1 = 59 min. (Y:39) Instruction Reg Mux 1 5 Reg File 32 Rt 4 Rw 32 WrAdr 32 1 32 32 Rd Din Dout busW busB 32 2 32 ALU Control Mux 1 3 << 2 Extend Imm 16 32 ALUOp=Or ExtOp=0 MemtoReg=0 ALUSelB=11

Memory Address Calculation
AdrCal 1: ExtOp ALUSelA ALUSelB=11 ALU output <- busA + SignExt[Imm16] ALUOp=Add x: MemtoReg PCSrc PCWr=0 PCWrCond=0 PCSrc=x BrWr=0 Zero IorD=x MemWr=0 IRWr=0 RegDst=x RegWr=1 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory How do we calculate the memory address? Simple, we have to add the contents of register Rs (busA) to the Sign Extended (ExtOp=1) version of the Immediate field (ALUSelB = 11). With the ALUOp set to add, the memory address will be valid at the ALU output by the end of this cycle. Let’s say we do have a store instruction and see what happens next. +1 = 61 min. (Y:41) Instruction Reg Mux 1 5 Reg File 32 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control Mux 1 3 << 2 Extend Imm 16 32 ALUOp=Add ExtOp=1 MemtoReg=x ALUSelB=11

Memory Access for Store
1: ExtOp SWmem MemWr ALUSelA mem[ALU output] <- busB ALUSelB=11 ALUOp=Add x: PCSrc,RegDst PCWr=0 PCWrCond=0 MemtoReg PCSrc=x BrWr=0 Zero IorD=x MemWr=1 IRWr=0 RegDst=x RegWr=0 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory Instruction Reg Mux 1 5 Reg File 32 Rt 4 Well, the address is already set up at the Memory’s write address port. The data is also already available on the Memory’s data port via busB. Therefore, all we have to do is to set MemWr to 1. Notice that it is very important that we keep ALUSelA, ALUSelB, and ALUOp the same as the previous cycle, the Memory Address calculation cycle. Otherwise, if any of these control signals changes during Memory Write, the address will also change because we do not have a register to save the ALU output. Any changes in the address during this cycle with MemWr = 1 will have catastrophic result. We will end up destroying data stored in memory by writing to the wrong address location. +2 = 63 min. (Y:43) Rw 32 WrAdr 32 1 32 32 Rd Din Dout busW busB 32 2 32 ALU Control Mux 1 3 << 2 Extend Imm 16 32 ALUOp=Add ExtOp=1 MemtoReg=x ALUSelB=11

Memory Access for Load Mem Dout <- mem[ALU output] LWmem PCWr=0
ALUOp=Add x: MemtoReg 1: ExtOp ALUSelB=11 ALUSelA, IorD PCSrc LWmem Mem Dout <- mem[ALU output] PCWr=0 PCWrCond=0 PCSrc=x BrWr=0 Zero IorD=1 MemWr=0 IRWr=0 RegDst=0 RegWr=0 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory If after the Memory Address calculation cycle, we realize we have a load. We then enter the Load Memory Access cycle. All we have to do is set the control signal IorD to 1 then after the memory read access delay, the data we want will be available at the output of the Ideal Memory (Dout). Once again, we need to set RegDst to zero in this cycle so Rt will be stabilized at the Register file’s write address port (Rw) before next cycle. +2 = 45 min. (Y:45) Instruction Reg Mux 1 5 Reg File 32 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control Mux 1 3 << 2 Extend Imm 16 32 ALUOp=Add ExtOp=1 MemtoReg=x ALUSelB=11

Write Back for Load Reg[rt] <- Mem Dout LWwr PCWr=0 PCWrCond=0
ALUOp=Add x: PCSrc 1: ALUSelA ALUSelB=11 MemtoReg RegWr, ExtOp IorD LWwr Reg[rt] <- Mem Dout PCWr=0 PCWrCond=0 PCSrc=x BrWr=0 Zero IorD=x MemWr=0 IRWr=0 RegDst=0 RegWr=0 ALUSelA=1 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory Because in this next cycle, the Write Back cycle, we will write the data from memory (MemtoReg = 1) into the register specified by the Rt field of the instruction. +1 = 66 min. (Y:46) Instruction Reg Mux 1 5 Reg File 32 Rt 4 Rw 32 WrAdr 32 1 32 Rd 32 Din Dout busW busB 32 2 32 ALU Control Mux 1 3 << 2 Extend Imm 16 32 ALUOp=Add ExtOp=1 MemtoReg=1 ALUSelB=11

Putting it all together: Multiple Cycle Datapath
PCWr PCWrCond PCSrc BrWr Zero IorD MemWr IRWr RegDst RegWr ALUSelA 1 Target 32 32 Mux PC Mux 1 32 Zero Rs ALU Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory Instruction Reg Mux 1 5 Reg File 32 Rt 4 Putting it all together, here it is: the multiple cycle datapath we set out to built. +1 = 47 min. (Y:47) Rw 32 WrAdr 32 1 32 32 Rd Din Dout busW busB 32 2 32 ALU Control Mux 1 3 << 2 Extend Imm 16 32 ALUOp ExtOp MemtoReg ALUSelB

Putting it all together: Control State Diagram
1: PCWr, IRWr ALUOp=Add Others: 0s x: PCWrCond RegDst, Mem2R Ifetch 1: BrWr, ExtOp ALUOp=Add Others: 0s x: RegDst, PCSrc ALUSelB=10 IorD, MemtoReg Rfetch/Decode 1: PCWrCond ALUOp=Sub x: IorD, Mem2Reg ALUSelB=01 RegDst, ExtOp ALUSelA BrComplete PCSrc beq AdrCal 1: ExtOp ALUSelA ALUSelB=11 ALUOp=Add lw or sw x: MemtoReg Ori PCSrc Rtype lw sw OriExec 1: RegDst ALUOp=Rtype ALUSelB=01 x: PCSrc, IorD MemtoReg ALUSelA ExtOp RExec ALUOp=Or IorD, PCSrc 1: ALUSelA ALUSelB=11 x: MemtoReg SWMem LWmem 1: ExtOp ALUSelA, IorD 1: ExtOp MemWr ALUSelB=11 Well, we pretty much concentrate on the multiple cycle datapath today. But if you think about it, by summarizing all the control signals in circles along the way, we have pretty much specified the control in a state diagram. All instructions start out at the Instruction Fetch cycle and continue to the Instruction Decode slash Register Fetch cycle. Once the instruction is decoded, we will either go to the Branch Complete cycle to complete the branch or go to one of the following: (1) R-type executioin or OR immediate execution for R-type or Or immediate instructions. (2) Or we will go to the memory address calculation cycle for load and store instrution. The rest is pretty straight forward. Before I go any further, I like to tell you a short story. Once there was a famous professor who traveled all over the country to give talks on a very special technical topic. He was afraid of flying so he hired a driver to drive him around. Whenever the professor was giving the talk, his driver would be sitting at the back of the lecture hall listening so after several years, the driver told the professor: “Hey, you know what, I bet I can give the same talk as well as you do.” The professor was doubtful but he also wanted to teach his “arrogant” driver a lesson so he said: “OK since the people in the town we are going to visit next have never met us before, we can try this: You pretend to be me and give the lecture and I will just sit at the back and pretend to be your driver.” +5 = 75 min. (Y:55) ALUSelA ALUOp=Add ALUSelB=11 x: MemtoReg ALUOp=Add PCSrc x: PCSrc,RegDst MemtoReg OriFinish 1: RegDst, RegWr ALUOp=Rtype ALUselA x: IorD, PCSrc ALUSelB=01 ExtOp Rfinish ALUOp=Add x: PCSrc 1: ALUSelA ALUSelB=11 MemtoReg RegWr, ExtOp IorD 1: ALUSelA ALUOp=Or x: IorD, PCSrc RegWr ALUSelB=11 LWwr

Summary Disadvantages of the Single Cycle Processor Long cycle time
Cycle time is too long for all instructions except the Load Multiple Cycle Processor: Divide the instructions into smaller steps Execute each step (instead of the entire instruction) in one cycle Do NOT confuse Multiple Cycle Processor with Multiple Cycle Delay Path Multiple Cycle Processor executes each instruction in multiple clock cycles Multiple Cycle Delay Path: a combinational logic path between two storage elements that takes more than one clock cycle to complete It is possible (desirable) to build a MC Processor without MCDP: Use a register to save a signal’s value whenever a signal is generated in one clock cycle and used in another cycle later Let me summarized what we learned doday. First we look at the single cycle processor we built and pointed out its two disadvantages: (1) First of all, it has a long cycle time. (2) And may be more importantly, this long cycle time is too long for all instructions except ... I then show you how to design a multiple cycle processor by: (1) Divide the instructions into smaller steps. (2) Then execute each step (instead of the entire instruction) in one clock cycle. We also cover a side topic called the Multiple Cycle Delay path. Do NOT confuse the muliple cycle processor with the multiple cycle delay path. They are two different things: (1) A multiple cycle processor executes each instruction in multiple clock cycles. (2) A multiple cycle delay path refers to a combinational logic delay path between two storage elements that takes more than one clock cycles to complete. It is possible, actually it is desirable, to build a multiple clock cycle processor that does NOT have multiple cycle delay path in it. All you have to do is to use a register to save a signal’s value whenever a signal is generated in one clock cycle and being used in another cycle or cycles later. +3 = 70 min. (Y:50)

EECS 361 Computer Architecture Lecture 10: Designing a Multiple Cycle Processor Start X:40.

Similar presentations

Presentation on theme: "EECS 361 Computer Architecture Lecture 10: Designing a Multiple Cycle Processor Start X:40."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EECS 361 Computer Architecture Lecture 10: Designing a Multiple Cycle Processor Start X:40.

Similar presentations

Presentation on theme: "EECS 361 Computer Architecture Lecture 10: Designing a Multiple Cycle Processor Start X:40."— Presentation transcript:

Similar presentations

About project

Feedback