Download presentation
Presentation is loading. Please wait.
1
Single Cycle datapath
2
How to Design a Processor: step-by-step
1. Analyze instruction set => datapath requirements the meaning of each instruction is given by the register transfers datapath must include storage element for ISA registers possibly more datapath must support each register transfer 2. Select set of datapath components and establish clocking methodology 3. Assemble datapath meeting the requirements 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. 5. Assemble the control logic
3
The MIPS Instruction Formats
All MIPS instructions are 32 bits long. The three instruction formats: R-type I-type J-type The different fields are: op: operation of the instruction rs, rt, rd: the source and destination register specifiers shamt: shift amount funct: selects the variant of the operation in the “op” field address / immediate: address offset or immediate value target address: target address of the jump instruction op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits op target address 26 31 6 bits 26 bits One of the most important thing you need to know before you start designing a processor is how the instructions look like. Or in more technical term, you need to know the instruction format. One good thing about the MIPS instruction set is that it is very simple. First of all, all MIPS instructions are 32 bits long and there are only three instruction formats: (a) R-type, (b) I-type, and (c) J-type. The different fields of the R-type instructions are: (a) OP specifies the operation of the instruction. (b) Rs, Rt, and Rd are the source and destination register specifiers. (c) Shamt specifies the amount you need to shift for the shift instructions. (d) Funct selects the variant of the operation specified in the “op” field. For the I-type instruction, bits 0 to 15 are used as an immediate field. I will show you how this immediate field is used differently by different instructions. Finally for the J-type instruction, bits 0 to 25 become the target address of the jump. +3 = 10 min. (X:50)
4
Step 1a: The MIPS-lite Subset for today
ADD and SUB addU rd, rs, rt subU rd, rs, rt OR Immediate: ori rt, rs, imm16 LOAD and STORE Word lw rt, rs, imm16 sw rt, rs, imm16 BRANCH: beq rs, rt, imm16 op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits In today’s lecture, I will show you how to implement the following subset of MIPS instructions: add, subtract, or immediate, load, store, branch, and the jump instruction. The Add and Subtract instructions use the R format. The Op together with the Func fields together specified all the different kinds of add and subtract instructions. Rs and Rt specifies the source registers. And the Rd field specifies the destination register. The Or immediate instruction uses the I format. It only uses one source register, Rs. The other operand comes from the immediate field. The Rt field is used to specified the destination register. (Note that dest is the Rt field!) Both the load and store instructions use the I format and both add the Rs and the immediate filed together to from the memory address. The difference is that the load instruction will load the data from memory into Rt while the store instruction will store the data in Rt into the memory. The branch on equal instruction also uses the I format. Here Rs and Rt are used to specified the registers we need to compare. If these two registers are equal, we will branch to a location offset by the immediate field. Finally, the jump instruction uses the J format and always causes the program to jump to a memory location specified in the address field. I know I went over this rather quickly and you may have missed something. But don’t worry, this is just an overview. You will keep seeing these (point to the format) all day today. +3 = 13 min. (X:53) op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits
5
Logical Register Transfers
RTL gives the meaning of the instructions All start by fetching the instruction op | rs | rt | rd | shamt | funct = MEM[ PC ] op | rs | rt | Imm = MEM[ PC ] inst Register Transfers ADDU R[rd] <– R[rs] + R[rt]; PC <– PC + 4 SUBU R[rd] <– R[rs] – R[rt]; PC <– PC + 4 ORi R[rt] <– R[rs] + zero_ext(Imm16); PC <– PC + 4 LOAD R[rt] <– MEM[ R[rs] + sign_ext(Imm16)]; PC <– PC + 4 STORE MEM[ R[rs] + sign_ext(Imm16) ] <– R[rt]; PC <– PC + 4 BEQ if ( R[rs] == R[rt] ) then PC <– PC + sign_ext(Imm16)] || else PC <– PC + 4
6
Step 1: Requirements of the Instruction Set
Memory instruction & data Registers (32 x 32) read RS read RT Write RT or RD PC Extender Add and Sub register or extended immediate Add 4 or extended immediate to PC
7
Step 2: Components of the Datapath
Combinational Elements Storage Elements Clocking methodology
8
Combinational Logic Elements (Basic Building Blocks)
CarryIn Adder MUX ALU A 32 Sum Adder 32 B Carry 32 Select A 32 Y MUX 32 B 32 Based on the Register Transfer Language examples we have so far, we know we will need the following combinational logic elements. We will need an adder to update the program counter. A MUX to select the results. And finally, an ALU to do various arithmetic and logic operation. +1 = 30 min. (Y:10) OP A 32 Result ALU 32 B 32
9
Storage Element: Register (Basic Building Block)
Similar to the D Flip Flop except N-bit input and output Write Enable input Write Enable: negated (0): Data Out will not change asserted (1): Data Out will become Data In Write Enable Data In Data Out N N Clk As far as storage elements are concerned, we will need a N-bit register that is similar to the D flip-flop I showed you in class. The significant difference here is that the register will have a Write Enable input. That is the content of the register will NOT be updated if Write Enable is not asserted (0). The content is updated at the clock tick ONLY if the Write Enable signal is asserted (1). +1 = 31 min. (Y:11)
10
Storage Element: Register File
Register File consists of 32 registers: Two 32-bit output busses: busA and busB One 32-bit input bus: busW Register is selected by: RA (number) selects the register to put on busA (data) RB (number) selects the register to put on busB (data) RW (number) selects the register to be written via busW (data) when Write Enable is 1 Clock input (CLK) The CLK input is a factor ONLY during write operation During read operation, behaves as a combinational logic block: RA or RB valid => busA or busB valid after “access time.” RW RA RB 5 5 5 Write Enable busA busW 32 32 32-bit Registers 32 busB Clk 32 We will also need a register file that consists of bit registers with two output busses (busA and busB) and one input bus. The register specifiers Ra and Rb select the registers to put on busA and busB respectively. When Write Enable is 1, the register specifier Rw selects the register to be written via busW. In our simplified version of the register file, the write operation will occurs at the clock tick. Keep in mind that the clock input is a factor ONLY during the write operation. During read operation, the register file behaves as a combinational logic block. That is if you put a valid value on Ra, then bus A will become valid after the register file’s access time. Similarly if you put a valid value on Rb, bus B will become valid after the register file’s access time. In both cases (Ra and Rb), the clock input is not a factor. +2 = 33 min. (Y:13)
11
Storage Element: Idealized Memory
Write Enable Address Memory (idealized) One input bus: Data In One output bus: Data Out Memory word is selected by: Address selects the word to put on Data Out Write Enable = 1: address selects the memory word to be written via the Data In bus Clock input (CLK) The CLK input is a factor ONLY during write operation During read operation, behaves as a combinational logic block: Address valid => Data Out valid after “access time.” Data In DataOut 32 32 Clk The last storage element you will need for the datapath is the idealized memory to store your data and instructions. This idealized memory block has just one input bus (DataIn) and one output bus (DataOut). When Write Enable is 0, the address selects the memory word to put on the Data Out bus. When Write Enable is 1, the address selects the memory word to be written via the DataIn bus at the next clock tick. Once again, the clock input is a factor ONLY during the write operation. During read operation, it behaves as a combinational logic block. That is if you put a valid value on the address lines, the output bus DataOut will become valid after the access time of the memory. +2 = 35 min. (Y:15)
12
Clocking Methodology Clk Setup Hold Setup Hold Don’t Care . Remember, we will be using a clocking methodology where all storage elements are clocked by the same clock edge. Consequently, our cycle time will be the sum of: (a) The Clock-to-Q time of the input registers. (b) The longest delay path through the combinational logic block. (c) The set up time of the output register. (d) And finally the clock skew. In order to avoid hold time violation, you have to make sure this inequality is fulfilled. +2 = 18 min. (X:58) All storage elements are clocked by the same clock edge Cycle Time = CLK-to-Q + Longest Delay Path + Setup + Clock Skew (CLK-to-Q + Shortest Delay Path - Clock Skew) > Hold Time
13
Step 3 Register Transfer Requirements –> Datapath Assembly
Instruction Fetch Read Operands and Execute Operation
14
3a: Overview of the Instruction Fetch Unit
The common RTL operations Fetch the Instruction: mem[PC] Update the program counter: Sequential Code: PC <- PC + 4 Branch and Jump: PC <- “something else” PC Clk Next Address Logic Now let’s take a look at the first major component of the datapath: the instruction fetch unit. The common RTL operations for all instructions are: (a) Fetch the instruction using the Program Counter (PC) at the beginning of an instruction’s execution (PC -> Instruction Memory -> Instruction Word). (b) Then at the end of the instruction’s execution, you need to update the Program Counter (PC -> Next Address Logic -> PC). More specifically, you need to increment the PC by 4 if you are executing sequential code. For Branch and Jump instructions, you need to update the program counter to “something else” other than plus 4. I will show you what is inside this Next Address Logic block when we talked about the Branch and Jump instructions. For now, let’s focus our attention to the Add and Subtract instructions. +2 = 37 min. (Y:17) Address Instruction Memory Instruction Word 32
15
3b: Add & Subtract R[rd] <- R[rs] op R[rt] Example: addU rd, rs, rt
Ra, Rb, and Rw come from instruction’s rs, rt, and rd fields ALUctr and RegWr: control logic after decoding the instruction op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits Rd Rs Rt ALUctr RegWr 5 5 5 And here is the datapath that can do the trick. First of all, we connect the register file’s Ra, Rb, and Rw input to the Rd, Rs, and Rt fields of the instruction bus (points to the format diagram). Then we need to connect busA and busB of the register file to the ALU. Finally, we need to connect the output of the ALU to the input bus of the register file. Conceptually, this is how it works. The instruction bus coming out of the Instruction memory will set the Ra and Rb to the register specifiers Rs and Rt. This causes the register file to put the value of register Rs onto busA and the value of register Rt onto busB, respectively. By setting the ALUctr appropriately, the ALU will perform either the Add and Subtract for us. The result is then fed back to the register file where the register specifier Rw should already be set to the instruction bus’s Rd field. Since the control, which we will design in our next lecture, should have already set the RegWr signal to 1, the result will be written back to the register file at the next clock tick (points to the Clk input). +3 = 42 min. (Y:22) busA Rw Ra Rb busW 32 32 32-bit Registers Result 32 ALU 32 Clk busB 32
16
Register-Register Timing
Clk Clk-to-Q PC Old Value New Value Instruction Memory Access Time Rs, Rt, Rd, Op, Func Old Value New Value Delay through Control Logic ALUctr Old Value New Value RegWr Old Value New Value Register File Access Time busA, B Old Value New Value ALU Delay busW Old Value New Value Let’s take a more quantitative picture of what is happening. At each clock tick, the Program Counter will present its latest value to the Instruction memory after Clk-to-Q time. After a delay of the Instruction Memory Access time, the Opcode, Rd, Rs, Rt, and Function fields will become valid on the instruction bus. Once we have the new instruction, that is the Add or Subtract instruction, on the instruction bus, two things happen in parallel. First of all, the control unit will decode the Opcode and Func field and set the control signals ALUctr and RegWr accordingly. We will cover this in the next lecture. While this is happening (points to Control Delay), we will also be reading the register file (Register File Access Time). Once the data is valid on busA and busB, the ALU will perform the Add or Subtract operation based on the ALUctr signal. Hopefully, the ALU is fast enough that it will finish the operation (ALU Delay) before the next clock tick. At the next clock tick, the output of the ALU will be written into the register file because the RegWr signal will be equal to 1. +3 = 45 min. (Y:25) Rd Rs Rt ALUctr RegWr Register Write Occurs Here 5 5 5 busA Rw Ra Rb busW 32 32 32-bit Registers Result 32 ALU 32 Clk busB 32
17
3c: Logical Operations with Immediate
R[rt] <- R[rs] op ZeroExt[imm16] ] 11 op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits rd? immediate 16 15 31 16 bits Rd Rt RegDst Mux Rs ALUctr RegWr 5 5 5 busA Rw Ra Rb busW 32 Result Here is the datapath for the Or immediate instructions. We cannot use the Rd field here (Rw) because in this instruction format, we don’t have a Rd field. The Rd field in the R-type is used here as part of the immediate field. For this instruction type, Rw input of the register file, that is the address of the register to be written, comes from the Rt field of the instruction. Recalled from earlier slide that for R-type instruction, the Rw comes from the Rd field. That’s why we need a MUX here to put Rd onto Rw for R-type instructions and to put Rt onto Rw for the I-type instruction. Since the second operation of this instruction will be the immediate field zero extended to 32 bits, we also need a MUX here to block off bus B from the register file. Since bus B is blocked off by the MUX, the value on bus B is don’t care. Therefore we do not have to worry about what ends up on the register file’s Rb register specifier. To keep things simple, we may just as well keep it the same as the R-type instruction and put the Rt field here. So to summarize, this is how this datapath works. With Rs on Register File’s Ra input, bus A will get the value of Rs as the first ALU operand. The second operand will come from the immediate field of the instruction. Once the ALU complete the OR operation, the result will be written into the register specified by the instruction’s Rt field. +3 = 50 min. (Y:30) 32 32-bit Registers 32 ALU 32 Clk busB 32 Mux imm16 ZeroExt 16 32 ALUSrc
18
3d: Load Operations R[rt] <- Mem[R[rs] + SignExt[imm16]] Example: lw rt, rs, imm16 11 op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits rd Rd Rt RegDst Mux Rs ALUctr RegWr 5 5 5 busA W_Src Rw Ra Rb busW 32 32 32-bit Registers 32 Once again we cannot use the instruction’s Rd field for the Register File’s Rw input because load is a I-type instruction and there is no such thing as the Rd field in the I format. So instead of Rd, the Rt field is used to specify the destination register through this two to one multiplexor. The first operand of the ALU comes from busA of the register file which contains the value of Register Rs (points to the Ra input of the register file). The second operand, on the other hand, comes from the immediate field of the instruction. Instead of using the Zero Extender I used in datapath for the or immediate datapath, I have to use a more general purpose Extender that can do both Sign Extend and Zero Extend. The ALU then adds these two operands together to form the memory address. Consequently, the output of the ALU has to go to two places: (a) First the address input of the data memory. (b) And secondly, also to the input of this two-to-one multiplexer. The other input of this multiplexer comes from the output of the data memory so we can place the output of the data memory onto the register file’s input bus for the load instruction. For Add, Subtract, and the Or immediate instructions, the output of the ALU will be selected to be placed on the input bus of the register file. In either case, the control signal RegWr should be asserted so the register file will be written at the end of the cycle. +3 = 60 min. (Y:40) ALU 32 Clk busB MemWr 32 Mux Mux WrEn Adr Data In 32 imm16 32 Data Memory Extender 32 16 Clk ALUSrc ExtOp
19
3e: Store Operations Mem[ R[rs] + SignExt[imm16] <- R[rt] ] Example: sw rt, rs, imm16 op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits Rd Rt ALUctr MemWr W_Src RegDst Mux Rs Rt RegWr 5 5 5 busA Rw Ra Rb busW 32 32 32-bit Registers And here is the datapath for the store instruction. The Register File, the ALU, and the Extender are the same as the datapath for the load instruction because the memory address has to be calculated the same exact way: (a) Put the register selected by Rs onto bus A and sign extend the 16 bit immediate field. (b) Then make the ALU (ALUctr) adds these two (busA and output of Extender) together. The new thing we added here is busB extension (DataIn). More specifically, in order to send the register selected by the Rt field (Rb of the register file) to data memory, we need to connect bus B to the data memory’s Data In bus. Finally, the store instruction is the first instruction we encountered that does not do any register write at the end. Therefore the control unit must make sure RegWr is zero for this instruction. +2 = 64 min. (Y:44) 32 ALU 32 Clk busB 32 Mux Mux WrEn Adr Data In 32 32 Data Memory imm16 16 Extender 32 Clk ExtOp ALUSrc
20
3f: The Branch Instruction
op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits beq rs, rt, imm16 mem[PC] Fetch the instruction from memory Equal <- R[rs] == R[rt] Calculate the branch condition if (COND eq 0) Calculate the next instruction’s address PC <- PC ( SignExt(imm16) x 4 ) else PC <- PC + 4 How does the branch on equal instruction work? Well it calculates the branch condition by subtracting the register selected by the Rt field from the register selected by the Rs field. If the result of the subtraction is zero, then these two registers are equal and we take a branch. Otherwise, we keep going down the sequential path (PC <- PC +4). +1 = 65 min. (Y:45)
21
Datapath for Branch Operations
beq rs, rt, imm16 Datapath generates condition (equal) op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits Inst Address nPC_sel Clk busW RegWr 32 busA busB 5 Rw Ra Rb 32 32-bit Registers Rs Rt Equal? Cond Adder 4 32 Mux PC Clk 00 The datapath for calculating the branch condition is rather simple. All we have to do is feed the Rs and Rt fields of the instruction into the Ra and Rb inputs of the register file. Bus A will then contain the value from the register selected by Rs. And bus B will contain the value from the register selected by Rt. The next thing to do is to ask the ALU to perform a subtract operation and feed the output Zero to the next address logic. How does the next address logic block look like? Well, before I show you that, let’s take a look at the binary arithmetics behind the program counter (PC). +2 = 67 min. (Y:47) Adder imm16 PC Ext
22
Putting it All Together: A Single Cycle Datapath
Adr Inst Memory Instruction<31:0> <21:25> <16:20> <11:15> <0:15> Rs Rt Rd Imm16 nPC_sel RegDst ALUctr MemWr MemtoReg Equal Rd Rt 1 Rs Rt 4 Adder RegWr 5 5 5 busA Mux Rw Ra Rb = 00 busW 32 32 32-bit Registers 32 busB ALU 32 So here is the single cycle datapath we just built. If you push into the Instruction Fetch Unit, you will see the last slide showing the PC, the next address logic, and the Instruction Memory. Here I have shown how we can get the Rt, Rs, Rd, and Imm16 fields out of the 32-bit instruction word. The Rt, Rs, and Rd fields will go to the register file as register specifiers while the Imm16 field will go to the Extender where it is either Zero and Sign extended to 32 bits. The signals ExtOp, ALUSrc, ALUctr, MemWr, MemtoReg, RegDst, RegWr, Branch, and Jump are control signals. And I will show you how to generate them on Friday. +2 = 80 min. (Z:00) Adder PC 32 Mux Clk Mux 32 WrEn Adr 1 Clk 1 Data In imm16 Data Memory 32 imm16 PC Ext 16 Extender Clk ExtOp ALUSrc
23
Step 4: Given Datapath: RTL -> Control
Instruction<31:0> Inst Memory <21:25> <21:25> <16:20> <11:15> <0:15> Adr Op Fun Rt Rs Rd Imm16 Control nPC_sel RegWr RegDst ExtOp ALUSrc ALUctr MemWr MemtoReg Equal DATA PATH
24
Meaning of the Control Signals
MemWr: write memory MemtoReg: 1 => Mem RegDst: 0 => “rt”; 1 => “rd” RegWr: write dest register ExtOp: “zero”, “sign” ALUsrc: 0 => regB; 1 => immed ALUctr: “add”, “sub”, “or” RegDst ALUctr MemWr MemtoReg Equal Rd Rt 1 Rs Rt RegWr 5 5 5 busA Rw Ra Rb = busW 32 32 32-bit Registers 32 busB ALU 32 32 Mux Clk Mux 32 WrEn Adr 1 1 Data In imm16 Data Memory 16 Extender 32 Clk ExtOp ALUSrc
25
Example: Load Instruction
Memory <21:25> <16:20> <11:15> <0:15> Adr Rs Rt Rd Imm16 nPC_sel RegDst ALUctr MemWr MemtoReg rt Rd Rt Equal +4 add 1 Rs Rt 4 Adder RegWr 5 5 5 busA Mux Rw Ra Rb = 00 busW 32 32 32-bit Registers 32 busB ALU 32 Adder PC 32 Mux Clk Mux 32 WrEn Adr 1 Clk 1 Data In imm16 Data Memory Extender 32 imm16 PC Ext 16 Clk sign ext ExtOp ALUSrc
26
An Abstract View of the Implementation
Control Ideal Instruction Memory Control Signals Instruction Conditions Rd Rs Rt 5 5 5 Instruction Address A Data Address Data Out 32 Clk PC Rw Ra Rb ALU 32 32 Ideal Data Memory Next Address 32 32-bit Registers Data In B One thing you may noticed from our last slide is that almost all instructions, except Jump, require reading some registers, do some computation, and then do something else. Therefore our datapath will look something like this. For example, if we have an add instruction (points to the output of Instruction Memory), we will read the registers from the register file (Ra, Rb and then busA and busB). Add the two numbers together (ALU) and then write the result back to the register file. On the other hand, if we have a load instruction, we will first use the ALU to calculate the memory address. Once the address is ready, we will use it to access the Data Memory. And once the data is available on Data Memory’s output bus, we will write the data to the register file. Well, this is simple enough. But if it is this simple, you probably won’t need to take this class. So in today’s lecture, I will show you how to turn this abstract datapath into a real datapath by making it slightly (JUST slightly) more complicated so it can do real work for you. But before we do that, let’s do a quick review of the clocking methodology +3 = 16 (X:56) Clk Clk 32 Datapath Logical vs. Physical Structure
27
Summary 5 steps to design a processor MIPS makes it easier
1. Analyze instruction set => datapath requirements 2. Select set of datapath components & establish clock methodology 3. Assemble datapath meeting the requirements 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. 5. Assemble the control logic MIPS makes it easier Instructions same size Source registers always in same place Immediates same size, location Operations always on registers/immediates Single cycle datapath => CPI=1, CCT => long Next time: implementing control
28
Recap: A Single Cycle Datapath
We have everything except control signals (underline) Today’s lecture will show you how to generate the control signals 32 ALUctr Clk busW RegWr busA busB 5 Rw Ra Rb 32 32-bit Registers Rs Rt Rd RegDst Extender Mux 16 imm16 ALUSrc ExtOp MemtoReg Data In WrEn Adr Data Memory MemWr ALU Instruction Fetch Unit Zero Instruction<31:0> 1 <21:25> <16:20> <11:15> <0:15> Imm16 nPC_sel The result of the last lecture is this single-cycle datapath. +1 = 6 min. (X:46)
29
RTL: The Add Instruction
op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits add rd, rs, rt mem[PC] Fetch the instruction from memory R[rd] <- R[rs] + R[rt] The actual operation PC <- PC Calculate the next instruction’s address OK, let’s get on with today’s lecture by looking at the simple add instruction. In terms of Register Transfer Language, this is what the Add instruction need to do. First, you need to fetch the instruction from Memory. Then you perform the actual add operation. More specifically: (a) You add the contents of the register specified by the Rs and Rt fields of the instruction. (b) Then you write the results to the register specified by the Rd field. And finally, you need to update the program counter to point to the next instruction. Now, let’s take a detail look at the datapath during various phase of this instruction. +2 = 10 min. (X:50)
30
Instruction Fetch Unit at the Beginning of Add
Fetch the instruction from Instruction memory: Instruction <- mem[PC] This is the same for all instructions Adr Inst Memory Adder PC Clk 00 Mux 4 nPC_sel imm16 Instruction<31:0> PC Ext
31
The Single Cycle Datapath during Add
op rs rt rd shamt funct 6 11 16 21 26 31 R[rd] <- R[rs] + R[rt] Instruction<31:0> nPC_sel= +4 Instruction Fetch Unit Rd Rt Clk <21:25> <16:20> <11:15> <0:15> RegDst = 1 1 Mux ALUctr = Add Rs Rt Rt Rs Rd Imm16 RegWr = 1 5 5 5 MemtoReg = 0 busA Zero MemWr = 0 Rw Ra Rb busW 32 32 32-bit Registers 32 busB ALU 32 This picture shows the activities at the main datapath during the execution of the Add or Subtract instructions. The active parts of the datapath are shown in different color as well as thicker lines. First of all, the Rs and Rt of the instructions are fed to the Ra and Rb address ports of the register file and cause the contents of registers specified by the Rs and Rt fields to be placed on busA and busB, respectively. With the ALUctr signals set to either Add or Subtract, the ALU will perform the proper operation and with MemtoReg set to 0, the ALU output will be placed onto busW. The control we are going to design will also set RegWr to 1 so that the result will be written to the register file at the end of the cycle. Notice that ExtOp is don’t care because the Extender in this case can either do a SignExt or ZeroExt. We DON’T care because ALUSrc will be equal to 0--we are using busB. The other control signals we need to worry about are: (a) MemWr has to be set to zero because we do not want to write the memory. (b) And Branch and Jump, we have to set to zero. Let me show you why. +3 = 15 min. (X:55) Clk 32 Mux 32 Mux WrEn Adr 1 1 Data In 32 Data Memory imm16 32 16 Extender Clk ALUSrc = 0 ExtOp = x
32
Instruction Fetch Unit at the End of Add
PC <- PC + 4 This is the same for all instructions except: Branch and Jump Adr Inst Memory Instruction<31:0> nPC_sel 4 Adder This picture shows the control signals setting for the Instruction Fetch Unit at the end of the Add or Subtract instruction. Both the Branch and Jump signals are set to 0. Consequently, the output of the first adder, which implements PC plus 1, is selected through the two 2-to-1 mux and got placed into the input of the Program Counter register. The Program Counter is updated to this new value at the next clock tick. Notice that the Program Counter is updated at every cycle. Therefore it does not have a Write Enable signal to control the write. Also, this picture is the same for or all instructions other than Branch andJjump. Therefore I will only show this picture again for the Branch and Jump instructions and will not repeat this for all other instructions. +2 = 17 min. (X:57) Mux 00 Adder PC Clk imm16
33
The Single Cycle Datapath during Or Immediate
op rs rt immediate 16 21 26 31 R[rt] <- R[rs] or ZeroExt[Imm16] Instruction<31:0> nPC_sel = Instruction Fetch Unit Rd Rt <0:15> RegDst = Clk <21:25> <16:20> <11:15> 1 Mux Rs Rt Rt Rs Rd Imm16 ALUctr = RegWr = 5 5 5 MemtoReg = busA Zero MemWr = Rw Ra Rb busW 32 32 32-bit Registers 32 busB ALU 32 Now let’s look at the control signals setting for the Or immediate instruction. The OR immediate instruction OR the content of the register specified by the Rs field to the Zero Extended Immediate field and write the result to the register specified in Rt. This is how it works in the datapath. The Rs field is fed to the Ra address port to cause the contents of register Rs to be placed on busA. The other operand for the ALU will come from the immediate field. In order to do this, the controller need to set ExtOp to 0 to instruct the extender to perform a Zero Extend operation. Furthermore, ALUSrc must set to 1 such that the MUX will block off bus B from the register file and send the zero extended version of the immediate field to the ALU. Of course, the ALUctr has to be set to OR so the ALU can perform an OR operation. The rest of the control signals (MemWr, MemtoReg, Branch, and Jump) are the same as theAdd and Subtract instructions. One big difference is the RegDst signal. In this case, the destination register is specified by the instruction’s Rt field, NOT the Rd field because we do not have a Rd field here. Consequently, RegDst must be set to 0 to place Rt onto the Register File’s Rw address port. Finally, in order to accomplish the register write, RegWr must be set to 1. +3 = 20 min. (X:60) Clk 32 Mux 32 Mux WrEn Adr 1 1 Data In 32 Data Memory imm16 32 16 Extender Clk ALUSrc = ExtOp =
34
The Single Cycle Datapath during Load
op rs rt immediate 16 21 26 31 R[rt] <- Data Memory {R[rs] + SignExt[imm16]} Instruction<31:0> nPC_sel= +4 Instruction Fetch Unit Rd Rt Clk <21:25> <16:20> <11:15> <0:15> RegDst = 0 1 Mux ALUctr = Add Rs Rt Rt Rs Rd Imm16 RegWr = 1 5 5 5 MemtoReg = 1 busA Zero MemWr = 0 Rw Ra Rb busW 32 32 32-bit Registers 32 busB ALU 32 Let’s continue our lecture with the load instruction. What does the load instruction do? It first adds the contecnts of the register specified by the Rs field to the Sign Extended version of the Immediate field to form the memory address. Then it uses this memory address to access the memory and write the data back to the register specified by the Rt field of the instruction. Here is how the datapath works: first the Rs field is fed to the Register File’s Ra address port to place the register onto bus A. Then the ExtOp signal is set to 1 so that the immediate field is Sign Extended and we place this value (output of Extender) onto the ALU input by setting ALUsrc to 1. The ALU then add (ALUctr = add) the two together to form the memory address which is then placed onto the Data Memory’s address port. In order to place the Data Memory’s output bus onto the Register File’s input bus (busW), the control needs to set MemtoReg to 1. Similar to the OR immediate instruction I showed you earlier, the destination register here is specified by the Rt field. Therefore RegDst must be set to 0. Finally, RegWr must be set to 1 to completer the register write operation. Well, it should be obvious to you guys by now that we need to set Branch and Jump to 0 to make sure the Instruction Fetch Unit update the Program Counter correctly. +3 = 28 min. (Y:08) Clk 32 Mux Mux WrEn Adr 1 1 Data In 32 Data Memory imm16 32 16 Extender 32 Clk ALUSrc = 1 ExtOp = 1
35
The Single Cycle Datapath during Store
op rs rt immediate 16 21 26 31 Data Memory {R[rs] + SignExt[imm16]} <- R[rt] Instruction<31:0> nPC_sel = Instruction Fetch Unit Rd Rt <0:15> RegDst = Clk <21:25> <16:20> <11:15> 1 Mux Rs Rt Rt Rs Rd Imm16 ALUctr = RegWr = 5 5 5 MemtoReg = busA Zero MemWr = Rw Ra Rb busW 32 32 32-bit Registers 32 busB ALU 32 The store instruction performs the inverse function of the load. Instead of loading data from memory, the store instruction sends the contents of register specified by Rt to data memory. Similar to the load instruction, the store instruction needs to read the contents of register Rs (points to Ra port) and add it to the sign extended verion of the immediate filed (Imm16, ExtOp = 1, ALUSrc = 1) to form the data memory address (ALUctr = add). However unlike the Load instructoion where busB is not used, the store instruction will use busB to send the data to the Data memory. Consequently, the Rt field of the instruction has to be fed to the Rb port of the register file. In order to write the Data Memory properly, the MemWr signal has to be set to 1. Notice that the store instruction does not update the register file. Therefore, RegWr must be set to zero and consequently control signals RegDst and MemtoReg are don’t cares. And once again we need to set the control signals Branch and Jump to zero to ensure proper Program Counter updataing. Well, by now, you are probably tied of these boring stuff where Branch and Jump are zero so let’s look at something different--the bracnh instruction. +3 = 31 min. (Y:11) Clk 32 Mux 32 Mux WrEn Adr 1 1 Data In 32 Data Memory imm16 32 16 Extender Clk ALUSrc = ExtOp =
36
The Single Cycle Datapath during Store
op rs rt immediate 16 21 26 31 Data Memory {R[rs] + SignExt[imm16]} <- R[rt] Instruction<31:0> nPC_sel= +4 Instruction Fetch Unit Rd Rt Clk <21:25> <16:20> <11:15> <0:15> RegDst = x 1 Mux ALUctr = Add Rs Rt Rt Rs Rd Imm16 RegWr = 0 5 5 5 MemtoReg = x busA Zero MemWr = 1 Rw Ra Rb busW 32 32 32-bit Registers 32 busB ALU 32 The store instruction performs the inverse function of the load. Instead of loading data from memory, the store instruction sends the contents of register specified by Rt to data memory. Similar to the load instruction, the store instruction needs to read the contents of register Rs (points to Ra port) and add it to the sign extended verion of the immediate filed (Imm16, ExtOp = 1, ALUSrc = 1) to form the data memory address (ALUctr = add). However unlike the Load instructoion where busB is not used, the store instruction will use busB to send the data to the Data memory. Consequently, the Rt field of the instruction has to be fed to the Rb port of the register file. In order to write the Data Memory properly, the MemWr signal has to be set to 1. Notice that the store instruction does not update the register file. Therefore, RegWr must be set to zero and consequently control signals RegDst and MemtoReg are don’t cares. And once again we need to set the control signals Branch and Jump to zero to ensure proper Program Counter updataing. Well, by now, you are probably tied of these boring stuff where Branch and Jump are zero so let’s look at something different--the bracnh instruction. +3 = 31 min. (Y:11) Clk 32 Mux 32 Mux WrEn Adr 1 1 Data In 32 Data Memory imm16 32 16 Extender Clk ALUSrc = 1 ExtOp = 1
37
The Single Cycle Datapath during Branch
op rs rt immediate 16 21 26 31 if (R[rs] - R[rt] == 0) then Zero <- 1 ; else Zero <- 0 Instruction<31:0> nPC_sel= “Br” Instruction Fetch Unit Rd Rt Clk <21:25> <16:20> <11:15> <0:15> RegDst = x 1 Mux ALUctr = Subtract Rs Rt Rt Rs Rd Imm16 RegWr = 0 5 5 5 MemtoReg = x busA Zero MemWr = 0 Rw Ra Rb busW 32 32 32-bit Registers 32 busB ALU 32 So how does the branch instruction work? As far as the main datapath is concerned, it needs to calculate the branch condition. That is, it subtracts the register specified in the Rt field from the register specified in the Rs field and set the condition Zero accordingly. In order to place the register values on busA and busB, we need to feed the Rs and Rt fields of the instruction to the Ra and Rb ports of the register file and set ALUSrc to 0. Then we have to instruction the ALU to perform the subtract (ALUctr = sub) operation and set the Zero bit accordingly. The Zero bit is sent to the Instruction Fetch Unit. I will show you the internal of the Instruction Fetch Unit in a second. But before we leave this slide, I want you to notice that ExtOp, MemtoReg, and RegDst are don’t cares but RegWr and MemWr have to be ZERO to prevent any write to occur. And finally, the controller needs to set the Branch signal to 1 so the Instruction Fetch Unit knows what to do. So now let’s take a look at the Instruction Fetch Unit. +2 = 33 min. (Y:13) Clk 32 Mux 32 Mux WrEn Adr 1 1 Data In 32 Data Memory imm16 32 16 Extender Clk ALUSrc = 0 ExtOp = x
38
Instruction Fetch Unit at the End of Branch
op rs rt immediate 16 21 26 31 if (Zero == 1) then PC = PC SignExt[imm16]*4 ; else PC = PC + 4 Adr Inst Memory Instruction<31:0> nPC_sel 4 Adder Let’s look at the interesting case where the branch condition Zero is true (Zero = 1). Well, if Zero is not asserted, we will have our boring case where PC + 1 is selected. Anyway, with Branch = 1 and Zero = 1, the output of the second adder will be selected. That is, we will add the seqential address, that is output of the first adder, to the sign extended version of the immediate field, to form the branch target address (output of 2nd adder). With the control signal Jump set to zero, this branch target address will be written into the Program Counter register (PC) at the end of the clock cycle. +2 = 35 min. (Y:15) Mux 00 Adder PC Clk imm16
39
Step 4: Given Datapath: RTL -> Control
Instruction<31:0> Inst Memory <21:25> <21:25> <16:20> <11:15> <0:15> Adr Op Fun Rt Rs Rd Imm16 Control nPC_sel RegWr RegDst ExtOp ALUSrc ALUctr MemWr MemtoReg Equal DATA PATH
40
A Summary of Control Signals
inst Register Transfer ADD R[rd] <– R[rs] + R[rt]; PC <– PC + 4 ALUsrc = RegB, ALUctr = “add”, RegDst = rd, RegWr, nPC_sel = “+4” SUB R[rd] <– R[rs] – R[rt]; PC <– PC + 4 ALUsrc = RegB, ALUctr = “sub”, RegDst = rd, RegWr, nPC_sel = “+4” ORi R[rt] <– R[rs] + zero_ext(Imm16); PC <– PC + 4 ALUsrc = Im, Extop = “Z”, ALUctr = “or”, RegDst = rt, RegWr, nPC_sel = “+4” LOAD R[rt] <– MEM[ R[rs] + sign_ext(Imm16)]; PC <– PC + 4 ALUsrc = Im, Extop = “Sn”, ALUctr = “add”, MemtoReg, RegDst = rt, RegWr, nPC_sel = “+4” STORE MEM[ R[rs] + sign_ext(Imm16)] <– R[rs]; PC <– PC + 4 ALUsrc = Im, Extop = “Sn”, ALUctr = “add”, MemWr, nPC_sel = “+4” BEQ if ( R[rs] == R[rt] ) then PC <– PC + sign_ext(Imm16)] || 00 else PC <– PC + 4 nPC_sel = “Br”, ALUctr = “sub”
41
A Summary of the Control Signals
See func We Don’t Care :-) Appendix A op add sub ori lw sw beq jump RegDst ALUSrc MemtoReg RegWrite MemWrite nPCsel Jump ExtOp ALUctr<2:0> 1 x Add Subtract Or xxx Here is a table summarizing the control signals setting for the seven (add, sub, ...) instructions we have looked at. Instead of showing you the exact bit values for the ALU control (ALUctr), I have used the symbolic values here. The first two columns are unique in the sense that they are R-type instrucions and in order to uniquely identify them, we need to look at BOTH the op field as well as the func fiels. Ori, lw, sw, and branch on equal are I-type instructions and Jump is J-type. They all can be uniquely idetified by looking at the opcode field alone. Now let’s take a more careful look at the first two columns. Notice that they are identical except the last row. So we can combine these two rows here if we can “delay” the generation of ALUctr signals. This lead us to something call “local decoding.” +3 = 42 min. (Y:22) op rs rt rd shamt funct 6 11 16 21 26 31 R-type add, sub I-type op rs rt immediate ori, lw, sw, beq J-type op target address jump
42
The Concept of Local Decoding
R-type ori lw sw beq jump RegDst ALUSrc MemtoReg RegWrite MemWrite Branch Jump ExtOp ALUop<N:0> 1 x “R-type” Or Add Subtract xxx op That is, instead of asking the Main Control to generates the ALUctr signals directly (see the diagram with the ALU), the main cotrol will generate a set of signals called ALUop. For all I and J type instructions, ALUop will tell the ALU Control exatly what the ALU needs to do (Add, Subtract, ...) . But whenever the Main Control sees a R-type instructions, it simply throws its hands up and say: “Wow, I don’t know what the ALU has to do but I know it is a R-type instruction” and let the Local Control Block, ALU Control to take care of the rest. Notice that this save us one column from the table we had on the last slide. But let’s be honest, if one column is the ONLY thing we save, we probably will not do it. But when you have to design for the entire MIPS instruction set, this column will used for ALL R-type instructions, which is more than just Add and Subtract I showed you here. Another advantage of this table over the last one, besides being smaller, is that we can uniquely identify each column by looking at the Op field only. Therefore, as I will show you later, the Main Control ONLY needs to look at the Opcode field. How many bits do we need for ALUop? +3 = 45 min. (Y:25) func ALU Control (Local) ALUctr op Main Control 6 3 ALUop 6 N ALU
43
The Encoding of ALUop Main Control op 6 ALU (Local) func N ALUop ALUctr 3 In this exercise, ALUop has to be 2 bits wide to represent: (1) “R-type” instructions “I-type” instructions that require the ALU to perform: (2) Or, (3) Add, and (4) Subtract To implement the full MIPS ISA, ALUop has to be 3 bits to represent: (2) Or, (3) Add, (4) Subtract, and (5) And (Example: andi) Well the answer is 2 because we only need to represent 4 things: “R-type,” the Or operation, the Add operation, and the Subtract operation. If you are implementing the entire MIPS instruction set, then ALUop has to be 3 bits wide because we will need to repreent 5 things: R-type, Or, Add, Subtract, and AND. Here I show you the bit assignment I made for the 3-bit ALUop. With this bit assignment in mind, let’s figure out what the local control ALU Control has to do. +1 = 26 min. (Y:26) R-type ori lw sw beq jump ALUop (Symbolic) “R-type” Or Add Add Subtract xxx ALUop<2:0> 1 00 0 10 0 00 0 00 0 01 xxx
44
The Decoding of the “func” Field
Main Control op 6 ALU (Local) func N ALUop ALUctr 3 R-type ori lw sw beq jump ALUop (Symbolic) “R-type” Or Add Subtract xxx ALUop<2:0> 1 00 0 10 0 00 0 01 op rs rt rd shamt funct 6 11 16 21 26 31 R-type Recall ALU Homework (also P. 286 text): What this table and diagram implies is that if the ALU Control receives ALUop = 100, it has to decode the instruction’s “func” field to figure out what the ALU needs to do. Based on the MIPS encoding in Appendix A (or Fig 3.18, page 153 of 2/e) of your text book, we know we have a Add instruction if the func field is If the func field is , we know we have a subtract operation and so on. Notice that the bit 5 and bit 4 of this field is the same for all these operations so as far as the ALU control is concerned, these bits are don’t care. Now recall from your ALU homework, the ALUctr signals has the following meaning (point to the table): 000 means Add, 001 means subtract, ... etc. Based on these three tables (point to the last row of the top table and then the two other tables) and the fact that bit 5 and bit 4 of the “func” field are don’t care, we can derive the following truth table for ALUctr. +2 = 48 min. (Y:28) funct<5:0> Instruction Operation add subtract and or set-on-less-than ALUctr ALU ALUctr<2:0> ALU Operation 000 001 010 110 111 Add Subtract And Or Set-on-less-than
45
The Truth Table for ALUctr
funct<3:0> Instruction Op. 0000 add R-type ori lw sw beq ALUop (Symbolic) “R-type” Or Add Subtract ALUop<2:0> 1 00 0 10 0 00 0 01 0010 subtract 0100 and 0101 or 1010 set-on-less-than ALUop func bit<2> bit<1> bit<0> bit<3> x ALUctr ALU Operation Add 1 Subtract Or And Set on < That is, whenever ALUop is 000, we don’t care anything about the func field because we know we need the ALU to do an ADD operation (point to Add column). Whenever the ALUop bit<2> is 0 and bit<0> is 1, we know we want the ALU to perform a Subtract regarless of what func field is. Bit<1> is a don’t care because for our encoding here, ALUop<1> will never be equal to 1 whenever bit<0> is 1 and bit<2> is 0. Similarly, whenever ALUop bit<2> is 0 and bit<1> is 1, we need the ALU to perform Or. The tricky part occrus when the ALUOp bit<2> equals to 1. In that case, we have a R-type instrution and we need to look at the Func field. In any case, once we have this Symbolic column, we can get this actual bit columns by referring to our ALU able on the last slide (use the last slide if time permit). +2 = 30 min. (Y:30)
46
The Logic Equation for ALUctr<2>
ALUop func bit<2> bit<1> bit<0> bit<3> bit<2> bit<1> bit<0> ALUctr<2> x 1 x x x x 1 1 x x 1 1 1 x x 1 1 1 This makes func<3> a don’t care ALUctr<2> = !ALUop<2> & ALUop<0> + ALUop<2> & !func<2> & func<1> & !func<0> From the truth table we had before the break, we can derive the logic equation for ALUctr bit 2 but collecting all the rows that has ALUCtr bit 2 equals to 1 and this table is the result. Each row becomes a product term and we need to OR the prodcut terms together. Notice that the last row are identical except the bit<3> of the func fields. One is zero and the other is one. Together, they make bit<3> a don’t care term. With all these don’t care terms, the logic equation is rather simple. The first prodcut term is: not ALUOp<2> and ALUOp<0>. The second product term, after we making Func<3> a don’t care becomes ... +2 = 57 min. (Y:37)
47
The Logic Equation for ALUctr<1>
ALUop func bit<2> bit<1> bit<0> bit<3> bit<2> bit<1> bit<0> ALUctr<1> x x x x 1 x 1 x x x x 1 1 x x 1 1 x x 1 1 1 x x 1 1 1 ALUctr<1> = !ALUop<2> & !ALUop<0> + ALUop<2> & !func<2> & !func<0> Here is the truth table when we collect all the rows whereALCctr bit<1> equals to 1. Once again, we can simplify the table by noticing that the first two rows are different only at the ALUop bit<0> position. We can make ALUop bit<0> into a don’t care. Similarly, the last three rows can be combined to make Func bit<3> and bit<1> into don’t cares. Consequently, the logic equation for ALUctr bit<1> becomes ... +2 = 59 min. (Y:39)
48
The Logic Equation for ALUctr<0>
ALUop func bit<2> bit<1> bit<0> bit<3> bit<2> bit<1> bit<0> ALUctr<0> 1 x x x x x 1 1 x x 1 1 1 1 x x 1 1 1 ALUctr<0> = !ALUop<2> & ALUop<0> + ALUop<2> & !func<3> & func<2> & !func<1> & func<0> + ALUop<2> & func<3> & !func<2> & func<1> & !func<0> Finally, after we gather all the rows where ALUctr bit 0 are 1’s, we have this truth table. Well, we are out of luck here. I don’t see any simple way to simplify these product terms by just looking at them. There may be some if you draw out the 7 dimension K map but I am not going to try it. So I just write down the logic equations as it is. +2 = 61 min. (Y:41)
49
The ALU Control Block ALU Control (Local) func 3 6 ALUop ALUctr ALUctr<2> = !ALUop<2> & ALUop<0> + ALUop<2> & !func<2> & func<1> & !func<0> ALUctr<1> = !ALUop<2> & !ALUop<0> + ALUop<2> & !func<2> & !func<0> ALUctr<0> = !ALUop<2> & ALUop<0> + ALUop<2> & !func<3> & func<2> & !func<1> & func<0> + ALUop<2> & func<3> & !func<2> & func<1> & !func<0> With all the logic equations available, you should be able to implement this logic block without any problem. In your next homework assignment, all your control logic will be done in VHDL: you just describe your control logic as if you are writing a C program. It will be much easier and less error prone then what I show you here. Your TA will have a VHDL tutorial ready for you and it is very easy to lern. +1 = 62 min. (Y:42)
50
The “Truth Table” for the Main Control
op 6 ALU (Local) func 3 ALUop ALUctr RegDst ALUSrc : op R-type ori lw sw beq jump RegDst 1 x x x ALUSrc 1 1 1 x MemtoReg 1 x x x RegWrite 1 1 1 MemWrite 1 Branch 1 Now that we have taken care of the Local Control (ALU Control), let’s refocus our attention to the Main Controller. The job of the Main Control is to look at the Opcode field of the instruction and generate these control signals for the datapath (RegDst, ... ExtOp) as well as the 3-bit ALUop field for the ALU Control. Here, I have shown you the symbolic value of the ALUop field as well as the actual bit assignment. For example here (2nd column), the R-type ALUop is encode as 100 and the Add operation (3rd column) is encoded as 000.. This is call a quote “Truth Table” unquote because if you think about it, this is like having the truth table rotates 90 degrees. Let me show you what I mean by that. +3 = 65 min. (Y:45) Jump 1 ExtOp x 1 1 x x ALUop (Symbolic) “R-type” Or Add Add Subtract xxx ALUop <2> 1 x ALUop <1> 1 x ALUop <0> 1 x
51
Putting it All Together: A Single Cycle Processor
ALUop ALU Control ALUctr 3 RegDst func op Main Control 3 Instr<5:0> 6 ALUSrc 6 : Instr<31:26> Instruction<31:0> nPC_sel Instruction Fetch Unit Rd Rt Clk <21:25> <16:20> <11:15> <0:15> RegDst 1 Mux Rs Rt Rt Rs Rd Imm16 RegWr ALUctr 5 5 5 busA MemtoReg Zero MemWr Rw Ra Rb busW 32 32 32-bit Registers 32 busB ALU 32 OK, now that we have the Main Control implemented, we have everything we needed for the single cycle processor and here it is. The Instruction Fetch Unit gives us the instruction. The OP field is fed to the Main Control for decode and the Func field is fed to the ALU Control for local decoding. The Rt, Rs, Rd, and Imm16 fields of the instruction are fed to the data path. Bsed on the OP field of the instruction, the Main Control of will set the control signals RegDst, ALUSrc, .... etc properly as I showed you earlier using separate slides. Furthermore, the ALUctr is use the ALUop from the Main conrol and the func field of the instruction to generate the ALUctr signals to ask the ALU to do the right thing: Add, Subtract, Or, and so on. This processor will execute each of the MIPS instruction in the subset in one cycle. +2 = 72 min (Y:52) Clk 32 Mux 32 Mux WrEn Adr 1 1 Data In 32 Data Memory imm16 32 16 Extender Instr<15:0> Clk ALUSrc ExtOp
52
Worst Case Timing (Load)
Clk Clk-to-Q PC Old Value New Value Instruction Memoey Access Time Rs, Rt, Rd, Op, Func Old Value New Value Delay through Control Logic ALUctr Old Value New Value ExtOp Old Value New Value ALUSrc Old Value New Value MemtoReg Old Value New Value Register Write Occurs RegWr Old Value New Value This timing diagram shows the worst case timing of our single cycle datapath which occurs at the load instruction. Clock to Q time after the clock tick, PC will present its new value to the Instruction memory. After a delay of instruction access time, the instruction bus (Rs, Rt, ...) becomes valid. Then three things happens in parallel: (a) First the Control generates the control signals (Delay through Control Logic). (b) Secondly, the regiser file is access to put Rs onto busA. (c) And we have to sign extended the immediate field to get the second operand (busB). Here I asuume register file access takes longer time than doing the sign extension so we have to wait until busA valid before the ALU can start the address calculation (ALU delay). With the address ready, we access the data memory and after a delay of the Data Memory Access time, busW will be valid. And by this time, the control unit whould have set the RegWr signal to one so at the next clock tick, we will write the new data coming from memory (busW) into the register file. +3 = 77 min. (Y:57) Register File Access Time busA Old Value New Value Delay through Extender & Mux busB Old Value New Value ALU Delay Address Old Value New Value Data Memory Access Time busW Old Value New
53
Drawback of this Single Cycle Processor
Long cycle time: Cycle time must be long enough for the load instruction: PC’s Clock -to-Q + Instruction Memory Access Time + Register File Access Time + ALU Delay (address calculation) + Data Memory Access Time + Register File Setup Time + Clock Skew Cycle time for load is much longer than needed for all other instructions Well, the last slide pretty much illustrate one of the biggest disadvantage of the single cycle implementation: it has a long cycle time. More specifically, the cycle time must be long enough for the load instruction which has the following components: Clock to Q time of the PC, .... Having a long cycle time is a big problem but not the the only problem. Another problem of this single cycle implementation is that this cycle time, which is long enough for the load instruction, is too long for all other instructions. We will show you why this is bad and what we can do about it in the next few lectures. That’s all for today. +2 = 79 min (Y:59)
54
Summary Single cycle datapath => CPI=1, CCT => long
5 steps to design a processor 1. Analyze instruction set => datapath requirements 2. Select set of datapath components & establish clock methodology 3. Assemble datapath meeting the requirements 4. Analyze implementation of each instruction to determine setting of control points that effects the register transfer. 5. Assemble the control logic Control is the hard part MIPS makes control easier Instructions same size Source registers always in same place Immediates same size, location Operations always on registers/immediates Control Datapath Memory Processor Input Output
55
Multicycle Datapath
56
Partitioning the CPI=1 Datapath
Add registers between smallest steps MemRd MemWr MemWr RegWr nPC_sel ALUSrc RegDst ExtOp ALUctr Reg. File Next PC Operand Fetch Exec PC Instruction Fetch Mem Access Result Store Data Mem
57
Example Multicycle Datapath
MemToReg MemRd MemWr RegDst RegWr nPC_sel ALUSrc ExtOp ALUctr Equal Reg. File Ext ALU Reg File A R Next PC PC IR B Mem Access M Data Mem Result Store Instruction Fetch Operand Fetch Critical Path ?
58
Recall: Step-by-step Processor Design
Step 1: ISA => Logical Register Transfers Step 2: Components of the Datapath Step 3: RTL + Components => Datapath Step 4: Datapath + Logical RTs => Physical RTs Step 5: Physical RTs => Control
59
Step 4: R-rtype (add, sub, . . .)
inst Logical Register Transfers ADDU R[rd] <– R[rs] + R[rt]; PC <– PC + 4 inst Physical Register Transfers IR <– MEM[pc] ADDU A<– R[rs]; B <– R[rt] S <– A + B R[rd] <– S; PC <– PC + 4 Logical Register Transfer Physical Register Transfers Equal Reg. File Reg File A Exec S Next PC PC Inst. Mem IR B Mem Access M Data Mem
60
Step 4:Logical immed Logical Register Transfer
inst Logical Register Transfers ADDU R[rt] <– R[rs] OR zx(Im16); PC <– PC + 4 inst Physical Register Transfers IR <– MEM[pc] ADDU A<– R[rs]; B <– R[rt] S <– A or ZeroExt(Im16) R[rt] <– S; PC <– PC + 4 Logical Register Transfer Physical Register Transfers Equal Reg. File Reg File A Exec S Next PC PC Inst. Mem IR B Mem Access M Data Mem
61
Step 4 : Load Logical Register Transfer Physical Register Transfers
inst Logical Register Transfers LW R[rt] <– MEM(R[rs] + sx(Im16); PC <– PC + 4 inst Physical Register Transfers IR <– MEM[pc] LW A<– R[rs]; B <– R[rt] S <– A + SignEx(Im16) M <– MEM[S] R[rd] <– M; PC <– PC + 4 Logical Register Transfer Physical Register Transfers Equal Reg. File Reg File A Exec S Next PC PC Inst. Mem IR B Mem Access M Data Mem
62
Step 4 : Store Logical Register Transfer Physical Register Transfers
inst Logical Register Transfers SW MEM(R[rs] + sx(Im16) <– R[rt]; PC <– PC + 4 inst Physical Register Transfers IR <– MEM[pc] SW A<– R[rs]; B <– R[rt] S <– A + SignEx(Im16); MEM[S] <– B PC <– PC + 4 Logical Register Transfer Physical Register Transfers Equal Reg. File Reg File A Exec S Next PC PC Inst. Mem IR B Mem Access M Data Mem
63
Step 4 : Branch Logical Register Transfer Physical Register Transfers
inst Logical Register Transfers BEQ if R[rs] == R[rt] then PC <= PC + sx(Im16) || 00 else PC <= PC + 4 Logical Register Transfer Physical Register Transfers inst Physical Register Transfers IR <– MEM[pc] BEQ|Eq PC <– PC + 4 inst Physical Register Transfers IR <– MEM[pc] BEQ|Eq PC <– PC + sx(Im16) || 00 Equal Reg. File Reg File A Exec S Next PC PC Inst. Mem IR B Mem Access M Data Mem
64
Alternative datapath (book): Multiple Cycle Datapath
Miminizes Hardware: 1 memory, 1 adder PCWr PCWrCond PCSrc BrWr Zero IorD MemWr IRWr RegDst RegWr ALUSelA 1 Target 32 32 Mux PC Mux 1 32 Zero Rs Mux 1 Ra 32 RAdr 5 32 Rt Rb busA 32 32 Ideal Memory Instruction Reg Mux 1 5 Reg File 32 ALU ALU Out Rt 4 Rw 32 WrAdr 32 1 32 32 Rd Din Dout busW busB 32 2 32 ALU Control Putting it all together, here it is: the multiple cycle datapath we set out to built. +1 = 47 min. (Y:47) Mux 1 3 << 2 Extend Imm 16 32 ALUOp ExtOp MemtoReg ALUSelB
65
Our Control Model State specifies control points for Register Transfer
Transfer occurs upon exiting state (same falling edge) inputs (conditions) Next State Logic State X Register Transfer Control Points Control State Depends on Input Output Logic outputs (control points)
66
Step 4 => Control Specification for multicycle proc
“instruction fetch” IR <= MEM[PC] A <= R[rs] B <= R[rt] “decode / operand fetch” LW BEQ & Equal R-type ORi SW BEQ & ~Equal Execute PC <= PC + SX || 00 S <= A fun B S <= A or ZX S <= A + SX S <= A + SX PC <= PC + 4 Memory M <= MEM[S] MEM[S] <= B PC <= PC + 4 Write-back R[rd] <= S PC <= PC + 4 R[rt] <= S PC <= PC + 4 R[rt] <= M PC <= PC + 4
67
Step 5: datapath + state diagram => control
Translate RTs into control points Assign states Then go build the controller
68
Mapping RTs to Control Points
IR <= MEM[PC] “instruction fetch” imem_rd, IRen A <= R[rs] B <= R[rt] “decode” Aen, Ben LW BEQ & Equal R-type ORi SW BEQ & ~Equal Execute S <= A fun B PC <= PC + SX || 00 PC <= PC + 4 ALUfun, Sen S <= A or ZX S <= A + SX S <= A + SX Memory M <= MEM[S] MEM[S] <= B PC <= PC + 4 R[rd] <= S PC <= PC + 4 RegDst, RegWr, PCen Write-back R[rt] <= S PC <= PC + 4 R[rt] <= M PC <= PC + 4
69
Assigning States “instruction fetch” IR <= MEM[PC] 0000 “decode”
A <= R[rs] B <= R[rt] “decode” 0001 LW BEQ & Equal R-type ORi SW BEQ & ~Equal Execute PC <= PC + SX || 00 S <= A fun B S <= A or ZX S <= A + SX S <= A + SX PC <= PC + 4 0100 0110 1000 1011 0011 0010 Memory M <= MEM[S] MEM[S] <= B PC <= PC + 4 1001 1100 Write-back R[rd] <= S PC <= PC + 4 R[rt] <= S PC <= PC + 4 R[rt] <= M PC <= PC + 4 0101 0111 1010
70
Detailed Control Specification
State Op field Eq Next IR PC Ops Exec Mem Write-Back en sel A B Ex Sr ALU S R W M M-R Wr Dst 0000 ?????? ? 0001 BEQ 0001 BEQ 0001 R-type x 0001 orI x 0001 LW x 0001 SW x 0010 xxxxxx x 0011 xxxxxx x 0100 xxxxxx x fun 1 0101 xxxxxx x 0110 xxxxxx x or 1 0111 xxxxxx x 1000 xxxxxx x add 1 1001 xxxxxx x 1010 xxxxxx x 1011 xxxxxx x add 1 1100 xxxxxx x -all same in Moore machine R: ORi: LW: SW:
71
Controller Design The state digrams that arise define the controller for an instruction set processor are highly structured Use this structure to construct a simple “microsequencer” Control reduces to programming this very simple device microprogramming sequencer control datapath control microinstruction micro-PC sequencer
72
Example: Jump-Counter
i i 0000 i+1 Map ROM op-code zero inc load Counter
73
Using a Jump Counter “instruction fetch” IR <= MEM[PC] 0000 inc
A <= R[rs] B <= R[rt] “decode” 0001 load inc LW BEQ & Equal R-type ORi SW BEQ & ~Equal Execute PC <= PC + SX || 00 S <= A fun B S <= A or ZX S <= A + SX S <= A + SX PC <= PC + 4 0100 0110 1000 1011 0011 0010 inc inc inc inc Memory zero zero M <= MEM[S] MEM[S] <= B PC <= PC + 4 1001 1100 inc Write-back R[rd] <= S PC <= PC + 4 R[rt] <= S PC <= PC + 4 R[rt] <= M PC <= PC + 4 zero zero 0101 0111 1010 zero zero
74
Our Microsequencer taken datapath control Z I L Micro-PC op-code
Map ROM
75
Microprogram Control Specification
ตPC Taken Next IR PC Ops Exec Mem Write-Back en sel A B Ex Sr ALU S R W M M-R Wr Dst 0000 ? inc 1 load inc 0010 x zero 0011 x zero 0100 x inc fun 1 0101 x zero 0110 x inc or 1 0111 x zero 1000 x inc add 1 1001 x inc x zero 1011 x inc add 1 1100 x zero BEQ R: ORi: LW: SW:
76
Mapping ROM R-type 000000 0100 BEQ 000100 0011 ori 001101 0110
LW SW
77
“microprogrammed control”
Overview of Control Control may be designed using one of several initial representations. The choice of sequence control, and how logic is represented, can then be determined independently; the control can then be implemented with one of several methods using a structured logic technique. Initial Representation Finite State Diagram Microprogram Sequencing Control Explicit Next State Microprogram counter Function + Dispatch ROMs Logic Representation Logic Equations Truth Tables Implementation PLA ROM Technique “hardwired control” “microprogrammed control”
78
Summary Disadvantages of the Single Cycle Processor Long cycle time
Cycle time is too long for all instructions except the Load Multiple Cycle Processor: Divide the instructions into smaller steps Execute each step (instead of the entire instruction) in one cycle Partition datapath into equal size chunks to minimize cycle time ~10 levels of logic between latches Follow same 5-step method for designing “real” processor
79
Summary (cont’d) Control is specified by finite state digram
Specialize state-diagrams easily captured by microsequencer simple increment & “branch” fields datapath control fields Control design reduces to Microprogramming Control is more complicated with: complex instruction sets restricted datapaths (see the book) Simple Instruction set and powerful datapath => simple control could try to reduce hardware (see the book) rather go for speed => many instructions at once!
80
Our Controller FSM Spec
IR <= MEM[PC] PC <= PC + 4 “instruction fetch” 0000 A <= R[rs] B <= R[rt] “decode” 0001 Equal BEQ PC <= PC + SX || 00 0010 0011 S <= A - B LW R-type ORi SW Execute S <= A fun B S <= A op ZX S <= A + SX S <= A + SX 0100 0110 1000 1011 ~Equal Memory M <= MEM[S] MEM[S] <= B 1001 1100 Write-back R[rd] <= S R[rt] <= S R[rt] <= M 0101 0111 1010
81
Microprogramming Control is the hard part of processor design
ฐ Datapath is fairly regular and well-organized ฐ Memory is highly regular ฐ Control is irregular and global Microprogramming: -- A Particular Strategy for Implementing the Control Unit of a processor by "programming" at the level of register transfer operations Microarchitecture: -- Logical structure and functional capabilities of the hardware as seen by the microprogrammer Historical Note: IBM 360 Series first to distinguish between architecture & organization Same instruction set across wide range of implementations, each with different cost/performance
82
Sequencer-based control unit
Control Logic Multicycle Datapath Outputs Inputs Types of “branching” • Set state to 0 • Dispatch (state 1) • Use incremented state number 1 State Reg Adder Address Select Logic Opcode
83
Designing a Microinstruction Set
1) Start with list of control signals 2) Group signals together that make sense (vs. random): called “fields” 3) Places fields in some logical order (e.g., ALU operation & ALU operands first and microinstruction sequencing last) 4) Create a symbolic legend for the microinstruction format, showing name of field values and how they set the control signals Use computers to design computers 5) To minimize the width, encode operations that will never be used at the same time
84
1&2) Start with list of control signals, grouped into fields
Signal name Effect when deasserted Effect when asserted ALUSelA 1st ALU operand = PC 1st ALU operand = Reg[rs] RegWrite None Reg. is written MemtoReg Reg. write data input = ALU Reg. write data input = memory RegDst Reg. dest. no. = rt Reg. dest. no. = rd TargetWrite None Target reg. = ALU MemRead None Memory at address is read MemWrite None Memory at address is written IorD Memory address = PC Memory address = ALU IRWrite None IR = Memory PCWrite None PC = PCSource PCWriteCond None IF ALUzero then PC = PCSource Single Bit Control Signal name Value Effect ALUOp 00 ALU adds ALU subtracts ALU does function code 11 ALU does logical OR ALUSelB 000 2nd ALU input = Reg[rt] nd ALU input = nd ALU input = sign extended IR[15-0] nd ALU input = sign extended, shift left 2 IR[15-0] nd ALU input = zero extended IR[15-0] PCSource 00 PC = ALU PC = Target PC = PC+4[29-26] : IR[25–0] << 2 Multiple Bit Control
85
Start with list of control signals, cont’d
For next state function (next microinstruction address), use Sequencer-based control unit from last lecture Called “microPC” or “PC” vs. state register Signal Value Effect Sequen 00 Next address = 0 -cing 01 Next address = dispatch ROM 10 Next address = address + 1 1 microPC Adder Mux 2 1 ตAddress Select Logic ROM Opcode
86
3) Microinstruction Format: unencoded vs. encoded fields
Field Name Width Control Signals Set wide narrow ALU Control 4 2 ALUOp SRC1 2 1 ALUSelA SRC2 5 3 ALUSelB ALU Destination 6 4 RegWrite, MemtoReg, RegDst, TargetWr. Memory 4 3 MemRead, MemWrite, IorD Memory Register 1 1 IRWrite PCWrite Control 5 4 PCWrite, PCWriteCond, PCSource Sequencing 3 2 AddrCtl Total width bits
87
4) Legend of Fields and Symbolic Names
Field Name Values for Field Function of Field with Specific Value ALU Add ALU adds Subt. ALU subtracts Func code ALU does function code Or ALU does logical OR SRC1 PC 1st ALU input = PC rs 1st ALU input = Reg[rs] SRC2 4 2nd ALU input = 4 Extend 2nd ALU input = sign ext. IR[15-0] Extend0 2nd ALU input = zero ext. IR[15-0] Extshft 2nd ALU input = sign ex., sl IR[15-0] rt 2nd ALU input = Reg[rt] ALU destination Target Target = ALUout rd Reg[rd] = ALUout Memory Read PC Read memory using PC Read ALU Read memory using ALU output Write ALU Write memory using ALU output Memory register IR IR = Mem Write rt Reg[rt] = Mem Read rt Mem = Reg[rt] PC write ALU PC = ALU output Target-cond. IF ALU Zero then PC = Target jump addr. PC = PCSource Sequencing Seq Go to sequential ตinstruction Fetch Go to the first microinstruction Dispatch Dispatch using ROM. Note: can specify combinations of fields that may not be possible or not work properly given the datapath (e.g., ALU operand and write register in single cycle)
88
Microprogramming Pros and Cons
Ease of design Flexibility Easy to adapt to changes in organization, timing, technology Can make changes late in design cycle, or even in the field Can implement very powerful instruction sets (just more control memory) Generality Can implement multiple instruction sets on same machine. Can tailor instruction set to application. Compatibility Many organizations, same instruction set Costly to implement Slow
89
Exceptions Exception = unprogrammed control transfer
System Exception Handler user program Exception: return from exception normal control flow: sequential, jumps, branches, calls, returns Exception = unprogrammed control transfer system takes action to handle the exception must record the address of the offending instruction returns control to user must save & restore user state Allows constuction of a “user virtual machine”
90
What happens to Instruction with Exception?
MIPS architecture defines the instruction as having no effect if the instruction causes an exception. When get to virtual memory we will see that certain classes of exceptions must prevent the instruction from changing the machine state. This aspect of handling exceptions becomes complex and potentially limits performance => why it is hard
91
Two Types of Exceptions
Interrupts caused by external events asynchronous to program execution may be handled between instructions simply suspend and resume user program Traps caused by internal events exceptional conditions (overflow) errors (parity) faults (non-resident page) synchronous to program execution condition must be remedied by the handler instruction may be retried or simulated and program continued or program may be aborted
92
MIPS convention: exception means any unexpected change in control flow, without distinguishing internal or external; use the term interrupt only when the event is externally caused. Type of event From where? MIPS terminology I/O device request External Interrupt Invoke OS from user program Internal Exception Arithmetic overflow Internal Exception Using an undefined instruction Internal Exception Hardware malfunctions Either Exception or Interrupt
93
Additions to MIPS ISA to support Exceptions?
EPC–a 32-bit register used to hold the address of the affected instruction (register 14 of coprocessor 0). Cause–a register used to record the cause of the exception. In the MIPS architecture this register is 32 bits, though some bits are currently unused. Assume that bits 5 to 2 of this register encodes the two possible exception sources mentioned above: undefined instruction=0 and arithmetic overflow=1 (register 13 of coprocessor 0). BadVAddr - register contained memory address at which memory reference occurred (register 8 of coprocessor 0) Status - interrupt mask and enable bits (register 12 of coprocessor 0) Control signals to write EPC , Cause, BadVAddr, and Status Be able to write exception address into PC, increase mux to add as input two ( hex) May have to undo PC = PC + 4, since want EPC to point to offending instruction (not its successor); PC = PC - 4
94
How Control Detects Exceptions in our FSD
Undefined Instruction–detected when no next state is defined from state 1 for the op value. We handle this exception by defining the next state value for all op values other than lw, sw, 0 (R-type), jmp, beq, and ori as new state 12. Shown symbolically using “other” to indicate that the op field does not match any of the opcodes that label arcs out of state 1. Arithmetic overflow–Chapter 4 included logic in the ALU to detect overflow, and a signal called Overflow is provided as an output from the ALU. This signal is used in the modified finite state machine to specify an additional possible next state Note: Challenge in designing control of a real machine is to handle different interactions between instructions and other exception-causing events such that control logic remains small and fast. Complex interactions makes the control unit the most challenging aspect of hardware design
95
Modification to the Control Specification
IR <= MEM[PC] PC <= PC + 4 undefined instruction EPC <= PC - 4 PC <= exp_addr cause <= 10 (RI) A <= R[rs] B <= R[rt] other BEQ LW R-type ORi SW S <= A - B ~Equal S <= A fun B S <= A op ZX S <= A + SX S <= A + SX 0010 Equal overflow PC <= PC + SX || 00 M <= MEM[S] MEM[S] <= B 0011 R[rd] <= S R[rt] <= S R[rt] <= M Additional condition from Datapath EPC <= PC - 4 PC <= exp_addr cause <= 12 (Ovf)
96
Summary Specialize state-diagrams easily captured by microsequencer
simple increment & “branch” fields datapath control fields Control design reduces to Microprogramming Exceptions are the hard part of control Need to find convenient place to detect exceptions and to branch to state or microinstruction that saves PC and invokes the operating system As we get pipelined CPUs that support page faults on memory accesses which means that the instruction cannot complete AND you must be able to restart the program at exactly the instruction with the exception, it gets even harder
97
Pipelining
98
Pipelining is Natural! Laundry Example
Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes A B C D
99
Sequential Laundry 6 PM 7 8 9 10 11 Midnight 30 40 20 30 40 20 30 40
Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e A B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?
100
Pipelined Laundry: Start work ASAP
6 PM 7 8 9 10 11 Midnight Time 30 40 20 T a s k O r d e A B C D Pipelined laundry takes 3.5 hours for 4 loads
101
Pipelining Lessons 6 PM 7 8 9 30 40 20 A B C D
Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Stall for Dependences 6 PM 7 8 9 Time 30 40 20 T a s k O r d e A B C D
102
Pipelined Execution Utilization? Now we just have to make it work Time
IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Program Flow IFetch Dcd Exec Mem WB Utilization? Now we just have to make it work
103
Single Cycle, Multiple Cycle, vs. Pipeline
Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. In the multiple clock cycle implementation, however, we cannot start executing the store until Cycle 6 because we must wait for the load instruction to complete. Similarly, we cannot start the execution of the R-type instruction until the store instruction has completed its execution in Cycle 9. In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Consequently, the cycle time for the Single Cycle implementation can be five times longer than the multiple cycle implementation. But may be more importantly, since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted. +2 = 77 min. (X:57) Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr
104
Why Pipeline? Suppose we execute 100 instructions Single Cycle Machine
45 ns/cycle x 1 CPI x 100 inst = 4500 ns Multicycle Machine 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
105
Why Pipeline? Because the resources are there!
Time (clock cycles) ALU Im Reg Dm I n s t r. O r d e Inst 0 ALU Im Reg Dm Inst 1 ALU Im Reg Dm Inst 2 Inst 3 ALU Im Reg Dm Inst 4 ALU Im Reg Dm
106
Can pipelining get us into trouble?
Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) data hazards: attempt to use item before it is ready E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer instruction depends on result of prior instruction still in the pipeline control hazards: attempt to make a decision before condition is evaulated E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in branch instructions Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards
107
Summary 1/3 Specialize state-diagrams easily captured by microsequencer simple increment & “branch” fields datapath control fields Control design reduces to Microprogramming Exceptions are the hard part of control Need to find convenient place to detect exceptions and to branch to state or microinstruction that saves PC and invokes the operating system As we get pipelined CPUs that support page faults on memory accesses which means that the instruction cannot complete AND you must be able to restart the program at exactly the instruction with the exception, it gets even harder
108
Summary 2/3 Microprogramming is a fundamental concept
implement an instruction set by building a very simple processor and interpreting the instructions essential for very complex instructions and when few register transfers are possible Pipelining is a fundamental concept multiple steps using distinct resources Utilize capabilities of the Datapath by pipelined instruction processing start next instruction while working on the current one limited by length of longest stage (plus fill/flush) detect and resolve hazards
109
The Five Stages of Load Ifetch: Instruction Fetch
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file As shown here, each of these five steps will take one clock cycle to complete. And in pipeline terminology, each step is referred to as one stage of the pipeline. +1 = 8 min. (X:48)
110
Pipelining Improve performance by increasing instruction throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this?
111
Basic Idea What do we need to add to actually split the datapath into stages?
112
Graphically Representing Pipelines
Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths
113
Conventional Pipelined Execution Representation
Time IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Program Flow IFetch Dcd Exec Mem WB
114
Single Cycle, Multiple Cycle, vs. Pipeline
Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. In the multiple clock cycle implementation, however, we cannot start executing the store until Cycle 6 because we must wait for the load instruction to complete. Similarly, we cannot start the execution of the R-type instruction until the store instruction has completed its execution in Cycle 9. In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Consequently, the cycle time for the Single Cycle implementation can be five times longer than the multiple cycle implementation. But may be more importantly, since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted. +2 = 77 min. (X:57) Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr
115
Single Memory is a Structural Hazard
Time (clock cycles) ALU I n s t r. O r d e Mem Reg Mem Reg Load ALU Mem Reg Instr 1 ALU Mem Reg Instr 2 ALU Instr 3 Mem Reg Mem Reg ALU Mem Reg Instr 4 Detection is easy in this case! (right half highlight means read, left half write)
116
Control Hazard Solutions
Stall: wait until decision is clear Its possible to move up decision to 2nd stage by adding hardware to check registers as being read Impact: 2 clock cycles per branch instruction => slow I n s t r. O r d e Time (clock cycles) ALU Mem Reg Mem Reg Add ALU Beq Mem Reg Mem Reg Load ALU Mem Reg Mem Reg
117
Control Hazard Solutions
Predict: guess one direction then back up if wrong Predict not taken Impact: 1 clock cycles per branch instruction if right, 2 if wrong (right ญ 50% of time) More dynamic scheme: history of 1 branch (ญ 90%) I n s t r. O r d e Time (clock cycles) ALU Mem Reg Mem Reg Add ALU Beq Mem Reg Mem Reg Load ALU Mem Reg Mem Reg
118
Control Hazard Solutions
Redefine branch behavior (takes place after next instruction) “delayed branch” Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” (ญ 50% of time) As launch more instruction per clock cycle, less useful I n s t r. O r d e Time (clock cycles) ALU Mem Reg Mem Reg Add ALU Beq Mem Reg Mem Reg ALU Misc Mem Reg Mem Reg ALU Load Mem Reg Mem Reg
119
Data Hazard on r1 add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7
or r8, r1 ,r9 xor r10, r1 ,r11
120
Data Hazard on r1: add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9
Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg Reg Dm I n s t r. O r d e ALU ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg Dm Reg ALU ALU Im Reg Dm xor r10,r1,r11
121
Data Hazard Solution: add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7
“Forward” result from one stage to another “or” OK if define read/write properly Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg Reg Dm I n s t r. O r d e ALU ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg Dm Reg ALU ALU Im Reg Dm xor r10,r1,r11
122
Forwarding (or Bypassing): What about Loads
Dependencies backwards in time are hazards Can’t solve with forwarding: Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg Reg Dm ALU ALU sub r4,r1,r3 Im Reg Dm Reg
123
Designing a Pipelined Processor
Go back and examine your datapath and control diagram associated resources with states ensure that flows do not conflict, or figure out how to resolve assert control in appropriate stage
124
Pipelined Processor (almost) for slides
What happens if we start a new instruction every cycle? Valid Inst. Mem IRex IRmem IRwb IR Dcd Ctrl Mem Ctrl WB Ctrl Ex Ctrl Equal Reg. File Reg File A Exec S Next PC PC B Mem Access M Data Mem
125
Control and Datapath Equal Reg. File Reg File A Exec S Next PC PC
IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC < PC+SX; M <– Mem[S] Mem[S] <- B R[rd] <– S; R[rt] <– S; R[rd] <– M; Equal Reg. File Reg File A Exec S Next PC PC Inst. Mem IR B M Mem Access Data Mem D
126
Pipelining the Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock Ifetch Reg/Dec Exec Mem Wr 1st lw 2nd lw Ifetch Reg/Dec Exec Mem Wr 3rd lw Ifetch Reg/Dec Exec Mem Wr The five independent functional units in the pipeline datapath are: Instruction Memory for the Ifetch stage Register File’s Read ports (bus A and busB) for the Reg/Dec stage ALU for the Exec stage Data Memory for the Mem stage Register File’s Write port (bus W) for the Wr stage For the load instructions, the five independent functional units in the pipeline datapath are: (a) Instruction Memory for the Ifetch stage. (b) Register File’s Read ports for the Reg/Decode stage. (c) ALU for the Exec stage. (d) Data memory for the Mem stage. (e) And finally Register File’s write port for the Write Back stage. Notice that I have treat Register File’s read and write ports as separate functional units because the register file we have allows us to read and write at the same time. Notice that as soon as the 1st load finishes its Ifetch stage, it no longer needs the Instruction Memory. Consequently, the 2nd load can start using the Instruction Memory (2nd Ifetch). Furthermore, since each functional unit is only used ONCE per instruction, we will not have any conflict down the pipeline (Exec-Ifet, Mem-Exec, Wr-Mem) either. I will show you the interaction between instructions in the pipelined datapath later. But for now, I want to point out the performance advantages of pipelining. If these 3 load instructions are to be executed by the multiple cycle processor, it will take 15 cycles. But with pipelining, it only takes 7 cycles. This (7 cycles), however, is not the best way to look at the performance advantages of pipelining. A better way to look at this is that we have one instruction enters the pipeline every cycle so we will have one instruction coming out of the pipeline (Wr stages) every cycle. Consequently, the “effective” (or average) number of cycles per instruction is now ONE even though it takes a total of 5 cycles to complete each instruction. +3 = 14 min. (X:54)
127
The Four Stages of R-type
Cycle 1 Cycle 2 Cycle 3 Cycle 4 R-type Ifetch Reg/Dec Exec Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: ALU operates on the two register operands Update PC Wr: Write the ALU output back to the register file Well, so far so good. Let’s take a look at the R-type instructions. The R-type instruction does NOT access data memory so it only takes four clock cycles, or in our new pipeline terminology, four stages to complete. The Ifetch and Reg/Dec stages are identical to the Load instructions. Well they have to be because at this point, we do not know we have a R-type instruction yet. Instead of calculating the effective address during the Exec stage, the R-type instruction will use the ALU to operate on the register operands. The result of this ALU operation is written back to the register file during the Wr back stage. +1 = 15 min. (55)
128
Pipelining the R-type and Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ops! We have a problem! R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr What happened if we try to pipeline the R-type instructions with the Load instructions? Well, we have a problem here!!! We end up having two instructions trying to write to the register file at the same time! Why do we have this problem (the write “bubble”)? Well, the reason for this problem is that there is something I have not yet told you. +1 = 16 min. (X:56) We have pipeline conflict or structural hazard: Two instructions try to write to the register file at the same time! Only one write port
129
Important Observation
Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions: Load uses Register File’s Write Port during its 5th stage R-type uses Register File’s Write Port during its 4th stage Ifetch Reg/Dec Exec Mem Wr Load 1 2 3 4 5 I already told you that in order for pipeline to work perfectly, each functional unit can ONLY be used once per instruction. What I have not told you is that this (1st bullet) is a necessary but NOT sufficient condition for pipeline to work. The other condition to prevent pipeline hiccup is that each functional unit must be used at the same stage for all instructions. For example here, the load instruction uses the Register File’s Wr port during its 5th stage but the R-type instruction right now will use the Register File’s port during its 4th stage. This (5 versus 4) is what caused our problem. How do we solve it? We have 2 solutions. +1 = 17 min. (X:57) Ifetch Reg/Dec Exec Wr R-type 1 2 3 4 2 ways to solve this pipeline hazard.
130
Solution 1: Insert “Bubble” into the Pipeline
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Pipeline Exec Wr R-type R-type Ifetch Bubble Reg/Dec Exec Wr Ifetch Reg/Dec Exec Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle The control logic can be complex. Lose instruction fetch and issue opportunity. No instruction is started in Cycle 6! The first solution is to insert a “bubble” into the pipeline AFTER the load instruction to push back every instruction after the load that are already in the pipeline by one cycle. At the same time, the bubble will delay the Instruction Fetch of the instruction that is about to enter the pipeline by one cycle. Needless to say, the control logic to accomplish this can be complex. Furthermore, this solution also has a negative impact on performance. Notice that due to the “extra” stage (Mem) Load instruction has, we will not have one instruction finishes every cycle (points to Cycle 5). Consequently, a mix of load and R-type instruction will NOT have an average CPI of 1 because in effect, the Load instruction has an effective CPI of 2. So this is not that hot an idea Let’s try something else. +2 = 19 min. (X:59)
131
Solution 2: Delay R-type’s Write by One Cycle
Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NOOP stage: nothing is being done. 1 2 3 4 5 R-type Ifetch Reg/Dec Exec Mem Wr Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch Reg/Dec Exec Mem Wr Well one thing we can do is to add a “Nop” stage to the R-type instruction pipeline to delay its register file write by one cycle. Now the R-type instruction ALSO uses the register file’s witer port at its 5th stage so we eliminate the write conflict with the load instruction. This is a much simpler solution as far as the control logic is concerned. As far as performance is concerned, we also gets back to having one instruction completes per cycle. This is kind of like promoting socialism: by making each individual R-type instruction takes 5 cycles instead of 4 cycles to finish, our overall performance is actually better off. The reason for this higher performance is that we end up having a more efficient pipeline. +1 = 20 min. (Y:00) R-type Ifetch Reg/Dec Exec Mem Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr
132
Modified Control & Datapath
IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; if Cond PC < PC+SX; M <– S M <– Mem[S] Mem[S] <- B M <– S R[rd] <– M; R[rt] <– M; R[rd] <– M; Equal Reg. File Reg File A M Exec S Next PC PC Inst. Mem IR B Mem Access Data Mem D
133
The Four Stages of Store
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Store Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Write the data into the Data Memory Let’s continue our lecture by looking at the store instruction. Once again, the Ifetch and Reg/Decode stages are the same as all other instructions. The Exec stage of the store instruction calculates the memory address. Once the address is calculated, the store instruction will write the data it read from the register file back at the Reg/Decode stage into the data memory during the Mem stage. Notice that unlike the load instruction which takes five cycles to accomplish its task, the Store instruction only takes four cycles or four pipe stages. In order to keep our pipeline diagram looks more uniform, however, we will keep the Wr stage for the store instruction in the pipeline diagram. But keep in mind that as far as the pipelined control and pipelined datapath are concerned, the store instruction requires NOTHING to be done once it finishes its Mem stage. +2 = 27 min. (Y:07)
134
The Three Stages of Beq Ifetch: Instruction Fetch Reg/Dec: Exec:
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Beq Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: compares the two register operand, select correct branch target address latch into PC Well similar to the store instruction, the branch instruction only consists of four pipe stages. Ifetch and Reg/decode are the same as all other instructions because we do not know what instruction we have at this point. We have not finish decoding the instruction yet. During the Execute stage of the pipeline, the BEQ instruction will use the ALU to compare the two register operands it fetched during the Reg/Dec stage. At the same time, a separate adder is used to calculate the branch target address. If the registers we compared during the Execute stage (point to the last bullet) have the same value, the branch is taken. That is, the branch target address we calculated earlier (last bullet) will be written into the Program Counter. Once again, similar to the Store instruction, the BEQ instruction will require NEITHER the pipelined control nor the pipelined datapath to do ANY thing once it finishes its Mem stage. With all these talk about pipelined datapath and pipelined control, let’s take a look at how the pipelined datapath looks like. +2 = 29 min. (Y:09)
135
Control Diagram Equal Reg. File Reg File A M Exec S Next PC PC
IR <- Mem[PC]; PC < PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC < PC+SX; M <– S M <– Mem[S] Mem[S] <- B M <– S R[rd] <– S; R[rt] <– S; R[rd] <– M; Equal Reg. File Reg File A M Exec S Next PC PC Inst. Mem IR B Mem Access Data Mem D
136
Datapath + Data Stationary Control
IR v v v fun rw rw rw Inst. Mem Decode wb wb wb rt me me WB Ctrl rs Mem Ctrl ex op im rs rt Reg. File Reg File A M S Exec B Mem Access Data Mem D PC Next PC
137
Let’s Try it Out 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5
24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 these addresses are octal
138
Start: Fetch 10 n n n n Inst. Mem Decode WB Ctrl Mem Ctrl IR im rs rt
Reg. File Reg File A M S Exec B Mem Access Data Mem D 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 10 PC
139
Fetch 14, Decode 10 n n n Inst. Mem Decode WB Ctrl Mem Ctrl IR im 2 rt
lw r1, r2(35) Decode WB Ctrl Mem Ctrl IR im 2 rt Reg. File Reg File A M S Exec B Mem Access Data Mem D 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 14 PC
140
Fetch 20, Decode 14, Exec 10 n n Inst. Mem Decode WB Ctrl Mem Ctrl IR
addI r2, r2, 3 Decode lw r1 WB Ctrl Mem Ctrl IR 35 2 rt Reg. File Reg File M r2 S Exec B Mem Access Data Mem D 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 20 PC
141
Fetch 24, Decode 20, Exec 14, Mem 10 n Inst. Mem Decode WB Ctrl Mem
sub r3, r4, r5 Decode addI r2, r2, 3 lw r1 WB Ctrl Mem Ctrl IR 4 5 3 Reg. File Reg File M r2 r2+35 Exec B Mem Access Data Mem D 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 24 PC
142
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 Inst. Mem Decode WB Ctrl Mem
beq r6, r7 100 Inst. Mem Decode addI r2 sub r3 lw r1 WB Ctrl Mem Ctrl IR 6 7 Reg. File M[r2+35] Reg File r4 Exec r2+3 r5 Mem Access Data Mem D 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 30 PC
143
Fetch 34, Dcd 30, Ex 24, Mem 20, WB 14 Inst. Mem Decode WB Ctrl Mem
ori r8, r9 17 Decode addI r2 sub r3 WB Ctrl beq Mem Ctrl IR 9 xx 100 r1=M[r2+35] Reg. File Reg File r6 r2+3 Exec r4-r5 r7 Mem Access Data Mem D 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 34 PC
144
Fetch 100, Dcd 34, Ex 30, Mem 24, WB 20 Inst. Mem Decode WB Ctrl Mem
add r10, r11, r12 ori r8 sub r3 beq WB Ctrl Mem Ctrl 11 12 17 Reg. File IR Reg File r1=M[r2+35] r9 r4-r5 Exec xxx r2 = r2+3 x Mem Access Data Mem D 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 100 PC ooops, we should have only one delayed instruction
145
Fetch 104, Dcd 100, Ex 34, Mem 30, WB 24 n Inst. Mem Decode add r10 and r13, r14, r15 ori r8 beq WB Ctrl Mem Ctrl 14 15 xx Reg. File IR Reg File r1=M[r2+35] r11 r9 | 17 xxx Exec r2 = r2+3 r3 = r4-r5 r12 Mem Access Data Mem D 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 104 PC Squash the extra instruction in the branch shadow!
146
Fetch 108, Dcd 104, Ex 100, Mem 34, WB 30 n Inst. Mem Decode add r10 and r13 ori r8 WB Ctrl Mem Ctrl xx Reg. File IR Reg File r1=M[r2+35] r14 r11+r12 r9 | 17 Exec r2 = r2+3 r3 = r4-r5 r15 Mem Access Data Mem D 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 110 PC Squash the extra instruction in the branch shadow!
147
Fetch 114, Dcd 110, Ex 104, Mem 100, WB 34 n NO WB NO Ovflow Inst. Mem Decode and r13 add r10 WB Ctrl Mem Ctrl Reg. File IR Reg File r14 & R15 r11+r12 r1=M[r2+35] Exec r2 = r2+3 r3 = r4-r5 r8 = r9 | 17 Mem Access Data Mem D 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 114 PC Squash the extra instruction in the branch shadow!
148
Summary: Pipelining What makes it easy What makes it hard?
all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction We’ll build a simple pipeline and look at these issues We’ll talk about modern processors and what really makes it hard: exception handling trying to improve performance with out-of-order execution, etc.
149
Summary Pipelining is a fundamental concept
multiple steps using distinct resources Utilize capabilities of the Datapath by pipelined instruction processing start next instruction while working on the current one limited by length of longest stage (plus fill/flush) detect and resolve hazards
150
What about Interrupts, Traps, Faults?
External Interrupts: Allow pipeline to drain, Load PC with interupt address Faults (within instruction, restartable) Force trap instruction into IF disable writes till trap hits WB must save multiple PCs or PC + state Refer to MIPS solution
151
Exception Handling IAU npc I mem detect bad instruction address Regs
lw $2,20($5) PC detect bad instruction im n op rw B A detect overflow alu S detect bad data address D mem m Regs Allow exception to take effect
152
Exception Problem How to stop the pipeline? Restart?
Exceptions/Interrupts: 5 instructions executing in 5 stage pipeline How to stop the pipeline? Restart? Who caused the interrupt? Stage Problem interrupts occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation ID Undefined or illegal opcode EX Arithmetic exception MEM Page fault on data fetch; misaligned memory access; memory-protection violation; memory error Load with data page fault, Add with instruction page fault? Solution 1: interrupt vector/instruction , check last stage Solution 2: interrupt ASAP, restart everything incomplete
153
Resolution: Freeze above & Bubble Below
IAU npc I mem freeze Regs op rw rs rt PC bubble im n op rw B A alu n op rw S D mem m n op rw Regs
154
Memory
155
The Goal: illusion of large, fast, cheap memory
Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and fast (most of the time)? Hierarchy Parallelism
156
An Expanded View of the Memory System
Processor Control Memory Memory Memory Datapath Memory Memory Instead, the memory system of a modern computer consists of a series of black boxes ranging from the fastest to the slowest. Besides variation in speed, these boxes also varies in size (smallest to biggest) and cost. What makes this kind of arrangement work is one of the most important principle in computer design. The principle of locality. +1 = 7 min. (X:47) Slowest Speed: Fastest Biggest Size: Smallest Lowest Cost: Highest
157
Why hierarchy works The Principle of Locality:
Program access a relatively small portion of the address space at any instant of time. Address Space 2^n - 1 Probability of reference The principle of locality states that programs access a relatively small portion of the address space at any instant of time. This is kind of like in real life, we all have a lot of friends. But at any given time most of us can only keep in touch with a small group of them. There are two different types of locality: Temporal and Spatial. Temporal locality is the locality in time which says if an item is referenced, it will tend to be referenced again soon. This is like saying if you just talk to one of your friends, it is likely that you will talk to him or her again soon. This makes sense. For example, if you just have lunch with a friend, you may say, let’s go to the ball game this Sunday. So you will talk to him again soon. Spatial locality is the locality in space. It says if an item is referenced, items whose addresses are close by tend to be referenced soon. Once again, using our analogy. We can usually divide our friends into groups. Like friends from high school, friends from work, friends from home. Let’s say you just talk to one of your friends from high school and she may say something like: “So did you hear so and so just won the lottery.” You probably will say NO, I better give him a call and find out more. So this is an example of spatial locality. You just talked to a friend from your high school days. As a result, you end up talking to another high school friend. Or at least in this case, you hope he still remember you are his friend. +3 = 10 min. (X:50)
158
Memory Hierarchy: How Does it Work?
Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor Spatial Locality (Locality in Space): => Move blocks consists of contiguous words to the upper levels Lower Level Memory Upper Level To Processor From Processor Blk X Blk Y How does the memory hierarchy work? Well it is rather simple, at least in principle. In order to take advantage of the temporal locality, that is the locality in time, the memory hierarchy will keep those more recently accessed data items closer to the processor because chances are (points to the principle), the processor will access them again soon. In order to take advantage of the spatial locality, not ONLY do we move the item that has just been accessed to the upper level, but we ALSO move the data items that are adjacent to it. +1 = 15 min. (X:55)
159
Memory Hierarchy: Terminology
Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty A HIT is when the data the processor wants to access is found in the upper level (Blk X). The fraction of the memory access that are HIT is defined as HIT rate. HIT Time is the time to access the Upper Level where the data is found (X). It consists of: (a) Time to access this level. (b) AND the time to determine if this is a Hit or Miss. If the data the processor wants cannot be found in the Upper level. Then we have a miss and we need to retrieve the data (Blk Y) from the lower level. By definition (definition of Hit: Fraction), the miss rate is just 1 minus the hit rate. This miss penalty also consists of two parts: (a) The time it takes to replace a block (Blk Y to BlkX) in the upper level. (b) And then the time it takes to deliver this new block to the processor. It is very important that your Hit Time to be much much smaller than your miss penalty. Otherwise, there will be no reason to build a memory hierarchy. +2 = 14 min. (X:54) Lower Level Memory To Processor Upper Level Memory Blk X From Processor Blk Y
160
Memory Hierarchy of a Modern Computer System
By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Processor Control Tertiary Storage (Disk) Secondary Storage (Disk) Second Level Cache (SRAM) Main Memory (DRAM) The design goal is to present the user with as much memory as is available in the cheapest technology (points to the disk). While by taking advantage of the principle of locality, we like to provide the user an average access speed that is very close to the speed that is offered by the fastest technology. (We will go over this slide in details in the next lecture on caches). +1 = 16 min. (X:56) Datapath On-Chip Cache Registers Speed (ns): 1s 10s 100s 10,000,000s (10s ms) 10,000,000,000s (10s sec) Size (bytes): 100s Ks Ms Gs Ts
161
How is the hierarchy managed?
Registers <-> Memory by compiler (programmer?) cache <-> memory by the hardware memory <-> disks by the hardware and operating system (virtual memory) by the programmer (files)
162
Example: 1 KB Direct Mapped Cache with 32 B Blocks
For a 2 ** N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 ** M) 31 9 4 Cache Tag Example: 0x50 Cache Index Byte Select Ex: 0x01 Ex: 0x00 Stored as part of the cache “state” Valid Bit Cache Tag Cache Data : Byte 31 Byte 1 Byte 0 Let’s use a specific example with realistic numbers: assume we have a 1 KB direct mapped cache with block size equals to 32 bytes. In other words, each block associated with the cache tag will have 32 bytes in it (Row 1). With Block Size equals to 32 bytes, the 5 least significant bits of the address will be used as byte select within the cache block. Since the cache size is 1K byte, the upper 32 minus 10 bits, or 22 bits of the address will be stored as cache tag. The rest of the address bits in the middle, that is bit 5 through 9, will be used as Cache Index to select the proper cache entry. +2 = 30 min. (Y:10) : 0x50 Byte 63 Byte 33 Byte 32 1 2 3 : : : : Byte 1023 Byte 992 31
163
Extreme Example: single big line
Cache Data Valid Bit Byte 0 Byte 1 Byte 3 Cache Tag Byte 2 Cache Size = 4 bytes Block Size = 4 bytes Only ONE entry in the cache If an item is accessed, likely that it will be accessed again soon But it is unlikely that it will be accessed again immediately!!! The next access will likely to be a miss again Continually loading data into the cache but discard (force out) them before they are used again Worst nightmare of a cache designer: Ping Pong Effect Conflict Misses are misses caused by: Different memory locations mapped to the same cache index Solution 1: make the cache size bigger Solution 2: Multiple entries for the same Cache Index Let’s go back to our 4-byte direct mapped cache and increase its block size to 4 byte. Now we end up have one cache entries instead of 4 entries. What do you think this will do to the miss rate? Well the miss rate probably will go to hell. It is true that if an item is accessed, it is likely that it will be accessed again soon. But probably NOT as soon as the very next access so the next access will cause a miss again. So what we will end up is loading data into the cache but the data will be forced out by another cache miss before we have a chance to use it again. This is called the ping pong effect: the data is acting like a ping pong ball bouncing in and out of the cache. It is one of the nightmares scenarios cache designer hope never happens. We also defined a term for this type of cache miss, cache miss caused by different memory location mapped to the same cache index. It is called Conflict miss. There are two solutions we can use to reduce the conflict miss. The first one is to increase the cache size. The second one is to increase the number of cache entries per cache index. Let me show you what I mean. +2 = 35 min. (Y:15)
164
Another Extreme Example: Fully Associative
Fully Associative Cache Forget about the Cache Index Compare the Cache Tags of all cache entries in parallel Example: Block Size = 2 B blocks, we need N 27-bit comparators By definition: Conflict Miss = 0 for a fully associative cache 31 4 Cache Tag (27 bits long) Byte Select Ex: 0x01 Cache Tag Valid Bit Cache Data While the direct mapped cache is on the simple end of the cache design spectrum, the fully associative cache is on the most complex end. It is the N-way set associative cache carried to the extreme where N in this case is set to the number of cache entries in the cache. In other words, we don’t even bother to use any address bits as the cache index. We just store all the upper bits of the address (except Byte select) that is associated with the cache block as the cache tag and have one comparator for every entry. The address is sent to all entries at once and compared in parallel and only the one that matches are sent to the output. This is called an associative lookup. Needless to say, it is very hardware intensive. Usually, fully associative cache is limited to 64 or less entries. Since we are not doing any mapping with the cache index, we will never push any other item out of the cache because multiple memory locations map to the same cache location. Therefore, by definition, conflict miss is zero for a fully associative cache. This, however, does not mean the overall miss rate will be zero. Assume we have 64 entries here. The first 64 items we accessed can fit in. But when we try to bring in the 65th item, we will need to throw one of them out to make room for the new item. This bring us to the third type of cache misses: Capacity Miss. +3 = 41 min. (Y:21) : X Byte 31 Byte 1 Byte 0 : X Byte 63 Byte 33 Byte 32 X X : : : X
165
A Two-way Set Associative Cache
N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Cache Index Valid Cache Tag Cache Data Cache Data Cache Block 0 Cache Tag Valid : Cache Block 0 : : : This is called a 2-way set associative cache because there are two cache entries for each cache index. Essentially, you have two direct mapped cache works in parallel. This is how it works: the cache index selects a set from the cache. The two tags in the set are compared in parallel with the upper bits of the memory address. If neither tag matches the incoming address tag, we have a cache miss. Otherwise, we have a cache hit and we will select the data on the side where the tag matches occur. This is simple enough. What is its disadvantages? +1 = 36 min. (Y:16) Compare Adr Tag Compare 1 Sel1 Mux Sel0 OR Cache Block Hit
166
Disadvantage of Set Associative Cache
N-way Set Associative Cache versus Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. Cache Data Cache Block 0 Cache Tag Valid : Cache Index Mux 1 Sel1 Sel0 Cache Block Compare Adr Tag OR Hit First of all, a N-way set associative cache will need N comparators instead of just one comparator (use the right side of the diagram for direct mapped cache). A N-way set associative cache will also be slower than a direct mapped cache because of this extra multiplexer delay. Finally, for a N-way set associative cache, the data will be available AFTER the hit/miss signal becomes valid because the hit/mis is needed to control the data MUX. For a direct mapped cache, that is everything before the MUX on the right or left side, the cache block will be available BEFORE the hit/miss signal (AND gate output) because the data does not have to go through the comparator. This can be an important consideration because the processor can now go ahead and use the data without knowing if it is a Hit or Miss. Just assume it is a hit. Since cache hit rate is in the upper 90% range, you will be ahead of the game 90% of the time and for those 10% of the time that you are wrong, just make sure you can recover. You cannot play this speculation game with a N-way set-associatvie cache because as I said earlier, the data will not be available to you until the hit/miss signal is valid. +2 = 38 min. (Y:18)
167
A Summary on Sources of Cache Misses
Compulsory (cold start or process migration, first reference): first access to a block “Cold” fact of life: not a whole lot you can do about it Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Capacity: Cache cannot contain all blocks access by the program Solution: increase cache size Invalidation: other process (e.g., I/O) updates memory (Capacity miss) That is the cache misses are due to the fact that the cache is simply not large enough to contain all the blocks that are accessed by the program. The solution to reduce the Capacity miss rate is simple: increase the cache size. Here is a summary of other types of cache miss we talked about. First is the Compulsory misses. These are the misses that we cannot avoid. They are caused when we first start the program. Then we talked about the conflict misses. They are the misses that caused by multiple memory locations being mapped to the same cache location. There are two solutions to reduce conflict misses. The first one is, once again, increase the cache size. The second one is to increase the associativity. For example, say using a 2-way set associative cache instead of directed mapped cache. But keep in mind that cache miss rate is only one part of the equation. You also have to worry about cache access time and miss penalty. Do NOT optimize miss rate alone. Finally, there is another source of cache miss we will not cover today. Those are referred to as invalidation misses caused by another process, such as IO , update the main memory so you have to flush the cache to avoid inconsistency between memory and cache. +2 = 43 min. (Y:23)
168
Improving Cache Performance: 3 general options
1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.
169
4 Questions for Memory Hierarchy
Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)
170
Q1: Where can a block be placed in the upper level?
Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets
171
Q2: How is a block found if it is in the upper level?
Tag on each block No need to check index or block offset Increasing associativity shrinks index, expands tag
172
Q3: Which block should be replaced on a miss?
Easy for Direct Mapped Set Associative or Fully Associative: Random LRU (Least Recently Used) Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
173
Q4: What happens on a write?
Write through—The information is written to both the block in the cache and to the block in the lower-level memory. Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT: read misses cannot result in writes WB: no writes of repeated writes WT always combined with write buffers so that don’t wait for lower level memory
174
Write Buffer for Write Through
Cache Processor DRAM Write Buffer A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle You are right, memory is too slow. We really didn't writ e to the memory directly. We are writing to a write buffer. Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene. The write buffer works as long as the frequency of store is not too high. Notice here, I am referring to the frequency with respect to time, not with respect to number of instructions. Remember the DRAM cycle time we talked about last time. It sets the upper limit on how frequent you can write to the main memory. If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. +2 = 60 min. (Y:40)
175
Write-miss Policy: Write Allocate versus Not Allocate
Assume: a 16-bit write to memory location 0x0 and causes a miss Do we read in the block? Yes: Write Allocate No: Write Not Allocate 31 9 4 Cache Tag Example: 0x00 Cache Index Byte Select Ex: 0x00 Ex: 0x00 Valid Bit Cache Tag Cache Data : 0x00 Byte 31 Byte 1 Byte 0 Let’s look at our 1KB direct mapped cache again. Assume we do a 16-bit write to memory location 0x and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select. After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss? True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking. +2 = 64 min. (Y:44) : Byte 63 Byte 33 Byte 32 1 2 3 : : : : Byte 1023 Byte 992 31
176
Recall: Levels of the Memory Hierarchy
Upper Level Capacity Access Time Cost Staging Xfer Unit faster CPU Registers 100s Bytes <10s ns Registers Instr. Operands prog./compiler 1-8 bytes Cache K Bytes ns $ /bit Cache cache cntl 8-128 bytes Blocks Main Memory M Bytes 100ns-1us $ Memory OS 512-4K bytes Pages Disk G Bytes ms cents Disk -3 -4 user/operator Mbytes Files Tape infinite sec-min 10 Larger Tape Lower Level -6
177
Basic Issues in Virtual Memory System Design
size of information blocks that are transferred from secondary to main storage (M) block of information brought into M, and M is full, then some region of M must be released to make room for the new block --> replacement policy which region of M is to hold the new block --> placement policy missing item fetched from secondary memory only on the occurrence of a fault --> demand load policy disk mem cache reg pages frame Paging Organization virtual and physical address space partitioned into blocks of equal size page frames pages
178
Address Map V = {0, 1, . . . , n - 1} virtual address space
M = {0, 1, , m - 1} physical address space MAP: V --> M U {0} address mapping function n > m MAP(a) = a' if data at virtual address a is present in physical address a' and a' in M = 0 if data at virtual address a is not present in M a missing item fault Name Space V fault handler Processor Addr Trans Mechanism Main Memory Secondary Memory a a' physical address OS performs this transfer
179
Paging Organization V.A. P.A. unit of mapping frame 0 1K Addr Trans
frame 0 1K Addr Trans MAP page 0 1K 1024 1 1K 1024 1 1K also unit of transfer from virtual to physical memory 7168 7 1K Physical Memory 31744 31 1K Virtual Memory Address Mapping 10 VA page no. disp Page Table Page Table Base Reg Access Rights V actually, concatenation is more likely PA + index into page table table located in physical memory physical memory address
180
Virtual Address and a Cache
VA PA miss Trans- lation Cache Main Memory CPU hit data It takes an extra memory access to translate VA to PA This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible ASIDE: Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address! for update: must update all cache entries with same physical address or memory becomes inconsistent determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or software enforced alias boundary: same lsb of VA &PA > cache size
181
TLBs A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB Virtual Address Physical Address Dirty Ref Valid Access TLB access time comparable to cache access time (much less than main memory access time)
182
Translation Look-Aside Buffers
Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped TLBs are usually small, typically not more than entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations. hit VA PA miss TLB Lookup Cache Main Memory CPU Translation with a TLB miss hit Trans- lation data 1/2 t t 20 t
183
Summary #1/ 4: The Principle of Locality:
Program likely to access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size Cache Design Space total size, block size, associativity replacement policy write-hit policy (write-through, write-back) write-miss policy Let’s summarize today’s lecture. I know you have heard this many times and many ways but it is still worth repeating. Memory hierarchy works because of the Principle of Locality which says a program will access a relatively small portion of the address space at any instant of time. There are two types of locality: temporal locality, or locality in time and spatial locality, or locality in space. So far, we have covered three major categories of cache misses. Compulsory misses are cache misses due to cold start. You cannot avoid them but if you are going to run billions of instructions anyway, compulsory misses usually don’t bother you. Conflict misses are misses caused by multiple memory location being mapped to the same cache location. The nightmare scenario is the ping pong effect when a block is read into the cache but before we have a chance to use it, it was immediately forced out by another conflict miss. You can reduce Conflict misses by either increase the cache size or increase the associativity, or both. Finally, Capacity misses occurs when the cache is not big enough to contains all the cache blocks required by the program. You can reduce this miss rate by making the cache larger. There are two write policy as far as cache write is concerned. Write through requires a write buffer and a nightmare scenario is when the store occurs so frequent that you saturates your write buffer. The second write polity is write back. In this case, you only write to the cache and only when the cache block is being replaced do you write the cache block back to memory. +3 = 77 min. (Y:57)
184
Summary #2 / 4: The Cache Design Space
Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation Cache Size Associativity Block Size No fancy replacement policy is needed for the direct mapped cache. As a matter of fact, that is what cause direct mapped trouble to begin with: only one place to go in the cache--causes conflict misses. Besides working at Sun, I also teach people how to fly whenever I have time. Statistic have shown that if a pilot crashed after an engine failure, he or she is more likely to get killed in a multi-engine light airplane than a single engine airplane. The joke among us flight instructors is that: sure, when the engine quit in a single engine stops, you have one option: sooner or later, you land. Probably sooner. But in a multi-engine airplane with one engine stops, you have a lot of options. It is the need to make a decision that kills those people.
185
Summary #3 / 4 : TLB, Virtual Memory
Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is repalced on miss? 4) How are writes handled? Page tables map virtual address to physical address TLBs are important for fast translation TLB misses are significant in processor performance: (funny times, as most systems can’t access all of 2nd level cache without TLB misses!) Let’s do a short review of what you learned last time. Virtual memory was originally invented as another level of memory hierarchy such that programers, faced with main memory much smaller than their programs, do not have to manage the loading and unloading portions of their program in and out of memory. It was a controversial proposal at that time because very few programers believed software can manage the limited amount of memory resource as well as human. This all changed as DRAM size grows exponentially in the last few decades. Nowadays, the main function of virtual memory is to allow multiple processes to share the same main memory so we don’t have to swap all the non-active processes to disk. Consequently, the most important function of virtual memory these days is to provide memory protection. The most common technique, but we like to emphasis not the only technique, to translate virtual memory address to physical memory address is to use a page table. TLB, or translation lookaside buffer, is one of the most popular hardware techniques to reduce address translation time. Since TLB is so effective in reducing the address translation time, what this means is that TLB misses will have a significant negative impact on processor performance. +3 = 3 min. (X:43)
186
Summary #4 / 4: Memory Hierachy
Today VM allows many processes to share single memory without having to swap all processes to disk; VM protection is more important than memory hierarchy Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms? Let’s do a short review of what you learned last time. Virtual memory was originally invented as another level of memory hierarchy such that programers, faced with main memory much smaller than their programs, do not have to manage the loading and unloading portions of their program in and out of memory. It was a controversial proposal at that time because very few programers believed software can manage the limited amount of memory resource as well as human. This all changed as DRAM size grows exponentially in the last few decades. Nowadays, the main function of virtual memory is to allow multiple processes to share the same main memory so we don’t have to swap all the non-active processes to disk. Consequently, the most important function of virtual memory these days is to provide memory protection. The most common technique, but we like to emphasis not the only technique, to translate virtual memory address to physical memory address is to use a page table. TLB, or translation lookaside buffer, is one of the most popular hardware techniques to reduce address translation time. Since TLB is so effective in reducing the address translation time, what this means is that TLB misses will have a significant negative impact on processor performance. +3 = 3 min. (X:43)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.