Non-Pipelined Processors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 6, 2013

Slides:

Advertisements

Similar presentations

CIS 314 Fall 2005 MIPS Datapath (Single Cycle and Multi-Cycle)

Advertisements

ELEN 468 Advanced Logic Design

Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Pipelining III Steve Ko Computer Sciences and Engineering University at Buffalo.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture MIPS Review & Pipelining I Steve Ko Computer Sciences and Engineering University at Buffalo.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture ISAs and MIPS Steve Ko Computer Sciences and Engineering University at Buffalo.

The Processor: Datapath & Control

January 31, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining Krste Asanovic Electrical Engineering and Computer.

Computer Structure - Datapath and Control Goal: Design a Datapath  We will design the datapath of a processor that includes a subset of the MIPS instruction.

Computer Architecture: A Constructive Approach Instruction Representation Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.

Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of.

Constructive Computer Architecture Interrupts/Exceptions/Faults Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1

Interrupts / Exceptions / Faults Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 30, 2012L21-1

CS3350B Computer Architecture Winter 2015 Lecture 5.6: Single-Cycle CPU: Datapath Control (Part 1) Marc Moreno Maza [Adapted.

Asanovic/Devadas Spring Simple Instruction Pipelining Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.

COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections

Computer Architecture: A Constructive Approach One-Cycle Implementation Joel Emer Computer Science & Artificial Intelligence Lab. Massachusetts Institute.

Electrical and Computer Engineering University of Cyprus LAB 2: MIPS.

Computer Architecture and Design – ECEN 350 Part 6 [Some slides adapted from A. Sprintson, M. Irwin, D. Paterson and others]

Datapath and Control Unit Design

1 A single-cycle MIPS processor  An instruction set architecture is an interface that defines the hardware operations which are available to software.

Constructive Computer Architecture: Branch Prediction Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.

Constructive Computer Architecture Virtual Memory and Interrupts Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Computer Architecture: A Constructive Approach Branch Prediction - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

Datapath and Control AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data.

Constructive Computer Architecture: Control Hazards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.

COM181 Computer Hardware Lecture 6: The MIPs CPU.

Computer Architecture: A Constructive Approach Multi-Cycle and 2 Stage Pipelined SMIPS Implementations Teacher: Yoav Etsion Taken (with permission) from.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010

Computer Architecture: A Constructive Approach Implementing SMIPS Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Constructive Computer Architecture Interrupts/Exceptions/Faults Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Computer Architecture Lecture 6.  Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions:

EE204 Computer Architecture

CS161 – Design and Architecture of Computer Systems

Electrical and Computer Engineering University of Cyprus

CS 230: Computer Organization and Assembly Language

Control Hazards Constructive Computer Architecture: Arvind

Non-Pipelined Processors

ELEN 468 Advanced Logic Design

Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.

Interrupts/Exceptions/Faults

Multistage Pipelined Processors and modular refinement

Non-Pipelined Processors

CSCI206 - Computer Organization & Programming

Non-Pipelined Processors - 2

Non-Pipelined Processors

Non-Pipelined Processors

Single-Cycle CPU DataPath.

Non-Pipelined and Pipelined Processors

Multi-cycle SMIPS Implementations

Topic 5: Processor Architecture Implementation Methodology

Rocky K. C. Chang 6 November 2017

Multistage Pipelined Processors and modular refinement

Composing the Elements

Composing the Elements

The Processor Lecture 3.2: Building a Datapath with Control

Topic 5: Processor Architecture

Pipelined Processors Arvind

COMS 361 Computer Organization

Pipelined Processors Constructive Computer Architecture: Arvind

Modeling Processors Arvind

Modular Refinement Arvind

Non-Pipelined Processors

Branch Predictor Interface

The Processor: Datapath & Control.

COMS 361 Computer Organization

Processor: Datapath and Control

Presentation transcript:

Non-Pipelined Processors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 6, L09-1

Single-Cycle RISC Processor PC Inst Memory Decode Register File Execute Data Memory +4 Datapath and control are derived automatically from a high-level rule-based description 2 read & 1 write ports separate Instruction & Data memories As an illustrative example, we will use SMIPS, a 35- instruction subset of MIPS ISA without the delay slot March 6,

Single-Cycle Implementation code structure module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; rule doProc; let inst = iMem.req(pc); let dInst = decode(inst); let rVal1 = rf.rd1(dInst.rSrc1); let rVal2 = rf.rd2(dInst.rSrc2); let eInst = exec(dInst, rVal1, rVal2, pc); update rf, pc and dMem to be explained later extracts fields needed for execution produces values needed to update the processor state instantiate the state March 6,

SMIPS Instruction formats Only three formats but the fields are used differently by different types of instructions opcode rsrt immediate I-type opcode rsrt rd shamt func R-type 6 26 opcode target J-type March 6,

Instruction formats cont Computational Instructions Load/Store Instructions opcode rsrt immediate rt  (rs) op immediate rsrt rd 0 func rd  (rs) func (rt) rs is the base register rt is the destination of a Load or the source for a Store addressing mode opcode rsrt displacement (rs) + displacement March 6,

Control Instructions Conditional (on GPR) PC-relative branch target address = (offset in words)4 + (PC+4) range: 128 KB range Unconditional register-indirect jumps Unconditional absolute jumps target address = {PC, target4} range : 256 MB range opcode rs offset BEQZ, BNEZ 6 26 opcode target J, JAL opcode rs JR, JALR jump-&-link stores PC+4 into the link register (R31) March 6,

decode Decoding Instructions: extract fields needed for execution instruction brComp rDst rSrc1 rSrc2 imm ext aluFunc iType 31:26, 5:0 31:26 5:0 31:26 20:16 15:11 25:21 20:16 15:0 25:0 Type DecodedInst Bit#(32) IType AluFunc Maybe#(RIndx) BrFunc Maybe#(Bit#(32)) pure combinational logic: derived automatically from the high-level description March 6,

Decoded Instruction typedef struct { IType iType; AluFunc aluFunc; BrFunc brFunc; Maybe#(FullIndx) dst; Maybe#(FullIndx) src1; Maybe#(FullIndx) src2; Maybe#(Data) imm; } DecodedInst deriving(Bits, Eq); typedef enum {Unsupported, Alu, Ld, St, J, Jr, Br} IType deriving(Bits, Eq); typedef enum {Add, Sub, And, Or, Xor, Nor, Slt, Sltu, LShift, RShift, Sra} AluFunc deriving(Bits, Eq); typedef enum {Eq, Neq, Le, Lt, Ge, Gt, AT, NT} BrFunc deriving(Bits, Eq); Instruction groups with similar executions paths FullIndx is similar to RIndx; to be explained later Destination register 0 behaves like an Invalid destination March 6,

Decode Function function DecodedInst decode(Bit#(32) inst); DecodedInst dInst = ?; let opcode = inst[ 31 : 26 ]; let rs = inst[ 25 : 21 ]; let rt = inst[ 20 : 16 ]; let rd = inst[ 15 : 11 ]; let funct = inst[ 5 : 0 ]; let imm = inst[ 15 : 0 ]; let target = inst[ 25 : 0 ]; case (opcode)... endcase return dInst; endfunction initially undefined opcoders rt immediate opcode rs rt rd shamt func opcode target March 6,

Naming the opcodes Bit#(6) opADDIU = 6’b001001; Bit#(6) opSLTI = 6’b001010; Bit#(6) opLW = 6’b100011; Bit#(6) opSW = 6’b101011; Bit#(6) opJ = 6’b000010; Bit#(6) opBEQ = 6’b000100; … Bit#(6) opFUNC = 6’b000000; Bit#(6) fcADDU = 6’b100001; Bit#(6) fcAND = 6’b100100; Bit#(6) fcJR = 6’b001000; … Bit#(6) opRT = 6’b000001; Bit#(6) rtBLTZ = 5’b00000; Bit#(6) rtBGEZ = 5’b00100; bit patterns are specified in the SMIPS ISA March 6,

Instruction Groupings instructions with common execution steps case (opcode) opADDIU, opSLTI, opSLTIU, opANDI, opORI, opXORI, opLUI: … opLW: … opSW: … opJ, opJAL: … opBEQ, opBNE, opBLEZ, opBGTZ, opRT: … opFUNC: case (funct) fcJR, fcJALR: … fcSLL, fcSRL, fcSRA: … fcSLLV, fcSRLV, fcSRAV: … fcADDU, fcSUBU, fcAND, fcOR, fcXOR, fcNOR, fcSLT, fcSLTU: … ; default: // Unsupported endcase default: // Unsupported endcase; These groupings are somewhat arbitrary March 6,

Decoding Instructions: I-Type ALU opADDIU, opSLTI, opSLTIU, opANDI, opORI, opXORI, opLUI: begin dInst.iType = dInst.aluFunc = dInst.dst = dInst.src1 = dInst.src2 = dInst.imm = dInst.brFunc = end almost like writing (Valid rt) Alu; case (opcode) opADDIU, opLUI: Add; opSLTI: Slt; opSLTIU: Sltu; opANDI: And; opORI: Or; opXORI: Xor; endcase; validReg(rt); Valid (case(opcode) opADDIU, opSLTI, opSLTIU: signExtend(imm); opLUI: {imm, 16'b0}; endcase); NT; validReg(rs); Invalid; March 6,

Decoding Instructions: Load & Store opLW: begin dInst.iType = Ld; dInst.aluFunc = Add; dInst.rDst = validReg(rt); dInst.rSrc1 = validReg(rs); dInst.rSrc2 = Invalid; dInst.imm = Valid (signExtend(imm)); dInst.brFunc = NT; end opSW: begin dInst.iType = St; dInst.aluFunc = Add; dInst.rDst = Invalid; dInst.rSrc1 = validReg(rs); dInst.rSrc2 = validReg(rt); dInst.imm = Valid(signExtend(imm)); dInst.brFunc = NT; end March 6,

Decoding Instructions: Jump opJ, opJAL: begin dInst.iType = J; dInst.rDst = opcode==opJ ? Invalid : validReg(31); dInst.rSrc1 = Invalid; dInst.rSrc2 = Invalid; dInst.imm = Valid(zeroExtend( {target, 2’b00})); dInst.brFunc = AT; end March 6,

Decoding Instructions: Branch opBEQ, opBNE, opBLEZ, opBGTZ, opRT: begin dInst.iType = Br; dInst.brFunc = case(opcode) opBEQ: Eq; opBNE: Neq; opBLEZ: Le; opBGTZ: Gt; opRT: (rt==rtBLTZ ? Lt : Ge); endcase; dInst.dst = Invalid; dInst.src1 = validReg(rs); dInst.src2 = (opcode==opBEQ||opcode==opBNE)? validReg(rt) : Invalid; dInst.imm = Valid(signExtend(imm) << 2); end March 6,

Decoding Instructions: opFUNC, JR opFUNC: case (funct) fcJR, fcJALR: begin dInst.iType = Jr; dInst.dst = funct == fcJR? Invalid: validReg(rd); dInst.src1 = validReg(rs); dInst.src2 = Invalid; dInst.imm = Invalid; dInst.brFunc = AT; end fcSLL, fcSRL, fcSRA: … March 6, JALR stores the pc in rd as opposed to JAL which stores the pc in R31

Decoding Instructions: opFUNC- ALU ops fcADDU, fcSUBU, fcAND, fcOR, fcXOR, fcNOR, fcSLT, fcSLTU: begin dInst.iType = Alu; dInst.aluFunc = case (funct) fcADDU: Add; fcSUBU: Sub; fcAND : And; fcOR : Or; fcXOR : Xor; fcNOR : Nor; fcSLT : Slt; fcSLTU: Sltu; endcase; dInst.dst = validReg(rd); dInst.src1 = validReg(rs); dInst.src2 = validReg(rt); dInst.imm = Invalid; dInst.brFunc = NT end default: // Unsupported endcase March 6,

Decoding Instructions: Unsupported instruction default: begin dInst.iType = Unsupported; dInst.dst = Invalid; dInst.src1 = Invalid; dInst.src2 = Invalid; dInst.imm = Invalid; dInst.brFunc = NT; end endcase March 6,

Reading Registers Read registers RVal2 RVal1 RF RSrc2 Pure combinational logic RSrc1 March 6,

Executing Instructions execute dInst addr brTaken data iType dst ALU Branch Address pc rVal2 rVal1 Pure combinational logic either for memory reference or branch target either for rf write or St ALUBr March 6, missPredict

Output of exec function typedef struct { IType iType; Maybe#(FullIndx) dst; Data data; Addr addr; Bool mispredict; Bool brTaken; } ExecInst deriving(Bits, Eq); March 6,

Execute Function function ExecInst exec(DecodedInst dInst, Data rVal1, Data rVal2, Addr pc); ExecInst eInst = ?; Data aluVal2 = let aluRes = eInst.iType = eInst.data = let brTaken = let brAddr = eInst.brTaken = eInst.addr = eInst.dst = return eInst; endfunction fromMaybe(rVal2, dInst.imm); alu(rVal1, aluVal2, dInst.aluFunc); dInst.iType; dInst.iType==St? rVal2 : (dInst.iType==J || dInst.iType==Jr)? (pc+4) : aluRes; aluBr(rVal1, rVal2, dInst.brFunc); brAddrCalc(pc, rVal1, dInst.iType, fromMaybe(?, dInst.imm), brTaken); brTaken; (dInst.iType==Ld || dInst.iType==St)? aluRes : brAddr; dInst.dst; March 6,

Branch Address Calculation function Addr brAddrCalc(Addr pc, Data val, IType iType, Data imm, Bool taken); Addr pcPlus4 = pc + 4; Addr targetAddr = case (iType) J : {pcPlus4[31:28], imm[27:0]}; Jr : val; Br : (taken? pcPlus4 + imm : pcPlus4); Alu, Ld, St, Unsupported: pcPlus4; endcase; return targetAddr; endfunction March 6,

Single-Cycle SMIPS atomic state updates if(eInst.iType == Ld) eInst.data <- dMem.req(MemReq{op: Ld, addr: eInst.addr, data: ?}); else if (eInst.iType == St) let dummy <- dMem.req(MemReq{op: St, addr: eInst.addr, data: data}); if(isValid(eInst.dst)) rf.wr(validRegValue(eInst.dst), eInst.data); pc <= eInst.brTaken ? eInst.addr : pc + 4; endrule endmodule state updates The whole processor is described using one rule; lots of big combinational functions March 6,

Harvard-Style Datapath for MIPS 0x4 RegWrite Add clk WBSrcMemWrite addr wdata rdata Data Memory we RegDst BSrc ExtSelOpCode z OpSel clk zero? clk addr inst Inst. Memory PC rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext ALU Control 31 PCSrc br rind jabs pc+4 old way March 6,

Hardwired Control Table OpcodeExtSelBSrcOpSelMemWRegWWBSrcRegDstPCSrc ALU ALUi ALUiu LW SW BEQZ z=0 BEQZ z=1 J JAL JR JALR BSrc = Reg / ImmWBSrc = ALU / Mem / PC RegDst = rt / rd / R31PCSrc = pc+4 / br / rind / jabs ***no yesrindPC R31 rind***no ** jabs ** * no yesPCR31 jabs * * * no ** pc+4sExt 16 *0?no ** brsExt 16 *0?no ** pc+4sExt 16 Imm+yesno** pc+4ImmOpnoyesALUrt pc+4*RegFuncnoyesALUrd sExt 16 ImmOppc+4noyesALUrt pc+4sExt 16 Imm+noyesMemrt uExt 16 old way March 6,

Single-Cycle SMIPS: Clock Speed PC Inst Memory Decode Register File Execute Data Memory +4 t Clock > t M + t DEC + t RF + t ALU + t M + t WB We can improve the clock speed if we execute each instruction in two clock cycles t Clock > max {t M, (t DEC + t RF + t ALU + t M + t WB )} However, this may not improve the performance because each instruction will now take two cycles to execute March 6,

Structural Hazards Sometimes multicycle implementations are necessary because of resource conflicts, aka, structural hazards Princeton style architectures use the same memory for instruction and data and consequently, require at least two cycles to execute Load/Store instructions If the register file supported less than 2 reads and one write concurrently then most instructions would take more than one cycle to execute Usually extra registers are required to hold values between cycles March 6,

Two-Cycle SMIPS PC Inst Memory Decode Register File Execute Data Memory +4 f2d state Introduce register “f2d” to hold a fetched instruction and register “state” to remember the state (fetch/execute) of the processor March 6,

Two-Cycle SMIPS module mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; Reg#(Data) f2d <- mkRegU; Reg#(State) state <- mkReg(Fetch); rule doFetch (state == Fetch); let inst = iMem.req(pc); f2d <= inst; state <= Execute; endrule March 6,

Two-Cycle SMIPS rule doExecute(stage==Execute); let inst = f2d; let dInst = decode(inst); let rVal1 = rf.rd1(validRegValue(dInst.src1)); let rVal2 = rf.rd2(validRegValue(dInst.src2)); let eInst = exec(dInst, rVal1, rVal2, pc); if(eInst.iType == Ld) eInst.data <- dMem.req(MemReq{op: Ld, addr: eInst.addr, data: ?}); else if(eInst.iType == St) let d <- dMem.req(MemReq{op: St, addr: eInst.addr, data: eInst.data}); if (isValid(eInst.dst)) rf.wr(validRegValue(eInst.dst), eInst.data); pc <= eInst.brTaken ? eInst.addr : pc + 4; stage <= Fetch; endrule endmodule no change from single-cycle March 6,

Two-Cycle SMIPS: Analysis PC Inst Memory Decode Register File Execute Data Memory +4 f2d stage In any given clock cycle, lots of unused hardware! ExecuteFetch next lecture: Pipelining to increase the throughput March 6,

Coprocessor Registers MIPS allows extra sets of 32-registers each to support system calls, floating point, debugging etc. These registers are known as coprocessor registers The registers in the n th set are written and read using instructions MTCn and MFCn, respectively Set 0 is used to get the results of program execution (Pass/Fail), the number of instructions executed and the cycle counts Type FullIndx is used to refer to the normal registers plus the coprocessor set 0 registers function validRegValue(FullIndx r) returns index of r typedef Bit#(5) RIndx; typedef enum {Normal, CopReg} RegType deriving (Bits, Eq); typedef struct {RegType regType; RIndx idx;} FullIndx; deriving (Bits, Eq); March 6,

Processor interface interface Proc; method Action hostToCpu(Addr startpc); method ActionValue#(Tuple2#(RIndx, Data)) cpuToHost; endinterface refers to coprocessor registers March 6,

Code with coprocessor calls let copVal = cop.rd(validRegValue(dInst.src1)); let eInst = exec(dInst, rVal1, rVal2, pc, copVal); cop.wr(eInst.dst, eInst.data); write coprocessor registers (MTC0) and indicate the completion of an instruction pass coprocessor register values to execute MFC0 We did not show these lines in our processor to avoid cluttering the slides March 6,