6.375 Tutorial 4 RISC-V and Final Projects Ming Liu March 4, 2016http://csg.csail.mit.edu/6.375T04-1.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

Final Presentation Part-A

EECE476: Computer Architecture Lecture 21: Faster Branches Branch Prediction with Branch-Target Buffers (not in textbook) The University of British ColumbiaEECE.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Arvind and Joel Emer Computer Science and Artificial Intelligence Laboratory M.I.T. Branch Prediction.

Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.

Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.

Constructive Computer Architecture Tutorial 4: SMIPS on FPGA Andy Wright 6.S195 TA October 7, 2013http://csg.csail.mit.edu/6.s195T04-1.

Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1

Computer Architecture: A Constructive Approach Branch Direction Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab.

A Relational Algebra Processor Final Project Ming Liu, Shuotao Xu.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

GBT Interface Card for a Linux Computer Carson Teale 1.

Presented by: Sergio Ospina Qing Gao. Contents ♦ 12.1 Processor Organization ♦ 12.2 Register Organization ♦ 12.3 Instruction Cycle ♦ 12.4 Instruction.

Realistic Memories and Caches Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 21, 2012L13-1

Lecture 9. MIPS Processor Design – Instruction Fetch Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System Education &

Constructive Computer Architecture: Branch Prediction Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.

PROCStar III Performance Charactarization Instructor : Ina Rivkin Performed by: Idan Steinberg Evgeni Riaboy Semestrial Project Winter 2010.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Constructive Computer Architecture: Branch Prediction Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.

Computer Architecture: A Constructive Approach Next Address Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab. Massachusetts.

The Alpha – Data Stream Matt Ziegler.

Constructive Computer Architecture Tutorial 4: Running and Debugging SMIPS Andy Wright TA October 10, 2014http://csg.csail.mit.edu/6.175T04-1.

Constructive Computer Architecture: Control Hazards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.

1 Tutorial: Lab 4 Again Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 8,

October 22, 2009http://csg.csail.mit.edu/korea Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

6.375 Tutorial 3 Scheduling, Sce-Mi & FPGA Tools Ming Liu

October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010

Lecture: Out-of-order Processors

Constructive Computer Architecture Tutorial 6 Cache & Exception

6.175: Constructive Computer Architecture Tutorial 5 Epochs, Debugging, and Caches Quan Nguyen (Troubled by the two biggest problems in computer science…

Bluespec-3: A non-pipelined processor Arvind

Bluespec-6: Modeling Processors

CS252 Graduate Computer Architecture Spring 2014 Lecture 8: Advanced Out-of-Order Superscalar Designs Part-II Krste Asanovic

Constructive Computer Architecture Tutorial 6: Discussion for lab6

Branch Prediction Constructive Computer Architecture: Arvind

Branch Prediction Constructive Computer Architecture: Arvind

Caches-2 Constructive Computer Architecture Arvind

Constructive Computer Architecture: Lab Comments & RISCV

Non-Pipelined Processors - 2

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 18: Pipelining Today’s topics:

Constructive Computer Architecture Tutorial 5 Epoch & Branch Predictor

Non-Pipelined and Pipelined Processors

Lecture 18: Pipelining Today’s topics:

Constructive Computer Architecture Tutorial 3 RISC-V Processor

Branch Prediction Constructive Computer Architecture: Arvind

6.375 Tutorial 5 RISC-V 6-Stage with Caches Ming Liu

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Bluespec-3: A non-pipelined processor Arvind

Branch Prediction: Direction Predictors

Branch Prediction: Direction Predictors

Modular Refinement - 2 Arvind

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Lecture 20: OOO, Memory Hierarchy

Control Hazards Constructive Computer Architecture: Arvind

Branch Prediction: Direction Predictors

Modeling Processors Arvind

Modeling Processors Arvind

Modular Refinement Arvind

Tutorial 7: SMIPS Labs and Epochs Constructive Computer Architecture

Presentation transcript:

6.375 Tutorial 4 RISC-V and Final Projects Ming Liu March 4, 2016http://csg.csail.mit.edu/6.375T04-1

Overview Branch Target Buffers RISC V Infrastructure Final Project March 4, 2016http://csg.csail.mit.edu/6.375T04-2

Two-stage Pipeline with BTB PC Decode Register File Execute Data Memory Inst Memory f2d Fetch Decode-RegisterFetch-Execute-Memory-WriteBack kill misprediction correct pc BTB: Branch Target Buffer At fetch: Use BTB to predict next PC At execute: Update BTB with correct next PC Only if instruction is a branch (iType == J, Jr, Br) BTB Update BTB {PC, correct PC} Predict Next PC March 4, 2016http://csg.csail.mit.edu/6.375T04-3

Next Address Predictor: Branch Target Buffer (BTB) iMem pc pc i target i valid match kk 2 k -entry direct-mapped BTB BTB remembers recent targets for a set of control instructions Fetch: looks for the pc and the associated target in BTB; if pc in not found then ppc is pc+4 Execute: checks prediction, if wrong kills the instruction and updates BTB (only for branches and jumps)  Even small BTBs are effective March 4, 2016http://csg.csail.mit.edu/6.375T04-4

Next Addr Predictor interface interface AddrPred; method Addr nap(Addr pc); method Action update(Redirect rd); endinterface  Two implementations: a)Simple PC+4 predictor b)Predictor using BTB March 4, 2016http://csg.csail.mit.edu/6.375T04-5

Simple PC+4 predictor module mkPcPlus4(AddrPred); method Addr nap(Addr pc); return pc + 4; endmethod method Action update(Redirect rd); endmethod endmodule March 4, 2016http://csg.csail.mit.edu/6.375T04-6

BTB predictor module mkBtb(AddrPred); RegFile#(BtbIndex, Addr) ppcArr <- mkRegFileFull; RegFile#(BtbIndex, BtbTag) entryPcArr <- mkRegFileFull; Vector#(BtbEntries, Reg#(Bool)) validArr <- replicateM(mkReg(False)); function BtbIndex getIndex(Addr pc)=truncate(pc>>2); function BtbTag getTag(Addr pc) = truncateLSB(pc); method Addr nap(Addr pc); BtbIndex index = getIndex(pc); BtbTag tag = getTag(pc); if(validArr[index] && tag == entryPcArr.sub(index)) return ppcArr.sub(index); else return (pc + 4); endmethod method Action update(Redirect redirect);... endmodule March 4, 2016http://csg.csail.mit.edu/6.375T04-7

BTB predictor update method method Action update(Redirect redirect); if(redirect.taken) begin let index = getIndex(redirect.pc); let tag = getTag(redirect.pc); validArr[index] <= True; entryPcArr.upd(index, tag); ppcArr.upd(index, redirect.nextPc); end else if(tag == entryPcArr.sub(index)) validArr[index] <= False; endmethod redirect input contains a pc, the correct next pc and whether the branch was taken or not (to avoid making entries for not-taken branches) March 4, 2016http://csg.csail.mit.edu/6.375T04-8

Multiple Predictors: BTB + Branch Direction Predictors  Need  next PC immediately  Instr type, PC relative targets available  Simple conditions, register targets available  Complex conditions available Next Addr Pred tight  loop PCPC Decode Reg Read Execute Write Back mispred insts must be filtered Br Dir Pred correct mispred March 4, 2016http://csg.csail.mit.edu/6.375T04-9

RISC-V Processor SCE-MI Infrastructure March 4, 2016http://csg.csail.mit.edu/6.375T04-10

RISC-V Interface cpuToHost hostToCpu iMemInit dMemInit mkProc – BSV CSR Core iMem dMem PC Host (Testbench) March 4, 2016http://csg.csail.mit.edu/6.375T04-11

RISC-V Interface interface Proc; method Action hostToCpu(Addr startpc); method ActionValue#(CpuToHostData) cpuToHost; interface MemInit iMemInit; interface MemInit dMemInit; endinterface typedef struct { CpuToHostType c2hType; Bit#(16) data; } CpuToHostData deriving(Bits, Eq); typedef enum { ExitCode, PrintChar, PrintIntLow, PrintIntHigh } CpuToHostType deriving(Bits, Eq); March 4, 2016http://csg.csail.mit.edu/6.375T04-12

RISC-V Interface: cpuToHost Write mtohost CSR: csrw mtohost, rs1 rs1[15:0]: data  32-bit Integer needs two writes rs1[17:16]: c2hType  0: Exit code  1: Print character  2: Print int low 16 bits  3: Print int high 16 bits typedef struct { CpuToHostType c2hType; Bit#(16) data; } CpuToHostData deriving(Bits, Eq); March 4, 2016http://csg.csail.mit.edu/6.375T04-13

RISC-V Interface: Others hostToCpu Tells the processor to start running from the given address iMemInit/dMemInit Used to initialize iMem and dMem Can also be used to check when initialization is done Defined in MemInit.bsv March 4, 2016http://csg.csail.mit.edu/6.375T04-14

SceMi Interface tb – C++mkProc – BSV CSR Core iMem dMem PC March 4, 2016http://csg.csail.mit.edu/6.375T04-15

Load Program Bypass this step in simulation tb – C++ add.riscv.vmh mkProc – BSV CSR Core iMem dMem PC March 4, 2016http://csg.csail.mit.edu/6.375T04-16

Load Program Simulation: load with mem.vmh (fixed file name) Copy.riscv.vmh to mem.vmh tb – C++ mkProc – BSV CSR Core iMem dMem PC mem.vmh March 4, 2016http://csg.csail.mit.edu/6.375T04-17

Start Processor mkProc – BSV CSR Core iMem dMem tb – C++ Starting PC 0x200 PC March 4, 2016http://csg.csail.mit.edu/6.375T04-18

Print & Exit tb – C++mkProc – BSV CSR Core iMem dMem PC Get reg c2hType: 1,2,3: print 0: Exit Data == 0 PASSED Data != 0 FAILED March 4, 2016http://csg.csail.mit.edu/6.375T04-19

Final Project March 4, 2016http://csg.csail.mit.edu/6.375T04-20

Overview Groups of 2-3 students Each group assigned to a graduate mentor in our group Groups meet individually with Arvind, mentor and me Weekly reports due before the meeting to and March 4, 2016http://csg.csail.mit.edu/6.375T04-21

Schedule March 4, 2016http://csg.csail.mit.edu/6.375T04-22

Project Considerations Design a complex digital system Choose an application that could benefit from hardware acceleration or FPGAs Application should be well understood Find/implement reference software code Look at past year projects on the website March 4, 2016http://csg.csail.mit.edu/6.375T04-23

FPGA IPs and Resources Many Xilinx related IPs are available in the BSV library $BLUESPECDIR/BSVSource/Xilinx BRAMs, DRAM, Clock generators/buffers, LED controller, HDMI controller, LCD controller Can wrap Verilog libraries/IPs in BSV code using importBVI Tutorial: Users/Import-BVI Users/Import-BVI March 4, 2016http://csg.csail.mit.edu/6.375T04-24

BRAMs on FPGAs Fast, small, on-chip distributed RAM on FPGA 1 cycle access latency 36Kbits x 1500 (approx) = ~6.75MB total Up to 2 ports  BRAM  Port A  Port B  Request  Resp  Request  Resp March 4, 2016http://csg.csail.mit.edu/6.375T04-25

BRAMs in BSV Library 2 Ported BRAM server: mkBRAM2Server() Large FIFOs: mkSizedBRAMFIFO() Large sync FIFOs: mkSyncBRAMFIFO() Primitive BRAM: mkBRAMCore2() import BRAM::*; BRAM_Configure cfg = defaultValue ; cfg.memorySize = 1024*32 ; //define custom memorySize //instantiate 32K x 16 bits BRAM module BRAM2Port#(UInt#(15), Bit#(16)) bram <- mkBRAM2Server (cfg) ; rule doWrite; bram.portA.request.put( BRAMRequest{ write: True, responseOnWrite: False, address: 15’h01 datain: data } ); March 4, 2016http://csg.csail.mit.edu/6.375T04-26

DRAM on FPGA Large capacity (1GB on VC707) Longer access latency, especially random access BSV library at $BLUESPECDIR/BSVSource/ Xilinx/XilinxVC707DDR3.bsv Misc/DDR3.bsv Not officially in documentation Example code will be given as part of Lab 6  DRAM DRAM Controller IP  BSV Wrapper  DDR3_Pin s  DDR3_Use r  FPGA  Off-chip March 4, 2016http://csg.csail.mit.edu/6.375T04-27

DRAM Request/Response 512-bit wide user interface DDR Request: Write: write or read Byteen: byte enable mask. Which of the 8-bit bytes in the 512-bits will be written Address: DRAM address for 512-bit words Data: data to be written DDR Response: Bit#(512) read data March 4, 2016http://csg.csail.mit.edu/6.375T04-28

Indirect Memory Access Host CPU load/stores data from host DRAM to PCIe device (FPGA) Low bandwidth, consumes CPU cycles Used in SceMi: ~50MB/s Host DRAM Bus FPGA  Host CPU FPGA DRAM March 4, 2016http://csg.csail.mit.edu/6.375T04-29

Direct Memory Access (DMA) Host CPU sets up DMA engine DMA engine performs data transfer High bandwidth, minimal CPU involved: 1-4 GB/s Not supported by SceMi Host DRAM Bus FPGA  Host CPU FPGA DRAM DMA Eng March 4, 2016http://csg.csail.mit.edu/6.375T04-30

Connectal A SceMi Alternative Open source hardware/software co- design library Generates glue logic between software/hardware Supports DMA onnectal Guest lecture next Wed on this March 4, 2016http://csg.csail.mit.edu/6.375T04-31