Realistic Memories and Caches

Slides:

Advertisements

Similar presentations

Constructive Computer Architecture: Branch Prediction: Direction Predictors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.

Advertisements

Data Hazards and Multistage Pipeline Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 18, 2012L8-1.

Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1

Constructive Computer Architecture: Branch Prediction: Direction Predictors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Realistic Memories and Caches Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 21, 2012L13-1

Constructive Computer Architecture: Branch Prediction Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.

Constructive Computer Architecture: Branch Prediction Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.

Constructive Computer Architecture Virtual Memory and Interrupts Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.

Computer Architecture: A Constructive Approach Next Address Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab. Massachusetts.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Constructive Computer Architecture: Control Hazards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.

1 Tutorial: Lab 4 Again Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 8,

October 22, 2009http://csg.csail.mit.edu/korea Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Realistic Memories and Caches – Part III Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 4, 2012L15-1.

Constructive Computer Architecture Tutorial 6: Five Details of SMIPS Implementations Andy Wright 6.S195 TA October 7, 2013http://csg.csail.mit.edu/6.s195T05-1.

October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010

Realistic Memories and Caches

Cache Memory and Performance

Yu-Lun Kuo Computer Sciences and Information Engineering

6.175: Constructive Computer Architecture Tutorial 5 Epochs, Debugging, and Caches Quan Nguyen (Troubled by the two biggest problems in computer science…

Control Hazards Constructive Computer Architecture: Arvind

Realistic Memories and Caches

Multilevel Memories (Improving performance using alittle “cash”)

Tutorial 7: SMIPS Epochs Constructive Computer Architecture

Constructive Computer Architecture Tutorial 6: Discussion for lab6

Branch Prediction Constructive Computer Architecture: Arvind

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Branch Prediction Constructive Computer Architecture: Arvind

Caches-2 Constructive Computer Architecture Arvind

Multistage Pipelined Processors and modular refinement

in Pipelined Processors

Non-Pipelined Processors - 2

Realistic Memories and Caches

Constructive Computer Architecture Tutorial 5 Epoch & Branch Predictor

Lab 4 Overview: 6-stage SMIPS Pipeline

Non-Pipelined and Pipelined Processors

in Pipelined Processors

Control Hazards Constructive Computer Architecture: Arvind

Branch Prediction Constructive Computer Architecture: Arvind

Multistage Pipelined Processors and modular refinement

Modular Refinement Arvind

Adapted from slides by Sally McKee Cornell University

Branch Prediction: Direction Predictors

Branch Prediction: Direction Predictors

Modular Refinement - 2 Arvind

Multistage Pipelined Processors and modular refinement

Lecture 20: OOO, Memory Hierarchy

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Lecture 20: OOO, Memory Hierarchy

in Pipelined Processors

Pipelined Processors Arvind

Control Hazards Constructive Computer Architecture: Arvind

Pipelined Processors Constructive Computer Architecture: Arvind

Branch Prediction: Direction Predictors

Modeling Processors Arvind

Modular Refinement Arvind

Control Hazards Constructive Computer Architecture: Arvind

Modular Refinement Arvind

Tutorial 7: SMIPS Labs and Epochs Constructive Computer Architecture

Branch Predictor Interface

Pipelined Processors: Control Hazards

Presentation transcript:

Realistic Memories and Caches Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October 18, 2017 http://csg.csail.mit.edu/6.175

2-Stage Pipeline Register File redirect PC Decode Execute Inst Data Fetch, Decode, RegisterFetch Execute, Memory, WriteBack epoch Register File redirect PC Decode Execute nap d2e Inst Memory Data Memory scoreboard The use of magic memories (combinational reads) makes such designs unrealistic October 18, 2017 http://csg.csail.mit.edu/6.175 2

Magic Memory Model Reads and writes are always completed in one cycle RAM ReadData WriteData Address WriteEnable Clock Reads and writes are always completed in one cycle a Read can be done any time (i.e. combinational) If enabled, a Write is performed at the rising clock edge (the write address and data must be stable at the clock edge) In a real SRAM or DRAM the data will be available several cycles after the address is supplied October 18, 2017 http://csg.csail.mit.edu/6.175

Memory System View iMem as a request/response system and split the fetch rule into two rules – one to send a request and the other to receive the response insert a FIFO (f12f2) to hold the pc address of the instructions being fetched Can be the same as f2d Similar idea applies to dMem PC iMem Decode f2d Epoch nap f12f2 enq first,deq assume iMem behaves like a FIFO October 18, 2017 http://csg.csail.mit.edu/6.175

Connecting 2-Stage-pipeline to req/res memory doExecute < doFetch rule doFetch; let instF = iMem.req(pc[1]); let ppcF = nap(pc[1]); let dInst = decode(instF); let stall = sb.search1(dInst.src1)|| sb.search2(dInst.src2); if(!stall) begin …fetch register values d2e.enq(Decode2Execute{pc: pc[1], ppc: ppcF, dIinst: dInst, epoch: epoch[1], rVal1: rVal1, rVal2: rVal2}); sb.insert(dInst.rDst); pc[1] <= ppcF; end endrule rule doExecute; ...the same as before … if (x.ppc != nextPC) begin pc[0] <= eInst.addr; epoch[0] <= !epoch[0]; end end d2e.deq; sb.remove; endrule fetch ? Magic memory decode October 18, 2017 http://csg.csail.mit.edu/6.175

Connecting 2-Stage-pipeline to Req/Res memory doExecute < doFetch rule fetch; iMem.enq(pc[1]); let ppcF = nap(pc[1]); pc[1] <= ppcF ; f2d.enq(Fetch2Decode(pc:pc[1], ppc:ppcF, epoch:epoch[1])) endrule rule decode; let inst = iMem.first; let x = f2d.first; let dInst = decode(inst); let stall = sb.search1(dInst.src1)|| sb.search2(dInst.src2); if(!stall) begin …fetch register values d2e.enq(Decode2Execute{pc: x.pc, ppc: x.ppc, dIinst: dInst, epoch: x.epoch, rVal1: rVal1, rVal2: rVal2}); sb.insert(dInst.rDst); iMem.deq; f2d.deq end endrule Req/Res memory What is the advantage of nap in fetch1 vs fetch2? We can also drop the instruction if epoch has changed must be done only if not stalling October 18, 2017 http://csg.csail.mit.edu/6.175

Dropping instructions rule decode; let inst = iMem.first; let x = f2d.first; if (epoch[?] != x.inEp) begin iMem.deq; f2d.deq end //dropping wrongpath instruction else begin let dInst = decode(inst); let stall = sb.search1(dInst.src1)|| sb.search2(dInst.src2); if(!stall) begin …fetch register values d2e.enq(Decode2Execute{pc: x.pc, ppc: x.ppc, dIinst: dInst, epoch: x.epoch, rVal1: rVal1, rVal2: rVal2}); sb.insert(dInst.rDst); iMem.deq; f2d.deq end end endrule Are both 0 and 1 correct? Yes, but 1 is better October 18, 2017 http://csg.csail.mit.edu/6.175

Data access in the execute stage Execute rule has to be split too in order to deal with multicycle memory system How should the functions of execute be split across rules call exec initiate memory ops, wait for load results redirection register update scoreboard updates October 18, 2017 http://csg.csail.mit.edu/6.175

Transforming the Execute rule – first attempt rule doExecute; let x = d2e.first; let dInstE = x.dInst; let pcE = x.pc; let inEp = x.epoch; let rVal1E = x.rVal1; let rVal2E = x.rVal2; if(epoch == inEp) begin let eInst = exec(dInstE, rVal1E, rVal2E, pcE); if(eInst.iType == Ld) eInst.data <- dMem.req(MemReq{op:Ld, addr:eInst.addr, data:?}); else if (eInst.iType == St) let d <- dMem.req(MemReq{op:St, addr:eInst.addr, data:eInst.data}); if (isValid(eInst.dst)) rf.wr(fromMaybe(?, eInst.dst), eInst.data); let nextPC = eInst.brTaken ? eInst.addr : pcE + 4; if (x.ppc != nextPC) begin pc[0] <= eInst.addr; epoch[0] <= !epoch[0]; end end d2e.deq; sb.remove; endrule execute Split – req/res writeback October 18, 2017 http://csg.csail.mit.edu/6.175

Execute rule: first attempt rule execute; let x = d2e.first; let dInstE = x.dInst; let pcE = x.pc; let inEp = x.epoch; let rVal1E = x.rVal1; let rVal2E = x.rVal2; if (epoch[1] != inEp) begin sb.remove; end else begin let eInst = exec(dInstE, rVal1E, rVal2E, pcE); e2w.enq(Exec2WB(eInst:eInst,pc:pcE,epoch:inEp)); if(eInst.iType == Ld) dMem.enq(MemReq{op:Ld, addr:eInst.addr, data:?}); else if (eInst.iType == St) begin dMem.enq(MemReq{op:St, addr:eInst.addr, data:eInst.data}); end end d2e.deq; endrule why? October 18, 2017 http://csg.csail.mit.edu/6.175

Writeback rule first attempt rule writeback; let x = e2d.first; let pcE=x.pc; let eInst=x.eInst; let inEp = x.epoch; if(epoch[0] = inEp) begin if (isValid(eInst.dst)) begin let data = eInst.iType==Ld ? dMem.first: eInst.data; rf.wr(fromMaybe(?, eInst.dst), data); end if(eInst.iType == Ld) dMem.deq; let nextPC = eInst.brTaken ? eInst.addr : pcE + 4; if (x.ppc != nextPC) begin pc[0] <= eInst.addr; epoch[0] <= !epoch[0]; end end sb.remove; e2w.deq endrule notice, we have assumed that St does not get a response October 18, 2017 http://csg.csail.mit.edu/6.175

Problems with the first attempt sb.remove is being called from both execute and writeback out of order removals – correctness simultaneous removals – concurrency St that was initiated in execute could be invalidated in writeback (wrong path instruction); consider a branch followed by a store a store, once it is sent to the memory, cannot be recalled Let us move redirection from writeback to execute and sb.remove from execute to writeback October 18, 2017 http://csg.csail.mit.edu/6.175

Dropping vs poisoning an instruction Once an instruction is determined to be on the wrong path, the instruction is either dropped or poisoned Drop: If the wrong path instruction has not modified any book keeping structures (e.g., Scoreboard) then it is simply removed Poison: If the wrong path instruction has modified book keeping structures then it is poisoned and passed down for book keeping reasons (say, to remove it from the scoreboard) Subsequent stages know not to update any architectural state for a poisoned instruction October 18, 2017 http://csg.csail.mit.edu/6.175

Execute rule: second attempt epoch[1] would create a combinational cycle and make the rule invalid rule execute; let x = d2e.first; ... if(epoch[0] != inEp) begin e2w.enq(Invalid); d2e.deq; end else begin let eInst = exec(dInstE, rVal1E, rVal2E, pcE); if(eInst.iType == Ld) dMem.enq(MemReq{op:Ld, addr:eInst.addr, data:?}); else if (eInst.iType == St) begin dMem.enq(MemReq{op:St, addr:eInst.addr, data:eInst.data}); end let nextPC = eInst.brTaken ? eInst.addr : pcE + 4; if (x.ppc != nextPC) begin pc[0] <= eInst.addr; epoch[0] <= !epoch[0]; end e2w.enq(Valid Exec12Exec2(eInst:eInst, pc:pcE)); d2e.deq; end endrule poisoning! October 18, 2017 http://csg.csail.mit.edu/6.175

Writeback rule second attempt rule writeback; let vx = e2w.first; if (vx matches tagged Valid .x) begin let pcE=x.pc; let eInst=x.eInst; if (isValid(eInst.dst)) begin let data = eInst.iType==Ld ? dMem.first: eInst.data; rf.wr(fromMaybe(?, eInst.dst), data); end if(eInst.iType == Ld) dMem.deq; sb.remove; e2w.deq; endrule October 18, 2017 http://csg.csail.mit.edu/6.175

Observations sb.remove is called only from exec2 ==> no concurrency issues Redirection is done from exec1 ==> better for performance St was initiated in exec1 and cannot be squashed by any older instruction in exec2 or the exec12exec2 fifo stall will work correctly in fetch2 because the scoreboard is not updated until the reg-file is also updated October 18, 2017 http://csg.csail.mit.edu/6.175

holds frequently used data Memory Hierarchy Small, Fast Memory SRAM CPU RegFile Big, Slow Memory DRAM holds frequently used data size: RegFile << SRAM << DRAM latency: RegFile << SRAM << DRAM bandwidth: on-chip >> off-chip On a data access: hit (data Î fast memory)  low latency access miss (data Ï fast memory)  long latency access (DRAM) why? Due to cost Due to size of DRAM Due to cost and wire delays (wires on-chip cost much less, and are faster) October 18, 2017 http://csg.csail.mit.edu/6.175

Managing of fast storage User managed Scratchpad memory ISA is aware of the storage hierarchy; separate instructions are needed to access different storage levels Automatically managed Cache memory: programmer has little control over how data moves between fast and slow memory Historically very successful (painless for the programmer) October 18, 2017 http://csg.csail.mit.edu/6.175

Why do caches work Temporal locality Spatial locality if a memory location is referenced at time t then there is very high probability that it will be referenced again in the near future, say, in the next several thousand instructions (frequently observed behavior) working set of locations for an instruction window Spatial locality if address x is referenced then addresses x+1, x+2 etc. are very likely to be referenced in the near future consider instruction streams, array and record accesses October 18, 2017 http://csg.csail.mit.edu/6.175

Inside a Cache A cache line usually holds more than one word to tag data Data from locations 100, 101, ... Data Byte 100 304 6848 valid bit A cache line usually holds more than one word to exploit spatial locality transport large data sets more efficiently reduce the number of tag bits needed to identify a cache line Tag only needs enough bits to uniquely identify the block (jse) October 18, 2017 http://csg.csail.mit.edu/6.175