Caches-2 Constructive Computer Architecture Arvind

Slides:

Advertisements

Similar presentations

Constructive Computer Architecture: Data Hazards in Pipelined Processors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.

Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time

Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1

Computer Architecture: A Constructive Approach Branch Direction Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab.

March, 2007http://csg.csail.mit.edu/arvindIPlookup-1 IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

September 24, L08-1 IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab.

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Realistic Memories and Caches Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 21, 2012L13-1

Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.

Caches and in-order pipelines Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology May 11, 2012L24-1.

Constructive Computer Architecture: Guards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September 24, 2014.

Realistic Memories and Caches – Part II Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 2, 2012L14-1.

Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.

Computer Architecture: A Constructive Approach Next Address Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab. Massachusetts.

Constructive Computer Architecture: Control Hazards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.

1 Tutorial: Lab 4 Again Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Non-blocking Caches Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology May 14, 2012L25-1

Realistic Memories and Caches – Part III Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 4, 2012L15-1.

Constructive Computer Architecture Store Buffers and Non-blocking Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.

6.375 Tutorial 4 RISC-V and Final Projects Ming Liu March 4, 2016http://csg.csail.mit.edu/6.375T04-1.

October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010

6.175 Final Project Part 0: Understanding Non-Blocking Caches and Cache Coherency Answers.

Realistic Memories and Caches

Constructive Computer Architecture Tutorial 6 Cache & Exception

Bluespec-3: A non-pipelined processor Arvind

Control Hazards Constructive Computer Architecture: Arvind

Realistic Memories and Caches

Bluespec-6: Modeling Processors

Folded “Combinational” circuits

Scheduling Constraints on Interface methods

Sequential Circuits - 2 Constructive Computer Architecture Arvind

Caches-2 Constructive Computer Architecture Arvind

Multistage Pipelined Processors and modular refinement

Notation Addresses are ordered triples:

Cache Coherence Constructive Computer Architecture Arvind

Pipelining combinational circuits

Constructive Computer Architecture: Guards

Constructive Computer Architecture Tutorial 7 Final Project Overview

Pipelining combinational circuits

Realistic Memories and Caches

Cache Coherence Constructive Computer Architecture Arvind

Modules with Guarded Interfaces

Pipelining combinational circuits

Sequential Circuits - 2 Constructive Computer Architecture Arvind

Cache Coherence Constructive Computer Architecture Arvind

Bypassing Computer Architecture: A Constructive Approach Joel Emer

Multistage Pipelined Processors and modular refinement

Modular Refinement Arvind

CC protocol for blocking caches

Bluespec-3: A non-pipelined processor Arvind

Caches-2 Constructive Computer Architecture Arvind

Modular Refinement - 2 Arvind

Caches and store buffers

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Constructive Computer Architecture: Guards

Control Hazards Constructive Computer Architecture: Arvind

Pipelined Processors Constructive Computer Architecture: Arvind

CS 3410, Spring 2014 Computer Science Cornell University

Modeling Processors Arvind

Modeling Processors Arvind

Modular Refinement Arvind

Control Hazards Constructive Computer Architecture: Arvind

Modular Refinement Arvind

Realistic Memories and Caches

Cache Coherence Constructive Computer Architecture Arvind

Caches-2 Constructive Computer Architecture Arvind

Presentation transcript:

Caches-2 Constructive Computer Architecture Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 3, 2014 http://csg.csail.mit.edu/6.175

Blocking vs. Non-Blocking cache At most one outstanding miss Cache must wait for memory to respond Cache does not accept requests in the meantime Non-blocking cache: Multiple outstanding misses Cache can continue to process requests while waiting for memory to respond to misses We will first design a write-back, Write-miss allocate, Direct-mapped, blocking cache November 3, 2014 http://csg.csail.mit.edu/6.175

Blocking Cache Interface req status memReq mReqQ missReq DRAM or next level cache Processor cache memResp resp hitQ mRespQ interface Cache; method Action req(MemReq r); method ActionValue#(Data) resp; method ActionValue#(MemReq) memReq; method Action memResp(Line r); endinterface November 3, 2014 http://csg.csail.mit.edu/6.175

Interface dynamics The cache either gets a hit and responds immediately, or it gets a miss, in which case it takes several steps to process the miss Reading the response dequeues it Requests and responses follow the FIFO order Methods are guarded, e.g., the cache may not be ready to accept a request because it is processing a miss A status register keeps track of the state of the cache while it is processing a miss typedef enum {Ready, StartMiss, SendFillReq, WaitFillResp} CacheStatus deriving (Bits, Eq); November 3, 2014 http://csg.csail.mit.edu/6.175

Blocking Cache code structure module mkCache(Cache); RegFile#(CacheIndex, Line) dataArray <- mkRegFileFull; … rule startMiss … endrule; method Action req(MemReq r) … endmethod; method ActionValue#(Data) resp … endmethod; method ActionValue#(MemReq) memReq … endmethod; method Action memResp(Line r) … endmethod; endmodule November 3, 2014 http://csg.csail.mit.edu/6.175

Extracting cache tags & index tag index L 2 Cache size in bytes Byte addresses Processor requests are for a single word but internal communications are in line sizes (2L words, typically L=2) AddrSz = CacheTagSz + CacheIndexSz + LS + 2 Need getIdx, getTag, getOffset functions function CacheIndex getIdx(Addr addr) = truncate(addr>>4); function Bit#(2) getOffset(Addr addr) = truncate(addr >> 2); function CacheTag getTag(Addr addr) = truncateLSB(addr); truncate = truncateMSB November 3, 2014 http://csg.csail.mit.edu/6.175

Blocking cache state elements RegFile#(CacheIndex, Line) dataArray <- mkRegFileFull; RegFile#(CacheIndex, Maybe#(CacheTag)) tagArray <- mkRegFileFull; RegFile#(CacheIndex, Bool) dirtyArray <- mkRegFileFull; Fifo#(1, Data) hitQ <- mkBypassFifo; Reg#(MemReq) missReq <- mkRegU; Reg#(CacheStatus) status <- mkReg(Ready); Fifo#(2, MemReq) memReqQ <- mkCFFifo; Fifo#(2, Line) memRespQ <- mkCFFifo; Tag and valid bits are kept together as a Maybe type CF Fifos are preferable because they provide better decoupling. An extra cycle here may not affect the performance by much November 3, 2014 http://csg.csail.mit.edu/6.175

Req method hit processing It is straightforward to extend the cache interface to include a cacheline flush command method Action req(MemReq r) if(status == Ready); let idx = getIdx(r.addr); let tag = getTag(r.addr); Bit#(2) wOffset = truncate(r.addr >> 2); let currTag = tagArray.sub(idx); let hit = isValid(currTag)? fromMaybe(?,currTag)==tag : False; if(hit) begin let x = dataArray.sub(idx); if(r.op == Ld) hitQ.enq(x[wOffset]); else begin x[wOffset]=r.data; dataArray.upd(idx, x); dirtyArray.upd(idx, True); end else begin missReq <= r; status <= StartMiss; end endmethod overwrite the appropriate word of the line November 3, 2014 http://csg.csail.mit.edu/6.175

Rest of the methods Memory side methods method ActionValue#(Data) resp; hitQ.deq; return hitQ.first; endmethod method ActionValue#(MemReq) memReq; memReqQ.deq; return memReqQ.first; method Action memResp(Line r); memRespQ.enq(r); Memory side methods November 3, 2014 http://csg.csail.mit.edu/6.175

Start-miss and Send-fill rules Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule startMiss(status == StartMiss); let idx = getIdx(missReq.addr); let tag=tagArray.sub(idx); let dirty=dirtyArray.sub(idx); if(isValid(tag) && dirty) begin // write-back let addr = {fromMaybe(?,tag), idx, 4'b0}; let data = dataArray.sub(idx); memReqQ.enq(MemReq{op: St, addr: addr, data: data}); end status <= SendFillReq; endrule Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule sendFillReq (status == SendFillReq); memReqQ.enq(missReq); status <= WaitFillResp; endrule November 3, 2014 http://csg.csail.mit.edu/6.175

Wait-fill rule Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule waitFillResp(status == WaitFillResp); let idx = getIdx(missReq.addr); let tag = getTag(missReq.addr); let data = memRespQ.first; tagArray.upd(idx, Valid (tag)); if(missReq.op == Ld) begin dirtyArray.upd(idx,False);dataArray.upd(idx, data); hitQ.enq(data[wOffset]); end else begin data[wOffset] = missReq.data; dirtyArray.upd(idx,True); dataArray.upd(idx, data); end memRespQ.deq; status <= Ready; endrule Is there a problem with waitFill? What if the hitQ is blocked? Should we not at least write it in the cache? November 3, 2014 http://csg.csail.mit.edu/6.175

Hit and miss performance Combinational read/write, i.e. 0-cycle response Requires req and resp methods to be concurrently schedulable, which in turn requires hitQ.enq < {hitQ.deq, hitQ.first} i.e., hitQ should be a bypass Fifo Miss No evacuation: memory load latency plus combinational read/write Evacuation: memory store followed by memory load latency plus combinational read/write Adding an extra cycle here and there in the miss case should not have a big negative performance impact November 3, 2014 http://csg.csail.mit.edu/6.175

Non-blocking cache cache req req proc mReq mReqQ Processor cache resp mResp Out-of-Order responses mRespQ Requests have to be tagged because responses come out-of-order (OOO) We will assume that all tags are unique and the processor is responsible for reusing tags properly November 3, 2014 http://csg.csail.mit.edu/6.175

Non-blocking Cache Behavior to be described by 2 concurrent FSMs to process input requests and memory responses, respectively St req goes into StQ and waits until data can be written into the cache req hitQ resp V D W Tag Data St Q Ld Buff load reqs waiting for data An extra bit in the cache to indicate if the data for a line is present wbQ mRespQ November 3, 2014 http://csg.csail.mit.edu/6.175 mReqQ 14

Incoming req Type of request st ld cache state V? In StQ? yes no yes bypass hit Cache state V? StQ empty? yes no yes no hit Cache W? yes no Write in cache Put in StQ Cache state W? Put in LdBuf Put in LdBuf send memReq set W If (evacuate) send wbResp yes no Put in StQ Put in StQ send memReq set W If (evacuate) send wbResp November 3, 2014 http://csg.csail.mit.edu/6.175

Mem Resp (line) 1. Update cache line (set V and unset W) 2. Process all matching ldBuff entries and send responses 3. L: If cachestate(oldest StQ entry address) = V then update the cache word with StQ entry; remove the oldest entry; Loop back to L else if cachestate(oldest StQ entry address) = !W then if(evacuate) wbResp; memReq for this store entry; set W November 3, 2014 http://csg.csail.mit.edu/6.175

Non-blocking Cache state declaration Code has not been tested module mkNBCache(NBCache); RegFile#(Index, Bool) valid <- mkRegFileFull; RegFile#(Index, Bool) dirty <- mkRegFileFull; RegFile#(Index, Bool) wait <- mkRegFileFull; RegFile#(Index, Tag) tagArray <- mkRegFileFull; RegFile#(Index, Data) dataArray <- mkRegFileFull; StQ#(StQSz) stQ <- mkStQ; LdBuff#(LdBuffSz) ldBuff <- mkLdBuff; FIFOF#(Tuple2#(Addr, Line)) wbQ <- mkFIFOF; FIFOF#(Addr) mReqQ <- mkFIFOF; mRespQ <- mkFIFOF; FIFOF#(Tuple2#(Id, Data)) respQ <- mkFIFOF; Reg#(Addr) addrResp <- mkRegU; Reg#(CacheState) buffSearch <- mkReg(None); // Either LdBuff, StQ or None November 3, 2014 http://csg.csail.mit.edu/6.175 17

Non-blocking Cache req method Code has not been tested method req(MemReq r) if (buffSearch== None); let idx = getIndex(r.addr); let tag = getTag(r.addr); let line = dataArray.sub(getIndex(r.addr)); Bit#(2) offset = getOffset(r.addr); let v = valid.sub(idx); let t = tagArray.sub(idx); let d = dirty.sub(idx); if(r.op == Ld) begin if (isValid(stQ.search(r.addr))) respQ.enq(tuple2(r.id, fromMaybe(?,stQ.search(r.addr)))); else if (t == tag && v) respQ.enq(tuple2(r.id, line[offset])); else begin ldBuff.enq(r); if ( !wait.sub(idx)) begin memReqQ.enq(r.addr); wait.upd(idx, True); if (t!= tag && d) begin //evacuate wbQ.enq(tuple2({t, idx, 4'b0},line)); tagArray.upd(idx, tag); end end end end else … November 3, 2014 http://csg.csail.mit.edu/6.175 18

Non-blocking Cache req method (cont) Code has not been tested else begin //store req if (t == tag && v) if (stQ.empty) begin line[offet] = r.data; dataArray.upd(idx, line); dirty.upd(idx, True); end else stQ.enq(r); else begin stQ.enq(r); if (!wait.sub(idx)) begin memReqQ.enq(r.addr); wait.upd(idx, True); if (t!= tag && d) begin //evacuate wbQ.enq(tuple2({t, idx, 4'b0},line,state.sub(idx))); tagArray.upd(idx, tag); endmethod November 3, 2014 http://csg.csail.mit.edu/6.175 19

Non-blocking Cache Memory response processing Code has not been tested rule memResp(buffSearch == None); match {.addr, .data} = memRespQ.first; memRespQ.deq; dataArray.upd(getIdx(addr), data); valid.upd(getIdx(addr), True); wait.upd(getIdx(addr), False); dirty.upd(getIdx(addr), False); buffSearch <= LdBuff; addrResp <= addr; endrule rule clearLoad(buffSearch == LdBuff); Bit#(2) offset = getOffset(r.addr); let rMaybe = ldBuff.search(addrResp); if (isValid(rMaybe)) begin let r = fromMaybe(?, rMaybe); respQ.enq(tuple2(r.id, dataArray.sub(getIdx(r.addr))[offset])); ldBuff.remove(r.ldBuffId); end else buffSearch <= StQ; November 3, 2014 http://csg.csail.mit.edu/6.175 20

Non-blocking Cache rules 2 Code has not been tested rule clearStore(buffSearch == StQ); Bit#(2) offset = getOffset(r.addr); let r = stQ.first; let line = dataArray.sub(getIdx(r.addr)); let v = valid.sub(getIdx(r.addr)); let t = tagArray.sub(getIdx(r.addr)); if (t == getTag(r.addr) && v) begin line[offset] = r.data; dataArray.upd(getIdx(r.addr), line); stQ.deq; dirty.upd(getIdx(r.addr), True); end else begin if (!wait.sub(idx)) begin memQ.enq(r.addr); wait.upd(idx, True); if (t!= tag && dirty.sub(getIdx(r.addr))) begin wbQ.enq(tuple2({tag, idx, 4'b0},line,state.sub(idx))); tagArray.upd(idx, tag); buffSearch <= None; endrule November 3, 2014 http://csg.csail.mit.edu/6.175 21

Non-blocking Cache methods (cont) Code has not been tested method ActionValue#(Addr) memReq; memReqQ.deq; return memReqQ.first; endmethod method ActionValue#(Tuple2#(Addr, Data)) wbResp; wbQ.deq; return wbQ.first; method Action memResp(Tuple2#(Addr, Data) r); memRespQ.enq(r); November 3, 2014 http://csg.csail.mit.edu/6.175 22

Four-Stage Pipeline Register File PC Decode Execute Inst Data Memory Epoch Register File PC Next Addr Pred f2d Decode d2e Execute m2w e2m f12f2 Inst Memory Data Memory scoreboard insert bypass FIFO’s to deal with (0,n) cycle memory response November 3, 2014 http://csg.csail.mit.edu/6.175 23