Caches-2 Constructive Computer Architecture Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 3, 2014 http://csg.csail.mit.edu/6.175
Blocking vs. Non-Blocking cache At most one outstanding miss Cache must wait for memory to respond Cache does not accept requests in the meantime Non-blocking cache: Multiple outstanding misses Cache can continue to process requests while waiting for memory to respond to misses We will first design a write-back, Write-miss allocate, Direct-mapped, blocking cache November 3, 2014 http://csg.csail.mit.edu/6.175
Blocking Cache Interface req status memReq mReqQ missReq DRAM or next level cache Processor cache memResp resp hitQ mRespQ interface Cache; method Action req(MemReq r); method ActionValue#(Data) resp; method ActionValue#(MemReq) memReq; method Action memResp(Line r); endinterface November 3, 2014 http://csg.csail.mit.edu/6.175
Interface dynamics The cache either gets a hit and responds immediately, or it gets a miss, in which case it takes several steps to process the miss Reading the response dequeues it Requests and responses follow the FIFO order Methods are guarded, e.g., the cache may not be ready to accept a request because it is processing a miss A status register keeps track of the state of the cache while it is processing a miss typedef enum {Ready, StartMiss, SendFillReq, WaitFillResp} CacheStatus deriving (Bits, Eq); November 3, 2014 http://csg.csail.mit.edu/6.175
Blocking Cache code structure module mkCache(Cache); RegFile#(CacheIndex, Line) dataArray <- mkRegFileFull; … rule startMiss … endrule; method Action req(MemReq r) … endmethod; method ActionValue#(Data) resp … endmethod; method ActionValue#(MemReq) memReq … endmethod; method Action memResp(Line r) … endmethod; endmodule November 3, 2014 http://csg.csail.mit.edu/6.175
Extracting cache tags & index tag index L 2 Cache size in bytes Byte addresses Processor requests are for a single word but internal communications are in line sizes (2L words, typically L=2) AddrSz = CacheTagSz + CacheIndexSz + LS + 2 Need getIdx, getTag, getOffset functions function CacheIndex getIdx(Addr addr) = truncate(addr>>4); function Bit#(2) getOffset(Addr addr) = truncate(addr >> 2); function CacheTag getTag(Addr addr) = truncateLSB(addr); truncate = truncateMSB November 3, 2014 http://csg.csail.mit.edu/6.175
Blocking cache state elements RegFile#(CacheIndex, Line) dataArray <- mkRegFileFull; RegFile#(CacheIndex, Maybe#(CacheTag)) tagArray <- mkRegFileFull; RegFile#(CacheIndex, Bool) dirtyArray <- mkRegFileFull; Fifo#(1, Data) hitQ <- mkBypassFifo; Reg#(MemReq) missReq <- mkRegU; Reg#(CacheStatus) status <- mkReg(Ready); Fifo#(2, MemReq) memReqQ <- mkCFFifo; Fifo#(2, Line) memRespQ <- mkCFFifo; Tag and valid bits are kept together as a Maybe type CF Fifos are preferable because they provide better decoupling. An extra cycle here may not affect the performance by much November 3, 2014 http://csg.csail.mit.edu/6.175
Req method hit processing It is straightforward to extend the cache interface to include a cacheline flush command method Action req(MemReq r) if(status == Ready); let idx = getIdx(r.addr); let tag = getTag(r.addr); Bit#(2) wOffset = truncate(r.addr >> 2); let currTag = tagArray.sub(idx); let hit = isValid(currTag)? fromMaybe(?,currTag)==tag : False; if(hit) begin let x = dataArray.sub(idx); if(r.op == Ld) hitQ.enq(x[wOffset]); else begin x[wOffset]=r.data; dataArray.upd(idx, x); dirtyArray.upd(idx, True); end else begin missReq <= r; status <= StartMiss; end endmethod overwrite the appropriate word of the line November 3, 2014 http://csg.csail.mit.edu/6.175
Rest of the methods Memory side methods method ActionValue#(Data) resp; hitQ.deq; return hitQ.first; endmethod method ActionValue#(MemReq) memReq; memReqQ.deq; return memReqQ.first; method Action memResp(Line r); memRespQ.enq(r); Memory side methods November 3, 2014 http://csg.csail.mit.edu/6.175
Start-miss and Send-fill rules Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule startMiss(status == StartMiss); let idx = getIdx(missReq.addr); let tag=tagArray.sub(idx); let dirty=dirtyArray.sub(idx); if(isValid(tag) && dirty) begin // write-back let addr = {fromMaybe(?,tag), idx, 4'b0}; let data = dataArray.sub(idx); memReqQ.enq(MemReq{op: St, addr: addr, data: data}); end status <= SendFillReq; endrule Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule sendFillReq (status == SendFillReq); memReqQ.enq(missReq); status <= WaitFillResp; endrule November 3, 2014 http://csg.csail.mit.edu/6.175
Wait-fill rule Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule waitFillResp(status == WaitFillResp); let idx = getIdx(missReq.addr); let tag = getTag(missReq.addr); let data = memRespQ.first; tagArray.upd(idx, Valid (tag)); if(missReq.op == Ld) begin dirtyArray.upd(idx,False);dataArray.upd(idx, data); hitQ.enq(data[wOffset]); end else begin data[wOffset] = missReq.data; dirtyArray.upd(idx,True); dataArray.upd(idx, data); end memRespQ.deq; status <= Ready; endrule Is there a problem with waitFill? What if the hitQ is blocked? Should we not at least write it in the cache? November 3, 2014 http://csg.csail.mit.edu/6.175
Hit and miss performance Combinational read/write, i.e. 0-cycle response Requires req and resp methods to be concurrently schedulable, which in turn requires hitQ.enq < {hitQ.deq, hitQ.first} i.e., hitQ should be a bypass Fifo Miss No evacuation: memory load latency plus combinational read/write Evacuation: memory store followed by memory load latency plus combinational read/write Adding an extra cycle here and there in the miss case should not have a big negative performance impact November 3, 2014 http://csg.csail.mit.edu/6.175
Non-blocking cache cache req req proc mReq mReqQ Processor cache resp mResp Out-of-Order responses mRespQ Requests have to be tagged because responses come out-of-order (OOO) We will assume that all tags are unique and the processor is responsible for reusing tags properly November 3, 2014 http://csg.csail.mit.edu/6.175
Non-blocking Cache Behavior to be described by 2 concurrent FSMs to process input requests and memory responses, respectively St req goes into StQ and waits until data can be written into the cache req hitQ resp V D W Tag Data St Q Ld Buff load reqs waiting for data An extra bit in the cache to indicate if the data for a line is present wbQ mRespQ November 3, 2014 http://csg.csail.mit.edu/6.175 mReqQ 14
Incoming req Type of request st ld cache state V? In StQ? yes no yes bypass hit Cache state V? StQ empty? yes no yes no hit Cache W? yes no Write in cache Put in StQ Cache state W? Put in LdBuf Put in LdBuf send memReq set W If (evacuate) send wbResp yes no Put in StQ Put in StQ send memReq set W If (evacuate) send wbResp November 3, 2014 http://csg.csail.mit.edu/6.175
Mem Resp (line) 1. Update cache line (set V and unset W) 2. Process all matching ldBuff entries and send responses 3. L: If cachestate(oldest StQ entry address) = V then update the cache word with StQ entry; remove the oldest entry; Loop back to L else if cachestate(oldest StQ entry address) = !W then if(evacuate) wbResp; memReq for this store entry; set W November 3, 2014 http://csg.csail.mit.edu/6.175
Non-blocking Cache state declaration Code has not been tested module mkNBCache(NBCache); RegFile#(Index, Bool) valid <- mkRegFileFull; RegFile#(Index, Bool) dirty <- mkRegFileFull; RegFile#(Index, Bool) wait <- mkRegFileFull; RegFile#(Index, Tag) tagArray <- mkRegFileFull; RegFile#(Index, Data) dataArray <- mkRegFileFull; StQ#(StQSz) stQ <- mkStQ; LdBuff#(LdBuffSz) ldBuff <- mkLdBuff; FIFOF#(Tuple2#(Addr, Line)) wbQ <- mkFIFOF; FIFOF#(Addr) mReqQ <- mkFIFOF; mRespQ <- mkFIFOF; FIFOF#(Tuple2#(Id, Data)) respQ <- mkFIFOF; Reg#(Addr) addrResp <- mkRegU; Reg#(CacheState) buffSearch <- mkReg(None); // Either LdBuff, StQ or None November 3, 2014 http://csg.csail.mit.edu/6.175 17
Non-blocking Cache req method Code has not been tested method req(MemReq r) if (buffSearch== None); let idx = getIndex(r.addr); let tag = getTag(r.addr); let line = dataArray.sub(getIndex(r.addr)); Bit#(2) offset = getOffset(r.addr); let v = valid.sub(idx); let t = tagArray.sub(idx); let d = dirty.sub(idx); if(r.op == Ld) begin if (isValid(stQ.search(r.addr))) respQ.enq(tuple2(r.id, fromMaybe(?,stQ.search(r.addr)))); else if (t == tag && v) respQ.enq(tuple2(r.id, line[offset])); else begin ldBuff.enq(r); if ( !wait.sub(idx)) begin memReqQ.enq(r.addr); wait.upd(idx, True); if (t!= tag && d) begin //evacuate wbQ.enq(tuple2({t, idx, 4'b0},line)); tagArray.upd(idx, tag); end end end end else … November 3, 2014 http://csg.csail.mit.edu/6.175 18
Non-blocking Cache req method (cont) Code has not been tested else begin //store req if (t == tag && v) if (stQ.empty) begin line[offet] = r.data; dataArray.upd(idx, line); dirty.upd(idx, True); end else stQ.enq(r); else begin stQ.enq(r); if (!wait.sub(idx)) begin memReqQ.enq(r.addr); wait.upd(idx, True); if (t!= tag && d) begin //evacuate wbQ.enq(tuple2({t, idx, 4'b0},line,state.sub(idx))); tagArray.upd(idx, tag); endmethod November 3, 2014 http://csg.csail.mit.edu/6.175 19
Non-blocking Cache Memory response processing Code has not been tested rule memResp(buffSearch == None); match {.addr, .data} = memRespQ.first; memRespQ.deq; dataArray.upd(getIdx(addr), data); valid.upd(getIdx(addr), True); wait.upd(getIdx(addr), False); dirty.upd(getIdx(addr), False); buffSearch <= LdBuff; addrResp <= addr; endrule rule clearLoad(buffSearch == LdBuff); Bit#(2) offset = getOffset(r.addr); let rMaybe = ldBuff.search(addrResp); if (isValid(rMaybe)) begin let r = fromMaybe(?, rMaybe); respQ.enq(tuple2(r.id, dataArray.sub(getIdx(r.addr))[offset])); ldBuff.remove(r.ldBuffId); end else buffSearch <= StQ; November 3, 2014 http://csg.csail.mit.edu/6.175 20
Non-blocking Cache rules 2 Code has not been tested rule clearStore(buffSearch == StQ); Bit#(2) offset = getOffset(r.addr); let r = stQ.first; let line = dataArray.sub(getIdx(r.addr)); let v = valid.sub(getIdx(r.addr)); let t = tagArray.sub(getIdx(r.addr)); if (t == getTag(r.addr) && v) begin line[offset] = r.data; dataArray.upd(getIdx(r.addr), line); stQ.deq; dirty.upd(getIdx(r.addr), True); end else begin if (!wait.sub(idx)) begin memQ.enq(r.addr); wait.upd(idx, True); if (t!= tag && dirty.sub(getIdx(r.addr))) begin wbQ.enq(tuple2({tag, idx, 4'b0},line,state.sub(idx))); tagArray.upd(idx, tag); buffSearch <= None; endrule November 3, 2014 http://csg.csail.mit.edu/6.175 21
Non-blocking Cache methods (cont) Code has not been tested method ActionValue#(Addr) memReq; memReqQ.deq; return memReqQ.first; endmethod method ActionValue#(Tuple2#(Addr, Data)) wbResp; wbQ.deq; return wbQ.first; method Action memResp(Tuple2#(Addr, Data) r); memRespQ.enq(r); November 3, 2014 http://csg.csail.mit.edu/6.175 22
Four-Stage Pipeline Register File PC Decode Execute Inst Data Memory Epoch Register File PC Next Addr Pred f2d Decode d2e Execute m2w e2m f12f2 Inst Memory Data Memory scoreboard insert bypass FIFO’s to deal with (0,n) cycle memory response November 3, 2014 http://csg.csail.mit.edu/6.175 23