Non-blocking Caches Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology May 14, 2012L25-1

Slides:

Advertisements

Similar presentations

Constructive Computer Architecture: Data Hazards in Pipelined Processors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.

Advertisements

Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 19, 2012L9-1

Stmt FSM Richard S. Uhler Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology (based on a lecture prepared by Arvind)

Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1

Computer Architecture: A Constructive Approach Branch Direction Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab.

February 21, 2007http://csg.csail.mit.edu/6.375/L07-1 Bluespec-4: Architectural exploration using IP lookup Arvind Computer Science & Artificial Intelligence.

March, 2007http://csg.csail.mit.edu/arvindIPlookup-1 IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

September 24, L08-1 IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab.

IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L06-1.

IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 4, 2013

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Realistic Memories and Caches Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 21, 2012L13-1

Caches and in-order pipelines Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology May 11, 2012L24-1.

Virtual Memory Part 1 Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology May 2, 2012L22-1

March 6, 2006http://csg.csail.mit.edu/6.375/L10-1 Bluespec-4: Rule Scheduling and Synthesis Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Realistic Memories and Caches – Part II Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 2, 2012L14-1.

September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.

Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.

Computer Architecture: A Constructive Approach Next Address Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab. Massachusetts.

Constructive Computer Architecture: Control Hazards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October.

1 Tutorial: Lab 4 Again Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 8,

October 22, 2009http://csg.csail.mit.edu/korea Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Realistic Memories and Caches – Part III Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 4, 2012L15-1.

Constructive Computer Architecture Store Buffers and Non-blocking Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.

October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010

October 6, 2009http://csg.csail.mit.edu/koreaL10-1 IP Lookup-2: The Completion Buffer Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Elastic Pipelines: Concurrency Issues

Caches-2 Constructive Computer Architecture Arvind

Realistic Memories and Caches

Constructive Computer Architecture Tutorial 6 Cache & Exception

Bluespec-3: A non-pipelined processor Arvind

Realistic Memories and Caches

Bluespec-6: Modeling Processors

Folded “Combinational” circuits

Scheduling Constraints on Interface methods

Sequential Circuits: Constructive Computer Architecture

Caches-2 Constructive Computer Architecture Arvind

IP Lookup: Some subtle concurrency issues

Stmt FSM Arvind (with the help of Nirav Dave)

Cache Coherence Constructive Computer Architecture Arvind

Pipelining combinational circuits

Constructive Computer Architecture Tutorial 7 Final Project Overview

Pipelining combinational circuits

Realistic Memories and Caches

Modular Refinement Arvind

Modular Refinement Arvind

Cache Coherence Constructive Computer Architecture Arvind

Bluespec-4: Architectural exploration using IP lookup Arvind

Modules with Guarded Interfaces

Pipelining combinational circuits

Cache Coherence Constructive Computer Architecture Arvind

CC protocol for blocking caches

Bluespec-3: A non-pipelined processor Arvind

Stmt FSM Arvind (with the help of Nirav Dave)

Caches-2 Constructive Computer Architecture Arvind

Modular Refinement - 2 Arvind

IP Lookup Arvind Computer Science & Artificial Intelligence Lab

Caches and store buffers

CS 3410, Spring 2014 Computer Science Cornell University

Modeling Processors Arvind

Modular Refinement Arvind

Implementing for Correct Concurrency

Modular Refinement Arvind

Realistic Memories and Caches

Cache Coherence Constructive Computer Architecture Arvind

Caches-2 Constructive Computer Architecture Arvind

Presentation transcript:

Non-blocking Caches Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology May 14, 2012L25-1

Non-blocking caches Enable multiple outstanding cache misses It is like pipelining the memory system Extremely important for hiding memory latency Dramatically more complicated than blocking caches We will use the same processor interface for both blocking and non-blocking caches May 14, 2012 L25-2http://csg.csail.mit.edu/6.S078

Non-blocking cache Completion buffer controls the entries of requests and ensures that departures take place in order even if loads complete out-of- order Processor DRAM cache req resp mReq mResp mReqQ mRespQ respDeq cbuf req proc May 14, 2012 L25-3http://csg.csail.mit.edu/6.S078

Non-blocking caches Split the non-blocking cache in two parts The front end attaches tags to requests and ensures FIFO delivery of responses to the processor The backend is a non-locking cache which can return responses out-of-order One may merge the front end with the processor and directly expose the backend interface respDeq cache req resp mReq mResp mReqQ mRespQ req proc FIFO responses inputs are tagged; responses can be OOO Processor req resp respDeq cbuf May 14, 2012 L25-4http://csg.csail.mit.edu/6.S078

Completion buffer: Interface interface CBuffer#(type t); method ActionValue#(Token) getToken(); method Action put(Token tok, t d); method t getResult(); method Action deqResult(); endinterface cbuf getResult getToken put (result & token) deqResult Concurrency requirement (needed to achieve (0,n), i.e., combinational responses) getToken < put < getResult < deqResult May 14, 2012 L25-5http://csg.csail.mit.edu/6.S078

Non-blocking FIFO Cache module mkNBfifoCache(Cache); cBuf <- mkCompletionBuffer; nbCache <- mkNBtaggedCache; method Action req(MemReq x); tok <- cBuf.getToken; nbCache.req({req:x, tag:tok}); endmethod method MemResp resp return cBuf.getResult endmethod method Action respDeq; cBuf.deqResult; endmethod rule nbCacheResponse; cBuf.put(nbCache.resp); nbCache.respDeq; endrule endmodule req resp respDeq cbuf May 14, 2012 L25-6http://csg.csail.mit.edu/6.S078

Completion buffer: Implementation IIVIVIIIVIVI cnt iidx ridx buf A circular buffer with two pointers iidx and ridx, and a counter cnt Elements are of Maybe type May 14, 2012 module mkCompletionBuffer(CompletionBuffer#(size)); Vector#(size, EHR#(Maybe#(t))) cb <- replicateM(mkEHR3(Invalid)); Reg#(Bit#(TAdd#(TLog#(size),1))) iidx <- mkReg(0); Reg#(Bit#(TAdd#(TLog#(size),1))) ridx <- mkReg(0); EHR#(Bit#(TAdd#(TLog#(size),1))) cnt <- mkEHR(0); Integer vsize = valueOf(size); Bit#(TAdd#(TLog#(size),1)) sz = fromInteger(vsize); rules and methods... endmodule L25-7http://csg.csail.mit.edu/6.S078

Completion Buffer cont method ActionValue#(t) getToken() if(cnt.r0!==sz); cb[iidx].w0(Invalid); iidx <= iidx==sz-1 ? 0 : iidx + 1; cnt.w0(cnt.r0 + 1); return iidx; endmethod method Action put(Token idx, t data); cb[idx].w1(Valid data); endmethod method t getResult() if(cnt.r1 !== 0 &&& (cb[ridx].r2 matches tagged (Valid.x)); return x; endmethod method Action deqResult if(cnt.r1!=0); cb[ridx].w2(Invalid); ridx <= ridx==sz-1 ? 0 : ridx + 1; cnt.w1(cnt.r1 – 1); endmethod May 14, 2012 getToken < put < getResult < deqResult L25-8

Non-blocking Cache St Buff Ld Buff Wait Buff TagData wbQ mReqQmRespQ hitQ respDeq respreq May 14, 2012 L25-9http://csg.csail.mit.edu/6.S078 Holds reqs that must be deferred because the address is already being processed Holds St’s because of no-allocate on a write-miss policy V/C/D Holds Ld misses that are being processed

Dynamics of NB cache processor side A ld request is checked first for a cache hit  respond, then store buff hit  respond, then ld buffer addr match  put in wait buff, otherwise put it in the ld buff A st request is checked first for a cache hit  respond, then ld buffer addr match  put in wait buff, otherwise put it in the store buff May 14, 2012 L25-10http://csg.csail.mit.edu/6.S078

Dynamics of NB cache Memory side: Responses for ld misses come back in the mRespQ retire the ld from the ld buff search wait buff for addr match, then if match is found go into the procWaitBuff mode and process all matching waiting misses. Proc requests are blocked off. Ld buff keeps issuing WB and mem requests as long as there are unfullfilled ld misses May 14, 2012 L25-11http://csg.csail.mit.edu/6.S078

Non-blocking Cache state declaration module mkNBCache(NBCache); RegFile#(Index, LineStatus) sArray <- mkRegFileFull; RegFile#(Index, Tag) tagArray <- mkRegFileFull; RegFile#(Index, Data) dataArray <- mkRegFileFull; StBuff#(StBuffSz) stBuff <- mkStBuff; LdBuff#(LdBuffSz) ldBuff <- mkLdBuff; SPipeFIFO#(TaggedMemReq) waitBuff <- mkSPipeFIFO; FIFOF#(MemReq) wbQ <- mkFIFOF; FIFOF#(MemReq) mReqQ <- mkFIFOF; FIFOF#(MemResp) mRespQ <- mkFIFOF; EHRBypassReg#(TypeHit) hitQ <- mkEHRBypassReg; Reg#(Bool) procWaitBuff <- mkReg(False); May 14, 2012 L25-12http://csg.csail.mit.edu/6.S078

Non-blocking Cache Processor-side request method method Action req(TaggedMemReq x) if(!procWaitBuff); Index idx = truncate(x.req.addr>>2); Tag tag = truncateLSB(x.req.addr); let lnSt = sArray.sub(idx); Bool tagMatch = tagArray.sub(idx)==tag; let sbMatch = stBuff.search(x.req.addr); let lbMatch = ldBuff.search(x.req.addr); if(lnSt!==Invalid && tagMatch) hitQ.enq(TypeHit{tag:x.tag, req:x.req, data:Invalid}); else if(lbMatch) waitBuff.enq(x); else if(x.req.op==St) stBuff.insert(x.req); else if(x.req.op==Ld && isValid(sbMatch)) hitQ.enq(TypeHit{tag:x.tag, req:x.req, data:sbMatch}); else ldBuff.insert(x, lnSt==Dirty ? WrBack : FillReq); endmethod May 14, 2012 L25-13http://csg.csail.mit.edu/6.S078

Non-blocking Cache Processor-side response methods method TaggedMemResp resp if(hitQ.first.req.op==Ld); let x = hitQ.first.req; Index idx = truncate(x.addr>>2); Data d = isValid(hitQ.first.data) ? fromMaybe(hitQ.first.data) : dataArray.sub(idx); return TaggedMemResp{tag:hitQ.first.tag, resp:d}; endmethod method Action respDeq if(hitQ.first.req.op==Ld); hitQ.deq; endmethod May 14, 2012 L25-14http://csg.csail.mit.edu/6.S078 No response for stores same as Blocking cache

Non-blocking Cache store hit rule rule storeHit(hitQ.first.req.op==St); let x = hitQ.first.req; Index idx = truncate(x.addr>>2); dataArray.upd(idx, x.data); sArray.upd(idx, Dirty); hitQ.deq; endrule May 14, 2012 L25-15http://csg.csail.mit.edu/6.S078 same as Blocking cache

Non-blocking Cache load buff rule rule ldBuffUpdate; let u = ldBuff.usearch; if(u.valid) if(u.cst==FillReq) begin mReqQ.enq(MemReq{op:Ld, addr:u.addr, data:?}); ldBuff.update(u.addr, FillResp); end else begin Index idx = truncate(u.addr>>2); wbQ.enq(MemReq{op:St, addr:{tagArray.sub(idx),idx,2'b00}, data:dataArray.sub(idx)}); mReqQ.enq(MemReq{op:Ld, addr:u.addr, data:?}); compBuff.update(u.addr, FillResp); end endrule May 14, 2012 L25-16http://csg.csail.mit.edu/6.S078

Non-blocking Cache memory response rule rule mRespAvailable; let data = mRespQ.first.data; let addr = mRespQ.first.addr; Index idx = truncate(addr>>2); Tag tag = truncateLSB(addr); sArray.upd(idx, Clean); tagArray.upd(idx, tag); dataArray.upd(idx, data); let x <- ldBuff.remove(addr); hitQ.enq(TypeHit{tag:x.tag, req:x.req, data:Valid (data)}); if(waitBuff.search(addr)) procWaitBuff <= True; else mRespQ.deq; endrule May 14, 2012 L25-17http://csg.csail.mit.edu/6.S078

Non-blocking Cache wait buffer rule rule goThroughWaitBuff(procWaitBuff); let data = mRespQ.first.data; let addr = mRespQ.first.addr; let x = waitBuff.first; waitBuff.deq; if(x.req.addr==addr) hitQ.enq(TypeHit{tag:x.tag, req:x.req, data:Valid (data)}); else waitBuff.enq(x); if(!waitBuff.search(addr)) begin procWaitBuff <= False; mRespQ.deq; end endrule May 14, 2012 L25-18http://csg.csail.mit.edu/6.S078

Non-blocking Cache memory-side methods method ActionValue#(MemReq) wb; wbQ.deq; return wbQ.first; endmethod method ActionValue#(MemReq) mReq; mReqQ.deq; return mReqQ.first; endmethod method Action mResp(MemResp x); mRespQ.enq(x); endmethod endmodule May 14, 2012 L25-19http://csg.csail.mit.edu/6.S078 same as Blocking cache

Misc rules May 14, 2012 L25-20http://csg.csail.mit.edu/6.S078 rule stBuffRemove; let x <- stBuff.remove; wbQ.enq(x); endrule

Store buffer methods insert: when a cache miss occurs on a store search: associatively searched for Ld addresses remove: when a store moves to the write- back queue May 14, 2012 L25-21http://csg.csail.mit.edu/6.S078 remove < search < insert

Store Buffer module mkStBuff(StBuff#(size)); Vector#(size, EHR2#(Maybe#(MemReq))) buff <- replicateM(mkEHR2(Invalid)); Reg#(Bit#(TAdd#(TLog#(size),1))) iidx <- mkReg(0); EHR2#(Bit#(TAdd#(TLog#(size),1))) ridx <- mkEHR2(0); EHR2#(Bit#(TAdd#(TLog#(size),1))) cnt <- mkEHR2(0); Integer vsize = valueOf(size)); Bit#(TAdd#(TLog#(size),1)) sz = fromInteger(vsize); method ActionValue#(MemReq) remove if(cnt.r0!=0); buff[ridx.r0].w0(Invalid); ridx.w0(ridx.r0==sz-1 ? 0 : ridx.r0 + 1); cnt.w0(cnt.r0 - 1); return fromMaybe(buff[ridx.r0].r0); endmethod May 14, 2012 L25-22http://csg.csail.mit.edu/6.S078

Store Buffer cont method Maybe#(Data) search(Addr a); Maybe#(Data) m = Invalid; let idx = ridx.r1; for(Integer i=0; i<vsize); i=i+1) begin if(isValid(buff[idx].r1) && fromMaybe(buff[idx].r1).addr==a) m = Valid (fromMaybe(buff[idx].r1).data); idx = idx + 1; end return m; endmethod method Action insert(MemReq x) if(cnt.r1!==sz); buff[iidx].w1(Valid (x)); iidx <= iidx==sz-1 ? 0 : iidx + 1; cnt.w1(cnt.r1 + 1); endmethod endmodule May 14, 2012 L25-23http://csg.csail.mit.edu/6.S078

Load buffer methods insert: when a cache miss occurs on a store search: associatively searched for Ld addresses remove: when a store moves to the write- back queue update: updates the ld status when a memory request is satisfied by the memory usearch: this search is needed to see if a Ld is is in Wb state or Fill state May 14, 2012 L25-24http://csg.csail.mit.edu/6.S078

Load Buffer module mkLdBuff(LdBuff#(size)); Vector#(size, EHR3#(LdStatus)) ldstA <- replicateM(mkEHR3(Invalid)); Vector#(size, EHR3#(TaggedMemReq)) buff <- replicateM(mkEHR3U); EHR3#(Bit#(TAdd#(TLog#(size),1))) cnt <- mkEHR3(0); Integer vsize = valueOf(size)); Bit#(TAdd#(TLog#(size),1)) sz = fromInteger(vsize); method ActionValue#(TaggedMemReq) remove(Addr a) if(cnt.r0!=0); Bit#(TLog#(size)) idx = 0; for(Integer i=0; i<vsize; i=i+1) if(buff[i].r0.req.addr==a) idx = fromInteger(i); ldstA[idx].w0(Invalid); cnt.w0(cnt.r0 - 1); return buff[idx].r0; endmethod May 14, 2012 L25-25http://csg.csail.mit.edu/6.S078

Load Buffer cont method Action update(Addr a, LdStatus ldst); for(Integer i=0; i<vsize; i=i+1) if(ldstA[i].r2!==Invalid && buff[i].r2.req.addr==a) ldstA[i].w2(ldst); endmethod method TypeUpdate usearch; TypeUpdate u = TypeUpdate{valid:False, ldst:?, addr:?}; for(Integer i=0; i<vsize; i=i+1) if(ldstA[i].r2==WrBack || ldstA[i].r2==FillReq) u = TypeUpdate{valid:True, ldstA:ldstA[i].r2, addr:buff[i].r2.req.addr}; return u; endmethod May 14, 2012 L25-26http://csg.csail.mit.edu/6.S078

Load Buffer cont method Bool search(Addr a); Bool s = False; for(Integer i=0; i<vsize; i=i+1) if(ldstA[i].r1!=Invalid && buff[i].r1.req.addr==a) s = True; return s; endmethod method Action insert(TaggedMemReq x, LdStatus ldst) if(cnt.r1!=sz); Bit#(TLog#(size)) idx = 0; for(Integer i=0; i<vsize; i=i+1) if(ldstA[i].r1==Invalid) idx = fromInteger(i); buff[idx].w1(x); ldstA[idx].w1(ldst); cnt.w1(cnt.r1 + 1); endmethod endmodule May 14, 2012 L25-27http://csg.csail.mit.edu/6.S078