Caches-2 Constructive Computer Architecture Arvind

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

Advertisements

Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1

Computer Architecture: A Constructive Approach Branch Direction Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab.

March, 2007http://csg.csail.mit.edu/arvindIPlookup-1 IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

September 24, L08-1 IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab.

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Realistic Memories and Caches Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 21, 2012L13-1

Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.

Caches and in-order pipelines Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology May 11, 2012L24-1.

Realistic Memories and Caches – Part II Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 2, 2012L14-1.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

Computer Architecture: A Constructive Approach Next Address Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab. Massachusetts.

Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

October 22, 2009http://csg.csail.mit.edu/korea Modular Refinement Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Non-blocking Caches Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology May 14, 2012L25-1

Realistic Memories and Caches – Part III Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 4, 2012L15-1.

Constructive Computer Architecture Store Buffers and Non-blocking Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.

October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.

Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010

CSCI206 - Computer Organization & Programming

CMSC 611: Advanced Computer Architecture

Cache Coherence Constructive Computer Architecture Arvind

6.175 Final Project Part 0: Understanding Non-Blocking Caches and Cache Coherency Answers.

Caches-2 Constructive Computer Architecture Arvind

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

Realistic Memories and Caches

CSE 351 Section 9 3/1/12.

Constructive Computer Architecture Tutorial 6 Cache & Exception

Bluespec-3: A non-pipelined processor Arvind

From Address Translation to Demand Paging

Realistic Memories and Caches

Bluespec-6: Modeling Processors

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

Cache Memory Presentation I

Consider a Direct Mapped Cache with 4 word blocks

Caches-2 Constructive Computer Architecture Arvind

Notation Addresses are ordered triples:

Cache Coherence Constructive Computer Architecture Arvind

Pipelining combinational circuits

Constructive Computer Architecture Tutorial 7 Final Project Overview

Lecture 23: Cache, Memory, Virtual Memory

Pipelining combinational circuits

Realistic Memories and Caches

Cache Coherence Constructive Computer Architecture Arvind

Modules with Guarded Interfaces

Pipelining combinational circuits

Cache Coherence Constructive Computer Architecture Arvind

Bypassing Computer Architecture: A Constructive Approach Joel Emer

CDA 5155 Caches.

Adapted from slides by Sally McKee Cornell University

CC protocol for blocking caches

Bluespec-3: A non-pipelined processor Arvind

CSE 351: The Hardware/Software Interface

Modular Refinement - 2 Arvind

Caches and store buffers

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

CS 3410, Spring 2014 Computer Science Cornell University

CSC3050 – Computer Architecture

Modeling Processors Arvind

Modeling Processors Arvind

Modular Refinement Arvind

Modular Refinement Arvind

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Realistic Memories and Caches

Cache Coherence Constructive Computer Architecture Arvind

Caches-2 Constructive Computer Architecture Arvind

Presentation transcript:

Caches-2 Constructive Computer Architecture Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 2, 2016 http://csg.csail.mit.edu/6.175

Cache Interface cache req memReq mshr mReqQ DRAM or next level cache Processor Miss request Handling Register(s) cache memResp resp hitQ mRespQ Requests are tagged for non-blocking caches interface Cache; method Action req(MemReq r); // (op: Ld/St, addr: ..., data: ...) method ActionValue#(Data) resp; // no response for St method ActionValue#(MemReq) memReq; method Action memResp(Line r); endinterface Miss Status Handling Registers (MSHRs) We will first design a write-back, write-miss allocate, Direct-mapped, blocking cache November 2, 2016 http://csg.csail.mit.edu/6.175

Interface dynamics The cache either gets a hit and responds immediately, or it gets a miss, in which case it takes several steps to process the miss Reading the response dequeues it Methods are guarded, e.g., cache may not be ready to accept a request because it is processing a miss A mshr register keeps track of the state of the request while processing it typedef enum {Ready, StartMiss, SendFillReq, WaitFillResp} ReqStatus deriving (Bits, Eq); November 2, 2016 http://csg.csail.mit.edu/6.175

Blocking cache state elements RegFile#(CacheIndex, Line) dataArray <- mkRegFileFull; RegFile#(CacheIndex, Maybe#(CacheTag)) tagArray <- mkRegFileFull; RegFile#(CacheIndex, Bool) dirtyArray <- mkRegFileFull; Fifo#(1, Data) hitQ <- mkBypassFifo; Reg#(MemReq) missReq <- mkRegU; Reg#(ReqStatus) mshr <- mkReg(Ready); Fifo#(2, MemReq) memReqQ <- mkCFFifo; Fifo#(2, Line) memRespQ <- mkCFFifo; Tag and valid bits are kept together as a Maybe type mshr and missReq go together CF Fifos are preferable because they provide better decoupling. An extra cycle here may not affect the performance by much November 2, 2016 http://csg.csail.mit.edu/6.175

Req method Blocking cache method Action req(MemReq r) if(mshr == Ready); let idx = getIdx(r.addr); let tag = getTag(r.addr); let wOffset = getOffset(r.addr); let currTag = tagArray.sub(idx); let hit = isValid(currTag)? fromMaybe(?,currTag)==tag : False; if(hit) begin let x = dataArray.sub(idx); if(r.op == Ld) hitQ.enq(x[wOffset]); else begin x[wOffset]=r.data; dataArray.upd(idx, x); dirtyArray.upd(idx, True); end else begin missReq <= r; mshr <= StartMiss; end endmethod overwrite the appropriate word of the line November 2, 2016 http://csg.csail.mit.edu/6.175

Miss processing mshr = StartMiss  mshr = SendFillReq  Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready mshr = StartMiss  if the slot is occupied by dirty data, initiate a write back of data mshr <= SendFillReq mshr = SendFillReq  send request to the memory mshr <= WaitFillReq mshr = WaitFillReq  Fill the slot when the data is returned from the memory and put the load response in hitQ mshr <= Ready November 2, 2016 http://csg.csail.mit.edu/6.175

Start-miss and Send-fill rules Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule startMiss(mshr == StartMiss); let idx = getIdx(missReq.addr); let tag=tagArray.sub(idx); let dirty=dirtyArray.sub(idx); if(isValid(tag) && dirty) begin // write-back let addr = {fromMaybe(?,tag), idx, 4'b0}; let data = dataArray.sub(idx); memReqQ.enq(MemReq{op: St, addr: addr, data: data}); end mshr <= SendFillReq; endrule Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule sendFillReq (mshr == SendFillReq); memReqQ.enq(missReq); mshr <= WaitFillResp; endrule November 2, 2016 http://csg.csail.mit.edu/6.175

Wait-fill rule Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule waitFillResp(mshr == WaitFillResp); let idx = getIdx(missReq.addr); let tag = getTag(missReq.addr); let data = memRespQ.first; tagArray.upd(idx, Valid (tag)); if(missReq.op == Ld) begin dirtyArray.upd(idx,False);dataArray.upd(idx, data); hitQ.enq(data[wOffset]); end else begin data[wOffset] = missReq.data; dirtyArray.upd(idx,True); dataArray.upd(idx, data); end memRespQ.deq; mshr <= Ready; endrule Is there a problem with waitFill? What if the hitQ is blocked? Should we not at least write it in the cache? November 2, 2016 http://csg.csail.mit.edu/6.175

Rest of the methods Memory side methods method ActionValue#(Data) resp; hitQ.deq; return hitQ.first; endmethod method ActionValue#(MemReq) memReq; memReqQ.deq; return memReqQ.first; method Action memResp(Line r); memRespQ.enq(r); Memory side methods November 2, 2016 http://csg.csail.mit.edu/6.175

Hit and miss performance Directly related to the latency of L1 0-cycle latency if hitQ is a bypass FIFO Miss No evacuation: memory load latency plus combinational read/write Evacuation: memory store followed by memory load latency plus combinational read/write Adding a few extra cycles in the miss case does not have a big impact on performance November 2, 2016 http://csg.csail.mit.edu/6.175

Speeding up Store Misses Unlike a load, a store does not require memory system to return any data to the processor; it only requires the cache to be updated for future load accesses Instead of delaying the pipeline, a store can be performed in the background; In case of a miss the data does not have to be brought into L1 at all (Write-miss no allocate policy) mReqQ mRespQ L1 Store buffer(stb) a small FIFO of (a,v) pairs Store Buffer November 2, 2016 http://csg.csail.mit.edu/6.175

Store Buffer L1 a small FIFO of (a,v) pairs mReqQ mRespQ L1 Store buffer(stb) a small FIFO of (a,v) pairs Store Buffer A St req is enqueued into stb input reqs are blocked if there is no space in stb A Ld req simultaneously searches L1 and stb; If Ld gets a hit in stb – it selects the most recent matching entry; L1 search result is discarded If Ld gets a miss in stb but a hit in L1, the L1 result is returned If no match in either stb and L1, miss-processing commences In background, oldest store in stb is dequed and processed If St address hits in L1: update L1; if write-through then also send to it to memory If it misses: Write-back write-miss-allocate: fetch the cache line from memory (miss processing) and then process the store Write-back/Write-through write-miss-no-allocate: pass the store to memory November 2, 2016 http://csg.csail.mit.edu/6.175

L1+Store Buffer (write-back, write-miss-allocate): Req method method Action req(MemReq r) if(mshr == Ready); ... get idx, tag and wOffset if(r.op == Ld) begin // search stb let x = stb.search(r.addr); if (isValid(x)) hitQ.enq(fromMaybe(?, x)); else begin // search L1 let currTag = tagArray.sub(idx); let hit = isValid(currTag) ? fromMaybe(?,currTag)==tag : False; if(hit) begin let x = dataArray.sub(idx); hitQ.enq(x[wOffset]); end else begin missReq <= r; mshr <= StartMiss; end end end else stb.enq(r.addr,r.data) // r.op == St endmethod No change in miss handling rules Entry into store buffer November 2, 2016 http://csg.csail.mit.edu/6.175

L1+Store Buffer (write-back, write-miss-allocate): Exit from Store Buff rule mvStbToL1 (mshr == Ready); stb.deq; match {.addr, .data} = stb.first; // move the oldest entry of stb into L1 // may start allocation/evacuation ... get idx, tag and wOffset let currTag = tagArray.sub(idx); let hit = isValid(currTag) ? fromMaybe(?,currTag)==tag : False; if(hit) begin let x = dataArray.sub(idx); x[wOffset] = data; dataArray.upd(idx,x); dirtyArray.upd(idx, True); end else begin missReq <= r; mshr <= StartMiss; end endrule November 2, 2016 http://csg.csail.mit.edu/6.175

Give priority to req method in accessing L1 method Action req(MemReq r) if(mshr == Ready); ... get idx, tag and wOffset if(r.op == Ld) begin // search stb let x = stb.search(r.addr); if (isValid(x)) hitQ.enq(fromMaybe(?, x)); else begin // search L1 ... else stb.enq(r.addr,r.data) // r.op == St endmethod lockL1[0] <= True; Lock L1 while processing processor requests && !lockL1[1] rule mvStbToL1 (mshr == Ready); stb.deq; match {.addr, .data} = stb.first; ... get idx, tag and wOffset endrule rule clearL1Lock; lockL1[1] <= False; endrule November 2, 2016 http://csg.csail.mit.edu/6.175

Write Through Caches L1 values are always consistent with values in the next level cache  No need for dirty array No need to write back on evacuation November 2, 2016 http://csg.csail.mit.edu/6.175

L1+Store Buffer (write-through, write-miss-no-allocate): Req method method Action req(MemReq r) if(mshr == Ready); ... get idx, tag and wOffset if(r.op == Ld) begin // search stb let x = stb.search(r.addr); if (isValid(x)) hitQ.enq(fromMaybe(?, x)); else begin // search L1 let currTag = tagArray.sub(idx); let hit = isValid(currTag) ? fromMaybe(?,currTag)==tag : False; if(hit) begin let x = dataArray.sub(idx); hitQ.enq(x[wOffset]); end else begin missReq <= r; mshr <= StartMiss; end end end else stb.enq(r.addr,r.data) // r.op == St endmethod No change November 2, 2016 http://csg.csail.mit.edu/6.175

Start-miss and Send-fill rules (write-through) Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule startMiss(mshr == StartMiss); let idx = getIdx(missReq.addr); let tag=tagArray.sub(idx); let dirty=dirtyArray.sub(idx); if(isValid(tag) && dirty) begin // write-back let addr = {fromMaybe(?,tag), idx, 4'b0}; let data = dataArray.sub(idx); memReqQ.enq(MemReq{op: St, addr: addr, data: data}); end mshr <= SendFillReq; endrule No need for this rule – miss can directly to into SendFillReq state Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule sendFillReq (mshr == SendFillReq); memReqQ.enq(missReq); mshr <= WaitFillResp; endrule November 2, 2016 http://csg.csail.mit.edu/6.175

Wait-fill rule (write-through) Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule waitFillResp(mshr == WaitFillResp); let idx = getIdx(missReq.addr); let tag = getTag(missReq.addr); let data = memRespQ.first; tagArray.upd(idx, Valid (tag)); if(missReq.op == Ld) begin dirtyArray.upd(idx,False);dataArray.upd(idx, data); hitQ.enq(data[wOffset]); end else begin data[wOffset] = missReq.data; dirtyArray.upd(idx,True); dataArray.upd(idx, data); end memRespQ.deq; mshr <= Ready; endrule Is there a problem with waitFill? What if the hitQ is blocked? Should we not at least write it in the cache? November 2, 2016 http://csg.csail.mit.edu/6.175

Miss rules (cleaned up) Ready -> SendFillReq -> WaitFillResp -> Ready rule sendFillReq (mshr == SendFillReq); memReqQ.enq(missReq); mshr <= WaitFillResp; endrule Ready -> SendFillReq -> WaitFillResp -> Ready rule waitFillResp(mshr == WaitFillResp); let idx = getIdx(missReq.addr); let tag = getTag(missReq.addr); let data = memRespQ.first; tagArray.upd(idx, Valid (tag)); dataArray.upd(idx, data); hitQ.enq(data[wOffset]); endrule Is there a problem with waitFill? What if the hitQ is blocked? Should we not at least write it in the cache? November 2, 2016 http://csg.csail.mit.edu/6.175

Exit from Store Buff (write-through write-miss-no-allocate) rule mvStbToL1 (mshr == Ready); ... get idx, tag and wOffset memReqQ.enq(...) // always send store to memory stb.deq; match {.addr, .data} = stb.first; // move this entry into L1 if address // is present in L1 let currTag = tagArray.sub(idx); let hit = isValid(currTag) ? fromMaybe(?,currTag)==tag : False; if(hit) begin let x = dataArray.sub(idx); x[wOffset] = data; dataArray.upd(idx,x) end else begin missReq <= r; mshr <= StartMiss; end endrule November 2, 2016 http://csg.csail.mit.edu/6.175

Functions to extract cache tag, index, word offset tag index L 2 Cache size in bytes Byte addresses function CacheIndex getIndex(Addr addr) = truncate(addr>>4); function Bit#(2) getOffset(Addr addr) = truncate(addr >> 2); function CacheTag getTag(Addr addr) = truncateLSB(addr); truncate = truncateMSB November 2, 2016 http://csg.csail.mit.edu/6.175