Realistic Memories and Caches

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

CMPE 421 Parallel Computer Architecture

Realistic Memories and Caches Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 21, 2012L13-1

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Realistic Memories and Caches – Part II Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 2, 2012L14-1.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Non-blocking Caches Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology May 14, 2012L25-1

Realistic Memories and Caches – Part III Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology April 4, 2012L15-1.

CMSC 611: Advanced Computer Architecture

Caches-2 Constructive Computer Architecture Arvind

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Realistic Memories and Caches

CSE 351 Section 9 3/1/12.

Constructive Computer Architecture Tutorial 6 Cache & Exception

6.175: Constructive Computer Architecture Tutorial 5 Epochs, Debugging, and Caches Quan Nguyen (Troubled by the two biggest problems in computer science…

Realistic Memories and Caches

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

How will execution time grow with SIZE?

Cache Memory Presentation I

Morgan Kaufmann Publishers

Caches-2 Constructive Computer Architecture Arvind

Lecture 21: Memory Hierarchy

Lecture 21: Memory Hierarchy

Cache Memories September 30, 2008

Lecture 23: Cache, Memory, Virtual Memory

Cache Coherence Constructive Computer Architecture Arvind

Cache Coherence Constructive Computer Architecture Arvind

Performance metrics for caches

Performance metrics for caches

Realistic Memories and Caches

Adapted from slides by Sally McKee Cornell University

CC protocol for blocking caches

Caches-2 Constructive Computer Architecture Arvind

Performance metrics for caches

Caches and store buffers

How can we find data in the cache?

Miss Rate versus Block Size

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

CS 704 Advanced Computer Architecture

Caches III CSE 351 Autumn 2018 Instructor: Justin Hsia

Lecture 22: Cache Hierarchies, Memory

CS 3410, Spring 2014 Computer Science Cornell University

CSC3050 – Computer Architecture

Lecture 21: Memory Hierarchy

Performance metrics for caches

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Update : about 8~16% are writes

Realistic Memories and Caches

Cache Memory and Performance

Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )

Cache Coherence Constructive Computer Architecture Arvind

Sarah Diesburg Operating Systems CS 3430

Caches-2 Constructive Computer Architecture Arvind

Performance metrics for caches

10/18: Lecture Topics Using spatial locality

Caches III CSE 351 Spring 2019 Instructor: Ruth Anderson

Overview Problem Solution CPU vs Memory performance imbalance

Sarah Diesburg Operating Systems COP 4610

Presentation transcript:

Realistic Memories and Caches Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October 31, 2016 http://csg.csail.mit.edu/6.175

Multistage Pipeline Register File PC Decode Execute Inst Data Memory fEpoch Register File eEpoch redirect e2c PC Decode Execute nap d2e Inst Memory Data Memory scoreboard The use of magic memories (combinational reads) makes such design unrealistic October 31, 2016 http://csg.csail.mit.edu/6.175 2

Magic Memory Model Reads and writes are always completed in one cycle RAM ReadData WriteData Address WriteEnable Clock Reads and writes are always completed in one cycle a Read can be done any time (i.e. combinational) If enabled, a Write is performed at the rising clock edge (the write address and data must be stable at the clock edge) In a real DRAM the data will be available several cycles after the address is supplied October 31, 2016 http://csg.csail.mit.edu/6.175

holds frequently used data Memory Hierarchy Small, Fast Memory SRAM CPU RegFile Big, Slow Memory DRAM holds frequently used data size: RegFile << SRAM << DRAM latency: RegFile << SRAM << DRAM bandwidth: on-chip >> off-chip On a data access: hit (data Î fast memory)  low latency access miss (data Ï fast memory)  long latency access (DRAM) why? Due to cost Due to size of DRAM Due to cost and wire delays (wires on-chip cost much less, and are faster) October 31, 2016 http://csg.csail.mit.edu/6.175

Inside a Cache A cache line usually holds more than one word tag data Data from locations 100, 101, ... Data Byte 100 304 6848 A cache line usually holds more than one word Spatial locality: if address x is referenced then addresses x+1, x+2 etc. are very likely to be referenced in the near future consider instruction streams, array and record accesses Larger data sets can be transported more efficiently Reduces the number of tag bits needed to identify a cache line Tag only needs enough bits to uniquely identify the block (jse) October 31, 2016 http://csg.csail.mit.edu/6.175

Internal Cache Organization Cache designs restrict where in cache a particular address can reside Direct mapped: An address can reside in exactly one location in cache. The cache location is typically determined by the lowest order address bits n-way Set associative: An address can reside in any of a set of n locations in the cache. The set is typically determine by the low order address bits October 31, 2016 http://csg.csail.mit.edu/6.175

Extracting cache tags & index tag index L 2 Cache size in bytes Byte addresses Processor requests are for a single word but cache line size is 2L words (typically L=2) Processor uses word-aligned byte addresses, i.e. the two least significant bits of the address are 00 Suppose cache size =2(K+L+2) bytes Direct mapped cache: Index=K; tag=32-(K+L+2) 2-Way set-associative cache: Index=K-1; tag = 32-(K-1+L+2) Need getIndex, getTag, getOffset functions October 31, 2016 http://csg.csail.mit.edu/6.175

Direct-Mapped Cache The simplest implementation Cache line number Tag Data Block V = Offset Index t k b HIT Data Word or Byte 2k lines req address Simplest scheme is to extract bits from ‘block number’ to determine ‘set’ (jse) What is a bad reference pattern? Strided and stride = size of cache October 31, 2016 http://csg.csail.mit.edu/6.175

Direct Map Address Selection higher-order vs. lower-order address bits Tag Data Block V = Offset Index t k b HIT Data Word or Byte 2k lines Index and tag reversed Why might this be undesirable? Spatially local blocks conflict. Why higher-order bits as tag may be undesirable? Spatially locality cannot be exploited October 31, 2016 http://csg.csail.mit.edu/6.175

Conflict Misses In a direct map cache, a cache-line can be stored only if the cache slot corresponding to its address is not occupied (even if there are other unoccupied slots) In a n-way set associate cache, a cache line can be stored in n different slots - reduces misses due to address conflicts October 31, 2016 http://csg.csail.mit.edu/6.175

2-Way Set-Associative Cache Tag Data Block V = Block Offset Index t k b hit Data Word or Byte High degree of associativity increases hardware complexity and access latency rapidly Compare latency to direct mapped case? (jse) October 31, 2016 http://csg.csail.mit.edu/6.175

Search the cache for the processor generated address Loads Search the cache for the processor generated address Found in cache a.k.a. hit Return a copy of the word from the cache-line Not in cache a.k.a. miss Bring the missing cache-line from Main Memory May require writing back a cache-line to create space … Update cache and return word to processor Which line do we replace? October 31, 2016 http://csg.csail.mit.edu/6.175

Stores On a write hit On a write miss Write-back policy: write only to cache and update the next level memory when line is evacuated Write-through policy: write to both the cache and the next level memory On a write miss Allocate – because of multi-word lines we first fetch the line, and then update a word in it No allocate – cache is not affected, the Store is forwarded to memory Write through vs. write back WT: +read miss never results in writes to main memory + main memory always has the most current copy of the data (consistent) - write is slower - every write needs a main memory access - as a result uses more memory bandwidth WB: + writes occur at the speed of the cache memory + multiple writes within a block require only one write to main memory + as a result uses less memory bandwidth - main memory is not always consistent with cache - reads that result in replacement may cause writes of dirty blocks to main memory Write Back with No Write Allocate: on hits it writes to cache setting dirty bit for the block, main memory is not updated; on misses it updates the block in main memory not bringing that block to the cache; Subsequent writes to the same block, if the block originally caused a miss, will generate misses all the way and result in very inefficient execution. Hence, WB+WA more common combination. Write Through with No Write Allocate: on hits it writes to cache and main memory; on misses it updates the block in main memory not bringing that block to the cache; Subsequent writes to the block will update main memory anyway, so write misses aren’t helped. Only read misses helped with allocate. Hence, WT, no WA usually. Re-look at these options when the next level is a cache; not a memory. Write back with no allocate? Why is this Typically one uses either Write-back-Allocate policy or Write-through-No-allocate policy October 31, 2016 http://csg.csail.mit.edu/6.175

Replacement Policy To bring in a new cache line, usually another cache line has to be thrown out. Which one? Direct mapped cache: No choice N-way set associative cache: Choice of policy One that is not dirty, i.e., has not been modified. In I-cache all lines are clean; If there is still a choice of more than one then Least Recently Used (LRU)? Most Recently Used? Random? How much is performance affected by the choice? Difficult to know without benchmarks and quantitative measurements October 31, 2016 http://csg.csail.mit.edu/6.175

Blocking vs. Non-Blocking cache At most one outstanding miss Cache must wait for memory to respond Cache does not accept requests in the meantime Non-blocking cache: Multiple outstanding misses Cache can continue to process requests while waiting for memory to respond to misses We will first design a write-back, Write-miss allocate, Direct-mapped, blocking cache October 31, 2016 http://csg.csail.mit.edu/6.175

Blocking Cache Interface req missReq memReq mReqQ mshr DRAM or next level cache Processor Miss_StatusHandling Register cache memResp resp hitQ mRespQ interface Cache; method Action req(MemReq r); method ActionValue#(Data) resp; method ActionValue#(MemReq) memReq; method Action memResp(Line r); endinterface Miss Status Handling Registers (MSHRs) We will first design a write-back, Write-miss allocate, Direct-mapped, blocking cache October 31, 2016 http://csg.csail.mit.edu/6.175

Interface dynamics The cache either gets a hit and responds immediately, or it gets a miss, in which case it takes several steps to process the miss Reading the response dequeues it Methods are guarded, e.g., cache may not be ready to accept a request because it is processing a miss A mshr register keeps track of the state of the cache while it is processing a miss typedef enum {Ready, StartMiss, SendFillReq, WaitFillResp} CacheStatus deriving (Bits, Eq); October 31, 2016 http://csg.csail.mit.edu/6.175

Blocking cache state elements RegFile#(CacheIndex, Line) dataArray <- mkRegFileFull; RegFile#(CacheIndex, Maybe#(CacheTag)) tagArray <- mkRegFileFull; RegFile#(CacheIndex, Bool) dirtyArray <- mkRegFileFull; Fifo#(1, Data) hitQ <- mkBypassFifo; Reg#(MemReq) missReq <- mkRegU; Reg#(CacheStatus) mshr <- mkReg(Ready); Fifo#(2, MemReq) memReqQ <- mkCFFifo; Fifo#(2, Line) memRespQ <- mkCFFifo; Tag and valid bits are kept together as a Maybe type CF Fifos are preferable because they provide better decoupling. An extra cycle here may not affect the performance by much October 31, 2016 http://csg.csail.mit.edu/6.175

Req method hit processing It is straightforward to extend the cache interface to include a cacheline flush command method Action req(MemReq r) if(mshr == Ready); let idx = getIdx(r.addr); let tag = getTag(r.addr); let wOffset = getOffset(r.addr); let currTag = tagArray.sub(idx); let hit = isValid(currTag)? fromMaybe(?,currTag)==tag : False; if(hit) begin let x = dataArray.sub(idx); if(r.op == Ld) hitQ.enq(x[wOffset]); else begin x[wOffset]=r.data; dataArray.upd(idx, x); dirtyArray.upd(idx, True); end else begin missReq <= r; mshr <= StartMiss; end endmethod overwrite the appropriate word of the line October 31, 2016 http://csg.csail.mit.edu/6.175

Miss processing mshr = StartMiss ==> mshr = SendFillReq ==> Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready mshr = StartMiss ==> if the slot is occupied by dirty data, initiate a write back of data mshr <= SendFillReq mshr = SendFillReq ==> send the request to the memory mshr <= WaitFillReq mshr = WaitFillReq ==> Fill the slot when the data is returned from the memory and put the load response in the cache response FIFO mshr <= Ready Rest of the slides contain miss handling rules October 31, 2016 http://csg.csail.mit.edu/6.175

Start-miss and Send-fill rules Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule startMiss(mshr == StartMiss); let idx = getIdx(missReq.addr); let tag=tagArray.sub(idx); let dirty=dirtyArray.sub(idx); if(isValid(tag) && dirty) begin // write-back let addr = {fromMaybe(?,tag), idx, 4'b0}; let data = dataArray.sub(idx); memReqQ.enq(MemReq{op: St, addr: addr, data: data}); end mshr <= SendFillReq; endrule Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule sendFillReq (mshr == SendFillReq); memReqQ.enq(missReq); mshr <= WaitFillResp; endrule October 31, 2016 http://csg.csail.mit.edu/6.175

Wait-fill rule Ready -> StartMiss -> SendFillReq -> WaitFillResp -> Ready rule waitFillResp(mshr == WaitFillResp); let idx = getIdx(missReq.addr); let tag = getTag(missReq.addr); let data = memRespQ.first; tagArray.upd(idx, Valid (tag)); if(missReq.op == Ld) begin dirtyArray.upd(idx,False);dataArray.upd(idx, data); hitQ.enq(data[wOffset]); end else begin data[wOffset] = missReq.data; dirtyArray.upd(idx,True); dataArray.upd(idx, data); end memRespQ.deq; mshr <= Ready; endrule Is there a problem with waitFill? What if the hitQ is blocked? Should we not at least write it in the cache? October 31, 2016 http://csg.csail.mit.edu/6.175

Rest of the methods Memory side methods method ActionValue#(Data) resp; hitQ.deq; return hitQ.first; endmethod method ActionValue#(MemReq) memReq; memReqQ.deq; return memReqQ.first; method Action memResp(Line r); memRespQ.enq(r); Memory side methods October 31, 2016 http://csg.csail.mit.edu/6.175

Functions to extract cache tag, index, word offset tag index L 2 Cache size in bytes Byte addresses function CacheIndex getIndex(Addr addr) = truncate(addr>>4); function Bit#(2) getOffset(Addr addr) = truncate(addr >> 2); function CacheTag getTag(Addr addr) = truncateLSB(addr); truncate = truncateMSB October 31, 2016 http://csg.csail.mit.edu/6.175