Lecture slides originally adapted from Prof. Valeriu Beiu (Washington State University, Spring 2005, EE 334)

 Temporal Locality (Locality in Time)  Keep most recently accessed data items closer to the processor  Spatial Locality (Locality in Space)  Move blocks consisting of contiguous words to the upper levels of cache 2

 Hit Time  Time to find and retrieve data from current level cache  Miss Penalty  Average time to retrieve data on a current level miss (includes the possibility of misses on successive levels of memory hierarchy)  Hit Rate  % of requests that are found in current level cache  Miss Rate  1 - Hit Rate 3

 CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time  Memory stall clock cycles = Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty  Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty  Memory hit time is included in execution cycles 4

 Average Memory Access time (AMAT) = Hit Time + (Miss Rate x Miss Penalty) 5

 Benefits of Larger Block Size  Spatial Locality: if we access a given word, we’re likely to access other nearby words soon  Very applicable with Stored-Program Concept: if we execute a given instruction, it’s likely that we’ll execute the next few as well  Works nicely in sequential array accesses too 6

 Drawbacks of Larger Block Size  Larger block size means larger miss penalty ▪ On a miss, takes longer time to load a new block from next level  If block size is too big (relative to cache size), then there are too few blocks  Result: miss rate goes up  In general, minimize Average Access Time = Hit Time + Miss Penalty x Miss Rate 7

 Compulsory Misses  Occur when a program is first started  Cache does not contain any of that program’s data yet, so misses are bound to occur  Can’t be avoided easily 8

 Conflict Misses  Miss that occurs because two distinct memory addresses map to the same cache location  Two blocks (which happen to map to the same location) can keep overwriting each other  Big problem in direct-mapped caches  How do we reduce the effect of these? 9

 Capacity Misses  Miss that occurs because the cache has a limited size  Miss that would not occur if we increase the size of the cache  This is the primary type of miss for Fully Associate caches 10

 Compulsory (cold start or process migration, first reference): first access to a block  Not a whole lot you can do about it  Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant  Capacity:  Cache cannot contain all blocks access by the program  Solution: increase cache size  Conflict (collision):  Multiple memory locations mapped to the same cache location  Solution 1: increase cache size  Solution 2: increase associativity  Coherence (invalidation):  Other process (e.g., I/O) updates memory 11

Direct Mapped N-way Set Associative Fully Associative Cache SizeLargeMediumSmall Capacity MissLowMediumHigh Conflict MissHighMediumZero Compulsory MissSame Coherence MissSame 12 Note: If you run many (millions or billions of) instructions, Compulsory Misses are insignificant

 Direct-Mapped Cache: index completely specifies which position a block can go in on a miss  N-Way Set Assoc (N > 1): index specifies a set, but block can occupy any position within the set on a miss  Fully Associative: block can be written into any position  Question: if we have the choice, where should we write an incoming block? 13

 If there are any locations with valid bit off (empty), then usually write the new block into the first one  If all possible locations already have a valid block, we must use a replacement policy by which we determine which block gets “cached out” on a miss 14

 LRU (Least Recently Used)  Idea: cache out block which has been accessed (read or write) least recently  Pro: temporal locality => recent past use implies likely future use: in fact, this is a very effective policy  Con: with 2-way set assoc, easy to keep track (one LRU bit); with 4-way or greater, requires complicated hardware and much time to keep track of this 15

 Larger cache  limited by cost and technology  hit time of first level cache < cycle time  More places in the cache to put each block of memory - associativity  fully-associative ▪ any block any line  k-way set associated ▪ k places for each block ▪ direct map: k=1 16

 How do we chose between options of associativity, block size, replacement policy?  Design against a performance model  Minimize: Average Access Time = Hit Time + Miss Penalty x Miss Rate  Influenced by technology and program behavior 17

 Assume  Hit Time = 1 cycle  Miss rate = 5%  Miss penalty = 20 cycles  Avg mem access time?  AMAT = 1 + 0.05 x 20 = 2 cycles 18

 When caches first became popular, Miss Penalty ~ 10 processor clock cycles  Today GHz Processors (<1 ns per clock cycle) and 100 ns to go to DRAM  100-300 processor clock cycles!  Solution: another cache between memory and the processor cache:  Second Level (L2) Cache 19

 Avg Mem Access Time = L1 Hit Time + L1 Miss Rate * L1 Miss Penalty  L1 Miss Penalty = L2 Hit Time + L2 Miss Rate * L2 Miss Penalty  Avg Mem Access Time = L1 Hit Time + L1 Miss Rate * (L2 Hit Time + L2 Miss Rate * L2 Miss Penalty) 20

 Assume  L1 Hit Time = 1 cycle  L1 Miss rate = 5%  L1 Miss Penalty = 100 cycles  Avg mem access time?  AMAT = 1 + 0.05 x 100 = 6 cycles 21

 Assume  L1 Hit Time = 1 cycle  L1 Miss rate = 5%  L2 Hit Time = 5 cycles  L2 Miss rate = 15% (% L1 misses that miss)  L2 Miss Penalty = 100 cycles  L1 miss penalty?  = 5 + 0.15 * 100 = 20  Avg mem access time?  AMAT =1 + 0.05 x 20 = 2 cycles  3x faster with L2 cache 22

 Q1: Where can a block be placed in the upper level?Block placement  Q2: How is a block found if it is in the upper level?Block identification  Q3: Which block should be replaced on a miss?Block replacement  Q4: What happens on a write?Write strategy 23

 Example: Block 12 placed in an 8 block cache  Fully Associative  Direct Mapped  Set Associative 24

 Direct indexing (using index and block offset), tag compares, or combination  Increasing associativity shrinks index, expands tag 25 Block AddressBlock Offset TagIndex Set Select Data Select

 Easy for Direct Mapped  Set Associative or Fully Associative:  Random  LRU (Least Recently Used) 26

 How do we handle:  Write Hit?  Write Miss? 27

 Write-through  update the block in the cache and the corresponding block in lower-level memory  Write-back  update word in cache block  allow memory word to be “stale” ▪ add ‘dirty’ bit to each line indicating that memory needs to be updated when block is replaced ▪ OS flushes cache before I/O !!!  Performance trade-offs? 28

 Pros and Cons of each?  WT: read misses cannot result in writes  WB: no writes of repeated writes  WT always combined with write buffers so that don’t wait for lower level memory 29

 A Write Buffer is needed between the Cache and Memory  Processor: writes data into the cache and the write buffer  Memory controller: write contents of the buffer to memory  Write buffer is just a FIFO:  Typical number of entries: 4  Must handle bursts of writes  Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle 30

 Store frequency (w.r.t. time) > 1 / DRAM write cycle  If this condition exists for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): ▪ Store buffer will overflow no matter how big you make it  Solution for write buffer saturation:  Use a write back cache  Install a second level (L2) cache ▪ Does this always work? 31

 Write-Buffer Issues: Could introduce RAW Hazard with memory!  Write buffer may contain only copy of valid data => Reads to memory may get wrong result if we ignore write buffer  Solutions:  Simply wait for write buffer to empty before servicing reads: ▪ Might increase read miss penalty (old MIPS 1000 by 50% )  Check write buffer contents before read (“fully associative”) ▪ If no conflicts, let the memory access continue ▪ Else grab data from buffer  Can Write Buffer help with Write Back?  Read miss replacing dirty block ▪ Copy dirty block to write buffer while starting read to memory  CPU stall less since restarts as soon as do read 32

 Assume: a 16-bit write to memory location 0x0 and causes a miss  Do we allocate space in cache and possibly read in the block? ▪ Yes: Write Allocate ▪ No: Not Write Allocate 33

 The Principle of Locality:  Program likely to access a relatively small portion of the address space at any instant of time. ▪ Temporal Locality: Locality in Time ▪ Spatial Locality: Locality in Space  Three (+1) Major Categories of Cache Misses:  Compulsory Misses: e.g. cold start misses  Conflict Misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect!  Capacity Misses: increase cache size  Coherence Misses: Caused by external processors or I/O devices 34

 size of cache: speed vs. capacity  block size  associativity  direct mapped vs. associative ▪ choice of N for N-way set associative  block replacement policy  2 nd level cache?  write-hit policy (write-through, write-back)  write-miss policy 35

 The optimal choice is a compromise  depends on access characteristics ▪ workload ▪ use (I-cache, D-cache, TLB)  depends on technology / cost  Use performance model to pick between choices, depending on: programs, technology, budget,...  Simplicity often wins 36

 Migration  Move data item to a local cache  Benefits? Concerns?  Replication  When shared data simultaneously read, make copies of the data in the local caches  Benefits? Concerns? 37

 Have to be careful with the memory hierarchy in an architecture/system with parallelism  A big issue for multicore multiprocessors  Also can occur with I/O devices 38

 Preserves Program Order  CPU A writes value 100 to location X // Assume no other CPU writes here CPU A reads from X // Should read 100  Coherent View of Memory  CPU A writes value 200 to location X // Assume no other CPU writes here CPU B reads from X //Should read 200 // if read/write sufficiently separated in time 39

 Writes to the same location are serialized  CPU A writes value 300 to location X CPU B writes value 400 to location X  CPU can never read 400 from X, then later 300  Ensure all writes to the same location are seen in the same order  Must enforce coherence 40

 Most popular cache coherence protocol  Problem: No centralized state of caches  Solution: Cache controllers snoop bus/network to determine which block is accessed  One example: Write Invalidate Protocol  Ensure exclusive access to block to be written  Beware of false sharing  Scalable? Also see: Directory-Based C.C.P. 41

 Caches are NOT mandatory:  Processor performs arithmetic  Memory stores data  Caches simply make data transfers go faster  Each level of memory hierarchy is just a subset of next higher level  Caches speed up due to temporal locality: store data used recently  Block size > 1 word speeds up due to spatial locality: store words adjacent to the ones used recently  Cache coherency important/tricky for multicore 42

Lecture slides originally adapted from Prof. Valeriu Beiu (Washington State University, Spring 2005, EE 334)

Similar presentations

Presentation on theme: "Lecture slides originally adapted from Prof. Valeriu Beiu (Washington State University, Spring 2005, EE 334)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture slides originally adapted from Prof. Valeriu Beiu (Washington State University, Spring 2005, EE 334)

Similar presentations

Presentation on theme: "Lecture slides originally adapted from Prof. Valeriu Beiu (Washington State University, Spring 2005, EE 334)"— Presentation transcript:

Similar presentations

About project

Feedback