Download presentation
Presentation is loading. Please wait.
Published byLilian Hoover Modified over 8 years ago
1
Lecture slides originally adapted from Prof. Valeriu Beiu (Washington State University, Spring 2005, EE 334)
2
Temporal Locality (Locality in Time) Keep most recently accessed data items closer to the processor Spatial Locality (Locality in Space) Move blocks consisting of contiguous words to the upper levels of cache 2
3
Hit Time Time to find and retrieve data from current level cache Miss Penalty Average time to retrieve data on a current level miss (includes the possibility of misses on successive levels of memory hierarchy) Hit Rate % of requests that are found in current level cache Miss Rate 1 - Hit Rate 3
4
CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory stall clock cycles = Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty Memory hit time is included in execution cycles 4
5
Average Memory Access time (AMAT) = Hit Time + (Miss Rate x Miss Penalty) 5
6
Benefits of Larger Block Size Spatial Locality: if we access a given word, we’re likely to access other nearby words soon Very applicable with Stored-Program Concept: if we execute a given instruction, it’s likely that we’ll execute the next few as well Works nicely in sequential array accesses too 6
7
Drawbacks of Larger Block Size Larger block size means larger miss penalty ▪ On a miss, takes longer time to load a new block from next level If block size is too big (relative to cache size), then there are too few blocks Result: miss rate goes up In general, minimize Average Access Time = Hit Time + Miss Penalty x Miss Rate 7
8
Compulsory Misses Occur when a program is first started Cache does not contain any of that program’s data yet, so misses are bound to occur Can’t be avoided easily 8
9
Conflict Misses Miss that occurs because two distinct memory addresses map to the same cache location Two blocks (which happen to map to the same location) can keep overwriting each other Big problem in direct-mapped caches How do we reduce the effect of these? 9
10
Capacity Misses Miss that occurs because the cache has a limited size Miss that would not occur if we increase the size of the cache This is the primary type of miss for Fully Associate caches 10
11
Compulsory (cold start or process migration, first reference): first access to a block Not a whole lot you can do about it Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant Capacity: Cache cannot contain all blocks access by the program Solution: increase cache size Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Coherence (invalidation): Other process (e.g., I/O) updates memory 11
12
Direct Mapped N-way Set Associative Fully Associative Cache SizeLargeMediumSmall Capacity MissLowMediumHigh Conflict MissHighMediumZero Compulsory MissSame Coherence MissSame 12 Note: If you run many (millions or billions of) instructions, Compulsory Misses are insignificant
13
Direct-Mapped Cache: index completely specifies which position a block can go in on a miss N-Way Set Assoc (N > 1): index specifies a set, but block can occupy any position within the set on a miss Fully Associative: block can be written into any position Question: if we have the choice, where should we write an incoming block? 13
14
If there are any locations with valid bit off (empty), then usually write the new block into the first one If all possible locations already have a valid block, we must use a replacement policy by which we determine which block gets “cached out” on a miss 14
15
LRU (Least Recently Used) Idea: cache out block which has been accessed (read or write) least recently Pro: temporal locality => recent past use implies likely future use: in fact, this is a very effective policy Con: with 2-way set assoc, easy to keep track (one LRU bit); with 4-way or greater, requires complicated hardware and much time to keep track of this 15
16
Larger cache limited by cost and technology hit time of first level cache < cycle time More places in the cache to put each block of memory - associativity fully-associative ▪ any block any line k-way set associated ▪ k places for each block ▪ direct map: k=1 16
17
How do we chose between options of associativity, block size, replacement policy? Design against a performance model Minimize: Average Access Time = Hit Time + Miss Penalty x Miss Rate Influenced by technology and program behavior 17
18
Assume Hit Time = 1 cycle Miss rate = 5% Miss penalty = 20 cycles Avg mem access time? AMAT = 1 + 0.05 x 20 = 2 cycles 18
19
When caches first became popular, Miss Penalty ~ 10 processor clock cycles Today GHz Processors (<1 ns per clock cycle) and 100 ns to go to DRAM 100-300 processor clock cycles! Solution: another cache between memory and the processor cache: Second Level (L2) Cache 19
20
Avg Mem Access Time = L1 Hit Time + L1 Miss Rate * L1 Miss Penalty L1 Miss Penalty = L2 Hit Time + L2 Miss Rate * L2 Miss Penalty Avg Mem Access Time = L1 Hit Time + L1 Miss Rate * (L2 Hit Time + L2 Miss Rate * L2 Miss Penalty) 20
21
Assume L1 Hit Time = 1 cycle L1 Miss rate = 5% L1 Miss Penalty = 100 cycles Avg mem access time? AMAT = 1 + 0.05 x 100 = 6 cycles 21
22
Assume L1 Hit Time = 1 cycle L1 Miss rate = 5% L2 Hit Time = 5 cycles L2 Miss rate = 15% (% L1 misses that miss) L2 Miss Penalty = 100 cycles L1 miss penalty? = 5 + 0.15 * 100 = 20 Avg mem access time? AMAT =1 + 0.05 x 20 = 2 cycles 3x faster with L2 cache 22
23
Q1: Where can a block be placed in the upper level?Block placement Q2: How is a block found if it is in the upper level?Block identification Q3: Which block should be replaced on a miss?Block replacement Q4: What happens on a write?Write strategy 23
24
Example: Block 12 placed in an 8 block cache Fully Associative Direct Mapped Set Associative 24
25
Direct indexing (using index and block offset), tag compares, or combination Increasing associativity shrinks index, expands tag 25 Block AddressBlock Offset TagIndex Set Select Data Select
26
Easy for Direct Mapped Set Associative or Fully Associative: Random LRU (Least Recently Used) 26
27
How do we handle: Write Hit? Write Miss? 27
28
Write-through update the block in the cache and the corresponding block in lower-level memory Write-back update word in cache block allow memory word to be “stale” ▪ add ‘dirty’ bit to each line indicating that memory needs to be updated when block is replaced ▪ OS flushes cache before I/O !!! Performance trade-offs? 28
29
Pros and Cons of each? WT: read misses cannot result in writes WB: no writes of repeated writes WT always combined with write buffers so that don’t wait for lower level memory 29
30
A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Must handle bursts of writes Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle 30
31
Store frequency (w.r.t. time) > 1 / DRAM write cycle If this condition exists for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): ▪ Store buffer will overflow no matter how big you make it Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache ▪ Does this always work? 31
32
Write-Buffer Issues: Could introduce RAW Hazard with memory! Write buffer may contain only copy of valid data => Reads to memory may get wrong result if we ignore write buffer Solutions: Simply wait for write buffer to empty before servicing reads: ▪ Might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read (“fully associative”) ▪ If no conflicts, let the memory access continue ▪ Else grab data from buffer Can Write Buffer help with Write Back? Read miss replacing dirty block ▪ Copy dirty block to write buffer while starting read to memory CPU stall less since restarts as soon as do read 32
33
Assume: a 16-bit write to memory location 0x0 and causes a miss Do we allocate space in cache and possibly read in the block? ▪ Yes: Write Allocate ▪ No: Not Write Allocate 33
34
The Principle of Locality: Program likely to access a relatively small portion of the address space at any instant of time. ▪ Temporal Locality: Locality in Time ▪ Spatial Locality: Locality in Space Three (+1) Major Categories of Cache Misses: Compulsory Misses: e.g. cold start misses Conflict Misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size Coherence Misses: Caused by external processors or I/O devices 34
35
size of cache: speed vs. capacity block size associativity direct mapped vs. associative ▪ choice of N for N-way set associative block replacement policy 2 nd level cache? write-hit policy (write-through, write-back) write-miss policy 35
36
The optimal choice is a compromise depends on access characteristics ▪ workload ▪ use (I-cache, D-cache, TLB) depends on technology / cost Use performance model to pick between choices, depending on: programs, technology, budget,... Simplicity often wins 36
37
Migration Move data item to a local cache Benefits? Concerns? Replication When shared data simultaneously read, make copies of the data in the local caches Benefits? Concerns? 37
38
Have to be careful with the memory hierarchy in an architecture/system with parallelism A big issue for multicore multiprocessors Also can occur with I/O devices 38
39
Preserves Program Order CPU A writes value 100 to location X // Assume no other CPU writes here CPU A reads from X // Should read 100 Coherent View of Memory CPU A writes value 200 to location X // Assume no other CPU writes here CPU B reads from X //Should read 200 // if read/write sufficiently separated in time 39
40
Writes to the same location are serialized CPU A writes value 300 to location X CPU B writes value 400 to location X CPU can never read 400 from X, then later 300 Ensure all writes to the same location are seen in the same order Must enforce coherence 40
41
Most popular cache coherence protocol Problem: No centralized state of caches Solution: Cache controllers snoop bus/network to determine which block is accessed One example: Write Invalidate Protocol Ensure exclusive access to block to be written Beware of false sharing Scalable? Also see: Directory-Based C.C.P. 41
42
Caches are NOT mandatory: Processor performs arithmetic Memory stores data Caches simply make data transfers go faster Each level of memory hierarchy is just a subset of next higher level Caches speed up due to temporal locality: store data used recently Block size > 1 word speeds up due to spatial locality: store words adjacent to the ones used recently Cache coherency important/tricky for multicore 42
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.