Chap.7 Memory system Jen-Chang Liu, Spring 2006
Big Ideas so far 15 weeks to learn big ideas in CS&E Principle of abstraction, used to build systems as layers Pliable Data: a program determines what it is Stored program concept: instructions just data Greater performance by exploiting parallelism (pipeline) Principle of Locality, exploited via a memory hierarchy (cache) Principles/Pitfalls of Performance Measurement
Five components of computer Input, output, memory, datapath, control
Outline Introduction Basics of caches Measuring cache performance Set associative cache Multilevel cache Virtual memory Make memory system fast Make memory system big
Introduction Programmer ’ s view about memory Unlimited amount of fast memory How to create the above illusion? 無限大的快速記憶體 Scene: library Book shelf desk one book books
Principle of locality Program access a relatively small portion of their address space at any instant of time Temporal locality If an item is referenced, it will tend to be referenced again soon Spatial locality If an item is referenced, items whose address are close by will tend to be referenced soon
Cost and performance of memory How to build a memory system from the above memory technologies? Access time$ per GB in 2004 SRAM 0.5-5ns$4000-$10000 DRAM 50-70ns$100-$200 Magnetic disk 5-20 million ns$0.5-$2 SRAM: static random access memory DRAM: dynamic random access memory
Memory hierarchy 記憶體階層 Ex. disk DRAM SRAM data All data Subset of data Subset of data
Operation in memory hierarchy If data is found /* hit */ transfer to processor; else /* miss */ transfer data to upper level; access time Hit time Miss penalty
Outline Introduction Basics of caches Measuring cache performance Set associative cache Multilevel cache Virtual memory How to design memory hierarchy?
Cache Cache: a safe place for hiding or storing things. Cache Memory hierarchy between CPU and main memory Any storage managed to take advantage of locality of access Webster ’ s dictionary 快取記憶體
What does a cache do?
Problem to design a cache Cache contains part of the data in memory of disk Q1: How do we know if a data item is in the cache? 如何知道 cache 有沒有現在要用的資料? = > 如何把記憶體抓到的資料放到 cache 裡?
Direct mapped cache (Fig 7.5) Ex. (block address) modulo (no. of cache blocks in the cache) Address of wordLocation in cache
Direct mapped cache (cont.) Many memory words one location in cache Q: Which memory word in the cache? Use tag to identify Q: Whether the memory block is valid? Ex. Initially, the cache is empty Use valid bit to identify data word … Cache addr. valid tag
Fig7.6
Cache access (Fig 7.7) Word = 4bytes address Cache 裡真正用來存資料的部分 Cache block 大小:
Ex. Calculate bits in a cache How many bits are required for a direct- mapped cache with 64KB of data and one- word blocks, assuming a 32-bit address? 32-bit address Word data 2 64KB = 16K words = 2 14 words Tag = = Cache bit: 2 14 x ( ) = 98KB
Ex. Real machine: DECstation … … KB data 98KB cache size (2 14 )
Ex. DECStation 3100 Use MIPS R2000 CPU Use pipeline as in Chap. 6 Data memory Instruction memory Two memory Units?
Ex. DECStation 3100 caches Instruction cache and data cache 64KB Instruction cache 64KB data cache
Ex. DECStation 3100 Cache access: Read 64KB Instruction cache 64KB data cache PC Address calculated from ALU Cache hit Cache miss Update cache
Peer Instruction A. Mem hierarchies were invented before (UNIVAC I wasn ’ t delivered ‘ til 1951) B. If you know your computer ’ s cache size, you can often make your code run faster. C. Memory hierarchies take advantage of spatial locality by keeping the most recent data items closer to the processor. ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT
CS61C L31 Caches I (25) Garcia 2005 © UCB Peer Instruction Answer A. Mem hierarchies were invented before (UNIVAC I wasn’t delivered ‘til 1951) B. If you know your computer’s cache size, you can often make your code run faster. C. Memory hierarchies take advantage of spatial locality by keeping the most recent data items closer to the processor. ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT A.“We are…forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less accessible.” – von Neumann, 1946 B.Certainly! That’s call “tuning” C.“Most Recent” items Temporal locality
CS61C L31 Caches I (26) Garcia 2005 © UCB Peer Instructions 1. All caches take advantage of spatial locality. 2. All caches take advantage of temporal locality. 3. On a read, the return value will depend on what is in the cache. ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT
CS61C L31 Caches I (27) Garcia 2005 © UCB Peer Instruction Answer 1. All caches take advantage of spatial locality. 2. All caches take advantage of temporal locality. 3. On a read, the return value will depend on what is in the cache. T R U E F A L S E 1. Block size = 1, no spatial! 2. That’s the idea of caches; We’ll need it again soon. 3. It better not! If it’s there, use it. Oth, get from mem F A L S E ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT
Handling cache misses Cache miss processing Stall the processor Fetch the data from memory Write the cache entry Put the data Update the tag field Update the valid bit Continue execution
Ex. DECStation 3100 Cache access: Write Store data new value Data in cache and memory is inconsistent!!! 資料不相符 1. Write-through 更改快取記憶體 同時也寫回記憶體 2. Write-back 不寫回記憶體
Problems with write-through Writing to main memory slows down the performance Ex. CPI without cache miss = 1.2 clock cycles write to memory causes extra 10 cycles 13% store instructions in gcc x13% = 2.5 clock cycles 記憶體存取造成效率變差 Solution: write buffer Store the data into write buffer while the data is waiting to be written to memory The process can continue execution after writing data into cache and write buffer 寫入資料暫存在 write buffer ,等待寫入記憶體,程式繼續執行
Problems with write-back New value is written only to the cache Problem: cache and memory inconsistence Complex to implement Ex. When a cache entry is replaced, it must update the corresponding memory address
Use of spatial locality Previous cache design takes advantage of temporal locality Use spatial locality in cache design A cache block that is larger than 1 word in length With a cache miss, we will fetch multiple words that are adjacent 時間上的局部性 空間上的局部性 一次抓多個相鄰的 words
One-word cache (Fig 7.7) address
Multiple-word cache 4-word block addr.
Advantage of multiple-word block (spatial locality) Ex. access word with byte address 16,24,20 … … word block cache 1-word block cache 16 - cache miss 24 - cache miss 20 - cache miss 16 – cache miss load 4-word block 24 – cache hit 20 – cache hit memory
Multiple-word cache: write miss addr. 1-word data 01 Reload 4-word block 1-word data miss
CS61C L32 Caches II (37) Garcia, 2005 © UCB 1. Read 0x Valid Tag 0x0-3 0x4-70x8-b0xc-f Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (38) Garcia, 2005 © UCB So we read block 1 ( )... Valid Tag 0x0-3 0x4-70x8-b0xc-f Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (39) Garcia, 2005 © UCB No valid data... Valid Tag 0x0-3 0x4-70x8-b0xc-f Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (40) Garcia, 2005 © UCB So load that data into cache, setting tag, valid... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (41) Garcia, 2005 © UCB Read from cache at offset, return word b Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (42) Garcia, 2005 © UCB 2. Read 0x C = 0… Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (43) Garcia, 2005 © UCB Index is Valid... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (44) Garcia, 2005 © UCB Index valid, Tag Matches... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (45) Garcia, 2005 © UCB Index Valid, Tag Matches, return d... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (46) Garcia, 2005 © UCB 3. Read 0x = 0… Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (47) Garcia, 2005 © UCB So read block 3... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (48) Garcia, 2005 © UCB No valid data... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (49) Garcia, 2005 © UCB Load that cache block, return word f... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd efgh Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (50) Garcia, 2005 © UCB 4. Read 0x = 0… Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd efgh Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (51) Garcia, 2005 © UCB So read Cache Block 1, Data is Valid... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd efgh Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (52) Garcia, 2005 © UCB Cache Block 1 Tag does not match (0 != 2)... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd efgh Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (53) Garcia, 2005 © UCB Miss, so replace block 1 with new data & tag... Valid Tag 0x0-3 0x4-70x8-b0xc-f ijkl efgh Index Tag fieldIndex fieldOffset
CS61C L32 Caches II (54) Garcia, 2005 © UCB And return word j... Valid Tag 0x0-3 0x4-70x8-b0xc-f ijkl efgh Index Tag fieldIndex fieldOffset
Advantage of multiple-word block (spatial locality) Comparison of miss rate Block size in words program Instruction miss rate Data miss rate gcc 1 6.1% 2.1% 4 2.0% 1.7% spice 1 1.2% 1.3% 4 0.3% 0.6% Why improvement on instruction miss is significant? Instruction references have better spatial locality
Miss rate v.s. block size Why? Block 數變少 !
Short conclusion Direct mapped cache Map a memory word to a cache block Valid bit, tag field Cache read Hit, read miss, miss penalty Cache write Write-through Write-back Write miss penalty Multi-word cache (use spatial locality)
Outline Introduction Basics of caches Measuring cache performance Set associative cache Multilevel cache Virtual memory Make memory system fast
Cache performance How cache affects system performance? CPU time = ( CPU execution clock cycles ) x clock cycle time + Memory-stall clock cycles cache hit cache miss Memory-stall cycles = Read-stall cycles + Write-stall cycles Read-stall cycles = Program Reads X Read miss rate x read miss penalty Assume read and write miss penalty are the same Memory-stall cycles = Program Mem. access X miss rate x miss penalty
Ex. Calculate cache performance CPI = 2 without any memory stalls For gcc, instruction cache miss rate=2% data cache miss rate=4% miss penalty = 40 cycles Sol: Set instruction count = I Instruction miss cycles = I x 2% x 40 = 0.8 x I Data miss cycles = I x 36% x 4% x 40 = 0.58 x I percentage of lw/sw Memory-stall cycles = 0.8I I = 1.38I CPU time stalls CPU time perfect cache = 2I I 2I =1.69
Why memory is bottleneck for system performance? In previous example, if we make the processor faster, change CPI from 2 to 1 Memory-stall cycles remains the same=1.38I CPU time stalls CPU time perfect cache = I I I =2.38 Percentage of memory stall: =41% =58% CPU 變快 (CPI 降低,或 clock rate 提高 ) Memory 對系統效能的影響百分比越重
Outline Introduction Basics of caches Measuring cache performance Set associative cache (reduce miss rate) Multilevel cache Virtual memory Make memory system fast
How to improve cache performance ? Larger cache Set associative cache Reduce cache miss rate New placement rule other than direct mapping Multi-level cache Reduce cache miss penalty Memory-stall cycles = Program Mem. access X miss rate x miss penalty
Flexible placement of blocks Recall: direct mapped cache One address -> one block in cache ? One address -> more than one block in cache 一個 memory address 可以對應到 cache 中一個以上的 block
Full-associative cache A memory data can be placed in any block in the cache Disadvantage: Search all entries in the cache for a match Using parallel comparators 可放在 cache 任意位置
Set-associative cache Between direct mapped and full-associative A memory data can be placed in a set of blocks in the cache Disadvantage: Search all entries in the set for a match Parallel comparators 可放在 cache 中某一個集合中 (address) modulo (number of sets in cache) Ex. 12 modulo 4 = 0
Example: 4-way set-associative cache Parallel comparators
Take all schemes as a case of set-associativity Ex. 8-block cache
Example: set-associative caches (p. 500) A cache with 4 blocks Load data with block addresses 0,8,0,6,8 one-way set-associative cache (direct mapped) 5 misses
Example: set-associative caches 2-way set-associative cache 4-way set-associative cache 4 misses 3 misses
Short conclusion Higher degree of associativity Lower miss rate More hardware cost to search
Outline Introduction Basics of caches Measuring cache performance Set associative cache Multilevel cache (reduce miss penalty) Virtual memory Make memory system fast
Multi-level cache Goal: reduce miss penalty Primary cache (L1) Secondary cache(L2) L1 cache miss L2 cache miss Cache hit Main memory
Example: Performance of multilevel cache CPI = 1 without cache miss, clock rate = 500MHz Primary cache, miss rate=5% Secondary cache, miss rate=2%, access time=20ns Main memory, access time=200 ns Total CPI = Base CPI + memory-stall CPI 1 ?
Example: Performance of multilevel cache (cont.) Total CPI = Base CPI + memory-stall CPI 1 ? access to main memory=200ns x 500M clock/sec=100clock access to L2 cache =20ns x 500M clock/sec =10 clock Total CPI = 1 + L1 miss penalty + L2 miss penalty = 1 + 5% x % x 100 = 3.5 One-level cache Two-level cache Total CPI = 1 + 5% x 100 = 6