Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited
Memory Hierarchy regs on-chip L1 cache (SRAM) main memory (DRAM) local secondary storage (local disks) Larger, slower, cheaper per byte remote secondary storage (tapes, distributed file systems, Web servers) on-chip L2 cache (SRAM) Smaller, faster, costlier per byte
– 3 – Why Caches Work Temporal locality: Recently referenced items are likely to be referenced again in the near future Spatial locality: Items with nearby addresses tend to be referenced close together in time block
Cache (L1 and L2) Performance Metrics Miss Rate Fraction of memory references not found in cache (misses / accesses) = 1 – hit rate Typical numbers (in percentages): 3-10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc. Hit Time Time to deliver a block in the cache to the processor includes time to determine whether the line is in the cache Typical numbers: 1-3 clock cycles for L clock cycles for L2 Miss Penalty Additional time required because of a miss typically cycles for main memory
– 5 – Lets think about those numbers Huge difference between a hit and a miss Could be 100x, if just L1 and main memory Would you believe 99% hits is twice as good as 97%? Consider: cache hit time of 1 cycle miss penalty of 100 cycles Average access time: 0.97 * 1 cycle * 100 cycles = 3.97 cycles 0.99 * 1 cycle * 100 cycles = 1.99 cycles
Types of Cache Misses Cold (compulsory) miss Occurs on first access to a block Spatial locality of access helps (also prefetching---more later) Conflict miss Multiple data objects all map to the same slot (like in hashing) e.g, block i must be placed in cache entry/slot: i mod 8 replacing block already in that slot referencing blocks 0, 8, 0, 8,... would miss every time Conflict misses are less of a problem these days Set associative caches with 8, or 16 set size per slot help Capacity miss When the set of active cache blocks (working set) is larger than the cache This is where to focus nowadays
– 7 – What about writes? Multiple copies of data exist: L1, L2, Main Memory, Disk What to do on a write-hit? Write-back (defer write to memory until replacement of line) Need a dirty bit (line different from memory or not) What to do on a write-miss? Write-allocate (load into cache, update line in cache)Typical Write-back + Write-allocateRare Write-through (write immediately to memory, usually for I/O)
– 8 – Main Memory is something like a Cache (for Disk) Driven by enormous miss penalty: Disk is about 10,000x slower than DRAM DRAM Design: Large page (block) size: typically 4KB
– 9 – Programs refer to virtual memory addresses Conceptually very large array of bytes (4GB for IA32, 16 exabytes for 64 bits) Each byte has its own address System provides address space private to each process Allocation: Compiler and run-time system All allocation within single virtual address space Virtual Memory
Virtual Addressing MMU = Memory Management Unit MMU keeps mapping of VAs -> PAs in a “page table” 0: 1: Main memory MMU 2: 3: 4: 5: 6: 7: Physical address (PA) Data word... CPU Virtual address (VA) CPU Chip
– 11 – MMU Needs Table of Translations MMU keeps mapping of VAs -> PAs in a “page table” 0: 1: Main memory MMU 2: 3: 4: 5: 6: 7: Physical address (PA)... CPU Virtual address (VA) CPU Chip Page Table
– 12 – Where is page table kept ? In main memory – can be cached e.g., in L2 (like data) 0: 1: Main memory MMU 2: 3: 4: 5: 6: 7: Physical address (PA)... CPU Virtual address (VA) CPU Chip Page Table
– 13 – Speeding up Translation with a TLB Translation Lookaside Buffer (TLB) Small hardware cache for page table in MMU Caches page table entries for a number of pages (eg., 256 entries)
– 14 – TLB Hit MMU Mem PA Data CPU VA CPU Chip PTE A TLB hit saves you from accessing memory for the page table TLB VA 3 Page Table
– 15 – TLB Miss MMU Mem PA Data CPU VA CPU Chip PTE TLB VA 4 PTE request 3 A TLB miss incurs an additional memory access (the PT) Page Table
– 16 – How to Program for Virtual Memory At any point in time, programs tend to access a set of active virtual pages called the working set Programs with better temporal locality will have smaller working sets If ((working set size) > main mem size) Thrashing: Performance meltdown where pages are swapped (copied) in and out continuously If ((# working set pages) > # TLB entries) Will suffer TLB misses Not as bad as page thrashing, but still worth avoiding
– 17 – More on TLBs Assume a 256-entry TLB, and each page is 4KB Can only have TLB hits for 1MB of data (256*4kB = 1MB) This is called the “TLB reach”---amount of mem TLB can cover Typical L2 cache is 6MB Hence should consider TLB-size before L2 size when tiling? Real CPUs have second-level TLBs (like an L2 for TLB) This is getting complicated to reason about! Likely have to experiment to find best tile size
– 18 – Memory Optimization: Summary Caches Conflict Misses: Not much of a concern (set-associative caches) Cache Capacity: Keep working set within on-chip cache capacity Fit in L1 or L2 depending on working-set size Virtual Memory: Page Misses: Keep page-level working set within main memory capacity TLB Misses: may want to keep working set #pages < TLB #entries
IA32 Linux Memory Layout Stack Runtime stack (8MB limit)Data Statically allocated data E.g., arrays & strings declared in codeHeap Dynamically allocated storage When call malloc(), calloc(), new()Text Executable machine instructions Read-only