Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy 24 August, 2018 Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Multilevel On-Chip Caches Morgan Kaufmann Publishers Multilevel On-Chip Caches 24 August, 2018 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Supporting Multiple Issue Morgan Kaufmann Publishers Supporting Multiple Issue 24 August, 2018 Both have multi-banked caches that allow multiple accesses per cycle assuming no bank conflicts Cortex-A53 and Core i7 cache optimizations Return requested word first Non-blocking cache Hit under miss allows additional cache hits during a miss hides some miss latency with other work Miss under miss allows multiple outstanding cache misses overlap the latency of two different misses Data prefetching look at a pattern of data misses and predict the next address to start fetching the data before the miss occurs. Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Performance of the Cortex-A53 Memory Hierarchies 32 KiB two-way set associative L1 instruction cache, 32 KiB four-way set associative L1 data cache, 1 MiB 16-way set associative L2 cache running the integer SPEC2006 benchmarks
Performance of the Cortex-A53 Memory Hierarchies The L1 miss penalty for a 1 GHz Cortex-A53 is 12 clock cycles, while the L2 miss penalty is 124 clock cycles Low miss rates multiplied by their high miss penalties represent a significant fraction of the CPI for 5 of the 12 SPEC2006 programs
Performance of the Core i7 Memory Hierarchies L1 instruction cache miss rate: 0.1% to 1.8%, average 0.4% L1 data cache miss rates: 5% to 10%, and sometimes higher L2 average data miss rate: 4% L3 average data miss rate: 1%
DGEMM Combine cache blocking, subword parallelism and instruction level parallelism Blocking improves performance over unrolled AVX code by factors of 2 to 2.5 for the larger matrices
Morgan Kaufmann Publishers Pitfalls 24 August, 2018 Ignoring memory system effects when writing or generating code Example: iterating over rows vs. columns of arrays Large strides result in poor locality Byte vs. word addressing Example: 32-byte direct-mapped cache, 4-byte blocks Byte 36 maps to block 1, since byte address 36 is block address 9 and (9 modulo 8)=1. Word 36 maps to block 4, (36 mod 8)=4. Example: cache with 256 bytes and a block size of 32 bytes. Into which block does the byte address 300 fall? Byte address 300 is block address: 300 32 = 9 The number of blocks in the cache is 256 32 = 8 Block number 9 falls into cache block number (9 modulo 8)=1. Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers 24 August, 2018 Pitfalls In multiprocessor with shared L2 or L3 cache Less associativity than cores results in conflict misses More cores need to increase associativity Using AMAT (Average Memory Access Time) to evaluate performance of out-of-order processors Ignores effect of non-blocked accesses Instead, evaluate performance by simulation Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers 24 August, 2018 Pitfalls Extending address range using segments E.g., Intel 80286 But a segment is not always big enough Makes address arithmetic complicated Implementing a VMM on an ISA not designed for virtualization E.g., non-privileged instructions accessing hardware resources Either extend ISA, or require guest OS not to use problematic instructions Chapter 5 — Large and Fast: Exploiting Memory Hierarchy
Fallacies Disk failure rates in the field match their specifications 100,000 disks quoted MTTF of 1,000,000 to 1,500,000 hours, or AFR of 0.6% to 0.8%. AFRs of 2% to 4%, often 3-5 times higher than the specified rates more than 100,000 disks at Google, quoted AFR of 1.5%, failure rates of 1.7% for drives in their first year rise to 8.6% for drives in their third year, or about 5-6 times the declared rate Operating systems are the best place to schedule disk accesses OS sorts the LBA into increasing order to improve performance Disk knows the actual mapping of the logical addresses onto the physical geometry of sectors, tracks, and surfaces, it can reduce the rotational and seek latencies by rescheduling
Morgan Kaufmann Publishers Concluding Remarks 24 August, 2018 Fast memories are small, large memories are slow We really want fast, large memories Caching gives this illusion Principle of locality Programs use a small part of their memory space frequently Memory hierarchy L1 cache L2 cache … DRAM memory disk Multilevel caches make it possible to use more cache optimizations more easily Memory system design is critical for multiprocessors Compiler enhancements such as restructuring the loops that access the arrays, substantially improves locality and cache performance Prefetching - a block of data is brought into the cache before it is actually referenced Chapter 5 — Large and Fast: Exploiting Memory Hierarchy