Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
12 September, 2018 Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers
12 September, 2018 Average Access Time Hit time is also important for performance Average memory access time (AMAT) AMAT = Hit time + Miss rate × Miss penalty Example CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5% AMAT = × 20 = 2ns 2 cycles per instruction Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

12 September, 2018 Performance Summary When CPU performance increased Miss penalty becomes more significant Decreasing base CPI Greater proportion of time spent on memory stalls Increasing clock rate Memory stalls account for more CPU cycles Can’t neglect cache behavior when evaluating system performance Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

12 September, 2018 Associative Caches Fully associative Allow a given block to go in any cache entry Requires all entries to be searched at once Comparator per entry (expensive hardware cost) n-way set associative Each set contains n entries Block number determines which set (Block number) modulo (#Sets in cache) Search all entries in a given set at once n comparators (less expensive) Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Associative Cache Example
Morgan Kaufmann Publishers 12 September, 2018 Associative Cache Example Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Spectrum of Associativity
Morgan Kaufmann Publishers Spectrum of Associativity 12 September, 2018 For a cache with 8 entries Higher associativity Advantage: decreases miss rate Disadvantage: increases hit time Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Associativity Example
Morgan Kaufmann Publishers Associativity Example 12 September, 2018 Compare 4-block caches Direct mapped, 2-way set associative, fully associative Block access sequence: 0, 8, 0, 6, 8 Direct mapped Block 0 mapped to (0 modulo 4) = 0 Block 6 mapped to (6 modulo 4) = 2 Block 8 mapped to (8 modulo 4) = 0 Block address Cache index Hit/miss Cache content after access 1 2 3 miss Mem[0] 8 Mem[8] 6 Mem[6] Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Associativity Example
Morgan Kaufmann Publishers Associativity Example 12 September, 2018 2-way set associative Block address Cache index Hit/miss Cache content after access Set 0 Set 1 miss Mem[0] 8 Mem[8] hit 6 Mem[6] Fully associative Block address Hit/miss Cache content after access miss Mem[0] 8 Mem[8] hit 6 Mem[6] With 8-block cache: no replacements in 2-way set associative cache Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

How Much Associativity
Morgan Kaufmann Publishers 12 September, 2018 How Much Associativity Increased associativity decreases miss rate But with diminishing returns Simulation of a system with 64KiB D-cache, 16-word blocks, SPEC2000 1-way: 10.3% 2-way: 8.6% 4-way: 8.3% 8-way: 8.1% Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Four-way Set Associative Cache Organization
Morgan Kaufmann Publishers 12 September, 2018 Four-way Set Associative Cache Organization Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

12 September, 2018 Replacement Policy Direct mapped: no choice Set associative Prefer non-valid entry, if there is one Otherwise, choose among entries in the set Least-recently used (LRU) Choose the one unused for the longest time Simple for 2-way, manageable for 4-way, too hard beyond that Random Gives approximately the same performance as LRU for high associativity Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Size of Tags versus Associativity
Cache = 4096 blocks, block size = 4 words, 64-bit address. Find the total number of sets and the total number of tag bits for caches that are direct-mapped, two-way and four-way set-associative, and fully associative. Block size = 4 words = 16 bytes = 24 bytes Index and tag bits = 64 – 4 = 60 bits Direct-mapped cache: number of sets = number of blocks = 4096 = 212; 12 bits of index; tag bits = 60 – 12 = 48; total tag bits = 48 x 4096 = 197K bits Two-way set associative cache: #sets = 2048; tag bits = 60 – 11 = 49; total tag bits = 49 x 2 x 2048 = 201K bits Four-way set associative cache: #sets = 1024; tag bits = 60 – 10 = 50; total tag bits = 50 x 4 x 1024 = 205K bits Fully associative cache: #sets = 1, 4096 blocks, tag bits = 60; total tag bits = 60 x 4096 x 1 = 246K bits

12 September, 2018 Multilevel Caches Primary cache attached to CPU Small, but fast Level-2 cache services misses from primary cache Larger, slower, but still faster than main memory Main memory services L-2 cache misses Some high-end systems include L-3 cache Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Multilevel Cache Example
Morgan Kaufmann Publishers 12 September, 2018 Multilevel Cache Example Given CPU base CPI = 1, clock rate = 4GHz Miss rate/instruction in primary cache = 2% Main memory access time = 100ns With just primary cache Miss penalty = 100ns/0.25ns = 400 cycles Effective CPI = × 400 = 9 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

12 September, 2018 Example (cont.) Now add L-2 cache Access time = 5ns Global miss rate to main memory = 0.5% Primary miss with L-2 hit Penalty = 5ns/0.25ns = 20 cycles Primary miss with L-2 miss Extra penalty = 420 cycles CPI = × × 400 = 3.4 Performance ratio = 9/3.4 = 2.6 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Multilevel Cache Considerations
Morgan Kaufmann Publishers 12 September, 2018 Multilevel Cache Considerations Primary cache Focus on minimal hit time L-2 cache Focus on low miss rate to avoid main memory access Hit time has less overall impact Results L-1 cache usually smaller than a single cache L-1 block size smaller than L-2 block size L-2 cache larger than single-level cache, consequently L-2 has larger block size L-2 has higher associativity than L-1 cache to reduce miss rates Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Interactions with Advanced CPUs
Morgan Kaufmann Publishers 12 September, 2018 Interactions with Advanced CPUs Out-of-order CPUs can execute instructions during cache miss Pending store stays in load/store unit Dependent instructions wait in reservation stations Independent instructions continue Effect of miss depends on program data flow Much harder to analyse Use system simulation Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Interactions with Software
Morgan Kaufmann Publishers Interactions with Software 12 September, 2018 Misses depend on memory access patterns Algorithm behavior – Radix sort has fewer operations Compiler optimization for memory access Quicksort consistently has many fewer misses per item to be sorted Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Software Optimization via Blocking
Goal: maximize accesses to data loaded into the cache before it is replaced improve temporal locality to reduce cache misses blocked algorithms operate on submatrices or blocks Consider inner loops of DGEMM: for (int j = 0; j < n; ++j) { double cij = C[i+j*n]; for( int k = 0; k < n; k++ ) cij += A[i+k*n] * B[k+j*n]; C[i+j*n] = cij; }

DGEMM Access Pattern C, A, and B arrays
It reads all NxN elements of B, reads the same N elements of one row of A repeatedly, and writes to one row of N elements of C. older accesses new accesses

Cache Blocked DGEMM 1 #define BLOCKSIZE 32 2 void do_block (int n, int si, int sj, int sk, double *A, double 3 *B, double *C) 4 { 5 for (int i = si; i < si+BLOCKSIZE; ++i) 6 for (int j = sj; j < sj+BLOCKSIZE; ++j) 7 { 8 double cij = C[i+j*n];/* cij = C[i][j] */ 9 for( int k = sk; k < sk+BLOCKSIZE; k++ ) 10 cij += A[i+k*n] * B[k+j*n];/* cij+=A[i][k]*B[k][j] */ 11 C[i+j*n] = cij;/* C[i][j] = cij */ 12 } 13 } 14 void dgemm (int n, double* A, double* B, double* C) 15 { 16 for ( int sj = 0; sj < n; sj += BLOCKSIZE ) 17 for ( int si = 0; si < n; si += BLOCKSIZE ) 18 for ( int sk = 0; sk < n; sk += BLOCKSIZE ) do_block(n, si, sj, sk, A, B, C); 20 } Blocking exploits a combination of spatial and temporal locality, since A benefits from spatial locality and B benefits from temporal locality.

Blocked DGEMM Access Pattern
The unoptimized performance is halved for the largest matrix. The cache-blocked version is less than 10% slower even at matrices that are 960×960 Unoptimized Blocked

Dependability 12 September, 2018 Assumption: a specification of proper service exists. Users can then see a system alternating between two states of delivered service with respect to the service specification Fault: failure of a component May or may not lead to system failure Failures can be permanent or intermittent. The latter is the more difficult case; it is harder to diagnose the problem when a system oscillates between the two states. Permanent failures are far easier to diagnose. Service accomplishment Service delivered as specified Restoration Failure Service interruption Deviation from specified service Chapter 6 — Storage and Other I/O Topics

Dependability Measures
Morgan Kaufmann Publishers 12 September, 2018 Dependability Measures Reliability measure: mean time to failure (MTTF) Annual failure rate (AFR): percentage of devices that would be expected to fail in a year for a given MTTF. When MTTF gets large it can be misleading, while AFR leads to better intuition. Example Disk – 1,000,000 hour MTTF 1,000,000 hours = 1,000,000/(365x24) = 114 years, disk almost never fails Warehouse scale computers – 50,000 servers, each with 2 disks AFR = (365x24)/1,000,000 = 0.876% 100,000 disks – 876 disks fail per year, or on average more than 2 disk failures per day Chapter 6 — Storage and Other I/O Topics

Dependability Measures
Morgan Kaufmann Publishers Dependability Measures 12 September, 2018 Service interruption measured as mean time to repair (MTTR) Mean time between failures MTBF = MTTF + MTTR Availability is a measure of service accomplishment with respect to the alternation between the two states of accomplishment and interruption. Availability = MTTF / (MTTF + MTTR) Improving Availability Increase MTTF: fault avoidance, fault tolerance, fault forecasting Reduce MTTR: improved tools and processes for diagnosis and repair One shorthand is to quote the number of “nines of availability” per year, for example very good Internet service today offers 4 or 5 nines of availability One nine: 90% => days of repair/year Two nines: 99% => days of repair/year Three nines: 99.9% => 526 minutes of repair/year Four nines: % => minutes of repair/year Five nines: % => minutes of repair/year Chapter 6 — Storage and Other I/O Topics

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Similar presentations

Presentation on theme: "Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Similar presentations

Presentation on theme: "Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy"— Presentation transcript:

Similar presentations

About project

Feedback