Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Slides:



Advertisements
Similar presentations
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Advertisements

Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
Lecture 32: Chapter 5 Today’s topic –Cache performance assessment –Associative caches Reminder –HW8 due next Friday 11/21/2014 –HW9 due Wednesday 12/03/2014.
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
Computing Systems Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
1  2004 Morgan Kaufmann Publishers Multilevel cache Used to reduce miss penalty to main memory First level designed –to reduce hit time –to be of small.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Caches 1 Computer Organization II © McQuain Memory Technology Static RAM (SRAM) – 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic RAM (DRAM)
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.
CS161 – Design and Architecture of Computer Systems
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
Cache Memory and Performance
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
The Goal: illusion of large, fast, cheap memory
CS352H: Computer Systems Architecture
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Cache Memories CSE 238/2038/2138: Systems Programming
How will execution time grow with SIZE?
CSCI206 - Computer Organization & Programming
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers
ECE 445 – Computer Organization
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Memory Hierarchy Chapter 5 (3/3) Microprocessor Design and Application
Systems Architecture II
Lecture 08: Memory Hierarchy Cache Performance
ECE 445 – Computer Organization
November 14 6 classes to go! Read
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Memory Hierarchy Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Memory & Cache.
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy 12 September, 2018 Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 12 September, 2018 Average Access Time Hit time is also important for performance Average memory access time (AMAT) AMAT = Hit time + Miss rate × Miss penalty Example CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5% AMAT = 1 + 0.05 × 20 = 2ns 2 cycles per instruction Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 12 September, 2018 Performance Summary When CPU performance increased Miss penalty becomes more significant Decreasing base CPI Greater proportion of time spent on memory stalls Increasing clock rate Memory stalls account for more CPU cycles Can’t neglect cache behavior when evaluating system performance Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 12 September, 2018 Associative Caches Fully associative Allow a given block to go in any cache entry Requires all entries to be searched at once Comparator per entry (expensive hardware cost) n-way set associative Each set contains n entries Block number determines which set (Block number) modulo (#Sets in cache) Search all entries in a given set at once n comparators (less expensive) Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Associative Cache Example Morgan Kaufmann Publishers 12 September, 2018 Associative Cache Example Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Spectrum of Associativity Morgan Kaufmann Publishers Spectrum of Associativity 12 September, 2018 For a cache with 8 entries Higher associativity Advantage: decreases miss rate Disadvantage: increases hit time Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Associativity Example Morgan Kaufmann Publishers Associativity Example 12 September, 2018 Compare 4-block caches Direct mapped, 2-way set associative, fully associative Block access sequence: 0, 8, 0, 6, 8 Direct mapped Block 0 mapped to (0 modulo 4) = 0 Block 6 mapped to (6 modulo 4) = 2 Block 8 mapped to (8 modulo 4) = 0 Block address Cache index Hit/miss Cache content after access 1 2 3 miss Mem[0] 8 Mem[8] 6 Mem[6] Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Associativity Example Morgan Kaufmann Publishers Associativity Example 12 September, 2018 2-way set associative Block address Cache index Hit/miss Cache content after access Set 0 Set 1 miss Mem[0] 8 Mem[8] hit 6 Mem[6] Fully associative Block address Hit/miss Cache content after access miss Mem[0] 8 Mem[8] hit 6 Mem[6] With 8-block cache: no replacements in 2-way set associative cache Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

How Much Associativity Morgan Kaufmann Publishers 12 September, 2018 How Much Associativity Increased associativity decreases miss rate But with diminishing returns Simulation of a system with 64KiB D-cache, 16-word blocks, SPEC2000 1-way: 10.3% 2-way: 8.6% 4-way: 8.3% 8-way: 8.1% Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Four-way Set Associative Cache Organization Morgan Kaufmann Publishers 12 September, 2018 Four-way Set Associative Cache Organization Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 12 September, 2018 Replacement Policy Direct mapped: no choice Set associative Prefer non-valid entry, if there is one Otherwise, choose among entries in the set Least-recently used (LRU) Choose the one unused for the longest time Simple for 2-way, manageable for 4-way, too hard beyond that Random Gives approximately the same performance as LRU for high associativity Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Size of Tags versus Associativity Cache = 4096 blocks, block size = 4 words, 64-bit address. Find the total number of sets and the total number of tag bits for caches that are direct-mapped, two-way and four-way set-associative, and fully associative. Block size = 4 words = 16 bytes = 24 bytes Index and tag bits = 64 – 4 = 60 bits Direct-mapped cache: number of sets = number of blocks = 4096 = 212; 12 bits of index; tag bits = 60 – 12 = 48; total tag bits = 48 x 4096 = 197K bits Two-way set associative cache: #sets = 2048; tag bits = 60 – 11 = 49; total tag bits = 49 x 2 x 2048 = 201K bits Four-way set associative cache: #sets = 1024; tag bits = 60 – 10 = 50; total tag bits = 50 x 4 x 1024 = 205K bits Fully associative cache: #sets = 1, 4096 blocks, tag bits = 60; total tag bits = 60 x 4096 x 1 = 246K bits

Morgan Kaufmann Publishers 12 September, 2018 Multilevel Caches Primary cache attached to CPU Small, but fast Level-2 cache services misses from primary cache Larger, slower, but still faster than main memory Main memory services L-2 cache misses Some high-end systems include L-3 cache Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Multilevel Cache Example Morgan Kaufmann Publishers 12 September, 2018 Multilevel Cache Example Given CPU base CPI = 1, clock rate = 4GHz Miss rate/instruction in primary cache = 2% Main memory access time = 100ns With just primary cache Miss penalty = 100ns/0.25ns = 400 cycles Effective CPI = 1 + 0.02 × 400 = 9 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 12 September, 2018 Example (cont.) Now add L-2 cache Access time = 5ns Global miss rate to main memory = 0.5% Primary miss with L-2 hit Penalty = 5ns/0.25ns = 20 cycles Primary miss with L-2 miss Extra penalty = 420 cycles CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4 Performance ratio = 9/3.4 = 2.6 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Multilevel Cache Considerations Morgan Kaufmann Publishers 12 September, 2018 Multilevel Cache Considerations Primary cache Focus on minimal hit time L-2 cache Focus on low miss rate to avoid main memory access Hit time has less overall impact Results L-1 cache usually smaller than a single cache L-1 block size smaller than L-2 block size L-2 cache larger than single-level cache, consequently L-2 has larger block size L-2 has higher associativity than L-1 cache to reduce miss rates Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Interactions with Advanced CPUs Morgan Kaufmann Publishers 12 September, 2018 Interactions with Advanced CPUs Out-of-order CPUs can execute instructions during cache miss Pending store stays in load/store unit Dependent instructions wait in reservation stations Independent instructions continue Effect of miss depends on program data flow Much harder to analyse Use system simulation Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Interactions with Software Morgan Kaufmann Publishers Interactions with Software 12 September, 2018 Misses depend on memory access patterns Algorithm behavior – Radix sort has fewer operations Compiler optimization for memory access Quicksort consistently has many fewer misses per item to be sorted Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Software Optimization via Blocking Goal: maximize accesses to data loaded into the cache before it is replaced improve temporal locality to reduce cache misses blocked algorithms operate on submatrices or blocks Consider inner loops of DGEMM: for (int j = 0; j < n; ++j) { double cij = C[i+j*n]; for( int k = 0; k < n; k++ ) cij += A[i+k*n] * B[k+j*n]; C[i+j*n] = cij; }

DGEMM Access Pattern C, A, and B arrays It reads all NxN elements of B, reads the same N elements of one row of A repeatedly, and writes to one row of N elements of C. older accesses new accesses

Cache Blocked DGEMM 1 #define BLOCKSIZE 32 2 void do_block (int n, int si, int sj, int sk, double *A, double 3 *B, double *C) 4 { 5 for (int i = si; i < si+BLOCKSIZE; ++i) 6 for (int j = sj; j < sj+BLOCKSIZE; ++j) 7 { 8 double cij = C[i+j*n];/* cij = C[i][j] */ 9 for( int k = sk; k < sk+BLOCKSIZE; k++ ) 10 cij += A[i+k*n] * B[k+j*n];/* cij+=A[i][k]*B[k][j] */ 11 C[i+j*n] = cij;/* C[i][j] = cij */ 12 } 13 } 14 void dgemm (int n, double* A, double* B, double* C) 15 { 16 for ( int sj = 0; sj < n; sj += BLOCKSIZE ) 17 for ( int si = 0; si < n; si += BLOCKSIZE ) 18 for ( int sk = 0; sk < n; sk += BLOCKSIZE ) 19 do_block(n, si, sj, sk, A, B, C); 20 } Blocking exploits a combination of spatial and temporal locality, since A benefits from spatial locality and B benefits from temporal locality.

Blocked DGEMM Access Pattern The unoptimized performance is halved for the largest matrix. The cache-blocked version is less than 10% slower even at matrices that are 960×960 Unoptimized Blocked

Morgan Kaufmann Publishers Dependability 12 September, 2018 Assumption: a specification of proper service exists. Users can then see a system alternating between two states of delivered service with respect to the service specification Fault: failure of a component May or may not lead to system failure Failures can be permanent or intermittent. The latter is the more difficult case; it is harder to diagnose the problem when a system oscillates between the two states. Permanent failures are far easier to diagnose. Service accomplishment Service delivered as specified Restoration Failure Service interruption Deviation from specified service Chapter 6 — Storage and Other I/O Topics

Dependability Measures Morgan Kaufmann Publishers 12 September, 2018 Dependability Measures Reliability measure: mean time to failure (MTTF) Annual failure rate (AFR): percentage of devices that would be expected to fail in a year for a given MTTF. When MTTF gets large it can be misleading, while AFR leads to better intuition. Example Disk – 1,000,000 hour MTTF 1,000,000 hours = 1,000,000/(365x24) = 114 years, disk almost never fails Warehouse scale computers – 50,000 servers, each with 2 disks AFR = (365x24)/1,000,000 = 0.876% 100,000 disks – 876 disks fail per year, or on average more than 2 disk failures per day Chapter 6 — Storage and Other I/O Topics

Dependability Measures Morgan Kaufmann Publishers Dependability Measures 12 September, 2018 Service interruption measured as mean time to repair (MTTR) Mean time between failures MTBF = MTTF + MTTR Availability is a measure of service accomplishment with respect to the alternation between the two states of accomplishment and interruption. Availability = MTTF / (MTTF + MTTR) Improving Availability Increase MTTF: fault avoidance, fault tolerance, fault forecasting Reduce MTTR: improved tools and processes for diagnosis and repair One shorthand is to quote the number of “nines of availability” per year, for example very good Internet service today offers 4 or 5 nines of availability One nine: 90% => 36.5 days of repair/year Two nines: 99% => 3.65 days of repair/year Three nines: 99.9% => 526 minutes of repair/year Four nines: 99.99% => 52.6 minutes of repair/year Five nines: 99.999% => 5.26 minutes of repair/year Chapter 6 — Storage and Other I/O Topics