Memory hierarchy.

Slides:



Advertisements
Similar presentations
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Advertisements

Cache Memories September 30, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Computer ArchitectureFall 2008 © October 27th, 2008 Majd F. Sakr CS-447– Computer Architecture.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
Systems I Locality and Caching
ECE Dept., University of Toronto
Memory Hierarchy 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
University of Washington Memory and Caches I The Hardware/Software Interface CSE351 Winter 2013.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
Memory hierarchy. – 2 – Memory Operating system and CPU memory management unit gives each process the “illusion” of a uniform, dedicated memory space.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
Memory Hierarchy 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 26 Memory Hierarchy Design (Concept of Caching and Principle of Locality)
CMSC 611: Advanced Computer Architecture
Cache Memories.
COSC3330 Computer Architecture
CSE 351 Section 9 3/1/12.
Memory COMPUTER ARCHITECTURE
Optimization III: Cache Memories
Lecture 12 Virtual Memory.
Cache Memories CSE 238/2038/2138: Systems Programming
Multilevel Memories (Improving performance using alittle “cash”)
How will execution time grow with SIZE?
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
The Hardware/Software Interface CSE351 Winter 2013
Today How’s Lab 3 going? HW 3 will be out today
CS 105 Tour of the Black Holes of Computing
Cache Memory Presentation I
CS 105 Tour of the Black Holes of Computing
CACHE MEMORY.
The Memory Hierarchy : Memory Hierarchy - Cache
Memory hierarchy.
Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron
ReCap Random-Access Memory (RAM) Nonvolatile Memory
Cache Memories September 30, 2008
Andy Wang Operating Systems COP 4610 / CGS 5765
Memory and cache CPU Memory I/O.
Memory Hierarchy II.
CS 105 Tour of the Black Holes of Computing
Andy Wang Operating Systems COP 4610 / CGS 5765
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Adapted from slides by Sally McKee Cornell University
Caches II CSE 351 Winter 2018 Instructor: Mark Wyse
How can we find data in the cache?
Memory Organization.
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Feb 11 Announcements Memory hierarchies! How’s Lab 3 going?
CSE 153 Design of Operating Systems Winter 2018
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Computer Organization and Assembly Languages Yung-Yu Chuang 2006/01/05
Cache - Optimization.
Cache Memory and Performance
Cache Memory and Performance
Sarah Diesburg Operating Systems CS 3430
Andy Wang Operating Systems COP 4610 / CGS 5765
Memory Management Jennifer Rexford.
Caches III CSE 351 Spring 2019 Instructor: Ruth Anderson
Overview Problem Solution CPU vs Memory performance imbalance
Sarah Diesburg Operating Systems COP 4610

Presentation transcript:

Memory hierarchy

Memory Operating system and CPU memory management unit gives each process the “illusion” of a uniform, dedicated memory space i.e. 0x0 – 0xFFFFFFFFFFFFFFFF for x86-64 Allows easy multitasking Hides underlying memory organization

Random-Access Memory (RAM) Non-persistent program code and data stored in two types of memory Static RAM (SRAM) Each cell stores bit with a six-transistor circuit. Retains value as long as it is kept powered. Faster and more expensive than DRAM. Dynamic RAM (DRAM) Each cell stores bit with a capacitor and transistor. Value must be refreshed every 10-100 ms. Slower and cheaper than SRAM.

CPU-Memory gap Metric 1985 1990 1995 2000 2005 2010 2015 2015:1985 access (ns) 200 100 70 60 50 40 20 10 typical size (MB) 0.256 4 16 64 2,000 8,000 16.000 62,500 DRAM SRAM Metric 1985 1990 1995 2000 2005 2010 2015 2015:1985 access (ns) 150 35 15 3 2 1.5 1.3 115 CPU Metric 1985 1990 1995 2000 2005 2010 2015 2015:1985 CPU 80286 80386 Pentium P-4 Core 2 Core i7(n)Core i7(h) Clock rate (MHz) 6 20 150 3,300 2,000 2,500 3,000 500 Cycle time (ns) 166 50 6 0.30 0.50 0.4 0.33 500 Cores 1 1 1 1 2 4 4 4 Effective cycle 166 50 6 0.30 0.25 0.10 0.08 2,075 time (ns) Similar cycle times (Memory not a bottleneck) Disparate improvement (Memory bottleneck) Inflection point in computer history when designers hit the “Power Wall” (n) Nehalem processor (h) Haswell processor

Addressing CPU-Memory Gap Observation Fast memory can better keep up with CPU speeds Cost more per byte and thus will have less capacity Use smaller, faster, SRAM-based memories closer to CPU Leads to an approach for organizing memory and storage systems in a memory hierarchy. For each level of the hierarchy k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1. Faster memory holds more frequently accessed blocks of main memory Managed automatically in hardware.

Memory hierarchy example CPU registers hold words retrieved from L1 cache. Registers Smaller Faster More Expensive L0 Level 1 Cache (per-core split) L1 cache holds cache lines retrieved from the L2 cache memory. L1 L2 Level 2 Cache (per-core unified)‏ L2 cache holds cache lines retrieved from L3 Larger Slower Cheaper L3 Level 3 Cache (shared) L4 Main Memory (off chip) L5 Local Secondary Storage (HD,Flash) Remote Secondary Storage

Hierarchy requires Locality Key to make this work is for programs to exhibit locality Data needed in the near future kept in fast memory near CPU Programs designed to reuse data and instructions near those they have used recently, or that were recently referenced themselves. Kinds of Locality: Temporal locality: Recently referenced items are likely to be referenced in the near future. Spatial locality: Items with nearby addresses tend to be referenced close together in time.

Locality in code Give an example of spatial locality for data access Reference array elements in succession (stride-1 pattern) Give an example of temporal locality for data access Reference sum each iteration Give an example of spatial locality for instruction access Reference instructions in sequence Give an example of temporal locality for instruction access Cycle through loop repeatedly sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum;

Locality example Which function has better locality? int sumarrayrows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum } int sumarraycols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum }

Practice problem 6.7 Permute the loops so that the function scans the 3-d array a[] with a stride-1 reference pattern (and thus has good spatial locality)? int sumarray3d(int a[M][N][N]) { int i, j, k, sum = 0; for (i = 0; i < N; i++) for (j = 0; j < N; j++) for (k = 0; k < M; k++) sum += a[k][i][j]; return sum }

Practice problem 6.7 Permute the loops so that the function scans the 3-d array a[] with a stride-1 reference pattern (and thus has good spatial locality)? int sumarray3d(int a[M][N][N]) { int i, j, k, sum = 0; for (k = 0; k < M; k++) for (i = 0; i < N; i++) for (j = 0; j < N; j++) sum += a[k][i][j]; return sum }

Practice problem 6.8 Rank-order the functions with respect to the spatial locality enjoyed by each void clear1(point *p, int n) { int i,j; for (i=0; i<n ; i++) { for (j=0;j<3;j++) p[i].vel[j]=0; for (j=0; j<3 ; j++) p[i].acc[j]=0; } #define N 1000 typedef struct { int vel[3]; int acc[3]; } point; point p[N]; Best void clear2(point *p, int n) { int i,j; for (i=0; i<n ; i++) { for (j=0; j<3 ; j++) { p[i].vel[j]=0; p[i].acc[j]=0; } void clear3(point *p, int n) { int i,j; for (j=0; j<3 ; j++) { for (i=0; i<n ; i++) p[i].vel[j]=0; p[i].acc[j]=0; } Worst

General Cache Operation Direct-mapped Smaller, faster, more expensive memory caches a subset of the blocks Cache 4 8 9 14 10 3 Data is copied in block-sized transfer units 4 10 Larger, slower, cheaper memory viewed as partitioned into “blocks” Memory 1 2 3 4 4 5 6 7 8 9 10 10 11 12 13 14 15

Direct mapped caches Simplest of cache to implement One block or cache line per set Each location in memory maps to a single location in cache E.g. Block i at level k+1 placed in block (i mod 4) at level k. Cache (k) 4 8 9 10 14 3 10 4 Memory (k+1) 1 2 3 4 4 5 6 7 8 9 10 10 11 12 13 14 15

Simple Direct Mapped Cache Implemented by using least-significant bits to index set Assume: cache block size 1 byte Address of byte: v tag t bits 0…01 v tag find set S = 2s sets v tag v tag

Simple Direct Mapped Cache After set found, examine block to see if valid Address of byte: valid? + match: assume yes = hit t bits 0…01 v tag tag If tag doesn’t match: old line is evicted and replaced

Simple direct-mapped cache example (S=4) Main memory Index Tag Data 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 00 01 10 11 M = size of memory space = 16 m = number of bits in address = log2(16) = 4 S = number of sets = 4 s = log2(S) = log2(4) = 2 log2(# of cache lines) Index is 2 LSB of address Tag is rest of address = m – s = 2 bits t (2 bits) s (2 bits)

Simple direct-mapped cache example (S=4) Main memory Cache Index Tag Data 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 00 01 10 11 Access pattern: 0000 0011 1000 t (2 bits) s (2 bits)

Simple direct-mapped cache example (S=4) Main memory Cache Index Tag Data 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 00 0x00 01 10 11 Access pattern: 0000 0011 1000 Cache miss t (2 bits) s (2 bits)

Simple direct-mapped cache example (S=4) Main memory Cache Index Tag Data 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 00 0x00 01 10 11 0x33 Access pattern: 0000 0011 1000 Cache miss t (2 bits) s (2 bits)

Simple direct-mapped cache example (S=4) Main memory Cache Index Tag Data 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 00 0x00 01 10 11 0x33 Access pattern: 0000 0011 1000 Cache miss (Tag mismatch) t (2 bits) s (2 bits)

Simple direct-mapped cache example (S=4) Main memory Cache Index Tag Data 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 00 10 0x88 01 11 0x33 Access pattern: 0000 0011 1000 Cache miss (Tag mismatch) t (2 bits) s (2 bits)

Simple direct-mapped cache example (S=4) Main memory Cache Index Tag Data 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 00 10 0x88 01 11 0x33 Access pattern: 0000 0011 1000 Cache hit t (2 bits) s (2 bits)

Simple direct-mapped cache example (S=4) Main memory Cache Index Tag Data 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 00 10 0x88 01 11 0x33 Access pattern: 0000 0011 1000 Cache hit Miss rate = 3/5 = 60% t (2 bits) s (2 bits)

Identify issue #1: (S=4) 00 01 10 11 Main memory Cache Access pattern: Index Tag Data 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 00 01 10 11 Access pattern: 0000 0001 0010 0011 0100 0101 … Miss rate = ? t (2 bits) s (2 bits)

Block size and spatial locality Larger block sizes can take advantage of spatial locality by loading data from not just one address, but also nearby addresses, into the cache. Increase hit rate for sequential access Sequential bytes in memory are stored together in a block (or cache line) Use least significant bits of address as block offset B = number of bytes in block Number of bits = log2(# of bytes in block) Now, set index uses the next least significant bits

Example: Direct Mapped Cache (B=8) Direct mapped: One line per set Cache block size 8 bytes Address of byte: v tag 1 2 3 4 5 6 7 t bits 0…01 100 v tag 1 2 3 4 5 6 7 find set S = 2s sets v tag 1 2 3 4 5 6 7 v tag 1 2 3 4 5 6 7

Example: Direct Mapped Cache (B=8) Direct mapped: One line per set Cache block size 8 bytes Address of byte: valid? + match: assume yes = hit t bits 0…01 100 v tag tag 1 2 3 4 5 6 7 block offset

Example: Direct Mapped Cache (B=8) Direct mapped: One line per set Cache block size 8 bytes Address of byte: valid? + match: assume yes = hit t bits 0…01 100 v tag 1 2 3 4 5 6 7 block offset Byte and next 3 bytes loaded for an int access If tag doesn’t match: old line is evicted and replaced

Direct-mapped cache (B=2) Main memory Cache Index Tag Data1 Data0 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 1 Access pattern: 0000 0001 0010 0011 0100 0101 … t (2 bits) s (1 bit) b (1 bit)

Direct-mapped cache (B=2) Main memory Cache Index Tag Data1 Data0 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 00 0x11 0x00 1 Access pattern: 0000 0001 0010 0011 0100 0101 … Cache miss t (2 bits) s (1 bit) b (1 bit)

Direct-mapped cache (B=2) Main memory Cache Index Tag Data1 Data0 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 00 0x11 0x00 1 Access pattern: 0000 0001 0010 0011 0100 0101 … Cache hit t (2 bits) s (1 bit) b (1 bit)

Direct-mapped cache (B=2) Main memory Cache Index Tag Data1 Data0 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 00 0x11 0x00 1 0x33 0x22 Access pattern: 0000 0001 0010 0011 0100 0101 … Cache miss t (2 bits) s (1 bit) b (1 bit)

Direct-mapped cache (B=2) Main memory Cache Index Tag Data1 Data0 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 00 0x11 0x00 1 0x33 0x22 Access pattern: 0000 0001 0010 0011 0100 0101 … Cache hit t (2 bits) s (1 bit) b (1 bit)

Direct-mapped cache (B=2) Main memory Cache Index Tag Data1 Data0 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 01 0x55 0x44 1 00 0x33 0x22 Access pattern: 0000 0001 0010 0011 0100 0101 … Cache miss t (2 bits) s (1 bit) b (1 bit)

Direct-mapped cache (B=2) Main memory Cache Index Tag Data1 Data0 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 01 0x55 0x44 1 00 0x33 0x22 Access pattern: 0000 0001 0010 0011 0100 0101 … Cache hit Miss rate = ? t (2 bits) s (1 bit) b (1 bit)

Practice problem Consider a direct-mapped cache with 2 sets and 4 bytes per block and a 5-bit address (S=2, B=4) Derive values for number of address bits used for the block offset (b), the set index (s), and the tag (t) using the formula m = t + s + b Draw a diagram of which bits of the address are used for the tag, the set index and the block offset Draw a diagram of the cache s = 1 b = 2 t = 2 t (2 bits) s (1 bit) b (2 bits) s=0 tag (2 bits) data (4 bytes) s=1 tag (2 bits) data (4 bytes)

Practice problem Using this cache (S=2, B=4) Derive the miss rate for the following access pattern 00000 00001 00011 00100 00111 01000 01001 10000 10011 t (2 bits) s (1 bit) b (2 bits) s=0 tag (2 bits) data (4 bytes) s=1 tag (2 bits) data (4 bytes)

Practice problem Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern 00000 Miss 00001 00011 00100 00111 01000 01001 10000 10011 00000 t (2 bits) s (1 bit) b (2 bits) s=0 00 Data from 00000 to 00011 s=1 tag (2 bits) data (4 bytes)

Practice problem Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern 00000 Miss 00001 Hit 00011 00100 00111 01000 01001 10000 10011 00000 t (2 bits) s (1 bit) b (2 bits) s=0 00 Data from 00000 to 00011 s=1 tag (2 bits) data (4 bytes)

Practice problem Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern 00000 Miss 00001 Hit 00011 Hit 00100 00111 01000 01001 10000 10011 00000 t (2 bits) s (1 bit) b (2 bits) s=0 00 Data from 00000 to 00011 s=1 tag (2 bits) data (4 bytes)

Practice problem Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern 00000 Miss 00001 Hit 00011 Hit 00100 Miss 00111 01000 01001 10000 10011 00000 t (2 bits) s (1 bit) b (2 bits) s=0 00 Data from 00000 to 00011 s=1 00 Data from 00100 to 00111

Practice problem Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern 00000 Miss 00001 Hit 00011 Hit 00100 Miss 00111 Hit 01000 01001 10000 10011 00000 t (2 bits) s (1 bit) b (2 bits) s=0 00 Data from 00000 to 00011 s=1 00 Data from 00100 to 00111

Practice problem Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern 00000 Miss 00001 Hit 00011 Hit 00100 Miss 00111 Hit 01000 Miss 01001 10000 10011 00000 t (2 bits) s (1 bit) b (2 bits) s=0 01 Data from 01000 to 01011 s=1 00 Data from 00100 to 00111

Practice problem Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern 00000 Miss 00001 Hit 00011 Hit 00100 Miss 00111 Hit 01000 Miss 01001 Hit 10000 10011 00000 t (2 bits) s (1 bit) b (2 bits) s=0 01 Data from 01000 to 01011 s=1 00 Data from 00100 to 00111

Practice problem Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern 00000 Miss 00001 Hit 00011 Hit 00100 Miss 00111 Hit 01000 Miss 01001 Hit 10000 Miss 10011 00000 t (2 bits) s (1 bit) b (2 bits) s=0 10 Data from 10000 to 10011 s=1 00 Data from 00100 to 00111

Practice problem Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern 00000 Miss 00001 Hit 00011 Hit 00100 Miss 00111 Hit 01000 Miss 01001 Hit 10000 Miss 10011 Hit 00000 t (2 bits) s (1 bit) b (2 bits) s=0 10 Data from 10000 to 10011 s=1 00 Data from 00100 to 00111

Practice problem Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern 00000 Miss 00001 Hit 00011 Hit 00100 Miss 00111 Hit 01000 Miss 01001 Hit 10000 Miss 10011 Hit t (2 bits) s (1 bit) b (2 bits) s=0 00 Data from 00000 to 00011 s=1 00 Data from 00100 to 00111

Block sizing Typically 8 to 64 bytes Advantages and disadvantages of large block size? Higher hit rates Memory latency on miss for loading line Inefficiency in loading data not subsequently used (consider instruction access)

Identify issue #2: (S,E,B,m) = (4,1,1,4) Cache Main memory Index Tag Data 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 00 01 10 11 Access pattern: 0010 0110 … Each access results in a cache miss and a load into cache block 2. Only one out of four cache blocks is used. Miss rate = 100% t (2 bits) s (2 bits)

Direct mapping and conflict misses The direct-mapped cache simple Each memory address belongs in exactly one block in cache Indices and offsets can be computed with simple bit operators But, has problems when multiple blocks map to same entry Cause thrashing in cache

General causes for cache misses Types of cache misses: Cold (compulsory) miss Cold misses occur because the cache is empty. Program start-up or context switch Capacity miss Occurs when the set of active cache blocks (working set) is larger than the cache. Conflict miss Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k. Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block. E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.

Set associative caches Reduces conflict misses Each set can store multiple distinct blocks/lines Each address maps to exactly one set in the cache, but data may be placed in any block within that set. Larger sets and higher associativity lead to fewer cache conflicts and lower miss rates, but they also increase the hardware cost.

E-way Set Associative Cache (E = 2) E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: t bits 0…01 100 v tag 1 2 3 4 5 6 7 v tag 1 2 3 4 5 6 7 find set v tag 1 2 3 4 5 6 7 v tag 1 2 3 4 5 6 7 v tag 1 2 3 4 5 6 7 v tag 1 2 3 4 5 6 7 v tag 1 2 3 4 5 6 7 v tag 1 2 3 4 5 6 7

2-way Set Associative Cache E = 2, B = 8 bytes Address of short int: t bits 0…01 100 compare both valid? + match: yes = hit v tag tag 1 2 3 4 5 6 7 v tag 1 2 3 4 5 6 7 block offset

2-way Set Associative Cache E = 2, B = 8 bytes Address of short int: t bits 0…01 100 compare both valid? + match: yes = hit v tag 1 2 3 4 5 6 7 v tag 1 2 3 4 5 6 7 block offset short int (2 Bytes) is here No match: One line in set is selected for eviction and replacement Replacement policies: random, least recently used (LRU), …

2-way Set Associative Cache Simulation b=1 M=16 byte addresses S=2 sets, E=2 blocks/set, B=2 bytes/block, xx x x Address trace (reads, one byte per read): 0 [00002], 1 [00012], 7 [01112], 8 [10002], 0 [00002] miss hit miss miss hit v Tag Block ? 1 00 M[0-1] Set 0 1 10 M[8-9] 1 01 M[6-7] Set 1

2-way set associative cache implementation ... 2s Index Tag Data Valid Address (m bits) = Hit s t 2-to-1 mux 2n Block offset Two comparators needed

General Cache Organization (S, E, B) E blocks per set block S sets set 1 2 B-1 tag v B bytes per block (also referred to as cache line) valid bit Cache size: C = S x E x B data bytes

General Cache Organization Bits in virtual address used to determine set and block offset in line S sets s = # address bits used to select set s = log2S B bytes per block b = # address bits used to select byte in block b = log2B Unique tag identifies block Address of word: t bits s bits b bits tag set index block offset

General Cache Organization Determining tag size M = Total size of main memory 2m addresses m = # address bits = log2M x86-64 m = 64 M = 264 Tag consists of bits not used in set index and block offset t = m – s - b Tag uniquely identifies data Address of word: t bits s bits b bits tag set index block offset Note: In practice, tag overhead often reduced by only using 48 of the 64 bits

data begins at this offset B = 2b bytes per cache block (the data) Cache Read Locate set Check if any line in set has matching tag Yes + line valid: hit Locate data starting at offset Address bits determine location in cache E lines per set Address of word: t bits s bits b bits S = 2s sets tag set index block offset data begins at this offset v tag 1 2 B-1 valid bit B = 2b bytes per cache block (the data)

Practice problem Assume 8 block cache with 16 bytes per block Find where the data from this address goes 1100000110011 B=16 bytes, so the lowest 4 bits are the block offset: 110000011 0011 Set index 1-way (i.e. direct mapped) cache 8 sets so next three bits (011) are the set index. 110000 011 0011. 2-way cache 4 sets so next two bits (11) are the set index. 1100000 11 0011. 4-way cache 2 sets so the next one bit (1) is the set index. 11000001 1 0011. 2-way associativity 4 sets, 2 blocks each Set Set 1 2 3 4 5 6 7 1-way associativity 8 sets, 1 block each 1 2 3 4-way associativity 2 sets, 4 blocks each Set 1

Practice problem 6.9 The table gives the parameters for a number of different caches. Cache and block sizes given in bytes. For each cache, derive the missing values Cache size = C = S * E * B # of tag bits = m - (s+b) Cache m C B E S s b t 1 32 1024 4 2 8 3 256 8 2 22 32 5 3 24 1 5 27

Practice problem 6.11 Consider the following code that runs on a system with a cache of the form (S,E,B)=(512,1,32) where block size is given in bytes How many integers can fit into a single block/line? Assuming sequential allocation of the integer array, what is the maximum number of integers from the array that are stored in the cache at any point in time? int array[N]; for (i = 0; i < N; i++) sum += array[i]; B=32 (8 integers) # integers = 8*512 = 4096

Practice problem 6.12 Consider a 2-way set associative cache (S,E,B,m) = (8,2,4,13) Excluding the overhead of the tags and valid bits, what is the capacity of this cache? 64 bytes Draw a figure of this cache Draw a diagram that shows the parts of the address that are used to determine the cache tag, cache set index, and cache block offset Set Tag Data0-3 Tag Data0-3 1 2 3 4 5 6 7 t (8 bits) s (3 bits) b (2 bits)

Practice problem 6.13 Consider a 2-way set associative cache (S,E,B,m) = (8,2,4,13) Note: Invalid cache lines are left blank Consider an access to 0x0E34. What is the block offset of this address in hex? What is the set index of this address in hex? What is the cache tag of this address in hex? Does this access hit or miss in the cache? What value is returned if it is a hit? Set Tag Data0-3 Tag Data0-3 1 2 3 4 5 6 7 09 86 30 3F 10 45 60 4F E0 23 EB 06 C7 06 78 07 C5 71 0B DE 18 4B 91 A0 B7 26 2D 46 00 38 00 BC 0B 37 0B 32 12 08 7B AD 05 40 67 C2 3B 6E F0 DE 12 C0 88 37 00 = 0x0 101 = 0x5 01110001 = 0x71 = Hit = 0x0B t (8 bits) s (3 bits) b (2 bits)

Practice problem 6.14 Consider a 2-way set associative cache (S,E,B,m) = (8,2,4,13) Note: Invalid cache lines are left blank Consider an access to 0x0DD5. What is the block offset of this address in hex? What is the set index of this address in hex? What is the cache tag of this address in hex? Does this access hit or miss in the cache? What value is returned if it is a hit? Set Tag Data0-3 Tag Data0-3 1 2 3 4 5 6 7 09 86 30 3F 10 45 60 4F E0 23 EB 06 C7 06 78 07 C5 71 0B DE 18 4B 91 A0 B7 26 2D 46 00 38 00 BC 0B 37 0B 32 12 08 7B AD 05 40 67 C2 3B 6E F0 DE 12 C0 88 37 01 = 0x1 101 = 0x5 01101110 = 0x6E Miss (invalid block) t (8 bits) s (3 bits) b (2 bits)

Practice problem 6.16 Consider a 2-way set associative cache (S,E,B,m) = (8,2,4,13) Note: Invalid cache lines are left blank List all hex memory addresses that will hit in Set 3 Set Tag Data0-3 Tag Data0-3 1 2 3 4 5 6 7 09 86 30 3F 10 45 60 4F E0 23 EB 06 C7 06 78 07 C5 71 0B DE 18 4B 91 A0 B7 26 2D 46 00 38 00 BC 0B 37 0B 32 12 08 7B AD 05 40 67 C2 3B 6E F0 DE 12 C0 88 37 00110010011xx = 0x64C, 0x64D, 0x64E, 0x64F t (8 bits) s (3 bits) b (2 bits)

Practice problem Consider the following set associative cache: (S,E,B,m) = (2,2,1,4) Derive values for number of address bits used for the tag (t), the index (s) and the block offset (b) s = 1 b = 0 t = 3 Draw a diagram of which bits of the address are used for the tag, the set index and the block offset Draw a diagram of the cache t (3 bits) s (1 bit) Tag Data Tag Data s=0 3 bits 1 byte 3 bits 1 byte Tag Data Tag Data s=1 3 bits 1 byte 3 bits 1 byte

Extra

Memory speed over time In 1985 Today A memory access took about as long as a CPU instruction Memory was not a bottleneck to performance Today CPUs are about 500 times faster than in 1980 DRAM Memory is about 10 times faster than in 1980

SRAM vs DRAM Summary Tran. Access per bit time Persist? Sensitive? Cost Applications SRAM 6 1X Yes No 100x cache memories DRAM 1 10X No Yes 1X Main memories, frame buffers

Examples of Caching in the Hierarchy Hardware On-Chip TLB Address translations TLB Web browser 10,000,000 Local disk Web pages Browser cache Web cache Network buffer cache Buffer cache Virtual Memory L2 cache L1 cache Registers Cache Type Parts of files 4-KB page 32-byte block 4-byte word What Cached Web proxy server 1,000,000,000 Remote server disks OS 100 Main memory 1 On-Chip L1 10 Off-Chip L2 AFS/NFS client Hardware+OS Compiler CPU registers Managed By Latency (cycles) Where Cached

Practice problem Byte-addressable machine m = 16 bits Cache = (S,E,B,m) = (?, 1, 1, 16) Direct mapped, Block size = 1 s = 5 (5 LSB of address gives index) How many blocks does the cache hold? How many bits of storage are required to build the cache (data plus all overhead including tags, etc.)? 32 (11 + 8 + 1) * 32 Tag Byte Valid

Direct-mapped cache B=2 (S,E,B,m) = (2,1,2,4) Main memory Cache Index Tag Data1 Data0 0000 0x00 0001 0x11 0010 0x22 0011 0x33 0100 0x44 0101 0x55 0110 0x66 0111 0x77 1000 0x88 1001 0x99 1010 0xAA 1011 0xBB 1100 0xCC 1101 0xDD 1110 0xEE 1111 0xFF 1 Partition the address to identify which pieces are used for the tag, the index, and the block offset b = 1 s = 1 t = 2 Tag = 2 MSB of address Index = Bit 1 of address = log2(# of cache lines) Block offset = LSB of address = log2(# blocks/line) t (2 bits) s (1 bit) b (1 bit)