Memory hierarchy. – 2 – Memory Operating system and CPU memory management unit gives each process the “illusion” of a uniform, dedicated memory space.

Slides:



Advertisements
Similar presentations
Cache Memory Organization
Advertisements

Today Memory hierarchy, caches, locality Cache organization
Cache Memories September 30, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Computer ArchitectureFall 2008 © October 27th, 2008 Majd F. Sakr CS-447– Computer Architecture.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
Caches Oct. 22, 1998 Topics Memory Hierarchy
Systems I Locality and Caching
ECE Dept., University of Toronto
Memory Hierarchy 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
University of Washington Memory and Caches I The Hardware/Software Interface CSE351 Winter 2013.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
– 1 – , F’02 Caching in a Memory Hierarchy Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.
Lecture 13: Caching EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr. Rozier.
Lecture 20: Locality and Caching CS 2011 Fall 2014, Dr. Rozier.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
CS 105 Tour of the Black Holes of Computing
The Memory Hierarchy Topics Storage technologies and trends Locality of reference Caching in the memory hierarchy.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
The Memory Hierarchy Topics Storage technologies and trends Locality of reference Caching in the memory hierarchy.
The Memory Hierarchy Topics Storage technologies and trends Locality of reference Caching in the memory hierarchy CS 105 Tour of the Black Holes of Computing.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
1 Memory Hierarchy ( Ⅲ ). 2 Outline The memory hierarchy Cache memories Suggested Reading: 6.3, 6.4.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Cache Memory. 2 Outline General concepts 3 ways to organize cache memory Issues with writes Write cache friendly codes Cache mountain Suggested Reading:
Carnegie Mellon Introduction to Computer Systems /18-243, spring th Lecture, Feb. 19 th Instructors: Gregory Kesden and Markus Püschel.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
University of Washington Today Midterm topics end here. HW 2 is due Wednesday: great midterm review. Lab 3 is posted. Due next Wednesday (July 31)  Time.
Memory Hierarchy 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Cache Organization 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.
Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
Carnegie Mellon 1 Cache Memories Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 26 Memory Hierarchy Design (Concept of Caching and Principle of Locality)
Cache Memories.
COSC3330 Computer Architecture
CSE 351 Section 9 3/1/12.
Optimization III: Cache Memories
Cache Memories CSE 238/2038/2138: Systems Programming
The Hardware/Software Interface CSE351 Winter 2013
Today How’s Lab 3 going? HW 3 will be out today
CS 105 Tour of the Black Holes of Computing
The Memory Hierarchy : Memory Hierarchy - Cache
Memory hierarchy.
Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron
ReCap Random-Access Memory (RAM) Nonvolatile Memory
Cache Memories September 30, 2008
Memory hierarchy.
CS 105 Tour of the Black Holes of Computing
Andy Wang Operating Systems COP 4610 / CGS 5765
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
CSE 153 Design of Operating Systems Winter 2018
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Computer Organization and Assembly Languages Yung-Yu Chuang 2006/01/05
Cache Memory and Performance
Sarah Diesburg Operating Systems CS 3430
Sarah Diesburg Operating Systems COP 4610
Presentation transcript:

Memory hierarchy

– 2 – Memory Operating system and CPU memory management unit gives each process the “illusion” of a uniform, dedicated memory space i.e. 0x0 – 0xFFFFFFFFFFFFFFFF for x86-64 Allows multitasking Hides underlying non-uniform memory hierarchy

– 3 – Random-Access Memory (RAM) Non-persistent program code and data stored in two types of memory Static RAM (SRAM) Each cell stores bit with a six-transistor circuit. Retains value as long as it is kept powered. Faster and more expensive than DRAM. Dynamic RAM (DRAM) Each cell stores bit with a capacitor and transistor. Value must be refreshed every ms. Slower and cheaper than SRAM.

– 4 – Metric :1985 access (ns) typical size (MB) ,0008, ,500 DRAM SRAM Metric :1985 access (ns) Metric :1985 CPU PentiumP-4Core 2Core i7(n)Core i7(h) Clock rate (MHz) ,3002,0002,5003, Cycle time (ns) Cores Effective cycle ,075 time (ns) CPU (n) Nehalem processor (h) Haswell processor Inflection point in computer history when designers hit the “Power Wall” CPU-Memory gap Similar cycle times (Memory not a bottleneck) Disparate improvement (Memory bottleneck)

– 5 – Observation Fast memory can keep up with CPU speeds, but cost more per byte and have less capacity Gap is widening Suggest an approach for organizing memory and storage systems known as a memory hierarchy. For each level of the hierarchy k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1. Key is for computer programs to exhibit locality Keep data needed in the near future in fast memory near the CPU Design programs to ensure data at k is accessed much more often than data at k+1. Addressing CPU-Memory Gap

– 6 – Cache Memories Main memory I/O bridge Bus interface ALU Register file CPU chip System busMemory bus Cache memory Smaller, faster, SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory Typical system structure:

– 7 – Intel Core i7 cache hierarchy

– 8 – Regs L1 d-cache L1 i-cache L2 unified cache Core 0 Regs L1 d-cache L1 i-cache L2 unified cache Core 3 … L3 unified cache (shared by all cores) Main memory Processor package 32kB each 256kB 2-20MB L4* 128 MB (Iris Pro) 16 (rax, rbx…) Intel Core i7 cache hierarchy (2014)

– 9 – Remote Secondary Storage Local Secondary Storage (HD,Flash) Main Memory (off chip) Level 2 Cache (per-core unified)‏ Level 1 Cache (per-core split) Registers L0 L1 L2 L3 L4 L5 Smaller Faster More Expensive Larger Slower Cheaper Level 3 Cache (shared) L1 cache holds cache lines retrieved from the L2 cache memory. CPU registers hold words retrieved from L1 cache. L2 cache holds cache lines retrieved from L3 Memory hierarchy (another view)

– 10 – Hierarchy requires Locality Principle of Locality: Programs tend to reuse data and instructions near those they have used recently, or that were recently referenced themselves. Kinds of Locality: Temporal locality: Recently referenced items are likely to be referenced in the near future. Spatial locality: Items with nearby addresses tend to be referenced close together in time.

– 11 – Give an example of spatial locality for data access Reference array elements in succession (stride-1 pattern) Give an example of temporal locality for data access Reference sum each iteration Give an example of spatial locality for instruction access Reference instructions in sequence Give an example of temporal locality for instruction access Cycle through loop repeadedly sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum; Locality in code

– 12 – int sumarrayrows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum } int sumarraycols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum } Locality example Which function has better locality?

– 13 – int sumarray3d(int a[M][N][N]) { int i, j, k, sum = 0; for (i = 0; i < N; i++) for (j = 0; j < N; j++) for (k = 0; k < M; k++) sum += a[k][i][j]; return sum } Practice problem 6.7 Permute the loops so that the function scans the 3-d array a[] with a stride-1 reference pattern (and thus has good spatial locality)?

– 14 – int sumarray3d(int a[M][N][N]) { int i, j, k, sum = 0; for (k = 0; k < M; k++) for (i = 0; i < N; i++) for (j = 0; j < N; j++) sum += a[k][i][j]; return sum } Practice problem 6.7 Permute the loops so that the function scans the 3-d array a[] with a stride-1 reference pattern (and thus has good spatial locality)?

– 15 – Rank-order the functions with respect to the spatial locality enjoyed by each void clear1(point *p, int n) { int i,j; for (i=0; i<n ; i++) { for (j=0;j<3;j++) p[i].vel[j]=0; for (j=0; j<3 ; j++) p[i].acc[j]=0; } void clear2(point *p, int n) { int i,j; for (i=0; i<n ; i++) { for (j=0; j<3 ; j++) { p[i].vel[j]=0; p[i].acc[j]=0; } void clear3(point *p, int n) { int i,j; for (j=0; j<3 ; j++) { for (i=0; i<n ; i++) p[i].vel[j]=0; for (i=0; i<n ; i++) p[i].acc[j]=0; } #define N 1000 typedef struct { int vel[3]; int acc[3]; } point; point p[N]; Best Worst Practice problem 6.8

– 16 – Cache Memory Larger, slower, cheaper memory viewed as partitioned into “blocks” Data is copied in block-sized transfer units Smaller, faster, more expensive memory caches a subset of the blocks General Cache Operation

– 17 – S sets Each set has E lines Each line has B bytes E = 2 e lines per set S = 2 s sets set line 012 B-1 tag v B = 2 b bytes per cache block (the data) Cache size: C = S x E x B data bytes valid bit General Cache Organization (S, E, B)

– 18 – Locate set Check if any line in set has matching tag Yes + line valid: hit Locate data starting at offset Cache Read Address determines location in cache E = 2 e lines per set S = 2 s sets 012 B-1tag v valid bit B = 2 b bytes per cache block (the data) t bitss bits b bits Address of word: tag set index block offset data begins at this offset

– 19 – t bitss bits b bits Address of word: tag set index block offset General Cache Organization S sets s = # address bits used to select set s = log 2 S B blocks per line b = # address bits used to select byte in line b = log 2 B How big is the tag?

– 20 – t bitss bits b bits Address of word: tag set index block offset M = Total size of main memory 2 m addresses m = # address bits = log 2 M x86-64 m = 64 M = 2 64 Tag consists of bits not used in set index and block offset Tag must uniquely identify data m = t + s + b t = m – s - b General Cache Organization

– 21 – CachemCBESsbt Cache size = C = S * E * B # of tag bits = m - (s+b) Practice problem 6.9 The table gives the parameters for a number of different caches. For each cache, derive the missing values

– 22 – Direct mapped caches Simplest of caches to implement One line per set (E = 1) Each location in memory maps to a single cache entry If multiple locations map to the same entry, their accesses will cause the cache to “thrash” from conflict misses Recall = (S,E,B,m) C = S*E*B = S*B for direct-mapped cache

– 23 – S = 2 s sets Direct mapped: One line per set Assume: cache block size 1 byte t bits0…01 Address of byte: tagv v v v find set Simple Direct Mapped Cache (E=1, B=1)

– 24 – tag Example: Direct Mapped Cache (E = 1) t bits0…01 Address of byte: match: assume yes = hitvalid? + v Direct mapped: One line per set Assume: cache block size 1 byte If tag doesn’t match: old line is evicted and replaced

– 25 – Cache Tag Data Index S = 4 E = 1 block per set B = 1 byte per line/block m = 4 bits of address = 16 bytes of memory s = log 2 (S) = log 2 (4) = 2 log 2 (# of cache lines) Index is 2 LSB of address b = log 2 (B) = 0 t = 4 – (s+b) = 2 Tag is 2 MSB of address Main memory t (2 bits)s (2 bits) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Simple direct-mapped cache example (S,E,B,m) = (4,1,1,4)

– 26 – Main memory Cache Tag Data Access pattern: Index t (2 bits)s (2 bits) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Simple direct-mapped cache example (S,E,B,m) = (4,1,1,4)

– 27 – 00000x Cache Tag Data Access pattern: Cache miss Index Main memory t (2 bits)s (2 bits) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Simple direct-mapped cache example (S,E,B,m) = (4,1,1,4)

– 28 – 00000x x33 Cache Tag Data Access pattern: Cache miss Index Main memory t (2 bits)s (2 bits) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Simple direct-mapped cache example (S,E,B,m) = (4,1,1,4)

– 29 – 00000x x33 Cache Tag Data Access pattern: Cache miss (Tag mismatch) Index Main memory t (2 bits)s (2 bits) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Simple direct-mapped cache example (S,E,B,m) = (4,1,1,4)

– 30 – 00100x x33 Cache Tag Data Access pattern: Cache miss (Tag mismatch) Index Main memory t (2 bits)s (2 bits) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Simple direct-mapped cache example (S,E,B,m) = (4,1,1,4)

– 31 – 00100x x33 Cache Tag Data Access pattern: Cache hit Index Main memory t (2 bits)s (2 bits) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Simple direct-mapped cache example (S,E,B,m) = (4,1,1,4)

– 32 – 00100x x33 Cache Tag Data Access pattern: Cache hit Miss rate = 3/5 = 60% Index Main memory t (2 bits)s (2 bits) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Simple direct-mapped cache example (S,E,B,m) = (4,1,1,4)

– 33 – Cache Tag Data Access pattern: … Miss rate = ? Index Main memory t (2 bits)s (2 bits) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Issue #1: (S,E,B,m) = (4,1,1,4)

– 34 – Block size and spatial locality Larger block sizes can take advantage of spatial locality by loading data from not just one address, but also nearby addresses, into the cache. Increase hit rate for sequential access Use least significant bits of address as block offset Sequential bytes in memory are stored together in a block (or cache line) Number of bits = log 2 (# of bytes in block) Use next significant bits (middle bits) to choose the set in the cache certain blocks map into. Number of bits = log 2 (# of sets in cache) Use most significant bits as a tag to uniquely identify blocks in cache

– 35 – Example: Direct Mapped Cache (E = 1, B=8) S = 2 s sets Direct mapped: One line per set Assume: cache block size 8 bytes t bits0… Address of byte: 0127tagv v v v3654 find set

– 36 – Example: Direct Mapped Cache (E = 1, B=8) Direct mapped: One line per set Assume: cache block size 8 bytes t bits0… Address of byte: 0127tagv3654 match: assume yes = hitvalid? + block offset tag

– 37 – Example: Direct Mapped Cache (E = 1, B=8) Direct mapped: One line per set Assume: cache block size 8 bytes t bits0… Address of byte: 0127tagv3654 match: assume yes = hitvalid? + Byte and next 3 bytes loaded block offset If tag doesn’t match: old line is evicted and replaced

– 38 – Cache Tag Data1 b = 1 s = 1 t = Data0 Index Tag = 2 MSB of address Index = Bit 1 of address = log 2 (# of cache lines) Block offset = LSB of address = log 2 (# blocks/line) Main memory t (2 bits)s (1 bit)b (1 bit) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Partition the address to identify which pieces are used for the tag, the index, and the block offset Direct-mapped cache B=2 (S,E,B,m) = (2,1,2,4)

– 39 – Cache Tag Data1 Access pattern: … 0 1 Data0 Index Main memory t (2 bits)s (1 bit)b (1 bit) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Direct-mapped cache B=2 (S,E,B,m) = (2,1,2,4)

– 40 – Cache Tag Data1 Access pattern: … 0000x110x00 1 Data0 Cache miss Index Main memory t (2 bits)s (1 bit)b (1 bit) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Direct-mapped cache B=2 (S,E,B,m) = (2,1,2,4)

– 41 – Cache Tag Data1 Access pattern: … 0000x110x00 1 Data0 Cache hit Index Main memory t (2 bits)s (1 bit)b (1 bit) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Direct-mapped cache B=2 (S,E,B,m) = (2,1,2,4)

– 42 – Cache Tag Data1 Access pattern: … 0000x110x x330x22 Data0 Cache miss Index Main memory t (2 bits)s (1 bit)b (1 bit) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Direct-mapped cache B=2 (S,E,B,m) = (2,1,2,4)

– 43 – Cache Tag Data1 Access pattern: … 0000x110x x330x22 Data0 Cache hit Index Main memory t (2 bits)s (1 bit)b (1 bit) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Direct-mapped cache B=2 (S,E,B,m) = (2,1,2,4)

– 44 – Cache Tag Data1 Access pattern: … 0010x550x x330x22 Data0 Cache miss Index Main memory t (2 bits)s (1 bit)b (1 bit) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Direct-mapped cache B=2 (S,E,B,m) = (2,1,2,4)

– 45 – Cache Tag Data1 Access pattern: … 0010x550x x330x22 Data0 Cache hit Miss rate = ? Index t (2 bits)s (1 bit)b (1 bit) x x x x x x x x x x xAA xBB xCC xDD xEE xFF Main memory Direct-mapped cache B=2 (S,E,B,m) = (2,1,2,4)

– 46 – t (2 bits)s (1 bit)b (2 bits) data (4 bytes) s=0 tag (2 bits) s=1 tag (2 bits) s = 1 b = 2 t = 2 Exercise Consider the following cache (S,E,B,m) = (2,1,4,5) Derive values for number of address bits used for the tag (t), the index (s) and the block offset (b) Draw a diagram of which bits of the address are used for the tag, the set index and the block offset Draw a diagram of the cache

– 47 – t (2 bits)s (1 bit)b (2 bits) data (4 bytes) s=0 tag (2 bits) s=1 tag (2 bits) Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern Exercise

– 48 – t (2 bits)s (1 bit)b (2 bits) Data from to data (4 bytes) s=0 00 s=1 tag (2 bits) Exercise Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern Miss

– 49 – t (2 bits)s (1 bit)b (2 bits) Data from to data (4 bytes) s=0 00 s=1 tag (2 bits) Exercise Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern Miss Hit

– 50 – t (2 bits)s (1 bit)b (2 bits) Data from to data (4 bytes) s=0 00 s=1 tag (2 bits) Exercise Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern Miss Hit Hit

– 51 – t (2 bits)s (1 bit)b (2 bits) Data from to Data from to s=0 00 s=1 00 Exercise Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern Miss Hit Hit Miss

– 52 – t (2 bits)s (1 bit)b (2 bits) Data from to Data from to s=0 00 s=1 00 Exercise Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern Miss Hit Hit Miss Hit

– 53 – t (2 bits)s (1 bit)b (2 bits) Data from to Data from to s=0 01 s=1 00 Exercise Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern Miss Hit Hit Miss Hit Miss

– 54 – t (2 bits)s (1 bit)b (2 bits) Data from to Data from to s=0 01 s=1 00 Exercise Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern Miss Hit Hit Miss Hit Miss Hit

– 55 – t (2 bits)s (1 bit)b (2 bits) Data from to Data from to s=0 10 s=1 00 Exercise Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern Miss Hit Hit Miss Hit Miss Hit Miss

– 56 – t (2 bits)s (1 bit)b (2 bits) Data from to Data from to s=0 10 s=1 00 Exercise Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern Miss Hit Hit Miss Hit Miss Hit Miss Hit 00000

– 57 – t (2 bits)s (1 bit)b (2 bits) Data from to Data from to s=0 00 s=1 00 Exercise Consider the same cache (S,E,B,m) = (2,1,4,5) Derive the miss rate for the following access pattern Miss Hit Hit Miss Hit Miss Hit Miss Hit Miss

– 58 – int array[4096]; for (i=0; i<4096; i++) sum += array[i]; B=32 (8 integers) # integers = 8*512 = 4096 Practice problem 6.11 Consider the following code that runs on a system with a cache of the form (S,E,B,m)=(512,1,32,32) Assuming sequential allocation of the integer array, what is the maximum number of integers from the array that are stored in the cache at any point in time?

– 59 – Cache misses Types of cache misses: Cold (compulsory) miss Cold misses occur because the cache is empty. Program start-up or context switch Conflict miss Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k. E.g. Block i at level k+1 must be placed in block (i mod 4) at level k. Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block. E.g. Referencing blocks 0, 8, 0, 8, 0, 8,... would miss every time. Capacity miss Occurs when the set of active cache blocks (working set) is larger than the cache.

– 60 – Direct mapping and conflict misses The direct-mapped cache is easy Each memory address belongs in exactly one block in cache Indices and offsets can be computed with simple bit operators But, not flexible Problems when multiple blocks map to same entry Cause thrashing in cache

– 61 – Cache Tag Data Access pattern: … Index Miss rate = Each access results in a cache miss and a load into cache block 2. Only one out of four cache blocks is used. Main memory t (2 bits)s (2 bits) x x x x x x x x x x xAA xBB xCC xDD xEE xFF % Issue #2: (S,E,B,m) = (4,1,1,4)

– 62 – Set associative caches Each memory address is assigned to a particular set in the cache, but not to a specific block in the set. Set sizes range from 1 (direct-mapped) to all blocks in cache (fully associative). Larger sets and higher associativity lead to fewer cache conflicts and lower miss rates, but they also increase the hardware cost.

– 63 – Extra

– 64 – Memory speed over time In 1985 A memory access took about as long as a CPU instruction Memory was not a bottleneck to performanceToday CPUs are about 500 times faster than in 1980 DRAM Memory is about 10 times faster than in 1980

– 65 – Tran.Access per bit timePersist?Sensitive?CostApplications SRAM61XYesNo100xcache memories DRAM110XNoYes1XMain memories, frame buffers SRAM vs DRAM Summary

– 66 – Hardware0On-Chip TLBAddress translations TLB Web browser 10,000,000Local diskWeb pagesBrowser cache Web cache Network buffer cache Buffer cache Virtual Memory L2 cache L1 cache Registers Cache Type Web pages Parts of files 4-KB page 32-byte block 4-byte word What Cached Web proxy server 1,000,000,000Remote server disks OS100Main memory Hardware1On-Chip L1 Hardware10Off-Chip L2 AFS/NFS client 10,000,000Local disk Hardware+ OS 100Main memory Compiler0 CPU registers Managed By Latency (cycles) Where Cached Examples of Caching in the Hierarchy

– 67 – Byte-addressable machine m = 16 bits Cache = (S,E,B,m) = (?, 1, 1, 16) Direct mapped, Block size = 1 s = 5 (5 LSB of address gives index) How many blocks does the cache hold? How many bits of storage are required to build the cache (data plus all overhead including tags, etc.)? 32 ( ) * 32 Tag Byte Valid Practice problem