Han Wang CS 3410, Spring 2012 Computer Science Cornell University

Slides:



Advertisements
Similar presentations
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Advertisements

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Caches Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)
Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches P & H Chapter 5.1, 5.2 (except writes)
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches P & H Chapter 5.1, 5.2 (except writes)
Caches Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University See P&H 5.2 (writes), 5.3, 5.5.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Memory Chapter 7 Cache Memories.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
11/3/2005Comp 120 Fall November 10 classes to go! Cache.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Caches Han Wang CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)
Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)
CMPE 421 Parallel Computer Architecture
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.
Caches Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Deniz Altinbuken CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: (except writes) Caches and Memory.
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
Yu-Lun Kuo Computer Sciences and Information Engineering
The Goal: illusion of large, fast, cheap memory
CPU cache Acknowledgment
Caches 2 Hakim Weatherspoon CS 3410, Spring 2013 Computer Science
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
ReCap Random-Access Memory (RAM) Nonvolatile Memory
Caches (Writing) Hakim Weatherspoon CS 3410, Spring 2013
Caches 2 Hakim Weatherspoon CS 3410, Spring 2013 Computer Science
Chapter 8 Digital Design and Computer Architecture: ARM® Edition
Caches (Writing) Hakim Weatherspoon CS 3410, Spring 2012
Lecture 08: Memory Hierarchy Cache Performance
Caches Hakim Weatherspoon CS 3410, Spring 2013 Computer Science
Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science
Adapted from slides by Sally McKee Cornell University
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
CS-447– Computer Architecture Lecture 20 Cache Memories
CS 3410, Spring 2014 Computer Science Cornell University
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Fundamentals of Computing: Computer Architecture
10/18: Lecture Topics Using spatial locality
Caches & Memory.
Presentation transcript:

Han Wang CS 3410, Spring 2012 Computer Science Cornell University Caches Han Wang CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)

Announcements This week: PA2 Work-in-progress submission Next six weeks: Two labs and two projects Prelim2 will be Thursday, March 29th

Agenda Memory Hierarchy Overview The Principle of Locality Direct-Mapped Cache Fully Associative Cache

Performance CPU clock rates ~0.2ns – 2ns (5GHz-500MHz) Technology Capacity $/GB Latency Tape 1 TB $.17 100s of seconds Disk 2 TB $.03 Millions of cycles (ms) SSD (Flash) 128 GB $2 Thousands of cycles (us) DRAM 8 GB $10 (10s of ns) SRAM off-chip 8 MB $4000 5-15 cycles (few ns) SRAM on-chip 256 KB ??? 1-3 cycles (ns) Others: eDRAM aka 1-T SRAM, FeRAM, CD, DVD, … Q: Can we create illusion of cheap + large + fast? 50-300 cycles Capacity and latency are closely coupled Cost is inversely proportional

Memory Pyramid Memory Pyramid < 1 cycle access L1 Cache RegFile 100s bytes < 1 cycle access L1 Cache (several KB) 1-3 cycle access L3 becoming more common (eDRAM ?) L2 Cache (½-32MB) 5-15 cycle access Memory Pyramid Memory (128MB – few GB) 50-300 cycle access L1 L2 and even L3 all on-chip these days using SRAM and eDRAM Disk (Many GB – few TB) 1000000+ cycle access These are rough numbers: mileage may vary for latest/greatest Caches usually made of SRAM (or eDRAM)

Memory Hierarchy Memory closer to processor small & fast stores active data Memory farther from processor big & slow stores inactive data

Active vs Inactive Data Assumption: Most data is not active. Q: How to decide what is active? A: Programmer decides A: Compiler decides A: OS decides at run-time A: Hardware decides at run-time 90/10 rule

Insight of Caches Q: What is “active” data? A: Data that will be used soon Q: What is “active” data? If Mem[x] is was accessed recently... … then Mem[x] is likely to be accessed soon Exploit temporal locality: … then Mem[x ± ε] is likely to be accessed soon Exploit spatial locality: put recently accessed Mem[x] higher in the pyramid soon = frequency of loads put entire block containing Mem[x] higher in the pyramid

Locality Analogy Writing a report on a specific topic. While at library, check out books and keep them on desk. If need more, check them out and bring to desk. But don’t return earlier books since might need them Limited space on desk; Which books to keep? You hope this collection of ~20 books on desk enough to write report, despite 20 being only 0.00002% of books in Cornell libraries

Two types of Locality Temporal Locality (locality in time) If a memory location is referenced then it will tend to be referenced again soon  Keep most recently accessed data items closer to the processor Spatial Locality (locality in space) If a memory location is referenced, the locations with nearby addresses will tend to be referenced soon  Move blocks consisting of contiguous words closer to the processor

Locality Memory trace int n = 4; int k[] = { 3, 14, 0, 10 }; 0x7c9a2b18 0x7c9a2b19 0x7c9a2b1a 0x7c9a2b1b 0x7c9a2b1c 0x7c9a2b1d 0x7c9a2b1e 0x7c9a2b1f 0x7c9a2b20 0x7c9a2b21 0x7c9a2b22 0x7c9a2b23 0x7c9a2b28 0x7c9a2b2c 0x0040030c 0x00400310 0x7c9a2b04 0x00400314 0x7c9a2b00 0x00400318 0x0040031c ... int n = 4; int k[] = { 3, 14, 0, 10 }; int fib(int i) { if (i <= 2) return i; else return fib(i-1)+fib(i-2); } int main(int ac, char **av) { for (int i = 0; i < n; i++) { printi(fib(k[i])); prints("\n");

Memory Hierarchy Memory closer to processor is fast but small usually stores subset of memory farther away “strictly inclusive” alternatives: strictly exclusive mostly inclusive Transfer whole blocks (cache lines): 4kb: disk ↔ ram 256b: ram ↔ L2 64b: L2 ↔ L1

Cache Lookups (Read) Processor tries to access Mem[x] Check: is block containing Mem[x] in the cache? Yes: cache hit return requested data from cache line No: cache miss read block from memory (or lower level cache) (evict an existing cache line to make room) place new block in cache return requested data  and stall the pipeline while all of this happens

Cache Organization Cache has to be fast and dense Cache Controller CPU Cache has to be fast and dense Gain speed by performing lookups in parallel but requires die real estate for lookup logic Reduce lookup logic by limiting where in the cache a block might be placed but might reduce cache effectiveness draw: dram on right; sram in controller; lookup logic;

Three common designs A given data block can be placed… … in any cache line  Fully Associative … in exactly one cache line  Direct Mapped … in a small set of cache lines  Set Associative Memory Cache

Three common designs A given data block can be placed… … in any cache line  Fully Associative … in exactly one cache line  Direct Mapped … in a small set of cache lines  Set Associative Memory Cache

Three common designs A given data block can be placed… … in any cache line  Fully Associative … in exactly one cache line  Direct Mapped … in a small set of cache lines  Set Associative Memory 2-way set-associative $ way0 way1

TIO: Mapping the Memory Address Lowest bits of address (Offset) determine which byte within a block it refers to. Full address format: n-bit Offset means a block is how many bytes? n-bit Index means cache has how many blocks? Tag Index Offset Memory Address

Direct Mapped Cache Direct Mapped Cache Each block number mapped to a single cache line index Simplest hardware 0x000000 0x000004 0x000008 0x00000c 0x000010 0x000014 0x000018 0x00001c 0x000020 0x000024 0x000028 0x00002c 0x000030 0x000034 0x000038 0x00003c 0x000040 0x000044 0x000048 line 0 line 1 4 bytes each, 8 bytes per line, show tags

Direct Mapped Cache Direct Mapped Cache Each block number mapped to a single cache line index Simplest hardware 0x000000 0x000004 0x000008 0x00000c 0x000010 0x000014 0x000018 0x00001c 0x000020 0x000024 0x000028 0x00002c 0x000030 0x000034 0x000038 0x00003c 0x000040 0x000044 0x000048 line 0 line 1 line 2 line 3 4 bytes each, 8 bytes per line, show tags

Tags and Offsets Assume sixteen 64-byte cache lines 0x7FFF3D4D = 0111 1111 1111 1111 0011 1101 0100 1101 Need meta-data for each cache line: valid bit: is the cache line non-empty? tag: which block is stored in this line (if valid) Q: how to check if X is in the cache? Q: how to clear a cache line? bottom 2 = byte offset within word bottom 6 = byte offset within cache line next 4 = cache line index remaining = tag

A Simple Direct Mapped Cache Using byte addresses in this example! Addr Bus = 5 bits Processor Direct Mapped Cache Memory 101 1 103 2 107 3 109 4 113 5 127 6 131 7 137 8 139 9 149 10 151 11 157 12 163 13 167 14 173 15 179 16 181 101 1 103 2 107 3 109 4 113 5 127 6 131 7 137 8 139 9 149 10 151 11 157 12 163 13 167 14 173 15 179 16 181 lb $1  M[ 1 ] lb $2  M[ 13 ] lb $3  M[ 0 ] lb $3  M[ 6 ] lb $2  M[ 5 ] lb $2  M[ 6 ] lb $2  M[ 10 ] lb $2  M[ 12 ] A = V tag data 4 cache lines line size = 2 bytes offset is 1 bit index is 2 bits tag is 2 bits $1 $2 $3 $4 Hits: Misses:

Direct Mapped Cache (Reading) Tag Index Offset V Tag Block = word select data hit?

Direct Mapped Cache Size Tag Index Offset n bit index, m bit offset Q: How big is cache (data only)? Q: How much SRAM needed (data + overhead)? (2**m bytes per line) (2**n lines) = 2**(n+m) bytes of data (1+(32-n-m) bits of overhead per line) (2**n lines) Some numbers: 32Kb cache, 64byte line size  512 lines, 6bit offset, 9bit index, 17 bit tag  18*512 = 16*512+1024 = 1024bytes + 128 bytes = 1152 bytes overhead

Cache Performance Cache Performance (very simplified): L1 (SRAM): 512 x 64 byte cache lines, direct mapped Data cost: 3 cycle per word access Lookup cost: 2 cycle Mem (DRAM): 4GB Data cost: 50 cycle per word, plus 3 cycle per consecutive word Performance depends on: Access time for hit, miss penalty, hit rate Hit rate 90% .90 *5 + .10*(2+50+16*3) = 4.5 + 10.0 = 14.5 cycles on average Hit rate 95 %  9.5 cycles on average

Misses Cache misses: classification The line is being referenced for the first time Cold (aka Compulsory) Miss The line was in the cache, but has been evicted Q: What causes a cache miss? conflict: collisions, competition

Avoiding Misses Q: How to avoid… Cold Misses Other Misses Unavoidable? The data was never in the cache… Prefetching! Other Misses Buy more SRAM Use a more flexible cache design fix program, prefetching = special instructions, prediction in CPU, prediction in cache controller, …

Bigger cache doesn’t always help… 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Bigger cache doesn’t always help… Mem access trace: 0, 16, 1, 17, 2, 18, 3, 19, 4, … Hit rate with four direct-mapped 2-byte cache lines? With eight 2-byte cache lines? With four 4-byte cache lines? What program is this? Memcpy

Misses Cache misses: classification The line is being referenced for the first time Cold (aka Compulsory) Miss The line was in the cache, but has been evicted… … because some other access with the same index Conflict Miss … because the cache is too small i.e. the working set of program is larger than the cache Capacity Miss Q: What causes a cache miss? conflict: collisions, competition

Avoiding Misses Q: How to avoid… Cold Misses Capacity Misses Unavoidable? The data was never in the cache… Prefetching! Capacity Misses Buy more SRAM Conflict Misses Use a more flexible cache design

Three common designs A given data block can be placed… … in any cache line  Fully Associative … in exactly one cache line  Direct Mapped … in a small set of cache lines  Set Associative

A Simple Fully Associative Cache Using byte addresses in this example! Addr Bus = 5 bits Processor Fully Associative Cache Memory 101 1 103 2 107 3 109 4 113 5 127 6 131 7 137 8 139 9 149 10 151 11 157 12 163 13 167 14 173 15 179 16 181 lb $1  M[ 1 ] lb $2  M[ 13 ] lb $3  M[ 0 ] lb $3  M[ 6 ] lb $2  M[ 5 ] lb $2  M[ 6 ] lb $2  M[ 10 ] lb $2  M[ 12 ] A = V tag data 4 cache lines line size = 2 words offset is 1 bit tag is 4 bits $1 $2 $3 $4 Hits: Misses:

Fully Associative Cache (Reading) Tag Offset V Tag Block = = = = line select “way” = line 64bytes word select 32bits hit? data

Fully Associative Cache Size Tag Offset m bit offset Q: How big is cache (data only)? Q: How much SRAM needed (data + overhead)? , 2n cache lines (2**m bytes per line) (2**n lines) = 2**(n+m) bytes of data (1+(32-m) bits of overhead per line) (2**n lines) Some numbers: 32Kb cache, 64byte line size  512 lines, 6bit offset, 26 bit tag  27*512 = 24*512+1536 = 1536bytes + 192 bytes = 1728bytes overhead

Fully-associative reduces conflict misses... 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Fully-associative reduces conflict misses... … assuming good eviction strategy Mem access trace: 0, 16, 1, 17, 2, 18, 3, 19, 4, 20, … Hit rate with four fully-associative 2-byte cache lines? No temporal locality, but lots of spacial locality  larger blocks would help

… but large block size can still reduce hit rate vector add trace: 0, 100, 200, 1, 101, 201, 2, 202, … Hit rate with four fully-associative 2-byte cache lines? With two fully-associative 4-byte cache lines?

Misses Cache misses: classification Cold (aka Compulsory) Capacity The line is being referenced for the first time Capacity The line was evicted because the cache was too small i.e. the working set of program is larger than the cache Conflict The line was evicted because of another access whose index conflicted conflict can’t happen with fully associative block size influences cold misses

Summary Caching assumptions Benefits small working set: 90/10 rule can predict future: spatial & temporal locality Benefits big & fast memory built from (big & slow) + (small & fast) Tradeoffs: associativity, line size, hit cost, miss penalty, hit rate Fully Associative  higher hit cost, higher hit rate Larger block size  lower hit cost, higher miss penalty Next up: other designs; writing to caches