Caches II CSE 351 Autumn 2016 Instructor: Justin Hsia

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

Computer ArchitectureFall 2008 © November 3 rd, 2008 Nael Abu-Ghazaleh CS-447– Computer.

CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson

Systems I Locality and Caching

CMPE 421 Parallel Computer Architecture

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Additional Slides By Professor Mary Jane Irwin Pennsylvania State University Group 3.

CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.

Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

CMSC 611: Advanced Computer Architecture

Memory Hierarchy Ideal memory is fast, large, and inexpensive

CSE 351 Section 9 3/1/12.

Memory COMPUTER ARCHITECTURE

The Goal: illusion of large, fast, cheap memory

How will execution time grow with SIZE?

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

The Hardware/Software Interface CSE351 Winter 2013

Caches III CSE 351 Autumn 2017 Instructor: Justin Hsia

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Caches II CSE 351 Spring 2017 Instructor: Ruth Anderson

Morgan Kaufmann Publishers

Local secondary storage (local disks)

Instructor: Justin Hsia

Lecture 21: Memory Hierarchy

Cache Memories September 30, 2008

Lecture 08: Memory Hierarchy Cache Performance

Page that info back into your memory!

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Chapter 6 Memory System Design

Adapted from slides by Sally McKee Cornell University

CMSC 611: Advanced Computer Architecture

Caches II CSE 351 Winter 2018 Instructor: Mark Wyse

How can we find data in the cache?

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Caches III CSE 351 Autumn 2018 Instructor: Justin Hsia

CS-447– Computer Architecture Lecture 20 Cache Memories

CS 3410, Spring 2014 Computer Science Cornell University

CSC3050 – Computer Architecture

Lecture 21: Memory Hierarchy

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Caches III CSE 351 Winter 2019 Instructors: Max Willsey Luis Ceze

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Cache - Optimization.

Cache Memory and Performance

Principle of Locality: Memory Hierarchies

Cache Memory and Performance

Memory & Cache.

10/18: Lecture Topics Using spatial locality

Caches III CSE 351 Spring 2019 Instructor: Ruth Anderson

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

https://what-if.xkcd.com/111/ Caches II CSE 351 Autumn 2016 Instructor: Justin Hsia Teaching Assistants: Chris Ma Hunter Zahn John Kaltenbach Kevin Bi Sachin Mehta Suraj Bhat Thomas Neuman Waylon Huang Xi Liu Yufang Sun https://what-if.xkcd.com/111/

Administrivia Lab 3 due Thursday Midterm grades Unadjusted average currently right around 65% All scores will be adjusted up by 6 points in Catalyst Regrade requests open on Gradescope until end of Thursday It is possible for your grade to go down Make sure you submit separate requests for each portion of a question (e.g. Q5A and Q5B) – these may go to different graders!

An Example Memory Hierarchy registers CPU registers hold words retrieved from L1 cache on-chip L1 cache (SRAM) Smaller, faster, costlier per byte L1 cache holds cache lines retrieved from L2 cache off-chip L2 cache (SRAM) Implemented with different technologies! disks, solid-state DRAM: little wells that drain slowly, refreshed periodically SRAM: flip-flops, little logic gates that feedback with each other, faster but uses more power! That’s why they’re different sizes and costs L2 cache holds cache lines retrieved from main memory main memory (DRAM) Larger, slower, cheaper per byte Main memory holds disk blocks retrieved from local disks local secondary storage (local disks) Local disks hold files retrieved from disks on remote network servers remote secondary storage (distributed file systems, web servers)

Making memory accesses fast! Cache basics Principle of locality Memory hierarchies Cache organization Direct-mapped (sets; index + tag) Associativity (ways) Replacement policy Handling writes Program optimizations that consider caches Divide addresses into “index” and “tag”

Cache Organization (1) Block Size (K): unit of transfer between $ and Mem Given in bytes and always a power of 2 (e.g. 64 B) Blocks consist of adjacent bytes (differ in address by 1) Spatial locality! Just like withinBlock() from Lab 1!

(refers to byte in memory) Cache Organization (1) Block Size (K): unit of transfer between $ and Mem Given in bytes and always a power of 2 (e.g. 64 B) Blocks consist of adjacent bytes (differ in address by 1) Spatial locality! Offset field Low-order log 2 K =𝐎 bits of address tell you which byte within a block (address) mod 2 𝑛 = 𝑛 lowest bits of address (address) modulo (# of bytes in a block) Just like withinBlock() from Lab 1! Block Number Block Offset 𝐀-bit address: (refers to byte in memory) 𝐎 bits 𝐀−𝐎 bits

Cache Organization (2) Cache Size (C): amount of data the $ can store Cache can only hold so much data (subset of next level) Given in bytes (C) or number of blocks (C/K) Example: C = 32 KiB = 512 blocks if using 64-B blocks Where should data go in the cache? We need a mapping from memory addresses to specific locations in the cache to make checking the cache for an address fast What is a data structure that provides fast lookup? Hash table!

Aside: Hash Tables for Fast Lookup Insert: 5 27 34 102 119 1 2 3 4 5 6 7 8 9 hash(27) % N (10) = 7… Apply hash function to map data to “buckets”

Place Data in Cache by Hashing Address Memory Cache Block Addr Block Data 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Index Block Data 00 01 10 11 Here K = 4 B and C/K = 4 Map to cache index from block address Use next log 2 C/K =𝐈 bits (block address) mod (# blocks in cache) Lets adjacent blocks fit in cache simultaneously!

Place Data in Cache by Hashing Address Memory Cache Block Addr Block Data 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Index Block Data 00 01 10 11 Here K = 4 B and C/K = 4 Collision! This might confuse the cache later when we access the data Solution?

Tags Differentiate Blocks in Same Index Memory Cache Block Addr Block Data 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 Index Tag Block Data 00 01 10 11 Here K = 4 B and C/K = 4 Tag = rest of address bits 𝐓 bits = 𝐀−𝐈−𝐎 Check this during a cache lookup

Checking for a Requested Address CPU sends address request for chunk of data Address and requested data are not the same thing! Analogy: your friend ≠ his or her phone number TIO address breakdown: Index field tells you where to look in cache Tag field lets you check that data is the block you want Offset field selects specified start byte within block Note: 𝐓 and 𝐈 sizes will change based on hash function Tag (𝐓) Offset (𝐎) 𝐀-bit address: Block Number Index (𝐈)

Cache Puzzle #1 What can you infer from the following behavior? Cache starts empty, also known as a cold cache Access (addr: hit/miss) stream: (14: miss), (15: hit), (16: miss) Minimum block size? Maximum block size? Lab 4!

Direct-Mapped Cache Memory Cache Block Addr Block Data 00 00 00 01 00 10 00 11 01 00 01 01 01 10 01 11 10 00 10 01 10 10 10 11 11 00 11 01 11 10 11 11 Index Tag Block Data 00 01 11 10 Here K = 4 B and C/K = 4 Hash function: (block address) mod (# of blocks in cache) Each memory address maps to exactly one index in the cache Fast (and simpler) to find an address

Direct-Mapped Cache Problem Memory Cache Block Addr Block Data 00 00 00 01 00 10 00 11 01 00 01 01 01 10 01 11 10 00 10 01 10 10 10 11 11 00 11 01 11 10 11 11 Index Tag Block Data 00 ?? 01 10 11 Here K = 4 B and C/K = 4 What happens if we access the following addresses? 8, 24, 8, 24, 8, …? Conflict in cache (misses!) Rest of cache goes unused Solution? Conflict! Rest of cache unused!

Associativity What if we could store data in any place in the cache? More complicated hardware = more power consumed, slower So we combine the two ideas: Each address maps to exactly one set Each set can store block in more than one way 1 2 3 4 5 6 7 Set 1-way: 8 sets, 1 block each 2-way: 4 sets, 2 blocks each 4-way: 2 sets, 4 blocks each 8-way: 1 set, 8 blocks direct mapped fully associative “Where is address 2?”

Used for tag comparison Cache Organization (3) Associativity (N): # of ways for each set Such a cache is called an “𝑁-way set associative cache” We now index into cache sets, of which there are C/K/N Use lowest log 2 C/K/N = 𝐈 bits of block address Direct-mapped: N = 1, so 𝐈 = log 2 C/K as we saw previously Fully associative: N = C/K, so 𝐈 = 0 bits Selects the set Used for tag comparison Selects the byte from block Tag (𝐓) Index (𝐈) Offset (𝐎) Increasing associativity Decreasing associativity Fully associative (only one set) Direct mapped (only one way)

Example Placement Where would data from address 0x1833 be placed? block size: 16 B capacity: 8 blocks address: 16 bits Where would data from address 0x1833 be placed? Binary: 0b 0001 1000 0011 0011 Tag (𝐓) Offset (𝐎) 𝐀-bit address: Index (𝐈) 𝐈 = log 2 C/K/N 𝐎 = log 2 K 𝐓 = 𝐀–𝐈–𝐎 𝐈 = ? 𝐈 = ? 𝐈 = ? Direct-mapped 2-way set associative 4-way set associative Set Tag Data 1 2 3 4 5 6 7 Set Tag Data 1 2 3 Set Tag Data 1 18

Example Placement Where would data from address 0x1833 be placed? block size: 16 B capacity: 8 blocks address: 16 bits Where would data from address 0x1833 be placed? Binary: 0b 0001 1000 0011 0011 Tag (𝐓) Offset (𝐎) 𝐀-bit address: Index (𝐈) 𝐈 = log 2 C/K/N 𝐎 = log 2 K 𝐓 = 𝐀–𝐈–𝐎 𝐈 = 3 𝐈 = 2 𝐈 = 1 Direct-mapped 2-way set associative 4-way set associative Set Tag Data 1 2 3 4 5 6 7 Set Tag Data 1 2 3 Set Tag Data 1 19

Block Replacement Any empty block in the correct set may be used to store block If there are no empty blocks, which one should we replace? No choice for direct-mapped caches Caches typically use something close to least recently used (LRU) (hardware usually implements “not most recently used”) Direct-mapped 2-way set associative 4-way set associative Set Tag Data 1 2 3 4 5 6 7 Set Tag Data 1 2 3 Set Tag Data 1

Cache Puzzle #2 What can you infer from the following behavior? Cache starts empty, also known as a cold cache Access (addr: hit/miss) stream: (10: miss), (12: miss), (10: miss) Associativity? Number of sets? How much associativity is there? None! first access kicked the other out, so 1-way How many sets? 10 & 12 map to same thing 1010 & 1100

General Cache Organization (S, N, K) N = blocks/lines per set set “line” (block plus management bits) S = # sets = 2 𝐈 valid bit so we know if this has been initialized Cache size: C=S×N×K data bytes (doesn’t include V or Tag) V tag 1 2 B-1 valid bit K = bytes per block

Notation Changes We are using different variable names from previous quarters Please be aware of this when you look at past exam questions or are watching videos Variable This Quarter Previous Quarters Block size K B Cache size C --- Associativity N E Address width 𝐀 m Tag field width 𝐓 t Index field width 𝐈 k, s Offset field width 𝐎 n, b Number of Sets S