10/18: Lecture Topics Using spatial locality

Slides:

Advertisements

Similar presentations

10/20: Lecture Topics HW 3 Problem 2 Caches –Types of cache misses –Cache performance –Cache tradeoffs –Cache summary Input/Output –Types of I/O Devices.

Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.

Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

CSCI206 - Computer Organization & Programming

CS161 – Design and Architecture of Computer

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Main Memory Cache Architectures

Soner Onder Michigan Technological University

CSE 351 Section 9 3/1/12.

CS161 – Design and Architecture of Computer

Cache Performance Samira Khan March 28, 2017.

Associativity in Caches Lecture 25

Improving Memory Access 1/3 The Cache and Virtual Memory

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

Lecture: Cache Hierarchies

Appendix B. Review of Memory Hierarchy

Cache Memory Presentation I

Consider a Direct Mapped Cache with 4 word blocks

Morgan Kaufmann Publishers Memory & Cache

Morgan Kaufmann Publishers

William Stallings Computer Organization and Architecture 7th Edition

CS61C : Machine Structures Lecture 6. 2

Lecture 21: Memory Hierarchy

Lecture 21: Memory Hierarchy

Bojian Zheng CSCD70 Spring 2018

CS61C : Machine Structures Lecture 6. 2

Lecture 08: Memory Hierarchy Cache Performance

Performance metrics for caches

Performance metrics for caches

Adapted from slides by Sally McKee Cornell University

ECE232: Hardware Organization and Design

Performance metrics for caches

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

Caches III CSE 351 Autumn 2018 Instructor: Justin Hsia

Lecture 22: Cache Hierarchies, Memory

CS-447– Computer Architecture Lecture 20 Cache Memories

CS 3410, Spring 2014 Computer Science Cornell University

Lecture 21: Memory Hierarchy

Performance metrics for caches

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Cache - Optimization.

Update : about 8~16% are writes

Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )

Memory & Cache.

Performance metrics for caches

Caches III CSE 351 Spring 2019 Instructor: Ruth Anderson

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

10/18: Lecture Topics Using spatial locality memory blocks Write-back vs. write-through Types of cache misses Cache performance Cache tradeoffs Cache summary

Locality Temporal locality: the principle that data being accessed now will probably be accessed again soon Useful data tends to continue to be useful Spatial locality: the principle that data near the data being accessed now will probably be needed soon If data item n is useful now, then it’s likely that data item n+1 will be useful soon

Memory Access Patterns Memory accesses don’t look like this random accesses Memory accesses do look like this hot variables step through arrays

Locality Last time, we improved memory performance by taking advantage of temporal locality When a word in memory was accessed we loaded it into memory This does nothing for spatial locality

Possible Spatial Locality Solution Store one word per cache line When memory word N is accessed, load word N, word N+1 and N+2 … into the cache This is called prefetching What’s a drawback? Example: What if we access the word at address 1000100? Index 000 001 010 011 100 101 110 111 Tag Valid Value

Memory Blocks Divide memory into blocks If any word in a block is accessed, then load an entire block into the cache Block 0 0x00000000–0x0000003F Block 1 0x00000040–0x0000007F Block 2 0x00000080–0x000000BF Cache line for 16 word block size tag valid w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15

Address Tags Revisited A cache block size > 1 word requires the address to be divided differently Instead of a byte offset into a word, we need a byte offset into the block Assuming we had 10-bit addresses, and 4 words in a block… 10 bit Address 0101100111 Tag (3) Index (3) Block Offset (4) 010 110 0111

Cache Diagram Cache with a block size of 4 words Index 000 001 010 011 100 101 110 111 Tag Valid Values What does the cache look like after accesses to these 10 bit addresses? 1000010010, 1100110011

Cache miss; access memory Cache Lookup 32 bit addr., 64K DM cache, 4 words/block ref. address 10000110 11101010 10000101 11101010 Index into cache Cache hit; select word yes Is valid bit on? yes Do tags match? return data no no Cache miss; access memory

Cache Example Suppose the L1 cache is 32KB, 2-way set associative and has 8 words per block, how do we partition the 32-bit address? How many bits for the block offset? How many bits for the index? How many bits for the tag?

The Effects of Block Size Big blocks are good Fewer first time misses Exploits spatial locality Small blocks are good Don’t evict so much other data when bringing in a new entry More likely that all items in the block will turn out to be useful How do you choose a block size?

Reads vs. Writes Caching is essentially making a copy of the data When you read, the copies still match when you’re done When you write, the results must eventually propagate to both copies Especially at the lowest level, which is in some sense the permanent copy

Write-Through Caches Write the update to the cache and the memory immediately Advantages: The cache and the memory are always consistent Evicting a cache line is cheap because no data needs to be written back Easier to implement Disadvantages?

Write-Back Caches Write the update to the cache only. Write to the memory only when the cache block is evicted. Advantages: Writes go at cache speed rather than memory speed. Some writes never need to be written to the memory. When a whole block is written back, can use high bandwidth transfer. Disadvantages?

Dirty bit When evicting a block from a write-back cache, we could always write the block back to memory write it back only if we changed it Caches use a “dirty bit” to mark if a line was changed the dirty bit is 0 when the block is loaded it is set to 1 if the block is modified when the line is evicted, it is written back only if the dirty bit is 1

Dirty Bit Example Use the dirty bit to determine when evicted cache lines need to be written back to memory Index 00 01 10 11 Tag Valid Dirty Values Assume 8 bit addresses Assume all memory words are initialized to 7 $r2 = Mem[10010000] Mem[10010100] = 10 $r3 = Mem[11010100] $r4 = Mem[11011000] Mem[01010000] = 10

i-Cache and d-Cache There usually are two separate caches for instructions and data. Why? Avoids structural hazards in pipelining The combined cache is twice as big but still has an access time of a small cache Allows both caches to operate in parallel, for twice the bandwidth

Handling i-Cache Misses Stall the pipeline and send the address of the missed instruction to the memory Instruct memory to perform a read; wait for the access to complete 3. Update the cache 4. Restart the instruction, this time fetching it successfully from the cache d-Cache misses are even easier, but still require a pipeline stall

Cache Replacement How do you decide which cache block to replace? If the cache is direct-mapped, it’s easy Otherwise, common strategies: Random Least Recently Used (LRU) Other strategies are used at lower levels of the hierarchy. More on those later.

LRU Replacement Replace the block that hasn’t been used for the longest time. Reference stream: A B C D B D E B A C B C E D C B

LRU Implementations LRU is very difficult to implement for high degrees of associativity 4-way approximation: 1 bit to indicate least recently used pair 1 bit per pair to indicate least recently used item in this pair Much more complex approximations at lower levels of the hierarchy

The Three C’s of Caches Three reasons for cache misses: Compulsory miss: item has never been in the cache Capacity miss: item has been in the cache, but space was tight and it was forced out (occurs even with fully associative caches) Conflict miss: item was in the cache, but the cache was not associative enough, so it was forced out (never occurs with fully associative caches)

Eliminating Cache Misses What cache parameters (cache size, block size, associativity) can you change to eliminate the following kinds of misses compulsory capacity conflict

Multi-Level Caches Use each level of the memory hierarchy as a cache over the next lowest level Inserting level 2 between levels 1 and 3 allows: level 1 to have a higher miss rate (so can be smaller and cheaper) level 3 to have a larger access time (so can be slower and cheaper) The new effective access time equation:

Which cache system is better? 32 KB unified data and instruction cache hit rate of 97% 16 KB data cache hit rate of 92% And 16 KB instruction cache hit rate of 98% Assume 20% of instructions are loads or stores

Cache Parameters and Tradeoffs If you are designing a cache, what choices do you have and what are their tradeoffs?

Cache Comparisons Alpha 21164 MIPS R10000 Pentium Pro UltraSparc 1 8KB direct-mapped 32B block 32KB 2-way (LRU) 64B block 4-way 16KB pseudo 2-way L1 i-Cache Alpha 21164 MIPS R10000 Pentium Pro UltraSparc 1 8KB direct-mapped 32B block 32KB 2-way (LRU) 2-way 16KB L1 d-Cache Alpha 21164 Pentium Pro 96KB 3-way 64B block on chip 256KB 4-way 32B block same package L2 unified Cache

Summary: Classifying Caches Where can a block be placed? Direct mapped, Set/Fully associative How is a block found? Direct mapped: by index Set associative: by index and search Fully associative: by search What happens on a write access? Write-back or Write-through Which block should be replaced? Random LRU (Least Recently Used)