CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches 2 P & H Chapter 5.2 (writes), 5.3, 5.5.
1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
Caches J. Nelson Amaral University of Alberta. Processor-Memory Performance Gap Bauer p. 47.
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
Translation Buffers (TLB’s)
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
Computer ArchitectureFall 2008 © November 3 rd, 2008 Nael Abu-Ghazaleh CS-447– Computer.
Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
2/27/2002CSE Cache II Caches, part II CPU On-chip cache Off-chip cache DRAM memory Disk memory.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University See P&H Chapter: , 5.8, 5.10, 5.15; Also, 5.13 & 5.17.
Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
The Three C’s of Misses 7.5 Compulsory Misses The first time a memory location is accessed, it is always a miss Also known as cold-start misses Only way.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
處理器設計與實作 CPU LAB for Computer Organization 1. LAB 7: MIPS Set-Associative D- cache.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
CSCI206 - Computer Organization & Programming
Associativity in Caches Lecture 25
Multilevel Memories (Improving performance using alittle “cash”)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Cache Memory Presentation I
Lecture 21: Memory Hierarchy
Lecture 21: Memory Hierarchy
Lecture 23: Cache, Memory, Virtual Memory
Module IV Memory Organization.
CPE 631 Lecture 05: Cache Design
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Performance metrics for caches
Performance metrics for caches
Adapted from slides by Sally McKee Cornell University
Performance metrics for caches
CS 3410, Spring 2014 Computer Science Cornell University
Lecture 21: Memory Hierarchy
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Performance metrics for caches
Cache - Optimization.
Cache Memory Rabi Mahapatra
Principle of Locality: Memory Hierarchies
Performance metrics for caches
10/18: Lecture Topics Using spatial locality
Presentation transcript:

CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric: miss rate m = 1 -h Other important metric: Average memory access time Av.Mem. Access time = h * T cache + (1-h) * T mem where T cache is the time to access the cache (e.g., 1 cycle) and T mem is the time to access main memory (e.g., 100 cycles) (Of course this formula has to be modified the obvious way if you have a hierarchy of caches)

CSE 378 Cache Performance2 Parameters for cache design Goal: Have h as high as possible without paying too much for T cache The bigger the cache size (or capacity), the higher h. –True but too big a cache increases T cache –Limit on the amount of “real estate” on the chip (although this limit is not present for 1 st level caches) The larger the cache associativity, the higher h. –True but too much associativity is costly because of the number of comparators required and might also slow down T cache (extra logic needed to select the “winner”) Line (or block) size –For a given application, there is an optimal line size but that optimal size varies from application to application

CSE 378 Cache Performance3 Impact of Capacity on Miss rate Data cache 2-way, 64 byte line. Appl 176.gcc from SPEC 2000

CSE 378 Cache Performance4 Parameters for cache design (ct’d) Write policy (see later) –There are several policies with, as expected, the most complex giving the best performance results Replacement algorithm (for set-associative caches) –Not very important for caches with small associativity (will be very important for paging systems) Split I and D-caches vs. unified caches. –First-level caches need to be split because of pipelining that requests an instruction every cycle. Allows for different design parameters for I-caches and D-caches –Second and higher level caches are unified (mostly used for data)

CSE 378 Cache Performance5 Example of cache hierarchies ( don’t quote me on these numbers ) ProcessorI-cacheD-cacheL2 Alpha 21064(8KB,1,32)(8KB,1,32) WTOffchip(2MB,1,64) Alpha 21164(8KB,1,32)(8KB,1,32) WT(96KB,3,64) WB A;pha 21264(64KB,2,64)(64KB,2,64) WBOffchip(16MB,1,64) Pentium(8KB,2,32)(8KB,2,32)WT/WBOffchip up to 8MB Pentium II(16KB,2,32)(16KB,2,32) WBGlued(512KB,8,32)WB Pentium III(16KB,2,32)(16KB,4,32) WB(512KB,8,32)WB Pentium 4Trace cache(16KB,4,64) WT(1MB,8,128)WB

CSE 378 Cache Performance6 Examples (cont’d) AMD K7 64k(I), 64K(D)

CSE 378 Cache Performance7 Back to associativity Advantages –Reduces conflict misses Disadvantages –Needs more comparators –Access time is longer (need to choose among the comparisons, i.e., need of a multiplexor) –Replacement algorithm is needed and could get more complex as associativity grows

CSE 378 Cache Performance8 Replacement algorithm None for direct-mapped Random or LRU or pseudo-LRU for set-associative caches –LRU means that the entry in the set which has not been used for the longest time will be replaced (think about a stack)

CSE 378 Cache Performance9 Impact of associativity on miss rate Data cache 32 KB, 64 byte line. Same app as for capacity

CSE 378 Cache Performance10 Impact of line size Recall line size = number of bytes stored in a cache entry On a cache miss the whole line is brought into the cache For a given cache capacity, advantages of large line size: –decrease number of lines: requires less real estate for tags –decrease miss rate IF the programs exhibit good spatial locality –increase transfer efficiency between cache and main memory For a given cache capacity, drawbacks of large line size: –increase latency of transfers –might bring unused data IF the programs exhibit poor spatial locality –Might increase the number of conflict/capacity misses

CSE 378 Cache Performance11 Impact of line size on miss rate

CSE 378 Cache Performance12 Classifying the cache misses:The 3 C’s Compulsory misses (cold start) –The first time you touch a line. Reduced (for a given cache capacity and associativity) by having large line sizes Capacity misses –The working set is too big for the ideal cache of same capacity and line size (i.e., fully associative with optimal replacement algorithm). Only remedy: bigger cache! Conflict misses (interference) –Mapping of two lines to the same location. Increasing associativity decreases this type of misses. There is a fourth C: coherence misses (cf. multiprocessors)

CSE 378 Cache Performance13 Performance revisited Recall Av.Mem. Access time = h * T cache + (1-h) * T mem We can expand on T mem as T mem = T acc + b * T tra –where T acc is the time to send the address of the line to main memory and have the DRAM read the line in its own buffer, and –T tra is the time to transfer one word (4 bytes) on the memory bus from the DRAM to the cache, and b is the line size (in words) (might also depend on width of the bus) For example, if T acc = 5 and T tra = 1, what cache is best between –C1 (b1 =1 ) and C2 (b2 = 4) for a program with h1 = 0.85 and h2=0.92 assuming T cache = 1 in both cases.

CSE 378 Cache Performance14 Writing in a cache On a write hit, should we write: –In the cache only (write-back) policy –In the cache and main memory (or next level cache) (write- through) policy On a write miss, should we –Allocate a line as in a read (write-allocate) –Write only in memory (write-around)

CSE 378 Cache Performance15 Write-through policy Write-through (aka store-through) –On a write hit, write both in cache and in memory –On a write miss, the most frequent option is write-around, i.e., write only in memory Pro: –memory is always coherent (better for I/O); –more reliable (no error detection-correction “ECC” required for cache) Con: –more memory traffic (can be somewhat alleviated with write buffers)

CSE 378 Cache Performance16 Write-back policy Write-back (aka copy-back) –On a write hit, write only in cache (requires dirty bit) –On a write miss, most often write-allocate (fetch on miss) but variations are possible –We write to memory when a dirty line is replaced Pro-con reverse of write through

CSE 378 Cache Performance17 Cutting back on write backs In write-through, you write only the word (byte) you modify In write-back, you write the entire line –But you could have one dirty bit/word so on replacement you’d need to write only the words that are dirty

CSE 378 Cache Performance18 Hiding memory latency On write-through, the processor has to wait till the memory has stored the data Inefficient since the store does not prevent the processor to continue working To speed-up the process, have write buffers between cache and main memory –write buffer is a (set of) temporary register that contains the contents and the address of what to store in main memory –The store to main memory from the write buffer can be done while the processor continues processing Same concept can be applied to dirty lines in write-back policy

CSE 378 Cache Performance19 Coherency: caches and I/O In general I/O transfers occur directly to/from memory from/to disk What happens for memory to disk –With write-through memory is up-to-date. No problem –With write-back, need to “purge” cache entries that are dirty and that will be sent to the disk What happens from disk to memory –The entries in the cache that correspond to memory locations that are read from disk must be invalidated –Need of a valid bit in the cache (or other techniques)

CSE 378 Cache Performance20 Reducing Cache Misses with more “Associativity” -- Victim caches Example of an “hardware assist” Victim cache: Small fully-associative buffer “behind” the cache and “before” main memory Of course can also exist if cache hierarchy –E.g., behind L1 and before L2, or behind L2 and before main memory) Main goal: remove some of the conflict misses in direct- mapped caches (or any cache with low associativity)

CSE 378 Cache Performance21 Index + Tag Cache Victim Cache 1. Hit 2.Miss in L1; Hit in VC; Send data to register and swap 3. From next level of memory hierarchy 3’. evicted

CSE 378 Cache Performance22 Operation of a Victim Cache 1. Hit in L1; Nothing else needed 2. Miss in L1 for block at location b, hit in victim cache at location v: swap contents of b and v (takes an extra cycle) 3. Miss in L1, miss in victim cache : load missing item from next level and put in L1; put entry replaced in L1 in victim cache; if victim cache is full, evict one of its entries. Victim buffer of 4 to 8 entries for a 32KB direct-mapped cache works well.