Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

Slides:



Advertisements
Similar presentations
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Performance of Cache Memory
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Caches Vincent H. Berk October 21, 2005
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
Chapter 12 Pipelining Strategies Performance Hazards.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Lecture 12: Memory Hierarchy— Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Lecture 15 Calculating and Improving Cache Perfomance
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 29 Memory Hierarchy Design Cache Performance Enhancement by: Reducing Cache.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)
Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Soner Onder Michigan Technological University
Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
How will execution time grow with SIZE?
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Lecture 14: Reducing Cache Misses
Lecture 08: Memory Hierarchy Cache Performance
Memory Hierarchy.
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Siddhartha Chatterjee
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
CSC3050 – Computer Architecture
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Cache Performance Improvements
Memory & Cache.
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples of Virtual Memory –Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) –Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) –Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) Classifying Misses: 3 Cs

3Cs Absolute Miss Rate (SPEC92) Conflict Compulsory vanishingly small Reducing Cache Misses Classifying Misses: 3 Cs

2:1 Cache Rule Conflict miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2 Reducing Cache Misses Classifying Misses: 3 Cs

3Cs Relative Miss Rate Conflict Reducing Cache Misses Classifying Misses: 3 Cs

Reducing Cache Misses 1. Larger Block Size Size of Cache Using the principle of locality. The larger the block, the greater the chance parts of it will be used again.

2:1 Cache Rule: Miss Rate Direct Mapped cache size N = Miss Rate 2-way cache size N/2 But Beware: Execution time is the only final measure we can believe! –Clock Cycle time increase as a result of having a more complicated cache. –Hill [1988] suggested hit time for 2-way vs. 1-way is: external cache +10% internal + 2% 2. Higher Associativity Reducing Cache Misses

Avg. Memory Access Time vs. Miss Rate 2. Higher Associativity Reducing Cache Misses The time to access memory has several components. The equation is: Average Memory Access Time = Hit Time + Miss Rate X Miss Penalty The miss penalty is 50 cycles. See data on next page. AssociativityClock Cycle Time Result

2. Higher Associativity Reducing Cache Misses Example: Avg. Memory Access Time vs. Miss Rate

How to combine fast hit time of direct mapped yet still avoid conflict misses? –Add buffer to place data discarded from cache –A 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache –Used in Alpha, HP machines. –In effect, this gives the same behavior as associativity, but only on those cache lines that really need it. Reducing Cache Misses 3. Victim Caches

Reducing Cache Miss Penalty Time to handle a miss is becoming more and more the controlling factor. This is because of the great improvement in speed of processors as compared to the speed of memory. 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples of Virtual Memory Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

Write through with write buffers offer RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read; if no conflicts, let the memory access continue Write Back? –Read miss replacing dirty block –Normal: Write dirty block to memory, and then do the read –Instead copy the dirty block to a write buffer, then do the read, and then do the write –CPU stall less since restarts as soon as do read Reducing Cache Miss Penalty Prioritization of Read Misses over Writes

Don’t have to load full block on a miss Have valid bits per subblock to indicate valid Valid Bits Subblocks Reducing Cache Miss Penalty Sub Block Placement for Reduced Miss Penalty

Don’t wait for full block to be loaded before restarting CPU –Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution –Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Generally useful only in large blocks, Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart block Reducing Cache Miss Penalty Early Restart and Critical Word First

L2 Equations Average Memory Access Time = Hit Time L1 + Miss Rate L1 x Miss Penalty L1 Miss Penalty L1 = Hit Time L2 + Miss Rate L2 x Miss Penalty L2 Average Memory Access Time = Hit Time L1 + Miss Rate L1 x (Hit Time L2 + Miss Rate L2 + Miss Penalty L2 ) Definitions: –Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rate L2 ) –Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU (Miss Rate L1 x Miss Rate L2 ) –Global Miss Rate is what matters Reducing Cache Miss Penalty Second Level Caches

Reducing Hit Time This is about how to reduce time to access data that IS in the cache. What techniques are useful for quickly and efficiently finding out if data is in the cache, and if it is, getting that data out of the cache. 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main Memory 5.7 Virtual Memory 5.8 Protection and Examples of Virtual Memory Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty

Reducing Hit Time Why Alpha has 8 KB Instruction and 8 KB data cache + 96 KB second level cache? –Small data cache and clock rate Direct Mapped, on chip Small and Simple Caches

Pipeline Tag Check and Update Cache as separate stages; current write tag check & previous write cache update Only STORES in the pipeline; empty during a miss Store r2, (r1) Check r1 Add -- Sub -- Store r4, (r3) M[r1]<-r2& In shade is “Delayed Write Buffer”; must be checked on reads; either complete write or read from buffer Reducing Hit Time Pipelining Writes for Fast Write Hits Check r3

Chap. 5 - Memory18

Chap. 5 - Memory19 Way prediction to reduce Hit time Reduce conflict-miss in associative caches. Predict which of the block within the set contains the current data. The multiplexor is preset to this predicted value so that the delay caused by multiplexer is avoided. If error, correct block is chosen and prediction is updated. One-bit history can be used for prediction.

Chap. 5 - Memory20 Trace caches to reduce Hit time used in Pentium 4. idea is to use dynamic trace of memory access pattern to fetch a sequence of instructions. complex to implement. high overhead.

Chap. 5 - Memory21 Nonblocking cache Most caches can only handle one outstanding request at a time. If a request is made to the cache and there is a miss, the cache must wait for the memory to supply the value that was needed, and until then it is "blocked". A non-blocking cache has the ability to work on other requests while waiting for memory to supply any misses. The Intel Pentium Pro and Pentium II processors use this technology for their level 2 caches, which can manage up to four simultaneous requests.Pentium ProPentium IIprocessors This is done by using a transaction-based architecture, and a dedicated "backside" bus for the cache that is independent of the main memory bus. Intel calls this "dual independent bus" (DIB) architecture."backside"Intel