Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 2, 2005 Mon, Nov 7, 2005 Topic: Caches (contd.)
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
CMPE 421 Parallel Computer Architecture
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
MBG 1 CIS501, Fall 99 Lecture 11: Memory Hierarchy: Caches, Main Memory, & Virtual Memory Michael B. Greenwald Computer Architecture CIS 501 Fall 1999.
CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
For each of these, where could the data be and how would we find it? TLB hit – cache or physical memory TLB miss – cache, memory, or disk Virtual memory.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
CS161 – Design and Architecture of Computer
CMSC 611: Advanced Computer Architecture
CSE 351 Section 9 3/1/12.
CS161 – Design and Architecture of Computer
Reducing Hit Time Small and simple caches Way prediction Trace caches
Associativity in Caches Lecture 25
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
The University of Adelaide, School of Computer Science
现代计算机体系结构 主讲教师:张钢 教授 天津大学计算机学院 课件、作业、讨论网址:
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers
Lecture: Cache Hierarchies
CMSC 611: Advanced Computer Architecture
Lecture 14: Reducing Cache Misses
Chapter 5 Memory CSE 820.
Lecture 08: Memory Hierarchy Cache Performance
Lecture: Cache Innovations, Virtual Memory
CPE 631 Lecture 05: Cache Design
Adapted from slides by Sally McKee Cornell University
Siddhartha Chatterjee
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Cache - Optimization.
Cache Performance Improvements
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006

Anshul Kumar, CSE IITD slide 2 PerformancePerformance Average memory access time = Hit time + Mem stalls / access = Hit time + Miss rate * Miss penalty Program execution time = IC * Cycle time * (CPI exec + Mem stalls / instr) Mem stalls / instr = Miss rate * Miss Penalty * Mem accesses / instr Miss Penalty in OOO processor = Total miss latency - Overlapped miss latency

Anshul Kumar, CSE IITD slide 3 Performance Improvement Reducing miss penalty Reducing miss rate Reducing miss penalty * miss rate Reducing hit time

Anshul Kumar, CSE IITD slide 4 Reducing Miss Penalty Multi level caches Critical word first and early restart Giving priority to read misses over write Merging write buffer Victim caches

Anshul Kumar, CSE IITD slide 5 Multi Level Caches Average memory access time = Hit time L1 + Miss rate L1 * Miss penalty L1 Miss penalty L1 = Hit time L2 + Miss rate L2 * Miss penalty L2 Multi level inclusion and Multi level exclusion

Anshul Kumar, CSE IITD slide 6 Misses in Multilevel Cache Local Miss rate –no. of misses / no. of requests, as seen at a level Global Miss rate –no. of misses / no. of requests, on the whole Solo Miss rate –miss rate if only this cache was present

Anshul Kumar, CSE IITD slide 7 Two level cache miss example A: L1, L2 B: ~L1, L2 C: L1, ~L2 D: ~L1, ~L2 Local miss (L1) = (B+D)/(A+B+C+D) Local miss (L2) = D/(B+D) Global Miss = D/(A+B+C+D) Solo miss (L2) = (C+D)/(A+B+C+D)

Anshul Kumar, CSE IITD slide 8 Critical Word First and Early Restart Read policy Load policy More effective when block size is large

Anshul Kumar, CSE IITD slide 9 Read Miss Priority Over Write Provide write buffers Processor writes into buffer and proceeds (for write through as well as write back) On read miss –wait for buffer to be empty, or –check addresses in buffer for conflict

Anshul Kumar, CSE IITD slide 10 Merging Write Buffer Merge writes belonging to same block in case of write through

Anshul Kumar, CSE IITD slide 11 Victim Cache (proposed by Jouppi) Evicted blocks are recycled Much faster than getting a block from the next level Size = 1 to 5 blocks A significant fraction of misses may be found in victim cache Cache Victim Cache from mem to proc

Anshul Kumar, CSE IITD slide 12 Reducing Miss Rate Large block size Larger cache Higher associativity Way prediction and pseudo-associative cache Warm start in multi-tasking Compiler optimizations

Anshul Kumar, CSE IITD slide 13 Large Block Size Reduces compulsory misses Too large block size - misses increase Miss Penalty increases

Anshul Kumar, CSE IITD slide 14 Large Cache Reduces capacity misses Hit time increases Keep small L1 cache and large L2 cache

Anshul Kumar, CSE IITD slide 15 Higher Associativity Reduces conflict misses 8-way is almost like fully associative Hit time increases

Anshul Kumar, CSE IITD slide 16 Way Prediction and Pseudo- associative Cache Way prediction: low miss rate of SA cache with hit time of DM cache Only one tag is compared initially Extra bits are kept for prediction Hit time in case of mis-prediction is high Pseudo-assoc. or column assoc. cache: get advantage of SA cache in a DM cache Check sequentially in a pseudo-set Fast hit and slow hit

Anshul Kumar, CSE IITD slide 17 Warm Start in Multi-tasking Cold start –process starts with empty cache –blocks of previous process invalidated Warm start –some blocks from previous activation are still available

Anshul Kumar, CSE IITD slide 18 Compiler optimizations Loop interchange Improve spatial locality by scanning arrays row-wise Blocking Improve temporal and spatial locality

Anshul Kumar, CSE IITD slide 19 Improving Locality Matrix Multiplication example

Anshul Kumar, CSE IITD slide 20 Cache Organization for the example Cache line (or block) = 4 matrix elements. Matrices are stored row wise. Cache can’t accommodate a full row/column. (In other words, L, M and N are so large w.r.t. the cache size that after an iteration along any of the three indices, when an element is accessed again, it results in a miss.) Ignore misses due to conflict between matrices. (as if there was a separate cache for each matrix.)

Anshul Kumar, CSE IITD slide 21 Matrix Multiplication : Code I for (i = 0; i < L; i++) for (j = o; j < M; j++) for (k = 0; k < N; k++) c[i][j] += A[i][k] * B[k][j]; CAB accessesLMLMNLMN missesLM/4LMN/4LMN Total misses = LM(5N+1)/4

Anshul Kumar, CSE IITD slide 22 Matrix Multiplication : Code II for (k = 0; k < N; k++) for (i = 0; i < L; i++) for (j = o; j < M; j++) c[i][j] += A[i][k] * B[k][j]; CAB accessesLMNLNLMN missesLMN/4LNLMN/4 Total misses = LN(2M+4)/4

Anshul Kumar, CSE IITD slide 23 Matrix Multiplication : Code III for (i = 0; i < L; i++) for (k = 0; k < N; k++) for (j = o; j < M; j++) c[i][j] += A[i][k] * B[k][j]; CAB accessesLMNLNLMN missesLMN/4LN/4LMN/4 Total misses = LN(2M+1)/4

Anshul Kumar, CSE IITD slide 24 BlockingBlocking =  k j jj kk k i j jj i kk i j k 5 nested loops blocking factor = b CAB accessesLMN/bLMN/bLMN missesLMN/4bLMN/4bMN/4 Total misses = MN(2L/b+1)/4

Anshul Kumar, CSE IITD slide 25 Loop Blocking for (k = 0; k < N; k+=4) for (i = 0; i < L; i++) for (j = o; j < M; j++) c[i][j] += A[i][k]*B[k][j] +A[i][k+1]*B[k+1][j] +A[i][k+2]*B[k+2][j] +A[i][k+3]*B[k+3][j]; CAB accessesLMN/4LNLMN missesLMN/16LN/4LMN/4 Total misses = LN(5M/4+1)/4

Anshul Kumar, CSE IITD slide 26 Reducing Miss Penalty * Miss Rate Non-blocking cache Hardware prefetching Compiler controlled prefetching

Anshul Kumar, CSE IITD slide 27 Non-blocking Cache In OOO processor Hit under a miss –complexity of cache controller increases Hit under multiple misses or miss under a miss –memory should be able to handle multiple misses

Anshul Kumar, CSE IITD slide 28 Hardware Prefetching Prefetch items before they are requested –both data and instructions What and when to prefetch? –fetch two blocks on a miss (requested+next) Where to keep prefetched information? –in cache –in a separate buffer (most common case)

Anshul Kumar, CSE IITD slide 29 Prefetch Buffer/Stream Buffer Cache prefetch buffer from mem to proc

Anshul Kumar, CSE IITD slide 30 Hardware prefetching: Stream buffers Joupi’s experiment [1990]: Single instruction stream buffer catches 15% to 25% misses from a 4KB direct mapped instruction cache with 16 byte blocks 4 block buffer – 50%, 16 block – 72% single data stream buffer catches 25% misses from 4 KB direct mapped cache 4 data stream buffers (each prefetching at a different address) – 43%

Anshul Kumar, CSE IITD slide 31 HW prefetching: UltraSPARC III example 64 KB data cache, 36.9 misses per 1000 instructions 22% instructions make data reference hit time = 1, miss penalty = 15 prefetch hit rate = 20% 1 cycle to get data from prefetch buffer What size of cache will give same performance? miss rate = 36.9/220 = 16.7% av mem access time =1+(.167*.2*1)+(.167*.8*15)=3.046 effective miss rate = ( )/15=13.6%=> 256 KB cache

Anshul Kumar, CSE IITD slide 32 Compiler Controlled Prefetching Register prefetch / Cache prefetch Faulting / non-faulting (non-binding) Semantically invisible (no change in registers or cache contents) Makes sense if processor doesn’t stall while prefetching (non-blocking cache) Overhead of prefetch instruction should not exceed the benefit

Anshul Kumar, CSE IITD slide 33 SW Prefetch Example 8 KB direct mapped, write back data cache with 16 byte blocks. a is 3  100, b is 101  3 for (i = 0; i < 3; i++) for (j = o; j < 100; j++) a[i][j] = b[j][0] * b[j+1][0]; each array element is 8 bytes misses in array a = 3 * 100 /2 = 150 misses in array b = 101 total misses = 251

Anshul Kumar, CSE IITD slide 34 SW Prefetch Example – contd. Suppose we need to prefetch 7 iterations in advance for (j = o; j < 100; j++){ prefetch(b[j+7]][0]); prefetch(a[0][j+7]); a[0][j] = b[j][0] * b[j+1][0];}; for (i = 1; i < 3; i++) for (j = o; j < 100; j++){ prefetch(a[i][j+7]); a[i][j] = b[j][0] * b[j+1][0];}; misses in first loop = 7 (for b[0..6][0]) + 4 (for a[0][0..6] ) misses in second loop = 4 (for a[1][0..6]) + 4 (for a[2][0..6] ) total misses = 19, total prefetches = 400

Anshul Kumar, CSE IITD slide 35 SW Prefetch Example – contd. Performance improvement? Assume no capacity and conflict misses, prefetches overlap with each other and with misses Original loop: 7, Prefetch loops: 9 and 8 cycles Miss penalty = 100 cycles Original loop = 300* *100 = 27,200 cycles 1 st prefetch loop = 100*9 + 11*100 = 2,000 cycles 2 nd prefetch loop = 200*8 + 8*100 = 2,400 cycles Speedup = 27200/( ) = 6.2

Anshul Kumar, CSE IITD slide 36 Reducing Hit Time Small and simple caches Avoid time loss in address translation Pipelined cache access Trace caches

Anshul Kumar, CSE IITD slide 37 Small and Simple Caches Small size => faster access Small size => fit on the chip, lower delay Simple (direct mapped) => lower delay Second level – tags may be kept on chip

Anshul Kumar, CSE IITD slide 38 Cache access time estimates using CACTI.8 micron technology, 1 R/W port, 32 b address, 64 b o/p, 32 B block

Anshul Kumar, CSE IITD slide 39 Avoid time loss in addr translation Virtually indexed, physically tagged cache –simple and effective approach –possible only if cache is not too large Virtually addressed cache –protection? –multiple processes? –aliasing? –I/O?

Anshul Kumar, CSE IITD slide 40 Cache Addressing Physical Address –first convert virtual address into physical address, then access cache –no time loss if index field available without address translation Virtual Address –access cache directly using the virtual address

Anshul Kumar, CSE IITD slide 41 Problems with virtually addressed cache page level protection? –copy protection info from TLB same virtual address from two different processes needs to be distinguished –purge cache blocks on context switch or use PID tags along with other address tags aliasing (different virtual addresses from two processes pointing to same physical address) – inconsistency? I/O uses physical addresses

Anshul Kumar, CSE IITD slide 42 Multi processes in virtually addr cache purge cache blocks on context switch use PID tags along with other address tags

Anshul Kumar, CSE IITD slide 43 Inconsistency in virtually addr cache Hardware solution (Alpha 21264) –64 KB cache, 2-way set associative, 8 KB page –a block with a given offset in a page can map to 8 locations in cache –check all 8 locations, invalidate duplicate entries Software solution (page coloring) –make 18 lsbs of all aliases same – ensures that direct mapped cache  256 KB has no duplicates –i.e., 4 KB pages are mapped to 64 sets (or colors)

Anshul Kumar, CSE IITD slide 44 Pipelined Cache Access Multi-cycle cache access but pipelined reduces cycle time but hit time is more than one cycle Pentium 4 takes 4 cycles greater penalty on branch misprediction more clock cycles between issue of load and use of data

Anshul Kumar, CSE IITD slide 45 Trace Caches what maps to a cache block? –not statically determined –decided by the dynamic sequence of instructions, including predicted branches Used in Pentium 4 (NetBurst architecture) starting addresses not word size * powers of 2 Better utilization of cache space downside – same instruction may be stored multiple times