1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 2, 2005 Mon, Nov 7, 2005 Topic: Caches (contd.)

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Advertisements

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Caches Vincent H. Berk October 21, 2005
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)
EE Architecture of Digital Systems Lecture 3 Cache Memory
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
ENGS 116 Lecture 141 Caches and Main Memory Vincent H. Berk November 5 th, 2008 Reading for Today: Sections C.4 – C.7 Reading for Wednesday: Sections 5.1.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
CS252/Kubiatowicz Lec 3.1 1/24/01 CS252 Graduate Computer Architecture Lecture 3 Caches and Memory Systems I January 24, 2001 Prof. John Kubiatowicz.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
1 IBM 360 Model 85 (1968) had a cache, which helped it outperform the more complex Model 91 (Tomasulo’s algorithm) Maurice Wilkes published the first paper.
Lec17.1 °Q1: Where can a block be placed in the upper level? (Block placement) °Q2: How is a block found if it is in the upper level? (Block identification)
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
1 Improving on Caches CS #4: Pseudo-Associative Cache Also called column associative Idea –start with a direct mapped cache, then on a miss check.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
1 Computer Architecture Cache Memory. 2 Today is brought to you by cache What do we want? –Fast access to data from memory –Large size of memory –Acceptable.
Memory Hierarchy— Motivation, Definitions, Four Questions about Memory Hierarchy, Improving Performance Professor Alvin R. Lebeck Computer Science 220.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Lecture 15 Calculating and Improving Cache Perfomance
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
Pradondet Nilagupta (Based on notes Robert F. Hodson --- CNU)
Cps 220 Cache. 1 ©GK Fall 1998 CPS220 Computer System Organization Lecture 17: The Cache Alvin R. Lebeck Fall 1999.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
COMP 206 Computer Architecture and Implementation Unit 8b: Cache Misses Siddhartha Chatterjee Fall 2000.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
Lecture 12: Design with Genetic Algorithms Memory I
CMSC 611: Advanced Computer Architecture
/ Computer Architecture and Design
COMP 740: Computer Architecture and Implementation
CPS220 Computer System Organization Lecture 18: The Memory Hierarchy
Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses
Multilevel Memories (Improving performance using alittle “cash”)
How will execution time grow with SIZE?
The University of Adelaide, School of Computer Science
Lecture 9: Memory Hierarchy (3)
现代计算机体系结构 主讲教师:张钢 教授 天津大学计算机学院 课件、作业、讨论网址:
5.2 Eleven Advanced Optimizations of Cache Performance
CPE 631 Lecture 06: Cache Design
CS252 Graduate Computer Architecture Lecture 7 Cache Design (continued) Feb 12, 2002 Prof. David Culler.
CS252 Graduate Computer Architecture Lecture 4 Cache Design
Lecture 14: Reducing Cache Misses
CPE 631 Lecture 05: CPU Caches
Memory Hierarchy.
Siddhartha Chatterjee
/ Computer Architecture and Design
Cache Memory Rabi Mahapatra
Cache Performance Improvements
10/18: Lecture Topics Using spatial locality
Presentation transcript:

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 2, 2005 Mon, Nov 7, 2005 Topic: Caches (contd.)

2Outline  Cache Organization  Cache Read/Write Policies Block replacement policies Block replacement policies Write-back vs. write-through caches Write-back vs. write-through caches Write buffers Write buffers  Cache Performance Means of improving performance Means of improving performance Reading: HP3 Sections

3 Review: 4 Questions for Mem Hierarchy  Where can a block be placed in the upper level? (Block placement)  How is a block found if it is in the upper level? (Block identification)  Which block should be replaced on a miss? (Block replacement)  What happens on a write? (Write strategy)

4 Review: Cache Shapes Direct-mapped (A = 1, S = 16) 2-way set-associative (A = 2, S = 8) 4-way set-associative (A = 4, S = 4) 8-way set-associative (A = 8, S = 2) Fully associative (A = 16, S = 1)

Cache Index : Cache TagExample: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit : : Cache Data Byte 0 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select Ex: 0x00 Byte Select Example 1: 1KB, Direct-Mapped, 32B Blocks  For a 1024 (2 10 ) byte cache with 32-byte blocks The uppermost 22 = ( ) address bits are the tag The uppermost 22 = ( ) address bits are the tag The lowest 5 address bits are the Byte Select (Block Size = 2 5 ) The lowest 5 address bits are the Byte Select (Block Size = 2 5 ) The next 5 address bits (bit5 - bit9) are the Cache Index The next 5 address bits (bit5 - bit9) are the Cache Index

Cache Index : Cache Tag 0x0002fe0x00 0x Valid Bit : : Cache Data Byte 0 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select 0x00 Byte Select = Cache Miss xxxxxxx 0x Example 1a: Cache Miss; Empty Block

Cache Index : Cache Tag 0x0002fe0x00 0x Valid Bit : : Cache Data 31 Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select 0x00 Byte Select = x0002fe 0x New Block of data Example 1b: … Read in Data

Cache Index : Cache Tag 0x x01 0x Valid Bit : : Cache Data Byte 0 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select 0x08 Byte Select = Cache Hit x0002fe 0x Example 1c: Cache Hit

Cache Index : Cache Tag 0x x02 0x Valid Bit : : Cache Data Byte 0 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select 0x04 Byte Select = Cache Miss x0002fe 0x Example 1d: Cache Miss; Incorrect Block

Cache Index : Cache Tag 0x x02 0x Valid Bit : : Cache Data Byte 0 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select 0x04 Byte Select = x0002fe 0x002450New Block of data Example 1e: … Replace Block

11 Cache Performance

12 Miss Penalty Block Size Miss Rate Exploits spatial locality Fewer blocks: compromises temporal locality Block Size Increased Miss Penalty & Miss Rate Average Access Time Block Size Block Size Tradeoff  In general, larger block size take advantage of spatial locality, BUT: Larger block size means larger miss penalty Larger block size means larger miss penalty  Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up If block size is too big relative to cache size, miss rate will go up  Too few cache blocks  Average Access Time Hit Time + Miss Penalty x Miss Rate Hit Time + Miss Penalty x Miss Rate

13 Sources of Cache Misses Sources of Cache Misses  Compulsory (cold start or process migration, first reference): first access to a block “Cold” fact of life: not a whole lot you can do about it “Cold” fact of life: not a whole lot you can do about it  Conflict/Collision/Interference Multiple mem locations mapped to the same cache location Multiple mem locations mapped to the same cache location Solution 1: Increase cache size Solution 1: Increase cache size Solution 2: Increase associativity Solution 2: Increase associativity  Capacity Cache cannot contain all blocks access by the program Cache cannot contain all blocks access by the program Solution 1: Increase cache size Solution 1: Increase cache size Solution 2: Restructure program Solution 2: Restructure program  Coherence/Invalidation Other process (e.g., I/O) updates memory Other process (e.g., I/O) updates memory

14 The 3C Model of Cache Misses  Based on comparison with another cache Compulsory: The first access to a block is not in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses. (Misses in Infinite Cache) Compulsory: The first access to a block is not in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses. (Misses in Infinite Cache) Capacity: If the cache cannot contain all the blocks needed during execution of a program (its working set), capacity misses will occur due to blocks being discarded and later retrieved. (Misses in fully associative size X Cache) Capacity: If the cache cannot contain all the blocks needed during execution of a program (its working set), capacity misses will occur due to blocks being discarded and later retrieved. (Misses in fully associative size X Cache) Conflict: If the block-placement strategy is set-associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference misses. (Misses in A-way associative size X Cache but not in fully associative size X Cache) Conflict: If the block-placement strategy is set-associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference misses. (Misses in A-way associative size X Cache but not in fully associative size X Cache)

15 Sources of Cache Misses Direct MappedN-way Set AssociativeFully Associative Compulsory Miss Cache Size Capacity Miss Invalidation Miss BigMediumSmall If you are going to run “billions” of instruction, compulsory misses are insignificant. Same Conflict MissHighMediumZero Low(er)MediumHigh Same

16 3Cs Absolute Miss Rate Conflict

17 3Cs Relative Miss Rate Conflict

18 How to Improve Cache Performance  Latency Reduce miss rate Reduce miss rate Reduce miss penalty Reduce miss penalty Reduce hit time Reduce hit time  Bandwidth Increase hit bandwidth Increase hit bandwidth Increase miss bandwidth Increase miss bandwidth

19 1. Reduce Misses via Larger Block Size

20 2. Reduce Misses via Higher Associativity  2:1 Cache Rule Miss Rate DM cache size N  Miss Rate FA cache size N/2 Miss Rate DM cache size N  Miss Rate FA cache size N/2 Not merely empirical Not merely empirical  Theoretical justification in Sleator and Tarjan, “Amortized efficiency of list update and paging rules”, CACM, 28(2): ,1985  Beware: Execution time is only final measure! Will clock cycle time increase? Will clock cycle time increase? Hill [1988] suggested hit time ~10% higher for 2-way vs. 1-way Hill [1988] suggested hit time ~10% higher for 2-way vs. 1-way

21 Example: Ave Mem Access Time vs. Miss Rate Example: assume clock cycle time is 1.10 for 2-way, 1.12 for 4- way, 1.14 for 8-way vs. clock cycle time of direct mapped (Red means A.M.A.T. not improved by more associativity)

22 3. Reduce Conflict Misses via Victim Cache  How to combine fast hit time of direct mapped yet avoid conflict misses  Add small highly associative buffer to hold data discarded from cache  Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache TAG DATA ? TAG DATA ? CPU Mem

23 4. Reduce Conflict Misses via Pseudo-Assoc.  How to combine fast hit time of direct mapped and have the lower conflict misses of 2-way SA cache  Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit)  Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles Better for caches not tied directly to processor Better for caches not tied directly to processor Hit Time Pseudo Hit Time Miss Penalty Time

24 5. Reduce Misses by Hardware Prefetching  Instruction prefetching Alpha fetches 2 blocks on a miss Alpha fetches 2 blocks on a miss Extra block placed in stream buffer Extra block placed in stream buffer On miss check stream buffer On miss check stream buffer  Works with data blocks too Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 stream buffers got 43% Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 stream buffers got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches  Prefetching relies on extra memory bandwidth that can be used without penalty e.g., up to 8 prefetch stream buffers in the UltraSPARC III e.g., up to 8 prefetch stream buffers in the UltraSPARC III

25 6. Reducing Misses by Software Prefetching  Data prefetch Compiler inserts special “prefetch” instructions into program Compiler inserts special “prefetch” instructions into program  Load data into register (HP PA-RISC loads)  Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC v9) A form of speculative execution A form of speculative execution  don’t really know if data is needed or if not in cache already Most effective prefetches are “semantically invisible” to prgm Most effective prefetches are “semantically invisible” to prgm  does not change registers or memory  cannot cause a fault/exception  if they would fault, they are simply turned into NOP’s  Issuing prefetch instructions takes time Is cost of prefetch issues < savings in reduced misses? Is cost of prefetch issues < savings in reduced misses?

26 7. Reduce Misses by Compiler Optzns.  Instructions Reorder procedures in memory so as to reduce misses Reorder procedures in memory so as to reduce misses Profiling to look at conflicts Profiling to look at conflicts McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocks McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocks  Data Merging Arrays Merging Arrays  Improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange Loop Interchange  Change nesting of loops to access data in order stored in memory Loop Fusion Loop Fusion  Combine two independent loops that have same looping and some variables overlap Blocking Blocking  Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

27 Merging Arrays Example  Reduces conflicts between val and key  Addressing expressions are different /* Before */ int val[SIZE]; int key[SIZE]; /* Before */ int val[SIZE]; int key[SIZE]; /* After */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; /* After */ struct merge { int val; int key; }; struct merge merged_array[SIZE];

28 Loop Interchange Example  Sequential accesses instead of striding through memory every 100 words /* Before */ for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j]; /* Before */ for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];

29 Loop Fusion Example  Before: 2 misses per access to a and c  After: 1 miss per access to a and c /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j]; /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) {a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} /* After */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) {a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];}

30 Blocking Example  Two Inner Loops: Read all NxN elements of z[] Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedly Read N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[] Write N elements of 1 row of x[]  Capacity Misses a function of N and Cache Size 3 NxN  no capacity misses; otherwise... 3 NxN  no capacity misses; otherwise...  Idea: compute on BxB submatrix that fits /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; } /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; }

31 Blocking Example (contd.)  Age of accesses White means not touched yet White means not touched yet Light gray means touched a while ago Light gray means touched a while ago Dark gray means newer accesses Dark gray means newer accesses

32 Blocking Example (contd.)  Work with BxB submatrices smaller working set can fit within the cache smaller working set can fit within the cache  fewer capacity misses /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; } /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }

33 Blocking Example (contd.)  Capacity reqd. goes from (2N 3 + N 2 ) to (2N 3 /B +N 2 )  B = “blocking factor”