COMP 740: Computer Architecture and Implementation

Slides:



Advertisements
Similar presentations
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Advertisements

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Lecture 12 Reduce Miss Penalty and Hit Time
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 2, 2005 Mon, Nov 7, 2005 Topic: Caches (contd.)
The University of Adelaide, School of Computer Science
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
Chapter 12 Pipelining Strategies Performance Hazards.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
1 IBM 360 Model 85 (1968) had a cache, which helped it outperform the more complex Model 91 (Tomasulo’s algorithm) Maurice Wilkes published the first paper.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
1 Improving on Caches CS #4: Pseudo-Associative Cache Also called column associative Idea –start with a direct mapped cache, then on a miss check.
Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Lecture 12: Memory Hierarchy— Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
Pradondet Nilagupta (Based on notes Robert F. Hodson --- CNU)
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
COMP 206 Computer Architecture and Implementation Unit 8b: Cache Misses Siddhartha Chatterjee Fall 2000.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
Lecture 12: Design with Genetic Algorithms Memory I
Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory
/ Computer Architecture and Design
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
CPS220 Computer System Organization Lecture 18: The Memory Hierarchy
Reducing Hit Time Small and simple caches Way prediction Trace caches
Associativity in Caches Lecture 25
Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses
CSC 4250 Computer Architectures
How will execution time grow with SIZE?
The University of Adelaide, School of Computer Science
Lecture 9: Memory Hierarchy (3)
现代计算机体系结构 主讲教师:张钢 教授 天津大学计算机学院 课件、作业、讨论网址:
11 Advanced Cache Optimizations
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Lecture: Cache Hierarchies
Lec09 Memory Hierarchy yet again
CPE 631 Lecture 06: Cache Design
CS252 Graduate Computer Architecture Lecture 7 Cache Design (continued) Feb 12, 2002 Prof. David Culler.
CS252 Graduate Computer Architecture Lecture 4 Cache Design
Advanced Computer Architectures
January 24, 2001 Prof. John Kubiatowicz
Lecture 14: Reducing Cache Misses
Lecture 08: Memory Hierarchy Cache Performance
CS203A Graduate Computer Architecture Lecture 13 Cache Design
Memory Hierarchy.
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Siddhartha Chatterjee
/ Computer Architecture and Design
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
11 Advanced Cache Optimizations
Cache Performance Improvements
Areg Melik-Adamyan, PhD
The University of Adelaide, School of Computer Science
10/18: Lecture Topics Using spatial locality
Presentation transcript:

COMP 740: Computer Architecture and Implementation Montek Singh Sep 14, 2016 Topic: Optimization of Cache Performance

Outline Cache Performance Means of improving performance Read textbook Appendix B.3 and Ch. 2.2 2

How to Improve Cache Performance Latency Reduce miss rate Reduce miss penalty Reduce hit time Bandwidth Increase hit bandwidth Increase miss bandwidth

1. Reduce Misses via Larger Block Size Figure B.10 Miss rate versus block size for five different-sized caches. Note that miss rate actually goes up if the block size is too large relative to the cache size. Each line represents a cache of different size. Figure B.11 shows the data used to plot these lines. Unfortunately, SPEC2000 traces would take too long if block size were included, so these data are based on SPEC92 on a DECstation 5000 [Gee et al. 1993].

2. Reduce Misses by Increasing Cache Size Increasing cache size reduces cache misses both capacity misses and conflict misses reduced

3. Reduce Misses via Higher Associativity 2:1 Cache Rule Miss Rate DM cache size N  Miss Rate FA cache size N/2 Not merely empirical Theoretical justification in Sleator and Tarjan, “Amortized efficiency of list update and paging rules”, CACM, 28(2):202-208,1985 Beware: Execution time is only final measure! Will clock cycle time increase? Hill [1988] suggested hit time ~10% higher for 2-way vs. 1-way

Example: Ave Mem Access Time vs. Miss Rate Example: assume clock cycle time is 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. clock cycle time of direct mapped (Red means A.M.A.T. not improved by more associativity)

4. Miss Penalty Reduction: L2 Cache L2 Equations: AMAT = Hit TimeL1 + Miss RateL1  Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2  Miss PenaltyL2 AMAT = Hit TimeL1 + Miss RateL1  (Hit TimeL2 + Miss RateL2 Miss PenaltyL2) Definitions: Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2) Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL1  Miss RateL2)

5. Reducing Miss Penalty Read Priority over Write on Miss: Challenges: Goal: allow reads to be served before writes have completed Challenges: Write-through caches: Using write buffers: RAW conflicts with reads on cache misses If simply wait for write buffer to empty might increase read miss penalty by 50% (old MIPS 1000) Check write buffer contents before read; if no conflicts, let the memory access continue Write-back caches: Read miss replacing dirty block Normal: Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read, and then do the write CPU stall less since restarts as soon as read completes

Summary of Basic Optimizations The University of Adelaide, School of Computer Science 28 April 2018 Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss rate Increases hit time, increases power consumption Higher associativity Reduces conflict misses Higher number of cache levels Reduces overall memory access time Giving priority to read misses over writes Reduces miss penalty Avoiding address translation in cache indexing (later) Reduces hit time Chapter 2 — Instructions: Language of the Computer

More advanced optimizations

1. Fast Hit Times via Small, Simple Caches Simple caches can be faster cache hit time increasingly a bottleneck to CPU performance set associativity requires complex tag matching  slower direct-mapped are simpler  faster  shorter CPU cycle times tag check can be overlapped with transmission of data Smaller caches can be faster can fit on the same chip as CPU avoid penalty of going off-chip for L2 caches: compromise keep tags on chip, and data off chip fast tag check, yet greater cache capacity L1 data cache reduced from 16KB in Pentium III to 8KB in Pentium IV

Simple and small is fast Access time vs. size and associativity

Simple and small is energy-efficient Energy per read vs. size and associativity

The University of Adelaide, School of Computer Science 2. Way Prediction The University of Adelaide, School of Computer Science 28 April 2018 Way prediction to improve hit time Goal: reduce conflict misses, yet maintain hit speed of a direct-mapped cache Approach: keep extra bits to predict the “way” within the set the output multiplexor is pre-set to select the desired block if block is correct one, fast hit time of 1 clock cycle if block isn’t correct, check other blocks in 2nd clock cycle Mis-prediction gives longer hit time Prediction accuracy > 90% for two-way > 80% for four-way I-cache has better accuracy than D-cache First used on MIPS R10000 in mid-90s Used on ARM Cortex-A8 Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 2a. Way Selection The University of Adelaide, School of Computer Science 28 April 2018 Extension of way prediction Idea: Instead of pre-setting the output multiplexor to select the correct block out of many… … only the ONE predict block is actually read from the cache Pros: energy efficient only reading one block (assuming prediction is correct) Cons: longer latency on misprediction if prediction was wrong, other block(s) have to now be read and their tags checks Chapter 2 — Instructions: Language of the Computer

The University of Adelaide, School of Computer Science 3. Pipelining Cache The University of Adelaide, School of Computer Science 28 April 2018 Pipeline cache access to improve bandwidth For faster clock cycle time: allow L1 hit time to be multiple clock cycles (instead of 1 cycle) make cache pipelined, so it still has high bandwidth Examples: Pentium: 1 cycle Pentium Pro – Pentium III: 2 cycles Pentium 4 – Core i7: 4 cycles Cons: increases number of pipeline stages for an instruction longer branch mis-prediction penalty more clock cycles between “load” and receiving the data Pros: allows faster clock rate for the processor makes it easier to increase associativity Chapter 2 — Instructions: Language of the Computer

4. Non-blocking Caches Non-blocking cache or lockup-free cache allows the data cache to continue to supply cache hits during a miss “Hit under miss” reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the CPU “Hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses

Value of Hit Under Miss for SPEC Hit under 1 miss, 2 misses and 64 misses Hit under 1 miss miss penalty reduced 9% for integer and 12.5% for floating-pt Hit under 2 misses benefir is slightly higher: 10% and 16% respectively No further benefit for 64 misses

The University of Adelaide, School of Computer Science 5. Multibanked Caches The University of Adelaide, School of Computer Science 28 April 2018 Organize cache as independent banks to support simultaneous access originally banks only used for main memory now common for L2 caches ARM Cortex-A8 supports 1-4 banks for L2 Intel i7 supports 4 banks for L1 and 8 banks for L2 Interleave banks according to block address can be accesses in parallel Chapter 2 — Instructions: Language of the Computer

6. Early Restart and Critical Word First Don’t wait for full block to be loaded before restarting CPU Early Restart—As soon as the requested word of the block arrrives, send it to the CPU and let the CPU continue execution Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives let the CPU continue while filling the rest of the words in the block. also called “wrapped fetch” and “requested word first” Generally useful only in large blocks Spatial locality a problem tend to want next sequential word, so not clear if benefit by early restart

The University of Adelaide, School of Computer Science 7. Merging Write Buffer The University of Adelaide, School of Computer Science 28 April 2018 Write buffers used in both write-through and write-back write-through: write sent to buffer so memory update can happen in background write-back: when a dirty block is replaced, write sent to buffer Merging writes: when updating a location that is already pending in the write buffer, update write buffer, instead of creating a new entry in write buffer No write buffer merging With write buffer merging Chapter 2 — Instructions: Language of the Computer

Merging Write Buffer (contd.) The University of Adelaide, School of Computer Science 28 April 2018 Pros: reduces stalls due to write buffer being full But: I/O writes cannot be merged memory-mapped I/O I/O writes become memory writes should not be merged because I/O has different semantics want to keep each I/O event distinct No write buffer merging With write buffer merging 23 Chapter 2 — Instructions: Language of the Computer

8. Reduce Misses by Compiler Optzns. Instructions Reorder procedures in memory so as to reduce misses Profiling to look at conflicts McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocks Data Merging Arrays Improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange Change nesting of loops to access data in order stored in memory Loop Fusion Combine two independent loops that have same looping and some variables overlap Blocking Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

Merging Arrays Example /* Before */ int val[SIZE]; int key[SIZE]; /* After */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reduces conflicts between val and key Addressing expressions are different

Loop Interchange Example /* Before */ for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words

Loop Fusion Example Before: 2 misses per access to a and c for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} Before: 2 misses per access to a and c After: 1 miss per access to a and c

Blocking Example Two Inner Loops: /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; } Two Inner Loops: Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[] Capacity Misses a function of N and Cache Size 3 NxN  no capacity misses; otherwise ... Idea: compute on BxB submatrix that fits

Blocking Example (contd.) Age of accesses White means not touched yet Light gray means touched a while ago Dark gray means newer accesses

Blocking Example (contd.) /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; } Work with BxB submatrices smaller working set can fit within the cache fewer capacity misses

Blocking Example (contd.) Capacity reqd. goes from (2N3 + N2) to (2N3/B +N2) B = “blocking factor”

Summary: Compiler Optimizations to Reduce Cache Misses

9. Reduce Misses by Hardware Prefetching Prefetching done by hardware outside of the cache Instruction prefetching Alpha 21064 fetches 2 blocks on a miss Extra block placed in stream buffer On miss check stream buffer Works with data blocks too Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 stream buffers got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on extra memory bandwidth that can be used without penalty e.g., up to 8 prefetch stream buffers in the UltraSPARC III

Hardware Prefetching: Benefit The University of Adelaide, School of Computer Science Hardware Prefetching: Benefit 28 April 2018 Fetch two blocks on miss (include next sequential block) Pentium 4 Pre-fetching Chapter 2 — Instructions: Language of the Computer

10. Reducing Misses by Software Prefetching Data prefetch Compiler inserts special “prefetch” instructions into program Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC v9) A form of speculative execution don’t really know if data is needed or if not in cache already Most effective prefetches are “semantically invisible” to prgm does not change registers or memory cannot cause a fault/exception if they would fault, they are simply turned into NOP’s Issuing prefetch instructions takes time Is cost of prefetch issues < savings in reduced misses? Combine with loop unrolling and software pipelining

A couple other optimizations

Reduce Conflict Misses via Victim Cache How to combine fast hit time of direct mapped yet avoid conflict misses Add small highly associative buffer to hold data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache CPU TAG DATA ? TAG DATA Mem ?

Reduce Conflict Misses via Pseudo-Assoc. How to combine fast hit time of direct mapped and have the lower conflict misses of 2-way SA cache Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles Better for caches not tied directly to processor Hit Time Pseudo Hit Time Miss Penalty Time

Fetching Subblocks to Reduce Miss Penalty Don’t have to load full block on a miss Have bits per subblock to indicate valid Valid Bits 100 200 300 1

Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

The University of Adelaide, School of Computer Science Summary The University of Adelaide, School of Computer Science 28 April 2018 Chapter 2 — Instructions: Language of the Computer