CS252 Graduate Computer Architecture Lecture 22 Caching Optimizations April 19 th, 2010 John Kubiatowicz Electrical Engineering and Computer Sciences University.

Slides:

Advertisements

Similar presentations

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.

Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Lecture 12 Reduce Miss Penalty and Hit Time

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 2, 2005 Mon, Nov 7, 2005 Topic: Caches (contd.)

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache IV Steve Ko Computer Sciences and Engineering University at Buffalo.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.

Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.

EENG449b/Savvides Lec /1/04 April 1, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

CS252/Kubiatowicz Lec 3.1 1/24/01 CS252 Graduate Computer Architecture Lecture 3 Caches and Memory Systems I January 24, 2001 Prof. John Kubiatowicz.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

EENG449b/Savvides Lec /7/05 April 7, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

February 11, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering.

Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Lecture 15 Calculating and Improving Cache Perfomance

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

01/26/2009CS267 - Lecture 2 1 Experimental Study of Memory (Membench)‏ Microbenchmark for memory system performance time the following loop (repeat many.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

CMSC 611: Advanced Computer Architecture

Reducing Hit Time Small and simple caches Way prediction Trace caches

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

CSC 4250 Computer Architectures

The University of Adelaide, School of Computer Science

Lecture 9: Memory Hierarchy (3)

5.2 Eleven Advanced Optimizations of Cache Performance

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Advanced Computer Architectures

Lecture 14: Reducing Cache Misses

Lecture 08: Memory Hierarchy Cache Performance

CPE 631 Lecture 05: Cache Design

Memory Hierarchy.

Siddhartha Chatterjee

Cache - Optimization.

Cache Memory Rabi Mahapatra

Cache Performance Improvements

Presentation transcript:

CS252 Graduate Computer Architecture Lecture 22 Caching Optimizations April 19 th, 2010 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley

Brief discussion of Transactional Memory LogTM: Log-based Transactional Memory –Kevin Moore, Jayaram Bobba, Michelle Moravan, Mark Hill & David Wood –Use of Cache Coherence protocol to detect transaction conflicts Transactional Interface: –begin_transaction(): Request that subsequent statements for a transaction –commit_transaction(): Ends successful transaction begun by matching begin_transaction(). Discards any transaction state saved for potential abort –abort_transaction(): Transfers control to a previously register conflict handler which should undo and discard work since last begin_transaction() 4/19/2010 cs252-S10, Lecture 22 2

Specific Logging Mechanism 4/19/2010 cs252-S10, Lecture 22 3

4/19/2010 cs252-S10, Lecture 22 4 Review: Cache performance Miss-oriented Approach to Memory Access: Separating out Memory component entirely –AMAT = Average Memory Access Time AMAT for Second-Level Cache

4/19/2010 cs252-S10, Lecture 22 5 Example: Impact of Cache on Performance Suppose a processor executes at –Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1 –50% arith/logic, 30% ld/st, 20% control Miss Behavior: – 10% of memory operations get 50 cycle miss penalty –1% of instructions get same miss penalty CPI = ideal CPI + average stalls per instruction 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = ( ) cycle/ins = % of the time the proc is stalled waiting for memory! AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54

4/19/2010 cs252-S10, Lecture 22 6 What is impact of Harvard Architecture? Unified vs Separate I&D (Harvard) Statistics (given in H&P): –16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% –32KB unified: Aggregate miss rate=1.99% Which is better (ignore L2 cache)? –Assume 33% data ops  75% accesses from instructions (1.0/1.33) –hit time=1, miss time=50 –Note that data hit has 1 stall for unified cache (only one port) AMAT Harvard =75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMAT Unified =75%x(1+1.99%x50)+25%x( %x50)= 2.24 Proc I-Cache-1 Proc Unified Cache-1 Unified Cache-2 D-Cache-1 Proc Unified Cache-2

4/19/2010 cs252-S10, Lecture 22 7 Recall: Reducing Misses Classifying Misses: 3 Cs –Compulsory —The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) –Capacity —If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) –Conflict —If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) More recent, 4th “C”: –Coherence - Misses caused by cache coherence.

4/19/2010 cs252-S10, Lecture 22 8 Review: 6 Basic Cache Optimizations Reducing hit time 1.Avoiding Address Translation during Cache Indexing E.g., Overlap TLB and cache access, Virtual Addressed Caches Reducing Miss Penalty 2.Giving Reads Priority over Writes E.g., Read complete before earlier writes in write buffer 3. Multilevel Caches Reducing Miss Rate 4.Larger Block size (Compulsory misses) 5.Larger Cache size (Capacity misses) 6.Higher Associativity (Conflict misses)

4/19/2010 cs252-S10, Lecture Advanced Cache Optimizations Reducing hit time 1.Small and simple caches 2.Way prediction 3.Trace caches Increasing cache bandwidth 4.Pipelined caches 5.Multibanked caches 6.Nonblocking caches Reducing Miss Penalty 7. Critical word first 8. Merging write buffers Reducing Miss Rate 9. Victim Cache 10. Hardware prefetching 11. Compiler prefetching 12. Compiler Optimizations

4/19/2010 cs252-S10, Lecture Key Idea: Pack multiple non-contiguous basic blocks into one contiguous trace cache line BR Single fetch brings in multiple basic blocks Trace cache indexed by start address and next n branch predictions BR 3. Fast (Instruction Cache) Hit times via Trace Cache

4/19/2010 cs252-S10, Lecture Fast Hit times via Trace Cache (Pentium 4 only; and last time?) Find more instruction level parallelism? How avoid translation from x86 to microops? Trace cache in Pentium 4 1.Dynamic traces of the executed instructions vs. static sequences of instructions as determined by layout in memory –Built-in branch predictor 2.Cache the micro-ops vs. x86 instructions –Decode/translate from x86 to micro-ops on trace cache miss +1.  better utilize long blocks (don’t exit in middle of block, don’t enter at label in middle of block) -1.  complicated address mapping since addresses no longer aligned to power-of-2 multiples of word size -1.  instructions may appear multiple times in multiple dynamic traces due to different branch outcomes

4/19/2010 cs252-S10, Lecture Reducing Misses: a “Victim Cache” How to combine fast hit time of direct mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache Used in Alpha, HP machines To Next Lower Level In Hierarchy DATA TAGS One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator

4/19/2010 cs252-S10, Lecture Reducing Misses by Hardware Prefetching of Instructions & Data Prefetching relies on having extra memory bandwidth that can be used without penalty Instruction Prefetching –Typically, CPU fetches 2 blocks on a miss: the requested block and the next consecutive block. –Requested block is placed in instruction cache when it returns, and prefetched block is placed into instruction stream buffer Data Prefetching –Pentium 4 can prefetch data into L2 cache from up to 8 streams from 8 different 4 KB pages –Prefetching invoked if 2 successive L2 cache misses to a page, if distance between those cache blocks is < 256 bytes

4/19/2010 cs252-S10, Lecture Issues in Prefetching Usefulness – should produce hits Timeliness – not late and not too early Cache and bandwidth pollution L1 Data L1 Instruction Unified L2 Cache RF CPU Prefetched data

4/19/2010 cs252-S10, Lecture Hardware Data Prefetching Prefetch-on-miss: –Prefetch b + 1 upon miss on b One Block Lookahead (OBL) scheme –Initiate prefetch for block b + 1 when block b is accessed –Why is this different from doubling block size? –Can extend to N block lookahead Strided prefetch –If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc. Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor, prefetching 12 lines ahead of current access

4/19/2010 cs252-S10, Lecture Reducing Misses by Software Prefetching Data Data Prefetch –Load data into register (HP PA-RISC loads) –Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) –Special prefetching instructions cannot cause faults; a form of speculative execution Issuing Prefetch Instructions takes time –Is cost of prefetch issues < savings in reduced misses? –Higher superscalar reduces difficulty of issue bandwidth

4/19/2010 cs252-S10, Lecture Reducing Misses by Compiler Optimizations McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software Instructions –Reorder procedures in memory so as to reduce conflict misses –Profiling to look at conflicts(using tools they developed) Data –Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays –Loop Interchange: change nesting of loops to access data in order stored in memory –Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap –Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

4/19/2010 cs252-S10, Lecture Blocking Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; Two Inner Loops: –Read all NxN elements of z[] –Read N elements of 1 row of y[] repeatedly –Write N elements of 1 row of x[] Capacity Misses a function of N & Cache Size: –2N 3 + N 2 => (assuming no conflict; otherwise …) Idea: compute on BxB submatrix that fits

4/19/2010 cs252-S10, Lecture Blocking Example /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; B called Blocking Factor Capacity Misses from 2N 3 + N 2 to 2N 3 /B +N 2 Conflict Misses Too?

4/19/2010 cs252-S10, Lecture Reducing Conflict Misses by Blocking Conflict misses in caches not FA vs. Blocking size –Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. 48 despite both fit in cache

4/19/2010 cs252-S10, Lecture Administrivia Exam: Next Exam: Tentatively Monday 5/3 –Could do it during day on Tuesday –Material: Everything up to last lecture –Closed Book, but 2 page hand-written notes (both sides) We have been talking about Chapter 5 (memory) –You should take a look, since might show up in test

4/19/2010 cs252-S10, Lecture Impact of Hierarchy on Algorithms Today CPU time is a function of (ops, cache misses) What does this mean to Compilers, Data structures, Algorithms? –Quicksort: fastest comparison based sorting algorithm when keys fit in memory –Radix sort: also called “linear time” sort For keys of fixed length and fixed radix a constant number of passes over the data is sufficient independent of the number of keys “The Influence of Caches on the Performance of Sorting” by A. LaMarca and R.E. Ladner. Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, January, 1997, –For Alphastation 250, 32 byte blocks, direct mapped L2 2MB cache, 8 byte keys, from 4000 to

4/19/2010 cs252-S10, Lecture Quicksort vs. Radix: Instructions Job size in keys

4/19/2010 cs252-S10, Lecture Quicksort vs. Radix Inst & Time Time Job size in keys Insts

4/19/2010 cs252-S10, Lecture Quicksort vs. Radix: Cache misses Job size in keys

4/19/2010 cs252-S10, Lecture Experimental Study (Membench) Microbenchmark for memory system performance for array A of length L from 4KB to 8MB by 2x for stride s from 4 Bytes (1 word) to L/2 by 2x time the following loop (repeat many times and average) for i from 0 to L by s load A[i] from memory (4 Bytes) s 1 experiment

4/19/2010 cs252-S10, Lecture Membench: What to Expect Consider the average cost per load –Plot one line for each array length, time vs. stride –Small stride is best: if cache line holds 4 words, at most ¼ miss –If array is smaller than a given cache, all those accesses will hit (after the first run, which is negligible for large enough runs) –Picture assumes only one level of cache –Values have gotten more difficult to measure on modern procs s = stride average cost per access total size < L1 cache hit time memory time size > L1

4/19/2010 cs252-S10, Lecture Memory Hierarchy on a Sun Ultra-2i L1: 16 KB 2 cycles (6ns) Sun Ultra-2i, 333 MHz L2: 64 byte line See for details L2: 2 MB, 12 cycles (36 ns) Mem: 396 ns (132 cycles) 8 K pages, 32 TLB entries L1: 16 B line Array length

4/19/2010 cs252-S10, Lecture Memory Hierarchy on a Power3 Power3, 375 MHz L2: 8 MB 128 B line 9 cycles L1: 32 KB 128B line.5-2 cycles Array size Mem: 396 ns (132 cycles)

4/19/2010 cs252-S10, Lecture Compiler Optimization vs. Memory Hierarchy Search Compiler tries to figure out memory hierarchy optimizations New approach: “Auto-tuners” 1st run variations of program on computer to find best combinations of optimizations (blocking, padding, …) and algorithms, then produce C code to be compiled for that computer “Auto-tuner” targeted to numerical method –E.g., PHiPAC (BLAS), Atlas (BLAS), Sparsity (Sparse linear algebra), Spiral (DSP), FFT-W

4/19/2010 cs252-S10, Lecture Reference Best: 4x2 Mflop/s Sparse Matrix – Search for Blocking for finite element problem [Im, Yelick, Vuduc, 2005]

4/19/2010 cs252-S10, Lecture Setup for Error Correction Codes (ECC) Memory systems generate errors (accidentally flipped- bits) –DRAMs store very little charge per bit –“Soft” errors occur occasionally when cells are struck by alpha particles or other environmental upsets. –Less frequently, “hard” errors can occur when chips permanently fail. –Problem gets worse as memories get denser and larger Where is “perfect” memory required? –servers, spacecraft/military computers, ebay, … Memories are protected against failures with ECCs Extra bits are added to each data-word –used to detect and/or correct faults in the memory system –in general, each possible data word value is mapped to a unique “code word”. A fault changes a valid code word to an invalid one - which can be detected.

4/19/2010 cs252-S10, Lecture Approach: Redundancy –Add extra information so that we can recover from errors –Simple technique: duplicate Block Codes: Data Coded in blocks –k data bits coded into n encoded bits –Measure of overhead: Rate of Code: K/N –Often called an (n,k) code –Consider data as vectors in GF(2) [ i.e. vectors of bits ] Code Space is set of all 2 n vectors, Data space set of 2 k vectors –Encoding function: C=f(d) –Decoding function: d=f(C’) –Not all possible code vectors, C, are valid! ECC Approach

4/19/2010 cs252-S10, Lecture Code Space v0v0 C 0 =f(v 0 ) Code Distance (Hamming Distance) General Idea: Code Vector Space Not every vector in the code space is valid Hamming Distance (d): –Minimum number of bit flips to turn one code word into another Number of errors that we can detect: (d-1) Number of errors that we can fix: ½(d-1)

Conclusion Memory wall inspires optimizations since much performance lost –Reducing hit time: Small and simple caches, Way prediction, Trace caches –Increasing cache bandwidth: Pipelined caches, Multibanked caches, Nonblocking caches –Reducing Miss Penalty: Critical word first, Merging write buffers –Reducing Miss Rate: Compiler optimizations –Reducing miss penalty or miss rate via parallelism: Hardware prefetching, Compiler prefetching Performance of programs can be complicated functions of architecture –To write fast programs, need to consider architecture »True on sequential or parallel processor –We would like simple models to help us design efficient algorithms Will “Auto-tuners” replace compilation to optimize performance? 4/19/2010 cs252-S10, Lecture 22 35