CS136, Advanced Architecture Cache and Memory Performance.

Slides:



Advertisements
Similar presentations
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
14 – Advanced Memory Hierarchy
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 2, 2005 Mon, Nov 7, 2005 Topic: Caches (contd.)
The University of Adelaide, School of Computer Science
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
ENGS 116 Lecture 141 Caches and Main Memory Vincent H. Berk November 5 th, 2008 Reading for Today: Sections C.4 – C.7 Reading for Wednesday: Sections 5.1.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
CES 524 May 6 Eleven Advanced Cache Optimizations (Ch 5) parallel architectures (Ch 4) Slides adapted from Patterson, UC Berkeley.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Lecture 10: Memory Hierarchy Design Kai Bu
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
CPE 731 Advanced Computer Architecture Advanced Memory Hierarchy Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
Main Memory CS448.
Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Lecture 12: Memory Hierarchy— Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
CSE 502 Graduate Computer Architecture Lec – Advanced Memory Hierarchy Larry Wittie Computer Science, StonyBrook University
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Chapter 2 Memory Hierarchy Design Introduction Cache performance Advanced cache optimizations Memory technology and DRAM optimizations Virtual machines.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
MBG 1 CIS501, Fall 99 Lecture 11: Memory Hierarchy: Caches, Main Memory, & Virtual Memory Michael B. Greenwald Computer Architecture CIS 501 Fall 1999.
CS 252 Graduate Computer Architecture Lecture 8: Memory Hierarchy Krste Asanovic Electrical Engineering and Computer Sciences University of California,
Chapter 5 Memory Hierarchy Design
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
Pradondet Nilagupta (Based on notes Robert F. Hodson --- CNU)
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Microprocessor Microarchitecture Memory Hierarchy Optimization Lynn Choi Dept. Of Computer and Electronics Engineering.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory
/ Computer Architecture and Design
COMP 740: Computer Architecture and Implementation
Reducing Hit Time Small and simple caches Way prediction Trace caches
Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses
CSC 4250 Computer Architectures
The University of Adelaide, School of Computer Science
Lecture 9: Memory Hierarchy (3)
现代计算机体系结构 主讲教师:张钢 教授 天津大学计算机学院 课件、作业、讨论网址:
11 Advanced Cache Optimizations
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Larry Wittie Computer Science, StonyBrook University and ~lw
CMSC 611: Advanced Computer Architecture
Lecture 14: Reducing Cache Misses
CS203A Graduate Computer Architecture Lecture 13 Cache Design
Memory Hierarchy.
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Siddhartha Chatterjee
/ Computer Architecture and Design
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
11 Advanced Cache Optimizations
Main Memory Background
Cache Performance Improvements
Presentation transcript:

CS136, Advanced Architecture Cache and Memory Performance

CS136 2 Outline 11 Advanced Cache Optimizations Memory Technology and DRAM optimizations

CS136 3 Why More on Memory Hierarchy? Processor-Memory Performance Gap Growing

CS136 4 Review: 6 Basic Cache Optimizations Reducing hit time 1.Giving reads priority over writes E.g., read completes before earlier writes in write buffer 2.Avoiding address translation during cache indexing Reducing miss penalty 3.Multilevel caches Reducing miss rate 4.Larger block size (compulsory misses) 5.Larger cache size (capacity misses) 6.Higher associativity (conflict misses)

CS Advanced Cache Optimizations Reducing hit time 1.Small and simple caches 2.Way prediction 3.Trace caches Increasing cache bandwidth 4.Pipelined caches 5.Multibanked caches 6.Nonblocking caches Reducing Miss Penalty 7. Critical word first 8. Merging write buffers Reducing Miss Rate 9. Compiler optimizations Reducing miss penalty or miss rate via parallelism 10. Hardware prefetching 11. Compiler prefetching

CS Fast Hit Times Via Small and Simple Caches Index tag memory and then compare ⇒ takes time  Small cache can help hit time since smaller memory takes less time to index –E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and Opteron –Also L2 cache small enough to fit on chip with processor ⇒ avoids time penalty of going off chip Simple  direct mapping –Can overlap tag check with data transmission since no choice Access time estimate for 90 nm using CACTI model 4.0 –Median ratios of access time relative to direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches

CS Fast Hit Times Via Way Prediction How to combine fast hit time of direct-mapped with lower conflict misses of 2-way SA cache? Way prediction: keep extra bits in cache to predict “way” (block within set) of next access –Multiplexer set early to select desired block; only 1 tag comparison done that cycle (in parallel with reading data) –Miss  check other blocks for matches in next cycle Accuracy  85% Drawback: CPU pipeline harder if hit time is variable-length Hit Time Way-Miss Hit Time Miss Penalty

CS Fast Hit Times Via Trace Cache (Pentium 4 Only; and Last Time?) Find more instruction-level parallelism? How avoid translation from x86 to micro-ops? Trace cache in Pentium 4 –Dynamic traces of executed instructions vs. static sequences from memory layout »Built-in branch predictor –Cache micro-ops instead of x86 instructions »Decode/translate from x86 to micro-ops on trace miss Utilizes long instruction blocks better (don’t enter or exit in middle) Complicated address mapping (addresses no longer aligned to nice multiples of word size) Instructions can appear many times in multiple dynamic traces due to different branch outcomes

CS : Increasing Cache Bandwidth by Pipelining Pipeline cache access to maintain bandwidth (but higher latency) Instruction-cache pipeline stages: Pentium: 1 stage Pentium Pro through Pentium III: 2 stages Pentium 4: 4 stages ⇒ Greater penalty on mispredicted branches ⇒ More clock cycles between issue of load and use of data

CS Increasing Cache Bandwidth with Non-Blocking Caches Non-blocking or lockup-free cache allows continued cache hits during miss –Requires F/E bits on registers or out-of-order execution –Requires multi-bank memories Hit under miss reduces effective miss penalty by working during miss vs. ignoring CPU requests Hit under multiple miss or miss under miss further lowers effective miss penalty by overlapping multiple misses –Significantly increases complexity of cache controller since can be many outstanding memory accesses –Requires multiple memory banks –Penium Pro allows 4 outstanding memory misses

CS Value of Hit Under Miss for SPEC (Old Data) 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss, SPEC 92 FP programs on average: AMAT= > > > 0.26 Int programs on average: AMAT= > > > 0.19 Integer Floating Point “Hit under n Misses” 0->1 1->2 2->64 Base

CS : Increasing Cache Bandwidth Via Multiple Banks Rather than treating cache as single monolithic block, divide into independent banks to support simultaneous accesses –E.g., T1 (“Niagara”) L2 has 4 banks Works best when accesses naturally spread across banks  mapping of addresses to banks affects behavior of memory system Simple mapping that works well is sequential interleaving –Spread block addresses sequentially across banks –E,g, bank i has all blocks with address i modulo n

CS Reduce Miss Penalty: Early Restart and Critical Word First Don’t wait for full block before restarting CPU Early restart—As soon as requested word of block arrives, send to CPU and continue execution –Spatial locality  tend to want next sequential word, so may still pay to get that one Critical Word First—Request missed word from memory first, send it to CPU right away; let CPU continue while filling rest of block –Long blocks more popular today  Critical Word 1st widely used block

CS Merging Write Buffer to Reduce Miss Penalty Write buffer lets processor continue while waiting for write to complete If buffer contains modified blocks, addresses can be checked to see if new data matches that of some write buffer entry If so, new data combined with that entry For sequential writes in write-through caches, increases block size of write (more efficient) Sun T1 (Niagara) and many others use write merging

CS Reducing Misses by Compiler Optimizations McFarling [1989] reduced misses by 75% in software on 8KB direct-mapped cache, 4 byte blocks Instructions –Reorder procedures in memory to reduce conflict misses –Profiling to look at conflicts (using tools they developed) Data –Merging arrays: Improve spatial locality by single array of compound elements vs. 2 arrays –Loop interchange: Change nesting of loops to access data in memory order –Loop fusion: Combine 2 independent loops that have same looping and some variable overlap –Blocking: Improve temporal locality by accessing blocks of data repeatedly vs. going down whole columns or rows

CS Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of structures */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reduce conflicts between val & key; improve spatial locality

CS Loop Interchange Example /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; improved spatial locality

CS Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access; improve spatial locality

CS Blocking Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; Two inner loops: –Read all NxN elements of z[] –Read N elements of 1 row of y[] repeatedly –Write N elements of 1 row of x[] Capacity misses a function of N & Cache Size: –2N 3 + N 2 => (assuming no conflict; otherwise …) Idea: compute on BxB submatrix that fits

CS Blocking Example /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; B called Blocking Factor Capacity misses from 2N 3 + N 2 to 2N 3 /B +N 2 Reduce conflict misses too?

CS Reducing Conflict Misses by Blocking Conflict misses in caches not FA vs. blocking size –Lam et al [1991]: Blocking factor of 24 had 1/5 the misses vs. 48 despite both fitting in cache

CS Summary of Compiler Optimizations to Reduce Cache Misses (by hand)

CS Reducing Misses by Hardware Prefetching of Instructions & Data Prefetching relies on having extra memory bandwidth that can be used without penalty Instruction prefetching –Typically, CPU fetches 2 blocks on miss: requested and next –Requested block goes in instruction cache, prefetched goes in instruction stream buffer Data prefetching –Pentium 4 can prefetch data into L2 cache from up to 8 streams from 8 different 4 KB pages –Prefetching invoked if 2 successive L2 cache misses to a page, if distance between those cache blocks is < 256 bytes

CS Reducing Misses by Software Prefetching Data Data prefetch –Load data into register (HP PA-RISC loads) –Cache prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) –Special prefetching instructions cannot cause faults; form of speculative execution Prefetch instructions take time –Is cost of prefetch issues < savings in reduced misses? –Higher superscalar reduces problem of issue bandwidth

CS Compiler Optimization vs. Memory Hierarchy Search Compiler tries to figure out memory hierarchy optimizations New approach: “Auto-tuners” –Run variations of program to find best combinations of optimizations (blocking, padding, …) and algorithms –Then produce C code for that computer “Auto-tuner” targeted to numerical method –E.g., PHiPAC (BLAS), Atlas (BLAS), Sparsity (Sparse linear algebra), Spiral (DSP), FFT-W

CS Reference Best: 4x2 Mflop/s Sparse Matrix – Search for Blocking for finite element problem [Im, Yelick, Vuduc, 2005]

CS Best Sparse Blocking for 8 Computers All possible column block sizes selected across the 8; how could compiler know? Intel Pentium M Sun Ultra 2, Sun Ultra 3, AMD Opteron IBM Power 4, Intel/HP Itanium Intel/HP Itanium 2 IBM Power row block size (r) column block size (c)

CS Technique Hit Time Band- width Miss pen- alty Miss rate HW cost/ complexity Comment Small and simple caches +–0 Trivial; widely used Way-predicting caches +1 Used in Pentium 4 Trace caches +3 Used in Pentium 4 Pipelined cache access –+ 1Widely used Nonblocking caches ++ 3Widely used Banked caches + 1 Used in L2 of Opteron and Niagara Critical word first and early restart +2 Widely used Merging write buffer + 1 Widely used with write through Compiler techniques to reduce cache misses + 0 Software is a challenge; some computers have compiler option Hardware prefetching of instructions and data ++ 2 instr., 3 data Many prefetch instructions; AMD Opteron prefetches data Compiler-controlled prefetching ++ 3 Needs nonblocking cache; in many CPUs

CS Main-Memory Background Performance of Main Memory: –Latency: cache miss penalty »Access time: time between request and when word arrives »Cycle time: time between requests –Bandwidth: I/O & large block miss penalty (L2) Main Memory is DRAM: Dynamic Random Access Memory –Dynamic since needs to be refreshed (8 ms, 1% time) –Addresses divided into 2 halves (memory as 2D matrix): »RAS or Row Access Strobe »CAS or Column Access Strobe Cache uses SRAM: Static Random Access Memory –No refresh (6 transistors/bit vs. 1 transistor) Size: DRAM/SRAM ~ 4-8, Cost/cycle time: SRAM/DRAM ~ 8-16

CS Main Memory Deep Background “Out-of-Core”, “In-Core,” “Core Dump”? “Core memory”? Non-volatile, magnetic Lost to 4 Kbit DRAM (today using 512Mbit DRAM) Access time 750 ns, cycle time ns

CS DRAM logical organization (4 Mbit) Square root of bits per RAS/CAS Column Decoder SenseAmps & I/O MemoryArray (2,048 x 2,048) A0…A10 … 11 D Q Word Line Storage Cell

CS Quest for DRAM Performance 1.Fast Page mode –Add timing signals to allow repeated accesses to row buffer without another row access time –Easy, since array buffers bits for each access 2.Synchronous DRAM (SDRAM) –Add clock signal to DRAM interface so repeated transfers don’t pay overhead to synchronize with controller 3.Double Data Rate (DDR SDRAM) –Transfer data on rising and falling edges of clock, doubling peak data rate –DDR2 lowers power by dropping voltage from 2.5 to 1.8 volts, and offers higher clock rates: up to 400 MHz –DDR3 drops to 1.5 volts and raises clock rates to 800 MHz Improved bandwidth, not latency

CS DRAM Named by Peak Chip Xfers / Sec DIMM Named by Peak DIMM MBytes / Sec Stan- dard Clock Rate (MHz) M transfers / second DRAM Name Mbytes/s/ DIMM DIMM Name DDR133266DDR PC2100 DDR150300DDR PC2400 DDR200400DDR PC3200 DDR DDR PC4300 DDR DDR PC5300 DDR DDR PC6400 DDR DDR PC8500 DDR DDR PC10700 DDR DDR PC12800 x 2x 8

CS Need for Error Correction! Motivation: –Failures/time proportional to number of bits! –As DRAM cells shrink, more vulnerable Went through period when failure rate was low enough to ignore error correction –Servers always corrected memory systems –Even desktop DRAM banks too large now Basic idea: add redundancy through parity bits –Common configuration: Random error correction »SEC-DED (single error correct, double error detect) »One example: 64 data bits + 8 parity bits (11% overhead) –Really want to handle physical-component failures as well »Organization is multiple DRAMs/DIMM, multiple DIMMs »Want to recover from failed DRAM and failed DIMM! »“Chip kill” to handle failures width of single DRAM chip

CS Basics of Error Correction Parity (XOR of all bits) can detect one flipped bit Identifying which takes minimum of log 2 word size bits (why?) Hamming’s insight: Now XOR top 16 only ⇒ can tell whether flipped bit is in top or bottom half Now XOR top 8 of top 16, also top 8 of bottom 16 ⇒ can tell which byte it’s in End result: can identify individual bit, and if properly arranged, incorrect XORs give index of bit to invert!

CS Fancier Error Correction Based on serious math (groups and fields) Basic idea: each valid word is coordinate in n- space –E.g., 32-space for 32-bit word Append k correction bits to place word into (n+k)- space –Ensure valid words are “far” from each other in larger space –Errors perturb position in space –Correction simply involves finding closest “legal” point State of the art has focused on burst errors, e.g., 17 consecutive bits incorrect –Caused by noise bursts in communication or on disk –Not as relevant to memory Slight variations needed for modern memory