ENGS 116 Lecture 141 Caches and Main Memory Vincent H. Berk November 5 th, 2008 Reading for Today: Sections C.4 – C.7 Reading for Wednesday: Sections 5.1.

Slides:



Advertisements
Similar presentations
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Advertisements

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Lecture 12 Reduce Miss Penalty and Hit Time
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 2, 2005 Mon, Nov 7, 2005 Topic: Caches (contd.)
The University of Adelaide, School of Computer Science
Caches Vincent H. Berk October 21, 2005
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
ENGS 116 Lecture 141 Main Memory and Virtual Memory Vincent H. Berk October 26, 2005 Reading for today: Sections 5.1 – 5.4, (Jouppi article) Reading for.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
CS252/Kubiatowicz Lec 3.1 1/24/01 CS252 Graduate Computer Architecture Lecture 3 Caches and Memory Systems I January 24, 2001 Prof. John Kubiatowicz.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Lec17.1 °Q1: Where can a block be placed in the upper level? (Block placement) °Q2: How is a block found if it is in the upper level? (Block identification)
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Memory Hierarchy Design
CPE 731 Advanced Computer Architecture Advanced Memory Hierarchy Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
Lecture 13 Main Memory Computer Architecture COE 501.
Main Memory CS448.
Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CS136, Advanced Architecture Cache and Memory Performance.
CS252/Kubiatowicz Lec 4.1 1/26/01 CS252 Graduate Computer Architecture Lecture 4 Caches and Memory Systems January 26, 2001 Prof. John Kubiatowicz.
Lecture 12: Memory Hierarchy— Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
MBG 1 CIS501, Fall 99 Lecture 11: Memory Hierarchy: Caches, Main Memory, & Virtual Memory Michael B. Greenwald Computer Architecture CIS 501 Fall 1999.
Chapter 5 Memory Hierarchy Design
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
Pradondet Nilagupta (Based on notes Robert F. Hodson --- CNU)
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
Administration Midterm on Thursday Oct 28. Covers material through 10/21. Histogram of grades for HW#1 posted on newsgroup. Sample problem set (and solutions)
Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory
/ Computer Architecture and Design
COMP 740: Computer Architecture and Implementation
Memory Hierarchy Reducing Hit Time Main Memory and Examples
Reducing Hit Time Small and simple caches Way prediction Trace caches
Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses
CSC 4250 Computer Architectures
The University of Adelaide, School of Computer Science
Lecture 9: Memory Hierarchy (3)
现代计算机体系结构 主讲教师:张钢 教授 天津大学计算机学院 课件、作业、讨论网址:
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Lecture 14: Reducing Cache Misses
CS203A Graduate Computer Architecture Lecture 13 Cache Design
Memory Hierarchy.
Siddhartha Chatterjee
/ Computer Architecture and Design
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Main Memory Background
Cache Performance Improvements
Presentation transcript:

ENGS 116 Lecture 141 Caches and Main Memory Vincent H. Berk November 5 th, 2008 Reading for Today: Sections C.4 – C.7 Reading for Wednesday: Sections 5.1 – 5.3 Reading for Monday: Sections 5.4 – 5.8

2 Reducing Hit Time Small and Simple Caches Way Prediction Trace caches ENGS 116 Lecture 14

3 1. Fast Hit Times via Small and Simple Caches Alpha has 8KB Instruction and 8KB data cache + 96KB second level cache –Small cache and fast clock rate Intel Pentium 4 moved from 2-way 16KB data + 16KB instruction to 4-way 8KB + 8KB –Now cache runs at core-speed Direct mapped, on chip Intel Core Duo architecture, however: –L1 Data & Instruction both 32KB, 8-way, per core –3 cycle latency (hit-time) –L2 shared (amongst 2 cores): 4 MB, 16-way, 14 cycle latency (incl. L1) ENGS 116 Lecture 14

4 2. Reducing Hit Time via “Pseudo-Associativity” or way prediction How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Way Prediction: keep prediction bits to decide what comparison is made first Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles –Better for caches not tied directly to processor (L2) –Used in MIPS R1000 L2 cache, similar in UltraSPARC –Not common today Hit Time Pseudo Hit Time Miss Penalty Time ENGS 116 Lecture 14

5 3. Trace Caches Combine branch-prediction and instruction prefetching Ability to load instruction blocks including taken branches into a cache block. Basis for Pentium 4 NetBurst architecture Contains decoded instructions – especially useful for the x86 CISC architecture decoded to RISC internal instructions. ENGS 116 Lecture 14

6 Increasing Cache BW Pipelined writes Multi-bank memory (later in Main Memory section) Non-blocking caches ENGS 116 Lecture 14

7 Pipeline Tag Check and Update Cache as separate stages; current write tag check & previous write cache update Only STORES in the pipeline; empty during a miss In shade is “Delayed Write Buffer”; must be checked on reads; either complete write or read from buffer 1. Pipelined Writes Data Delayed write buffer MUXMUX =? Tag =? CPU address Data in out Write buffer Lower level memory Store r2, (r1)Check r1 Add-- Sub-- Store r4, (r3) M[r1]=r2 &check r3 ENGS 116 Lecture 14

8 2. Increase Bandwidth: Non-blocking Caches to Reduce Stalls on Misses Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss –requires out-of-order execution CPU “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses –Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses –Requires multiple memory banks (otherwise cannot support) –Pentium Pro allows 4 outstanding memory misses ENGS 116 Lecture 14

9 Value of Hit Under Miss for SPEC FP programs on average: AMAT= 0.68  0.52  0.34  0.26 Int programs on average: AMAT= 0.24  0.20  0.19  KB Data Cache, Direct Mapped, 32B block, 16 cycle miss Hit Under i Misses Avg. Mem. Access Time eqntott espresso xlisp compress mdljsp2 ear fpppp tomcatv swm256 doduc su2cor wave5 mdljdp2 hydro2d alvinn nasa7 spice2g6 ora 0->1 1->2 2->64 Base “Hit under n Misses” 0  1 1  2 2  64 Base IntegerFloating Point ENGS 116 Lecture 14

10 Reducing Miss Penalty Critical Word First Merging Write Buffers Victim Caches Subblock Placement ENGS 116 Lecture 14

11 1. Reduce Miss Penalty: Early Restart and Critical Word First Don’t wait for full block to be loaded before restarting CPU –Early restart — As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution –Critical Word First — Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first. Generally useful only in large blocks, Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart block ENGS 116 Lecture 14

12 2. Reduce Miss Penalty by Merging Write Buffer Write merging in write buffer 4 entry, 4 word 16 sequential writes in a row Write AddressVVVV VVVV ENGS 116 Lecture 14

13 Victim cache 3. Reduce Miss Penalty via a “Victim Cache” How to combine fast hit time of direct mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4-KB direct-mapped data cache Used in Alpha, HP machines CPU address Data Data in out Write buffer Lower Level Memory Tag Data =? ENGS 116 Lecture 14

14 4. Reduce Miss Penalty: Subblock Placement Don’t have to load full block on a miss Have valid bits per subblock to indicate valid (Originally invented to reduce tag storage) Valid BitsSubblocks ENGS 116 Lecture 14

15 Reducing Misses by Compiler Optimizations McFarling [1989] reduced cache misses by 75% on 8KB direct mapped cache, 4 byte blocks in software Instructions –Reorder procedures in memory so as to reduce conflict misses –Profiling to look at conflicts (using tools they developed) Data –Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays –Loop Interchange: change nesting of loops to access data in order stored in memory –Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap –Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows ENGS 116 Lecture 14

16 Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of structures */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality ENGS 116 Lecture 14

17 Loop Interchange Example /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; improved spatial locality ENGS 116 Lecture 14

18 Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access; improve temporal locality ENGS 116 Lecture 14

19 Blocking Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { r = 0; for (k = 0; k < N; k = k+1) r = r + y[i][k]*z[k][j]; x[i][j] = r; }; Two inner loops: –Read all N  N elements of z[ ] –Read N elements of 1 row of y[ ] repeatedly –Write N elements of 1 row of x[ ] Capacity misses a function of N & Cache Size: –3 N  N  4  no capacity misses; otherwise... Idea: compute on B  B submatrix that fits ENGS 116 Lecture 14

20 Blocking Example /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) { r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }; B called blocking factor Capacity misses reduced from 2N 3 + N 2 to 2N 3 /B +N 2 Conflict misses, too? Blocks don’t have to be square. ENGS 116 Lecture 14

21 Reducing Conflict Misses by Blocking Conflict misses in caches not FA vs. Blocking size –Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. 48 despite fact that both fit in cache ENGS 116 Lecture 14

22 Reducing Misses by Hardware Prefetching of Instructions & Data E.g., Instruction Prefetching –Alpha fetches 2 blocks on a miss –Extra block placed in “stream buffer” –On miss check stream buffer –Intel Core Duo has prefetchers for code and instructions Works with data blocks, too: –Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% –Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty (use when bus is idle) ENGS 116 Lecture 14

23 Reducing Misses by Software Prefetching of Data Data Prefetch –Load data into register (HP PA-RISC loads) –Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) –Special prefetching instructions cannot cause faults; a form of speculative execution Issuing prefetch instructions takes time –Is cost of prefetch issues < savings in reduced misses? –Higher superscalar reduces difficulty of issue bandwidth ENGS 116 Lecture 14

24 Cache Example Suppose we have a processor with a base CPI of 1.0, assuming that all references hit in the primary cache, and a clock rate of 500 MHz. Assume a main memory access time of 200 ns, including all the miss handling. Suppose the miss rate per instruction at the primary cache is 5%. How much faster will the machine be if we add a secondary cache that has a 20-ns access time for either a hit or a miss and is large enough to reduce the global miss rate to main memory to 2%? ENGS 116 Lecture 14

25 What is the Impact of What You’ve Learned About Caches? : Speed = ƒ(no. operations) 1990 –Pipelined Execution & Fast Clock Rate –Out-of-Order execution –Superscalar Instruction Issue 1998: Speed = ƒ(non-cached memory accesses) What does this mean for –Compilers, Operating Systems, Algorithms, Data Structures? ENGS 116 Lecture 14

26 Main Memory Background Performance of Main Memory: –Latency: Cache miss penalty »Access Time: time between request and word arrives »Cycle Time: time between requests –Bandwidth: I/O & large block miss penalty (L2) Main Memory is DRAM: dynamic random access memory –Dynamic since needs to be refreshed periodically (1% time) –Addresses divided into 2 halves (memory as a 2-D matrix): »RAS or Row Access Strobe »CAS or Column Access Strobe Cache uses SRAM: static random access memory –No refresh; 6 transistors/bit vs. 1 transistor; Size: DRAM/SRAM ≈ 4-8; Cost/Cycle time: SRAM/DRAM ≈ 8-16 ENGS 116 Lecture 14

27 4 Key DRAM Timing Parameters t RAC : minimum time from RAS line falling to the valid data output. –Quoted as the speed of a DRAM when buying –A typical 512Mbit DRAM t RAC = 8-10 ns t RC : minimum time from the start of one row access to the start of the next. –t RC = 15 ns for a 512Mbit DRAM with a t RAC of 8-10 ns t CAC : minimum time from CAS line falling to valid data output. –1 ns for a 512Mbit DRAM with a t RAC of 8-10 ns t PC : minimum time from the start of one column access to the start of the next. –15 ns for a 512Mbit DRAM with a t RAC of ns ENGS 116 Lecture 14

28ENGS 116 Lecture 14 DRAM Performance A 8 ns (t RAC ) DRAM can –perform a row access only every 16 ns (t RC ) –perform column access (t CAC ) in 1 ns, but time between column accesses is at least 3 ns (t PC ). »In practice, external address delays and turning around buses make it 8 to 10 ns These times do not include the time to drive the addresses off the microprocessor or the memory controller overhead! ENGS 116 Lecture 14

29 DRAM History DRAMs: capacity + 60%/yr, cost – 30%/yr –2.5X cells/area, 1.5X die size in ≈ 3 years ‘98 DRAM fab line costs $2B, '08 costs $2.5B, 3-4 years to build Rely on increasing numbers of computers & memory per computer (60% market) –SIMM, DIMM, or RIMM is replaceable unit => computers use any generation (S)DRAM Commodity, second source industry => high volume, low profit, conservative –Little organization innovation in 20 years Order of importance: 1) Cost/bit, 2) Capacity –First RAMBUS: 10X BW, + 30% cost => little impact Current SDRAM yield very high: > 80%

ENGS 116 Lecture 1430 Main Memory Performance Simple: –CPU, Cache, Bus, Memory same width (32 or 64 bits) Wide: –CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UltraSPARC 512) Interleaved: –CPU, Cache, Bus 1 word; Memory N modules (4 modules); example is word interleaved

ENGS 116 Lecture 1431 Main Memory Performance Timing model (word size is 32 bits) –1 to send address, –6 for access time, 1 to send data –Cache Block is 4 words Simple memory  4  ( )= 32 Wide memory  = 8 Interleaved memory   1= 11 Address Bank 0Bank 1Bank 3Bank

ENGS 116 Lecture 1432 Independent Memory Banks Memory banks for independent accesses vs. faster sequential accesses –Multiprocessor –I/O (DMA) –CPU with Hit under n Misses, Non-blocking Cache Superbank: all memory active on one block transfer (or Bank) Bank: portion within a superbank that is word interleaved (or subbank) Superbank Superbank offset (Bank) Superbank # Bank # Bank offset...

ENGS 116 Lecture 1433 Independent Memory Banks How many banks? number banks ≥ number clocks to access word in bank –For sequential accesses, otherwise will return to original bank before it has next word ready –(like in vector case) Increasing DRAM => fewer chips => harder to have banks

ENGS 116 Lecture 1434 Avoiding Bank Conflicts Lots of banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; Even with 128 banks, since 512 is multiple of 128, conflict on word accesses SW: loop interchange or declaring array not power of 2 (“array padding”) HW: prime number of banks –bank number = address mod number of banks –address within bank = address / number of words in bank –modulo & divide per memory access with prime no. banks? –address within bank = address mod number words in bank –bank number? easy if 2 N words per bank

ENGS 116 Lecture 1435 Fast Memory Systems: DRAM specific Multiple CAS accesses: several names (page mode) –Extended Data Out (EDO): 30% faster in page mode New DRAMs to address gap; what will they cost, will they survive? –RAMBUS: startup company; reinvent DRAM interface >>Each chip a module vs. slice of memory >>Short bus between CPU and chips >>Does own refresh >>Variable amount of data returned >>1 byte / 2 ns (500 MB/s per chip) –Synchronous DRAM: 2 banks on chip, a clock signal to DRAM, transfer synchronous to system clock ( MHz) –Intel claims RAMBUS Direct is future of PC memory, Pentium 4 originally only supported RAMBUS memory... Niche memory or main memory? –e.g., Video RAM for frame buffers, DRAM + fast serial output

ENGS 116 Lecture 1436 Pitfall: Predicting Cache Performance of One Program from Another (ISA, compiler,...) 4KB data cache: miss rate 8%, 12%, or 28%? 1KB instruction cache: miss rate 0%, 3%, or 10%? Alpha vs. MIPS for 8 KB Data $: 17% vs. 10% Why 2X Alpha v. MIPS? 0% 5% 10% 15% 20% 25% 30% 35% Cache Size (KB) Miss Rate D: tomcatv D: gcc D: espresso I: gcc I: espresso I: tomcatv D$, Tom D$, gcc D$, esp I$, gcc I$, esp I$, Tom

ENGS 116 Lecture 1437 Pitfall: Simulating Too Small an Address Trace Instructions Executed (billions) Cumulative Average Memory Access Time I$= 4 KB, B = 16 B D$= 4 KB, B = 16 B L2= 512 KB, B = 128 B MP= 12, 200 (miss penalties)

ENGS 116 Lecture 1438 Additional Pitfalls Having too small an address space Ignoring the impact of the operating system on the performance of the memory hierarchy

ENGS 116 Lecture 1439 Figure 5.53 Summary of the memory-hierarchy examples in Chapter 5.