MBG 1 CIS501, Fall 99 Lecture 11: Memory Hierarchy: Caches, Main Memory, & Virtual Memory Michael B. Greenwald Computer Architecture CIS 501 Fall 1999.

Slides:



Advertisements
Similar presentations
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Advertisements

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Performance of Cache Memory
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 12, 2003 Topics: 1. Cache Performance (concl.) 2. Cache Coherence.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
EECC722 - Shaaban #1 lec # 12 Fall Computer System Components SDRAM PC100/PC MHZ bits wide 2-way interleaved ~ 900 MBYTES/SEC.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
ENGS 116 Lecture 131 Caches and Virtual Memory Vincent H. Berk October 31 st, 2008 Reading for Today: Sections C.1 – C.3 (Jouppi article) Reading for Monday:
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (
CPE 731 Advanced Computer Architecture Advanced Memory Hierarchy Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
Lecture 13 Main Memory Computer Architecture COE 501.
Main Memory CS448.
Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 4, 2002 Topic: 1. Caches (contd.); 2. Virtual Memory.
CS252/Kubiatowicz Lec 4.1 1/26/01 CS252 Graduate Computer Architecture Lecture 4 Caches and Memory Systems January 26, 2001 Prof. John Kubiatowicz.
Lecture 12: Memory Hierarchy— Five Ways to Reduce Miss Penalty (Second Level Cache) Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Pradondet Nilagupta (Based on notes Robert F. Hodson --- CNU)
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
For each of these, where could the data be and how would we find it? TLB hit – cache or physical memory TLB miss – cache, memory, or disk Virtual memory.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
Administration Midterm on Thursday Oct 28. Covers material through 10/21. Histogram of grades for HW#1 posted on newsgroup. Sample problem set (and solutions)
Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory
CS 704 Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture
Memory Hierarchy Reducing Hit Time Main Memory and Examples
CSC 4250 Computer Architectures
CS 704 Advanced Computer Architecture
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Lecture 08: Memory Hierarchy Cache Performance
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Main Memory Background
Presentation transcript:

MBG 1 CIS501, Fall 99 Lecture 11: Memory Hierarchy: Caches, Main Memory, & Virtual Memory Michael B. Greenwald Computer Architecture CIS 501 Fall 1999

MBG 2 CIS501, Fall 99 Administration Homework #3: goal-reduced. –Prob 1: 20 points. Prob 2: 40 Out of contact should end after the weekend.

MBG 3 CIS501, Fall 99 Improving Cache Performance Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) Improve performance by: 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

MBG 4 CIS501, Fall 99 5th Miss Penalty & L2 Cache L2 Equations AMAT = Hit Time L1 + Miss Rate L1 x Miss Penalty L1 Miss Penalty L1 = Hit Time L2 + Miss Rate L2 x Miss Penalty L2 AMAT = Hit Time L1 + Miss Rate L1 x (Hit Time L2 + Miss Rate L2 + Miss Penalty L2 ) Definitions: –Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rate L2 ) –Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU (Miss Rate L1 x Miss Rate L2 ) –Global Miss Rate is what matters

MBG 5 CIS501, Fall 99 Reducing Misses: Which apply to L2 Cache? Reducing Miss Rate 1. Reduce Misses via Larger Block Size 2. Reduce Conflict Misses via Higher Associativity 3. Reducing Conflict Misses via Victim Cache 4. Reducing Conflict Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Capacity/Conf. Misses by Compiler Optimizations … But: must be careful about CPU pin bandwidth when you go off-chip (Perl &Sites [OSDI96])

MBG 6 CIS501, Fall 99 L2 cache block size & A.M.A.T. 32KB L1, 8 byte path to memory

MBG 7 CIS501, Fall 99 Reducing Miss Penalty Summary Five techniques –Read priority over write on miss –Subblock placement –Early Restart and Critical Word First on miss –Non-blocking Caches (Hit under Miss, Miss under Miss) –Second Level Cache Can be applied recursively to Multilevel Caches –Danger is that time to DRAM will grow with multiple levels in between (miss rate is higher for lower levels, also > % writes) –First attempts at L2 caches can make things worse, since increased worst case is worse

MBG 8 CIS501, Fall 99 What is the Impact of What You’ve Learned About Caches? : Speed = ƒ(no. operations) 1990 –Pipelined Execution & Fast Clock Rate –Out-of-Order execution –Superscalar Instruction Issue 1998: Speed = ƒ(non-cached memory accesses) Superscalar, Out-of-Order machines hide L1 data cache miss (­5 clocks) but not L2 cache miss (­50 clocks)?

MBG 9 CIS501, Fall 99 Cache Optimization Summary TechniqueMRMPHTComplexity Larger Block Size+–0 Higher Associativity+–1 Victim Caches+2 Pseudo-Associative Caches +2 HW Prefetching of Instr/Data+2 Compiler Controlled Prefetching+3 Compiler Reduce Misses+0 Priority to Read Misses+1 Subblock Placement ++1 Early Restart & Critical Word 1st +2 Non-Blocking Caches+3 Second Level Caches+2 miss rate miss penalty

MBG 10 CIS501, Fall 99 Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

MBG 11 CIS501, Fall Fast Hit times via Small and Simple Caches Coupled to clock rate Direct Mapped, faster and allows overlap tag check and data load. On chip (Why Alpha has 8KB Instruction and 8KB data cache + 96KB second level cache?)

MBG 12 CIS501, Fall 99 Every memory reference must be translated from virtual address (logical, per-process, name) to physical address (where memory is currently located). Lookup in page table (itself in memory and on disk) based on high order bits (“page”). Why not just send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache vs. Physical Cache 2. Fast hits by Avoiding Address Translation

MBG 13 CIS501, Fall 99 Virtually Addressed Caches CPU TB $ MEM VA PA Conventional Organization CPU $ TB MEM VA PA Virtually Addressed Cache Translate only on miss Synonym Problem CPU $TB MEM VA PA Tags PA Overlap $ access with VA translation: requires $ index to remain invariant across translation VA Tags L2 $

MBG 14 CIS501, Fall 99 Why aren’t all caches virtual? There are problems: –Every time process is switched logically must flush the cache; otherwise get false hits »Cost is time to flush + “compulsory” misses from empty cache –Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address –I/O must interact with cache, so need virtual address: what if not currently running process? Solution to cache flush –Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process Solution to aliases –Force aliases to be identical in the last N bits of their virtual addresses (called page coloring). If direct mapped and indexed by > N lower bits, then guarantee that no aliases simultaneously in the cache. 2. Fast hits by Avoiding Address Translation

MBG 15 CIS501, Fall Fast Cache Hits by Avoiding Translation: Process ID impact Black is uniprocess Light Gray is multiprocess when flush cache Dark Gray is multiprocess when use Process ID tag Y axis: Miss Rates up to 20% X axis: Cache size from 2 KB to 1024 KB

MBG 16 CIS501, Fall Fast Cache Hits by Avoiding Translation : Index with Physical Portion of Address If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag (won’t change) Limits DM cache to page size: what if want bigger caches? –Higher associativity moves barrier to right (even > 8-way) –Page coloring (require Paddr of Vaddr to match low N bits). Page Address Page Offset Address Tag Index Block Offset

MBG 17 CIS501, Fall 99 Pipeline Tag Check and Update Cache as separate stages; current write tag check & previous write cache update Only STORES in the pipeline; empty during a miss Store r2, (r1)Check r1 Add-- Sub-- Store r4, (r3) M[r1]<-r2& check r3 In shade is “Delayed Write Buffer”; must be checked on reads; either complete write or read from buffer 3. Fast Hit Times Via Pipelined Writes

MBG 18 CIS501, Fall Fast Writes on Misses Via Small Subblocks If most writes are 1 word, subblock size is 1 word, & write through then always write subblock & tag immediately –Tag match and valid bit already set: Writing the block was proper, & nothing lost by setting valid bit on again. –Tag match and valid bit not set: The tag match means that this is the proper block; writing the data into the subblock makes it appropriate to turn the valid bit on. –Tag mismatch: This is a miss and will modify the data portion of the block. Since write-through cache, no harm was done; memory still has an up-to- date copy of the old value. Only the tag to the address of the write and the valid bits of the other subblock need be changed because the valid bit for this subblock has already been set Doesn’t work with write back due to last case

MBG 19 CIS501, Fall 99 Cache Optimization Summary TechniqueMRMPHTComplexity Larger Block Size+–0 Higher Associativity+–1 Victim Caches+2 Pseudo-Associative Caches +2 HW Prefetching of Instr/Data+2 Compiler Controlled Prefetching+3 Compiler Reduce Misses+0 Priority to Read Misses+1 Subblock Placement ++1 Early Restart & Critical Word 1st +2 Non-Blocking Caches+3 Second Level Caches+2 Small & Simple Caches–+0 Avoiding Address Translation+2 Pipelining Writes+1 miss rate hit time miss penalty

MBG 20 CIS501, Fall 99 Main Memory Background Performance of Main Memory: –Latency: Cache Miss Penalty »Access Time: time between request and word arrives »Cycle Time: time between requests –Bandwidth: I/O & Large Block Miss Penalty (L2) Main Memory is DRAM: Dynamic Random Access Memory –Dynamic since needs to be refreshed periodically (8 ms, 1% time) –Addresses divided into 2 halves (Memory as a 2D matrix): »RAS or Row Access Strobe »CAS or Column Access Strobe Cache uses SRAM: Static Random Access Memory –No refresh (6 transistors/bit vs. 1 transistorSize: DRAM/SRAM ­ 4-8, Cost/Cycle time: SRAM/DRAM ­ 8-16

MBG 21 CIS501, Fall 99 Core Memory “Out-of-Core”, “In-Core,” “Core Dump”? Non-volatile, magnetic Lost to 4 Kbit DRAM Access time 750 ns, cycle time ns

MBG 22 CIS501, Fall 99 DRAM logical organization (4 Mbit) Square root of bits per RAS/CAS Column Decoder SenseAmps & I/O MemoryArray (2,048 x 2,048) A0…A10 … 11 D Q Word Line Storage Cell Data Out Bit Line Data In Row Decoder Address Buffer

MBG 23 CIS501, Fall 99 DRAM physical organization (4 Mbit) Block Row Dec. 9 : 512 Row Block Row Dec. 9 : 512 ColumnAddress … Block Row Dec. 9 : 512 Block Row Dec. 9 : 512 … Block 0Block 3 … I/O D Q Address 2 8 I/Os