1 Memory Hierarchy Design. 2 Outline Introduction Cache Performance Reducing Cache Miss Penalty Reducing Miss Rate Virtual Memory Protection and Examples.

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Virtual Memory Chapter 18 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S. Dandamudi.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Virtual Memory Adapted from lecture notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
Memory Management (II)
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Lecture 19: Virtual Memory
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Lecture 9: Memory Hierarchy Virtual Memory Kai Bu
0 High-Performance Computer Architecture Memory Organization Chapter 5 from Quantitative Architecture January 2006.
1 Improving on Caches CS #4: Pseudo-Associative Cache Also called column associative Idea –start with a direct mapped cache, then on a miss check.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Virtual Memory. Virtual Memory: Topics Why virtual memory? Virtual to physical address translation Page Table Translation Lookaside Buffer (TLB)
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
Memory Architecture Chapter 5 in Hennessy & Patterson.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 29 Memory Hierarchy Design Cache Performance Enhancement by: Reducing Cache.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
Lecture 12 Virtual Memory.
CSC 4250 Computer Architectures
CS 704 Advanced Computer Architecture
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers
Chapter 8 Digital Design and Computer Architecture: ARM® Edition
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Lecture 14: Reducing Cache Misses
Lecture 08: Memory Hierarchy Cache Performance
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Virtual Memory Overcoming main memory size limitation
Contents Memory types & memory hierarchy Virtual memory (VM)
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
CSC3050 – Computer Architecture
Main Memory Background
Presentation transcript:

1 Memory Hierarchy Design

2 Outline Introduction Cache Performance Reducing Cache Miss Penalty Reducing Miss Rate Virtual Memory Protection and Examples of Virtual Memory

3 5.1 Introduction

4 Memory Hierarchy Design Motivated by the principle of locality - –Take advantage of 2 forms of locality Spatial - nearby references are likely Temporal - same reference is likely soon Also motivated by cost/performance structures –Smaller hardware is faster: SRAM, DRAM, Disk, Tape –Access vs. bandwidth variations –Fast memory is more expensive Goal – Provide a memory system with cost almost as low as the cheapest level and speed almost as fast as the fastest level

5 DRAM/CPU Gap CPU performance improves at 55%/year –In 1996 it was a phenomenal 18% per month DRAM - has improved at 7% per year

6 Levels in A Typical Memory Hierarchy

7 5.3 Cache Performance

8 Cache memory Cache is the name given to the first level of memory hierarchy encountered once the address leaves the CPU. When the CPU finds a requested data item in the cache, it is called cache hit. When the CPU does not find a data item in the cache, it is called cache miss.

9 Cache Performance A better measure of memory hierarchy performance is the average memory access time: Where Hit time is time to hit in the cache. Miss rate is the fraction of cache accesses that result in a miss (i.e number of accesses that miss divided by number of accesses.) Average memory access time=Hit time + Miss rate*Miss penalty

10 Average Memory Access Time and Processor Performance As we know that, we can model CPU time as this formula raises the question whether the clock cycles for a cache hit should be considered –part of CPU execution clock cycles (or) – part of memory stall clock cycles. The most widely accepted is to include hit clock cycles in CPU execution clock cycles. CPU time=(CPU execution clock cycles +memory stall clock cycles)* Clock cycles time.

11 (contd..) Furthermore, cache misses have a double-barreled impact on a CPU with a low CPI and fast clock: –The lower the CPI execution, the higher the relative impact of a fixed number of caches miss clock cycles. –When calculating CPI, the cache miss penalty is ensured in CPU clock cycles for a miss. Therefore, even if memory hierarchies for two computers are identical, the CPU with the higher clock rate has larger number of clock cycles per miss and hence a higher memory portion of CPI. Although minimizing average memory access time is a reasonable goal. But final goal is to reduce the CPU execution time.

12 Improving Cache Performance Strategies for improving cache performance –Reducing the miss penalty: Multilevel caches, critical word first, read miss before write miss. –Reducing the miss rate: Larger block size, larger cache size, higher associativity, compiler optimizations. –Reducing the miss penalty or miss rate via parallelism: Non blocking caches –Reducing the time to hit in the cache: Small and simple caches, avoiding translation, pipelined cache access and trace caches.

13 Reducing Cache Miss Penalty

14 Reducing cache misses has been traditional focus of cache research, –but the cache performance formula assures us that improvements in miss penalty can be just as beneficial as improvements in miss rate. We give five optimizations here to address increasing miss penalty. –The most interesting optimization is the first, which adds more levels of caches to reduce miss penalty.

15 Techniques for Reducing Miss Penalty Multilevel Caches (the most important) Critical Word First and Early Restart Giving Priority to Read Misses over Writes Merging Write Buffer Victim Caches

16 Multi-Level Caches Probably the best miss-penalty reduction Performance measurement for 2-level caches –AMAT = Hit-time-L1 + Miss-rate-L1* Miss-penalty-L1 And –Miss-penalty-L1 = Hit-time-L2 + Miss-rate-L2 * Miss-penalty-L2 So –AMAT = Hit-time-L1 + Miss-rate-L1 * (Hit-time-L2 + Miss-rate-L2 * Miss-penalty-L2) L1 and L2 refer, respectively, to first-Level and a second- level cache. The second level miss rate is measured on the leftovers from the first-level cache.

17 Multi-Level Caches (Cont.) Definitions: –Local miss rate: misses in this cache divided by the total number of memory accesses to this cache (Miss-rate-L2) –Global miss rate: misses in this cache divided by the total number of memory accesses generated by CPU (Miss-rate-L1 x Miss-rate-L2) –Global Miss Rate is what matters Advantages: –Capacity misses in L1 end up with a significant penalty reduction since they likely will get supplied from L2 No need to go to main memory –Conflict misses in L1 similarly will get supplied by L2

18 Comparing Local and Global Miss Rates Huge 2nd level caches Global miss rate close to single level cache rate provided L2 >> L1 Global cache miss rate should be used when evaluating second-level caches (or 3 rd, 4 th,… levels of hierarchy) Many fewer hits than L1, target reduce misses

19 Critical Word First and Early Restart Multilevel caches require extra hardware to reduce to miss penalty Do not wait for full block to be loaded before restarting CPU –Critical Word First : request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first –Early restart : Fetch the words in the normal order, – but as soon as the requested word of the block arrives, –send it to the CPU and let the CPU continue execution block

20 Benefits of critical word first and early restart depend on –Block size: generally useful only in large blocks –Likelihood of another access to the portion of the block that has not yet been fetched

21 Giving Priority to Read Misses Over Writes In write through, write buffers complicate memory access in that they might hold the updated value of location needed on a read miss –RAW conflicts with main memory reads on cache misses Read miss waits until the write buffer empty  increase read miss penalty Check contents of write buffer on a read miss and –if no conflicts, and –the memory system is available, let the read miss continue. SW R3, 512(R0) LW R1, 1024(R0) LW R2, 512(R0)

22 Merging Write Buffer This technique also involves write buffers, this time improving their efficiency. An entry of write buffer often contain multi-words. However, a write often involves single word –A single-word write occupies the whole entry if no write-merging Write merging: check to see if the address of a new data matches the address of a valid write buffer entry. – If so, the new data are combined with that entry Advantage –Multi-word writes are usually faster than single-word writes –Reduce the stalls due to the write buffer being full

23 Victim Caches One approach to lower the miss penalty is to remember what was just discarded in case it is need again. Since the discarded data has been fetched, it can be used again at small cost. Add small fully associative cache (called victim cache) between the cache and the refill path –Contain only blocks discarded from a cache because of a miss –Are checked on a miss to see if they have the desired data before going to the next lower-level of memory If yes, swap the victim block and cache block –Addressing both victim and regular cache at the same time The penalty will not increase

24 Victim Cache Organization

25 Reducing Miss Rate

26 Classify Cache Misses - 3 C’s To gain better insights into the causes of misses, we sorts all misses into 3 sample categories: Compulsory  independent of cache size –First access to a block  no choice but to load it –Also called cold-start or first-reference misses –Measured by a infinite cache (ideal) Capacity  decrease as cache size increases –Cache cannot contain all the blocks needed during execution, then blocks being discarded will be later retrieved Conflict (Collision)  decrease as associativity increases –Side effect of set associative or direct mapping –A block may be discarded and later retrieved if too many blocks map to the same cache block. This misses are also called collision misses or interference misses.

27 Techniques for Reducing Miss Rate The classical approach to improving cache behavior is to reduce miss rates. –We present 5 techniques to do so. Larger Block Size Larger Caches Higher Associativity Way Prediction and Pseudo-associative Caches Compiler optimizations

28 Larger Block Sizes The simplest way to reduce miss rate is to increase the block size. Larger block size will reduce compulsory misses –Reason is due to spatial locality Obvious disadvantage –Higher miss penalty: larger block takes longer to move –May increase conflict misses and capacity miss if cache is small

29 Large Caches The way to reduce capacity misses Is to increase capacity of the cache. –Help with both conflict and capacity misses Disadvantage: –May need longer hit time and higher HW cost

30 Higher Associativity There are 2 general rules of thumb: –8-way set associative is for practical purposes as effective in reducing misses as fully associative –2: 1 Rule of thumb 2 way set associative of size N/ 2 is about the same as a direct mapped cache of size N (held for cache size < 128 KB) Greater associativity comes at the cost of increased hit time –Lengthen the clock cycle

31 Way Prediction In way prediction, extra bits are kept in cache to predict the way, or block within the set of the next cache access Multiplexor is set early to select the desired block, and only a single tag comparison is performed that clock cycle A miss results in checking the other blocks for matches in subsequent clock cycles

32 Pseudo-Associative Caches A relative approach is called psedoassociative or column associative. Attempt to get the miss rate of set-associative caches and the hit speed of direct-mapped cache Idea –Start with a direct mapped cache –On a miss check another entry –Usual method is to invert the high order index bit to get the next try  Problem - fast hit and slow hit –May have the problem that you mostly need the slow hit –In this case it is better to swap the blocks Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles –Better for caches not tied directly to processor (L2)

33 Compiler Optimization for Code This final technique reduces miss rates without any hardware changes. Code can easily be arranged without affecting correctness. Reordering the procedures of a program might reduce instruction miss rates by reducing conflict misses.

34 Compiler Optimization for Data Idea – improve the spatial and temporal locality of the data Lots of options –Array merging – Allocate arrays so that paired operands show up in same cache block –Loop interchange – Exchange inner and outer loop order to improve cache performance –Loop fusion – For independent loops accessing the same data, fuse these loops into a single aggregate loop –Blocking – Do as much as possible on a sub- block before moving on

35 Virtual Memory

36 Virtual Memory Virtual memory divides physical memory into blocks (called page or segment) and allocates them to different processes With virtual memory, the CPU produces virtual addresses that are translated by a combination of HW and SW to physical addresses, which accesses main memory. The process is called memory mapping or address translation Today, the two memory-hierarchy levels controlled by virtual memory are DRAMs and magnetic disks

37 Example of Virtual to Physical Address Mapping Mapping by a page table

38 Address Translation Hardware for Paging frame numberframe offset f (l-n) d (n)

39 Page table when some pages are not in main memory… illegal access OS puts the process in the backing store when it starts executing.

40 Virtual Memory (Cont.) Permits applications to grow bigger than main memory size Helps with multiple process management –Each process gets its own chunk of memory –Permits protection of 1 process’ chunks from another –Mapping of multiple chunks onto shared physical memory –Mapping also facilitates relocation (a program can run in any memory location, and can be moved during execution) –Application and CPU run in virtual space (logical memory, 0 – max) –Mapping onto physical space is invisible to the application Cache VS. VM –Block becomes a page or segment –Miss becomes a page or address fault

41 Cache vs. VM Differences Replacement –Cache miss handled by hardware –Page fault usually handled by OS Addresses –VM space is determined by the address size of the CPU –Cache space is independent of the CPU address size Lower level memory –For caches - the main memory is not shared by something else –For VM - most of the disk contains the file system File system addressed differently - usually in I/ O space VM lower level is usually called SWAP space

42 2 VM Styles - Paged or Segmented? Virtual systems can be categorized into two classes: pages (fixed-size blocks), and segments (variable-size blocks) PageSegment Words per addressOneTwo (segment and offset) Programmer visible?Invisible to application programmer May be visible to application programmer Replacing a blockTrivial (all blocks are the same size) Hard (must find contiguous, variable-size, unused portion of main memory) Memory use inefficiency Internal fragmentation (unused portion of page) External fragmentation (unused pieces of main memory) Efficient disk trafficYes (adjust page size to balance access time and transfer time) Not always (small segments may transfer just a few bytes)

43 Block Identification Example * = 9 Physical space = 2 5 Logical space = 2 4 Page size = 2 2 PT Size = 2 4 /2 2 = 2 2 Each PT entry needs 5-2 bits

44 Virtual Memory Block Replacement -- LRU is the best –However true LRU is a bit complex – so use approximation Page table contains a use tag, and on access the use tag is set OS checks them every so often - records what it sees in a data structure - then clears them all On a miss the OS decides who has been used the least and replace that one Write Strategy -- always write back –Due to the access time to the disk, write through is silly –Use a dirty bit to only write back pages that have been modified

45 Techniques for Fast Address Translation Page table is kept in main memory (kernel memory) –Each process has a page table Every data/instruction access requires two memory accesses –One for the page table and one for the data/instruction –Can be solved by the use of a special fast-lookup hardware cache called associative registers or translation look-aside buffers (TLBs) If locality applies then cache the recent translation –TLB = translation look-aside buffer –TLB entry: virtual page no, physical page no, protection bit, use bit, dirty bit

46 Paging Hardware with TLB

47 Page Size – An Architectural Choice Large pages are good: –Reduces page table size –Amortizes the long disk access –If spatial locality is good then hit rate will improve –Reduce the number of TLB miss Large pages are bad: –More internal fragmentation If everything is random each structure’s last page is only half full –Process start up time takes longer

48 Protection and Examples of VM

49 Protection Multiprogramming forces us to worry about it Hence lots of processes –Hence task switch overhead –HW must provide savable state –OS must promise to save and restore properly –Most machines task switch every few milliseconds –A task switch typically takes several microseconds

50 Protection Options Simplest - base and bound (valid if Base  Address  Bound ) –2 registers - check each address falls between the values These registers must be changed by the OS but not the application –Need for 2 modes: regular & privileged Hence need to privilege-trap and provide mode switch ability VM provides another option –Check as part of the VA  PA translation process –Memory protection implemented by associating protection bit with each page The protection bits reside in the page table & TLB Read-only, read-write, execute-only

51 Memory Protection with Valid- Invalid bit A process with length 10, – are also invalid PTLR

52 VM Example Alpha Intel Pentium Paged segmentation & Multi-level Paging