Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Memory Hierarchy Design. 2 Outline Introduction Cache Performance Reducing Cache Miss Penalty Reducing Miss Rate Virtual Memory Protection and Examples.

Similar presentations


Presentation on theme: "1 Memory Hierarchy Design. 2 Outline Introduction Cache Performance Reducing Cache Miss Penalty Reducing Miss Rate Virtual Memory Protection and Examples."— Presentation transcript:

1 1 Memory Hierarchy Design

2 2 Outline Introduction Cache Performance Reducing Cache Miss Penalty Reducing Miss Rate Virtual Memory Protection and Examples of Virtual Memory

3 3 5.1 Introduction

4 4 Memory Hierarchy Design Motivated by the principle of locality - –Take advantage of 2 forms of locality Spatial - nearby references are likely Temporal - same reference is likely soon Also motivated by cost/performance structures –Smaller hardware is faster: SRAM, DRAM, Disk, Tape –Access vs. bandwidth variations –Fast memory is more expensive Goal – Provide a memory system with cost almost as low as the cheapest level and speed almost as fast as the fastest level

5 5 DRAM/CPU Gap CPU performance improves at 55%/year –In 1996 it was a phenomenal 18% per month DRAM - has improved at 7% per year

6 6 Levels in A Typical Memory Hierarchy

7 7 5.3 Cache Performance

8 8 Cache memory Cache is the name given to the first level of memory hierarchy encountered once the address leaves the CPU. When the CPU finds a requested data item in the cache, it is called cache hit. When the CPU does not find a data item in the cache, it is called cache miss.

9 9 Cache Performance A better measure of memory hierarchy performance is the average memory access time: Where Hit time is time to hit in the cache. Miss rate is the fraction of cache accesses that result in a miss (i.e number of accesses that miss divided by number of accesses.) Average memory access time=Hit time + Miss rate*Miss penalty

10 10 Average Memory Access Time and Processor Performance As we know that, we can model CPU time as this formula raises the question whether the clock cycles for a cache hit should be considered –part of CPU execution clock cycles (or) – part of memory stall clock cycles. The most widely accepted is to include hit clock cycles in CPU execution clock cycles. CPU time=(CPU execution clock cycles +memory stall clock cycles)* Clock cycles time.

11 11 (contd..) Furthermore, cache misses have a double-barreled impact on a CPU with a low CPI and fast clock: –The lower the CPI execution, the higher the relative impact of a fixed number of caches miss clock cycles. –When calculating CPI, the cache miss penalty is ensured in CPU clock cycles for a miss. Therefore, even if memory hierarchies for two computers are identical, the CPU with the higher clock rate has larger number of clock cycles per miss and hence a higher memory portion of CPI. Although minimizing average memory access time is a reasonable goal. But final goal is to reduce the CPU execution time.

12 12 Improving Cache Performance Strategies for improving cache performance –Reducing the miss penalty: Multilevel caches, critical word first, read miss before write miss. –Reducing the miss rate: Larger block size, larger cache size, higher associativity, compiler optimizations. –Reducing the miss penalty or miss rate via parallelism: Non blocking caches –Reducing the time to hit in the cache: Small and simple caches, avoiding translation, pipelined cache access and trace caches.

13 13 Reducing Cache Miss Penalty

14 14 Reducing cache misses has been traditional focus of cache research, –but the cache performance formula assures us that improvements in miss penalty can be just as beneficial as improvements in miss rate. We give five optimizations here to address increasing miss penalty. –The most interesting optimization is the first, which adds more levels of caches to reduce miss penalty.

15 15 Techniques for Reducing Miss Penalty Multilevel Caches (the most important) Critical Word First and Early Restart Giving Priority to Read Misses over Writes Merging Write Buffer Victim Caches

16 16 Multi-Level Caches Probably the best miss-penalty reduction Performance measurement for 2-level caches –AMAT = Hit-time-L1 + Miss-rate-L1* Miss-penalty-L1 And –Miss-penalty-L1 = Hit-time-L2 + Miss-rate-L2 * Miss-penalty-L2 So –AMAT = Hit-time-L1 + Miss-rate-L1 * (Hit-time-L2 + Miss-rate-L2 * Miss-penalty-L2) L1 and L2 refer, respectively, to first-Level and a second- level cache. The second level miss rate is measured on the leftovers from the first-level cache.

17 17 Multi-Level Caches (Cont.) Definitions: –Local miss rate: misses in this cache divided by the total number of memory accesses to this cache (Miss-rate-L2) –Global miss rate: misses in this cache divided by the total number of memory accesses generated by CPU (Miss-rate-L1 x Miss-rate-L2) –Global Miss Rate is what matters Advantages: –Capacity misses in L1 end up with a significant penalty reduction since they likely will get supplied from L2 No need to go to main memory –Conflict misses in L1 similarly will get supplied by L2

18 18 Comparing Local and Global Miss Rates Huge 2nd level caches Global miss rate close to single level cache rate provided L2 >> L1 Global cache miss rate should be used when evaluating second-level caches (or 3 rd, 4 th,… levels of hierarchy) Many fewer hits than L1, target reduce misses

19 19 Critical Word First and Early Restart Multilevel caches require extra hardware to reduce to miss penalty Do not wait for full block to be loaded before restarting CPU –Critical Word First : request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first –Early restart : Fetch the words in the normal order, – but as soon as the requested word of the block arrives, –send it to the CPU and let the CPU continue execution block

20 20 Benefits of critical word first and early restart depend on –Block size: generally useful only in large blocks –Likelihood of another access to the portion of the block that has not yet been fetched

21 21 Giving Priority to Read Misses Over Writes In write through, write buffers complicate memory access in that they might hold the updated value of location needed on a read miss –RAW conflicts with main memory reads on cache misses Read miss waits until the write buffer empty  increase read miss penalty Check contents of write buffer on a read miss and –if no conflicts, and –the memory system is available, let the read miss continue. SW R3, 512(R0) LW R1, 1024(R0) LW R2, 512(R0)

22 22 Merging Write Buffer This technique also involves write buffers, this time improving their efficiency. An entry of write buffer often contain multi-words. However, a write often involves single word –A single-word write occupies the whole entry if no write-merging Write merging: check to see if the address of a new data matches the address of a valid write buffer entry. – If so, the new data are combined with that entry Advantage –Multi-word writes are usually faster than single-word writes –Reduce the stalls due to the write buffer being full

23 23 Victim Caches One approach to lower the miss penalty is to remember what was just discarded in case it is need again. Since the discarded data has been fetched, it can be used again at small cost. Add small fully associative cache (called victim cache) between the cache and the refill path –Contain only blocks discarded from a cache because of a miss –Are checked on a miss to see if they have the desired data before going to the next lower-level of memory If yes, swap the victim block and cache block –Addressing both victim and regular cache at the same time The penalty will not increase

24 24 Victim Cache Organization

25 25 Reducing Miss Rate

26 26 Classify Cache Misses - 3 C’s To gain better insights into the causes of misses, we sorts all misses into 3 sample categories: Compulsory  independent of cache size –First access to a block  no choice but to load it –Also called cold-start or first-reference misses –Measured by a infinite cache (ideal) Capacity  decrease as cache size increases –Cache cannot contain all the blocks needed during execution, then blocks being discarded will be later retrieved Conflict (Collision)  decrease as associativity increases –Side effect of set associative or direct mapping –A block may be discarded and later retrieved if too many blocks map to the same cache block. This misses are also called collision misses or interference misses.

27 27 Techniques for Reducing Miss Rate The classical approach to improving cache behavior is to reduce miss rates. –We present 5 techniques to do so. Larger Block Size Larger Caches Higher Associativity Way Prediction and Pseudo-associative Caches Compiler optimizations

28 28 Larger Block Sizes The simplest way to reduce miss rate is to increase the block size. Larger block size will reduce compulsory misses –Reason is due to spatial locality Obvious disadvantage –Higher miss penalty: larger block takes longer to move –May increase conflict misses and capacity miss if cache is small

29 29 Large Caches The way to reduce capacity misses Is to increase capacity of the cache. –Help with both conflict and capacity misses Disadvantage: –May need longer hit time and higher HW cost

30 30 Higher Associativity There are 2 general rules of thumb: –8-way set associative is for practical purposes as effective in reducing misses as fully associative –2: 1 Rule of thumb 2 way set associative of size N/ 2 is about the same as a direct mapped cache of size N (held for cache size < 128 KB) Greater associativity comes at the cost of increased hit time –Lengthen the clock cycle

31 31 Way Prediction In way prediction, extra bits are kept in cache to predict the way, or block within the set of the next cache access Multiplexor is set early to select the desired block, and only a single tag comparison is performed that clock cycle A miss results in checking the other blocks for matches in subsequent clock cycles

32 32 Pseudo-Associative Caches A relative approach is called psedoassociative or column associative. Attempt to get the miss rate of set-associative caches and the hit speed of direct-mapped cache Idea –Start with a direct mapped cache –On a miss check another entry –Usual method is to invert the high order index bit to get the next try 010111  110111 Problem - fast hit and slow hit –May have the problem that you mostly need the slow hit –In this case it is better to swap the blocks Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles –Better for caches not tied directly to processor (L2)

33 33 Compiler Optimization for Code This final technique reduces miss rates without any hardware changes. Code can easily be arranged without affecting correctness. Reordering the procedures of a program might reduce instruction miss rates by reducing conflict misses.

34 34 Compiler Optimization for Data Idea – improve the spatial and temporal locality of the data Lots of options –Array merging – Allocate arrays so that paired operands show up in same cache block –Loop interchange – Exchange inner and outer loop order to improve cache performance –Loop fusion – For independent loops accessing the same data, fuse these loops into a single aggregate loop –Blocking – Do as much as possible on a sub- block before moving on

35 35 Virtual Memory

36 36 Virtual Memory Virtual memory divides physical memory into blocks (called page or segment) and allocates them to different processes With virtual memory, the CPU produces virtual addresses that are translated by a combination of HW and SW to physical addresses, which accesses main memory. The process is called memory mapping or address translation Today, the two memory-hierarchy levels controlled by virtual memory are DRAMs and magnetic disks

37 37 Example of Virtual to Physical Address Mapping Mapping by a page table

38 38 Address Translation Hardware for Paging frame numberframe offset f (l-n) d (n)

39 39 Page table when some pages are not in main memory… illegal access OS puts the process in the backing store when it starts executing.

40 40 Virtual Memory (Cont.) Permits applications to grow bigger than main memory size Helps with multiple process management –Each process gets its own chunk of memory –Permits protection of 1 process’ chunks from another –Mapping of multiple chunks onto shared physical memory –Mapping also facilitates relocation (a program can run in any memory location, and can be moved during execution) –Application and CPU run in virtual space (logical memory, 0 – max) –Mapping onto physical space is invisible to the application Cache VS. VM –Block becomes a page or segment –Miss becomes a page or address fault

41 41 Cache vs. VM Differences Replacement –Cache miss handled by hardware –Page fault usually handled by OS Addresses –VM space is determined by the address size of the CPU –Cache space is independent of the CPU address size Lower level memory –For caches - the main memory is not shared by something else –For VM - most of the disk contains the file system File system addressed differently - usually in I/ O space VM lower level is usually called SWAP space

42 42 2 VM Styles - Paged or Segmented? Virtual systems can be categorized into two classes: pages (fixed-size blocks), and segments (variable-size blocks) PageSegment Words per addressOneTwo (segment and offset) Programmer visible?Invisible to application programmer May be visible to application programmer Replacing a blockTrivial (all blocks are the same size) Hard (must find contiguous, variable-size, unused portion of main memory) Memory use inefficiency Internal fragmentation (unused portion of page) External fragmentation (unused pieces of main memory) Efficient disk trafficYes (adjust page size to balance access time and transfer time) Not always (small segments may transfer just a few bytes)

43 43 Block Identification Example 11 01 13 2 * 4 + 1 = 9 Physical space = 2 5 Logical space = 2 4 Page size = 2 2 PT Size = 2 4 /2 2 = 2 2 Each PT entry needs 5-2 bits 010 01 9

44 44 Virtual Memory Block Replacement -- LRU is the best –However true LRU is a bit complex – so use approximation Page table contains a use tag, and on access the use tag is set OS checks them every so often - records what it sees in a data structure - then clears them all On a miss the OS decides who has been used the least and replace that one Write Strategy -- always write back –Due to the access time to the disk, write through is silly –Use a dirty bit to only write back pages that have been modified

45 45 Techniques for Fast Address Translation Page table is kept in main memory (kernel memory) –Each process has a page table Every data/instruction access requires two memory accesses –One for the page table and one for the data/instruction –Can be solved by the use of a special fast-lookup hardware cache called associative registers or translation look-aside buffers (TLBs) If locality applies then cache the recent translation –TLB = translation look-aside buffer –TLB entry: virtual page no, physical page no, protection bit, use bit, dirty bit

46 46 Paging Hardware with TLB

47 47 Page Size – An Architectural Choice Large pages are good: –Reduces page table size –Amortizes the long disk access –If spatial locality is good then hit rate will improve –Reduce the number of TLB miss Large pages are bad: –More internal fragmentation If everything is random each structure’s last page is only half full –Process start up time takes longer

48 48 Protection and Examples of VM

49 49 Protection Multiprogramming forces us to worry about it Hence lots of processes –Hence task switch overhead –HW must provide savable state –OS must promise to save and restore properly –Most machines task switch every few milliseconds –A task switch typically takes several microseconds

50 50 Protection Options Simplest - base and bound (valid if Base  Address  Bound ) –2 registers - check each address falls between the values These registers must be changed by the OS but not the application –Need for 2 modes: regular & privileged Hence need to privilege-trap and provide mode switch ability VM provides another option –Check as part of the VA  PA translation process –Memory protection implemented by associating protection bit with each page The protection bits reside in the page table & TLB Read-only, read-write, execute-only

51 51 Memory Protection with Valid- Invalid bit A process with length 10,468 10468 – 12287 are also invalid PTLR

52 52 VM Example Alpha 21264 Intel Pentium Paged segmentation & Multi-level Paging


Download ppt "1 Memory Hierarchy Design. 2 Outline Introduction Cache Performance Reducing Cache Miss Penalty Reducing Miss Rate Virtual Memory Protection and Examples."

Similar presentations


Ads by Google