Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 7 Large and Fast: Exploiting Memory Hierarchy

Similar presentations


Presentation on theme: "Chapter 7 Large and Fast: Exploiting Memory Hierarchy"— Presentation transcript:

1 Chapter 7 Large and Fast: Exploiting Memory Hierarchy

2 Outline I. Introduction II. Basics of Caches III. Measuring and improving cache performance IV. Virtual memory V. A common framework for memory hierarchies

3 Exploiting Memory Hierarchy
• We want memory large and fast. – Is it possible? illusion of unlimited fast memory – SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB. DRAM access times are 50-70ns at cost of $100 to $200 per GB. Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB. • Try and give it to them anyway – build a memory hierarchy

4 Memory Hierarchy • Memory hierarchy
– A memory hierarchy consists of multiple levels of memory with different speeds and sizes – Faster memory, more expensive per bit, smaller size

5 Principle of Locality • A principle that makes having a memory hierarchy a good idea Principle of locality • If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be referenced soon. • Code examples – Temporal locality (locality in time) • e.g. instruction and data in a loop – Spatial locality (locality in space) • e. g. instruction are normally accessed sequentially, good spatial locality • e. g. array elements access

6 How Memory Hierarchy Uses Locality
• Temporal locality – Keep more recently accessed data items closer to the processor • Spatial locality – Move blocks consisting of multiple contiguous words in memory to upper levels of the hierarchy

7 Levels in Memory Hierarchy
• The memory system is organized as a hierarchy – A level closer to the processor is generally a subset of any level further away, and all the data is stored at the lowest level • Block – Minimum unit of information that can be presented or not presented in the two level hierarchy • Upper level – The one close to the CPU – Smaller and faster

8 Hit or Miss • If the data requested by the processor appears in some block in the upper level => “hit” else “miss” • Hit rate (hit ratio) – The fraction of memory accesses found in the upper level – A measure of the performance of the memory hierarchy • Miss rate = 1 – hit rate • Hit time – The time to access the upper level which includes the time needed to determine whether the access is a hit or a miss • Miss penalty – The time to replace a block in the upper level with the corresponding block from the lower level, plus the time to deliver the data to the processor • If the hit rate is high enough – The memory hierarchy has an effective access time closer to that of the highest (fastest) level and a size equal to that of the lowest (largest) level

9 Outline I. Introduction II. Basics of Caches III. Measuring and improving cache performance IV. Virtual memory V. A common framework for memory hierarchies

10 Caches • Caches – A safe place to hide or store things
– A memory level between the CPU and main memory – A storage managed to take advantage of locality of access

11 Caches • Two questions about cache • Our first example:
– How do we know if a data item is in the cache – If it is, how do we find it • Our first example: – block size is one word of data – "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level

12 Direct Mapped Cache • Direct mapped cache
– Each memory location is mapped to exactly one location in the cache • Mapping between addresses and cache locations – (block address) modulo (number of cache blocks in the cache) – The cache can be accessed directly with the low order bits (log2 (cache size in blocks)), 0 ~ 31 address One entry can contain different addresses

13 Tags and Valid Bits • Each cache location can contain the contents of different memory locations, how do we know whether the data in cache is what we want? – Tags • Contains the address information required to identify whether a word in the cache corresponds to the requested words. Only the upper portion of address is needed • Upper 2-bits in previous figures – Valid bits • To indicate whether an entry contains a valid address or not • Cache data and tags is meaningless after CPU startup

14 Accessing a Cache

15 Accessing a Cache (Cont.)

16 Accessing a Cache (Cont.)

17 Address and Operations in the Direct Mapped Caches
• Referenced address – Cache index select blocks – Tags field to compare with the value of the tag field • Cache size with single word block – Assume 1 block = 1 word = 4bytes – Address = 32 bits – Cache size = 2n * ( block size + tag size + valid field size) = 2n * ( 32 + (32-n-2) + 1) = 2n *(63 – n) • Cache size with multiple word block – If each block is 2m word – Cache size (in bits), 2n blocks = 2n (2m * 32 + (32 – m – 2 –n) + 1)

18 Address Mapping to a Multiword Cache
• What if the block is not just single word? • Block mapping – (block address) modulo (number of cache block) – Block address = byte address / bytes per block – E.g. 16 bytes per block, byte address 1200 – Block address = 1200 / 16 = 75 – Map to cache block number (75 modulo 64) = 11

19 Block Size Effect on Cache Performance

20 Block Size Effect on Cache Performance
• Miss rate ↓if block size ↑ – Larger blocks exploit spatial locality to lower miss rates – But too large block size (too large relative to the cache size) Miss rate ↑ • Number of block is small, thus results in higher competition for these blocks • Block will be bumped out before its words can be accessed • Spatial locality decreases within the large block size – Miss penalty ↑ • Time to fetch the block has two parts • The latency to the first word and the transfer time for the rest of block Solution to hide some transfer time • Early restart – Resume execution as soon as the requested word of the block is returned – To hide some of the transfer time – Less effective for data cache due to its randomness • Requested word first – The requested word is sent first – Slightly faster than early restart

21 Hits vs. Misses • Read hits • Read misses • Write hits:
– this is what we want! • Read misses – stall the CPU, fetch block from memory, deliver to cache, restart • Write hits: – can replace data in cache and memory (write-through) – write the data only into the cache (write-back the cache later) • Write misses: – read the entire block into the cache, then write the word

22 Handle Cache Misses • Cache miss • Work to do for cache misses
– A request for data from the cache that cannot be filled because data is not present in the cache • Work to do for cache misses – To stall the CPU, freezing the contents of all registers – A separated controller handles the cache misses, fetched the data into the cache from memory – Once the data is present, execution is restarted at the cycle that caused the cache miss • 4 steps for instruction miss – Send the “pc – 4” to the memory – Instruct memory to perform a read and wait for the memory to complete the access – Write the cache entry • Put the data in the data portion of the entry • Write the upper bits of address from the ALU into the tag • Turn the valid bit one – Restart the instruction execution at the first, => refetch the instruction and find it in the cache

23 Handling Writes • Cache and memory are “inconsistent” • Two solutions
– if we write data only to the data cache without changing memory • Two solutions – Write through • Always write the data into both the cache and the memory • Problem: Poor performance due to long main memory access – Every write causes the data to be written to main memory. These writes takes a long time. – CPI: (memory cycles) * 10% (store instruction) = 11 • Solution: write buffer – Store the data while the data are waiting to be written to memory – CPU continues execution after writing data to the cache and buffer – Free the data after completing memory write – Stall the CPU if buffer full, so increasing the buffer depth if rate difference is large – Write back • The new value is written only to the block in the cache • The modified block is written to the lower level of the hierarchy when it is replaced • “faster but more complex to implement”

24 More Complication on Writes
• Policy on write misses – Write miss • Write the word into the cache, updating both the tag and data • Also write the word to main memory using the entire address – For write through caches • Fetch-on-miss, fetch-on-write, allocate-on-miss – Allocate a cache block to the address that missed and fetch the rest of the data before writing the data and continuing execution • Write-around: no-fetch-on-write, no-allocate-on-write – Sometimes program write entire blocks of data before reading them, thus initial fetch is a waste

25 More Complication on Writes (Cont.)
• Efficient implementation of writes in write back – For write back cache, we must write the data back to memory if the data in the cache is dirty and we have a cache miss. (No need for write through cache, why?) – Write-through • One same cycle to read the tag for hit check and data write into the cache • If tag mismatch, miss occurred, but the memory still has the correct value – Write-back • Two cycles: A cycle for hit check followed by a cycle foractually write • Extra store buffer to allow store to take only one cycle by pipelining it

26 Example: Intrinsity FastMath

27 Split or Combined Cache
• Split cache: data cache and instruction cache – Explore different spatial locality of data and instruction – Increase cache bandwidth • Combined cache: one cache for data and instruction – Lower miss rate (fig. 7.10) because data can the part of instruction and vice versa – FastMath: combined cache • Doubling the cache bandwidth by supporting both an instruction and data access

28 Design the Memory System to Support Caches
• Reduce the miss penalty by increasing memory bandwidth – By different memory organizations – Increases the physical or logical width of the memory system All block words accessed sequentially

29 Impact of Different Memory Organizations (1)
• Assume a hypothetical memory access times – 1 cycle to send the address – 15 cycles for each DRAM access initiated – 1 cycles to send a word of data • If cache block = 4 words 1. One word wide memory organization Miss penalty = * *1 = 65 cycles The number of bytes transferred per clock for a single miss = 4*4/65 = 0.25

30 Impact of Different Memory Organizations (2)
2. wider memory, bus, cache – Parallel accesses – If 2 words wide Miss penalty = * * 1 = 33 cycles Bandwidth for a single miss = 4*4/33 = 0.48 – If 4 words wide Miss penalty = * * 1 = 17 cycles Bandwidth for a single miss = 4*4/17 = 0.94 – Cost • Wider bus, • potential increase in cache access time (due to mux and control logic between the processor and cache)

31 Impact of Different Memory Organizations (3)
3. Interleaved memory with multiple banks – Read or write multiple words in one access time – Each bank, one word wide – Sending address to several banks simultaneously – Miss penalty = * 1 = 20 cycles – Bandwidth per miss = 0.8 – Side advantages • Valuable on writes • Each bank can write independently, quadrupling the write bandwidth and leads to fewer stalls in a write-through cache

32 Summary • Direct mapped cache • Cache and memory consistent
– One word go in exactly one location • Cache and memory consistent – Write through – Write back • To take advantage of spatial locality – Multiple words per block – Larger block reduce the miss rate, and tag storage – Note. One word per block has no spatial locality • Reduce miss penalty – Make the memory wider and interleaving

33 V. A common framework for memory hierarchies
Outline I. Introduction II. Basics of Caches III. Measuring and improving cache performance IV. Virtual memory V. A common framework for memory hierarchies

34 Cache Performance • Simplified model
– CPU execution time = (CPU execution cycles + Memory stall cycles)* cycle time – stall cycles = # of memory access instructions * miss rate * miss penalty • Two ways of improving performance: – decreasing the miss rate • Reduce the probability that two different memory blocks will contend for the same cache location – decreasing the miss penalty • Multi-level caching: Add an additional level to the hierarchy

35 Performance • Stall cycle
– Stall cycle is obtained by detailed simulation For write through, read and write miss penalty is equal, if we assume the write buffer stall is negligible, we can combine them.

36 Write-stall-cycle (more complicated)
• Write-through – 2 sources of stalls • Write miss: block move from memory • Write buffer stalls: when the buffer is full • Write-back – Potential stall to write a cache block back to memory when the block is replaced

37 Calculating Cache Performance
• Q. How faster a processor would run with a perfect cache? – instruction cache miss rate for a program: 2% – Data cache miss rate: 4% – Processor CPI: 2 without any memory stall – Miss penalty: 100 cycles for all misses – SPECTINT: load/store: 36% • A. – Instruction miss cycle = I * 2% *100 = 2 I – Data miss cycle = I * 36% * 4% * 100 = 1.44 I – Total memory stall cycle = 2 I I = 3.44 I – CPI with memory stall = = 5.44 – Ration of the execution time = 5.44/2 = 2.72

38 Cache Performance with Lower CPI
• What if the processor is faster but memory is not? • Amdahl’s law • If the processor CPI is reduced from 2 to 1, what will happened in the previous example • A. – System with cache miss, CPI = – The system with the prefect cache will be 4.44/1 = 4.44 faster – The amount of execution time spent on memory stall will rise from 3.44/5.44 = 63 % to 3.44/4.44 = 77%

39 Cache Performance with Higher Clock Rate
• What if the processor is faster but memory is not? • Amdahl’s law • If the processor clock rate doubled, what will happened in the previous example? • assume the absolute time is not changed to handle a cache miss • A. – Measured in faster clock cycle, the new miss penalty will be 200 cycles (assume the absolute time is not changed to handle a cache miss) – Total miss cycles per instruction = 2% * % * (4% * 200) = 6.88 – Faster system with cache miss, CPI = = 8.88 – Slower system with cache miss, CPI = 5.44 – The slower clock system will be Execution time of slow clock/Execution time of faster clock = I * CPI_slow * Cycle time / (I * CPI_fast * ½ * Cycle time ) = 5.44/(8.44 * ½ ) = 1.23 faster

40 CPU v.s. Cache Performance
• If a processor improves both clock rate and CPI, – The lower the CPI, • the more pronounced the impact of stall cycles – The main memory system is unlikely to improve as fast as processor cycle • The importance of cache performance for processors with low CPI and high clock rate is greater • Hit time is related to clock cycle time – Larger cache – Hit time increase could dominate the improvement in hit rate, leading to a decrease in processor performance

41 Reducing cache Miss by More Flexible Placement of Blocks
Block placement schemes 1. direct mapped – A block can go in exactly one place in the cache – A direct mapping from any block address in memory to a single location in the upper level of the hierarchy 2. Fully associative – A block can be placed in any location in the cache. – Parallel search by comparators • Only effective caches with smaller number of blocks due to high hardware cost • Sequential search will be too slow 3. Set associative – There are a fixed number of locations (at least two) where each block can be placed – N locations for a block => n-way set associative cache – Combination of direct-mapped and fully associative • A block is directly mapped into a set and then all the blocks in the set are searches for a match

42 Mapping • Direct mapped • Set associative
– (block address) modulo (number of cache blocks) • Set associative – (block address) modulo (number of cache sets)

43 Location of a Memory Block in a Cache
• Map address 12 to a location in a cache – Assume 8-blocks in a cache

44 Decreasing miss ratio with associativity

45 Associativity • A direct-mapped cache = 1 – way set associative
• A fully-associative cache = m-way set associative if m-entries in the cache • Degree of associativity ↑miss rate ↓ but hit time ↑

46 Miss and Associativity In Caches
• Q. three small caches (fully associative, 2-way, direct mapped), each consisting of 4 one- word blocks, find the no. of miss for the block address 0, 8, 0, 6, 8 • A. – Direct mapped

47 Miss and Associativity In Caches
• A. 2-way set associative cache (LRU replacement)

48 Miss and Associativity In Caches
A. fully associative cache (LRU replacement)

49 Miss and Associativity In Caches
• Three misses is the best that we can do • if we have 8 blocks – No replacements in the two-way associative cache – Same performance as fully associative cache • If we have 16 blocks – All three caches has the same number of caches Why ? • Cache size and associativity are not independent in determining the cache performance • Fig. 7.15

50 Locating a Block in a Cache
search parallel Tag Index Block Offset • If total cache size is kept the same – Increasing the associativity increases the number of block per set, and decrease the no. of set – Associativity x 2 => no. block x 2=> no. of set / 2

51 4-Way Set Associative Implementation

52 Size of Tags • Q. • A. – Assume cache 4K blocks,
– 32-bit address (block address) – 4 word block size – Total no. of sets ? – Total no. of tag bits ? • A. – 4-word (16 byte) per block, total (32 – 4) = 28 bits for tag and index – Direct mapped • no. of sets = 4K • Log2(4K) = 12, No. of tag bits, (28 – 12 ) * 4K = 64Kbits – 2-way associative • 4K/2 = 2K sets • No. of tags = (28 – 11) * 2 * 2K = 34 * 2K = 68Kbits – 4-way associative • 4K/4 = 1K sets • No. of tags = (28 – 10) * 4 * 1K = 72 * 1K = 72Kbits – Fully associative • Only one set • 28 * 4K = 112Kbits

53 Choose Which Block to Replace
• Least recently used (LRU) – Most commonly used for a fully associative cache or set associative cache – The block replaced is the one that has been unused for the longest time • Implementation of a 2-way set associative – One bit in each set and set the bit when referenced

54 Reduce the Miss Penalty Using Multilevel Caches
• First level cache L1 – On the same IC as the microprocessor • 2nd level cache L2 – Often on-chip or off-chip separated SRAM – Accessed whenever a miss occurs in the primary cache – If the L2 contains the data, miss penalty = L2 access time – If neither the primary nor secondary cache contains the data, a main memory access is required • Larger miss penalty • Design considerations for a L1 and L2 cache are different – L1: Minimize hit time • Smaller cache, smaller block size – L2: minimize the miss rate • Larger cache, larger block size

55 Example of 2-level Cache
• Q. – Base CPI = 1 – All reference hit in the primary cache, – Clock rate 5GHz – Main memory: 100ns access time, including all miss handling – Miss rate per instruction at L1 cache: 2% – What if we add a L2 cache with 5ns access and reduce the miss rate to memory to 0.5% • A. – Miss penalty to main memory 100ns/(0.2ns per clock) = 500 clock cycles – Effective CPI with L1 = 1 + 2% * 500 = 11 – With L2, miss penalty to L2: 5ns /0.2ns = 25 clock cycles – Ideal case: all miss is satisfied in the L2 cache. That’s all miss penalty – Not ideal case: Total CPI with L1 and L2 = 1 + 2% * % * 500 = 4 – Processor with L1, L2 is faster by 11/4 = 2.8

56 Summary • Processor get faster, effect of the memory stall ↑
• Two ways to improve cache performance – Reduce miss rate by associative placement • Fully associative is more flexible, but also more cost • Set associative are a practical alternatives – Reduce miss penalty by multi-level cache • L1 on hit time • L2 on miss rate

57 Outline I. Introduction II. Basics of Caches III. Measuring and improving cache performance IV. Virtual memory V. A common framework for memory hierarchies

58 Virtual Memory • The main memory can act as a “cache” for the secondary storage =>magnetic disk – Main memory contains the “active portions” of the programs • Two motivations for virtual memory – 1. To allow efficient and safe sharing of memory among multiple programs (major reason) – 2. To remove the programming burdens of a small, limited amount of main memory program address space physical address translation with protection overlay: divide program into pieces, loaded or unload under user program control

59 Terminology: Cache v.s. Virtual Memory
• Virtual address -> Physical address • A virtual memory block-> page • Miss: page fault

60 Program Relocation • Virtual memory also simplifies loading the program for execution by providing relocation – Relocation maps the virtual address used by a program to different physical address – Virtual memory system relocate the program as a set of fixed-size blocks (pages) • Eliminate the need to find a contiguous block of memory for program • OS only need to allocate a sufficient number of pages in memory

61 Address Translation • Address translation
– The number of bits in the page offset field determines the page size – Usually , virtual page number > physical page number

62 Design choices in Virtual Memory
• Mainly motivated by the high cost of a miss – A page fault will take million of cycles to process – Miss rate (page fault rate) should be quite low – Every best means should be used disregarding the cost How to Reduce Miss Rate (Page Fault) • Large page size (block size) – Old: 4KB ~ 16KB – New: 32KB ~ 64 KB – Embedded: 1KB • Fully associative placement of pages – Why you can use fully associative here? • Page faults can be handled in software by using clever algorithms – Overhead is small compared to the disk access time – Small reduction in miss rate will pay for the cost of such algorithm • Use write back – Not use write through, this is too slow

63 How Do You Place the Page and Find it Again?
• Fully associative replacement – Major difficulty is in locating an entry, since it can be anywhere – Full search is impractical • Solutions – Locate pages by a index table: page table • The page table resides in memory • A page table is indexed by the page number • No tag is required because the page table contains a mapping for every possible virtual page – Each program has its own page table • A page table contains the virtual to physical address translations in a virtual memory system – A register (page table register) points to the start of the page table • To indicate the location of the page in the memory

64 Page Tables

65 Page Faults • Page Faults • How to deal with this ?
– If the valid bit for a virtual is off, a page fault occurs • How to deal with this ? – The OS gets control (through exception), – Find the page in the lower level (disk) – Decide where to place the requested page in main memory

66 How You Find the Pages? • You have to keep track of the location on disk of each page in virtual memory space • Swap space – The space on the disk reserved for the full virtual memory space of a process – OS creates the swap space when it creates the process • Data structured created by OS – To record where each virtual page is stored on disk • This structure can be part of the page table or an auxiliary data structure index as the page table • Created when creating swap space – To track which process and which virtual address use each physical page

67 Page Tables

68 Where to Place the Requested Pages?
• If some pages are empty, use them • If all pages in main memory are in use, choose a page to replace – LRU replacement (least recently used) – Replaced pages are written to swap space on the disk • Examples – Page references in order : 10, 12, 9, 7, 11, 10 – If we referenced page 8, choose 12 to replace – If one more page fault, choose 9 to replace • More on LRU – Complete LRU implementation -> too expensive • Require updating a data on every memory access – Alternative: a “use” bit or “reference” bit • Set whenever a page is accessed, OS periodically clears the use bits, and later records them for replacement purpose

69 Size of Page Tables • Example: 32-bit virtual address, 4KB pages (212), 4 bytes per page table entry – No. of page table entries: 232/212=220 – Size of page table = 220x22=4MB/each program • To reduce the amount of storage required for the page table – 1. A limit register to restricts the size of the page for a give process • If needed, more entries are added to the page table • Expand in one direction – 2. Two page tables, and two separate limits • One for stack and one for heap – 3. Inverted page table • Apply a hashing function to the virtual address, so that size of page table = no. of physical pages in the main memory – 4. Multiple levels of pages • First level maps large fixed-size blocks (segment) of virtual address space (64 ~ 256 pages) by segment table • Each entry in the page table points to a page table for that segment – 5. Page tables to be paged • Allow the page tables to reside in the virtual address space

70 Making Address Translation Fast: TLB
• Why TLB (translation lookaside buffer) – Two cycle memory access • Since the page tables are stored in main memory • One memory access to obtain the physical address • Second access to get the data – Improvement: locality of reference to the page table • Spatial and temporal locality • TLB – A cache that holds page table mapping – In TLB Cache • Tag entry: virtual page number portion • Data entry: physical page number • Other bits: reference bit and dirty bit from the page table – If TLB hit • Get the physical page number • Turn on the reference bit • Turn on the dirty bit if write

71 Making Address Translation Fast: TLB
• A cache for address translations: translation lookaside buffer

72 How About TLB Miss? • If TLB miss, determine whether it is a page fault or a TLB miss (check the valid bit) – If the page exists in memory: TLB miss • Only the translation is missing • Load the translation from the page into the TLB and try again • TLB miss will occur more frequent than true page fault since TLB has fewer entries that the no. of pages in memory – If the page is not in memory: Page fault • Translation is not in the page table=> true page fault • Raise an exception to OS • Replacement scheme – Fully associative since TLB is often small – Randomly choosing an entry to replace for large TLB

73 FastMath TLB and cache

74 TLBs and caches

75 Combination of events in page table, TLB, and Cache
• Fig. 7-26 • If TLB hit – Impossible for page table miss • Data must be in the page table • Disregarding cache hit or miss – Page table hit • Cache hit • Cache miss • If TLB miss • TLB misses but entry found in page table; after retry, data is found or misses in cache – Page table miss

76 Combination of events in page table, TLB, and Cache
• If page table is missed – Impossible for TLB hit or cache hit – Only possible for both TLB and cache miss • Else if page table is hit – TLB hit • Cache hit • Cache miss – TLB miss • TLB miss, but entry found in page table; after retry, data is found or misses in cache • cache hit • cache miss

77 Hardware Support for Virtual Memory System
• Support two operating modes – User process (user mode) – OS process or supervisor process, a kernel process (kernel mode, supervisor mode) • User read-only processor state – Provide a portion of the processor-state that a user process can read but not write – Special access instruction only available in supervisor mode • Special mode change instruction to change from user mode to supervisor mode and vice versa – System call exception

78 Protection with Virtual Memory
• To allow sharing memory by multiple processes • Write protection – Write access bit • Read protection – Each process have its own virtual address – Independent virtual pages map to disjoint physical address • Page table could not be changed by user process • OS is able to modify the page table • Place the page table in protected address space • Share information – Create a page table entry to another process’s space

79 Summary • Virtual memory • Reduce the high cost of page faults
– Support sharing of the main memory among multiple process – Protection is required for sharing • Reduce the high cost of page faults – Large page size – Fully associative page table – OS use LRU and reference bit to choose the replaced page – Write back • Cache for address translation: TLB

80 III. Measuring and improving cache performance IV. Virtual memory
Outline I. Introduction II. Basics of Caches III. Measuring and improving cache performance IV. Virtual memory V. A common framework for memory hierarchies

81 A Common Framework for Memory Hierarchies
• Q1. Where can a block be placed? – Direct mapped, set associative, fully associative – Degree of associativity ↑ miss rate ↓, but cost ↑, access time ↑

82 Miss Rate v.s. Cache Size v.s. Associativity

83 A Common Framework for Memory Hierarchies
• Q2. How is a block found?

84 Choice Among Direct Mapped, Set Associative, and Fully Associative
• Choice consideration – Cost of miss – Cost of implementing the associativity: extra time and hardware – With L2, L1 can be small and has higher associativity • Virtual memory always use fully associative placement, because – Fully associativity is beneficial since misses are very expensive – Full associativity allows software to use sophisticated replacement schemes – Full map can be easily indexed with no extra hardware and no searching required – The large page means the page table size overhead is relatively small • Set associative placement is often used for caches and TLBs – Combine indexing and search of a small set • A few systems have used direct-mapped cache – Advantage in accss time and simplicity

85 A Common Framework for Memory Hierarchies
• Q3. Which block should be replaced on a cache miss – Random • Randomly select one candidate block, possibly using some hardware assistance • Miss rate is 1.1 times of LRU for 2-way set associative. But difference becomes small or even better for larger cache size. • Easily implemented by hardware – LRU • One that has been unused for the longest time • LRU is too costly, thus uses approximated method or random method – 1-bit for a pair • Virtual memory always uses approximated LRU with reference bits or equivalent functionality – Even a tiny reduction in miss rate is important since miss penalty is huge and relatively infrequent

86 A Common Framework for Memory Hierarchies
• Q4. What happens on a write? – Write through • Easier to implement – Write back or copy back in virtual memory system • Individual word writing speed = rate of cache • Multiple write within a block require only one write to the next level • Block writing can use high bandwidth transfer • Trend of current cache design, • Only solution for virtual memory

87 BIG Picture: • Q1. where can a block be placed?
– One place (direct mapped), a few places (set associative), or any place (fully associative) • Q2. how is a block found? – Four methods: indexing (as in a direct mapped cache), limited search (as in a set associative cache), full search (as in a fully associative cache), and a separate lookup table (as in a page table) • Q3. What block is replaced on a miss? – LRU or random • Q4. How are writes handled? – Write through or write back

88 3Cs (3 categories for the sources of misses)
• Compulsory misses – First access – Also called cold-start misses • Capacity misses – Caused when the cache cannot contain all the blocks needed during execution of a program – When block are replaced and then later retrieved • Conflict misses – Occur in set associative or direct mapped cache – Multiple blocks compete for the same set – Also called collision misses

89 3Cs Three source of Misses

90 Memory Hierarchy Design Challenges

91 Modern Systems

92 Modern Systems • Things are getting complicated!

93 Cache Complexities: SW Impact
• Not always easy to understand implications of caches:

94 Cache Complexities: SW Impact
• Here is why: • Memory system performance is often critical factor – multilevel caches, pipelined processors, make it harder to predict outcomes – Compiler optimizations to increase locality sometimes hurt ILP • Difficult to predict best algorithm: need experimental data

95 Some Issues for Research
• Processor speeds continue to increase very fast – much faster than either DRAM or disk access times • Design challenge: dealing with this growing disparity – Prefetching? 3rd level caches and more? Memory design?


Download ppt "Chapter 7 Large and Fast: Exploiting Memory Hierarchy"

Similar presentations


Ads by Google