1 ECE3055 Computer Architecture and Operating Systems Lecture 8 Memory Subsystem Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering.

Slides:

Advertisements

Similar presentations

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014

1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.

CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and

Caching IV Andreas Klappenecker CPSC321 Computer Architecture.

Cs 325 virtualmemory.1 Accessing Caches in Virtual Memory Environment.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

The Memory Hierarchy CPSC 321 Andreas Klappenecker.

1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.

S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.

Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.

The Memory Hierarchy II CPSC 321 Andreas Klappenecker.

Caching I Andreas Klappenecker CPSC321 Computer Architecture.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1  1998 Morgan Kaufmann Publishers Chapter Seven.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)

Processor Design 5Z032 Memory Hierarchy Chapter 7 Henk Corporaal Eindhoven University of Technology 2009.

11/10/2005Comp 120 Fall November 10 8 classes to go! questions to me –Topics you would like covered –Things you don’t understand –Suggestions.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Computing Systems Memory Hierarchy.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Patterson.

ECM534 Advanced Computer Architecture

Lecture 19: Virtual Memory

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

1  2004 Morgan Kaufmann Publishers Multilevel cache Used to reduce miss penalty to main memory First level designed –to reduce hit time –to be of small.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.

1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.

The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.

Computer Organization & Programming

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

Improving Memory Access 2/3 The Cache and Virtual Memory

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

Summary of caches: The Principle of Locality: –Program likely to access a relatively small portion of the address space at any instant of time. Temporal.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

1  2004 Morgan Kaufmann Publishers Page Tables. 2  2004 Morgan Kaufmann Publishers Page Tables.

COSC3330 Computer Architecture

Memory COMPUTER ARCHITECTURE

Yu-Lun Kuo Computer Sciences and Information Engineering

The Goal: illusion of large, fast, cheap memory

Improving Memory Access 1/3 The Cache and Virtual Memory

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: illusion of having more physical memory program relocation protection.

Virtual Memory 4 classes to go! Today: Virtual Memory.

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Memory & Cache.

Presentation transcript:

1 ECE3055 Computer Architecture and Operating Systems Lecture 8 Memory Subsystem Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

2 Performance matters  Consider basic 5 stage pipeline design CPU runs at 1ns cycle time (1GHz) Main memory runs at 100ns access time  How is performance?  Fetch A (100 cycles)  Decode A (1 cycle) + Fetch B (100 cycles)  Ex A (1 cy) + Dec B (1 cy) + Fetch C (100 cycles)  Effective 1 instr per 100 cycles --> 10MHz CPU Latency killing system  Problem only getting worse CPU speeds grow much faster than DRAM speeds DRAM bandwidth improving well

3 Typical Solution  With the large gap in CPU vs DRAM speeds, must have solution Cache (“slave memories”) Ultra-fast, small local memory Cache runs at 1ns access time...  Fetch A (1 cycle)  Decode A (1 cycle) + Fetch B (1 cycle)  Ex A (1 cy) + Dec B (1 cy) + Fetch C (1 cy)  Effective 1 instr per 1 cyc ----> 1GHz CPU  Special terms hit miss rates (hit/miss)

4  SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)  DRAM: value is stored as a charge on capacitor (must be refreshed) very small but slower than SRAM (factor of 5 to 10)  * ignoring new technologies ‘on the horizon’ Memories: Two basic types* Word line Pass transistor Bit line Bit line bar Word line Pass transistor Capacitor Bit line

5 Memory Photos Intel Paxville (dual Core) 90nm 8-way 2MB L2 for each core Intel Itanium2.13µm 24-way 6MB L3 240-pin DDR2 DRAM

6  Users want large and fast memories! For example: SRAM access times are 700ps-1ns (1-3 cycles) DRAM access times are ns ( cycles) Disk access times are ~1 million ns (~3M cycles)  Try and give it to them anyway build a memory hierarchy Exploiting Memory Hierarchy

7 Model of Memory Hierarchy RegFileL1 Data cache L1 Inst cache L2 Cache Cache Main Memory Memory DISKSRAMDRAM

8 P4: Prescott w/ 2MB L2 (90nm)  Prescott runs very fast (3.4+ GHz)  2MB L2 Unified Cache  12K* trace cache (think “I$”)  16KB data cache  Where is the cache?  What about the similar blocks?  Why the visual differences?  Why is it square?  What’s with the colors?  Check this out: TC L1D L2

9 Interfacing Processors and Peripherals  I/O Design affected by many factors (expandability, resilience)  Performance: — access latency — throughput — connection between devices and the system — the memory hierarchy — the operating system  A variety of different users (e.g., banks, supercomputers, engineers)

10 I/O Devices  Very diverse devices — behavior (i.e., input vs. output) — partner (who is at the other end?) — data rate

11 I/O Example: Disk Drives  To access data: — seek: position head over the proper track ( ms. avg.) — rotational latency: wait for desired sector (.5 / RPM) — transfer: grab the data (one or more sectors) 2 to 15 MB/sec  not considering disk buffer hits ( MB/s)

12 Locality  A principle that makes having a memory hierarchy a good idea Temporal locality  If an item is referenced, Temporal locality: it will tend to be referenced again soon Spatial locality Spatial locality: nearby items will tend to be referenced soon. Why does code have locality? What about data locality?  Our initial focus: two levels (upper, lower) block: minimum unit of data hit: data requested is in the upper level miss: data requested is not in the upper level

13  Two issues: How do we know if a data item is in the cache? If it is, how do we find it?  Our first example: block size is one word of data "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level Cache

14  Mapping: address is modulo the number of blocks in the cache Direct Mapped Cache

15  For MIPS: What kind of locality are we taking advantage of? Direct Mapped Cache

16 Example: DM$, 8-Entry, 4B Valid Tag DataIndex Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] (4 bytes)

17 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes)

18 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 #24 is (4 bytes)

19 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 #24 is (4 bytes) 4-byte block, drop low 2 bits for byte offset! Only matters for byte- addressable systems

20 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 #24 is (4 bytes) Next log 2 (8) bits mod 8 = Index

21 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) #24 is Next log 2 (8) bits mod 8 = Index

22 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) #24 is

23 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) #24 is [a] copy

24 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) #24 is [a] copy 000

25 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) #24 is [a] copy 000

26 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000

27 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000

28 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 #28 is

29 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 #28 is [b] copy000

30 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [b] copy000

31 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [b] copy000

32 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [b] copy000 #60 is It’s valid! How to tell it’s the wrong address?

33 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [b] copy000 #60 is The tags don’t match! It’s not what we want to access!

34 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 #60 is

35 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 #60 is [c] copy001

36 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001

37 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001 Q: What if the machine is only word-addressable?

38 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001 Q: What if the machine is only word-addressable?

39 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001 Q: What if the machine is only word-addressable? #60 is

40 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001 Q: What if the machine is only word-addressable? Tag is 2 bits larger; otherwise same (note index+data change!) #60 is

41 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001 Q: What about writing back to memory?

42 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] OLD [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory? Do we update memory now? Or later?

43 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] OLD [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory? Assume later...

44 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] OLD [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory?

45 Example: DM$, 8-Entry, 4B Valid Tag DataIndex lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] OLD [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory? # Now What? How do we know to write back?

46 Example: DM$, 8-Entry, 4B Valid Tag Dat Index lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] OLD [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory? # Need extra state! The “dirty” bit! a Dirty

47 Example: DM$, 8-Entry, 4B Valid Tag Dat Index lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] NEW [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory? # a Dirty

48 Example: DM$, 8-Entry, 4B Valid Tag Dat Index lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] NEW [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 Q: What about writing back to memory? # a Dirty

49 Example: DM$, 8-Entry, 4B Valid Tag Dat Index lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] NEW [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 Q: What about writing back to memory? # a Dirty [d] copy101

50 Example: DM$, 8-Entry, 4B Valid Tag Dat Index lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] NEW [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 Q: What about writing back to memory? a Dirty [d] copy101

51 Example: DM$, 8-Entry, 4B Valid Tag Dat Index lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] Main Memory (System) [a] [b] [c] NEW [d] OLD #24 #28 #60 #188 (4 bytes) [a] copy 000 Q: What about writing back to memory? a Dirty [d] NEW101

52  Trade-Offs: Write-back or Write-Through? Write-Alloc or No-Write-Alloc? How does Tag change with # of Entries? How does minimum machine word size impact tag? What kind of locality are we taking advantage of? DM$: Thoughts

53  Taking advantage of spatial locality: Direct Mapped Cache Address (showing bit positions) 1612Byte offset V Tag Data HitData K entries 16 bits128 bits Mux Block offsetIndex Tag

54  Read hits this is what we want!  Read misses stall the CPU, fetch block from memory, deliver to cache, restart  Write hits: can replace data in cache and memory (write-through) write the data only into the cache (write-back the cache later)  Write misses: read the entire block into the cache, then write the word… ? Hits vs. Misses

55  Make reading multiple words easier by using banks of memory  It can get a lot more complicated... Hardware Issues

56  Increasing the block size tends to decrease miss rate :  Use split caches because there is more spatial locality in code: Performance % 35% 30% 25% 20% 15% 10% 5% 0% M i s s r a t e Block size (bytes) 1 KB 8 KB 16 KB 64 KB 256 KB

57 Performance  Simplified model: execution time = (execution cycles + stall cycles)  cycle time stall cycles = # of instructions  miss ratio  miss penalty  Two ways of improving performance: decreasing the miss ratio decreasing the miss penalty What happens if we increase block size?

58 Compared to direct mapped, give a series of references that: results in a lower miss ratio using a 2-way set associative cache results in a higher miss ratio using a 2-way set associative cache (assuming “least recently used” replacement strategy) Decreasing miss ratio with associativity

59 An implementation

60 Set-Associative Cache  Multiple cache blocks (lines) can be allocated into the same set  When full, needs to evict some block out of the cache  Need to consider the locality  Replacement policy Last-In First-Out (LIFO), like a stack Random First-In First-Out (FIFO) Least Recently Used (LRU)

61 Least Recently Used (LRU) ABCD ABCD MRU LRULRU+1MRU-1 Access C CABD Access D DCAB Access E EDCA Access C CEDA Access G GCED

62 Performance

63 Decreasing miss penalty with multilevel caches  Add a second level cache: often primary cache is on the same chip as the processor use SRAMs to add another cache above primary memory (DRAM) miss penalty goes down if data is in 2nd level cache  Example: CPI of 1.0 on a 5GHz machine for no cache miss CPI The same machine with 1 st level cache (L1) with a 2% miss rate per instruction, and a 100ns DRAM access (what is the CPI ?) speedup Adding 2nd level cache with 5ns access time (including L1 access time) decreases miss rate per instruction to 0.5%, what is the speedup over the machine with only L1  Using multilevel caches: try and optimize the hit time on the 1st level cache try and optimize the miss rate on the 2nd level cache

64 Virtual Memory  Main memory can act as a cache for the secondary storage (disk)  Advantages: illusion of having more physical memory program relocation protection

65 Pages: virtual memory blocks  Page faults: the data is not in memory, retrieve it from disk huge miss penalty, thus pages should be fairly large (e.g., 4KB) reducing page faults is important (LRU is worth the price) can handle the faults in software instead of hardware using write-through is too expensive so we use writeback Page offsetVirtual page number Virtual address Page offsetPhysical page number Physical address Translation

66 Page Tables Physical memory Disk storage Valid Page table Virtual page number Physical page or disk address

67 Page Tables

68 Making Address Translation Fast  A cache for address translations: translation lookaside buffer

69 TLBs and caches

70 Modern Systems  Very complicated memory systems:

71  Processor speeds continue to increase very fast — much faster than either DRAM or disk access times  Design challenge: dealing with this growing disparity  Trends: synchronous SRAMs (provide a burst of data) redesign DRAM chips to provide higher bandwidth or processing restructure code to increase locality use prefetching (make cache visible to ISA) Some Issues