1 CACHE BASICS. 2 1977: DRAM faster than microprocessors.

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Caching IV Andreas Klappenecker CPSC321 Computer Architecture.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
The Memory Hierarchy CPSC 321 Andreas Klappenecker.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
The University of Adelaide, School of Computer Science
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
Processor Design 5Z032 Memory Hierarchy Chapter 7 Henk Corporaal Eindhoven University of Technology 2009.
11/10/2005Comp 120 Fall November 10 8 classes to go! questions to me –Topics you would like covered –Things you don’t understand –Suggestions.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
Computing Systems Memory Hierarchy.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
CHAPTER 5.
Systems I Locality and Caching
Storage HierarchyCS510 Computer ArchitectureLecture Lecture 12 Storage Hierarchy.
Lecture 19: Virtual Memory
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Lecture 14 Memory Hierarchy and Cache Design Prof. Mike Schulte Computer Architecture ECE 201.
1  2004 Morgan Kaufmann Publishers Multilevel cache Used to reduce miss penalty to main memory First level designed –to reduce hit time –to be of small.
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.
1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 CHAPTER : DRAM faster than microprocessors.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
COSC3330 Computer Architecture
Memory COMPUTER ARCHITECTURE
The Goal: illusion of large, fast, cheap memory
CS352H: Computer Systems Architecture
CS 704 Advanced Computer Architecture
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Cache Memory Presentation I
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Virtual Memory 4 classes to go! Today: Virtual Memory.
Siddhartha Chatterjee
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Main Memory Background
Cache Memory and Performance
Presentation transcript:

1 CACHE BASICS

2 1977: DRAM faster than microprocessors

3 Since 1980, CPU has outpaced DRAM...

4 How do architects address this gap? Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per bit than slower memory Solution: organize memory system into a hierarchy –Entire addressable memory space available in largest, slowest memory –Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor Temporal and spatial locality insures that nearly all references can be found in smaller memories –Gives the allusion of a large, fast memory being presented to the processor

5 Memory Hierarchy

6 Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors: –Aggregate peak bandwidth grows with # cores: Intel Core i7 can generate two references per core per clock Four cores and 3.2 GHz clock –25.6 billion 64-bit data references/second + –12.8 billion 128-bit instruction references –= GB/s! DRAM bandwidth is only 6% of this (25 GB/s) Requires: –Multi-port, pipelined caches –Two levels of cache per core –Shared third-level cache on chip

7 Memory Hierarchy Basics When a word is not found in the cache, a miss occurs: –Fetch word from lower level in hierarchy, requiring a higher latency reference –Lower level may be another cache or the main memory –Also fetch the other words contained within the block Takes advantage of spatial locality –Place block into cache in any location within its set, determined by address block address MOD number of sets

8 Locality A principle that makes memory hierarchy a good idea If an item is referenced –Temporal locality: it will tend to be referenced again soon –Spatial locality: nearby items will tend to be referenced soon Our initial focus: two levels (upper, lower) –Block: minimum unit of data –Hit: data requested is in the upper level –Miss: data requested is not in the upper level

9 Memory Hierarchy Basics Note that speculative and multithreaded processors may execute other instructions during a miss –Reduces performance impact of misses

10 For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level Cache Two issues –How do we know if a data item is in the cache? –If it is, how do we find it? Our first example –Block size is one word of data –”Direct mapped"

11 Direct mapped cache Mapping –Cache address is Memory address modulo the number of blocks in the cache –(Block address) modulo (#Blocks in cache)

12 What kind of locality are we taking advantage of? Direct mapped cache

13 Direct mapped cache Taking advantage of spatial locality (16KB cache, 256 Blocks, 16 words/block)

14 Block Size vs. Performance

15 Block Size vs. Cache Measures Increasing Block Size generally increases Miss Penalty and decreases Miss Rate Block Size Miss Rate Miss Penalty Avg. Memory Access Time X=

16 Four Questions for Memory Hierarchy Designers Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

17 Q1: Where can a block be placed in the upper level? Direct Mapped: Each block has only one place that it can appear in the cache. Fully associative: Each block can be placed anywhere in the cache. Set associative: Each block can be placed in a restricted set of places in the cache. –If there are n blocks in a set, the cache placement is called n- way set associative

18 Associativity Examples Fully associative: Block 12 can go anywhere Direct mapped: Block no. = (Block address) mod (No. of blocks in cache) Block 12 can go only into block 4 (12 mod 8) Set associative: Set no. = (Block address) mod (No. of sets in cache) Block 12 can go anywhere in set 0 (12 mod 4)

19 Direct Mapped Cache

20 2 Way Set Associative Cache

21 Fully Set Associative Cache

22 An implementation of a four-way set associative cache

23 Performance

24 Q2: How Is a Block Found If It Is in the Upper Level? The address can be divided into two main parts –Block offset: selects the data from the block offset size = log2(block size) –Block address: tag + index index: selects set in cache index size = log2(#blocks/associativity) –tag: compared to tag in cache to determine hit tag size = addreess size - index size - offset size TagIndex

25 Q3: Which Block Should be Replaced on a Miss? Easy for Direct Mapped Set Associative or Fully Associative: –Random - easier to implement –Least Recently used - harder to implement - may approximate Miss rates for caches with different size, associativity and replacement algorithm. Associativity:2-way4-way8-way SizeLRURandomLRURandomLRURandom 16 KB5.18%5.69%4.67%5.29%4.39%4.96% 64 KB1.88%2.01%1.54%1.66%1.39%1.53% 256 KB1.15%1.17%1.13%1.13%1.12%1.12% For caches with low miss rates, random is almost as good as LRU.

26 Q4: What Happens on a Write?

27 Q4: What Happens on a Write? Since data does not have to be brought into the cache on a write miss, there are two options: –Write allocate The block is brought into the cache on a write miss Used with write-back caches Hope subsequent writes to the block hit in cache –No-write allocate The block is modified in memory, but not brought into the cach Used with write-through caches Writes have to go to memory anyway, so why bring the block into the cache

28 Hits vs. misses Read hits –This is what we want! Read misses –Stall the CPU, fetch block from memory, deliver to cache, restart Write hits –Can replace data in cache and memory (write-through) –Write the data only into the cache (write-back the cache later) Write misses –Read the entire block into the cache, then write the word

29 Cache Misses On cache hit, CPU proceeds normally On cache miss –Stall the CPU pipeline –Fetch block from next level of hierarchy –Instruction cache miss Restart instruction fetch –Data cache miss Complete data access

30 Cache Measures Hit rate: fraction found in the cache –Miss rate = 1 - Hit Rate Hit time: time to access the cache Miss penalty: time to replace a block from lower level, –access time: time to access lower level –transfer time: time to transfer block CPU time = (CPU execution cycles+ Memory stall cycles)*Cycle time

31 Improving Cache Performance Average memory-access time = Hit time + Miss rate x Miss penalty Improve performance by: 1. Reduce the miss rate: 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

32 Types of misses Compulsory –Very first access to a block (cold-start miss) Capacity –Cache cannot contain all blocks needed Conflict –Too many blocks mapped onto the same set

33 How do you solve Compulsory misses? –Larger blocks with a side effect! Capacity misses? –Not much options: enlarge the cache otherwise face “thrashing!”, computer runs at a speed of the lower memory or slower! Conflict misses? –Full associative cache with a cost of hardware and may slow the processor!

34 Basic cache optimizations: –Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty –Larger total cache capacity to reduce miss rate Increases hit time, increases power consumption –Higher associativity Reduces conflict misses Increases hit time, increases power consumption –Higher number of cache levels Reduces overall memory access time –Giving priority to read misses over writes Reduces miss penalty

35 Other Optimizations: Victim Cache Add a small fully associative victim cache to place data discarded from regular cache When data not found in cache, check victim cache 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache Get access time of direct mapped with reduced miss rate

36 Other Optimizations: Reducing Misses by HW Prefetching of Instruction & Data E.g., Instruction Prefetching –Alpha fetches 2 blocks on a miss –Extra block placed in stream buffer –On miss check stream buffer –Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% Works with data blocks too: –Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on extra memory bandwidth that can be used without penalty

37 Other Optimizations: Reducing Misses by Compiler Optimizations Instructions –Reorder procedures in memory so as to reduce misses –Profiling to look at conflicts –McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocks Data –Merging Arrays : improve spatial locality by single array of compound elements vs. 2 arrays –Loop Interchange : change nesting of loops to access data in order stored in memory –Loop Fusion : Combine 2 independent loops that have same looping and some variables overlap –Blocking : Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

38. Problem: referencing multiple arrays in the same dimension, with the same index, at the same time can lead to conflict misses.. Solution: Merge the independent arrays into a compound array. /* Before */ int val[SIZE]; int key[SIZE]; /* After */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Merging Arrays Example

39 Miss Rate Reduction Techniques: Compiler Optimizations –Loop Interchange

40 Miss Rate Reduction Techniques: Compiler Optimizations –Loop Fusion

41. Problem: When accessing multiple multi-dimensional arrays (e.g., for matrix multiplication), capacity misses occur if not all of the data can fit into the cache.. Solution: Divide the matrix into smaller submatrices (or blocks) that can fit within the cache. The size of the block chosen depends on the size of the cache. Blocking can only be used for certain types of algorithms Blocking

42 Summary of Compiler Optimizations to Reduce Cache Misses

43 Decreasing miss penalty with multi-level caches Add a second level cache: –Often primary cache is on the same chip as the processor –Use SRAMs to add another cache above primary memory (DRAM) –Miss penalty goes down if data is in 2nd level cache Using multilevel caches: –Try and optimize the hit time on the 1st level cache –Try and optimize the miss rate on the 2nd level cache

44 Multilevel Caches Primary cache attached to CPU –Small, but fast Level-2 cache services misses from primary cache –Larger, slower, but still faster than main memory Main memory services L-2 cache misses Some high-end systems include L-3 cache

45 Virtual Memory Use main memory as a “cache” for secondary (disk) storage –Managed jointly by CPU hardware and the operating system (OS) Programs share main memory –Each gets a private virtual address space holding its frequently used code and data –Protected from other programs CPU and OS translate virtual addresses to physical addresses –VM “block” is called a page –VM translation “miss” is called a page fault

46 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation –protection

47 Pages: virtual memory blocks Page faults: the data is not in memory, retrieve it from disk –huge miss penalty, thus pages should be fairly large (e.g., 4KB) What type (direct mapped, set or fully set associative) –reducing page faults is important (LRU is worth the price) –can handle the faults in software instead of hardware –using write-through is too expensive so we use writeback Book title Lib. Location

48 Page Tables (Fully Associative Search Time)

49 A Program’s State Page Table PC Registers

50 Page Tables

51 Page Faults Replacement Policy Handle with Hardware or Software –External memory excess time is large relative to software based solution LRU –Costly to keep track of every page –Mechanism? Keep refreshing the 1 bit

52 Making Address Translation Fast Page tables in memory Memory access by a program : twice as long –Obtain physical address –Get data Make us of locality of reference –Temporal & Spatial (Words in a page) Solution –Special cache Keep track of recently used translations Translation Lookaside Buffer (TLB) –Translation cache –Your piece of paper where you record the location of books you need from the library

53 Making Address Translation Fast A cache for address translations: translation lookaside buffer Typical values: entries, miss-rate:.01% - 1% miss-penalty: 10 – 100 cycles

54 TLBs and caches

55 TLBs and Caches

56 Processor speeds continue to increase very fast — much faster than either DRAM or disk access times Design challenge: dealing with this growing disparity –Prefetching? 3 rd level caches and more? Memory design? Some Issues

57 Memory Technology Performance metrics –Latency is concern of cache –Bandwidth is concern of multiprocessors and I/O –Access time Time between read request and when desired word arrives DRAM used for main memory, SRAM used for cache

58 Latches and Flip-flops

59 Latches and Flip-flops

60 Latches and Flip-flops Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge (edge-triggered methodology)

61 SRAM

62 SRAM vs. DRAM Which one has a better memory density? Which one is faster? static RAM (SRAM): value stored in a cell is kept on a pair of inverting gates dynamic RAM (DRAM), value kept in a cell is stored as a charge in a capacitor. DRAMs use only a single transistor per bit of storage, By comparison, SRAMs require four to six transistors per bit In DRAMs, the charge is stored on a capacitor, so it cannot be kept indefinitely and must periodically be refreshed. (called dynamic) Every ~ 8 ms Each row can be refreshed simultaneously Must be re-written after being read

63 Memory Technology Amdahl: –Memory capacity should grow linearly with processor speed (followed this trend for about 20 years) –Unfortunately, memory capacity and speed has not kept pace with processors –Fourfold improvement every 3 years (originally) –Doubled capacity from

64 Memory Optimizations

65 Memory Technology Some optimizations: –Synchronous DRAM Added clock to DRAM interface Burst mode with critical word first –Wider interfaces 4 bit transfer mode originally In 2010, upto 16-bit busses –Double data rate (DDR) Transfer data on both rising and falling edge

66 Memory Optimizations

67 Memory Optimizations DDR: –DDR2 Lower power (2.5 V -> 1.8 V) Higher clock rates (266 MHz, 333 MHz, 400 MHz) –DDR3 1.5 V 800 MHz –DDR4 (scheduled for production in 2014) V 1600 MHz GDDR5 is graphics memory based on DDR3

68 Memory Optimizations Graphics memory: –Achieve 2-5 X bandwidth per DRAM vs. DDR3 Wider interfaces (32 vs. 16 bit) Higher clock rate –Possible because they are attached via soldering instead of socketted Dual Inline Memory Modules (DIMM) Reducing power in SDRAMs: –Lower voltage –Low power mode (ignores clock, continues to refresh)

69 Virtual Machines First developed in 1960s Regained popularity recently –Need for isolation and security in modern systems –Failures in security and reliability of standard operation systems –Sharing of single computer among many unrelated users (datacenter, cloud) –Dramatic increase in raw speed of processors Overhead of VMs now more acceptable

70 Virtual Machines Emulation methods that provide a standard software interface –IBM VM/370, VMware, ESX Server, Xen Create the illusion of having an entire computer to yourself including a copy of the OS Allows different ISAs and operating systems to be presented to user programs –“System Virtual Machines” –SVM software is called “virtual machine monitor” or “hypervisor” –Individual virtual machines run under the monitor are called “guest VMs”

71 Impact of VMs on Virtual Memory Each guest OS maintains its own set of page tables –VMM adds a level of memory between physical and virtual memory called “real memory” –VMM maintains shadow page table that maps guest virtual addresses to physical addresses Requires VMM to detect guest’s changes to its own page table Occurs naturally if accessing the page table pointer is a privileged operation

72 Assume 75% instruction, 25% data access

73

74 Cost of Misses, CPU time

75

76

77 Example CPI of 1.0 on a 5Ghz machine with a 2% miss rate and 100ns main memory access Adding 2nd level cache with 5ns access time decreases miss rate to 0.5% How much faster is the new configuration?