Winter 2002CSE 141 - Cache Improving memory with caches CPU On-chip cache Off-chip cache DRAM memory Disk memory.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.
CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.
1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.
Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
11/3/2005Comp 120 Fall November 10 classes to go! Cache.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
L18 – Memory Hierarchy 1 Comp 411 – Fall /30/2009 Memory Hierarchy Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
2/27/2002CSE Cache II Caches, part II CPU On-chip cache Off-chip cache DRAM memory Disk memory.
Systems I Locality and Caching
CMPE 421 Parallel Computer Architecture
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
IT253: Computer Organization
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
EEL-4713 Ann Gordon-Ross 1 EEL-4713 Computer Architecture Memory hierarchies.
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
What is it and why do we need it? Chris Ward CS147 10/16/2008.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
3/1/2002CSE Virtual Memory Virtual Memory CPU On-chip cache Off-chip cache DRAM memory Disk memory Note: Some of the material in this lecture are.
COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.
CMSC 611: Advanced Computer Architecture
Memory COMPUTER ARCHITECTURE
The Goal: illusion of large, fast, cheap memory
How will execution time grow with SIZE?
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
CS61C : Machine Structures Lecture 6. 2
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Adapted from slides by Sally McKee Cornell University
How can we find data in the cache?
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Chapter Five Large and Fast: Exploiting Memory Hierarchy
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Winter 2002CSE Cache Improving memory with caches CPU On-chip cache Off-chip cache DRAM memory Disk memory

CSE Cache2 The five components Computer Memory Datapath Control Output Input

CSE Cache3 Memory technologies SRAM –access time: 3-10 ns. (on-processor SRAM can be 1-2 ns.) –cost: $100 per MByte (??). DRAM –access times: ns –cost: $0.50 per MByte. Disk – access times: 5 to 20 million ns – cost of $0.01 per MByte. We want SRAM’s access time and disk’s capacity. Disclaimer: Access times and prices are approximate and constantly changing. (2/2002)

CSE Cache4 The problem with memory It’s expensive (and perhaps impossible) to build a large, fast memory –“fast” meaning “low latency” why is low latency important? To access data quickly: –it must be physically close –there can’t be too many layers of logic Solution: Move data you are about to access to a nearby, smaller, memory cache –Assuming you can make good guesses about what you will access soon.

CSE Cache5 A typical memory hierarchy CPU SRAM memory SRAM memory DRAM memory Disk memory on-chip “level 1” cache off-chip “level 2” cache main memory disk small, fast big, slower, cheaper/bit huge, very slow, very cheap

CSE Cache6 Cache basics In running program, main memory is data’s “home location”. –Addresses refer to location in main memory. –“Virtual memory” allows disk to extend DRAM We’ll study virtual memory later When data is accessed, it is automatically moved into cache –Processor (or smaller cache) uses cache’s copy –Data in main memory may (temporarily) get out-of-date But hardware must keep everything consistent. –Unlike registers, cache is not part of ISA Different models can have totally different cache design

CSE Cache7 The principle of locality Memory hierarchies take advantage of memory locality. –The principle that future memory accesses are near past accesses. Two types of locality (the following are “fuzzy” terms): –Temporal locality - near in time: we will often access the same data again very soon –Spatial locality - near in space/distance: our next access is often very close to recent accesses. This sequence of addresses has both types of locality 1, 2, 3, 1, 2, 3, 8, 8, 47, 9, 10, 8, 8... spatial temporal non-local

CSE Cache8 How does HW decide what to cache? Taking advantage of temporal locality: bring data into cache whenever its referenced kick out something that hasn’t been used recently Taking advantage of spatial locality: bring in a block of contiguous data (cacheline), not just the requested data. Some processors have instructions that let software influence cache: Prefetch instruction (“bring location x into cache”) “Never cache x” or “keep x in cache” instructions This font (Helvetica italics) means “won’t be on test”

CSE Cache9 Cache Vocabulary cache hit: an access where data is already in cache cache miss: an access where data isn’t in cache cache block size or cache line size: the amount of data that gets transferred on a cache miss. instruction cache (I-cache): cache that can only hold instructions. data cache (D-cache): cache that can only hold data. unified cache: cache that holds both data & instructions. A typical processor today has separate “Level 1” I- and D-caches on the same chip as the processor (and possibly a larger, unified “L2” on-chip cache), and larger L2 (or L3) unified cache on a separate chip. like the multi- cycle design like the single cycle and pipelined designs

CSE Cache10 Cache Issues On a memory access - How does hardware know if it is a hit or miss? On a cache miss - where to put the new data? what data to throw out? how to remember what data is where?

CSE Cache11 A simple cache Fully associative: any line of data can go anywhere in cache LRU replacement strategy: make room by throwing out the least recently used data. tagdata the tag identifies the addresses of the cached data the tag identifies the addresses of the cached data A very small cache: 4 entries, each holds a four-byte word, any entry can hold any word. time since last reference We’ll use this field to help decide what entry to replace

CSE Cache12 Simple cache in action tagdata Sequence of memory references: 24, 20, 04, 12, 20, 44, 04, 24, 44 time since last reference data more data etc etc -0 The first four reference are all misses – they fill up cache data more data etc etc -1 The next reference (“20”) is a hit. Times are updated. “44” is a miss – oldest data (24-27) is replaced new data more data etc etc new data more data etc etc -2 Now what happens ??

CSE Cache13 An even simpler cache Keeping track of when cache entries were last used (for LRU replacement) in big cache needs lots of hardware and can be slow. In a direct mapped cache, each memory location is assigned a single location in cache. –Usually* done by using a few bits of the address –We’ll let bits 2 and 3 (counting from LSB = “0”) of the address be the index * Some machines use a pseudo-random hash of the address

CSE Cache14 Direct mapped cache in action tagdata Sequence of memory references: 24, 20, 04, 12, 20, 44, 04, 24, data data = ; so index is = ; so index is 01. (remember: index is bits 2-3 of address) index data data = ; so index is 01. (kicks out of the cache) data data data 12 = ; so index is data data data your turn = = =

CSE Cache15 A Better Cache Design Direct mapped caches are simpler –Less hardware; possibly faster Fully associative caches usually have fewer misses. Set associative caches try to get best of both. –An index is computed from the address –In a “k-way set associative cache”, the index specifies a set of k cache locations where the data can be kept. k=1 is direct mapped. k=cache size (in lines) is fully associative. –Use LRU replacement (or something else) within the set. index tag data 2-way set associative cache Two places to look for data with index “0”

CSE Cache16 2-way set associative cache in action Sequence of memory references: 24, 20, 04, 12, 20, 44, 04, 24, = ; index is = ; index is 1. (index is bit 2 of address) your turn = = = data-- 20 – 23data-- index 0101 tag data 04 = ; index is 1. (goes in 2 nd slot of “01” set) data-- 20 – 23data04 – 07data = ; index is 1. (kicks out older item in “01” set) data-- 12 – 15data04 – 07data data-- 12 – 15data04 – 07data 0101

CSE Cache17 Cache Associativity An observation (?): 4-way cache has about the same hit rate as a direct- mapped cache of twice the size

CSE Cache18 Longer Cache Blocks Large cache blocks take advantage of spatial locality. Less tag space is needed (for a given capacity cache) Too large block size can waste cache space. Large blocks require longer transfer times. Good design requires compromise! tag data (room for big block)

CSE Cache19 Larger block size in action Sequence of memory references: 24, 20, 28, 12, 20, 08, 44, 04, = ; index is 1. (index is bit 3 of address) your turn = = = – 31data index 0101 tag 8 Bytes of data 20 = ; index is 0. (notice that line is Bytes – line starts with multiple of length) 28 = ; index is 1. A hit - even though we haven’t referenced 28 before! data 24 – 31data data 24 – 31data 0101

CSE Cache20 Block Size and Miss Rate Rule of thumb: block size should be less than square root of cache size.

CSE Cache21 Cache Parameters Cache size = Number of sets * block size * associativity 128 blocks, 32-byte blocks, direct mapped, size = ? 128 KB cache, 64-byte blocks, 512 sets, associativity = ?

CSE Cache22 Details What bits should we use for the index? How do we know if a cache entry is empty? Are stores and loads treated the same? What if a word overlaps two cache lines?? How does this all work, anyway???

CSE Cache23 Choosing bits for the index If line length is n Bytes, the low-order log 2 n bits of a Byte-address give the offset of address within a line. The next group of bits is the index -- this ensures that if the cache holds X bytes, then any block of X contiguous Byte addresses can co-reside in the cache. (Provided the block starts on a cache line boundary.) The remaining bits are the tag. Anatomy of an address: tagindexoffset

CSE Cache24 Is a cache entry empty? Problem: when a program starts up, cache is empty. –It might contain stuff left from previous user. –How do you make sure you don’t match an invalid tag? Solution: an extra “valid” bit per cacheline –Entire cache can be marked “invalid” on context switch.

CSE Cache25 Putting it all together 64 KB cache, direct-mapped, 32-byte cache block tagindex valid tagdata 64 KB / 32 bytes = 2 K cache blocks/sets 11 = hit/miss word offset

CSE Cache26 A set associative cache 32 KB cache, 2-way set-associative, 16-byte blocks tagindex valid tagdata 32 KB / 16 bytes / 2 = 1 K cache sets 10 = 18 hit/miss word offset tagdata valid = This picture doesn’t show the “most recent” bit (need one bit per set)

CSE Cache27 Key Points Caches give illusion of a large, cheap memory with the access time of a fast, expensive memory. Caches take advantage of memory locality, specifically temporal locality and spatial locality. Cache design presents many options (block size, cache size, associativity) that an architect must combine to minimize miss rate and access time to maximize performance.

CSE Cache28 Computer of the Day Integrated Circuits (IC’s) –Single chip has transistors, resistors, “wires”. –Invented in 1958 at Texas Instruments, – used in “third generation” computers of late 60’s; (1 st = tubes, 2 nd = transistors). Some computers using IC technology... Apollo guidance system (first computer on the moon) –~5000 IC’s: each with 3 transistors, 4 resistors. Illiac IV – “The most infamous computer” (at that time) –designed late 60’s, built in early 70’s, actually used – Plan: 1000 MFLOP/s. Reality: 15 MFLOP/s (200 MIPS). Cost: $31M –First “massively parallel” computer: four groups of 64 processors, Each group is “SIMD” (Single Instruction, Multiple Data)