1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

Slides:



Advertisements
Similar presentations
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time
Performance of Cache Memory
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Computer ArchitectureFall 2008 © November 3 rd, 2008 Nael Abu-Ghazaleh CS-447– Computer.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
2/27/2002CSE Cache II Caches, part II CPU On-chip cache Off-chip cache DRAM memory Disk memory.
1  The second question was how to determine whether or not the data we’re interested in is already stored in the cache.  If we want to read memory address.
1 Block replacement  Any empty block in the correct set may be used for storing data.  If there are no empty blocks, the cache controller will attempt.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
1 Caches Concepts Review  What is a block address? —Why not bring just what is needed by the processor?  What is a set associative cache?  Write-through?
CMPE 421 Parallel Computer Architecture
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
January 4, 2016©2003 Craig Zilles (derived from slides by Howard Huang) 1 Cache Writing & Performance  Today we’ll finish up with caches; we’ll cover:
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X)  Hit Rate : the fraction of memory access found in.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Lecture 20 Last lecture: Today’s lecture: Types of memory
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
COSC2410: LAB 19 INTRODUCTION TO MEMORY/CACHE DIRECT MAPPING 1.
CSCI206 - Computer Organization & Programming
Cache Writing & Performance
COSC3330 Computer Architecture
CSE 351 Section 9 3/1/12.
Cache Memory and Performance
IT 251 Computer Organization and Architecture
CSC 4250 Computer Architectures
How will execution time grow with SIZE?
Morgan Kaufmann Publishers Memory & Cache
ECE 445 – Computer Organization
Lecture 08: Memory Hierarchy Cache Performance
ECE232: Hardware Organization and Design
How can we find data in the cache?
Miss Rate versus Block Size
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
CS 3410, Spring 2014 Computer Science Cornell University
Cache Writing & Performance
10/18: Lecture Topics Using spatial locality
Presentation transcript:

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte chunks: block addresses = address / 2 n = (address >> n) Spatial locality Byte Address Index Block Address m-bit Address k bits(m-k-n) bits n-bit Block Offset TagIndex

2 How big is the cache? Byte-addressable machine, 16-bit addresses, cache details:  direct-mapped  block size = one byte  index = 5 least significant bits Two questions:  How many blocks does the cache hold?  How many bits of storage are required to build the cache (data plus all overhead including tags, etc.)?

3 Disadvantage of direct mapping  The direct-mapped cache is easy: indices and offsets can be computed with bit operators or simple arithmetic, because each memory address belongs in exactly one block  However, this isn’t really flexible. If a program uses addresses 2, 6, 2, 6, 2,..., then each access will result in a cache miss and a load into cache block 2  This cache has four blocks, but direct mapping might not let us use all of them  This can result in more misses than we might like Index Memory Address

4 A fully associative cache  A fully associative cache permits data to be stored in any cache block, instead of forcing each memory address into one particular block —when data is fetched from memory, it can be placed in any unused block of the cache —this way we’ll never have a conflict between two or more memory addresses which map to a single cache block  In the previous example, we might put memory address 2 in cache block 2, and address 6 in block 3. Then subsequent repeated accesses to 2 and 6 would all be hits instead of misses.  If all the blocks are already in use, it’s usually best to replace the least recently used one, assuming that if it hasn’t used it in a while, it won’t be needed again anytime soon.

5 The price of full associativity  However, a fully associative cache is expensive to implement. —Because there is no index field in the address anymore, the entire address must be used as the tag, increasing the total cache size. —Data could be anywhere in the cache, so we must check the tag of every cache block. That’s a lot of comparators!... IndexTag (32 bits)DataValid Address (32 bits) = Hit 32 Tag = =

6 Set associativity  An intermediate possibility is a set-associative cache —The cache is divided into groups of blocks, called sets —Each memory address maps to exactly one set in the cache, but data may be placed in any block within that set  If each set has 2 x blocks, the cache is an 2 x -way associative cache  Here are several possible organizations of an eight-block cache Set 1-way associativity 8 sets, 1 block each 2-way associativity 4 sets, 2 blocks each 4-way associativity 2 sets, 4 blocks each Set 0101

7 Writing to a cache  Writing to a cache raises several additional issues  First, let’s assume that the address we want to write to is already loaded in the cache. We’ll assume a simple direct-mapped cache:  If we write a new value to that address, we can store the new data in the cache, and avoid an expensive main memory access [but inconsistent] HUGE problem in multiprocessors IndexTagDataVAddress Data IndexTagDataVAddress Data Mem[ ] = 21763

8 Write-through caches  A write-through cache solves the inconsistency problem by forcing all writes to update both the cache and the main memory  This is simple to implement and keeps the cache and memory consistent  Why is this not so good? IndexTagDataVAddress Data Mem[ ] = 21763

9 Write-back caches  In a write-back cache, the memory is not updated until the cache block needs to be replaced (e.g., when loading data into a full cache set)  For example, we might write some data to the cache at first, leaving it inconsistent with the main memory as shown before —The cache block is marked “dirty” to indicate this inconsistency  Subsequent reads to the same memory address will be serviced by the cache, which contains the correct, updated data IndexTagDataDirty Address Data Mem[ ] = V 1

10 Finishing the write back  We don’t need to store the new value back to main memory unless the cache block gets replaced  E.g. on a read from Mem[ ], which maps to the same cache block, the modified cache contents will first be written to main memory  Only then can the cache block be replaced with data from address 142 Index TagData AddressData Index TagData Dirty AddressData V 1 V 1

11 Write misses  A second scenario is if we try to write to an address that is not already contained in the cache; this is called a write miss  Let’s say we want to store into Mem[ ] but we find that address is not currently in the cache  When we update Mem[ ], should we also load it into the cache? IndexTagDataVAddress Data

12  With a write around policy, the write operation goes directly to main memory without affecting the cache  This is good when data is written but not immediately used again, in which case there’s no point to load it into the cache yet for (int i = 0; i < SIZE; i++) a[i] = i; Write around caches (a.k.a. write-no-allocate) IndexTagDataV AddressData Mem[ ] = 21763

13 Allocate on write  An allocate on write strategy would instead load the newly written data into the cache  If that data is needed again soon, it will be available in the cache IndexTagDataVAddress Data Mem[214] = 21763

14 Which is it?  Given the following trace of accesses, can you determine whether the cache is write-allocate or write-no-allocate? —Assume A and B are distinct, and can be in the cache simultaneously. Load A Store B Store A Load A Load B Load A Miss Hit Answer: Write-no-allocate On a write- allocate cache this would be a hit

15 Memory System Performance  To examine the performance of a memory system, we need to focus on a couple of important factors. —How long does it take to send data from the cache to the CPU? —How long does it take to copy data from memory into the cache? —How often do we have to access main memory?  There are names for all of these variables. —The hit time is how long it takes data to be sent from the cache to the processor. This is usually fast, on the order of 1-3 clock cycles. —The miss penalty is the time to copy data from main memory to the cache. This often requires dozens of clock cycles (at least). —The miss rate is the percentage of misses. Lots of dynamic RAM A little static RAM (cache) CPU

16 Average memory access time  The average memory access time, or AMAT, can then be computed AMAT = Hit time + (Miss rate x Miss penalty) This is just averaging the amount of time for cache hits and the amount of time for cache misses  How can we improve the average memory access time of a system? —Obviously, a lower AMAT is better —Miss penalties are usually much greater than hit times, so the best way to lower AMAT is to reduce the miss penalty or the miss rate  However, AMAT should only be used as a general guideline. Remember that execution time is still the best performance metric.

17 Performance example  Assume that 33% of the instructions in a program are data accesses. The cache hit ratio is 97% and the hit time is one cycle, but the miss penalty is 20 cycles:  To make AMAT smaller, we can decrease the miss rate —e.g. make the cache larger, add more associativity —but larger/more complex  longer hit time!  Alternate approach: decrease the miss penalty —BIG idea: a big, slow cache is still faster than RAM!  Modern processors have at least two cache levels —too many levels introduces other problems (keeping data consistent, communicating across levels)

18 Opteron Vital Statistics  L1 Caches: Instruction & Data —64 kB —64 byte blocks —2-way set associative —2 cycle access time  L2 Cache: —1 MB —64 byte blocks —4-way set associative —16 cycle access time (total, not just miss penalty)  Memory —200+ cycle access time L1 cacheCPU Main Memory L2 cache

19 Associativity tradeoffs and miss rates  Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s less chance of a conflict between two addresses which both belong in the same set —Overall, this will reduce AMAT and memory stall cycles  Figure from the textbook shows the miss rates decreasing as the associativity increases 0% 3% 6% 9% 12% Eight-wayFour-wayTwo-wayOne-way Miss rate Associativity

20 Cache size and miss rates  The cache size also has a significant impact on performance —The larger a cache is, the less chance there will be of a conflict —Again this means the miss rate decreases, so the AMAT and number of memory stall cycles also decrease  Miss rate as a function of both the cache size and its associativity 0% 3% 6% 9% 12% 15% Eight-wayFour-wayTwo-wayOne-way 1 KB 2 KB 4 KB 8 KB Miss rate Associativity

21 Block size and miss rates  Finally, miss rates relative to the block size and overall cache size —Smaller blocks do not take maximum advantage of spatial locality —But if blocks are too large, there will be fewer blocks available, and more potential misses due to conflicts 1 KB 8 KB 16 KB 64 KB % 35% 30% 25% 20% 15% 10% 5% 0% Miss rate Block size (bytes)