 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.

Slides:



Advertisements
Similar presentations
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Performance of Cache Memory
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Example How are these parameters decided?. Row-Order storage main() { int i, j, a[3][4]={1,2,3,4,5,6,7,8,9,10,11,12}; for (i=0; i
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
1 Block replacement  Any empty block in the correct set may be used for storing data.  If there are no empty blocks, the cache controller will attempt.
Computing Systems Memory Hierarchy.
1 Caches Concepts Review  What is a block address? —Why not bring just what is needed by the processor?  What is a set associative cache?  Write-through?
Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.
1 CENG 450 Computer Systems and Architecture Cache Review Amirali Baniasadi
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
January 4, 2016©2003 Craig Zilles (derived from slides by Howard Huang) 1 Cache Writing & Performance  Today we’ll finish up with caches; we’ll cover:
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Computer Organization CS224 Fall 2012 Lessons 39 & 40.
The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
1 Components of the Virtual Memory System  Arrows indicate what happens on a lw virtual address data physical address TLB page table memory cache disk.
Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.
Cache Writing & Performance
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Improving Memory Access 1/3 The Cache and Virtual Memory
IT 251 Computer Organization and Architecture
CSC 4250 Computer Architectures
How will execution time grow with SIZE?
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
ECE 445 – Computer Organization
Lecture 08: Memory Hierarchy Cache Performance
ECE232: Hardware Organization and Design
How can we find data in the cache?
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
CS-447– Computer Architecture Lecture 20 Cache Memories
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache Writing & Performance
Presentation transcript:

 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s less chance of a conflict between two addresses which both belong in the same set  Figure from the textbook shows the miss rates decreasing as the associativity increases 1 Associativity tradeoffs and miss rates 0% 3% 6% 9% 12% Eight-wayFour-wayTwo-wayOne-way Miss rate Associativity

2 Cache size and miss rates  The cache size also has a significant impact on performance —The larger a cache is, the less chance there will be of a conflict —Again this means the miss rate decreases, but the hit time increases  Miss rate as a function of both the cache size and its associativity 0% 3% 6% 9% 12% 15% Eight-wayFour-wayTwo-wayOne-way 1 KB 2 KB 4 KB 8 KB Miss rate Associativity

3 Block size and miss rates  Finally, miss rates relative to the block size and overall cache size —Smaller blocks do not take maximum advantage of spatial locality —But if blocks are too large, there will be fewer blocks available, and more potential misses due to conflicts 1 KB 8 KB 16 KB 64 KB % 35% 30% 25% 20% 15% 10% 5% 0% Miss rate Block size (bytes)

4 Performance example  Assume that 33% of the instructions in a program are data accesses, the cache hit ratio is 97% and the hit time is one cycle, but the miss penalty is 20 cycles Memory stall cycles= Memory accesses x Miss rate x Miss penalty = 0.33 N x 0.03 x 20 cycles  0.2 N cycles  If N instructions are executed, then the number of wasted cycles will be 0.2 x N This code is 1.2 times slower than a program with a “perfect” CPI of 1!

5 Memory systems are a bottleneck CPU time = (CPU execution cycles + Memory stall cycles) x Cycle time  Processor performance traditionally outpaces memory performance, so the memory system is often the system bottleneck  For example, with a base CPI of 1, the CPU time from the last page is: CPU time = ( I I ) x Cycle time  What if we could double the CPU performance so the CPI becomes 0.5, but memory performance remained the same? CPU time = (0.5 I I ) x Cycle time  The overall CPU time improves by just 1.2/0.7 = 1.7 times!  Amdahl’s Law again: —Speeding up only part of a system has diminishing returns

6 Basic main memory design  Let’s assume the following three steps are taken when a cache needs to load data from the main memory: 1.It takes 1 cycle to send an address to the RAM 2.There is a 15-cycle latency for each RAM access 3.It takes 1 cycle to return data from the RAM  In the setup shown here, the buses from the CPU to the cache and from the cache to RAM are all one word wide  If the cache has one-word blocks, then filling a block from RAM (i.e., the miss penalty) would take 17 cycles = 17 clock cycles  The cache controller has to send the desired address to the RAM, wait and receive the data Main Memory Cache CPU

7 Miss penalties for larger cache blocks  If the cache has four-word blocks, then loading a single block would need four individual main memory accesses, and a miss penalty of 68 cycles! 4 x ( ) = 68 clock cycles Main Memory CPU Cache

8 A wider memory  A simple way to decrease the miss penalty is to widen the memory and its interface to the cache, so we can read multiple words from RAM in one shot  If we could read four words from the memory at once, a four-word cache load would need just 17 cycles: = 17 cycles  The disadvantage is the cost of the wider buses—each additional bit of memory width requires another connection to the cache Main Memory Cache CPU

9 An interleaved memory  Another approach is to interleave the memory, or split it into “banks” that can be accessed individually  The main benefit is overlapping the latencies of accessing each word  For example, if our main memory has four banks, each one byte wide, then we could load four bytes into a cache block in just 20 cycles (4 x 1) = 20 cycles  Our buses are still one byte wide here, so four cycles are needed to transfer data to the caches  This is cheaper than implementing a four-byte bus, but not too much slower Main Memory CPU Bank 0 Bank 1Bank 2 Bank 3 Cache

10  Here is a diagram to show how the memory accesses can be interleaved —The magenta cycles represent sending an address to a memory bank —Each memory bank has a 15-cycle latency, and it takes another cycle (shown in blue) to return data from the memory  This is the same basic idea as pipelining! —As soon as we request data from one memory bank, we can go ahead and request data from another bank as well —Each individual load takes 17 clock cycles, but four overlapped loads require just 20 cycles Interleaved memory accesses Load word 1 Load word 2 Load word 3 Load word 4 Clock cycles 15 cycles

11 Summary  Writing to a cache poses a couple of interesting issues —Write-through and write-back policies keep the cache consistent with main memory in different ways for write hits —Write-around and allocate-on-write are two strategies to handle write misses, differing in whether updated data is loaded into the cache.  Memory system performance depends upon the cache hit time, miss rate and miss penalty, as well as the actual program being executed —We can use these numbers to find the average memory access time —We can also revise our CPU time formula to include stall cycles AMAT = Hit time + (Miss rate x Miss penalty) Memory stall cycles = Memory accesses x miss rate x miss penalty CPU time = (CPU execution cycles + Memory stall cycles) x Cycle time  The organization of a memory system affects its performance —The cache size, block size, and associativity affect the miss rate —We can organize the main memory to help reduce miss penalties. For example, interleaved memory supports pipelined data accesses.

12 Inferring the cache structure  Consider the following program char a[LENGTH]; int sum = 0; for(int i = 0; i < 10000; ++i) for(int j = 0; j < LENGTH; j += STEP) sum += a[j];  Key idea: compute the average time it takes to execute sum += a[j]  What is the cache size (data)? What is the block size? What is the associativity? average time (in ns)