CS 704 Advanced Computer Architecture

Slides:



Advertisements
Similar presentations
361 Computer Architecture Lecture 15: Cache Memory
Advertisements

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Performance of Cache Memory
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Storage HierarchyCS510 Computer ArchitectureLecture Lecture 12 Storage Hierarchy.
CMPE 421 Parallel Computer Architecture
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
11 Intro to cache memory Kosarev Nikolay MIPT Nov, 2009.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
Lecture 20 Last lecture: Today’s lecture: Types of memory
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 29 Memory Hierarchy Design Cache Performance Enhancement by: Reducing Cache.
Associative Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word Tag uniquely identifies block of.
CMSC 611: Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Soner Onder Michigan Technological University
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
Basic Performance Parameters in Computer Architecture:
CS 704 Advanced Computer Architecture
Cache Memory Presentation I
Consider a Direct Mapped Cache with 4 word blocks
CS61C : Machine Structures Lecture 6. 2
Lecture 21: Memory Hierarchy
Lecture 21: Memory Hierarchy
CS 704 Advanced Computer Architecture
CS61C : Machine Structures Lecture 6. 2
Lecture 23: Cache, Memory, Virtual Memory
Lecture 08: Memory Hierarchy Cache Performance
CPE 631 Lecture 05: Cache Design
Performance metrics for caches
Performance metrics for caches
ECE232: Hardware Organization and Design
Performance metrics for caches
How can we find data in the cache?
Miss Rate versus Block Size
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?
Lecture 22: Cache Hierarchies, Memory
Contents Memory types & memory hierarchy Virtual memory (VM)
CS-447– Computer Architecture Lecture 20 Cache Memories
Lecture 21: Memory Hierarchy
Performance metrics for caches
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
EECS 361 Computer Architecture Lecture 15: Cache Memory
Cache Memory Rabi Mahapatra
Lecture 7 Memory Hierarchy and Cache Design
Principle of Locality: Memory Hierarchies
CS Computer Architecture Spring Lecture 19: Cache Introduction
Memory & Cache.
Performance metrics for caches
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

CS 704 Advanced Computer Architecture Lecture 28 Memory Hierarchy Design (Cache Design and policies ) Prof. Dr. M. Ashraf Chughtai Welcome to the 28th Lecture for the series of lectures on Advanced Computer Architecture

Lecture 28 Memory Hierarchy (4) Today’s Topics Recap: Cache Addressing Techniques Placement and Replacement Policies Cache Write Strategy Cache Performance Enhancement Summary MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Recap: Block Size Trade off Impact of block size on the cache performance and categories of cache design The trade-off of the block size verses the Miss rate, Miss Penalty, and Average access time , the basic CPU performance matrices MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Recap: Block Size Trade off The larger block size reduces the miss rate, but If block size is too big relative to cache size, miss rate will go up; and Miss penalty will go up as the block size increases; and Combining these two parameters, the third parameter, Average Access Time MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Recap: Cache Organizations Block placement policy, we studied three cache organizations. MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Recap: Cache Organizations Direct Mapped where each block has only one place it can appear in the cache – Conflict Miss Fully Associative Mapped where any block of the main memory can be placed any where in the cache; and Set Associative Mapped which allows to place a block in a set of places in the cache MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Memory Hierarchy Designer’s Concerns Block placement: Where can a block be placed in the upper level? Block identification: How is a block found if it is in the upper level? Block replacement: Which block should be replaced on a miss? Write strategy: What happens on a write? MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Block Placement Policy Fully Associative: Block can be placed any where in the upper level (Cache) E.g. Block 12 from the main memory can be place at block 2, 6 or any of the 8 block locations in cache MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Block Placement Policy Set Associative: Block can be placed any where in a set in upper level (cache) The set number in the upper level given as: (Block No) MOD (number of sets) E.g., an 8-block, 2-way set associative mapped cache, has 4 sets [0-3] each of two blocks; therefore and block 12 or 16 of main memory can go any where in set # 0 as (12 MOD 4 = 0) and (16 MOD 4 = 0) Similarly, block 14 can be placed at any of the 2 locations in set#2 (14 MOD 4 = 2) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Block Placement Policy Direct Mapped: (1 way associative) Block can be placed at only one specific location in upper level (Cache) The location in the cache is given by: Block number MOD No. of cache blocks E.g., the block 12 or block 20 can be place at location 4 in cache having 8 blocks as: 12 MOD 8 = 4 20 MOD 8 = 4 MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Lecture 28 Memory Hierarchy (4) Block Identification How is a block found if it is in the upper level? Tag/Block A TAG is associated with each block frame The TAG gives the block address All possible TAGS, where a block may be placed are checked in parallel Valid bit is used to identify whether the block contains correct data No need to check index or block offset MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Block Identification: Direct Mapped Lower Level (Main) memory: 4GB – 32-bit address Cache Index 5bits 1 Cache Data Byte 0 4 31 Cache Tag (22-bits) Ex: 0x00 22 bit Valid Bit Byte 1 Byte 31 : Byte 32 Byte 33 Byte 63 Byte Select 9 31 31 MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Lecture 28 Memory Hierarchy (4) Block Identification 31 9 8 4 Cache Tag (23-bits) Cache Index 4bits Byte Select 23bit 1 Byte 0 Byte 1 Byte 31 : Byte 32 Byte 33 Byte 63 15 : 23bit Byte 31 Byte 1 Byte 0 : Byte 63 Byte 33 Byte 32 1 15 MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Block Replacement Policy In case of cache miss, a new block needs to be brought in If the existing block locations, as defined by Block placement policy, the are filled, then an existing block has to be fired based on Cache mapping; and some block replacement policy MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Block Replacement Policy For the Direct Mapped Cache, the block replacement is very simple as a block can be place at only one location given by: (Block No.) MOD (Number of Cache Blocks There are three commonly used schemes for Fully and Set Associative mapped These policies are: MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Block Replacement Policy Random: replace any block it is simple and easiest to implement The candidate for replacement are randomly selected Some designers use pseudo random block numbers MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Block Replacement Policy Least Recently Used (LRU): replace the block either never used of used long ago It reduces the chances of throwing out information that may be needed soon Here, the access time and number of times a block is accessed is recorded The block replaced is one that has not been used for longest time E.g., if the blocks are accessed in the sequence 0,2,3,0, 4,3,0,1,8,0 the victim to replace is block 2 MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Block Replacement Policy First-in, First-out (FIFO): the block first place in the cache is thrown out first; e.g., if the blocks are accessed in the sequence 2,3,4,5,3,4 then to bring in a new block in the cache, the block 2 will be thrown out as it is the oldest accessed block in the sequence FIFO is used as approximation to LRU as LRU can be complicated to calculate MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Block Replacement Policy: Conclusion Associativity 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1 .15% 1.17% 1.13% 1.13% 1.12% 1.12% MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Lecture 28 Memory Hierarchy (4) Write Strategy Must not overwrite a cache block unless main memory is up to date Multiple CPUs may have individual caches I/O may address main memory directly Memory is accessed for read and write purposes MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Lecture 28 Memory Hierarchy (4) Write Strategy .. Cont’d The instruction cache accesses are read Instruction issue dominates the cache traffic as the writes are typically 10% of the cache access Furthermore, the data cache are 10%- 20% of the overall memory access are write MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Lecture 28 Memory Hierarchy (4) Write Strategy.. Cont’d In order to optimize the cache performance, according to the Amdahl’s law, we make the common case fast Fortunately, the common case, i.e., the cache read, is easy to make fast as: Read can be optimized by making the tag- checking and data-transfer in parallel Thus, the cache performance is good MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Lecture 28 Memory Hierarchy (4) Write Strategy .. Cont’d However, in case of cache-write, the cache contents modification cannot begin until the tag is checked for address-hit Therefore the cache-write cannot begin in parallel with the tag checking Another complication is that the processor specifies the size of write which is usually a portion of the block Therefore, the write needs consideration MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Lecture 28 Memory Hierarchy (4) Write Strategy.. Cont’d Write back —The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced Write through —The information is written to both the block in the cache and to the block in the lower-level memory MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Strategy: Pros and Cons of each Write Back: No write to the lower level for repeated writes to cache a dirty bit is commonly used to indicate the status as the cache block is modified (dirty) or not modified (clean) Reduce memory-bandwidth requirements, hence the reduces the memory power requirements MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Strategy: Pros and Cons of each Write Through: Simplifies the replacement procedure the block is always clean, so unlike write-back strategy the read misses cannot result in writes to the lower level always combined with write buffers so that don’t wait for lower level memory Simplifies the data-coherency as the next lower level has the most recent copy (we will discuss this later) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Buffer for Write Through Processor Cache Write Buffer DRAM +2 = 60 min. (Y:40) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Buffer for Write Through Write buffer is just a FIFO: Typical number of entries: 4 Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene +2 = 60 min. (Y:40) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Buffer for Write Through DRAM cycle time sets the upper limit on how frequent you can write to the main memory. The write buffer works as long as the frequency of store, with respect to the time, is not too high, i.e., Store frequency << 1 / DRAM write cycle +2 = 60 min. (Y:40) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Buffer for Write Through If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. A Memory System designer’s nightmare is when the Store frequency with respect to time approaches 1 over the DRAM Write Cycle Time, i.e., The CPU Cycle Time <= DRAM Write Cycle Time We call this Write Buffer Saturation +2 = 60 min. (Y:40) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Buffer Saturation In that case, it does NOT matter how big you make the write buffer, the write buffer will still overflow because you are simply feeding things in it faster than you can empty it There are two solutions to this problem: The first solution is to get rid of this write buffer and replace this write through cache with a write back cache +2 = 60 min. (Y:40) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Write Buffer Saturation Processor Cache Write Buffer DRAM L2 +2 = 60 min. (Y:40) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Lecture 28 Memory Hierarchy (4) Write-Miss Policy In case of write-miss, two options are used, these options are : Write Allocate: A block is allocated on a write-miss, followed by the write hit action No-write Allocate: Usually the write- misses do not affect the cache, rather the block is modified only in the lower level memory, i.e., Let’s look at our 1KB direct mapped cache again. Assume we do a 16-bit write to memory location 0x000000 and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select. After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss? True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking. +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Lecture 28 Memory Hierarchy (4) Write-Miss Policy The blocks stay out of the cache in no-write allocate until the program tries to read the blocks, but The blocks that are only written will still be in the cache with write allocate Let us discuss it with the help of example Let’s look at our 1KB direct mapped cache again Assume we do a 16-bit write to memory location 0x000000 and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select. After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss? True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking. +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Lecture 28 Memory Hierarchy (4) Write-miss Policy Assume we do a 16-bit write to memory location 0x000000 and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select. Cache Index 1 2 3 : Cache Data Byte 0 4 31 Cache Tag Example: 0x00 Ex: 0x00 0x00 Valid Bit Byte 1 Byte 31 Byte 32 Byte 33 Byte 63 Byte 992 Byte 1023 Byte Select 9 Let’s look at our 1KB direct mapped cache again. After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss? True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking. +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Lecture 28 Memory Hierarchy (4) Write-miss Policy Assume we do a 16-bit write to memory location 0x000000 and causes a cache miss in our 1KB direct mapped cache that has 32- byte block select After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking. +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Lecture 28 Memory Hierarchy (4) Write-miss Policy As the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Lecture 28 Memory Hierarchy (4) Write-miss Policy So even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

No write-allocate verses write allocate: Example Let us consider a fully associative write- back cache with cache entries that start empty Consider the following sequence of five memory operations and find The number of hits and misses when using no-write allocate verses write allocate +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

No write-allocate verses write allocate: Example Write Mem [100] Read Mem [200] Write Mem [200] For no-write allocate, the address [100] is not in the cache (i.e., its tag is not in the cache +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

No write-allocate verses write allocate: Example So the first two writes will result in MISSES Address [200] is also not in the cache, the reed is also miss The subsequent write [200] is a hit The last write [100] is still a miss The result is 4 MISSes and 1 HIT +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

No write-allocate verses write allocate: Example For the write-allocate policy The first access to 100 and 200 are MISSES The rest are HITS as [100] and [200] are both found in the cache The result is 2 MISSes and 3 HITs +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

No write-allocate verses write allocate: Conclusion Either write miss policy could be used with the write-through or write-back Normally Write-back caches use write-allocate, hopping that the subsequent write to the block will be captured by the cache +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

No write-allocate verses write allocate: Conclusion Write-through caches often use No Write Allocate, the reason is that even if there is a subsequent write to the block, the write must go to the lower level memory +2 = 64 min. (Y:44) MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)

Lecture 28 Memory Hierarchy (4) Allah Hafiz And Aslam-U-Alacum MAC/VU-Advanced Computer Architecture Lecture 28 Memory Hierarchy (4)