361 Computer Architecture Lecture 15: Cache Memory

361 Computer Architecture Lecture 15: Cache Memory
Start: X:40

Outline of Today’s Lecture
Cache Replacement Policy Cache Write Policy Example Summary And here is an outline of today’’s lecture. In the next 15 minutes, I will give you an review of the important concepts concerning memory system design and introduce you to the basic principles of caches. An in depth discussion of the cache design is next. After that, we will spend some time talking about the Cache Write and Replacement policy. Finally, we will finish off where we left off Wednesday on our tour of the SPACstation 20 memory system. +1 = 4 min. (X:44)

An Expanded View of the Memory System
Processor Control Memory Memory Memory Datapath Memory Memory Well what is cache? In the overall scheme of things, cache is nothing but memory. They are fast and expensive memory but they are still memory. They usually occupy the two levels just outside the processor’s datapath. What do we called those memory that is included INSIDE the processor’s datapath? Registers. +1 = 5 min. (X:45) Slowest Speed: Fastest Biggest Size: Smallest Lowest Cost: Highest

The Need to Make a Decision!
Direct Mapped Cache: Each memory location can only mapped to 1 cache location No need to make any decision :-) Current item replaced the previous item in that cache location N-way Set Associative Cache: Each memory location have a choice of N cache locations Fully Associative Cache: Each memory location can be placed in ANY cache location Cache miss in a N-way Set Associative or Fully Associative Cache: Bring in new block from memory Throw out a cache block to make room for the new block We need to make a decision on which block to throw out! In a direct mapped cache, each memory location can only go to 1 & only 1 cache location so on a cache miss, we don’t have a make to decision on which block to throw out. We just throw out the item that was originally in the cache and keep the new one. As a matter of fact, that is what cause direct mapped trouble to begin with: every memory location only has one place to go in the cache: high conflict misses. On a N-way set associative cache, each memory location can go to one of the N cache locations while on the fully associative cache, each memory location can go to ANY cache location in the cache. So when we have a cache miss in a N-way set associative or fully associative cache, we have a slight problem. We need to make a decision on which block to throw out. Besides working at Sun, I also teach people how to fly whenever I have time. Statistic have shown that if a pilot crashed after an engine fail, he or she has a higher probability of getting killed in a multi-engine airplane than a single engine airplane. The joke among us flight instructors is that: sure, when the engine quit in a single engine airplane, you have one option, you land. But in a multi-engine airplane with one engine stops, you have a lot of options. It is the need to make a decision that kills those indecisive people. +3 = 54 min. (Y:34)

Cache Block Replacement Policy
Random Replacement: Hardware randomly selects a cache item and throw it out Entry 0 Entry 1 Random Replacement : Pointer Entry 63 If you are one of those indecisive people, do I have a replacement policy for you (Random). Just throw away any one you like. Flip a coin, throw darts, whatever, you don’t care. The important thing is to make a decision and move on. Just don’t crash. Well all jokes aside, a lot of time a good random replacement policy is sufficient. If you really want to be fancy, you can adopt the Least Recently Used algorithm and throw out the one that has not been accessed for the longest time. It is hard to implement a true LRU algorithm in hardware. Here is an example where you can do a “pseudo” LRU with very simple hardware. Build a hardware replacement pointer, most likely a shift register, that will select an entry you will throw up next time when you have to make room for a new item. We can start the pointer at any random location. During normal access, whenever an access is made to an entry that the pointer points to, you move the pointer to the next entry. So at any given time, statistically speaking, the pointer is more likely to end up at an entry that has not been used very often than an entry that is heavily used. +2 = 56 min. (Y:36) What is the problem with this? Can we do better?

Cache Block Replacement Policy
Least Recently Used: Hardware keeps track of the access history Replace the entry that has not been used for the longest time Entry 0 Entry 1 : Entry 63 LRU If you are one of those indecisive people, do I have a replacement policy for you (Random). Just throw away any one you like. Flip a coin, throw darts, whatever, you don’t care. The important thing is to make a decision and move on. Just don’t crash. Well all jokes aside, a lot of time a good random replacement policy is sufficient. If you really want to be fancy, you can adopt the Least Recently Used algorithm and throw out the one that has not been accessed for the longest time. It is hard to implement a true LRU algorithm in hardware. Here is an example where you can do a “pseudo” LRU with very simple hardware. Build a hardware replacement pointer, most likely a shift register, that will select an entry you will throw up next time when you have to make room for a new item. We can start the pointer at any random location. During normal access, whenever an access is made to an entry that the pointer points to, you move the pointer to the next entry. So at any given time, statistically speaking, the pointer is more likely to end up at an entry that has not been used very often than an entry that is heavily used. +2 = 56 min. (Y:36) What about Cost/Performance?

Cache Block Replacement Policy – A compromise
Example of a Simple “Pseudo” Least Recently Used Implementation: Assume 64 Fully Associative Entries Hardware replacement pointer points to one cache entry Whenever an access is made to the entry the pointer points to: Move the pointer to the next entry Otherwise: do not move the pointer If you are one of those indecisive people, do I have a replacement policy for you (Random). Just throw away any one you like. Flip a coin, throw darts, whatever, you don’t care. The important thing is to make a decision and move on. Just don’t crash. Well all jokes aside, a lot of time a good random replacement policy is sufficient. If you really want to be fancy, you can adopt the Least Recently Used algorithm and throw out the one that has not been accessed for the longest time. It is hard to implement a true LRU algorithm in hardware. Here is an example where you can do a “pseudo” LRU with very simple hardware. Build a hardware replacement pointer, most likely a shift register, that will select an entry you will throw up next time when you have to make room for a new item. We can start the pointer at any random location. During normal access, whenever an access is made to an entry that the pointer points to, you move the pointer to the next entry. So at any given time, statistically speaking, the pointer is more likely to end up at an entry that has not been used very often than an entry that is heavily used. +2 = 56 min. (Y:36) Entry 0 Entry 1 Replacement : Pointer Entry 63

Cache Write Policy: Write Through versus Write Back
Cache read is much easier to handle than cache write: Instruction cache is much easier to design than data cache Cache write: How do we keep data in the cache and memory consistent? Two options (decision time again :-) Write Back: write to cache only. Write the cache block to memory when that cache block is being replaced on a cache miss. Need a “dirty” bit for each cache block Greatly reduce the memory bandwidth requirement Control can be complex Write Through: write to cache and memory at the same time. What!!! How can this be? Isn’t memory too slow for this? So far, we have been thinking mostly in terms of cache read not cache write because cache read is much easier to handle than cache write. That’s why in general Instruction Cache is much easier to build than data cache. The problem with cache write is that we have to keep the cache and memory consistent. There are two options we can follow. The first one is write back. In this case, you only write to the cache. We will write the cache block back to memory only when the cache block is being replaced on a cache miss. In order to take full advantage of the write back feature, you will need a dirty bit for each block to indicate whether you have written into this cache block before. Write back cache can greatly reduce the memory bandwidth requirement but it also requires rather complex control. The second option is Write Through: you write to the cache and memory at the same time. You probably say: WHAT. You can’t do that. Memory is too slow. +2 = 58 min. (Y:38)

Write Buffer for Write Through
Cache Processor DRAM Write Buffer A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: Store frequency (w.r.t. time) -> 1 / DRAM write cycle Write buffer saturation You are right, memory is too slow. We really didn't writ e to the memory directly. We are writing to a write buffer. Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene. The write buffer works as long as the frequency of store is not too high. Notice here, I am referring to the frequency with respect to time, not with respect to number of instructions. Remember the DRAM cycle time we talked about last time. It sets the upper limit on how frequent you can write to the main memory. If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. +2 = 60 min. (Y:40)

Write Buffer Saturation
Cache Processor DRAM Write Buffer Store frequency (w.r.t. time) -> 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache: A Memory System designer’s nightmare is when the Store frequency with respect to time approaches 1 over the DRAM Write Cycle Time. We called this Write Buffer Saturation. In that case, it does NOT matter how big you make the write buffer, the write buffer will still overflow because you simply feeding things in it faster than you can empty it. This is called Write Buffer Saturation and I have seen this happened before in simulation and whey that happens your processor will be running at DRAM cycle time--very very slow. The first solution for write buffer saturation is to get rid of this write buffer and replace this write through cache with a write back cache. Another solution is to install the 2nd level cache between the write buffer and memory and makes the 2nd level write back. +2 = 62 min. (Y:42) Cache L2 Cache Processor DRAM Write Buffer

Write Allocate versus Not Allocate
Assume: a 16-bit write to memory location 0x0 and causes a miss Do we read in the rest of the block (Byte 2, 3, )? Yes: Write Allocate No: Write Not Allocate 31 9 4 Cache Tag Example: 0x00 Cache Index Byte Select Ex: 0x00 Ex: 0x00 Valid Bit Cache Tag Cache Data : 0x00 Byte 31 Byte 1 Byte 0 Let’s look at our 1KB direct mapped cache again. Assume we do a 16-bit write to memory location 0x and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select. After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss? True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking. +2 = 64 min. (Y:44) : Byte 63 Byte 33 Byte 32 1 2 3 : : : : Byte 1023 Byte 992 31

What is a Sub-block? : : : : : : : : Sub-block:
A unit within a block that has its own valid bit Example: 1 KB Direct Mapped Cache, 32-B Block, 8-B Sub-block Each cache entry will have: 32/8 = 4 valid bits Write miss: only the bytes in that sub-block is brought in. SB3’s V Bit SB2’s V Bit SB1’s V Bit SB0’s V Bit Cache Tag Cache Data : : B31 B24 B7 B0 Sub-block3 Sub-block2 Sub-block1 Sub-block0 A sub-block is defined as a unit within a block that has its own valid bit. For example here, if each 32 byte block is divided into four 8 byte sub-blocks, we will need 4 valid bits per block. On a cache write miss, we only need to bring in the data for the sub-block that we are writing to and set the valid bit to all other subblocks to zeros. This works fine for write. For read, we still want to bring in the rest of the block so the time it takes to fill the block is still important. Modern computer has several ways to reduce the time to fill the cache. +2 = 66 min. (Y:46) 1 2 3 : : : : : : Byte 1023 Byte 992 31

SPARCstation 20’s Memory System
Controller Memory Bus (SIMM Bus) bit wide datapath Memory Module 7 Memory Module 6 Memory Module 5 Memory Module 4 Memory Module 3 Memory Module 2 Memory Module 1 Memory Module 0 Processor Module (Mbus Module) Processor Bus (Mbus) 64-bit wide SuperSPARC Processor The SPARCstation 20 memory system is rather simple. It consists of a 128-bit memory bus, which inside SUN, we called it the SIMM bus where you can put in up to 8 memory modules. The memory bus is controlled by the memory controller. On the other side of the Memory Controller is the Processor Bus. Inside SUN, we called the Processor Bus the Mbus. On the processor bus is a processor module which contains a SuperSPARC processor as well as some external cache. We already look at the memory module in Wednesday lecture. Today, we will talk about the caches on the processor module. +1 = 68 min. (Y:48) Instruction Cache External Cache Register File Data Cache

SPARCstation 20’s External Cache
Processor Module (Mbus Module) SuperSPARC Processor External Cache Instruction Cache 1 MB Register File Direct Mapped Data Cache Write Back Write Allocate SPARCstation 20’s External Cache: Size and organization: 1 MB, direct mapped Block size: 128 B Sub-block size: 32 B Write Policy: Write back, write allocate First is the external cache. It is 1 mega byte, direct mapped. Block size is 128 bytes with 32 bytes sub-block. It is a write back cache and it also does write allocate. +1 = 69 min. (Y:49)

SPARCstation 20’s Internal Instruction Cache
Processor Module (Mbus Module) SuperSPARC Processor External Cache I-Cache 20 KB 5-way 1 MB Register File Direct Mapped Write Back Data Cache Write Allocate SPARCstation 20’s Internal Instruction Cache: Size and organization: 20 KB, 5-way Set Associative Block size: 64 B Sub-block size: 32 B Write Policy: Does not apply Note: Sub-block size the same as the External (L2) Cache Inside the SuperSPARC processor chip, it has a 20 KB, 5-way set associative cache. The block size of this cache is 64B with 32 B sub-block. Well there is no such thing as write policy for this cache because this cache is read only. It is an instruction cache. +1 = 70 min. (Y:50)

SPARCstation 20’s Internal Data Cache
Processor Module (Mbus Module) SuperSPARC Processor External Cache I-Cache 20 KB 5-way 1 MB Register File Direct Mapped D-Cache Write Back 16 KB 4-way Write Allocate WT, WNA SPARCstation 20’s Internal Data Cache: Size and organization: 16 KB, 4-way Set Associative Block size: 64 B Sub-block size: 32 B Write Policy: Write through, write not allocate Sub-block size the same as the External (L2) Cache Finally, the data cache size is 16 KB. It is 4-way set associative. The block size is 64 B with 32 B sub-block. The write policy is write through and it does not allocate on write. That is the rest of the block is not filled on write miss. Notice that the sub-block size here is the same as the L2 cache. This makes transferring data between the two levels much easier. +1 = 71 min. (Y:51)

Two Interesting Questions?
Processor Module (Mbus Module) SuperSPARC Processor External Cache I-Cache 20 KB 5-way 1 MB Register File Direct Mapped D-Cache Write Back 16 KB 4-way Write Allocate WT, WNA Why did they use N-way set associative cache internally? Answer: A N-way set associative cache is like having N direct mapped caches in parallel. They want each of those N direct mapped cache to be 4 KB. Same as the “virtual page size.” Virtual Page Size: cover in next week’s virtual memory lecture How many levels of cache does SPARCstation 20 has? Answer: Three levels. (1) Internal I & D caches, (2) External cache and (3) ... If you look at this cache organization, you may ask why did they use 4-way and 5-way set associative cache inside the processor chip? The answer is that a N-way associative cache is like having N direct-mapped cache in parallel. They want each of those N direct mapped cache to be 4 KB in size, the same as the virtual page size. You will learn about virtual page size in next Wednesday lecture and when you do, you will appreciate why you want to keep the set size of the cache the same as the virtual page size. The next question you may ask is: how many levels of caches does SPARCstation 20 has. The answer is three. The first level cache is the internal instruction and data cache. The second level is the External cache. Can anybody tell me where is the third level cache? +2 = 73 min. (Y:53)

SPARCstation 20’s Memory Module
Supports a wide range of sizes: Smallest 4 MB: 16 2Mb DRAM chips, 8 KB of Page Mode SRAM Biggest: 64 MB: 32 16Mb chips, 16 KB of Page Mode SRAM DRAM Chip 15 512 cols 256K x 8 = 2 MB DRAM Chip 0 512 rows 256K x 8 = 2 MB 512 x 8 SRAM Well the third level cache is not explicit or well defined so it is sort of a trick question. But remember all the Page Mode SRAM we have in the memory module? For a fully loaded SPARCstation 20 with 8 of the biggest memory modules, you can have up to 8 x 16 KB, or 128 KB of page mode SRAM sitting on the Memory Bus. These Page Mode SRAM, in a sense, serves as your third level cache. +1 = 74 min. (Y:54) 8 bits bits<127:0> 512 x 8 SRAM bits<7:0> Memory Bus<127:0>

Summary: Replacement Policy Exploit principle of locality
Write Policy: Write Through: need a write buffer. Nightmare: WB saturation Write Back: control can be complex Getting data into the processor from Cache and into the cache from slower memory are one of the most important R&D topics in industry. Let’s summarize today’s lecture. I know you have heard this many times and many ways but it is still worth repeating. Memory hierarchy works because of the Principle of Locality which says a program will access a relatively small portion of the address space at any instant of time. There are two types of locality: temporal locality, or locality in time and spatial locality, or locality in space. So far, we have covered three major categories of cache misses. Compulsory misses are cache misses due to cold start. You cannot avoid them but if you are going to run billions of instructions anyway, compulsory misses usually don’t bother you. Conflict misses are misses caused by multiple memory location being mapped to the same cache location. The nightmare scenario is the ping pong effect when a block is read into the cache but before we have a chance to use it, it was immediately forced out by another conflict miss. You can reduce Conflict misses by either increase the cache size or increase the associativity, or both. Finally, Capacity misses occurs when the cache is not big enough to contains all the cache blocks required by the program. You can reduce this miss rate by making the cache larger. There are two write policy as far as cache write is concerned. Write through requires a write buffer and a nightmare scenario is when the store occurs so frequent that you saturates your write buffer. The second write polity is write back. In this case, you only write to the cache and only when the cache block is being replaced do you write the cache block back to memory. +3 = 77 min. (Y:57)

361 Computer Architecture Lecture 15: Cache Memory

Similar presentations

Presentation on theme: "361 Computer Architecture Lecture 15: Cache Memory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

361 Computer Architecture Lecture 15: Cache Memory

Similar presentations

Presentation on theme: "361 Computer Architecture Lecture 15: Cache Memory"— Presentation transcript:

Similar presentations

About project

Feedback