CS152 Computer Architecture and Engineering Lecture 20 Caches

CS152 Computer Architecture and Engineering Lecture 20 Caches

The Big Picture: Where are We Now?
The Five Classic Components of a Computer Today’s Topics: Recap last lecture Simple caching techniques Many ways to improve cache performance Virtual memory? Control Datapath Memory Processor Input Output So where are in in the overall scheme of things. Well, we just finished designing the processor’s datapath. Now I am going to show you how to design the control for the datapath. +1 = 7 min. (X:47)

The Art of Memory System Design
Workload or Benchmark programs Processor reference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>, . . . op: i-fetch, read, write Memory Optimize the memory system organization to minimize the average memory access time for typical workloads $ MEM

Example: 1 KB Direct Mapped Cache with 32 B Blocks
For a 2 ** N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2M) One cache miss, pull in complete “Cache Block” (or “Cache Line”) Cache Index 1 2 3 : Cache Data Byte 0 4 31 Cache Tag Example: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit Byte 1 Byte 31 Byte 32 Byte 33 Byte 63 Byte 992 Byte 1023 Byte Select Ex: 0x00 9 Block address Let’s use a specific example with realistic numbers: assume we have a 1 KB direct mapped cache with block size equals to 32 bytes. In other words, each block associated with the cache tag will have 32 bytes in it (Row 1). With Block Size equals to 32 bytes, the 5 least significant bits of the address will be used as byte select within the cache block. Since the cache size is 1K byte, the upper 32 minus 10 bits, or 22 bits of the address will be stored as cache tag. The rest of the address bits in the middle, that is bit 5 through 9, will be used as Cache Index to select the proper cache entry. +2 = 30 min. (Y:10)

N-way set associative: N entries for each Cache Index
Set Associative Cache N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared to the input in parallel Data is selected based on the tag result Cache Index Valid Cache Tag Cache Data Cache Data Cache Block 0 Cache Tag Valid : Cache Block 0 : : : This is called a 2-way set associative cache because there are two cache entries for each cache index. Essentially, you have two direct mapped cache works in parallel. This is how it works: the cache index selects a set from the cache. The two tags in the set are compared in parallel with the upper bits of the memory address. If neither tag matches the incoming address tag, we have a cache miss. Otherwise, we have a cache hit and we will select the data on the side where the tag matches occur. This is simple enough. What is its disadvantages? +1 = 36 min. (Y:16) Compare Adr Tag Compare 1 Sel1 Mux Sel0 OR Cache Block Hit

Disadvantage of Set Associative Cache
N-way Set Associative Cache versus Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. Cache Data Cache Block 0 Cache Tag Valid : Cache Index Mux 1 Sel1 Sel0 Cache Block Compare Adr Tag OR Hit First of all, a N-way set associative cache will need N comparators instead of just one comparator (use the right side of the diagram for direct mapped cache). A N-way set associative cache will also be slower than a direct mapped cache because of this extra multiplexer delay. Finally, for a N-way set associative cache, the data will be available AFTER the hit/miss signal becomes valid because the hit/mis is needed to control the data MUX. For a direct mapped cache, that is everything before the MUX on the right or left side, the cache block will be available BEFORE the hit/miss signal (AND gate output) because the data does not have to go through the comparator. This can be an important consideration because the processor can now go ahead and use the data without knowing if it is a Hit or Miss. Just assume it is a hit. Since cache hit rate is in the upper 90% range, you will be ahead of the game 90% of the time and for those 10% of the time that you are wrong, just make sure you can recover. You cannot play this speculation game with a N-way set-associatvie cache because as I said earlier, the data will not be available to you until the hit/miss signal is valid. +2 = 38 min. (Y:18)

Example: Fully Associative
Fully Associative Cache Forget about the Cache Index Compare the Cache Tags of all cache entries in parallel Example: Block Size = 32 B blocks, we need N 27-bit comparators By definition: Conflict Miss = 0 for a fully associative cache 31 4 Cache Tag (27 bits long) Byte Select Ex: 0x01 Cache Tag Valid Bit Cache Data While the direct mapped cache is on the simple end of the cache design spectrum, the fully associative cache is on the most complex end. It is the N-way set associative cache carried to the extreme where N in this case is set to the number of cache entries in the cache. In other words, we don’t even bother to use any address bits as the cache index. We just store all the upper bits of the address (except Byte select) that is associated with the cache block as the cache tag and have one comparator for every entry. The address is sent to all entries at once and compared in parallel and only the one that matches are sent to the output. This is called an associative lookup. Needless to say, it is very hardware intensive. Usually, fully associative cache is limited to 64 or less entries. Since we are not doing any mapping with the cache index, we will never push any other item out of the cache because multiple memory locations map to the same cache location. Therefore, by definition, conflict miss is zero for a fully associative cache. This, however, does not mean the overall miss rate will be zero. Assume we have 64 entries here. The first 64 items we accessed can fit in. But when we try to bring in the 65th item, we will need to throw one of them out to make room for the new item. This bring us to the third type of cache misses: Capacity Miss. +3 = 41 min. (Y:21) : = Byte 31 Byte 1 Byte 0 : = Byte 63 Byte 33 Byte 32 = = : : : =

A Summary on Sources of Cache Misses
Compulsory (cold start or process migration, first reference): first access to a block “Cold” fact of life: not a whole lot you can do about it Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant Capacity: Cache cannot contain all blocks access by the program Solution: increase cache size Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Coherence (Invalidation): other process (e.g., I/O) updates memory (Capacity miss) That is the cache misses are due to the fact that the cache is simply not large enough to contain all the blocks that are accessed by the program. The solution to reduce the Capacity miss rate is simple: increase the cache size. Here is a summary of other types of cache miss we talked about. First is the Compulsory misses. These are the misses that we cannot avoid. They are caused when we first start the program. Then we talked about the conflict misses. They are the misses that caused by multiple memory locations being mapped to the same cache location. There are two solutions to reduce conflict misses. The first one is, once again, increase the cache size. The second one is to increase the associativity. For example, say using a 2-way set associative cache instead of directed mapped cache. But keep in mind that cache miss rate is only one part of the equation. You also have to worry about cache access time and miss penalty. Do NOT optimize miss rate alone. Finally, there is another source of cache miss we will not cover today. Those are referred to as invalidation misses caused by another process, such as IO , update the main memory so you have to flush the cache to avoid inconsistency between memory and cache. +2 = 43 min. (Y:23)

Design options at constant cost
Direct Mapped N-way Set Associative Fully Associative Cache Size Big Medium Small Compulsory Miss Same Same Same Conflict Miss High Medium Zero Capacity Miss Low Medium High This is a hidden slide. Coherence Miss Same Same Same Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant (except for streaming media types of programs).

Recap: Four Questions for Caches and Memory Hierarchy
Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

Q1: Where can a block be placed in the upper level?
Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets Fully associative: block 12 can go anywhere Block no. Direct mapped: block 12 can go only into block 4 (12 mod 8) Block no. Set associative: block 12 can go anywhere in set 0 (12 mod 4) Set 1 2 3 Block no. Block-frame address Block no.

Q2: How is a block found if it is in the upper level?
offset Block Address Tag Index Set Select Data Select Direct indexing (using index and block offset), tag compares, or combination Increasing associativity shrinks index, expands tag

Q3: Which block should be replaced on a miss?
Easy for Direct Mapped Set Associative or Fully Associative: Random LRU (Least Recently Used) Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.2% 5.7% % 5.3% % 5.0% 64 KB 1.9% 2.0% % 1.7% % 1.5% 256 KB 1.15% 1.17% % % % %

Q4: What happens on a write?
Write through—The information is written to both the block in the cache and to the block in the lower- level memory. Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT: read misses cannot result in writes WB: no writes of repeated writes WT always combined with write buffers so that don’t wait for lower level memory

Write Buffer for Write Through
Cache Processor DRAM Write Buffer A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Must handle bursts of writes Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle You are right, memory is too slow. We really didn't writ e to the memory directly. We are writing to a write buffer. Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene. The write buffer works as long as the frequency of store is not too high. Notice here, I am referring to the frequency with respect to time, not with respect to number of instructions. Remember the DRAM cycle time we talked about last time. It sets the upper limit on how frequent you can write to the main memory. If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. +2 = 60 min. (Y:40)

Write Buffer Saturation
Cache Processor DRAM Write Buffer Store frequency (w.r.t. time) > 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache: (does this always work?) A Memory System designer’s nightmare is when the Store frequency with respect to time approaches 1 over the DRAM Write Cycle Time. We called this Write Buffer Saturation. In that case, it does NOT matter how big you make the write buffer, the write buffer will still overflow because you simply feeding things in it faster than you can empty it. This is called Write Buffer Saturation and I have seen this happened before in simulation and whey that happens your processor will be running at DRAM cycle time--very very slow. The first solution for write buffer saturation is to get rid of this write buffer and replace this write through cache with a write back cache. Another solution is to install the 2nd level cache between the write buffer and memory and makes the 2nd level write back. +2 = 62 min. (Y:42) Cache L2 Cache Processor DRAM Write Buffer

RAW Hazards from Write Buffer!
Write-Buffer Issues: Could introduce RAW Hazard with memory! Write buffer may contain only copy of valid data  Reads to memory may get wrong result if we ignore write buffer Solutions: Simply wait for write buffer to empty before servicing reads: Might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read (“fully associative”); If no conflicts, let the memory access continue Else grab data from buffer Can Write Buffer help with Write Back? Read miss replacing dirty block Copy dirty block to write buffer while starting read to memory CPU stall less since restarts as soon as do read

Write-miss Policy: Write Allocate versus Not Allocate
Assume: a 16-bit write to memory location 0x0 and causes a miss Do we allocate space in cache and possibly read in the block? Yes: Write Allocate No: Not Write Allocate 31 9 4 Cache Tag Example: 0x00 Cache Index Byte Select Ex: 0x00 Ex: 0x00 Valid Bit Cache Tag Cache Data : 0x50 Byte 31 Byte 1 Byte 0 Let’s look at our 1KB direct mapped cache again. Assume we do a 16-bit write to memory location 0x and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select. After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss? True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking. +2 = 64 min. (Y:44) : Byte 63 Byte 33 Byte 32 1 2 3 : : : : Byte 1023 Byte 992 31

Impact of Memory Hierarchy on Algorithms
Today CPU time is a function of (ops, cache misses) What does this mean to Compilers, Data structures, Algorithms? Quicksort: fastest comparison based sorting algorithm when keys fit in memory Radix sort: also called “linear time” sort For keys of fixed length and fixed radix a constant number of passes over the data is sufficient independent of the number of keys “The Influence of Caches on the Performance of Sorting” by A. LaMarca and R.E. Ladner. Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, January, 1997, For Alphastation 250, 32 byte blocks, direct mapped L2 2MB cache, 8 byte keys, from 4000 to Let’s do a short review of what you learned last time. Virtual memory was originally invented as another level of memory hierarchy such that programers, faced with main memory much smaller than their programs, do not have to manage the loading and unloading portions of their program in and out of memory. It was a controversial proposal at that time because very few programers believed software can manage the limited amount of memory resource as well as human. This all changed as DRAM size grows exponentially in the last few decades. Nowadays, the main function of virtual memory is to allow multiple processes to share the same main memory so we don’t have to swap all the non-active processes to disk. Consequently, the most important function of virtual memory these days is to provide memory protection. The most common technique, but we like to emphasis not the only technique, to translate virtual memory address to physical memory address is to use a page table. TLB, or translation lookaside buffer, is one of the most popular hardware techniques to reduce address translation time. Since TLB is so effective in reducing the address translation time, what this means is that TLB misses will have a significant negative impact on processor performance. +3 = 3 min. (X:43)

Quicksort vs. Radix as vary number keys: Instructions
Radix sort Quick sort Instructions/key Job size in keys

Quicksort vs. Radix as vary number keys: Instrs & Time
Radix sort Time Quick sort Instructions Job size in keys

Quicksort vs. Radix as vary number keys: Cache misses
Radix sort Cache misses Quick sort Job size in keys What is proper approach to fast algorithms?

The Principle of Locality:
Summary #1/ 2: The Principle of Locality: Program likely to access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three (+1) Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size Coherence Misses: Caused by external processors or I/O devices Cache Design Space total size, block size, associativity replacement policy write-hit policy (write-through, write-back) write-miss policy Let’s summarize today’s lecture. I know you have heard this many times and many ways but it is still worth repeating. Memory hierarchy works because of the Principle of Locality which says a program will access a relatively small portion of the address space at any instant of time. There are two types of locality: temporal locality, or locality in time and spatial locality, or locality in space. So far, we have covered three major categories of cache misses. Compulsory misses are cache misses due to cold start. You cannot avoid them but if you are going to run billions of instructions anyway, compulsory misses usually don’t bother you. Conflict misses are misses caused by multiple memory location being mapped to the same cache location. The nightmare scenario is the ping pong effect when a block is read into the cache but before we have a chance to use it, it was immediately forced out by another conflict miss. You can reduce Conflict misses by either increase the cache size or increase the associativity, or both. Finally, Capacity misses occurs when the cache is not big enough to contains all the cache blocks required by the program. You can reduce this miss rate by making the cache larger. There are two write policy as far as cache write is concerned. Write through requires a write buffer and a nightmare scenario is when the store occurs so frequent that you saturates your write buffer. The second write polity is write back. In this case, you only write to the cache and only when the cache block is being replaced do you write the cache block back to memory. +3 = 77 min. (Y:57)

Summary #2 / 2: The Cache Design Space
Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation The optimal choice is a compromise depends on access characteristics workload use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Cache Size Associativity Block Size No fancy replacement policy is needed for the direct mapped cache. As a matter of fact, that is what cause direct mapped trouble to begin with: only one place to go in the cache--causes conflict misses. Besides working at Sun, I also teach people how to fly whenever I have time. Statistic have shown that if a pilot crashed after an engine failure, he or she is more likely to get killed in a multi-engine light airplane than a single engine airplane. The joke among us flight instructors is that: sure, when the engine quit in a single engine stops, you have one option: sooner or later, you land. Probably sooner. But in a multi-engine airplane with one engine stops, you have a lot of options. It is the need to make a decision that kills those people. Bad Factor A Factor B Good Less More

CS152 Computer Architecture and Engineering Lecture 20 Caches

Similar presentations

Presentation on theme: "CS152 Computer Architecture and Engineering Lecture 20 Caches"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS152 Computer Architecture and Engineering Lecture 20 Caches

Similar presentations

Presentation on theme: "CS152 Computer Architecture and Engineering Lecture 20 Caches"— Presentation transcript:

Similar presentations

About project

Feedback