361 Computer Architecture Lecture 14: Cache Memory

361 Computer Architecture Lecture 14: Cache Memory
Start: X:40

The Motivation for Caches
Memory System Processor DRAM Cache Motivation: Large memories (DRAM) are slow Small memories (SRAM) are fast Make the average access time small by: Servicing most accesses from a small, fast memory. Reduce the bandwidth required of the large memory Since DRAMs, which is dense and cheap, are slow and SRAMs, which is very expensive, is fast, it give us a motivation to build a cache between the main memory and the processor. The cache is nothing but a small and fast memory build from SRAM. Its function is to make the average access time small while at the same time reduce the bandwidth required of the main memory by fulfilling most of the memory requests make by the processor. I will show you how these (last two bullets) work in today’s lecture. +1 = 3 min. (X:43)

Outline of Today’s Lecture
Recap of Memory Hierarchy & Introduction to Cache A In-depth Look at the Operation of Cache Cache Write and Replacement Policy Summary And here is an outline of today’’s lecture. In the next 15 minutes, I will give you an review of the important concepts concerning memory system design and introduce you to the basic principles of caches. An in depth discussion of the cache design is next. After that, we will spend some time talking about the Cache Write and Replacement policy. Finally, we will finish off where we left off Wednesday on our tour of the SPACstation 20 memory system. +1 = 4 min. (X:44)

An Expanded View of the Memory System
Processor Control Memory Memory Memory Datapath Memory Memory Well what is cache? In the overall scheme of things, cache is nothing but memory. They are fast and expensive memory but they are still memory. They usually occupy the two levels just outside the processor’s datapath. What do we called those memory that is included INSIDE the processor’s datapath? Registers. +1 = 5 min. (X:45) Slowest Speed: Fastest Biggest Size: Smallest Lowest Cost: Highest

Levels of the Memory Hierarchy
Upper Level Capacity Access Time Cost Staging Xfer Unit faster CPU Registers 100s Bytes <10s ns Registers Instr. Operands prog./compiler 1-8 bytes Cache K Bytes ns $ /bit Cache cache cntl 8-128 bytes Blocks Main Memory M Bytes 100ns-1us $ Memory OS 512-4K bytes Pages Here is a picture how the different levels stack up. The top of the heap, that is the fastest and most expensive memory resource are your registers. For example, in the MIPS processor, you only have 32 of them each 4 bytes long. So you only have 128 bytes of register but they extremely fast, as fast as the processor. The transfer of data between register and its lower level is controlled by your program. More specifically, the load and store instructions which moves 1 to 8 bytes at a time. The level below the registers are caches. They are still fast, usually in the 10s of ns range and you can usually have thousands of bytes of caches in your computer. Data transfer between the cache and its lower level is done automatically by the hardware cache controller so the user does not have to worry about it. The unit of transfer is called block and the block size is usually less than 128 bytes. When we talk about memory, we generally referred to the main memory that is built out of DRAM chips. Here we are talking about millions of bytes with access time in the hundreds of nano seconds. The unit of transfer between the main memory and disk is pages. The common page sizes are 4 KB and 8 KB and the transfer is usually done by the operating system. You will learn more about this interface (Disk and Memory) in the virtual memory lecture next Wednesday. Today’s lecture will concentrate on the memory / cache interface. +3 = 8 min. (X:48) Disk G Bytes ms cents Disk -3 -4 user/operator Mbytes Files Tape infinite sec-min 10 Larger Tape Lower Level -6

The Principle of Locality
Address Space 2 Probability of reference What are the principles of Locality? Before we go any further, let’s review the most important observation that makes memory hierarchy feasible. The observation is called the Principle of locality which says program access a relatively small portion of the address space at any instant of time. A general rule of thumb to remember is that most programs will spend 90% of the time in 10% of the code. When we talk about localities, we are talking about two types: temporal and spatial. Temporal locality is the locality in time which says if an item is referenced, it will tend to be referenced again soon. Spatial locality is the locality in space. That is, if an item is referenced, items whose addresses are close by tend to be referenced soon. +2 = 10 min. (X:50)

The Principle of Locality
Address Space 2 Probability of reference The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Example: 90% of time in 10% of the code Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. Before we go any further, let’s review the most important observation that makes memory hierarchy feasible. The observation is called the Principle of locality which says program access a relatively small portion of the address space at any instant of time. A general rule of thumb to remember is that most programs will spend 90% of the time in 10% of the code. When we talk about localities, we are talking about two types: temporal and spatial. Temporal locality is the locality in time which says if an item is referenced, it will tend to be referenced again soon. Spatial locality is the locality in space. That is, if an item is referenced, items whose addresses are close by tend to be referenced soon. +2 = 10 min. (X:50)

Memory Hierarchy: Principles of Operation
At any given time, data is copied between only 2 adjacent levels: Upper Level (Cache) : the one closer to the processor Smaller, faster, and uses more expensive technology Lower Level (Memory): the one further away from the processor Bigger, slower, and uses less expensive technology Block: The minimum unit of information that can either be present or not present in the two level hierarchy The key thing to remember about memory hierarchy is that although a memory hierarchy can consists of multiple levels but data is copied between only two adjacent levels at a time. For example, if this Upper Level is the cache, then we only have to worry about how data is being transfer to or from the level below it, that is the main memory. How its lower level get its data (Blk Y) is really not your concern. Well, at least not until the next lecture when you learn about virtual memory. The minimum unit of information that can either present or not present in the two level hierarchy is called a block. +1 = 11 min. (X:51) Lower Level (Memory) To Processor Upper Level (Cache) Blk X From Processor Blk Y

Memory Hierarchy: Terminology
Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty = Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty Here is a quick review of the terminology. A HIT is when the data the processor wants to access is found in the upper level (block X), that is the cache. The fraction of the memory access that are HIT is defined as hit rate. Hit time is the time to access the Upper Level cache where the data is found. It consists of: (a) The time to access the cache. (b) And the time to determine if this is a hit or miss. If the data the processor wants cannot be found in the upper level, then we have a miss and we need to retrieve the data (Blk Y) from the lower level, that is the main memory. Be definition, the miss rate is just one minus the hit rate. This miss penalty also consists of two parts: (a) The time it takes to replace a block (Blk Y to BlkX) in the upper level cache. (b) And then the time it takes to deliver this new block to the processor. Needless to say, it is very important that your hit time is much much smaller than your miss penalty. Otherwise, there will be no reason to build the cache. +2 = 13 min. (X:53) Lower Level (Memory) To Processor Upper Level (Cache) Blk X From Processor Blk Y

Basic Terminology: Typical Values
Block (line) size bytes Hit time cycles Miss penalty cycles (and increasing) (access time) (6-10 cycles) (transfer time) ( cycles) Miss rate 1% - 20% Cache Size 1 KB KB And here are some typical values for caches. The block size is usually in the range of 4 to 128 bytes. This hit time is in the range of 1 to 4 processor cycles while the miss penalty can be as high as 32 cycles and it is getting bigger all the time since the processor is getting faster at a rate much higher than the DRAM. Typical miss rate of the cache is between one to twenty percent assuming the cache size is in the range between 1 KB and 256 KB. +1 = 14 min. (X:54)

How Does Cache Work? Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Keep more recently accessed data items closer to the processor Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. Move blocks consists of contiguous words to the cache Lower Level Memory Here is a simple overview of how cache works. Cache works on the principle of locality. It takes advantage of the temporal locality by keeping the more recently addressed words in the cache. In order to take advantage of spatial locality, on a cache miss, we will move an entire block of data that contains the missing word from the lower level into the cache. This way, we are not ONLY storing the most recently touched data in the cache but also the data that are adjacent to them. +1 = 15 min. (X:55) To Processor Upper Level Cache Blk X From Processor Blk Y

The Simplest Cache: Direct Mapped Cache
Memory Address Memory 4 Byte Direct Mapped Cache 1 Cache Index 2 3 1 4 2 5 3 6 7 Location 0 can be occupied by data from: Memory location 0, 4, 8, ... etc. In general: any memory location whose 2 LSBs of the address are 0s Address<1:0> => cache index Which one should we place in the cache? How can we tell which one is in the cache? 8 9 Let’s look at the simplest cache one can build. A direct mapped cache that only has 4 bytes. In this direct mapped cache with only 4 bytes, location 0 of the cache can be occupied by data form memory location 0, 4, 8, C, ... and so on. While location 1 of the cache can be occupied by data from memory location 1, 5, 9, ... etc. So in general, the cache location where a memory location can map to is uniquely determined by the 2 least significant bits of the address (Cache Index). For example here, any memory location whose two least significant bits of the address are 0s can go to cache location zero. With so many memory locations to chose from, which one should we place in the cache? Of course, the one we have read or write most recently because by the principle of temporal locality, the one we just touch is most likely to be the one we will need again soon. Of all the possible memory locations that can be placed in cache Location 0, how can we tell which one is in the cache? +2 = 22 min. (Y:02) A B C D E F

Cache Tag and Cache Index
Assume a 32-bit memory (byte ) address: A 2**N bytes direct mapped cache: Cache Index: The lower N bits of the memory address Cache Tag: The upper (32 - N) bits of the memory address 31 N Cache Tag Example: 0x50 Cache Index Ex: 0x03 Stored as part of the cache “state” 2 N Bytes Valid Bit Direct Mapped Cache Byte 0 Well, in order to determine whether a memory location is currently in a direct mapped cache, we need to store part of the address as cache state. In general, if we have a 32-bit address and a 2 to the power of N bytes of cache, then the lower N bits of the address will be used as cache index to select the one and only one cache location where this memory location can be mapped to. The upper 32 minus N bits of the address will be stored as part of the cache state we called cache tags. Let me show you some examples using our simple 4-byte direct mapped cache. +1 = 23 min. (Y:03) Byte 1 1 Byte 2 2 0x50 Byte 3 3 : : : Byte 2**N -1 N

Cache Access Example Sad Fact of Life: A lot of misses at start up:
V Tag Data Start Up Access Access 000 M [00001] (miss) (HIT) 010 M [01010] Miss Handling: Load Data Write Tag & Set V 000 M [00001] 000 M [00001] Access 010 M [01010] Access (HIT) (miss) Let’s start out with the cache empty. When we receive the memory address , we use 01 as the cache index to find out the valid bit at cache location 01 is 0, not valid, so we have a cache miss. As a result of the cache miss, the contents of memory location is loaded into cache location 01 while the upper portion of the memory address is stored as cache tag. Now let’s say we access the cache again, this time with address Once again we have a cache miss because the valid bit at cache location 10 is not set again. As a result of this cache miss, the contents of memory location is loaded into cache location 10 while the upper bits of the memory address 010 are stored as cache tag. Now if we access the cache again with , we will have a cache hit because the 2 LSB of the address will take us to cache location 01 where the cache tag stored there matches the upper bits of the current address. Besides showing you how a simple direct mapped cache works, this picture also show you a sad fact about caches: they will be a lot of cache misses at start up. These misses are referred to as Compulsory Misses because you cannot eliminate them even if you make the cache bigger or more sophisticated. As a matter of fact, Compulsory Misses may get worse if you make the cache bigger. This is like having an electric water heater at your house. Larger the tank, more hot water will be available to you as long as you your electric heater is running. Let’s say you have a power out for several hours. When the power comes back on, then larger the water tank, longer will the electric heater longer take to heat up the water. +3 = 26 min. (Y:06) Load Data Write Tag & Set V Sad Fact of Life: A lot of misses at start up: Compulsory Misses (Cold start misses) 000 M [00001] 010 M [01010]

Definition of a Cache Block
Cache Block: the cache data that has in its own cache tag Our previous “extreme” example: 4-byte Direct Mapped cache: Block Size = 1 Byte Take advantage of Temporal Locality: If a byte is referenced, it will tend to be referenced soon. Did not take advantage of Spatial Locality: If a byte is referenced, its adjacent bytes will be referenced soon. In order to take advantage of Spatial Locality: increase the block size Let’s look at some definition here. A cache block is defined as the block of cache data that has its own cache tag. Our previous 4-byte direct mapped cache is on the extreme end of the spectrum where block size is one byte because each byte here has its own cache tag. This extreme arrangement does NOT take advantage of the spatial locality which states that if a byte is referenced, adjacent bytes will tend to be referenced soon. Well NO KIDDING. When was the last time you write a program that access one-byte at a time. Even if you just fetch one instruction, you will be accessing 4 bytes at a time. So without further insulting your intelligence, let’s look at bigger block size. How big should we make the block size? Well just like anything in life, it is a tradeoff. But before we look at the tradeoff, let’s take a look at how block size affects our Cache Index and Cache tag arrangement. +2 = 28 min. (Y:08) Valid Cache Tag Direct Mapped Cache Data Byte 0 Byte 1 Byte 2 Byte 3

Example: 1 KB Direct Mapped Cache with 32 B Blocks
For a 2 ** N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 ** M) 31 9 4 Cache Tag Example: 0x50 Cache Index Byte Select Ex: 0x01 Ex: 0x00 Stored as part of the cache “state” Valid Bit Cache Tag Cache Data : Byte 31 Byte 1 Byte 0 Let’s use a specific example with realistic numbers: assume we have a 1 KB direct mapped cache with block size equals to 32 bytes. In other words, each block associated with the cache tag will have 32 bytes in it (Row 1). With Block Size equals to 32 bytes, the 5 least significant bits of the address will be used as byte select within the cache block. Since the cache size is 1K byte, the upper 32 minus 10 bits, or 22 bits of the address will be stored as cache tag. The rest of the address bits in the middle, that is bit 5 through 9, will be used as Cache Index to select the proper cache entry. +2 = 30 min. (Y:10) : 0x50 Byte 63 Byte 33 Byte 32 1 2 3 : : : : Byte 1023 Byte 992 31

Increased Miss Penalty
Block Size Tradeoff In general, larger block size take advantage of spatial locality BUT: Larger block size means larger miss penalty: Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up Average Access Time: = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate Average Access Time Miss Penalty As I said earlier, block size is a tradeoff. In general, larger block size will reduce the miss rate because it take advantage of spatial locality. But remember, miss rate NOT the only cache performance metrics. You also have to worry about miss penalty. As you increase the block size, your miss penalty will go up because as the block gets larger, it will take you longer to fill up the block. Even if you look at miss rate by itself, which you should NOT, bigger block size does not always win. As you increase the block size, assuming keeping cache size constant, your miss rate will drop off rapidly at the beginning due to spatial locality. However, once you pass certain point, your miss rate actually goes up. As a result of these two curves, the Average Access Time (point to equation), which is really the more important performance metric than the miss rate, will go down initially because the miss rate is dropping much faster than the increase in miss penalty. But eventually, as you keep on increasing the block size, the average access time can go up rapidly because not only is the miss penalty is increasing, the miss rate is increasing as well. Let me show you why your miss rate may go up as you increase the block size by another extreme example. +3 = 33 min. (Y:13) Miss Rate Exploits Spatial Locality Increased Miss Penalty & Miss Rate Fewer blocks: compromises temporal locality Block Size Block Size Block Size

Another Extreme Example
Cache Data Valid Bit Byte 0 Byte 1 Byte 3 Cache Tag Byte 2 Cache Size = 4 bytes Block Size = 4 bytes Only ONE entry in the cache True: If an item is accessed, likely that it will be accessed again soon But it is unlikely that it will be accessed again immediately!!! The next access will likely to be a miss again Continually loading data into the cache but discard (force out) them before they are used again Worst nightmare of a cache designer: Ping Pong Effect Conflict Misses are misses caused by: Different memory locations mapped to the same cache index Solution 1: make the cache size bigger Solution 2: Multiple entries for the same Cache Index Let’s go back to our 4-byte direct mapped cache and increase its block size to 4 byte. Now we end up have one cache entries instead of 4 entries. What do you think this will do to the miss rate? Well the miss rate probably will go to hell. It is true that if an item is accessed, it is likely that it will be accessed again soon. But probably NOT as soon as the very next access so the next access will cause a miss again. So what we will end up is loading data into the cache but the data will be forced out by another cache miss before we have a chance to use it again. This is called the ping pong effect: the data is acting like a ping pong ball bouncing in and out of the cache. It is one of the nightmares scenarios cache designer hope never happens. We also defined a term for this type of cache miss, cache miss caused by different memory location mapped to the same cache index. It is called Conflict miss. There are two solutions we can use to reduce the conflict miss. The first one is to increase the cache size. The second one is to increase the number of cache entries per cache index. Let me show you what I mean. +2 = 35 min. (Y:15)

A Two-way Set Associative Cache
N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Cache Index Valid Cache Tag Cache Data Cache Data Cache Block 0 Cache Tag Valid : Cache Block 0 : : : This is called a 2-way set associative cache because there are two cache entries for each cache index. Essentially, you have two direct mapped cache works in parallel. This is how it works: the cache index selects a set from the cache. The two tags in the set are compared in parallel with the upper bits of the memory address. If neither tag matches the incoming address tag, we have a cache miss. Otherwise, we have a cache hit and we will select the data on the side where the tag matches occur. This is simple enough. What is its disadvantages? +1 = 36 min. (Y:16) Compare Adr Tag Compare 1 Sel1 Mux Sel0 OR Cache Block Hit

Disadvantage of Set Associative Cache
N-way Set Associative Cache versus Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. Cache Data Cache Block 0 Cache Tag Valid : Cache Index Mux 1 Sel1 Sel0 Cache Block Compare Adr Tag OR Hit First of all, a N-way set associative cache will need N comparators instead of just one comparator (use the right side of the diagram for direct mapped cache). A N-way set associative cache will also be slower than a direct mapped cache because of this extra multiplexer delay. Finally, for a N-way set associative cache, the data will be available AFTER the hit/miss signal becomes valid because the hit/mis is needed to control the data MUX. For a direct mapped cache, that is everything before the MUX on the right or left side, the cache block will be available BEFORE the hit/miss signal (AND gate output) because the data does not have to go through the comparator. This can be an important consideration because the processor can now go ahead and use the data without knowing if it is a Hit or Miss. Just assume it is a hit. Since cache hit rate is in the upper 90% range, you will be ahead of the game 90% of the time and for those 10% of the time that you are wrong, just make sure you can recover. You cannot play this speculation game with a N-way set-associatvie cache because as I said earlier, the data will not be available to you until the hit/miss signal is valid. +2 = 38 min. (Y:18)

And yet Another Extreme Example: Fully Associative
Fully Associative Cache -- push the set associative idea to its limit! Forget about the Cache Index Compare the Cache Tags of all cache entries in parallel Example: Block Size = 2 B blocks, we need N 27-bit comparators By definition: Conflict Miss = 0 for a fully associative cache 31 4 Cache Tag (27 bits long) Byte Select Ex: 0x01 Cache Tag Valid Bit Cache Data While the direct mapped cache is on the simple end of the cache design spectrum, the fully associative cache is on the most complex end. It is the N-way set associative cache carried to the extreme where N in this case is set to the number of cache entries in the cache. In other words, we don’t even bother to use any address bits as the cache index. We just store all the upper bits of the address (except Byte select) that is associated with the cache block as the cache tag and have one comparator for every entry. The address is sent to all entries at once and compared in parallel and only the one that matches are sent to the output. This is called an associative lookup. Needless to say, it is very hardware intensive. Usually, fully associative cache is limited to 64 or less entries. Since we are not doing any mapping with the cache index, we will never push any other item out of the cache because multiple memory locations map to the same cache location. Therefore, by definition, conflict miss is zero for a fully associative cache. This, however, does not mean the overall miss rate will be zero. Assume we have 64 entries here. The first 64 items we accessed can fit in. But when we try to bring in the 65th item, we will need to throw one of them out to make room for the new item. This bring us to the third type of cache misses: Capacity Miss. +3 = 41 min. (Y:21) : X Byte 31 Byte 1 Byte 0 : X Byte 63 Byte 33 Byte 32 X X : : : X

A Summary on Sources of Cache Misses
Compulsory (cold start, first reference): first access to a block “Cold” fact of life: not a whole lot you can do about it Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Capacity: Cache cannot contain all blocks access by the program Solution: increase cache size Invalidation: other process (e.g., I/O) updates memory (Capacity miss) That is the cache misses are due to the fact that the cache is simply not large enough to contain all the blocks that are accessed by the program. The solution to reduce the Capacity miss rate is simple: increase the cache size. Here is a summary of other types of cache miss we talked about. First is the Compulsory misses. These are the misses that we cannot avoid. They are caused when we first start the program. Then we talked about the conflict misses. They are the misses that caused by multiple memory locations being mapped to the same cache location. There are two solutions to reduce conflict misses. The first one is, once again, increase the cache size. The second one is to increase the associativity. For example, say using a 2-way set associative cache instead of directed mapped cache. But keep in mind that cache miss rate is only one part of the equation. You also have to worry about cache access time and miss penalty. Do NOT optimize miss rate alone. Finally, there is another source of cache miss we will not cover today. Those are referred to as invalidation misses caused by another process, such as IO , update the main memory so you have to flush the cache to avoid inconsistency between memory and cache. +2 = 43 min. (Y:23)

Source of Cache Misses Quiz
Direct Mapped N-way Set Associative Fully Associative Cache Size Compulsory Miss Conflict Miss Capacity Miss Let me see whether you guys can help me fill out this table. So all things being equal, which organization do you think we should use if we want to build the biggest cache (Big, Medium, Small). Answer: The direct mapped cache is the simplest so we can make it the biggest. The fully associative cache is the most complex so it will be the smallest. How about compulsory miss, which cache do you think has the highest compulsory miss rate (High, Medium Low)? Answer: Well, the direct mapped cache with the biggest cache size probably will take the longest to warm up so it probably will have the highest compulsory miss rate. But the good thing about compulsory miss rate is that if you are going to run billions of instructions, you really don’t care your initial misses before the cache is warm. Conflict Misses: which do you think has the lowest conflict miss rate. Answer: Simple. By definition, the fully-associative cache will have zero conflict misses so you can’t get any lower than that. For the direct mapped cache, each memory location can ONLY go to one cache location, so the conflict misses probably will be highest. Capacity Misses: well who has the biggest cache size wins in this category. So the direct mapped cache will have the lowest capacity miss rate while the fully associative cache will have the highest. Finally, invalidation misses. They probably affect all three caches the same way. + 3 = 46 min. (Y:26) Invalidation Miss Categorize as high, medium, low, zero

Sources of Cache Misses Answer
Direct Mapped N-way Set Associative Fully Associative Cache Size Big Medium Small Compulsory Miss High (but who cares!) Medium Low See Note Conflict Miss High Medium Zero Capacity Miss Low Medium High This is a hidden slide. Invalidation Miss Same Same Same Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant.

Summary: The Principle of Locality:
Program access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size Let’s summarize today’s lecture. I know you have heard this many times and many ways but it is still worth repeating. Memory hierarchy works because of the Principle of Locality which says a program will access a relatively small portion of the address space at any instant of time. There are two types of locality: temporal locality, or locality in time and spatial locality, or locality in space. So far, we have covered three major categories of cache misses. Compulsory misses are cache misses due to cold start. You cannot avoid them but if you are going to run billions of instructions anyway, compulsory misses usually don’t bother you. Conflict misses are misses caused by multiple memory location being mapped to the same cache location. The nightmare scenario is the ping pong effect when a block is read into the cache but before we have a chance to use it, it was immediately forced out by another conflict miss. You can reduce Conflict misses by either increase the cache size or increase the associativity, or both. Finally, Capacity misses occurs when the cache is not big enough to contains all the cache blocks required by the program. You can reduce this miss rate by making the cache larger. There are two write policy as far as cache write is concerned. Write through requires a write buffer and a nightmare scenario is when the store occurs so frequent that you saturates your write buffer. The second write polity is write back. In this case, you only write to the cache and only when the cache block is being replaced do you write the cache block back to memory. +3 = 77 min. (Y:57)

What about .. Replacement Writing NEXT CLASS

The Need to Make a Decision!
Direct Mapped Cache: Each memory location can only mapped to 1 cache location No need to make any decision :-) Current item replaced the previous item in that cache location N-way Set Associative Cache: Each memory location have a choice of N cache locations Fully Associative Cache: Each memory location can be placed in ANY cache location Cache miss in a N-way Set Associative or Fully Associative Cache: Bring in new block from memory Throw out a cache block to make room for the new block Damn! We need to make a decision on which block to throw out! In a direct mapped cache, each memory location can only go to 1 & only 1 cache location so on a cache miss, we don’t have a make to decision on which block to throw out. We just throw out the item that was originally in the cache and keep the new one. As a matter of fact, that is what cause direct mapped trouble to begin with: every memory location only has one place to go in the cache: high conflict misses. On a N-way set associative cache, each memory location can go to one of the N cache locations while on the fully associative cache, each memory location can go to ANY cache location in the cache. So when we have a cache miss in a N-way set associative or fully associative cache, we have a slight problem. We need to make a decision on which block to throw out. Besides working at Sun, I also teach people how to fly whenever I have time. Statistic have shown that if a pilot crashed after an engine fail, he or she has a higher probability of getting killed in a multi-engine airplane than a single engine airplane. The joke among us flight instructors is that: sure, when the engine quit in a single engine airplane, you have one option, you land. But in a multi-engine airplane with one engine stops, you have a lot of options. It is the need to make a decision that kills those indecisive people. +3 = 54 min. (Y:34)

Cache Block Replacement Policy
Random Replacement: Hardware randomly selects a cache item and throw it out Least Recently Used: Hardware keeps track of the access history Replace the entry that has not been used for the longest time Example of a Simple “Pseudo” Least Recently Used Implementation: Assume 64 Fully Associative Entries Hardware replacement pointer points to one cache entry Whenever an access is made to the entry the pointer points to: Move the pointer to the next entry Otherwise: do not move the pointer If you are one of those indecisive people, do I have a replacement policy for you (Random). Just throw away any one you like. Flip a coin, throw darts, whatever, you don’t care. The important thing is to make a decision and move on. Just don’t crash. Well all jokes aside, a lot of time a good random replacement policy is sufficient. If you really want to be fancy, you can adopt the Least Recently Used algorithm and throw out the one that has not been accessed for the longest time. It is hard to implement a true LRU algorithm in hardware. Here is an example where you can do a “pseudo” LRU with very simple hardware. Build a hardware replacement pointer, most likely a shift register, that will select an entry you will throw up next time when you have to make room for a new item. We can start the pointer at any random location. During normal access, whenever an access is made to an entry that the pointer points to, you move the pointer to the next entry. So at any given time, statistically speaking, the pointer is more likely to end up at an entry that has not been used very often than an entry that is heavily used. +2 = 56 min. (Y:36) Entry 0 Entry 1 Replacement : Pointer Entry 63

Cache Write Policy: Write Through versus Write Back
Cache read is much easier to handle than cache write: Instruction cache is much easier to design than data cache Cache write: How do we keep data in the cache and memory consistent? Two options (decision time again :-) Write Back: write to cache only. Write the cache block to memory when that cache block is being replaced on a cache miss. Need a “dirty” bit for each cache block Greatly reduce the memory bandwidth requirement Control can be complex Write Through: write to cache and memory at the same time. What!!! How can this be? Isn’t memory too slow for this? So far, we have been thinking mostly in terms of cache read not cache write because cache read is much easier to handle than cache write. That’s why in general Instruction Cache is much easier to build than data cache. The problem with cache write is that we have to keep the cache and memory consistent. There are two options we can follow. The first one is write back. In this case, you only write to the cache. We will write the cache block back to memory only when the cache block is being replaced on a cache miss. In order to take full advantage of the write back feature, you will need a dirty bit for each block to indicate whether you have written into this cache block before. Write back cache can greatly reduce the memory bandwidth requirement but it also requires rather complex control. The second option is Write Through: you write to the cache and memory at the same time. You probably say: WHAT. You can’t do that. Memory is too slow. +2 = 58 min. (Y:38)

Write Buffer for Write Through
Cache Processor DRAM Write Buffer A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: Store frequency (w.r.t. time) -> 1 / DRAM write cycle Write buffer saturation You are right, memory is too slow. We really didn't writ e to the memory directly. We are writing to a write buffer. Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene. The write buffer works as long as the frequency of store is not too high. Notice here, I am referring to the frequency with respect to time, not with respect to number of instructions. Remember the DRAM cycle time we talked about last time. It sets the upper limit on how frequent you can write to the main memory. If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. +2 = 60 min. (Y:40)

Write Buffer Saturation
Cache Processor DRAM Write Buffer Store frequency (w.r.t. time) -> 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache: A Memory System designer’s nightmare is when the Store frequency with respect to time approaches 1 over the DRAM Write Cycle Time. We called this Write Buffer Saturation. In that case, it does NOT matter how big you make the write buffer, the write buffer will still overflow because you simply feeding things in it faster than you can empty it. This is called Write Buffer Saturation and I have seen this happened before in simulation and whey that happens your processor will be running at DRAM cycle time--very very slow. The first solution for write buffer saturation is to get rid of this write buffer and replace this write through cache with a write back cache. Another solution is to install the 2nd level cache between the write buffer and memory and makes the 2nd level write back. +2 = 62 min. (Y:42) Cache L2 Cache Processor DRAM Write Buffer

Write Allocate versus Not Allocate
Assume: a 16-bit write to memory location 0x0 and causes a miss Do we read in the rest of the block (Byte 2, 3, )? Yes: Write Allocate No: Write Not Allocate 31 9 4 Cache Tag Example: 0x00 Cache Index Byte Select Ex: 0x00 Ex: 0x00 Valid Bit Cache Tag Cache Data : 0x00 Byte 31 Byte 1 Byte 0 Let’s look at our 1KB direct mapped cache again. Assume we do a 16-bit write to memory location 0x and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select. After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss? True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking. +2 = 64 min. (Y:44) : Byte 63 Byte 33 Byte 32 1 2 3 : : : : Byte 1023 Byte 992 31

What is a Sub-block? : : : : : : : : Sub-block:
A unit within a block that has its own valid bit Example: 1 KB Direct Mapped Cache, 32-B Block, 8-B Sub-block Each cache entry will have: 32/8 = 4 valid bits Write miss: only the bytes in that sub-block is brought in. SB3’s V Bit SB2’s V Bit SB1’s V Bit SB0’s V Bit Cache Tag Cache Data : : B31 B24 B7 B0 Sub-block3 Sub-block2 Sub-block1 Sub-block0 A sub-block is defined as a unit within a block that has its own valid bit. For example here, if each 32 byte block is divided into four 8 byte sub-blocks, we will need 4 valid bits per block. On a cache write miss, we only need to bring in the data for the sub-block that we are writing to and set the valid bit to all other subblocks to zeros. This works fine for write. For read, we still want to bring in the rest of the block so the time it takes to fill the block is still important. Modern computer has several ways to reduce the time to fill the cache. +2 = 66 min. (Y:46) 1 2 3 : : : : : : Byte 1023 Byte 992 31

SPARCstation 20’s Memory System
Controller Memory Bus (SIMM Bus) bit wide datapath Memory Module 7 Memory Module 6 Memory Module 5 Memory Module 4 Memory Module 3 Memory Module 2 Memory Module 1 Memory Module 0 Processor Module (Mbus Module) Processor Bus (Mbus) 64-bit wide SuperSPARC Processor The SPARCstation 20 memory system is rather simple. It consists of a 128-bit memory bus, which inside SUN, we called it the SIMM bus where you can put in up to 8 memory modules. The memory bus is controlled by the memory controller. On the other side of the Memory Controller is the Processor Bus. Inside SUN, we called the Processor Bus the Mbus. On the processor bus is a processor module which contains a SuperSPARC processor as well as some external cache. We already look at the memory module in Wednesday lecture. Today, we will talk about the caches on the processor module. +1 = 68 min. (Y:48) Instruction Cache External Cache Register File Data Cache

SPARCstation 20’s External Cache
Processor Module (Mbus Module) SuperSPARC Processor External Cache Instruction Cache 1 MB Register File Direct Mapped Data Cache Write Back Write Allocate SPARCstation 20’s External Cache: Size and organization: 1 MB, direct mapped Block size: 128 B Sub-block size: 32 B Write Policy: Write back, write allocate First is the external cache. It is 1 mega byte, direct mapped. Block size is 128 bytes with 32 bytes sub-block. It is a write back cache and it also does write allocate. +1 = 69 min. (Y:49)

SPARCstation 20’s Internal Instruction Cache
Processor Module (Mbus Module) SuperSPARC Processor External Cache I-Cache 20 KB 5-way 1 MB Register File Direct Mapped Write Back Data Cache Write Allocate SPARCstation 20’s Internal Instruction Cache: Size and organization: 20 KB, 5-way Set Associative Block size: 64 B Sub-block size: 32 B Write Policy: Does not apply Note: Sub-block size the same as the External (L2) Cache Inside the SuperSPARC processor chip, it has a 20 KB, 5-way set associative cache. The block size of this cache is 64B with 32 B sub-block. Well there is no such thing as write policy for this cache because this cache is read only. It is an instruction cache. +1 = 70 min. (Y:50)

SPARCstation 20’s Internal Data Cache
Processor Module (Mbus Module) SuperSPARC Processor External Cache I-Cache 20 KB 5-way 1 MB Register File Direct Mapped D-Cache Write Back 16 KB 4-way Write Allocate WT, WNA SPARCstation 20’s Internal Data Cache: Size and organization: 16 KB, 4-way Set Associative Block size: 64 B Sub-block size: 32 B Write Policy: Write through, write not allocate Sub-block size the same as the External (L2) Cache Finally, the data cache size is 16 KB. It is 4-way set associative. The block size is 64 B with 32 B sub-block. The write policy is write through and it does not allocate on write. That is the rest of the block is not filled on write miss. Notice that the sub-block size here is the same as the L2 cache. This makes transferring data between the two levels much easier. +1 = 71 min. (Y:51)

Two Interesting Questions?
Processor Module (Mbus Module) SuperSPARC Processor External Cache I-Cache 20 KB 5-way 1 MB Register File Direct Mapped D-Cache Write Back 16 KB 4-way Write Allocate WT, WNA Why did they use N-way set associative cache internally? Answer: A N-way set associative cache is like having N direct mapped caches in parallel. They want each of those N direct mapped cache to be 4 KB. Same as the “virtual page size.” Virtual Page Size: cover in next week’s virtual memory lecture How many levels of cache does SPARCstation 20 has? Answer: Three levels. (1) Internal I & D caches, (2) External cache and (3) ... If you look at this cache organization, you may ask why did they use 4-way and 5-way set associative cache inside the processor chip? The answer is that a N-way associative cache is like having N direct-mapped cache in parallel. They want each of those N direct mapped cache to be 4 KB in size, the same as the virtual page size. You will learn about virtual page size in next Wednesday lecture and when you do, you will appreciate why you want to keep the set size of the cache the same as the virtual page size. The next question you may ask is: how many levels of caches does SPARCstation 20 has. The answer is three. The first level cache is the internal instruction and data cache. The second level is the External cache. Can anybody tell me where is the third level cache? +2 = 73 min. (Y:53)

SPARCstation 20’s Memory Module
Supports a wide range of sizes: Smallest 4 MB: 16 2Mb DRAM chips, 8 KB of Page Mode SRAM Biggest: 64 MB: 32 16Mb chips, 16 KB of Page Mode SRAM DRAM Chip 15 512 cols 256K x 8 = 2 MB DRAM Chip 0 512 rows 256K x 8 = 2 MB 512 x 8 SRAM Well the third level cache is not explicit or well defined so it is sort of a trick question. But remember all the Page Mode SRAM we have in the memory module? For a fully loaded SPARCstation 20 with 8 of the biggest memory modules, you can have up to 8 x 16 KB, or 128 KB of page mode SRAM sitting on the Memory Bus. These Page Mode SRAM, in a sense, serves as your third level cache. +1 = 74 min. (Y:54) 8 bits bits<127:0> 512 x 8 SRAM bits<7:0> Memory Bus<127:0>

Summary: The Principle of Locality:
Program access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size Write Policy: Write Through: need a write buffer. Nightmare: WB saturation Write Back: control can be complex Let’s summarize today’s lecture. I know you have heard this many times and many ways but it is still worth repeating. Memory hierarchy works because of the Principle of Locality which says a program will access a relatively small portion of the address space at any instant of time. There are two types of locality: temporal locality, or locality in time and spatial locality, or locality in space. So far, we have covered three major categories of cache misses. Compulsory misses are cache misses due to cold start. You cannot avoid them but if you are going to run billions of instructions anyway, compulsory misses usually don’t bother you. Conflict misses are misses caused by multiple memory location being mapped to the same cache location. The nightmare scenario is the ping pong effect when a block is read into the cache but before we have a chance to use it, it was immediately forced out by another conflict miss. You can reduce Conflict misses by either increase the cache size or increase the associativity, or both. Finally, Capacity misses occurs when the cache is not big enough to contains all the cache blocks required by the program. You can reduce this miss rate by making the cache larger. There are two write policy as far as cache write is concerned. Write through requires a write buffer and a nightmare scenario is when the store occurs so frequent that you saturates your write buffer. The second write polity is write back. In this case, you only write to the cache and only when the cache block is being replaced do you write the cache block back to memory. +3 = 77 min. (Y:57)

361 Computer Architecture Lecture 14: Cache Memory

Similar presentations

Presentation on theme: "361 Computer Architecture Lecture 14: Cache Memory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

361 Computer Architecture Lecture 14: Cache Memory

Similar presentations

Presentation on theme: "361 Computer Architecture Lecture 14: Cache Memory"— Presentation transcript:

Similar presentations

About project

Feedback