COMP 206 Siddhartha Chatterjee Fall 2000

COMP 206 Siddhartha Chatterjee Fall 2000
COMP 206 Computer Architecture and Implementation Unit 8a: Basics of Caches Siddhartha Chatterjee Fall 2000 Siddhartha Chatterjee

The Big Picture: Where are We Now?
COMP 206 Fall 2000 The Big Picture: Where are We Now? The Five Classic Components of a Computer This unit: Memory System Processor Input Control Memory Datapath Output You should know by now that all computer consist of 5 components: (1) Input and (2) output devices. (3) The Memory System. And the (4) Control and (5) Datapath of the Processor. You already learned how to design the datapath and control for the processor. Today and Friday lectures will talk about the Memory System. Well we called this the BIG picture because it is a simplification, or abstraction, of what the real world looks like. In the real world, the memory system of a modern computer is not just a black box like this. +1 = 6 min. (X:46) Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

The Motivation for Caches
COMP 206 Fall 2000 The Motivation for Caches Motivation Large memories (DRAM) are slow Small memories (SRAM) are fast Make the average access time small by servicing most accesses from a small, fast memory Reduce the bandwidth required of the large memory Processor Memory System Cache DRAM Since DRAMs, which is dense and cheap, are slow and SRAMs, which is very expensive, is fast, it give us a motivation to build a cache between the main memory and the processor. The cache is nothing but a small and fast memory build from SRAM. Its function is to make the average access time small while at the same time reduce the bandwidth required of the main memory by fulfilling most of the memory requests make by the processor. I will show you how these (last two bullets) work in today’s lecture. +1 = 3 min. (X:43) Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

The Principle of Locality
COMP 206 Fall 2000 The Principle of Locality Address Space 2n Probability of reference The Principle of Locality Program access a relatively small portion of the address space at any instant of time Example: 90% of time in 10% of the code Two different types of locality Temporal Locality (locality in time): If an item is referenced, it will tend to be referenced again soon Spatial Locality (locality in space): If an item is referenced, items whose addresses are close by tend to be referenced soon Before we go any further, let’s review the most important observation that makes memory hierarchy feasible. The observation is called the Principle of locality which says program access a relatively small portion of the address space at any instant of time. A general rule of thumb to remember is that most programs will spend 90% of the time in 10% of the code. When we talk about localities, we are talking about two types: temporal and spatial. Temporal locality is the locality in time which says if an item is referenced, it will tend to be referenced again soon. Spatial locality is the locality in space. That is, if an item is referenced, items whose addresses are close by tend to be referenced soon. +2 = 10 min. (X:50) Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

Levels of the Memory Hierarchy
COMP 206 Fall 2000 Levels of the Memory Hierarchy CPU Registers 100s Bytes 1-5 ns ~$.05 Cache K-M Bytes 1-30 ns ~$.0005 Main Memory M-G Bytes 100ns-1s ~$ Disk 1-30 G Bytes 3-15 ms cents Capacity Access Time Cost/bit Tape/Network “infinite” sec-min 10-6 cents Registers L1, L2, … Cache Memory Words Blocks Pages Files Staging Transfer Unit programmer/compiler 1-8 bytes cache controller 8-128 bytes OS 4-64K bytes user/operator Mbytes Upper Level Lower Level Faster Larger Here is a picture how the different levels stack up. The top of the heap, that is the fastest and most expensive memory resource are your registers. For example, in the MIPS processor, you only have 32 of them each 4 bytes long. So you only have 128 bytes of register but they extremely fast, as fast as the processor. The transfer of data between register and its lower level is controlled by your program. More specifically, the load and store instructions which moves 1 to 8 bytes at a time. The level below the registers are caches. They are still fast, usually in the 10s of ns range and you can usually have thousands of bytes of caches in your computer. Data transfer between the cache and its lower level is done automatically by the hardware cache controller so the user does not have to worry about it. The unit of transfer is called block and the block size is usually less than 128 bytes. When we talk about memory, we generally referred to the main memory that is built out of DRAM chips. Here we are talking about millions of bytes with access time in the hundreds of nano seconds. The unit of transfer between the main memory and disk is pages. The common page sizes are 4 KB and 8 KB and the transfer is usually done by the operating system. You will learn more about this interface (Disk and Memory) in the virtual memory lecture next Wednesday. Today’s lecture will concentrate on the memory / cache interface. +3 = 8 min. (X:48) Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

Memory Hierarchy: Principles of Operation
COMP 206 Fall 2000 Memory Hierarchy: Principles of Operation At any given time, data is copied between only 2 adjacent levels Upper Level (Cache) : the one closer to the processor Smaller, faster, and uses more expensive technology Lower Level (Memory): the one further away from the processor Bigger, slower, and uses less expensive technology Block The minimum unit of information that can either be present or not present in the two-level hierarchy The key thing to remember about memory hierarchy is that although a memory hierarchy can consists of multiple levels but data is copied between only two adjacent levels at a time. For example, if this Upper Level is the cache, then we only have to worry about how data is being transfer to or from the level below it, that is the main memory. How its lower level get its data (Blk Y) is really not your concern. Well, at least not until the next lecture when you learn about virtual memory. The minimum unit of information that can either present or not present in the two level hierarchy is called a block. +1 = 11 min. (X:51) Lower Level (Memory) To Processor Upper Level (Cache) Blk X From Processor Blk Y Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

Memory Hierarchy: Terminology
COMP 206 Fall 2000 Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X in previous slide) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y on previous slide) Miss Rate = 1 - (Hit Rate) Miss Penalty = Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty Here is a quick review of the terminology. A HIT is when the data the processor wants to access is found in the upper level (block X), that is the cache. The fraction of the memory access that are HIT is defined as hit rate. Hit time is the time to access the Upper Level cache where the data is found. It consists of: (a) The time to access the cache. (b) And the time to determine if this is a hit or miss. If the data the processor wants cannot be found in the upper level, then we have a miss and we need to retrieve the data (Blk Y) from the lower level, that is the main memory. Be definition, the miss rate is just one minus the hit rate. This miss penalty also consists of two parts: (a) The time it takes to replace a block (Blk Y to BlkX) in the upper level cache. (b) And then the time it takes to deliver this new block to the processor. Needless to say, it is very important that your hit time is much much smaller than your miss penalty. Otherwise, there will be no reason to build the cache. +2 = 13 min. (X:53) Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

Siddhartha Chatterjee
Cache Addressing Set 0 Set j-1 Block 0 Block k-1 Replacement info Sector 0 Sector m-1 Tag Byte 0 Byte n-1 Valid Dirty Shared Block/line is unit of allocation Sector/sub-block is unit of transfer and coherence Cache parameters j, k, m, n are integers, and generally powers of 2 Fall 2000 Siddhartha Chatterjee

Examples of Cache Configurations
Memory address Block address Block offset Tag Set Sector Byte Fall 2000 Siddhartha Chatterjee

Storage Overhead of Cache
Fall 2000 Siddhartha Chatterjee

Basics of Cache Operation
Fall 2000 Siddhartha Chatterjee

Details of Simple Blocking Cache
Write Through Write Back Fall 2000 Siddhartha Chatterjee

Example: 1KB, Direct-Mapped, 32B Blocks
COMP 206 Fall 2000 Example: 1KB, Direct-Mapped, 32B Blocks For a 1024 (210) byte cache with 32-byte blocks The uppermost 22 = ( ) address bits are the tag The lowest 5 address bits are the Byte Select (Block Size = 25) The next 5 address bits (bit5 - bit9) are the Cache Index 4 31 9 Cache Index : Cache Tag Example: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit 1 2 3 Cache Data Byte 0 Byte 1 Byte 31 Byte 32 Byte 33 Byte 63 Byte 992 Byte 1023 Byte Select Ex: 0x00 Let’s use a specific example with realistic numbers: assume we have a 1 KB direct mapped cache with block size equals to 32 bytes. In other words, each block associated with the cache tag will have 32 bytes in it (Row 1). With Block Size equals to 32 bytes, the 5 least significant bits of the address will be used as byte select within the cache block. Since the cache size is 1K byte, the upper 32 minus 10 bits, or 22 bits of the address will be stored as cache tag. The rest of the address bits in the middle, that is bit 5 through 9, will be used as Cache Index to select the proper cache entry. +2 = 30 min. (Y:10) Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

Example 1a: 1KB Direct Mapped Cache
4 31 9 Cache Index : Cache Tag 0x0002fe 0x00 0x000050 Valid Bit 1 2 3 Cache Data Byte 0 Byte 1 Byte 31 Byte 32 Byte 33 Byte 63 Byte 992 Byte 1023 Byte Select = Cache Miss 0xxxxxxx 0x004440 Fall 2000 Siddhartha Chatterjee

Example 1b: 1KB Direct Mapped Cache
4 31 9 Cache Index : Cache Tag 0x0002fe 0x00 0x000050 Valid Bit 1 2 3 Cache Data Byte 32 Byte 33 Byte 63 Byte 992 Byte 1023 Byte Select = 0x004440 New Block of data Fall 2000 Siddhartha Chatterjee

Example 2: 1KB Direct Mapped Cache
4 31 9 Cache Index : Cache Tag 0x000050 0x01 Valid Bit 1 2 3 Cache Data Byte 0 Byte 1 Byte 31 Byte 32 Byte 33 Byte 63 Byte 992 Byte 1023 Byte Select 0x08 = Cache Hit 0x0002fe 0x004440 Fall 2000 Siddhartha Chatterjee

Example 3a: 1KB Direct Mapped Cache
4 31 9 Cache Index : Cache Tag 0x002450 0x02 0x000050 Valid Bit 1 2 3 Cache Data Byte 0 Byte 1 Byte 31 Byte 32 Byte 33 Byte 63 Byte 992 Byte 1023 Byte Select 0x04 = Cache Miss 0x0002fe 0x004440 Fall 2000 Siddhartha Chatterjee

Example 3b: 1KB Direct Mapped Cache
4 31 9 Cache Index : Cache Tag 0x002450 0x02 0x000050 Valid Bit 1 2 3 Cache Data Byte 0 Byte 1 Byte 31 Byte 32 Byte 33 Byte 63 Byte 992 Byte 1023 Byte Select 0x04 = 0x0002fe New Block of data Fall 2000 Siddhartha Chatterjee

A-way Set-Associative Cache
COMP 206 Fall 2000 A-way Set-Associative Cache A-way set associative: A entries for each Cache Index A direct-mapped caches operating in parallel Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Cache Data Cache Block 0 Cache Tag Valid : Cache Index Mux 1 SEL1 SEL0 Cache Block Compare Addr. Tag OR Hit This is called a 2-way set associative cache because there are two cache entries for each cache index. Essentially, you have two direct mapped cache works in parallel. This is how it works: the cache index selects a set from the cache. The two tags in the set are compared in parallel with the upper bits of the memory address. If neither tag matches the incoming address tag, we have a cache miss. Otherwise, we have a cache hit and we will select the data on the side where the tag matches occur. This is simple enough. What is its disadvantages? +1 = 36 min. (Y:16) Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

Fully Associative Cache
COMP 206 Fall 2000 Fully Associative Cache Push the set-associative idea to its limit! Forget about the Cache Index Compare the Cache Tags of all cache tag entries in parallel Example: Block Size = 32B, we need N 27-bit comparators : Cache Data Byte 0 4 31 Cache Tag (27 bits long) Valid Bit Byte 1 Byte 31 Byte 32 Byte 33 Byte 63 Cache Tag Byte Select Ex: 0x01 X While the direct mapped cache is on the simple end of the cache design spectrum, the fully associative cache is on the most complex end. It is the N-way set associative cache carried to the extreme where N in this case is set to the number of cache entries in the cache. In other words, we don’t even bother to use any address bits as the cache index. We just store all the upper bits of the address (except Byte select) that is associated with the cache block as the cache tag and have one comparator for every entry. The address is sent to all entries at once and compared in parallel and only the one that matches are sent to the output. This is called an associative lookup. Needless to say, it is very hardware intensive. Usually, fully associative cache is limited to 64 or less entries. Since we are not doing any mapping with the cache index, we will never push any other item out of the cache because multiple memory locations map to the same cache location. Therefore, by definition, conflict miss is zero for a fully associative cache. This, however, does not mean the overall miss rate will be zero. Assume we have 64 entries here. The first 64 items we accessed can fit in. But when we try to bring in the 65th item, we will need to throw one of them out to make room for the new item. This bring us to the third type of cache misses: Capacity Miss. +3 = 41 min. (Y:21) Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

Cache Shapes Direct-mapped (A = 1, S = 16) 2-way set-associative (A = 2, S = 8) 4-way set-associative (A = 4, S = 4) 8-way set-associative (A = 8, S = 2) Fully associative (A = 16, S = 1) Fall 2000 Siddhartha Chatterjee

Need for Block Replacement Policy
COMP 206 Fall 2000 Need for Block Replacement Policy Direct Mapped Cache Each memory location can only mapped to 1 cache location No need to make any decision :-) Current item replaces previous item in that cache location N-way Set Associative Cache Each memory location have a choice of N cache locations Fully Associative Cache Each memory location can be placed in ANY cache location Cache miss in a N-way Set Associative or Fully Associative Cache Bring in new block from memory Throw out a cache block to make room for the new block We need to decide which block to throw out! In a direct mapped cache, each memory location can only go to 1 & only 1 cache location so on a cache miss, we don’t have a make to decision on which block to throw out. We just throw out the item that was originally in the cache and keep the new one. As a matter of fact, that is what cause direct mapped trouble to begin with: every memory location only has one place to go in the cache: high conflict misses. On a N-way set associative cache, each memory location can go to one of the N cache locations while on the fully associative cache, each memory location can go to ANY cache location in the cache. So when we have a cache miss in a N-way set associative or fully associative cache, we have a slight problem. We need to make a decision on which block to throw out. Besides working at Sun, I also teach people how to fly whenever I have time. Statistic have shown that if a pilot crashed after an engine fail, he or she has a higher probability of getting killed in a multi-engine airplane than a single engine airplane. The joke among us flight instructors is that: sure, when the engine quit in a single engine airplane, you have one option, you land. But in a multi-engine airplane with one engine stops, you have a lot of options. It is the need to make a decision that kills those indecisive people. +3 = 54 min. (Y:34) Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

Cache Block Replacement Policies
COMP 206 Fall 2000 Cache Block Replacement Policies Random Replacement Hardware randomly selects a cache item and throw it out Least Recently Used Hardware keeps track of the access history Replace the entry that has not been used for the longest time. For two-way set-associative cache one needs one bit for LRU replacement Example of a Simple “Pseudo” LRU Implementation Assume 64 Fully Associative entries Hardware replacement pointer points to one cache entry Whenever an access is made to the entry the pointer points to: Move the pointer to the next entry Otherwise: do not move the pointer If you are one of those indecisive people, do I have a replacement policy for you (Random). Just throw away any one you like. Flip a coin, throw darts, whatever, you don’t care. The important thing is to make a decision and move on. Just don’t crash. Well all jokes aside, a lot of time a good random replacement policy is sufficient. If you really want to be fancy, you can adopt the Least Recently Used algorithm and throw out the one that has not been accessed for the longest time. It is hard to implement a true LRU algorithm in hardware. Here is an example where you can do a “pseudo” LRU with very simple hardware. Build a hardware replacement pointer, most likely a shift register, that will select an entry you will throw up next time when you have to make room for a new item. We can start the pointer at any random location. During normal access, whenever an access is made to an entry that the pointer points to, you move the pointer to the next entry. So at any given time, statistically speaking, the pointer is more likely to end up at an entry that has not been used very often than an entry that is heavily used. +2 = 56 min. (Y:36) : Entry 0 Entry 1 Entry 63 Replacement Pointer Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

COMP 206 Fall 2000 Cache Write Policy Cache read is much easier to handle than cache write Instruction cache is much easier to design than data cache Cache write How do we keep data in the cache and memory consistent? Two options (decision time again :-) Write Back: write to cache only. Write the cache block to memory when that cache block is being replaced on a cache miss Need a “dirty bit” for each cache block Greatly reduce the memory bandwidth requirement Control can be complex Write Through: write to cache and memory at the same time What!!! How can this be? Isn’t memory too slow for this? So far, we have been thinking mostly in terms of cache read not cache write because cache read is much easier to handle than cache write. That’s why in general Instruction Cache is much easier to build than data cache. The problem with cache write is that we have to keep the cache and memory consistent. There are two options we can follow. The first one is write back. In this case, you only write to the cache. We will write the cache block back to memory only when the cache block is being replaced on a cache miss. In order to take full advantage of the write back feature, you will need a dirty bit for each block to indicate whether you have written into this cache block before. Write back cache can greatly reduce the memory bandwidth requirement but it also requires rather complex control. The second option is Write Through: you write to the cache and memory at the same time. You probably say: WHAT. You can’t do that. Memory is too slow. +2 = 58 min. (Y:38) Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

Write Buffer for Write Through
COMP 206 Fall 2000 Write Buffer for Write Through Processor Cache Write Buffer DRAM A Write Buffer is needed between cache and main memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO Typical number of entries: 4 Works fine if store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation You are right, memory is too slow. We really didn't writ e to the memory directly. We are writing to a write buffer. Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene. The write buffer works as long as the frequency of store is not too high. Notice here, I am referring to the frequency with respect to time, not with respect to number of instructions. Remember the DRAM cycle time we talked about last time. It sets the upper limit on how frequent you can write to the main memory. If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. +2 = 60 min. (Y:40) Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

Write Buffer Saturation
COMP 206 Fall 2000 Write Buffer Saturation Processor Cache Write Buffer DRAM Store frequency (w.r.t. time) > 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row) Store buffer will overflow no matter how big you make it CPU Cycle Time << DRAM Write Cycle Time Solutions for write buffer saturation Use a write back cache Install a second level (L2) cache A Memory System designer’s nightmare is when the Store frequency with respect to time approaches 1 over the DRAM Write Cycle Time. We called this Write Buffer Saturation. In that case, it does NOT matter how big you make the write buffer, the write buffer will still overflow because you simply feeding things in it faster than you can empty it. This is called Write Buffer Saturation and I have seen this happened before in simulation and whey that happens your processor will be running at DRAM cycle time--very very slow. The first solution for write buffer saturation is to get rid of this write buffer and replace this write through cache with a write back cache. Another solution is to install the 2nd level cache between the write buffer and memory and makes the 2nd level write back. +2 = 62 min. (Y:42) Processor Cache Write Buffer DRAM L2 Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

Write Allocate versus Not Allocate
COMP 206 Fall 2000 Write Allocate versus Not Allocate Assume that a 16-bit write to memory location 0x00 causes a cache miss Do we read in the block? Yes: Write Allocate No: Write No-Allocate Cache Index 1 2 3 : Cache Data Byte 0 4 31 Cache Tag Example: 0x00 Ex: 0x00 0x00 Valid Bit Byte 1 Byte 31 Byte 32 Byte 33 Byte 63 Byte 992 Byte 1023 Byte Select 9 Dirty-bit Let’s look at our 1KB direct mapped cache again. Assume we do a 16-bit write to memory location 0x and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select. After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss? True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking. +2 = 64 min. (Y:44) Fall 2000 Siddhartha Chatterjee Siddhartha Chatterjee

Four Questions for Memory Hierarchy
Where can a block be placed in the upper level? (Block placement) How is a block found if it is in the upper level? (Block identification) Which block should be replaced on a miss? (Block replacement) What happens on a write? (Write strategy) Fall 2000 Siddhartha Chatterjee

COMP 206 Siddhartha Chatterjee Fall 2000

Similar presentations

Presentation on theme: "COMP 206 Siddhartha Chatterjee Fall 2000"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP 206 Siddhartha Chatterjee Fall 2000

Similar presentations

Presentation on theme: "COMP 206 Siddhartha Chatterjee Fall 2000"— Presentation transcript:

Similar presentations

About project

Feedback