CPS220 Computer System Organization Lecture 18: The Memory Hierarchy

CPS220 Computer System Organization Lecture 18: The Memory Hierarchy
Alvin R. Lebeck Fall 1999 Start: X:40

Outline of Today’s Lecture
The Memory Hierarchy Review: Direct Mapped Cache. Review: Two-Way Set Associative Cache Review: Fully Associative cache Review: Replacement Policies Review: Cache Write Strategies Reducing cache misses Choosing Block size Associativity and the 2:1 rule. Victim Buffer. Prefetching Software techniques Here is an outline of today’’s lecture. In the next 15 minutes, I give you an overview , or a BIG picture, of what Memory System Design is all about. Then I will spend sometime showing you the technology (SRAM & DRAM) that drive the Memory System design. Finally, I will give you a real life example by showing you how the SPARCstation 20’s memory system looks like. +1 = 5 min. (X:45)

Review: Who Cares About the Memory Hierarchy?
Processor Only Thus Far in Course: CPU cost/performance, ISA, Pipelined Execution CPU-DRAM Gap 1980: no cache in µproc; level cache, 60% trans. on Alpha µproc (150 clock cycles for a miss!)

Four Questions for Memory Hierarchy Designers
Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

Direct Mapped cache is an array of fixed size blocks.
Each block holds consecutive bytes of main memory data. The Tag Array holds the Block Memory Address. A valid bit associated with each cache block tells if the data is valid. Cache Index: The location of a block (and it’s tag) in the cache. Block Offset: The byte location in the cache block. Cache-Index = (<Address> Mod (Cache_Size))/ Block_Size Block-Offset = <Address> Mod (Block_Size) Tag = <Address> / (Cache_Size) Let’s look at some definition here. A cache block is defined as the block of cache data that has its own cache tag. Our previous 4-byte direct mapped cache is on the extreme end of the spectrum where block size is one byte because each byte here has its own cache tag. This extreme arrangement does NOT take advantage of the spatial locality which states that if a byte is referenced, adjacent bytes will tend to be referenced soon. Well NO KIDDING. When was the last time you write a program that access one-byte at a time. Even if you just fetch one instruction, you will be accessing 4 bytes at a time. So without further insulting your intelligence, let’s look at bigger block size. How big should we make the block size? Well just like anything in life, it is a tradeoff. But before we look at the tradeoff, let’s take a look at how block size affects our Cache Index and Cache tag arrangement. +2 = 28 min. (Y:08) block offset Cache Index Tag Data Address L bits M-L bits 32-M bits

A N-way Set Associative Cache
N-way set associative: N entries for each Cache Index N direct mapped caches operating in parallel Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Cache Index Valid Cache Tag Cache Data Cache Data Cache Block 0 Cache Tag Valid : Cache Block 0 : : : This is called a 2-way set associative cache because there are two cache entries for each cache index. Essentially, you have two direct mapped cache works in parallel. This is how it works: the cache index selects a set from the cache. The two tags in the set are compared in parallel with the upper bits of the memory address. If neither tag matches the incoming address tag, we have a cache miss. Otherwise, we have a cache hit and we will select the data on the side where the tag matches occur. This is simple enough. What is its disadvantages? +1 = 36 min. (Y:16) Compare Adr. Tag Adr. Tag Compare 1 SEL1 Mux SEL0 OR Cache Block Hit

And yet Another Extreme Example: Fully Associative cache
Fully Associative Cache -- push the set associative idea to its limit! Forget about the Cache Index Compare the Cache Tags of all cache tag entries in parallel Example: Block Size = 32B blocks, we need N 27-bit comparators By definition: Conflict Miss = 0 for a fully associative cache 31 4 Cache Tag (27 bits long) Byte Select Ex: 0x01 Cache Tag Valid Bit Cache Data While the direct mapped cache is on the simple end of the cache design spectrum, the fully associative cache is on the most complex end. It is the N-way set associative cache carried to the extreme where N in this case is set to the number of cache entries in the cache. In other words, we don’t even bother to use any address bits as the cache index. We just store all the upper bits of the address (except Byte select) that is associated with the cache block as the cache tag and have one comparator for every entry. The address is sent to all entries at once and compared in parallel and only the one that matches are sent to the output. This is called an associative lookup. Needless to say, it is very hardware intensive. Usually, fully associative cache is limited to 64 or less entries. Since we are not doing any mapping with the cache index, we will never push any other item out of the cache because multiple memory locations map to the same cache location. Therefore, by definition, conflict miss is zero for a fully associative cache. This, however, does not mean the overall miss rate will be zero. Assume we have 64 entries here. The first 64 items we accessed can fit in. But when we try to bring in the 65th item, we will need to throw one of them out to make room for the new item. This bring us to the third type of cache misses: Capacity Miss. +3 = 41 min. (Y:21) : X Byte 31 Byte 1 Byte 0 : X Byte 63 Byte 33 Byte 32 X X : : : X

The Need to Make a Decision!
Direct Mapped Cache: Each memory location can only mapped to 1 cache location No need to make any decision :-) Current item replaced the previous item in that cache location N-way Set Associative Cache: Each memory location have a choice of N cache locations Fully Associative Cache: Each memory location can be placed in ANY cache location Cache miss in a N-way Set Associative or Fully Associative Cache: Bring in new block from memory Throw out a cache block to make room for the new block We need to make a decision on which block to throw out! In a direct mapped cache, each memory location can only go to 1 & only 1 cache location so on a cache miss, we don’t have a make to decision on which block to throw out. We just throw out the item that was originally in the cache and keep the new one. As a matter of fact, that is what cause direct mapped trouble to begin with: every memory location only has one place to go in the cache: high conflict misses. On a N-way set associative cache, each memory location can go to one of the N cache locations while on the fully associative cache, each memory location can go to ANY cache location in the cache. So when we have a cache miss in a N-way set associative or fully associative cache, we have a slight problem. We need to make a decision on which block to throw out. Besides working at Sun, I also teach people how to fly whenever I have time. Statistic have shown that if a pilot crashed after an engine fail, he or she has a higher probability of getting killed in a multi-engine airplane than a single engine airplane. The joke among us flight instructors is that: sure, when the engine quit in a single engine airplane, you have one option, you land. But in a multi-engine airplane with one engine stops, you have a lot of options. It is the need to make a decision that kills those indecisive people. +3 = 54 min. (Y:34)

Cache Block Replacement Policy
Random Replacement: Hardware randomly selects a cache item and throw it out Least Recently Used: Hardware keeps track of the access history Replace the entry that has not been used for the longest time. For two way set associative cache one needs one bit for LRU replacement. Example of a Simple “Pseudo” Least Recently Used Implementation: Assume 64 Fully Associative Entries Hardware replacement pointer points to one cache entry Whenever an access is made to the entry the pointer points to: Move the pointer to the next entry Otherwise: do not move the pointer If you are one of those indecisive people, do I have a replacement policy for you (Random). Just throw away any one you like. Flip a coin, throw darts, whatever, you don’t care. The important thing is to make a decision and move on. Just don’t crash. Well all jokes aside, a lot of time a good random replacement policy is sufficient. If you really want to be fancy, you can adopt the Least Recently Used algorithm and throw out the one that has not been accessed for the longest time. It is hard to implement a true LRU algorithm in hardware. Here is an example where you can do a “pseudo” LRU with very simple hardware. Build a hardware replacement pointer, most likely a shift register, that will select an entry you will throw up next time when you have to make room for a new item. We can start the pointer at any random location. During normal access, whenever an access is made to an entry that the pointer points to, you move the pointer to the next entry. So at any given time, statistically speaking, the pointer is more likely to end up at an entry that has not been used very often than an entry that is heavily used. +2 = 56 min. (Y:36) Entry 0 Entry 1 Replacement : Pointer Entry 63

Cache Write Policy: Write Through versus Write Back
Cache read is much easier to handle than cache write: Instruction cache is much easier to design than data cache Cache write: How do we keep data in the cache and memory consistent? Two options (decision time again :-) Write Back: write to cache only. Write the cache block to memory when that cache block is being replaced on a cache miss. Need a “dirty bit” for each cache block Greatly reduce the memory bandwidth requirement Control can be complex Write Through: write to cache and memory at the same time. What!!! How can this be? Isn’t memory too slow for this? So far, we have been thinking mostly in terms of cache read not cache write because cache read is much easier to handle than cache write. That’s why in general Instruction Cache is much easier to build than data cache. The problem with cache write is that we have to keep the cache and memory consistent. There are two options we can follow. The first one is write back. In this case, you only write to the cache. We will write the cache block back to memory only when the cache block is being replaced on a cache miss. In order to take full advantage of the write back feature, you will need a dirty bit for each block to indicate whether you have written into this cache block before. Write back cache can greatly reduce the memory bandwidth requirement but it also requires rather complex control. The second option is Write Through: you write to the cache and memory at the same time. You probably say: WHAT. You can’t do that. Memory is too slow. +2 = 58 min. (Y:38)

Write Buffer for Write Through
Cache Processor DRAM Write Buffer A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation You are right, memory is too slow. We really didn't writ e to the memory directly. We are writing to a write buffer. Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene. The write buffer works as long as the frequency of store is not too high. Notice here, I am referring to the frequency with respect to time, not with respect to number of instructions. Remember the DRAM cycle time we talked about last time. It sets the upper limit on how frequent you can write to the main memory. If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. +2 = 60 min. (Y:40)

Write Buffer Saturation
Cache Processor DRAM Write Buffer Store frequency (w.r.t. time) -> 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time << DRAM Write Cycle Time Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache: store compression A Memory System designer’s nightmare is when the Store frequency with respect to time approaches 1 over the DRAM Write Cycle Time. We called this Write Buffer Saturation. In that case, it does NOT matter how big you make the write buffer, the write buffer will still overflow because you simply feeding things in it faster than you can empty it. This is called Write Buffer Saturation and I have seen this happened before in simulation and whey that happens your processor will be running at DRAM cycle time--very very slow. The first solution for write buffer saturation is to get rid of this write buffer and replace this write through cache with a write back cache. Another solution is to install the 2nd level cache between the write buffer and memory and makes the 2nd level write back. +2 = 62 min. (Y:42) Cache L2 Cache Processor DRAM Write Buffer

What is a Sub-block? : : : : : : : : Sub-block:
Share one cache tag between all sub-blocks in a block Each sub-block within a block has its own valid bit Example: 1 KB Direct Mapped Cache, 32-B Block, 8-B Sub-block Each cache entry will have: 32/8 = 4 valid bits Write miss: only the bytes in that sub-block are brought in. reduces cache fill bandwidth (penalty). SB3’s V Bit SB2’s V Bit SB1’s V Bit SB0’s V Bit Cache Tag Cache Data : : B31 B24 B7 B0 A sub-block is defined as a unit within a block that has its own valid bit. For example here, if each 32 byte block is divided into four 8 byte sub-blocks, we will need 4 valid bits per block. On a cache write miss, we only need to bring in the data for the sub-block that we are writing to and set the valid bit to all other subblocks to zeros. This works fine for write. For read, we still want to bring in the rest of the block so the time it takes to fill the block is still important. Modern computer has several ways to reduce the time to fill the cache. +2 = 66 min. (Y:46) Sub-block3 Sub-block2 Sub-block1 Sub-block0 1 2 3 : : : : : : 31 Byte 1023 Byte 992

Cache Performance CPUtime = IC x (CPIexecution + Mem accesses per instruction x Miss rate x Miss penalty) x Clock cycle time Misses per instruction = Memory accesses per instruction x Miss rate CPUtime = IC x (CPIexecution + Misses per instruction x Miss penalty) x Clock cycle time

Improving Cache Performance
1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

Sources of Cache Misses
Compulsory (cold start or process migration, first reference): first access to a block (Misses in Infinite Cache) Solutions: Prefetching? Larger blocks? Conflict (collision): Multiple memory locations mapped to the same cache location (Misses in N-way Associative, Size X Cache) Solution 1: increase cache size Solution 2: increase associativity Capacity: Cache cannot contain all blocks access by the program (Misses in Size X Cache) Solution: increase cache size Invalidation: other process (e.g., I/O) updates memory (Capacity miss) That is the cache misses are due to the fact that the cache is simply not large enough to contain all the blocks that are accessed by the program. The solution to reduce the Capacity miss rate is simple: increase the cache size. Here is a summary of other types of cache miss we talked about. First is the Compulsory misses. These are the misses that we cannot avoid. They are caused when we first start the program. Then we talked about the conflict misses. They are the misses that caused by multiple memory locations being mapped to the same cache location. There are two solutions to reduce conflict misses. The first one is, once again, increase the cache size. The second one is to increase the associativity. For example, say using a 2-way set associative cache instead of directed mapped cache. But keep in mind that cache miss rate is only one part of the equation. You also have to worry about cache access time and miss penalty. Do NOT optimize miss rate alone. Finally, there is another source of cache miss we will not cover today. Those are referred to as invalidation misses caused by another process, such as IO , update the main memory so you have to flush the cache to avoid inconsistency between memory and cache. +2 = 43 min. (Y:23)

Sources of Cache Misses
Direct Mapped N-way Set Associative Fully Associative Cache Size Big Medium Small Compulsory Miss Same Same Same Conflict Miss High Medium Zero Capacity Miss Low(er) Medium High This is a hidden slide. Invalidation Miss Same Same Same Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant.

3Cs Absolute Miss Rate Conflict C a c h e S i z ( K B )
Miss Rate per Type . 2 4 6 8 1 16 32 64 128 - w y p t o m u l s r Conflict

2:1 Cache Rule Conflict C a c h e S i z ( K B ) Miss Rate per Type . 2
. 2 4 6 8 1 16 32 64 128 - w y Capacity Compulsory Conflict

3Cs Relative Miss Rate Conflict 1 % 8 % Miss Rate per Type 6 % 4 %
% 1-way Conflict 8 % 2-way 4-way Miss Rate per Type 6 % 8-way 4 % Capacity 2 % % 1 2 4 8 16 32 64 128 C o m p u l s o r y C a c h e S i z e ( K B )

1. Reduce Misses via Larger Block Size

2. Reduce Misses via Higher Associativity
2:1 Cache Rule: Miss Rate DM cache size N ≈ Miss Rate 2W-SA cache size N/2 Beware: Execution time is only final measure! Will Clock Cycle time increase? Hill [1988] suggested hit time external cache +10%, internal + 2% for 2-way vs. 1-way

Example: Avg. Memory Access Time vs. Miss Rate
Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped Cache Size Associativity (KB) 1-way 2-way 4-way 8-way (Red means A.M.A.T. not improved by more associativity)

3. Reducing Conflict Misses via Victim Cache
CPU TAG DATA How to combine fast hit time of Direct Mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache ? TAG DATA Mem ?

4. Reducing Conflict Misses via Pseudo-Associativity
How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit) Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles Better for caches not tied directly to processor Hit Time Pseudo Hit Time Miss Penalty Time

5. Reducing Misses by HW Prefetching of Instruction & Data
E.g., Instruction Prefetching Alpha fetches 2 blocks on a miss Extra block placed in stream buffer On miss check stream buffer Works with data blocks too: Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on extra memory bandwidth that can be used without penalty

6. Reducing Misses by SW Prefetching Data
Data Prefetch Load data into register (HP PA-RISC loads) binding Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) non-binding Special prefetching instructions cannot cause faults; a form of speculative execution Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses?

7. Reducing Misses by Compiler Optimizations
Instructions Reorder procedures in memory so as to reduce misses Profiling to look at conflicts McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocks Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

Merging Arrays Example
/* Before */ int val[SIZE]; int key[SIZE]; /* After */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key

Loop Interchange Example
/* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ Sequential accesses Instead of striding through memory every 100 words

Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; /* After */ { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access

Blocking Example Two Inner Loops: Read all NxN elements of z[]
/* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; Two Inner Loops: Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[] Capacity Misses a function of N & Cache Size: 3 NxN => no capacity misses; otherwise ... Idea: compute on BxB submatrix that fits

Blocking Example Capacity Misses reduced from 2N3 + N2 to 2N3/B +N2
/* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; Capacity Misses reduced from 2N3 + N2 to 2N3/B +N2 B called Blocking Factor Conflict Misses Too?

Reducing Conflict Misses by Blocking
Conflict misses in caches not FA vs. Blocking size Lam et al [1991] a blocking factor of 24 had one fifth the misses vs. 48 despite both fit in cache

Summary of Compiler Optimizations to Reduce Cache Misses

CPS220 Computer System Organization Lecture 18: The Memory Hierarchy

Similar presentations

Presentation on theme: "CPS220 Computer System Organization Lecture 18: The Memory Hierarchy"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CPS220 Computer System Organization Lecture 18: The Memory Hierarchy

Similar presentations

Presentation on theme: "CPS220 Computer System Organization Lecture 18: The Memory Hierarchy"— Presentation transcript:

Similar presentations

About project

Feedback