CEG3420 Computer Design Caches and Virtual Memory

CEG3420 Computer Design Caches and Virtual Memory

Recap: Who Cares About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency) µProc 60%/yr. (2X/1.5yr) 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap: (grows 50% / year) Performance 10 DRAM 9%/yr. (2X/10 yrs) Y-axis is performance X-axis is time Latency Cliché: Not e that x86 didn’t have cache on chip until 1989 DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time

Recap: Static RAM Cell Write: Read: 6-Transistor SRAM Cell word word
(row select) 1 1 bit bit Write: 1. Drive bit lines (bit=1, bit=0) 2.. Select row Read: 1. Precharge bit and bit to Vdd 3. Cell pulls one line low 4. Sense amp on column detects difference between bit and bit bit bit replaced with pullup to save area The classical SRAM cell looks like this. It consists of two back-to-back inverters that serves as a flip-flop. Here is an expanded view of this cell, you can see it consists of 6 transistors. In order to write a value into this cell, you need to drive from both sides. For example, if you want to write a 1, you will drive “bit” to 1 while at the same time, drive “bit bar” to zero. Once the bit lines are driven to their desired values, you will turn on these two transistors by setting the word line to high so the values on the bit lines will be written into the cell. Remember now these are very very tiny transistors so we cannot rely on them to drive these long bit lines effectively during read. Also, the pull down devices are usually much stronger than the pull up devices. So the first thing we need to do on read is to charge these two bit lines to a high values. Once these bit lines are charged to high, we will turn on these two transistors so one of these inverters (the lower one in our example) will start pulling one of the bit line low while the other bit line will remain at HI. It will take this small inverter a long time to drive this long bit line to low but we don’t have to wait that long since all we need to detect the difference between these two bit lines. And if you ask any circuit designer, they will tell you it is much easier to detect a “differential signal” (point to bit and bit bar) than to detect an absolute signal. +2 = 30 min. (Y:10)

Recap: 1-Transistor Memory Cell (DRAM)
row select Write: 1. Drive bit line 2.. Select row Read: 1. Precharge bit line to Vdd 3. Cell and bit line share charges Very small voltage changes on the bit line 4. Sense (fancy sense amp) Can detect changes of ~1 million electrons 5. Write: restore the value Refresh 1. Just do a dummy read to every cell. bit The state of the art DRAM cell only has one transistor. The bit is stored in a tiny transistor. The write operation is very simple. Just drive the bit line and select the row by turning on this pass transistor. For read, we will need to precharge this bit line to high and then turn on the pass transistor. This will cause a small voltage change on the bit line and a very sensitive amplifier will be used to measure this small voltage change with respect to a reference bit line. Once again, the value we stored will be destroyed by the read operation so an automatic write back has to be performed at the end of every read. + 2 = 48 min. (Y:28)

DRAMs over Time DRAM Generation 1st Gen. Sample Memory Size
Die Size (mm2) Memory Area (mm2) Memory Cell Area (µm2) ‘84 ‘87 ‘90 ‘93 ‘96 ‘99 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb Top is generations Side is minimum memory per PC 32 chips in 1986 4 chips today 60% vs. system at 25% (from Kazuhiro Sakashita, Mitsubishi)

DRAM v. Desktop Microprocessors Cultures
Standards pinout, package, binary compatibility, refresh rate, IEEE 754, I/O bus capacity, ... Sources Multiple Single Figures 1) capacity, 1a) $/bit 1) SPEC speed of Merit 2) BW, 3) latency 2) cost Improve 1) 60%, 1a) 25%, 1) 60%, Rate/year 2) 20%, 3) 7% 2) little change

Recap: Memory Hierarchy of a Modern Computer System
By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Processor Control Tertiary Storage (Disk) Secondary Storage (Disk) Second Level Cache (SRAM) Main Memory (DRAM) Datapath On-Chip Cache Registers The design goal is to present the user with as much memory as is available in the cheapest technology (points to the disk). While by taking advantage of the principle of locality, we like to provide the user an average access speed that is very close to the speed that is offered by the fastest technology. (We will go over this slide in details in the next lecture on caches). +1 = 16 min. (X:56) Speed (ns): 1s 10s 100s 10,000,000s (10s ms) 10,000,000,000s (10s sec) Size (bytes): 100s Ks Ms Gs Ts

Two Different Types of Locality:
Recap: Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. DRAM is slow but cheap and dense: Good choice for presenting the user with a BIG memory system SRAM is fast but expensive and not very dense: Good choice for providing the user FAST access time. Let’s summarize today’s lecture. The first thing we covered is the principle of locality. There are two types of locality: temporal, or locality of time and spatial, locality of space. We talked about memory system design. The key idea of memory system design is to present the user with as much memory as possible in the cheapest technology while by taking advantage of the principle of locality, create an illusion that the average access time is close to that of the fastest technology. As far as Random Access technology is concerned, we concentrate on 2: DRAM and SRAM. DRAM is slow but cheap and dense so is a good choice for presenting the use with a BIG memory system. SRAM, on the other hand, is fast but it is also expensive both in terms of cost and power, so it is a good choice for providing the user with a fast access time. I have already showed you how DRAMs are used to construct the main memory for the SPARCstation 20. On Friday, we will talk about caches. +2 = 78 min. (Y:58)

The Big Picture: Where are We Now?
The Five Classic Components of a Computer Today’s Topics: Recap last lecture Cache Review Administrivia Advanced Cache Virtual Memory Protection TLB Processor Input Control Memory Datapath Output So where are in in the overall scheme of things. Well, we just finished designing the processor’s datapath. Now I am going to show you how to design the control for the datapath. +1 = 7 min. (X:47)

The Art of Memory System Design
Workload or Benchmark programs Processor reference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>, . . . op: i-fetch, read, write Memory Optimize the memory system organization to minimize the average memory access time for typical workloads $ MEM

Example: 1 KB Direct Mapped Cache with 32 B Blocks
For a 2 ** N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 ** M) 31 9 4 Cache Tag Example: 0x50 Cache Index Byte Select Ex: 0x01 Ex: 0x00 Stored as part of the cache “state” Valid Bit Cache Tag Cache Data : Byte 31 Byte 1 Byte 0 : 0x50 Byte 63 Byte 33 Byte 32 1 Let’s use a specific example with realistic numbers: assume we have a 1 KB direct mapped cache with block size equals to 32 bytes. In other words, each block associated with the cache tag will have 32 bytes in it (Row 1). With Block Size equals to 32 bytes, the 5 least significant bits of the address will be used as byte select within the cache block. Since the cache size is 1K byte, the upper 32 minus 10 bits, or 22 bits of the address will be stored as cache tag. The rest of the address bits in the middle, that is bit 5 through 9, will be used as Cache Index to select the proper cache entry. +2 = 30 min. (Y:10) 2 3 : : : : Byte 1023 Byte 992 31

Increased Miss Penalty
Block Size Tradeoff In general, larger block size take advantage of spatial locality BUT: Larger block size means larger miss penalty: Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up Too few cache blocks In gerneral, Average Access Time: = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate Average Access Time Miss Penalty Miss Rate As I said earlier, block size is a tradeoff. In general, larger block size will reduce the miss rate because it take advantage of spatial locality. But remember, miss rate NOT the only cache performance metrics. You also have to worry about miss penalty. As you increase the block size, your miss penalty will go up because as the block gets larger, it will take you longer to fill up the block. Even if you look at miss rate by itself, which you should NOT, bigger block size does not always win. As you increase the block size, assuming keeping cache size constant, your miss rate will drop off rapidly at the beginning due to spatial locality. However, once you pass certain point, your miss rate actually goes up. As a result of these two curves, the Average Access Time (point to equation), which is really the more important performance metric than the miss rate, will go down initially because the miss rate is dropping much faster than the increase in miss penalty. But eventually, as you keep on increasing the block size, the average access time can go up rapidly because not only is the miss penalty is increasing, the miss rate is increasing as well. Let me show you why your miss rate may go up as you increase the block size by another extreme example. +3 = 33 min. (Y:13) Exploits Spatial Locality Increased Miss Penalty & Miss Rate Fewer blocks: compromises temporal locality Block Size Block Size Block Size

Extreme Example: single big line
Cache Data Valid Bit Byte 0 Byte 1 Byte 3 Cache Tag Byte 2 Cache Size = 4 bytes Block Size = 4 bytes Only ONE entry in the cache If an item is accessed, likely that it will be accessed again soon But it is unlikely that it will be accessed again immediately!!! The next access will likely to be a miss again Continually loading data into the cache but discard (force out) them before they are used again Worst nightmare of a cache designer: Ping Pong Effect Conflict Misses are misses caused by: Different memory locations mapped to the same cache index Solution 1: make the cache size bigger Solution 2: Multiple entries for the same Cache Index Let’s go back to our 4-byte direct mapped cache and increase its block size to 4 byte. Now we end up have one cache entries instead of 4 entries. What do you think this will do to the miss rate? Well the miss rate probably will go to hell. It is true that if an item is accessed, it is likely that it will be accessed again soon. But probably NOT as soon as the very next access so the next access will cause a miss again. So what we will end up is loading data into the cache but the data will be forced out by another cache miss before we have a chance to use it again. This is called the ping pong effect: the data is acting like a ping pong ball bouncing in and out of the cache. It is one of the nightmares scenarios cache designer hope never happens. We also defined a term for this type of cache miss, cache miss caused by different memory location mapped to the same cache index. It is called Conflict miss. There are two solutions we can use to reduce the conflict miss. The first one is to increase the cache size. The second one is to increase the number of cache entries per cache index. Let me show you what I mean. +2 = 35 min. (Y:15)

Another Extreme Example: Fully Associative
Fully Associative Cache Forget about the Cache Index Compare the Cache Tags of all cache entries in parallel Example: Block Size = 2 B blocks, we need N 27-bit comparators By definition: Conflict Miss = 0 for a fully associative cache 31 4 Cache Tag (27 bits long) Byte Select Ex: 0x01 Cache Tag Valid Bit Cache Data : X Byte 31 Byte 1 Byte 0 While the direct mapped cache is on the simple end of the cache design spectrum, the fully associative cache is on the most complex end. It is the N-way set associative cache carried to the extreme where N in this case is set to the number of cache entries in the cache. In other words, we don’t even bother to use any address bits as the cache index. We just store all the upper bits of the address (except Byte select) that is associated with the cache block as the cache tag and have one comparator for every entry. The address is sent to all entries at once and compared in parallel and only the one that matches are sent to the output. This is called an associative lookup. Needless to say, it is very hardware intensive. Usually, fully associative cache is limited to 64 or less entries. Since we are not doing any mapping with the cache index, we will never push any other item out of the cache because multiple memory locations map to the same cache location. Therefore, by definition, conflict miss is zero for a fully associative cache. This, however, does not mean the overall miss rate will be zero. Assume we have 64 entries here. The first 64 items we accessed can fit in. But when we try to bring in the 65th item, we will need to throw one of them out to make room for the new item. This bring us to the third type of cache misses: Capacity Miss. +3 = 41 min. (Y:21) : X Byte 63 Byte 33 Byte 32 X X : : : X

A Two-way Set Associative Cache
N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Cache Index Valid Cache Tag Cache Data Cache Data Cache Block 0 Cache Tag Valid : Cache Block 0 : : : This is called a 2-way set associative cache because there are two cache entries for each cache index. Essentially, you have two direct mapped cache works in parallel. This is how it works: the cache index selects a set from the cache. The two tags in the set are compared in parallel with the upper bits of the memory address. If neither tag matches the incoming address tag, we have a cache miss. Otherwise, we have a cache hit and we will select the data on the side where the tag matches occur. This is simple enough. What is its disadvantages? +1 = 36 min. (Y:16) Compare Adr Tag Compare 1 Sel1 Mux Sel0 OR Cache Block Hit

Disadvantage of Set Associative Cache
N-way Set Associative Cache versus Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. Cache Data Cache Block 0 Cache Tag Valid : Cache Index Mux 1 Sel1 Sel0 Cache Block Compare Adr Tag OR Hit First of all, a N-way set associative cache will need N comparators instead of just one comparator (use the right side of the diagram for direct mapped cache). A N-way set associative cache will also be slower than a direct mapped cache because of this extra multiplexer delay. Finally, for a N-way set associative cache, the data will be available AFTER the hit/miss signal becomes valid because the hit/mis is needed to control the data MUX. For a direct mapped cache, that is everything before the MUX on the right or left side, the cache block will be available BEFORE the hit/miss signal (AND gate output) because the data does not have to go through the comparator. This can be an important consideration because the processor can now go ahead and use the data without knowing if it is a Hit or Miss. Just assume it is a hit. Since cache hit rate is in the upper 90% range, you will be ahead of the game 90% of the time and for those 10% of the time that you are wrong, just make sure you can recover. You cannot play this speculation game with a N-way set-associatvie cache because as I said earlier, the data will not be available to you until the hit/miss signal is valid. +2 = 38 min. (Y:18)

A Summary on Sources of Cache Misses
Compulsory (cold start or process migration, first reference): first access to a block “Cold” fact of life: not a whole lot you can do about it Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Capacity: Cache cannot contain all blocks access by the program Solution: increase cache size Invalidation: other process (e.g., I/O) updates memory (Capacity miss) That is the cache misses are due to the fact that the cache is simply not large enough to contain all the blocks that are accessed by the program. The solution to reduce the Capacity miss rate is simple: increase the cache size. Here is a summary of other types of cache miss we talked about. First is the Compulsory misses. These are the misses that we cannot avoid. They are caused when we first start the program. Then we talked about the conflict misses. They are the misses that caused by multiple memory locations being mapped to the same cache location. There are two solutions to reduce conflict misses. The first one is, once again, increase the cache size. The second one is to increase the associativity. For example, say using a 2-way set associative cache instead of directed mapped cache. But keep in mind that cache miss rate is only one part of the equation. You also have to worry about cache access time and miss penalty. Do NOT optimize miss rate alone. Finally, there is another source of cache miss we will not cover today. Those are referred to as invalidation misses caused by another process, such as IO , update the main memory so you have to flush the cache to avoid inconsistency between memory and cache. +2 = 43 min. (Y:23)

Source of Cache Misses Quiz
Direct Mapped N-way Set Associative Fully Associative Cache Size: Small, Medium, Big? Compulsory Miss: Conflict Miss Capacity Miss Invalidation Miss Let me see whether you guys can help me fill out this table. So all things being equal, which organization do you think we should use if we want to build the biggest cache (Big, Medium, Small). Answer: The direct mapped cache is the simplest so we can make it the biggest. The fully associative cache is the most complex so it will be the smallest. How about compulsory miss, which cache do you think has the highest compulsory miss rate (High, Medium Low)? Answer: Well, the direct mapped cache with the biggest cache size probably will take the longest to warm up so it probably will have the highest compulsory miss rate. But the good thing about compulsory miss rate is that if you are going to run billions of instructions, you really don’t care your initial misses before the cache is warm. Conflict Misses: which do you think has the lowest conflict miss rate. Answer: Simple. By definition, the fully-associative cache will have zero conflict misses so you can’t get any lower than that. For the direct mapped cache, each memory location can ONLY go to one cache location, so the conflict misses probably will be highest. Capacity Misses: well who has the biggest cache size wins in this category. So the direct mapped cache will have the lowest capacity miss rate while the fully associative cache will have the highest. Finally, invalidation misses. They probably affect all three caches the same way. + 3 = 46 min. (Y:26) Choices: Zero, Low, Medium, High, Same

Administrative Issues
New Office Hours: Gebis: Tue, 3:30-4:30, Kirby: Wed 1-2, Kozyrakis: Mon 1pm-2pm, Th 11am-noon ,Patterson: Wed 1-2 and Wed 3:30-4:30 Reflector site for handouts and lecture notes (backup): Read: Chapter 7 of COD 2/e; how many taken CS162? Upcoming events in CS152: Wed 11/5 Intro to I/O Systems Brian Wong, Sun Fri 11/7 Advanced I/O Systems Brian Wong, Sun Wed 11/12 Intro Digital Signal Processor (DSP) Prof. Brodersen Fri 11/14 Advanced DSP Jeff Bier, BDTI Sun 11/16 Miterm Review 1-3PM 306 Soda TAs Wed 11/19 Midterm II 5:30-8: Soda; >8:30 - Val’s Fri 11/21 Field Trip to Intel (leave 9AM, Return 5PM)

Sources of Cache Misses Answer
Direct Mapped N-way Set Associative Fully Associative Cache Size Big Medium Small Compulsory Miss Same Same Same Conflict Miss High Medium Zero Capacity Miss Low Medium High Invalidation Miss Same Same Same This is a hidden slide. Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant.

How Do you Design a Cache?
Set of Operations that must be supported read: data <= Mem[Physical Address] write: Mem[Physical Address] <= Data Deterimine the internal register transfers Design the Datapath Design the Cache Controller Inside it has: Tag-Data Storage, Muxes, Comparators, . . . Physical Address Memory “Black Box” Read/Write Data Control Points Cache DataPath R/W Active Cache Controller Address Data In wait Data Out Signals

Example: direct map allows miss signal after data
Impact on Cycle Time IR PC I -Cache D Cache A B R T IRex IRm IRwb miss invalid Miss Cache Hit Time: directly tied to clock rate increases with cache size increases with associativity Average Memory Access time = Hit Time + Miss Rate x Miss Penalty Time = IC x CT x (ideal CPI + memory stalls) Example: direct map allows miss signal after data

Improving Cache Performance: 3 general options
1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

4 Questions for Memory Hierarchy
Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

Q1: Where can a block be placed in the upper level?
Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets

Q2: How is a block found if it is in the upper level?
Tag on each block No need to check index or block offset Increasing associativity shrinks index, expands tag

Q3: Which block should be replaced on a miss?
Easy for Direct Mapped Set Associative or Fully Associative: Random LRU (Least Recently Used) Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

Q4: What happens on a write?
Write through—The information is written to both the block in the cache and to the block in the lower-level memory. Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT: read misses cannot result in writes WB: no writes of repeated writes WT always combined with write buffers so that don’t wait for lower level memory

Write Buffer for Write Through
Cache Processor DRAM Write Buffer A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: Store frequency (w.r.t. time) -> 1 / DRAM write cycle Write buffer saturation You are right, memory is too slow. We really didn't writ e to the memory directly. We are writing to a write buffer. Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene. The write buffer works as long as the frequency of store is not too high. Notice here, I am referring to the frequency with respect to time, not with respect to number of instructions. Remember the DRAM cycle time we talked about last time. It sets the upper limit on how frequent you can write to the main memory. If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. +2 = 60 min. (Y:40)

Write Buffer Saturation
Cache Processor DRAM Write Buffer Store frequency (w.r.t. time) -> 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache: A Memory System designer’s nightmare is when the Store frequency with respect to time approaches 1 over the DRAM Write Cycle Time. We called this Write Buffer Saturation. In that case, it does NOT matter how big you make the write buffer, the write buffer will still overflow because you simply feeding things in it faster than you can empty it. This is called Write Buffer Saturation and I have seen this happened before in simulation and whey that happens your processor will be running at DRAM cycle time--very very slow. The first solution for write buffer saturation is to get rid of this write buffer and replace this write through cache with a write back cache. Another solution is to install the 2nd level cache between the write buffer and memory and makes the 2nd level write back. +2 = 62 min. (Y:42) Cache L2 Cache Processor DRAM Write Buffer

Write-miss Policy: Write Allocate versus Not Allocate
Assume: a 16-bit write to memory location 0x0 and causes a miss Do we read in the block? Yes: Write Allocate No: Write Not Allocate 31 9 4 Cache Tag Example: 0x00 Cache Index Byte Select Ex: 0x00 Ex: 0x00 Valid Bit Cache Tag Cache Data : 0x00 Byte 31 Byte 1 Byte 0 : Byte 63 Byte 33 Byte 32 1 Let’s look at our 1KB direct mapped cache again. Assume we do a 16-bit write to memory location 0x and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select. After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss? True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking. +2 = 64 min. (Y:44) 2 3 : : : : Byte 1023 Byte 992 31

Impact of Memory Hierarchy on Algorithms
Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms? “The Influence of Caches on the Performance of Sorting” by A. LaMarca and R.E. Ladner. Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, January, 1997, Quicksort: fastest comparison based sorting algorithm when all keys fit in memory Radix sort: also called “linear time” sort because for keys of fixed length and fixed radix a constant number of passes over the data is sufficient independent of the number of keys For Alphastation 250, 32 byte blocks, direct mapped L2 2MB cache, 8 byte keys, from 4000 to Let’s do a short review of what you learned last time. Virtual memory was originally invented as another level of memory hierarchy such that programers, faced with main memory much smaller than their programs, do not have to manage the loading and unloading portions of their program in and out of memory. It was a controversial proposal at that time because very few programers believed software can manage the limited amount of memory resource as well as human. This all changed as DRAM size grows exponentially in the last few decades. Nowadays, the main function of virtual memory is to allow multiple processes to share the same main memory so we don’t have to swap all the non-active processes to disk. Consequently, the most important function of virtual memory these days is to provide memory protection. The most common technique, but we like to emphasis not the only technique, to translate virtual memory address to physical memory address is to use a page table. TLB, or translation lookaside buffer, is one of the most popular hardware techniques to reduce address translation time. Since TLB is so effective in reducing the address translation time, what this means is that TLB misses will have a significant negative impact on processor performance. +3 = 3 min. (X:43)

Quicksort vs. Radix as vary number keys: Instructions
Radix sort Quick sort Instructions/key Set size in keys

Quicksort vs. Radix as vary number keys: Instrs & Time
Radix sort Time Quick sort Instructions Set size in keys

Quicksort vs. Radix as vary number keys: Cache misses
Radix sort Cache misses Quick sort Set size in keys What is proper approach to fast algorithms?

Recall: Levels of the Memory Hierarchy
Upper Level Capacity Access Time Cost Staging Xfer Unit faster CPU Registers 100s Bytes <10s ns Registers Instr. Operands prog./compiler 1-8 bytes Cache K Bytes ns $ /bit Cache cache cntl 8-128 bytes Blocks Main Memory M Bytes 100ns-1us $ Memory OS 512-4K bytes Pages Disk G Bytes ms cents Disk -3 -4 user/operator Mbytes Files Tape infinite sec-min 10 Larger Tape Lower Level -6

Basic Issues in Virtual Memory System Design
size of information blocks that are transferred from secondary to main storage (M) block of information brought into M, and M is full, then some region of M must be released to make room for the new block --> replacement policy which region of M is to hold the new block --> placement policy missing item fetched from secondary memory only on the occurrence of a fault --> demand load policy disk mem cache reg pages frame Paging Organization virtual and physical address space partitioned into blocks of equal size page frames pages

Address Map V = {0, 1, . . . , n - 1} virtual address space
M = {0, 1, , m - 1} physical address space MAP: V --> M U {0} address mapping function n > m MAP(a) = a' if data at virtual address a is present in physical address a' and a' in M = 0 if data at virtual address a is not present in M a missing item fault Name Space V fault handler Processor Addr Trans Mechanism Main Memory Secondary Memory a a' physical address OS performs this transfer

Paging Organization V.A. P.A. unit of mapping frame 0 1K Addr Trans
frame 0 1K Addr Trans MAP page 0 1K 1024 1 1K 1024 1 1K also unit of transfer from virtual to physical memory 7168 7 1K Physical Memory 31744 31 1K Virtual Memory Address Mapping 10 VA page no. disp Page Table Page Table Base Reg Access Rights V actually, concatenation is more likely PA + index into page table table located in physical memory physical memory address

Virtual Address and a Cache
VA PA miss Trans- lation Cache Main Memory CPU hit data It takes an extra memory access to translate VA to PA This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible ASIDE: Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address! for update: must update all cache entries with same physical address or memory becomes inconsistent determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or software enforced alias boundary: same lsb of VA &PA > cache size

TLBs A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB Virtual Address Physical Address Dirty Ref Valid Access TLB access time comparable to cache access time (much less than main memory access time)

Translation Look-Aside Buffers
Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped TLBs are usually small, typically not more than entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations. hit VA PA miss TLB Lookup Cache Main Memory CPU Translation with a TLB miss hit Trans- lation data 1/2 t t 20 t

Reducing Translation Time
Machines with TLBs go one step further to reduce # cycles/cache access They overlap the cache access with the TLB access Works because high order bits of the VA are used to look in the TLB while low order bits are used as index into cache

Overlapped Cache & TLB Access
assoc lookup index 32 1 K 4 bytes 10 2 00 Hit/ Miss PA Data 20 12 PA Hit/ Miss page # disp = IF cache hit AND (cache tag = PA) then deliver data to CPU ELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN access memory with the PA from the TLB ELSE do standard VA translation

Problems With Overlapped TLB Access
Overlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache Example: suppose everything the same except that the cache is increased to 8 K bytes instead of 4 K: 11 2 cache index 00 This bit is changed by VA translation, but is needed for cache lookup 20 12 virt page # disp Solutions: go to 8K byte page sizes; go to 2 way set associative cache; or SW guarantee VA[13]=PA[13] 1K 2 way set assoc cache 10 4 4

The Principle of Locality:
Summary #1/ 4: The Principle of Locality: Program likely to access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size Cache Design Space total size, block size, associativity replacement policy write-hit policy (write-through, write-back) write-miss policy Let’s summarize today’s lecture. I know you have heard this many times and many ways but it is still worth repeating. Memory hierarchy works because of the Principle of Locality which says a program will access a relatively small portion of the address space at any instant of time. There are two types of locality: temporal locality, or locality in time and spatial locality, or locality in space. So far, we have covered three major categories of cache misses. Compulsory misses are cache misses due to cold start. You cannot avoid them but if you are going to run billions of instructions anyway, compulsory misses usually don’t bother you. Conflict misses are misses caused by multiple memory location being mapped to the same cache location. The nightmare scenario is the ping pong effect when a block is read into the cache but before we have a chance to use it, it was immediately forced out by another conflict miss. You can reduce Conflict misses by either increase the cache size or increase the associativity, or both. Finally, Capacity misses occurs when the cache is not big enough to contains all the cache blocks required by the program. You can reduce this miss rate by making the cache larger. There are two write policy as far as cache write is concerned. Write through requires a write buffer and a nightmare scenario is when the store occurs so frequent that you saturates your write buffer. The second write polity is write back. In this case, you only write to the cache and only when the cache block is being replaced do you write the cache block back to memory. +3 = 77 min. (Y:57)

Summary #2 / 4: The Cache Design Space
Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation The optimal choice is a compromise depends on access characteristics workload use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Cache Size Associativity Block Size Bad No fancy replacement policy is needed for the direct mapped cache. As a matter of fact, that is what cause direct mapped trouble to begin with: only one place to go in the cache--causes conflict misses. Besides working at Sun, I also teach people how to fly whenever I have time. Statistic have shown that if a pilot crashed after an engine failure, he or she is more likely to get killed in a multi-engine light airplane than a single engine airplane. The joke among us flight instructors is that: sure, when the engine quit in a single engine stops, you have one option: sooner or later, you land. Probably sooner. But in a multi-engine airplane with one engine stops, you have a lot of options. It is the need to make a decision that kills those people. Good Factor A Factor B Less More

Summary #3 / 4 : TLB, Virtual Memory
Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is repalced on miss? 4) How are writes handled? Page tables map virtual address to physical address TLBs are important for fast translation TLB misses are significant in processor performance: (funny times, as most systems can’t access all of 2nd level cache without TLB misses!) Let’s do a short review of what you learned last time. Virtual memory was originally invented as another level of memory hierarchy such that programers, faced with main memory much smaller than their programs, do not have to manage the loading and unloading portions of their program in and out of memory. It was a controversial proposal at that time because very few programers believed software can manage the limited amount of memory resource as well as human. This all changed as DRAM size grows exponentially in the last few decades. Nowadays, the main function of virtual memory is to allow multiple processes to share the same main memory so we don’t have to swap all the non-active processes to disk. Consequently, the most important function of virtual memory these days is to provide memory protection. The most common technique, but we like to emphasis not the only technique, to translate virtual memory address to physical memory address is to use a page table. TLB, or translation lookaside buffer, is one of the most popular hardware techniques to reduce address translation time. Since TLB is so effective in reducing the address translation time, what this means is that TLB misses will have a significant negative impact on processor performance. +3 = 3 min. (X:43)

Summary #4 / 4: Memory Hierachy
VIrtual memory was controversial at the time: can SW automatically manage 64KB across many programs? 1000X DRAM growth removed the controversy Today VM allows many processes to share single memory without having to swap all processes to disk; VM protection is more important than memory hierarchy Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms? Let’s do a short review of what you learned last time. Virtual memory was originally invented as another level of memory hierarchy such that programers, faced with main memory much smaller than their programs, do not have to manage the loading and unloading portions of their program in and out of memory. It was a controversial proposal at that time because very few programers believed software can manage the limited amount of memory resource as well as human. This all changed as DRAM size grows exponentially in the last few decades. Nowadays, the main function of virtual memory is to allow multiple processes to share the same main memory so we don’t have to swap all the non-active processes to disk. Consequently, the most important function of virtual memory these days is to provide memory protection. The most common technique, but we like to emphasis not the only technique, to translate virtual memory address to physical memory address is to use a page table. TLB, or translation lookaside buffer, is one of the most popular hardware techniques to reduce address translation time. Since TLB is so effective in reducing the address translation time, what this means is that TLB misses will have a significant negative impact on processor performance. +3 = 3 min. (X:43)

CEG3420 Computer Design Caches and Virtual Memory

Similar presentations

Presentation on theme: "CEG3420 Computer Design Caches and Virtual Memory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CEG3420 Computer Design Caches and Virtual Memory

Similar presentations

Presentation on theme: "CEG3420 Computer Design Caches and Virtual Memory"— Presentation transcript:

Similar presentations

About project

Feedback