Memory Hierarchy Memory Hierarchy Cache Memory

Memory Hierarchy Memory Hierarchy Cache Memory
Reasons Virtual Memory Cache Memory Translation Lookaside Buffer Address translation Demand paging

Why Care About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency) 1000 µProc 60%/yr. (2X/1.5yr) CPU “Moore’s Law” 100 Processor-Memory Performance Gap: (grows 50% / year) Performance 10 DRAM 9%/yr. (2X/10 yrs) Y-axis is performance X-axis is time Latency Cliché: Not e that x86 didn’t have cache on chip until 1989 DRAM 1 1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 1982 Time

DRAMs over Time DRAM Generation 1st Gen. Sample Memory Size
Die Size (mm2) Memory Area (mm2) Memory Cell Area (µm2) ‘84 ‘87 ‘90 ‘93 ‘96 ‘99 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb Top is generations Side is minimum memory per PC 32 chips in 1986 4 chips today 60% vs. system at 25% (from Kazuhiro Sakashita, Mitsubishi)

Recap: Two Different Types of Locality:
Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. DRAM is slow but cheap and dense: Good choice for presenting the user with a BIG memory system SRAM is fast but expensive and not very dense: Good choice for providing the user FAST access time. Let’s summarize today’s lecture. The first thing we covered is the principle of locality. There are two types of locality: temporal, or locality of time and spatial, locality of space. We talked about memory system design. The key idea of memory system design is to present the user with as much memory as possible in the cheapest technology while by taking advantage of the principle of locality, create an illusion that the average access time is close to that of the fastest technology. As far as Random Access technology is concerned, we concentrate on 2: DRAM and SRAM. DRAM is slow but cheap and dense so is a good choice for presenting the use with a BIG memory system. SRAM, on the other hand, is fast but it is also expensive both in terms of cost and power, so it is a good choice for providing the user with a fast access time. I have already showed you how DRAMs are used to construct the main memory for the SPARCstation 20. On Friday, we will talk about caches. +2 = 78 min. (Y:58)

Memory Hierarchy of a Modern Computer
By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Processor Control Tertiary Storage (Disk /Tape) Secondary Storage (Disk) Second Level Cache (SRAM) Main Memory (DRAM) Datapath On-Chip Cache The design goal is to present the user with as much memory as is available in the cheapest technology (points to the disk). While by taking advantage of the principle of locality, we like to provide the user an average access speed that is very close to the speed that is offered by the fastest technology. (We will go over this slide in details in the next lecture on caches). +1 = 16 min. (X:56) Registers Speed : 1ns Xns 10ns 100ns 10 ms 10 sec Size (bytes): 100 64K K..M M G T

Levels of the Memory Hierarchy
Staging Xfer Unit faster Reg:s Cache Memory Disk Tape Instr. Operands prog 1-8 bytes cache cntl 8-128 bytes Blocks OS 512-4K bytes Pages user/operator Mbytes Files Larger

The Art of Memory System Design
Processor $ MEM Memory Workload or Benchmark programs Optimize the memory system organization to minimize the average memory access time for typical workloads reference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>, . . . op: i-fetch, read, write

Virtual Memory System Design
size of information blocks that are transferred from secondary to main storage (M) block of information brought into M, and M is full, then some region of M must be released to make room for the new block --> replacement policy which region of M is to hold the new block --> placement policy missing item fetched from secondary memory only on the occurrence of a fault --> demand load policy page reg cache mem disk Page Paging Organization virtual and physical address space partitioned into blocks of equal size (pages)

Address Map V = {0, 1, . . . , n - 1} virtual address space
M = {0, 1, , m - 1} physical address space MAP: V --> M U {0} address mapping function n > m MAP(a) = a' if data at virtual address a is present in physical address a' and a' in M = 0 if data at virtual address a is not present in M a missing item fault Name Space V fault handler Processor Addr Trans Mechanism Main Memory Secondary Memory a a' physical address OS performs this transfer

Paging Organization Address Mapping VA page no. disp 10 Page Table
frame 0 1 7 1024 7168 P.A. Physical Memory 1K Addr Trans MAP page 0 31 31744 unit of mapping also unit of transfer from virtual to physical memory Virtual Memory V.A. Address Mapping VA page no. disp 10 Page Table index into page table Base Reg V Access Rights PA + table located in physical memory physical address actually, concatenation is more likely

Address Mapping CP0 User Memory MIPS PIPELINE Instr Data 32 32 24-bit
Physical Address 32-bit Virtual Address User process 2 running Kernel Memory Page Table 1 Here we need page table 2 for address mapping Page Table 2 Page Table n

Translation Lookaside Buffer (TLB)
CP0 On TLB hit, the 32-bit virtual address is translated into a 24-bit physical address by hardware We never call the Kernel! User Memory MIPS PIPELINE 32 32 24 D R Physical Addr [23:10] Virtual Address Kernel Memory Page Table 1 Page Table 2 Page Table n

So Far, NO GOOD 60 ns, RAM CP0 STALL IM DE EX DM 32 32
Critical path 20 ns 24-bit Physical Address TLB MIPS pipe is clocked at 50 MHz Kernel Memory 5ns Page Table 1 But RAM needs 3 cycles to read/write STALLS the pipe Page Table 2 Page Table n

A cache Hit never STALLS the pipe
Let’s put in a Cache 60 ns, RAM CP0 IM DE EX DM 32 32 Critical path 20 ns TLB Cache MIPS pipe is clocked at 50 MHz Kernel Memory 5ns 15ns Page Table 1 A cache Hit never STALLS the pipe Page Table 2 Page Table n

Fully Associative Cache
23 2 1 24-bit PA Check all Cache lines Cache Hit if PA[23:2]=TAG Tag PA[23:2] Data Word PA[1:0] 16 all 2 lines 16 2 * 4=256kb

Fully Associative Cache
Very good hit ratio (nr hits/nr accesses) But! Too expensive checking all 2 Cache lines concurrently A comparator for each line! A lot of hardware 16

Direct Mapped Cache 23 18 17 2 1 24-bit PA Selects ONE cache line
24-bit PA Selects ONE cache line Cache Hit if PA[23:18]=TAG Tag PA[23:18] Data Word PA[1:0] 1 line 16 2 * 4=256kb

Direct Mapped Cache Not so good hit ratio But!
Each line can hold only certain addresses, less freedom But! Much cheaper to implement, only one line checked Only one comparator

Set Associative Cache 23 18-z 17-z 2 1 24-bit PA
24-bit PA z Selects ONE set of lines, size 2 Cache Hit if PA[23:18-z]=TAG in the set Tag PA[23:18-z] Data Word PA[1:0] z 2 lines 2z-way set associative 16 2 * 4=256kb

Set Associative Cache Quite good hit ratio
The number (set) of different addresses for each line is greater than that of a directly mapped cache The larger Z the better hit ratio, but more expensive 2z comparators Cost-performance tradeoff

Cache Miss A Cache Miss should be handled by the hardware
If handled by the OS it would be very slow (>>60 ns) On a Cache Miss Stall the pipe Read in new data to cache Release the pipe, now we get a Cache Hit

A Summary on Sources of Cache Misses
Compulsory (cold start or process migration, first reference): first access to a block “Cold” fact of life: not a whole lot you can do about it Note: If you are going to run “billions” of instructions, Compulsory Misses are insignificant Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Capacity: Cache cannot contain all blocks access by the program Solution: increase cache size Invalidation: other process (e.g., I/O) updates memory (Capacity miss) That is the cache misses are due to the fact that the cache is simply not large enough to contain all the blocks that are accessed by the program. The solution to reduce the Capacity miss rate is simple: increase the cache size. Here is a summary of other types of cache miss we talked about. First is the Compulsory misses. These are the misses that we cannot avoid. They are caused when we first start the program. Then we talked about the conflict misses. They are the misses that caused by multiple memory locations being mapped to the same cache location. There are two solutions to reduce conflict misses. The first one is, once again, increase the cache size. The second one is to increase the associativity. For example, say using a 2-way set associative cache instead of directed mapped cache. But keep in mind that cache miss rate is only one part of the equation. You also have to worry about cache access time and miss penalty. Do NOT optimize miss rate alone. Finally, there is another source of cache miss we will not cover today. Those are referred to as invalidation misses caused by another process, such as IO , update the main memory so you have to flush the cache to avoid inconsistency between memory and cache. +2 = 43 min. (Y:23)

Example: 1 KB Direct Mapped Cache with 32 Byte Blocks
For a 2N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2M) Cache Index 1 2 3 : Cache Data Byte 0 4 31 Cache Tag Example: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit Byte 1 Byte 31 Byte 32 Byte 33 Byte 63 Byte 992 Byte 1023 Byte Select Ex: 0x00 9 Let’s use a specific example with realistic numbers: assume we have a 1 KB direct mapped cache with block size equals to 32 bytes. In other words, each block associated with the cache tag will have 32 bytes in it (Row 1). With Block Size equals to 32 bytes, the 5 least significant bits of the address will be used as byte select within the cache block. Since the cache size is 1K byte, the upper 32 minus 10 bits, or 22 bits of the address will be stored as cache tag. The rest of the address bits in the middle, that is bit 5 through 9, will be used as Cache Index to select the proper cache entry. +2 = 30 min. (Y:10)

Increased Miss Penalty
Block Size Tradeoff In general, larger block size take advantage of spatial locality BUT: Larger block size means larger miss penalty: Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up: Too few cache blocks In gerneral, Average Access Time: TimeAv= Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate Average Access Time Miss Penalty Miss Rate As I said earlier, block size is a tradeoff. In general, larger block size will reduce the miss rate because it take advantage of spatial locality. But remember, miss rate NOT the only cache performance metrics. You also have to worry about miss penalty. As you increase the block size, your miss penalty will go up because as the block gets larger, it will take you longer to fill up the block. Even if you look at miss rate by itself, which you should NOT, bigger block size does not always win. As you increase the block size, assuming keeping cache size constant, your miss rate will drop off rapidly at the beginning due to spatial locality. However, once you pass certain point, your miss rate actually goes up. As a result of these two curves, the Average Access Time (point to equation), which is really the more important performance metric than the miss rate, will go down initially because the miss rate is dropping much faster than the increase in miss penalty. But eventually, as you keep on increasing the block size, the average access time can go up rapidly because not only is the miss penalty is increasing, the miss rate is increasing as well. Let me show you why your miss rate may go up as you increase the block size by another extreme example. +3 = 33 min. (Y:13) Exploits Spatial Locality Increased Miss Penalty & Miss Rate Fewer blocks: compromises temporal locality Block Size Block Size Block Size

Extreme Example: single big line
Cache Data Valid Bit Byte 0 Byte 1 Byte 3 Cache Tag Byte 2 Cache Size = 4 bytes Block Size = 4 bytes Only ONE entry in the cache If an item is accessed, likely that it will be accessed again soon But it is unlikely that it will be accessed again immediately!!! The next access will likely to be a miss again Continually loading data into the cache but discard (force out) them before they are used again Worst nightmare of a cache designer: Ping Pong Effect Conflict Misses are misses caused by: Different memory locations mapped to the same cache index Solution 1: make the cache size bigger Solution 2: Multiple entries for the same Cache Index Let’s go back to our 4-byte direct mapped cache and increase its block size to 4 byte. Now we end up have one cache entries instead of 4 entries. What do you think this will do to the miss rate? Well the miss rate probably will go to hell. It is true that if an item is accessed, it is likely that it will be accessed again soon. But probably NOT as soon as the very next access so the next access will cause a miss again. So what we will end up is loading data into the cache but the data will be forced out by another cache miss before we have a chance to use it again. This is called the ping pong effect: the data is acting like a ping pong ball bouncing in and out of the cache. It is one of the nightmares scenarios cache designer hope never happens. We also defined a term for this type of cache miss, cache miss caused by different memory location mapped to the same cache index. It is called Conflict miss. There are two solutions we can use to reduce the conflict miss. The first one is to increase the cache size. The second one is to increase the number of cache entries per cache index. Let me show you what I mean. +2 = 35 min. (Y:15)

What if copies are changed?
Hierarchy Small, fast and expensive VS Slow big and inexpensive Cache Contains copies What if copies are changed? INCONSISTENCY! HD 2 Gb RAM 16 Mb Cache 256kb I D

Cache Miss, Write Through/Back
To avoid INCONSISTENCY we can Write Through Always write data to RAM Not so good performance (write 60ns) Therefore, WT always combined with write buffers so that don’t wait for lower level memory Write Back Write data to memory only when cache line is replaced We need a Dirty bit (D) for each cache line D-bit set by hardware on write operation Much better performance, but more complex hardware

Write Buffer for Write Through
Processor Cache Write Buffer DRAM A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: Store frequency (w.r.t. time) -> 1 / DRAM write cycle Write buffer saturation You are right, memory is too slow. We really didn't writ e to the memory directly. We are writing to a write buffer. Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene. The write buffer works as long as the frequency of store is not too high. Notice here, I am referring to the frequency with respect to time, not with respect to number of instructions. Remember the DRAM cycle time we talked about last time. It sets the upper limit on how frequent you can write to the main memory. If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. +2 = 60 min. (Y:40)

Write Buffer Saturation
Store frequency (w.r.t. time) -> 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache: A Memory System designer’s nightmare is when the Store frequency with respect to time approaches 1 over the DRAM Write Cycle Time. We called this Write Buffer Saturation. In that case, it does NOT matter how big you make the write buffer, the write buffer will still overflow because you simply feeding things in it faster than you can empty it. This is called Write Buffer Saturation and I have seen this happened before in simulation and whey that happens your processor will be running at DRAM cycle time--very very slow. The first solution for write buffer saturation is to get rid of this write buffer and replace this write through cache with a write back cache. Another solution is to install the 2nd level cache between the write buffer and memory and makes the 2nd level write back. +2 = 62 min. (Y:42) Processor Cache Write Buffer DRAM L2

Replacement Strategy in Hardware
A Direct mapped cache selects ONE cache line No replacement strategy Set/Fully Associative Cache selects a set of lines. Strategy to select one Cache line Random, Round Robin Not so good, spoils the idea with Associative Cache Least Recently Used, (move to top strategy) Good, but complex and costly for large Z We could use an approximation (heuristic) Not Recently Used, (replace if not used for a certain time)

Sequential RAM Access Accessing sequential words from RAM is faster than accessing RAM randomly Only lower address bits will change How could we exploit this? Let each Cache Line hold an Array of Data words Give the Base address and array size “Burst Read” the array from RAM to Cache “Burst Write” the array from Cache to RAM

System Startup, RESET Random Cache Contents
We might read incorrect values from the Cache We need to know if the contents is Valid, a V-bit for each cache line Let the hardware clear all V-bits on RESET Set the V-bit and clear the D-bit for the line copied from RAM to Cache

Final Cache Model 23 18-z 17-z 2+j 1+j 24-bit PA
24-bit PA z Selects ONE set of lines, size 2 Cache Hit if (PA[23:18-z]=TAG) and V in set Set D bit if Write V D Tag PA[23:18-z] Data Word PA[1+j:0] z 2 lines ...

Translation Lookaside Buffer (TLB)
CP0 On TLB hit, the 32-bit virtual address is translated into a 24-bit physical address by hardware We never call the Kernel! User Memory MIPS PIPELINE 32 32 24 D R Physical Addr [23:10] Virtual Address Kernel Memory Page Table 1 Page Table 2 Page Table n

Virtual Address and a Cache
CPU It takes an extra memory access to translate VA to PA This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible ASIDE: Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address! for update: must update all cache entries with same physical address or memory becomes inconsistent determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or software enforced alias boundary: same lsb of VA &PA > cache size VA Trans- lation data hit PA Cache miss Main Memory

Translation Look-Aside Buffers
Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped TLBs are usually small, typically not more than entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations. hit VA PA miss TLB Lookup Cache Main Memory CPU miss hit Translation with a TLB OS Page table data

Reducing Translation Time
Machines with TLBs go one step further to reduce cycles/cache access They overlap the cache access with the TLB access Works because high order bits of the VA are used to look in the TLB while low order bits are used as index into cache

Overlapped Cache & TLB Access
10 2 00 4 bytes index 1 K page # disp 20 12 assoc lookup 32 PA Hit/ Miss Data = IF cache hit AND (cache tag = PA) then deliver data to CPU ELSE IF [cache miss OR (cache tag = PA)] and TLB hit THEN access memory with the PA from the TLB ELSE do standard VA translation

Problems With Overlapped TLB Access
Overlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache Example: suppose everything the same except that the cache is increased to 8 K bytes instead of 4 K: 11 2 00 virt page # disp 20 12 cache index This bit is changed by VA translation, but is needed for cache lookup Solutions: go to 8K byte page sizes; go to 2 way set associative cache; or SW guarantee VA[13]=PA[13] 1K 4 10 2 way set assoc cache

Startup a User process Allocate Stack pages, Make a Page Table;
Set Instruction (I), Global Data (D) and Stack pages (S) Clear Resident (R) and Dirty (D) bits Clear V-bits in TLB Kernel Memory V D R Page Table Page Table I Place on Hard Disk I ... I D S TLB

Demand Paging IM Stage: We get a TLB Miss and Page Fault (page 0 not resident) Page Table (Kernel memory) holds HD address for page 0 (P0) Read page to RAM page X, Update PA[23:10] in Page Table Update TLB, set V, clear D, Page #, PA[23:10] Restart failing instruction: TLB hit! RAM XX…X00..0 Page 0 I TLB V 22-bit Page # D Physical Addr PA[23:10] 1 00…...….0 XX………………..X I P0 ...

Demand Paging DM Stage: We get a TLB Miss and Page Fault (page 3 not resident) Page Table (Kernel memory) holds HD address for page 3 (P3) Read page to RAM page Y, Update PA[23:10] in Page Table Update TLB, set V, clear D, Page #, PA[23:10] Restart failing instruction: TLB hit! RAM Page 0 I TLB YY…Y00..0 Page 3 D V 22-bit Page # D Physical Addr PA[23:10] 1 00…...….0 XX………………..X I 1 00…...…11 YY………………..Y D P0 P3 ... P1 P2

Spatial and Temporal Locality
Spatial Locality Now TLB holds page translation; 1024 bytes, 256 instructions The next instruction (PC+4) will cause a TLB Hit Access a data array, e.g., 0($t0),4($t0) etc Temporal Locality TLB holds translation Branch within the same page, access the same instruction address Access the array again e.g., 0($t0),4($t0) etc THIS IS THE ONLY REASON A SMALL TLB WORKS

Replacement Strategy If TLB is full the OS selects the TLB line to replace Any line will do, they are the same and concurrently checked Strategy to select one Random Not so good Round Robin Not so good, about the same as random Least Recently Used, (move to top strategy) Much better, (the best we can do without knowing or predicting page access). Based on temporal locality

Hierarchy Small, fast and expensive VS Slow big and inexpensive
TLB 64 Lines TLB/RAM Contains copies What if copies are changed? INCONSISTENCY! RAM 256Mb >> 64 Kernel Memory HD 32 Gb Page Table

Inconsistency Replace a TLB entry, caused by TLB Miss
If old TLB entry dirty (D-bit) we update Page Table (Kernel memory) Replace a page in RAM (swapping) caused by Page Fault If old Page is in TLB Check old page TLB D-bit, if Dirty write page to HD Clear TLB V-bit and Page Table R-bit (now not resident) If old Page is in not in TLB Check old page Page Table D-bit, if Dirty write page to HD Clear Page Table R-bit (page not resident any more)

Current Working Set If RAM is full the OS selects a page to replace, Page Fault OBS! The RAM is shared by many User processes Least Recently Used, (move to top strategy) Much better, (the best we can do without knowing or predicting page access) Swapping is VERY expensive (, maybe > 100 ms) Why not try harder to keep the pages needed (the working set) in RAM using Advanced memory paging algorithms Current working set of process P {p0,p3,...} set of pages used under t t t now

Trashing Probability of Page Fault 1 Trashing No useful work done!
This we want to avoid Fragment of working set not resident 1

Summary: Cache, TLB, Virtual Memory
Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: Where can a page be placed? How is a page found? What page is replaced on miss? How are writes handled? Page tables map virtual address to physical address TLBs are important for fast translation TLB misses are significant in processor performance: (some systems can’t access all of 2nd level cache without TLB misses!) Let’s do a short review of what you learned last time. Virtual memory was originally invented as another level of memory hierarchy such that programers, faced with main memory much smaller than their programs, do not have to manage the loading and unloading portions of their program in and out of memory. It was a controversial proposal at that time because very few programers believed software can manage the limited amount of memory resource as well as human. This all changed as DRAM size grows exponentially in the last few decades. Nowadays, the main function of virtual memory is to allow multiple processes to share the same main memory so we don’t have to swap all the non-active processes to disk. Consequently, the most important function of virtual memory these days is to provide memory protection. The most common technique, but we like to emphasis not the only technique, to translate virtual memory address to physical memory address is to use a page table. TLB, or translation lookaside buffer, is one of the most popular hardware techniques to reduce address translation time. Since TLB is so effective in reducing the address translation time, what this means is that TLB misses will have a significant negative impact on processor performance. +3 = 3 min. (X:43)

Summary: Memory Hierachy
Virtual memory was controversial at the time: can SW automatically manage 64KB across many programs? 1000X DRAM growth removed the controversy Today VM allows many processes to share single memory without having to swap all processes to disk; VM protection is more important than memory space increase Today CPU time is a function of (ops, cache misses) vs. just of(ops): What does this mean to Compilers, Data structures, Algorithms? Vtune performance analyzer, cache misses. Let’s do a short review of what you learned last time. Virtual memory was originally invented as another level of memory hierarchy such that programers, faced with main memory much smaller than their programs, do not have to manage the loading and unloading portions of their program in and out of memory. It was a controversial proposal at that time because very few programers believed software can manage the limited amount of memory resource as well as human. This all changed as DRAM size grows exponentially in the last few decades. Nowadays, the main function of virtual memory is to allow multiple processes to share the same main memory so we don’t have to swap all the non-active processes to disk. Consequently, the most important function of virtual memory these days is to provide memory protection. The most common technique, but we like to emphasis not the only technique, to translate virtual memory address to physical memory address is to use a page table. TLB, or translation lookaside buffer, is one of the most popular hardware techniques to reduce address translation time. Since TLB is so effective in reducing the address translation time, what this means is that TLB misses will have a significant negative impact on processor performance. +3 = 3 min. (X:43)

Memory Hierarchy Memory Hierarchy Cache Memory

Similar presentations

Presentation on theme: "Memory Hierarchy Memory Hierarchy Cache Memory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Memory Hierarchy Memory Hierarchy Cache Memory

Similar presentations

Presentation on theme: "Memory Hierarchy Memory Hierarchy Cache Memory"— Presentation transcript:

Similar presentations

About project

Feedback