Csci 136 Computer Architecture II – Overview on Cache Memory

Csci 136 Computer Architecture II – Overview on Cache Memory
Xiuzhen Cheng

Note: you must pass final to pass this course!
Announcement Homework assignment #11, Due time – by April 21. Readings: Sections 7.1 – 7.3 Problems: , , 7.29, Project #3 is due on 11:59PM, April 17, 2004 Final: Thursday, May 12, 12:40AM-2:40PM Note: you must pass final to pass this course!

The Big Picture: Where are We Now?
The Five Classic Components of a Computer Today’s Topics: Locality and Memory Hierarchy Simple caching techniques Many ways to improve cache performance Control Datapath Memory Processor Input Output

Technology Trends Capacity Speed (latency)
Logic: 2x in 3 years 2x in 3 years DRAM: 4x in 3 years 2x in 10 years Disk: 4x in 3 years 2x in 10 years Memory Technology Typical Access Time $ per GB in 2004 SRAM 0.5-5ns $4000-$10,000 DRAM 50-70ns $100-$200 Magnetic Disk 5,000,000-20,000,000ns $0.50-$2

Who Cares About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency) Rely on caches to bridge gap µProc 60%/yr. (2X/1.5yr) 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap: (grows 50% / year) Performance 10 DRAM 9%/yr. (2X/10 yrs) DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time

The Goal: illusion of large, fast, cheap memory
Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and fast (most of the time)? Hierarchy

An Expanded View of the Memory System
Processor Control Memory Memory Memory Datapath Memory Memory Slowest Speed: Fastest Biggest Size: Smallest Lowest Cost: Highest

Why hierarchy works The Principle of Locality:
Program access a relatively small portion of the address space at any instant of time Address Space 2^n - 1 Probability of reference

Memory Hierarchy: How Does it Work?
Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor Spatial Locality (Locality in Space): => Move blocks consisting of contiguous words to the upper levels Lower Level Memory Upper Level To Processor From Processor Blk X Blk Y

Memory Hierarchy: Terminology
Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor Hit Time << Miss Penalty

How is the hierarchy managed?
Registers <-> Memory by compiler (programmer?) cache <-> memory by the hardware memory <-> disks by the hardware and operating system (virtual memory) by the programmer (files)

Memory Hierarchy Technology
Random Access: “Random” is good: access time is the same for all locations DRAM: Dynamic Random Access Memory High density, low power, cheap, slow Dynamic: need to be “refreshed” regularly SRAM: Static Random Access Memory Low density, high power, expensive, fast Static: content will last “forever”(until lose power) “Non-so-random” Access Technology: Access time varies from location to location and from time to time Examples: Disk, CDROM Sequential Access Technology: access time linear in location (e.g.,Tape) The technology we used to build our memory hierarchy can be divided into two categories: Random Access and Non-so-Random Access. Unlike all other aspects of life where the word random usually associates with bad things, random, when associates with memory access, for the lack of a better word, is good! Because random access means you can access any random location at any time and the access time will be the same as any other random locations. Which is NOT the case for disks or tape where the access time for a given location at any time can be quite different from some other random locations at some other random time. As far as Random Access technology is concerned, we will concentrate on two specific technologies: Dynamic RAM and Static RAM. The advantages of Dynamic RAMs are high density, low cost, and low power so we can have a lot of them without burning a hole in our budget or our desktop. The disadvantages of DRAM are they are slow. Also they will forget what you tell them if you don’t remind them constantly (Refresh). We will talk more about refresh later today. SRAM only has one redeeming feature: it is fast. Other than that, they have low density, expensive, and burn a lot of power. Oh, SRAM actually has another redeeming feature. They will not forget what you tell them. They will keep whatever you write to them forever. Well “forever” is a long time. So lets just say it will keep your data as long as you don’t pull the plug on your computer. In the next two lectures, we will be focusing on DRAMs and SRAMs. We will not get into disk until the Virtual Memory lecture a week from now. +3 = 19 min. (X:59)

Levels of the Memory Hierarchy
Upper Level Processor faster Instr. Operands Cache Blocks Memory Pages Disk Files Tape Larger Lower Level

Memory Hierarchy (1/3) Processor Memory Disk
executes instructions on order of nanoseconds to picoseconds holds a small amount of code and data in registers Memory More capacity than registers, still limited Access time ~ ns Disk HUGE capacity (virtually limitless) VERY slow: runs ~milliseconds

Levels in memory hierarchy Higher
Processor Increasing Distance from Proc., Decreasing speed Levels in memory hierarchy Higher Level 1 Level 2 Level n Level 3 . . . Lower Size of memory at each level As we move to deeper levels the latency goes up and price per bit goes down. Q: Can $/bit go up as move deeper?

Memory Hierarchy (3/3) If level closer to Processor, it must be:
smaller faster subset of lower levels (contains most recently used data) Lowest Level (usually disk) contains all available data Other levels?

Why Memory Caching We’ve discussed three levels in the hierarchy: processor, memory, disk Mismatch between processor and memory speeds leads us to add a new level: a memory cache Implemented with SRAM technology: faster but more expensive than DRAM memory. “S” = Static, no need to refresh, ~60ns “D” = Dynamic, need to refresh, ~10ns arstechnica.com/paedia/r/ram_guide/ram_guide.part1-1.html

Memory Hierarchy Analogy: Library (1/2)
You’re writing a term paper (Processor) at a table in Doe Doe Library is equivalent to disk essentially limitless capacity very slow to retrieve a book Table is memory smaller capacity: means you must return book when table fills up easier and faster to find a book there once you’ve already retrieved it

Memory Hierarchy Analogy: Library (2/2)
Open books on table are cache smaller capacity: can have very few open books fit on table; again, when table fills up, you must close a book much, much faster to retrieve data Illusion created: whole library open on the tabletop Keep as many recently used books open on table as possible since likely to use again Also keep as many books on table as possible, since faster than going to library

Memory Hierarchy Basis
Disk contains everything. When Processor needs something, bring it into to all higher levels of memory. Cache contains copies of data in memory that are being used. Memory contains copies of data on disk that are being used. Entire idea is based on Temporal/Temporal Locality

Summary: Exploit Locality to Achieve Fast Memory
Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. DRAM is slow but cheap and dense: Good choice for presenting the user with a BIG memory system SRAM is fast but expensive and not very dense: Good choice for providing the user FAST access time. Let’s summarize today’s lecture. The first thing we covered is the principle of locality. There are two types of locality: temporal, or locality of time and spatial, locality of space. We talked about memory system design. The key idea of memory system design is to present the user with as much memory as possible in the cheapest technology while by taking advantage of the principle of locality, create an illusion that the average access time is close to that of the fastest technology. As far as Random Access technology is concerned, we concentrate on 2: DRAM and SRAM. DRAM is slow but cheap and dense so is a good choice for presenting the use with a BIG memory system. SRAM, on the other hand, is fast but it is also expensive both in terms of cost and power, so it is a good choice for providing the user with a fast access time. I have already showed you how DRAMs are used to construct the main memory for the SPARCstation 20. On Friday, we will talk about caches. +2 = 78 min. (Y:58)

Cache Design How do we organize cache?
Where does each memory address map to? (Remember that cache is subset of memory, so multiple memory addresses map to the same cache location.) How do we know which elements are in cache? How do we quickly locate them?

Fully Associative Cache
Cache Technologies Direct Mapped Cache Fully Associative Cache Set Associative Cache

Direct Mapped Cache IDEA: Assign cache location based on the address of the block in the memory Each memory location is mapped to exactly one location in the cache Mapping: (block address) modulo (number of blocks in the cache) How to find the block address? Questions to answer: How to find the data in the cache How do you know the data in the cache is the one you want TAG, INDEX and VALID BIT

Direct Mapped Cache: One word cache
Cache Data Valid Bit Byte 0 Byte 1 Byte 3 Cache Tag Byte 2 Cache Size = 4 bytes Block Size = 4 bytes Only ONE entry in the cache Does not explore temporal locality: If an item is accessed, likely that it will be accessed again soon But it is unlikely that it will be accessed again immediately!!! The next access will likely to be a miss again Continually loading data into the cache but discard (force out) them before they are used again Worst nightmare of a cache designer: Ping Pong Effect Too many Conflict Misses caused by: Different memory locations mapped to the same cache index Solution 1: make the cache size bigger Solution 2: Multiple entries for the same Cache Index

Exploring Temporal Locality

Exploring Spatial Locality

Example: 1 KB Direct Mapped Cache with 32 B Blocks
For a 2 ** N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 ** M) Block address 31 9 4 Cache Tag Example: 0x50 Cache Index Byte Select Ex: 0x01 Ex: 0x00 Stored as part of the cache “state” Valid Bit Cache Tag Cache Data : Byte 31 Byte 1 Byte 0 : 0x50 Byte 63 Byte 33 Byte 32 1 2 3 : : : : Byte 1023 Byte 992 31 Questions: Total bytes in the cache? How to compute index?

Increased Miss Penalty
Block Size Tradeoff In general, larger block size take advantage of spatial locality BUT: Larger block size means larger miss penalty: Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up Too few cache blocks In general, Average Access Time: = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate Average Access Time Miss Penalty Miss Rate Exploits Spatial Locality Increased Miss Penalty & Miss Rate Fewer blocks: compromises temporal locality Block Size Block Size Block Size

An Extreme Example: Fully Associative
Fully Associative Cache Forget about the Cache Index, no fixed location for each block Compare the Cache Tags of all cache entries in parallel Example: Block Size = 32 B blocks, we need N 27-bit comparators By definition: Conflict Miss = 0 for a fully associative cache 31 4 Cache Tag (27 bits long) Byte Select Ex: 0x01 Cache Tag Valid Bit Cache Data : X Byte 31 Byte 1 Byte 0 : X Byte 63 Byte 33 Byte 32 X X : : : X

Set Associative Cache N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel Mapping: (block address) modulo (number of sets in the block) Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared to the input in parallel Data is selected based on the tag comparison result Cache Index Valid Cache Tag Cache Data Cache Data Cache Block 0 Cache Tag Valid : Cache Block 0 : : : Compare Adr Tag Compare 1 Mux Sel1 Sel0 OR Cache Block Hit

A Four-way Set-Associative Cache

Disadvantage of Set Associative Cache
N-way Set Associative Cache versus Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. Cache Data Cache Block 0 Cache Tag Valid : Cache Index Mux 1 Sel1 Sel0 Cache Block Compare Adr Tag OR Hit

Sources of Cache Misses
Compulsory (cold start or process migration, first reference): first access to a block “Cold” fact of life: not a whole lot you can do about it Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Capacity: Cache cannot contain all blocks access by the program Solution: increase cache size Coherence (Invalidation): other process (e.g., I/O) updates memory (Capacity miss) That is the cache misses are due to the fact that the cache is simply not large enough to contain all the blocks that are accessed by the program. The solution to reduce the Capacity miss rate is simple: increase the cache size. Here is a summary of other types of cache miss we talked about. First is the Compulsory misses. These are the misses that we cannot avoid. They are caused when we first start the program. Then we talked about the conflict misses. They are the misses that caused by multiple memory locations being mapped to the same cache location. There are two solutions to reduce conflict misses. The first one is, once again, increase the cache size. The second one is to increase the associativity. For example, say using a 2-way set associative cache instead of directed mapped cache. But keep in mind that cache miss rate is only one part of the equation. You also have to worry about cache access time and miss penalty. Do NOT optimize miss rate alone. Finally, there is another source of cache miss we will not cover today. Those are referred to as invalidation misses caused by another process, such as IO , update the main memory so you have to flush the cache to avoid inconsistency between memory and cache. +2 = 43 min. (Y:23)

Source of Cache Misses Quiz
Assume constant cost. Direct Mapped N-way Set Associative Fully Associative Cache Size: Small, Medium, Big? Compulsory Miss: Conflict Miss Capacity Miss Coherence Miss Choices: Zero, Low, Medium, High, Same

Sources of Cache Misses Answer
Direct Mapped N-way Set Associative Fully Associative Cache Size Big Medium Small Compulsory Miss Same Same Same Conflict Miss High Medium Zero Capacity Miss Low Medium High Coherence Miss Same Same Same Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant.

Four Questions for Caches and Memory Hierarchy
Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

Q1: Where can a block be placed in the upper level?
Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets Fully associative: block 12 can go anywhere Block no. Direct mapped: block 12 can go only into block 4 (12 mod 8) Block no. Set associative: block 12 can go anywhere in set 0 (12 mod 4) Set 1 2 3 Block no. Block-frame address Block no.

Q2: How is a block found if it is in the upper level?
offset Block Address Tag Index Set Select Data Select Direct indexing (using index and block offset), tag compares, or combination Increasing associativity shrinks index, expands tag

Q3: Which block should be replaced on a miss?
Easy for Direct Mapped Set Associative or Fully Associative: Random LRU (Least Recently Used) Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.2% 5.7% % 5.3% % 5.0% 64 KB 1.9% 2.0% % 1.7% % 1.5% 256 KB 1.15% 1.17% % % % %

Q4: What happens on a write?
Write through—The information is written to both the block in the cache and to the block in the lower-level memory. Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT: Simpler and cheaper, easy to implement , read misses cannot result in writes WB: no writes of repeated writes WT always combined with write buffers so that don’t wait for lower level memory

Write Buffer for Write Through
Cache Processor DRAM Write Buffer A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation

Write Buffer Saturation
Cache Processor DRAM Write Buffer Store frequency (w.r.t. time) > 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache: (does this always work?) Cache L2 Cache Processor DRAM Write Buffer

Cache performance equations:
Time = IC x CT x (ideal CPI + memory stalls/inst) memory stalls/instruction = Average accesses/inst x Miss Rate x Miss Penalty = (Average IFETCH/inst x MissRateInst x Miss PenaltyInst) + (Average Data/inst x MissRateData x Miss PenaltyData) Assumes that ideal CPI includes Hit Times. Assumes read and write miss penalties are the same Average Memory Access time = Hit Time x Hit Rate + (Miss Rate x Miss Penalty)

Impact on Performance Suppose a processor executes at
Clock Rate = 200 MHz (5 ns per cycle) CPI = 1.1 50% arith/logic, 30% ld/st, 20% control Suppose that 10% of memory operations get 50 cycle miss penalty CPI = ideal CPI + average stalls per instruction = 1.1(cyc) +( 0.30 (datamops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss) ) = 1.1 cycle cycle = 2.6 58 % of the time the processor is stalled waiting for memory! a 1% instruction miss rate would add an additional 0.5 cycles to the CPI!

Improving Cache Performance: 3 general options
Time = IC x CT x (ideal CPI + memory stalls) Average Memory Access time = (Hit Rate x Hit Time) + (Miss Rate x Miss Panelty) 1. Reduce the miss rate 2. Reduce the miss penalty 3. Reduce the time to hit in the cache

3Cs Absolute Miss Rate (SPEC92)
Conflict Compulsory vanishingly small

2:1 Cache Rule miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2 Conflict

3Cs Relative Miss Rate Conflict Flaws: for fixed block size
Good: insight => invention

1. Reduce Misses via Larger Block Size

2. Reduce Misses via Higher Associativity
2:1 Cache Rule: Miss Rate DM cache size N Miss Rate 2-way cache size N/2 Beware: Execution time is only final measure! Will Clock Cycle time increase? Hill [1988] suggested hit time for 2-way vs. 1-way external cache +10%, internal + 2%

3. Reducing Misses via a “Victim Cache”
How to combine fast hit time of direct mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache Used in Alpha, HP machines TAGS DATA Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data To Next Lower Level In Hierarchy

4. Reducing Misses by Hardware Prefetching
E.g., Instruction Prefetching Alpha fetches 2 blocks on a miss Extra block placed in “stream buffer” On miss check stream buffer Works with data blocks too: Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty

5. Reducing Misses by Software Prefetching Data
Data Prefetch Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) Special prefetching instructions cannot cause faults; a form of speculative execution Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth

6. Reducing Misses by Compiler Optimizations
McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software Instructions Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts(using tools they developed) Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

Summary #1/ 3: The Principle of Locality:
Program likely to access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three (+1) Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size Coherence Misses: Caused by external processors or I/O devices Cache Design Space total size, block size, associativity replacement policy write-hit policy (write-through, write-back) write-miss policy Let’s summarize today’s lecture. I know you have heard this many times and many ways but it is still worth repeating. Memory hierarchy works because of the Principle of Locality which says a program will access a relatively small portion of the address space at any instant of time. There are two types of locality: temporal locality, or locality in time and spatial locality, or locality in space. So far, we have covered three major categories of cache misses. Compulsory misses are cache misses due to cold start. You cannot avoid them but if you are going to run billions of instructions anyway, compulsory misses usually don’t bother you. Conflict misses are misses caused by multiple memory location being mapped to the same cache location. The nightmare scenario is the ping pong effect when a block is read into the cache but before we have a chance to use it, it was immediately forced out by another conflict miss. You can reduce Conflict misses by either increase the cache size or increase the associativity, or both. Finally, Capacity misses occurs when the cache is not big enough to contains all the cache blocks required by the program. You can reduce this miss rate by making the cache larger. There are two write policy as far as cache write is concerned. Write through requires a write buffer and a nightmare scenario is when the store occurs so frequent that you saturates your write buffer. The second write polity is write back. In this case, you only write to the cache and only when the cache block is being replaced do you write the cache block back to memory. +3 = 77 min. (Y:57)

Summary #2 / 3: The Cache Design Space
Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation The optimal choice is a compromise depends on access characteristics workload use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Cache Size Associativity Block Size No fancy replacement policy is needed for the direct mapped cache. As a matter of fact, that is what cause direct mapped trouble to begin with: only one place to go in the cache--causes conflict misses. Besides working at Sun, I also teach people how to fly whenever I have time. Statistic have shown that if a pilot crashed after an engine failure, he or she is more likely to get killed in a multi-engine light airplane than a single engine airplane. The joke among us flight instructors is that: sure, when the engine quit in a single engine stops, you have one option: sooner or later, you land. Probably sooner. But in a multi-engine airplane with one engine stops, you have a lot of options. It is the need to make a decision that kills those people. Bad Factor A Factor B Good Less More

Summary #3 / 3: Cache Miss Optimization
Lots of techniques people use to improve the miss rate of caches: Technique Larger Block Size Higher Associativity Victim Caches Pseudo-Associative Caches HW Prefetching of Instr/Data Compiler Controlled Prefetching Compiler Reduce Misses miss rate

Questions?

Csci 136 Computer Architecture II – Overview on Cache Memory

Similar presentations

Presentation on theme: "Csci 136 Computer Architecture II – Overview on Cache Memory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Csci 136 Computer Architecture II – Overview on Cache Memory

Similar presentations

Presentation on theme: "Csci 136 Computer Architecture II – Overview on Cache Memory"— Presentation transcript:

Similar presentations

About project

Feedback