ENG3380 Computer Organization and Architecture “Cache Memory Part III”

ENG3380 Computer Organization and Architecture “Cache Memory Part III”
Winter 2017 S. Areibi School of Engineering University of Guelph

Topics Cache Associativity Direct Mapped vs. N-way Set Associative
Cost of Set Associative Cache Design Write Policy Replacement Policy Summary With thanks to W. Stallings, Hamacher, J. Hennessy, M. J. Irwin for lecture slide contents Many slides adapted from the PPT slides accompanying the textbook and CSE331 Course School of Engineering

References “Computer Organization and Architecture: Designing for Performance”, 10th edition, by William Stalling, Pearson. “Computer Organization and Design: The Hardware/Software Interface”, 5th edition, by D. Patterson and J. Hennessy, Morgan Kaufmann Computer Organization and Architecture: Themes and Variations”, 2014, by Alan Clements, CENGAGE Learning School of Engineering

Associative Cache

Where can a block be placed in the upper level?
Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets Direct Mapped (12 mod 8) = 4 2-Way Assoc (12 mod 4) = 0 Full Assoc Cache Memory

Spectrum of Associativity
Morgan Kaufmann Publishers Spectrum of Associativity 22 June, 2018 For a cache with 8 entries (i.e., 8 blocks) # of sets = # Blocks / Associativity # sets = 8 Each set = 1 block Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers Spectrum of Associativity 22 June, 2018 For a cache with 8 entries # of sets = # Blocks/Associativity # sets = 8 Each set = 1 block # sets = 4 Each set = 2 blocks Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 22 June, 2018 For a cache with 8 entries # of sets = # Blocks/Associativity # sets = 8 Each set = 1 block # sets = 4 Each set = 2 blocks # sets = 2 Each set = 4 blocks Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 22 June, 2018 For a cache with 8 entries # of sets = # Blocks/Associativity # sets = 8 Each set = 1 block # sets = 4 Each set = 2 blocks # sets = 2 Each set = 4 blocks # sets = 1 Each set = 8 blocks Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Direct Mapping Simplest approach uses a fixed mapping: memory block j  cache block (j mod #sets) Only one unique location for each mem. block Two blocks may contend for same location New block always overwrites previous block Divide address into 3 fields: word, block, tag Block field determines location in cache Tag field from original address stored in cache Compared with later address for hit or miss

Fully Associative Mapping
Full flexibility: locate block anywhere in cache Block field of address no longer needs any bits Tag field is enlarged to encompass those bits Larger tag stored in cache with each block For hit/miss, compare all tags simultaneously in parallel against tag field of given address This associative search increases complexity Flexible mapping also requires appropriate replacement algorithm when cache is full

Set-Associative Mapping
Combination of direct & associative mapping Associative search involves only tags in a set k blocks/set  k-way set-associative cache Direct-mapped  1-way; associative  all-way Group blocks of cache into sets Block field bits map a block to a unique set But any block within a set may be used Reducing flexibility also reduces complexity Replacement algorithm is only for blocks in set

# of Sets and Blocks per Set
Morgan Kaufmann Publishers 22 June, 2018 # of sets Blocks per set Direct mapped # of blocks in cache 1 Set associative Fully associative Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

# of Sets and Blocks per Set
Morgan Kaufmann Publishers 22 June, 2018 # of sets Blocks per set Direct mapped # of blocks in cache 1 Set associative (# of blocks in cache)/ associativity Associativity (typically 2 to 16) Fully associative Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Location Method and # of Comparisons
Morgan Kaufmann Publishers 22 June, 2018 Location method # of comparisons Direct mapped Index 1 Set associative Fully associative Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 22 June, 2018 Location method # of comparisons Direct mapped Index 1 Set associative Index the set; compare set’s tags Degree of associativity Fully associative Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 22 June, 2018 Location method # of comparisons Direct mapped Index 1 Set associative Index the set; compare set’s tags Degree of associativity Fully associative Compare all blocks tags # of blocks Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Two-way Set Associative Cache Architecture
N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel (N typically 2 to 4) Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared in parallel If a “Hit” then: Data is selected based on the tag result Cache Index Valid Cache Tag Cache Data Cache Data Cache Block 0 Cache Tag Valid : Cache Block 0 : : : This is called a 2-way set associative cache because there are two cache entries for each cache index. Essentially, you have two direct mapped cache works in parallel. This is how it works: the cache index selects a set from the cache. The two tags in the set are compared in parallel with the upper bits of the memory address. If neither tag matches the incoming address tag, we have a cache miss. Otherwise, we have a cache hit and we will select the data on the side where the tag matches occur. This is simple enough. What is its disadvantages? +1 = 36 min. (Y:16) Compare Adr Tag Compare 1 Sel1 Mux Sel0 OR Cache Block Hit

Range of Set Associative Caches
For a fixed size cache, each increase by a factor of two in associativity doubles the number of blocks per set (i.e., the number or ways) and halves the number of sets – decreases the size of the index by 1 bit increases the size of the tag by 1 bit Used for tag compare Selects the set Selects the word in the block Tag Index Block offset Byte offset Increasing associativity Selects Byte in a word Decreasing associativity For lecture Fully associative (only one set) Tag is all the bits except block and byte offset Direct mapped (only one way) Smaller tags

(block address) modulo (# blocks in the cache)
Mapping Functions (block address) modulo (# blocks in the cache) Use small cache with 128 blocks of 16 words Use main memory with 64K words (4K blocks) Word-addressable memory, so 16-bit address Direct Mapping

Mapping Functions Fully Associative Mapping
Use small cache with 128 blocks of 16 words Use main memory with 64K words (4K blocks) Word-addressable memory, so 16-bit address Fully Associative Mapping

(block address) modulo (# sets in the cache)
Mapping Functions (block address) modulo (# sets in the cache) Use small cache with 128 blocks of 16 words Use main memory with 64K words (4K blocks) Word-addressable memory, so 16-bit address 2-Way Set Associative

Example A 32 KB (4-way set-associative) data cache array with byte line sizes (assume processor address is 40-bits) How many sets? 32 KB/128B = 256 Sets  28 How many index bits, offset bits, tag bits? 5-bits for offset  25 = 32 8-bits for index 27-bits for tag i.e.  40-13 How large is the tag array? 27 * 4 * 256 = 27 KB tag index offset 27 8 5 40- bits 32B Tag Array Data Array

Miss Rate vs Block Size vs Cache Size
Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses) AMAT = HitTime + (MissRate x MissPenalty) Solution?

Reducing Cache Miss Rates
Allow more flexible block placement In a direct mapped cache a memory block maps to exactly one cache block (no flexibility) At the other extreme, could allow a memory block to be mapped to any cache block – fully associative cache A compromise is to divide the cache into sets each of which consists of n “ways” (n-way set associative). A memory block maps to a unique set (specified by the index field) and can be placed in any way of that set (so there are n choices) (block address) modulo (# sets in the cache)

Recall in Direct Mapped Cache
Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid Tag = 00, Index = 00 Tag = 00, Index = 01 Tag = 00, Index = 10 Tag = 00, Index = 11 tag miss 1 miss 2 miss 3 miss 00 Mem(0) 00 Mem(0) 00 Mem(1) 00 Mem(0) 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(1) 00 Mem(2) 00 Mem(3) Tag = 01, Index = 00 Tag = 00, Index = 11 Tag = 01, Index = 00 Tag = 11, Index = 11 4 miss 3 hit 4 hit 15 miss 01 4 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) For lecture 11 15 8 requests, 6 misses

2-Way Set Associative: Spatial Locality
Let cache block hold more than one word Start with an empty cache - all blocks initially marked as not valid miss 1 hit 2 miss 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(3) Mem(2) 3 hit 4 miss 3 hit 01 5 4 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) For lecture 4 hit 15 miss 01 Mem(5) Mem(4) 01 Mem(5) Mem(4) 11 15 14 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) Index = (block address) modulo (# sets in the cache) 8 requests, 4 misses vs. 6 misses in Direct Mapped Cache AMAT = HitTime + (MissRate x MissPenalty)

Direct Mapped Cache: Ping Pong Effect
Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid miss 4 miss miss 4 miss 01 4 00 01 4 00 Mem(0) 00 Mem(0) 01 Mem(4) 00 Mem(0) miss 4 miss miss 4 miss 01 4 00 01 4 00 01 Mem(4) 00 Mem(0) 01 Mem(4) 00 Mem(0) For class handout 8 requests, 8 misses Ping pong effect due to conflict misses - two memory locations that map into the same cache block

Reference String: 2-way Set Associative
Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid miss 4 miss hit 4 hit Mem(0) Mem(0) Mem(0) Mem(0) Mem(4) Mem(4) Mem(4) 8 requests, 2 misses vs. 8 misses! In Direct Mapped Cache For lecture Another sample string to try Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist!

Associativity Example
Morgan Kaufmann Publishers Associativity Example 22 June, 2018 Compare 4-block caches Direct mapped, (4 Blocks, or 4 Sets) 2-way set associative, (2 Sets) Fully associative (1 Set) Block access sequence: 0, 8, 0, 6, 8 Block Address Cache Block (0 % 4) = 0 6 (6 % 4) = 2 8 (8 % 4) = 0 Direct Mapped 5 Misses Block address Cache index Hit/miss Cache content after access 1 2 3 miss Mem[0] 8 Mem[8] 6 Mem[6] Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Associativity Example
Morgan Kaufmann Publishers Associativity Example 22 June, 2018 Block Address Cache Set (0 % 2) = 0 6 (6 % 2) = 0 8 (8 % 2) = 0 Block access sequence: 0, 8, 0, 6, 8 What if # blocks = 8, 16? 2-way set associative 4 Misses Block address Cache index Hit/miss Cache content after access Set 0 Set 1 miss Mem[0] 8 Mem[8] hit 6 Mem[6] Why 6 replace 8? Fully associative 3 Misses Block address Hit/miss Cache content after access miss Mem[0] 8 Mem[8] hit 6 Mem[6] Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Four-Way Set Associative Cache
28 = 256 sets each with four ways (each with one block) Byte offset 22 Tag 8 Index Index Data Tag V 1 2 . 253 254 255 Data Tag V 1 2 . 253 254 255 Data Tag V 1 2 . 253 254 255 Data Tag V 1 2 . 253 254 255 Hit Data 32 4x1 select This is called a 4-way set associative cache because there are four cache entries for each cache index. Essentially, you have four direct mapped cache working in parallel. This is how it works: the cache index selects a set from the cache. The four tags in the set are compared in parallel with the upper bits of the memory address. If no tags match the incoming address tag, we have a cache miss. Otherwise, we have a cache hit and we will select the data from the way where the tag matches occur. This is simple enough. What is its disadvantages? +1 = 36 min. (Y:16)

Morgan Kaufmann Publishers
22 June, 2018 Costs of Set Associative Caches N-way set associative cache costs N comparators (delay and area) MUX delay (set selection) before data is available Data available after set selection (and Hit/Miss decision). In a direct mapped cache, the cache block is available before the Hit/Miss decision So its not possible to just assume a hit and continue and recover later if it was a miss Four-Way Set Associative Cache Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

22 June, 2018 Costs of Set Associative Caches When a miss occurs, which way’s block do we pick for replacement? Least Recently Used (LRU): the block replaced is the one that has been unused for the longest time Must have hardware to keep track of when each way’s block was used relative to the other blocks in the set For 2-way set associative, takes one bit per set → set the bit when a block is referenced (and reset the other way’s bit) Four-Way Set Associative Cache Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Benefits of Set Associative Caches
The choice of direct mapped or set associative depends on the cost of a miss versus the cost of implementation Data from Hennessy & Patterson, Computer Architecture, 2003 As cache sizes grow, the relative improvement from associativity increases only slightly; since the overall miss rate of a larger cache is lower, the opportunity for improving the miss rate decreases and the absolute improvement in miss rate from associativity shrinks significantly. Largest gains are in going from direct mapped to 2-way (20%+ reduction in miss rate)

How Much Associativity
Morgan Kaufmann Publishers 22 June, 2018 Increased associativity decreases miss rate But with diminishing returns Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000 1-way: 10.3% 2-way: 8.6% 4-way: 8.3% 8-way: 8.1% Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Associative Caches Morgan Kaufmann Publishers 22 June, 2018 Fully associative Allow a given block to go in any cache entry Requires all entries to be searched at once Comparator per entry (expensive) n-way set associative Each set contains n entries Block number determines which set (Block number) modulo (#Sets in cache) Search all entries in a given set at once n comparators (less expensive) Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Replacement Policy

Four Questions for Cache Design
Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement strategy) Q4: What happens on a write? (Write strategy)

Replacement Strategy Replacement is trivial for direct mapping, since there is one location or position of each block (determined by its address)! In associative and set-associative caches there exists some flexibility. When a new block is to be brought into the cache and all positions that it may occupy are full, the cache controller must decide which of the old blocks to overwrite. This is an important issue, because the decision can be a strong determining factor in system performance. In general, the objective is to keep blocks in the cache that are likely to be referenced in the near future. But it is not easy to determine which blocks are to be referenced. The property of locality of reference in programs gives a clue to a reasonable strategy.

Cache Replacement Need a free line to insert new block
Which block should we kick out? Several strategies Random (randomly selected line) FIFO (line that has been in cache the longest) LFU (Least Frequently Used line) LRU (Least Recently Used line) LRU Approximations NMRU (Not Most Recently Used) 

LRU Replacement Algorithm
Replace that block in the set that has been in the cache longest with no reference to it. Consider temporal locality of reference and use a least-recently-used (LRU) algorithm For k-way set associativity, each block in a set has a counter ranging from from 0 to k1 Hitting on a block clears its counter value to 0; others originally lower in set are incremented If set is full, replace the block with counter3

Another Implementation of LRU
Have LRU counter for each line in a set When line accessed Get old value X of its counter Set its counter to max value For every other line in the set If counter larger than X, decrement it When replacement needed Select line whose counter is 0

Approximating LRU LRU is pretty complicated (esp. for many ways)
Access and possibly update all counters in a set on every access (not just replacement) Need something simpler and faster But still close to LRU NMRU – Not Most Recently Used The entire set has one MRU pointer Points to last-accessed line in the set Replacement: Randomly select a non-MRU line

Which block should be replaced on a miss?
Replacement: Random vs. LRU Which block should be replaced on a miss? Set Associative or Fully Associative: Random LRU (Least Recently Used) Assoc: way way way Size LRU Ran LRU Ran LRU Ran 16 KB 5.2% 5.7% % 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% % 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% % % % %

Miss Rate for 2-way Set Associative Cache
Replacement: Random vs. LRU The Least Recently Used (LRU) block? Appealing, but hard to implement for high associativity A randomly chosen block? Easy to implement, how well does it work? Miss Rate for 2-way Set Associative Cache Size Random LRU 16 KB 5.7% 5.2% 64 KB 2.0% 1.9% 256 KB 1.17% 1.15% Also, try Other LRU approx.

Replacement Policy: Summary
Morgan Kaufmann Publishers Replacement Policy: Summary 22 June, 2018 Direct mapped: no choice Set associative Prefer non-valid entry, if there is one Otherwise, choose among entries in the set Least-recently used (LRU) Choose the one unused for the longest time Simple for 2-way, manageable for 4-way, too hard beyond that Random Gives approximately the same performance as LRU for high associativity Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Write Policy

Four Questions for Cache Design
Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement strategy) Q4: What happens on a write? (Write strategy)

1-Cache Hits Processor issues Read and Write requests as if it were accessing main memory directly But control circuitry first checks the cache If desired information is present in the cache, a read or write hit occurs Read hit  instruction or data read from cache Write hit  data to be written in cache For a read hit, main memory is not involved; the cache provides the desired information For a write hit, there are two approaches: Write Through Write Back

Write Policy Morgan Kaufmann Publishers 22 June, 2018 Write-through Update both upper and lower levels (consistency). Simplifies replacement, but may require write buffer Simpler but results in unnecessary Write operations in main memory when a given cache word is updated several times during its cache residency!! Write-back Update upper level only Update lower level when block is replaced Also involves unnecessary Write operations!! Because all words of the block are eventually written back, even if only a single word has been changed. Write-back is more complex to implement than Write-through Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Write-Back Caches Need a Dirty bit for each line
A dirty bit indicates that cache and main memory are inconsistent. A dirty line has more recent data in cache than memory Line starts as clean (not dirty) Line becomes dirty on first write to it Memory not updated yet, cache has the only up-to-date copy of data for a dirty line Replacing a dirty line Must write data back to memory (write-back)

2-Cache Misses A cache miss is a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency. There are three kind of cache misses: Instruction read miss, Data read miss, Data write miss Cache read misses from an instruction cache cause the largest delay (why?) Cache read misses from a data cache usually cause a smaller delay (why?) Cache write misses to data cache generally cause the shortest delay (why?) Because the write can be queued.

Handling Cache Misses If desired information is not present in cache, a read or write miss occurs: For a read miss, the block with desired word is transferred from main memory to the cache Note: the word may be sent to the processor as soon as it is read from main memory (“early restart”)  reduces processor wait time. For a write miss (a possible scenario): We first fetch the words of the block from memory. After the block is fetched and placed into the cache, we can overwrite the word that caused the miss into the cache block. We also write the word to main memory using the full address.

Write Miss: Example 22 June, 2018 We try to write to an address that is not already contained in the cache (write miss) Lets say we want to store into Mem[ ] but we find that address is not currently in the cache index V Tag Data Address Data ….. ….. 1 00010 123456 6378 110 ….. ….. Cache Main Memory Write Allocation Policy: Should we bring block from MM to Cache and overwrite value? Should we forget about the cache and write directly to MM? Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Write Allocation: Misses
Do we allocate cache lines on a write? Write-allocate: A write miss brings block into cache

Cache Write Misses Do we allocate cache lines on a write?
Write-allocate: A write miss brings block into cache No-write-allocate: A write miss leaves cache as it was

Allocate on Write: Example
Morgan Kaufmann Publishers Allocate on Write: Example 22 June, 2018 An Allocate on Write strategy would load the cache with data from memory and then updated with the new data Mem[214] = 21763 index Tag Data Address V Data ….. ….. 1 00010 123456 1 11010 21763 21763 6378 110 ….. ….. Cache Main Memory If that data is needed again soon, it will be available in cache. This is generally the baseline behavior on processors. Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Write Around: Example 22 June, 2018 With a Write Around policy, the write operation goes directly to main memory without affecting the cache. Mem[ ] = 21763 Address Data index V Tag Data ….. 6378 21763 ….. 1 00010 123456 110 ….. ….. Cache Main Memory Some modern processors with write-allocate caches provide special store instructions called non-temporal stores to do this. Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Write Allocation 22 June, 2018 What should happen on a write miss? Alternatives for write-through Allocate on miss: fetch the block Write around: don’t fetch the block Since programs often write a whole block before reading it (e.g., initialization) For write-back Usually fetch the block Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Write-Through (Write Buffer)
Morgan Kaufmann Publishers 22 June, 2018 The advantage of Write-Through strategy is that it updates both cache and memory Consistency. But writes to memory take a long time!! e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles Effective CPI = ×100 = 11 Performance is reduced by more than a factor of 10!!! Solution: write buffer Holds data waiting to be written to memory CPU continues immediately Only stalls on write if write buffer is already full Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Write Buffer for Write-Through Caching
Cache Processor DRAM write buffer Write buffer between the cache and main memory Processor: writes data into the cache and the write buffer Memory controller: writes contents of the write buffer to memory The write buffer is just a FIFO Typical number of entries: 4 Works fine if store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare When the store frequency (w.r.t. time) → 1 / DRAM write cycle leading to write buffer saturation Solution: use a write-back cache; or use an L2 cache Don’t overflow buffer! We really didn't write to the memory directly. We are writing to a write buffer. Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene. The write buffer works as long as the frequency of store is not too high. Notice here, I am referring to the frequency with respect to time, not with respect to number of instructions. Remember the DRAM cycle time we talked about last time. It sets the upper limit on how frequent you can write to the main memory. If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. A Memory System designer’s nightmare is when the Store frequency with respect to time approaches 1 over the DRAM Write Cycle Time. We called this Write Buffer Saturation. In that case, it does NOT matter how big you make the write buffer, the write buffer will still overflow because you simply feeding things in it faster than you can empty it. This is called Write Buffer Saturation and I have seen this happened before in simulation and when that happens your processor will be running at DRAM cycle time--very very slow. The first solution for write buffer saturation is to get rid of this write buffer and replace this write through cache with a write back cache. Another solution is to install the 2nd level cache between the write buffer and memory and makes the 2nd level write back.

Cache in Pipeline Architecture
Read misses (I$ and D$) stall the pipeline, fetch the block from the next level in the memory hierarchy, install it in the cache and send the requested word to the processor, let the pipeline resume Write misses (D$ only) stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (which may involve having to evict a dirty block if using a write-back cache), write the word to the cache, let the pipeline resume Write allocate No-write allocate Let’s look at our 1KB direct mapped cache again. Assume we do a 16-bit write to memory location 0x and causes a cache miss in our 1KB direct mapped cache that has 32-byte block select. After we write the cache tag into the cache and write the 16-bit data into Byte 0 and Byte 1, do we have to read the rest of the block (Byte 2, 3, ... Byte 31) from memory? If we do read the rest of the block in, it is called write allocate. But stop and think for a second. Is it really necessary to bring in the rest of the block on a write miss? True, the principle of spatial locality implies that we are likely to access them soon. But the type of access we are going to do is likely to be another write. So if even if we do read in the data, we may end up overwriting them anyway so it is a common practice to NOT read in the rest of the block on a write miss. If you don’t bring in the rest of the block, or use the more technical term, Write Not Allocate, you better have some way to tell the processor the rest of the block is no longer valid. This bring us to the topic of sub-blocking.

Summary

Summary: The Cache Design Space
Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation The optimal choice is a compromise depends on access characteristics workload use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Cache Size Associativity Block Size Bad Good Less More Factor A Factor B No fancy replacement policy is needed for the direct mapped cache. As a matter of fact, that is what cause direct mapped trouble to begin with: only one place to go in the cache--causes conflict misses. Besides working at Sun, I also teach people how to fly whenever I have time. Statistic have shown that if a pilot crashed after an engine failure, he or she is more likely to get killed in a multi-engine light airplane than a single engine airplane. The joke among us flight instructors is that: sure, when the engine quit in a single engine stops, you have one option: sooner or later, you land. Probably sooner. But in a multi-engine airplane with one engine stops, you have a lot of options. It is the need to make a decision that kills those people.

End Slides

ENG3380 Computer Organization and Architecture “Cache Memory Part III”

Similar presentations

Presentation on theme: "ENG3380 Computer Organization and Architecture “Cache Memory Part III”"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ENG3380 Computer Organization and Architecture “Cache Memory Part III”

Similar presentations

Presentation on theme: "ENG3380 Computer Organization and Architecture “Cache Memory Part III”"— Presentation transcript:

Similar presentations

About project

Feedback