1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.

1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache

2 Direct Mapped Cache Example 2 ( 4, 1-word blocks ) 0123 4 3415  Consider the main memory word reference string 0 1 2 3 4 3 4 14 Start with an empty cache - all blocks initially marked as not valid

3 Direct Mapped Cache Example 2 ( 4, 1-word blocks ) 0123 4 3415  Consider the main memory word reference string 0 1 2 3 4 3 4 15 00 Mem(0) 00 Mem(1) 00 Mem(0) 00 Mem(1) 00 Mem(2) miss hit 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 014 11 15 00 Mem(1) 00 Mem(2) 00 Mem(3) Start with an empty cache - all blocks initially marked as not valid l 8 requests, 6 misses, 2 hits

4 Direct Mapped Caching: A Simple First Example 00 01 10 11 Cache Main Memory Q1: How do we find it? Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache) TagData Q2: Is it there? Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache Valid 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx Two low order bits define the byte in the word (32b words) (block address) modulo (# of blocks in the cache) Index Valid bit indicates whether an entry contains valid information – if the bit is not set, there cannot be a match for this block

5 Direct Mapped Cache Example 1  8-blocks, 1 word/block, direct mapped  Initial state IndexVTagData 000N 001N 010N 011N 100N 101N 110N 111N

6 Direct Mapped Cache Example 1 IndexVTagData 000N 001N 010N 011N 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/missCache block 2210 110Miss110

7 Direct Mapped Cache Example 1 IndexVTagData 000N 001N 010Y11Mem[11010] 011N 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/missCache block 2611 010Miss010

8 Direct Mapped Cache Example 1 IndexVTagData 000N 001N 010Y11Mem[11010] 011N 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/missCache block 2210 110Hit110 2611 010Hit010

9 Direct Mapped Cache Example 1 IndexVTagData 000Y10Mem[10000] 001N 010Y11Mem[11010] 011Y00Mem[00011] 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/missCache block 1610 000Miss000 300 011Miss011 1610 000Hit000

10 Direct Mapped Cache Example 1 IndexVTagData 000Y10Mem[10000] 001N 010Y10Mem[10010] 011Y00Mem[00011] 100N 101N 110Y10Mem[10110] 111N Word addrBinary addrHit/missCache block 1810 010Miss010

11  One word/block, cache size = 1K words 20 Tag 10 Index Data IndexTagValid 0 1 2. 1021 1022 1023 31 30... 13 12 11... 2 1 0 Byte offset What kind of locality are we taking advantage of? 20 Data 32 Hit Address Subdivision : Direct Mapped Cache FIGURE 7.7 For this cache, the lower portion of the address is used to select a cache entry consisting of a data word and a tag.

12 Another Example for Direct Mapping 0404 0 404  Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid

13 Another Example for Direct Mapping 0404 0 404  Consider the main memory word reference string 0 4 0 4 0 4 0 4 miss 00 Mem(0) 01 4 01 Mem(4) 0 00 00 Mem(0) 01 4 00 Mem(0) 01 4 00 Mem(0) 01 4 01 Mem(4) 0 00 01 Mem(4) 0 00 Start with an empty cache - all blocks initially marked as not valid  Ping pong effect due to conflict misses - two memory locations that map into the same cache block l 8 requests, 8 misses

Handling Cache Misses  Our control unit must detect a miss and process the miss by fetching the data from memory or from a lower-level cache.  Approach – stall the CPU, freezing the contents of all the registers – a separate controller fetches the data from memory – once data is present, execution of datapath is resumed 14

Instruction Cache Misses  Send the original PC value (current PC - 4) to the memory  Instruct main memory to perform a read and wait for the memory to complete its access.  Write the cache entry, putting the data from memory in the data portion of the entry, writing the upper bits of the address (from the ALU) into the tag field, and setting the valid bit  Restart the instruction execution at the first step, which will re-fetch the instruction (now in the cache) 15

Example Machine  The Digital DECStation 3100, one of the first commercially available RISC-Architecture machines used a MIPS R2000 processor. l 5 stage pipeline l requested an instruction word and a data word on every clock cycle l static branch prediction l delayed branch instruction l 64 KB data cache and 64 KB instruction cache 16

DECStation 3100 Cache (16 K words) 17 offset This cache has 2 14 (16K) words with a block size of 1 word. 14 bits are used to index into the cache. 16 bits are compared against the tag. A hit results if the upper 16 bits matches the tag AND the valid bit is set

DECStation 3100 Cache  64 KB = 16 K words with a 1 word block  Read requests (lw instruction): 1. Send the address to the appropriate cache. Instruction - from PC Data - from ALU 2. If the cache signals a hit, read the requested word from the data lines. Else, send the address to main memory. When the requested word comes back from main memory, write it into the cache. 18

DECStation 3100 Cache  Write requests: On a sw instruction, the DECStation 3100 used a scheme called write-through.  Write-through is when you store the word in the data cache AND in main memory. This is done to keep the data cache and main memory consistent.  Don’t bother to check for hits and misses. Just write the word into the cache and into main memory. 19 A=5 cachemem P A=10A=5 cachemem P A=10 cachemem P Process wants to update A as 10 Now, cache copy of A is not equal to MEM copy of A With write policy write into both of them inconsistency

DECStation 3100 Cache  Write requests using a write-through scheme (see page17): 1. Index the cache using bits 2-15 of the address. 2. Write bits 31-16 of the address into the tag, write the data word into the data portion, and set the valid bit. 3. Also write the word to main memory using the entire address.  This simple approach slows our performance down because of the long write to main memory. 20

EXAMPLE : for cache MISS  Suppose 10% of the instructions are stores, if the CPI without cache misses 1.0 spending 100 extra cycles on every write  1.0+ 100 * 10 % = 11  Reducing the performance by more than factor of 10  SOLUTION: Write Buffer Solution 21

Write Buffer Solution  Use a write buffer to store the data while it is waiting to be written to memory l Write the data into the cache and into the write buffer l Processor continues with execution l The write buffer copies the data into main memory (memory controller)  Hopefully, the processor does not generate writes faster than the write buffer can take them. If the write buffer becomes full, stalls occur.  A write buffer stores the data while it is waiting to be written to memory. After writing the data into the cache and into the write buffer the processor can continue execution. When a write main memory completes the entry in the write buffer is freed. 22

Write Buffer Saturation  PROBLEM: Memory system designer’s nightmare:! l If Store frequency (w.r.t. time) << 1 / DRAM write cycle! -Write buffer works fine l If Store frequency (w.r.t. time) -> 1 / DRAM write cycle! -If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row):! –Store buffer will overﬂow no matter how big you make it! »Because you simply feeding data faster than can empty it –The CPU Cycle Time << DRAM Write Cycle Time! 23

Solution for write buffer saturation:  Use a write back cache!  Install a second level (L2) cache:!  store compression! 24

Alternative: Write-Back Solution  The write-back scheme doesn’t automatically write to the cache AND to main memory.  Instead, it writes only to the cache. The data does not get written to memory until that cache block has to be replaced with a different block (when that cache block is being replaced on a cache miss). l Can improve performance. l Greatly reduce the memory bandwidth requirement! l More complex to implement than write-through. But,  Control can be complex!  Need a “dirty bit” for each cache block! 25

Sources of Cache Misses  Compulsory (cold start or process migration, first reference): They are caused when we first start the program. l First access to a block, “cold” fact of life, not a whole lot you can do about it l If you are going to run “millions” of instruction, compulsory misses are insignificant  Conflict (collision): l Multiple memory locations mapped to the same cache location l Solution 1: increase cache size l Solution 2: increase associativity (next lecture)  Capacity: l Cache cannot contain all blocks accessed by the program l Solution: increase cache size 26

Multiword (4 word) Block Direct Mapped Cache Taking the Advantage of Spatial Locality  The cache organization (i.e. 1 word= 1 block) we have discussed so far does not take the spatial locality,  We want have a cache block size > one word  So, when we have cache miss occurs, we fetch multiple adjacent words, and the probability that one these words will be needed shortly will be high l Example: - Choose block size= 4 words (4x4=16 bytes) - Now, we need an extra “block index” field, which selects one of four words in the indexed block according to the request - The total size of the tag field is also reduced per word, because each tag is used for 4 words (25% tag overhead only, compared to the case where block size is 1 word) 27

Multiword (4 word) Block Direct Mapped Cache Cache size 4Kx4= 16K words 16Kx4 Bytes =64K Byte cache 28

Formulas  During Lecture Period 29

30 Multiword (4 word) Block Direct Mapped Cache 12 Index Data IndexTagValid 0 1 2. 4093 4094 4095 31 30... 17 16 15... 4 3 2 1 0 Byte offset 16 Tag HitData 32 Block offset  Four words/block, cache size = 16K words What kind of locality are we taking advantage of? 32 2

31 Taking Advantage of Spatial Locality 0  Let cache block hold more than one word 0 1 2 3 4 3 4 15 1 2 343 415 Start with an empty cache - all blocks initially marked as not valid

32 Taking Advantage of Spatial Locality 0  Let cache block hold more than one word 0 1 2 3 4 3 4 15 1 2 343 415 00 Mem(1) Mem(0) miss 00 Mem(1) Mem(0) hit 00 Mem(3) Mem(2) 00 Mem(1) Mem(0) miss hit 00 Mem(3) Mem(2) 00 Mem(1) Mem(0) miss 00 Mem(3) Mem(2) 00 Mem(1) Mem(0) 01 54 hit 00 Mem(3) Mem(2) 01 Mem(5) Mem(4) hit 00 Mem(3) Mem(2) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 01 Mem(5) Mem(4) miss 11 1514 Start with an empty cache - all blocks initially marked as not valid l 8 requests, 4 misses

Cache Hits and Misses  Read misses are processed the same as with a 1 word block cache. Read the entire block from memory into the cache.  Write hits and misses must be handled differently with a multiword block cache. Consider: l Assume memory addresses X and Y both map to cache block C l C is a 4 word block containing Y l What would happen if we did a write to address X by simply overwriting the data and tag in cache block C? Ans: Cache block C would contain the tag for X, 1 word of X, and 3 words of Y. 33

Cache Hits and Misses Solution 1 (when write Miss occurs)  Ignore the cache when we have a write miss  Do not change the tag for X and do not update data X1, just write it on memory l Where is the idea of using cache? if the data resides in memory l In this case we can not use the advantage of locality  ….. Another Solution ? 34

Cache Hits and Misses Solution 2 (First implement read miss then write)  Perform a tag comparison while writing the data.  If equal, we have a write hit. No problem.  If unequal, we have a write miss. – Fetch the block from memory – Rewrite the word that caused the miss (Write through)  So, with a multi-word cache, a write miss causes a read from memory. 35

Miss Penalty and Miss Rate vs Block Size  In general, larger block sizes take advantage of spatial locality  BUT:  – Larger block size means larger miss penalties: l Takes longer time to fill up the block  – If block size is too big relative to cache size, miss rate will go up l Too few cache blocks l 16- 64 bytes works fine  In general, Average Access Time: = Hit Time + Miss Penalty x Miss Rate  Need to find a middle ground (Good design needs compromise 36

37 Miss Rate vs Block Size vs Cache Size  Miss rate goes up if the block size is too large relative to the cache size.  because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses)

Spacial Locality  Does increasing the block size help the miss rate? Yes, until the number of blocks in the cache becomes small. Then, a cache block may be swapped out before many of its words are accessed losing any spatial locality benefits. Effective Instruction Data combined Program miss rate miss rate miss rate gcc (1 word blocks) 6.1% 2.1% 5.4% gcc (4 word blocks) 2.0% 1.7% 1.9% spice (1 word blocks) 1.2% 1.3% 1.2% spice (4 word blocks) 0.3% 0.6% 0.4% 38

Miss Penalty  Increasing the block size also increases the miss penalty since we must read more words from memory for each miss. Reading more words takes more time.  Miss Penalty = latency to the first word + transfer time for the rest.  One way around this problem is to design our memory system to transfer blocks of memory efficiently. l One common way is to increase the width of the memory and the bus. (Transfer 2 or 4 words at a time, 1 latency period.) l Another common way is interleaving. This technique uses banks of memory. The requested address is sent to each bank in parallel. The memory latency is incurred once. Then the banks take turns at sending the requested words to the cache. 39

Cache Summary  The Principle of Locality: l Program likely to access a relatively small portion of the address space at any instant of time -Temporal Locality: Locality in Time -Spatial Locality: Locality in Space  Three Major Categories of Cache Misses: l Compulsory Misses: sad facts of life. Example: cold start misses l Conflict Misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect! l Capacity Misses: increase cache size  Cache Design Space l total size, block size, associativity (replacement policy) l write-policy (write-through, write-back) 40

Main Memory Organizations CPU Cache Bus Memory CPU Bus Memory Multiplexor Cache CPU Cache Bus Memory bank 1 Memory bank 2 Memory bank 3 Memory bank 0 (a) one-word wide memory organization (b) wide memory organization (c) interleaved memory organization DRAM access time >> bus transfer time Processing the cache Miss Latency to fetch the first word from MEM (finding the addr. for word0) Block transfer time (to bring all words in block from MEM) It is difficult to reduce the latency to fetch the first word from MEM. However, we can reduce the miss penalty if we increase the bandwidth from MEM to cache 41

Memory Access Time Example  Assume that it takes 1 cycle to send the address, 15 cycles for each DRAM access and 1 cycle to send a word of data.  Assuming a cache block of 4 words and one-word wide DRAM (page 41 fig. a), miss penalty = 1 + 4x15 + 4x1 = 65 cycles  With main memory and bus width of 2 words (page 41 fig. b), miss penalty = 1 + 2x15 + 2x1 = 33 cycles. For 4-word wide memory, miss penalty is 17 cycles. Expensive due to wide bus and control circuits.  With interleaved memory of 4 memory banks and same bus width (page 41 fig. c), the miss penalty = 1 + 1x15 + 4x1 = 20 cycles. The memory controller must supply consecutive addresses to different memory banks. Interleaving is universally adapted in high-performance computers. 42

Cache Performance  During Lecture Period 43

1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.

Similar presentations

Presentation on theme: "1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.

Similar presentations

Presentation on theme: "1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache."— Presentation transcript:

Similar presentations

About project

Feedback