1 CSCI 2510 Computer Organization Memory System II Cache In Action.

1 CSCI 2510 Computer Organization Memory System II Cache In Action

2 Cache-Main Memory Mapping A way to record which part of the Main Memory is now in cache Design concerns:  Be Efficient: fast determination of cache hits/ misses  Be Effective: make full use of the cache; increase probability of cache hits In the following discussion, we assume:  16-bit address, i.e. 2 16 = 64k Bytes in the Main Memory  One Word = One Byte  Synonym: Cache line === Cache block

3 Imagine: Trivial Conceptual Case Cache size == Main Memory size == 64kB Trivial one-to-one mapping Do we need Main Memory any more? Cache 64kB FAST Main Memory 64kB SLOW CPU FASTEST

4 Reality: Cache Block/ Cache Line Cache size is much smaller than the Main Memory size A block in the Main Memory maps to a block in the Cache Many-to-One Mapping Main Memory Block 0 Block 1 Block 127 Block 128 Block 129 Block 255 Block 256 Block 257 Block 4095 tag Cache Block 0 Block 1 Block 127

5 Direct Mapping Direct mapped cache Block j of Main Memory maps to block ( j mod 128 ) of Cache [same colour in figure] Cache hit occurs if tag matches desired address there are 2 4 = 16 words (bytes) in a block 2 7 = 128 Cache blocks 2 (7+5) = 2 7 x 2 5 = 4096 Main Memory blocks Main Memory Block 0 Block 1 Block 127 Block 128 Block 129 Block 255 Block 256 Block 257 Block 4095 tag Cache Block 0 Block 1 Block 127 7416-bit Main Memory address Cache tag Cache Block No Word/ Byte Address within block (4-bit) 5 12-bit Main Memory Block number/ address

6 Direct Mapping Memory address divided into 3 fields  Main Memory Block number determines position of block in cache  Tag used to keep track of which block is in cache (as many MM blocks can map to same position in cache)  The last 4 bits in the address selects target word in the block Given an address t,b,w (16-bit)  See if it is already in cache by comparing t with the tag in block b  If not, cache miss!, replace the current block at b with a new one from memory block t,b (12-bit) E.g. CPU is looking for [A7B4] MAR = 1010011110110100 Go to cache block 1111011 See if the tag is 10100 If YES, cache hit! Get the word/ byte 0100 in the cache block 1111011 Main Memory Block 0 Block 1 Block 127 Block 128 Block 129 Block 255 Block 256 Block 257 Block 4095 tag Cache Block 0 Block 1 Block 127 7416-bit Main Memory address Cache tag Cache Block No Word/ Byte Address within block (4-bit) 5 12-bit Main Memory Block number/ address

7 Associative Mapping In direct mapping, a Main Memory block is restricted to reside in a given position in the Cache (determined by mod) Associative mapping allows a MM block to reside in an arbitrary Cache block location In this example, all 128 tag entries must be compared with the address Tag in parallel (by hardware) 4 tag Cache Main Memory Block 0 Block 1 Block i Block 4095 Block 0 Block 1 Block 127 12 16-bit Main Memory address TagWord/ Byte E.g. CPU is looking for [A7B4] MAR = 1010011110110100 See if the tag 101001111011 matches one of the 128 cache tags If YES, cache hit! Get the word/ byte 0100 in the BINGO cache block Tag width is 12 bits

8 Set Associative Mapping Combination of direct and associative Same colour Main Memory blocks 0,64,128,…,4032 map to cache set 0 and can occupy either of 2 positions within that set A cache with k-blocks per set is called a k-way set associative cache.  (j mod 64) derives the Set Number  Within the target Set, compare the 2 tag entries with the address Tag in parallel (by hardware) Main Memory Block 0 Block 1 Block 63 Block 64 Block 65 Block 127 Block 128 Block 129 Block 4095 16-bit Main Memory address 664 Tag Set Number Word/ Byte tag Cache Block 0 Block 1 Block 126 tag Block 2 Block 3 tag Block 127 Set 0 Set 1 Set 63 Tag width is 6 bits

9 Set Associative Mapping Main Memory Block 0 Block 1 Block 63 Block 64 Block 65 Block 127 Block 128 Block 129 Block 4095 16-bit Main Memory address 664 Tag Set Number Word/ Byte tag Cache Block 0 Block 1 Block 126 tag Block 2 Block 3 tag Block 127 Set 0 Set 1 Set 63 Tag width is 6 bits E.g. 2-Way Set Associative CPU is looking for [A7B4] MAR = 1010011110110100 Go to cache Set 111011 (59 10 ) - Block 1110110 (118 10 ) - Block 1110111 (119 10 ) See if ONE of the TWO tags in the Set 111011 is 101001 If YES, cache hit! Get the word/ byte 0100 in the BINGO cache block

10 Replacement Algorithms Direct mapped cache  Position of each block fixed  Whenever replacement is needed (i.e. cache miss  new block to load), the choice is obvious and thus no "replacement algorithm" is needed Associative and Set Associative  Need to decide which block to replace (thus keep/ retain ones likely to be used in cache in near future again)  One strategy is least recently used (LRU) e.g. for a 4-block/ set cache, use a log 2 4 = 2-bit counter for each block Reset the counter to 0 whenever the block is accessed; counters of other blocks in the same set should be incremented On cache miss, replace/ uncache a block with counter reaching 3  Another is random replacement (choose random block) Advantage is that it is easier to implement at high speed

11 Cache Example Assume separate instruction and data caches  We consider only the data Cache has space for 8 blocks A block contains one 16-bit word (i.e. 2 bytes) A[10][4] is an array of words located at 7A00-7A27 in row- major order The program is to normalize first column of the array (a vertical vector) by its average short A[10][4]; int sum = 0; int j, i; double mean; // forward loop for (j = 0; j <= 9; j++) sum += A[j][0]; mean = sum / 10.0; // backward loop for (i = 9; i >= 0; i--) A[i][0] = A[i][0] / mean;

12 Cache Example A[0][0] A[0][1] A[0][2] A[0][3] A[1][0] A[9][0] A[9][1] A[9][2] A[9][3] Array Contents (40 elements) Tag for Direct Mapped Tag for Set-Associative Tag for Associative 8 blocks in cache, 3 bits encodes cache block number To simplify the discussion in this example: 16-bit word address; byte addresses are not shown One item == One block == One word == Two bytes 4 blocks/ set, 2 cache sets, 1 bit encodes cache set number

13 Direct Mapping Least significant 3-bits of address determine location in cache No replacement algorithm is needed in Direct Mapping When i = 9 and i = 8, get a cache hit (2 hits in total) Only 2 out of the 8 cache positions used Very inefficient cache utilization Content of data cache after loop pass: (time line) j = 0j = 1j = 2j = 3j = 4j = 5j = 6j = 7j = 8j = 9i = 9i = 8i = 7i = 6i = 5i = 4i = 3i = 2i = 1i = 0 Cache Block number 0 A[0][0] A[2][0] A[4][0] A[6][0] A[8][0] A[6][0] A[4][0] A[2][0] A[0][0] 1 2 3 4 A[1][0] A[3][0] A[5][0] A[7][0] A[9][0] A[7][0] A[5][0] A[3][0] A[1][0] 5 6 7 Tags not shown but are needed

14 Associative Mapping LRU replacement policy: get cache hits for i = 9, 8, …, 2 If i loop was a forward one, we would get no hits! Content of data cache after loop pass: ( time line) j = 0j = 1j = 2j = 3j = 4j = 5j = 6j = 7j = 8j = 9i = 9i = 8i = 7i = 6i = 5i = 4i = 3i = 2i = 1i = 0 Cache Block numbe r 0 A[0][0] A[8][0] A[0][0] 1 A[1][0] A[9][0] A[1][0] 2 A[2][0] 3 A[3][0] 4 A[4][0] 5 A[5][0] 6 A[6][0] 7 A[7][0] Tags not shown but are needed LRU Counters not shown but are needed

15 Set Associative Mapping Since all accessed blocks have even addresses (7A00, 7A04, 7A08,...), only half of the cache is used, i.e. they all map to set 0 LRU replacement policy: get hits for i = 9, 8, 7 and 6 Random replacement would have better average performance If i loop was a forward one, we would get no hits! Set 0 Content of data cache after loop pass: (time line) j = 0j = 1j = 2j = 3j = 4j = 5j = 6j = 7j = 8j = 9i = 9i = 8i = 7i = 6i = 5i = 4i = 3i = 2i = 1i = 0 Cache Block numbe r 0 A[0][0] A[4][0] A[8][0] A[4][0] A[0][0] 1 A[1][0] A[5][0] A[9][0] A[5][0] A[1][0] 2 A[2][0] A[6][0] A[2][0] 3 A[3][0] A[7][0] A[3][0] 4 5 6 7 Tags not shown but are needed Set 1 LRU Counters not shown but are needed

16 Comments on the Example In this example, Associative is best, then Set- Associative, lastly Direct Mapping What are the advantages and disadvantages of each scheme? In practice,  low hit rates like in the example is very rare  usually Set-Associative with LRU replacement scheme Larger blocks and more blocks greatly improve cache hit rate, i.e. more cache memory

17 Real-life Example: Intel Core 2 Duo Core Processing unit L2 cache 4MB Main memory Input/Output System bus L1 instruction cache 32kB L1 data cache 32kB Core Processing unit L1 instruction cache 32kB L1 data cache 32kB

18 Real-life Example: Intel Core 2 Duo Number of processors: 1 Number of cores: 2 per processor Number of threads: 2 per processor Name: Intel Core 2 Duo E6600 Code Name: Conroe Specification: Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Technology: 65 nm Core Speed: 2400 MHz Multiplier x Bus speed: 9.0 x 266.0 MHz = 2400 MHz Front-Side-Bus speed: 4 x 266.0MHz = 1066 MHz Instruction sets: MMX, SSE, SSE2, SSE3, SSSE3, EM64T L1 Data cache 2 x 32 KBytes, 8-way set associative, 64-byte line size L1 Instruction cache 2 x 32 KBytes, 8-way set associative, 64-byte line size L2 cache 4096 KBytes, 16-way set associative, 64-byte line size

19 Memory Module Interleaving Processor and cache are fast, main memory is slow. Try to hide access latency by interleaving memory accesses across several memory modules. Each memory module has own Address Buffer Register (ABR) and Data Buffer Register (DBR) Which scheme below can be better interleaved? m bits Address in moduleMM address (a) Consecutive words in a module i k bits Module DBRABRDBRABR DBR 0 n1- (b) Consecutive words in consecutive modules i k bits 0 Module MM address DBRABR DBRABRDBR Address in module 2 k 1- m bits

20 Memory Module Interleaving Memory interleaving can be realized in technology such as “Dual Channel Memory Architecture.” Two or more compatible (identical the best) memory modules are used in matching banks. Within a memory module, in fact, 8 chips are used in “parallel”, to achieve a 8 x 8 = 64-bit memory bus. This is also a kind of memory interleaving.

21 Memory Interleaving Example Suppose we have a cache read miss and need to load from main memory Assume cache with 8-word block (cache line size = 8 bytes) Assume it takes one clock to send address to DRAM memory and one clock to send data back. In addition, DRAM has 6 cycle latency for first word; good that each of subsequent words in same row takes only 4 cycles Read 8 bytes without interleaving: 1+(1x6)+(7x4)+1 = 36 cycles For 4-module interleaved scheme: 1+(1x6)+(1x8) = 15 cycles  Transfer of first 4 words are overlapped with access of second 4 words

22 Memory Without Interleaving 161141141 141 … Read 8 bytes (non-interleaving): 1 + 1x6 + 7x4 + 1 = 36 Cycles 161 Single Memory Read: 1 + 6 + 1 = 8 Cycles

23 Four Memory Modules Interleaved 1 161 161 161 141 141 141 141 Read 8 bytes (interleaving): 1 + 6 + 1x8 = 15 Cycles 16

24 Cache Hit Rate and Miss Penalty Goal is to have a big memory system with speed of cache High cache hit rates (> 90%) are essential  Miss penalty must also be reduced Example  h is hit rate  M is miss penalty  C is cache access time  Average memory access time = h * C + (1-h) * M

25 Cache Hit Rate and Miss Penalty Optimistically assume 8 cycles to read a single memory item; 15 cycles to load a 8-byte block from main memory (previous example); cache access time = 1 cycle For every 100 instructions, statistically 30 instructions are data read/ write Instruction fetch: 100 memory access: assume hit rate = 0.95 Data read/ write: 30 memory access: assume hit rate = 0.90 Execution cycles without cache = (100 + 30) x 8 Execution cycles with cache = 100(0.95x1+0.05x15)+30(0.9x1+0.1x15) Ratio = 4.30, i.e. cache speeds up execution more than 4 times!

26 Summary Cache Organizations:  Direct, Associative, Set-Associative Cache Replacement Algorithms:  Random, Least Recently Used Memory Interleaving Cache Hit and Miss Penalty

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

Similar presentations

Presentation on theme: "1 CSCI 2510 Computer Organization Memory System II Cache In Action."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

Similar presentations

Presentation on theme: "1 CSCI 2510 Computer Organization Memory System II Cache In Action."— Presentation transcript:

Similar presentations

About project

Feedback