Download presentation
Presentation is loading. Please wait.
1
Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches
2
@H.C. Embedded Computer Architecture2 Data layout for caches Caches are hardware controled Therefore: no explicit reuse copy code needed in your code! What can we still do to improve performance? Topics: –Cache principles –The 3 C's: Compulsory, Capacity and Conflict misses –Data layout examples reducing misses
3
@H.C. Embedded Computer Architecture3 Cache operation (direct mapped cache) Memory / Lower level Cache / Higher level block or line tagsdata
4
@H.C. Embedded Computer Architecture4 Why does a cache work? Principle of Locality –Temporal locality an accessed item has a high probability being accessed in the near future –Spatial locality items close in space to a recently accessed item have a high probability of being accessed next Check yourself why there is temporal and spatial locality for instruction accesses and for data accesses –Regular programs have high instruction and data locality
5
@H.C. Embedded Computer Architecture5 Direct mapped cache 20 10 Byte offset ValidTagDataIndex 0 1 2 1021 1022 1023 Tag Index HitData 20 32 31 30 13 12 1 1 2 1 0 Address (bit positions)
6
@H.C. Embedded Computer Architecture6 Taking advantage of spatial locality: Direct mapped cache: larger blocks Address (bit positions)
7
@H.C. Embedded Computer Architecture7 Increasing the block (or line) size tends to decrease miss rate Performance: effect of block size
8
@H.C. Embedded Computer Architecture8 p-k-mmk tagindex address byte address tagdata Hit? main memory CPU 2 k lines p-k-m2 m bytes Cache Line or Block Cache principles Virtual or Physical address
9
@H.C. Embedded Computer Architecture9 4 Cache Architecture Fundamentals 1.Block placement –Where in the cache will a new block be placed? 2.Block identification –How is a block found in the cache? 3.Block replacement policy –Which block is evicted from the cache? 4.Updating policy –When is a block written from cache to memory? –Write-Through vs. Write-Back caches
10
@H.C. Embedded Computer Architecture10Cache0 1 7 2 3 4 5 6 2 3 4 5 0 1 6 7... 0 1 2 3 4 5 6 7 Fully associative (one-to-many) Anywhere in cache Here only! 0 1 2 3 4 5 6 7 Direct mapped (one-to-one) Here only! Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Mapping?... Block placement policies
11
@H.C. Embedded Computer Architecture11 4-way associative cache 4 ways 256 sets
12
@H.C. Embedded Computer Architecture12 Performance: effect of associativity 1 KB 2 KB 8 KB
13
@H.C. Embedded Computer Architecture13 Cache Basics Cache_size = N sets x Associativity x Block_size Block_address = Byte_address DIV Block_size in bytes Index = Block_address MOD N sets Because the block size and the number of sets are (usually) powers of two, DIV and MOD can be performed efficiently tag index block offset block address … 2 1 031 …
14
@H.C. Embedded Computer Architecture14 Example 1 Assume –Cache of 4K blocks, with 4 word block size –32 bit addresses Direct mapped (associativity=1) : –16 bytes per block = 2^4 4 (2+2) bits for byte and word offsets –32 bit address : 32-4=28 bits for index and tag –#sets=#blocks/ associativity : log2 of 4K=12 : 12 for index –Total number of tag bits : (28-12)*4K=64 Kbits 2-way associative –#sets=#blocks/associativity : 2K sets –1 bit less for indexing, 1 bit more for tag (compared to direct mapped) –Tag bits : (28-11) * 2 * 2K=68 Kbits 4-way associative –#sets=#blocks/associativity : 1K sets –2 bits less for indexing, 2 bits more for tag (compared to direct mapped) –Tag bits : (28-10) * 4 * 1K=72 Kbits
15
@H.C. Embedded Computer Architecture15 Example 2 3 caches consisting of 4 one-word blocks: Cache 1 : fully associative Cache 2 : two-way set associative Cache 3 : direct mapped Suppose following sequence of block addresses: 0, 8, 0, 6, 8
16
@H.C. Embedded Computer Architecture16 Example 2: Direct Mapped Block addressCache Block 00 mod 4=0 66 mod 4=2 88 mod 4=0 Address of memory block Hit or miss Location 0 Location 1 Location 2 Location 3 0missMem[0] 8missMem[8] 0missMem[0] 6missMem[0]Mem[6] 8missMem[8]Mem[6] Coloured = new entry = miss
17
@H.C. Embedded Computer Architecture17 Example 2: 2-way Set Associative: (4/2 = 2 sets) Block addressCache Block 00 mod 2=0 66 mod 2=0 88 mod 2=0 Address of memory block Hit or miss SET 0 entry 0 SET 0 entry 1 SET 1 entry 0 SET 1 entry 1 0MissMem[0] 8MissMem[0]Mem[8] 0HitMem[0]Mem[8] 6MissMem[0]Mem[6] 8MissMem[8]Mem[6] LEAST RECENTLY USED BLOCK (so all in set/location 0)
18
@H.C. Embedded Computer Architecture18 Example 2: Fully associative (4 way assoc., 4/4 = 1 set) Address of memory block Hit or miss Block 0Block 1Block 2Block 3 0MissMem[0] 8MissMem[0]Mem[8] 0HitMem[0]Mem[8] 6MissMem[0]Mem[8]Mem[6] 8HitMem[0]Mem[8]Mem[6]
19
@H.C. Embedded Computer Architecture19 Cache Fundamentals The “Three C's” Compulsory Misses –1st access to a block: never in the cache Capacity Misses –Cache cannot contain all the blocks –Blocks are discarded and retrieved later –Avoided by increasing cache size Conflict Misses –Too many blocks mapped to same set –Avoided by increasing associativity Some add 4 th C: Coherence Misses
20
@H.C. Embedded Computer Architecture20 for(i=0; i<10; i++) A[i] = f(B[i]); Cache(@ i=2) A[0] B[1] B[2] B[0] A[1] A[2] --- B[3], A[3] required B[3] never loaded before loaded into cache A[3] never loaded before allocates new line Cache(@ i=3) Compulsory miss example
21
@H.C. Embedded Computer Architecture21 Capacity miss example B[3] B[0] A[0] i=0 B[3] B[0] A[0] B[4] B[1] A[1] i=1 A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] i=2 A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] i=3 A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] i=4 B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] i=5 B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] i=6 for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] i=7 11 compulsory misses (+8 write misses) 5 capacity misses Cache size: 8 blocks of 1 word Fully associative
22
@H.C. Embedded Computer Architecture22 Cache (@ i=0) 1 2 3 4 5 6 7 B[0][j] A[0]/B[0][j] 0 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] 7 10 31 B[3][0] B[0][1] A[3] 2 3 4 B[0][0] B[1][0] B[1][1] B[2][0] 5 6 11 B[2][1] B[3][1] 12 B[0][2] B[1][2] 13 B[2][2] B[3][2] 8 9 14 15 0 1 7 2 7 2 3 4 5 6 3 4 5 0 1 6 7 B[0][3] 0... Memory address Cache address j=even A[0] multiply loaded A[i] read 10 times -> A[0] flushed in favor B[0][j] -> Miss j=odd Conflict miss example
23
@H.C. Embedded Computer Architecture23 “Three C's” vs Cache size [Gee93]
24
Data layout may reduce cache misses
25
@H.C. Embedded Computer Architecture25 Example 1: Capacity & Compulsory miss reduction B[3] B[0] A[0] i=0 B[3] B[0] A[0] B[4] B[1] A[1] i=1 A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] i=2 A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] i=3 A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] i=4 B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] i=5 B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] i=6 for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] i=7 11 compulsory misses (+8 write misses) 5 capacity misses
26
@H.C. Embedded Computer Architecture26 #Words B[] i 60 Cache Memory Main Memory (16 words) AB[new] Fit data in cache with in-place mapping A[] 15 Detailed Analysis: max=15 words 12 for(i=0; i<12; i++) A[i] = B[i+3]+B[i]; Traditional Analysis: max=27 words
27
@H.C. Embedded Computer Architecture27 Remove capacity / compulsory misses with in-place mapping AB[3] AB[0] i=0 AB[3] AB[0] AB[4] AB[1] i=1 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] i=2 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] i=3 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] i=4 AB[3] AB[8] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] i=5 AB[3] AB[8] AB[4] AB[9] AB[5] AB[2] AB[6] AB[7] i=6 for(i=0; i<N; i++) AB[i] = AB[i+3]+AB[i]; AB[7] AB[8] AB[4] AB[9] AB[5] AB[10] AB[6] AB[7] i=7 11 compulsory misses 5 cache hits (+8 write hits)
28
@H.C. Embedded Computer Architecture28 Cache (@ i=0) 1 2 3 4 5 6 7 B[0][j] A[0]/B[0][j] 0 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] 7 10 31 B[3][0] B[0][1] A[3] 2 3 4 B[0][0] B[1][0] B[1][1] B[2][0] 5 6 11 B[2][1] B[3][1] 12 B[0][2] B[1][2] 13 B[2][2] B[3][2] 8 9 14 15 0 1 7 2 7 2 3 4 5 6 3 4 5 0 1 6 7 B[0][3] 0... Memory address Cache address j=even A[0] multiply loaded A[i] read 10 times -> A[0] flushed in favor B[0][j] -> Miss j=odd Example 2: Conflict miss reduction
29
@H.C. Embedded Computer Architecture29 for(j=0; j<10; j++) for(i=0; i<4; i++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] 7 12 31 B[3][0] B[0][1] Main Memory A[3] 2 3 4 B[0][0] B[1][0] B[1][1] B[2][0] 5 6 13 Leave gap B[2][1] B[3][1] B[0][2] 0 1 7 4 7 2 3 4 5 6 5 6 7 14 15 184......... 1 2 3 4 5 6 7 B[0][j] A[0] 0 A[0] multiply loaded A[i] multiple x read No conflict Cache (@ i=0) j=any © imec 2001 Avoid conflict miss with main memory data layout
30
@H.C. Embedded Computer Architecture30 Data Layout Organization for Direct Mapped Caches
31
@H.C. Embedded Computer Architecture31 Conclusions on Data Management In multi-media applications exploring data transfer and storage issues should be done at source code level DMM method: –Reducing number of external memory accesses –Reducing external memory size –Trade-offs between internal memory complexity and speed –Platform independent high-level transformations –Platform dependent transformations exploit platform characteristics (efficient use of memory, cache, …) –Substantial energy reduction Although caches are hardware controlled data layout can largely influence the miss-rate
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.