Embedded Systems in Silicon TD5102 Data Management (3) SCBD, MAA, and Data Layout Henk Corporaal Technical University Eindhoven DTI / NUS Singapore 2005/2006
H.C. TD51022 Part 3 overview Recap on design flow Platform dependent steps –SCBD: Storage Cycle Budget Distribution –MAA: Memory Allocation and Assignment –Data layout techniques for RAM –Data layout techniques for Caches Results Conclusions Thanks to the IMEC DTSE people
H.C. TD51023 Dynamic memory mgmt Task concurrency mgmt Physical memory mgmt Address optimization SWdesignflowHWdesignflow SW/HW co-design SW/HW co-design Concurrent OO spec Remove OO overhead
H.C. TD51024 DM steps C-in Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout C-out Address optimization
H.C. TD51025 Result of Memory hierarchy assignment for cavity detection L3 L2 L1 N*M 3*1 image_in M*3 gauss_x gauss_xycomp_edgeimage_out 3*3 1*1 3*3 1*1 N*M N*M*3 N*M 0 N*M*3 N*M N*M*3N*M*8 M*3 1MB SDRAM 16KB Cache 128 B RegFile
H.C. TD51026 Data-reuse - cavity detection code for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixel initialized */ if (x==0 && y>=1 && y<=M-2) in_pixels[x%3] = image_in[x][y]; /* copy rest of in_pixel's in row */ if (x>=0 && x =1 && y<=M-2) in_pixels[(x+1)%3]= image_in[x+1][y]; if (x>=1 && x =1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; Code after reuse transformation (partly)
Storage Cycle Budget Distribution & Memory Allocation and Assignment
H.C. TD51028 Define the memory organization which can provide enough bandwidth with minimal cost
H.C. TD51029 Lower required performance by balancing bandwidth Memory Bandwidth Required Memory Bandwidth Required time High Low Reduce max. number of loads/store per cycle
H.C. TD Data management approach One of the many possible schedules
H.C. TD Data management approach
H.C. TD Conflict cost calculation Self conflicts Chromatic number Number of conflicts
H.C. TD Self conflict dual port memory
H.C. TD Chromatic number minimum # single port memories
H.C. TD Low number of conflicts large assignment freedom
H.C. TD time slots ? R(C) W(B) R(B) W(A) R(A) R(C) W(C) R(D) W(D) W(A) W(C) R(C) W(B) Conflict Directed Ordering is used for flat graph scheduling Reduce intervals until all conflicts known Driven by cost of conflicts Constructive algorithm
H.C. TD Local optimization is not good for global optimization
H.C. TD Budget distribution has large impact on memory cost
H.C. TD Decreasing basic block length until target cycle budget is met
H.C. TD Obtain more freedom by merging loops More scheduling freedom Extension to different threads
H.C. TD Memory allocation and assignment
H.C. TD Memory Allocation and Assignment Substeps Array-to-memory Assignment D C A B Port Assignment Bus Sharing D C A B Memory Allocation 123
H.C. TD Influence of MAA Bitwidth Address range Nr. memories Nr. ports Assign arrays to memory Memory interconnect Minimize power & Area Bitwidth (maximum) Size Nr. ports (R/W/RW) MEMORY-1 A B Bitwidth (maximum) Size Nr. ports (R/W/RW) MEMORY-N K L XXXX 1001XXXXXX
H.C. TD Trade-offs in the physical memory Area Power Area Power Trade off area and power for required bandwidth A B C D A C D B
H.C. TD Example of bus sharing possibilities R(A)R(B) R(B)W(A) W(C)R(A) R(A)W(B) W(A)W(B) W(A)W(C) m1m2m3 AB X X C m1m2m3 ABC m1m2m3 AB X C
H.C. TD Decreasing cycle budget limits freedom and raises cost
H.C. TD Resulting Pareto curve for DAB synchro application
H.C. TD Example conflict graph for cavity detection
H.C. TD MAA result Power: On-chip area:
H.C. TD Data layout how to put data into memory
H.C. TD A C ? ? B MEM1 F G ? ? H MEM2 PE A’ B’ ? ? CACHE PE A B CACHE A C MEM1 B F MEM2 G H Memory data layout for custom and cache architectures C A B C B
H.C. TD for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); i-1 i j Window Intra-array in-place mapping reduces size of one array a time max nr. of life elements
H.C. TD variable domains abstract addresses real addresses aAaA a C A B aCaC aBaB Two-phase mapping of array elements onto addresses Storage order Allocation
H.C. TD a2a2 a1a1 a=3a 1 +a 2 a=3(1-a 1 )+a 2 a=3a 1 +(2-a 2 ) a=3(1-a 1 )+(2-a 2 ) a=2a 2 +a 1 a=2a 2 +(1-a 1 )a=2(2-a 2 )+(1-a 1 ) a=2(2-a 2 )+a 1 a a=??? memory address variable domain Exploration of storage orders for 2-dimensional array ??????
H.C. TD Chosen storage order determines window size for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); row-major ordering: a=5i+j for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*i+j] = f(a[5*i+j-5]); column-major: a=5j+i for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*j+i] = f(a[5*j+i-1]); Highest live address: Lowest live address: 5*i+j 5*i+j-5 5*4+i-1 5*0+i-1 Difference + 1= Window: 6 21
H.C. TD A B C D E Memory Size Static allocation: no in-place mapping E aEaE C aCaC A aAaA D aDaD B aBaB
H.C. TD C Memory Size A D B E Static, windowed C Memory Size A D B E Dynamic, windowed Windowed Allocation: intra-array in-place mapping
H.C. TD Dynamic allocation: inter-array in-place mapping E aEaE C aCaC A aAaA D aDaD B aBaB A B CD E Memory Size
H.C. TD A B C E D A C E D B Memory Size Dynamic, common window Dynamic allocation strategy with common window
H.C. TD Before: bit8 B[10][20]; bit6 A[30]; for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[x][y] = …; After: bit8 memory[334]; bit8* B =(bit8*)&memory[134]; bit6* A =(bit6*)&memory[120]; for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[(x*20+y*2)%78] = …; Expressing memory data layout in source code Example: array of 10x20 elements A: offset 120, no window B: storage order [20, 2], offset 134, window 78
H.C. TD int x[W], y[W]; for (i1=0; i1 < W; i1++) x[i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * x[wrap(i2+di2,W)]; } y[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(y[i3]); Example of memory data layout for storage size reduction
H.C. TD Occupied address-time domain of x[] and y[]
H.C. TD int mem1[N+W]; for (i1=0; i1 < W; i1++) mem1[N+i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * mem1[N+wrap(i2+di2,W)]; } mem1[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(mem1[i3]); Optimized source code after memory data layout
H.C. TD Optimized OAT domain after memory data layout
H.C. TD In-place mapping for cavity detection example Input image is partly consumed by the time first results for output image are ready index time Image_in time address Image time index Image_out
H.C. TD In-place - cavity detection code for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */ … = image_in[x+1][y]; } for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */ … = image [x+1][y]; }
H.C. TD Cavity detection summary Overall result: Local accesses reduced by factor 3 Memory size reduced by factor 5 Power reduced by factor 5 System bus load reduced by factor 12 Performance worsened by factor 6
H.C. TD Data layout for caches Caches are hardware controled Therefore: no explicit copy coded needed ! What can we do ?
H.C. TD p-k-mmk tagindex address byte address tagdata Hit? main memory CPU 2 k lines p-k-m2 m bytes Cache line / Block Cache principles
H.C. TD Cache Architecture Fundamentals Block placement –Where in the cache will a new block be placed? Block identification –How is a block found in the cache? Block replacement policy –Which block is evicted from the cache? Updating policy –How is a block written from cache to memory?
H.C. TD510251Cache Fully associative (one-to-many) Anywhere in cache Here only! Direct mapped (one-to-one) Here only! Memory Mapping?... Block placement policies
H.C. TD Direct mapped cache Byte offset ValidTagDataIndex Tag Index HitData Address (bit positions)
H.C. TD Taking advantage of spatial locality: Direct mapped cache: larger blocks Address (bit positions)
H.C. TD Increasing the block size tends to decrease miss rate: Performance
H.C. TD way associative cache
H.C. TD Performance 1 KB 2 KB 8 KB
H.C. TD Cache Fundamentals The “Three C's” Compulsory Misses –1st access to a block: never in the cache Capacity Misses –Cache cannot contain all the blocks –Blocks are discarded and retrieved later –Avoided by increasing cache size Conflict Misses –Too many blocks mapped to same set –Avoided by increasing associativity
H.C. TD for(i=0; i<10; i++) A[i] = f(B[i]); i=2) A[0] B[1] B[2] B[0] A[1] A[2] --- B[3], A[3] required B[3] never loaded before loaded into cache A[3] never loaded before allocates new line i=3) Compulsory miss example
H.C. TD Capacity miss example B[3] B[0] A[0] i=0 B[3] B[0] A[0] B[4] B[1] A[1] i=1 A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] i=2 A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] i=3 A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] i=4 B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] i=5 B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] i=6 for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] i=7 11 compulsory misses (+8 write misses) 5 capacity misses Cache size: 8 blocks of 1 word Fully associative
H.C. TD Cache i=0) B[0][j] A[0]/B[0][j] 0 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] B[3][0] B[0][1] A[3] B[0][0] B[1][0] B[1][1] B[2][0] B[2][1] B[3][1] 12 B[0][2] B[1][2] 13 B[2][2] B[3][2] B[0][3] 0... Memory address Cache address j=even A[0] multiply loaded A[i] multiple x read -> A[0] flushed in favor B[0][j] -> Miss j=odd Conflict miss example
H.C. TD “Three C's” vs Cache size [Gee93]
Data layout may reduce cache misses
H.C. TD Example 1: Capacity & Compulsory miss reduction B[3] B[0] A[0] i=0 B[3] B[0] A[0] B[4] B[1] A[1] i=1 A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] i=2 A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] i=3 A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] i=4 B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] i=5 B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] i=6 for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] i=7 11 compulsory misses (+8 write misses) 5 capacity misses
H.C. TD #Words B[] i 60 Cache Memory Main Memory (16 words) AB[new] Fit data in cache with in-place mapping A[] 15 Detailed Analysis: max=15 words 12 for(i=0; i<12; i++) A[i] = B[i+3]+B[i]; Traditional Analysis: max=27 words
H.C. TD Remove capacity / compulsory misses with in-place mapping AB[3] AB[0] i=0 AB[3] AB[0] AB[4] AB[1] i=1 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] i=2 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] i=3 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] i=4 AB[3] AB[8] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] i=5 AB[3] AB[8] AB[4] AB[9] AB[5] AB[2] AB[6] AB[7] i=6 for(i=0; i<N; i++) AB[i] = AB[i+3]+AB[i]; AB[7] AB[8] AB[4] AB[9] AB[5] AB[10] AB[6] AB[7] i=7 11 compulsory misses 5 cache hits (+8 write hits)
H.C. TD Cache i=0) B[0][j] A[0]/B[0][j] 0 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] B[3][0] B[0][1] A[3] B[0][0] B[1][0] B[1][1] B[2][0] B[2][1] B[3][1] 12 B[0][2] B[1][2] 13 B[2][2] B[3][2] B[0][3] 0... Memory address Cache address j=even A[0] multiply loaded A[i] multiple x read -> A[0] flushed in favor B[0][j] -> Miss j=odd Example 2: Conflict miss reduction
H.C. TD for(j=0; j<10; j++) for(i=0; i<4; i++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] B[3][0] B[0][1] Main Memory A[3] B[0][0] B[1][0] B[1][1] B[2][0] Leave gap B[2][1] B[3][1] B[0][2] B[0][j] A[0] 0 A[0] multiply loaded A[i] multiple x read No conflict Cache i=0) j=any © imec 2001 Avoid conflict miss with main memory data layout
H.C. TD Data Layout Organization for Direct Mapped Caches
H.C. TD Conclusion on Data Management In multi-media applications exploring data transfer and storage issues should be done at source code level DMM method –Reducing number of external memory accesses –Reducing external memory size –Trade-offs between internal memory complexity and speed –Platform independent high-level transformations –Platform dependent transformations exploit platform characteristics (efficient use of memory, cache, …) –Substantial energy reduction