Presentation is loading. Please wait.

Presentation is loading. Please wait.

Embedded Systems in Silicon TD5102 Data Management (3) SCBD, MAA, and Data Layout Henk Corporaal Technical.

Similar presentations


Presentation on theme: "Embedded Systems in Silicon TD5102 Data Management (3) SCBD, MAA, and Data Layout Henk Corporaal Technical."— Presentation transcript:

1 Embedded Systems in Silicon TD5102 Data Management (3) SCBD, MAA, and Data Layout Henk Corporaal http://www.ics.ele.tue.nl/~heco/courses/EmbSystems Technical University Eindhoven DTI / NUS Singapore 2005/2006

2 H.C. TD51022 Part 3 overview Recap on design flow Platform dependent steps –SCBD: Storage Cycle Budget Distribution –MAA: Memory Allocation and Assignment –Data layout techniques for RAM –Data layout techniques for Caches Results Conclusions Thanks to the IMEC DTSE people

3 H.C. TD51023 Dynamic memory mgmt Task concurrency mgmt Physical memory mgmt Address optimization SWdesignflowHWdesignflow SW/HW co-design SW/HW co-design Concurrent OO spec Remove OO overhead

4 H.C. TD51024 DM steps C-in Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout C-out Address optimization

5 H.C. TD51025 Result of Memory hierarchy assignment for cavity detection L3 L2 L1 N*M 3*1 image_in M*3 gauss_x gauss_xycomp_edgeimage_out 3*3 1*1 3*3 1*1 N*M N*M*3 N*M 0 N*M*3 N*M N*M*3N*M*8 M*3 1MB SDRAM 16KB Cache 128 B RegFile

6 H.C. TD51026 Data-reuse - cavity detection code for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixel initialized */ if (x==0 && y>=1 && y<=M-2) in_pixels[x%3] = image_in[x][y]; /* copy rest of in_pixel's in row */ if (x>=0 && x =1 && y<=M-2) in_pixels[(x+1)%3]= image_in[x+1][y]; if (x>=1 && x =1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; Code after reuse transformation (partly)

7 Storage Cycle Budget Distribution & Memory Allocation and Assignment

8 H.C. TD51028 Define the memory organization which can provide enough bandwidth with minimal cost

9 H.C. TD51029 Lower required performance by balancing bandwidth Memory Bandwidth Required Memory Bandwidth Required time High Low Reduce max. number of loads/store per cycle

10 H.C. TD510210 Data management approach One of the many possible schedules

11 H.C. TD510211 Data management approach

12 H.C. TD510212 Conflict cost calculation Self conflicts Chromatic number Number of conflicts

13 H.C. TD510213 Self conflict  dual port memory

14 H.C. TD510214 Chromatic number  minimum # single port memories

15 H.C. TD510215 Low number of conflicts  large assignment freedom

16 H.C. TD510216 time slots ? R(C) W(B) R(B) W(A) R(A) R(C) W(C) R(D) W(D) 123456 W(A) W(C) R(C) W(B) Conflict Directed Ordering is used for flat graph scheduling Reduce intervals until all conflicts known Driven by cost of conflicts Constructive algorithm

17 H.C. TD510217 Local optimization is not good for global optimization

18 H.C. TD510218 Budget distribution has large impact on memory cost

19 H.C. TD510219 Decreasing basic block length until target cycle budget is met

20 H.C. TD510220 Obtain more freedom by merging loops More scheduling freedom Extension to different threads

21 H.C. TD510221 Memory allocation and assignment

22 H.C. TD510222 Memory Allocation and Assignment Substeps Array-to-memory Assignment D C A B Port Assignment Bus Sharing D C A B Memory Allocation 123

23 H.C. TD510223 Influence of MAA Bitwidth Address range Nr. memories Nr. ports Assign arrays to memory Memory interconnect Minimize power & Area Bitwidth (maximum) Size Nr. ports (R/W/RW) MEMORY-1 A B Bitwidth (maximum) Size Nr. ports (R/W/RW) MEMORY-N K L 1001001110101001 100100111010XXXX 1001XXXXXX 0101110010

24 H.C. TD510224 Trade-offs in the physical memory Area Power Area Power Trade off area and power for required bandwidth A B C D A C D B

25 H.C. TD510225 Example of bus sharing possibilities R(A)R(B) R(B)W(A) W(C)R(A) R(A)W(B) W(A)W(B) W(A)W(C) m1m2m3 AB X X C m1m2m3 ABC m1m2m3 AB X C

26 H.C. TD510226 Decreasing cycle budget limits freedom and raises cost

27 H.C. TD510227 Resulting Pareto curve for DAB synchro application

28 H.C. TD510228 Example conflict graph for cavity detection

29 H.C. TD510229 MAA result Power: On-chip area:

30 H.C. TD510230 Data layout how to put data into memory

31 H.C. TD510231 A C ? ? B MEM1 F G ? ? H MEM2 PE A’ B’ ? ? CACHE PE A B CACHE A C MEM1 B F MEM2 G H Memory data layout for custom and cache architectures C A B C B

32 H.C. TD510232 for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); i-1 i j Window Intra-array in-place mapping reduces size of one array a time max nr. of life elements

33 H.C. TD510233 variable domains abstract addresses real addresses aAaA a C A B aCaC aBaB Two-phase mapping of array elements onto addresses Storage order Allocation

34 H.C. TD510234 a2a2 a1a1 a=3a 1 +a 2 a=3(1-a 1 )+a 2 a=3a 1 +(2-a 2 ) a=3(1-a 1 )+(2-a 2 ) a=2a 2 +a 1 a=2a 2 +(1-a 1 )a=2(2-a 2 )+(1-a 1 ) a=2(2-a 2 )+a 1 a a=??? memory address variable domain Exploration of storage orders for 2-dimensional array ??????

35 H.C. TD510235 Chosen storage order determines window size for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); row-major ordering: a=5i+j for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*i+j] = f(a[5*i+j-5]); column-major: a=5j+i for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*j+i] = f(a[5*j+i-1]); Highest live address: Lowest live address: 5*i+j 5*i+j-5 5*4+i-1 5*0+i-1 Difference + 1= Window: 6 21

36 H.C. TD510236 A B C D E Memory Size Static allocation: no in-place mapping E aEaE C aCaC A aAaA D aDaD B aBaB

37 H.C. TD510237 C Memory Size A D B E Static, windowed C Memory Size A D B E Dynamic, windowed Windowed Allocation: intra-array in-place mapping

38 H.C. TD510238 Dynamic allocation: inter-array in-place mapping E aEaE C aCaC A aAaA D aDaD B aBaB A B CD E Memory Size

39 H.C. TD510239 A B C E D A C E D B Memory Size Dynamic, common window Dynamic allocation strategy with common window

40 H.C. TD510240 Before: bit8 B[10][20]; bit6 A[30]; for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[x][y] = …; After: bit8 memory[334]; bit8* B =(bit8*)&memory[134]; bit6* A =(bit6*)&memory[120]; for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[(x*20+y*2)%78] = …; Expressing memory data layout in source code Example: array of 10x20 elements A: offset 120, no window B: storage order [20, 2], offset 134, window 78

41 H.C. TD510241 int x[W], y[W]; for (i1=0; i1 < W; i1++) x[i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * x[wrap(i2+di2,W)]; } y[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(y[i3]); Example of memory data layout for storage size reduction

42 H.C. TD510242 Occupied address-time domain of x[] and y[]

43 H.C. TD510243 int mem1[N+W]; for (i1=0; i1 < W; i1++) mem1[N+i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * mem1[N+wrap(i2+di2,W)]; } mem1[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(mem1[i3]); Optimized source code after memory data layout

44 H.C. TD510244 Optimized OAT domain after memory data layout

45 H.C. TD510245 In-place mapping for cavity detection example Input image is partly consumed by the time first results for output image are ready index time Image_in time address Image time index Image_out

46 H.C. TD510246 In-place - cavity detection code for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */ … = image_in[x+1][y]; } for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */ … = image [x+1][y]; }

47 H.C. TD510247 Cavity detection summary Overall result: Local accesses reduced by factor 3 Memory size reduced by factor 5 Power reduced by factor 5 System bus load reduced by factor 12 Performance worsened by factor 6

48 H.C. TD510248 Data layout for caches Caches are hardware controled Therefore: no explicit copy coded needed ! What can we do ?

49 H.C. TD510249 p-k-mmk tagindex address byte address tagdata Hit? main memory CPU 2 k lines p-k-m2 m bytes Cache line / Block Cache principles

50 H.C. TD510250 Cache Architecture Fundamentals Block placement –Where in the cache will a new block be placed? Block identification –How is a block found in the cache? Block replacement policy –Which block is evicted from the cache? Updating policy –How is a block written from cache to memory?

51 H.C. TD510251Cache0 1 7 2 3 4 5 6 2 3 4 5 0 1 6 7... 0 1 2 3 4 5 6 7 Fully associative (one-to-many) Anywhere in cache Here only! 0 1 2 3 4 5 6 7 Direct mapped (one-to-one) Here only! Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Mapping?... Block placement policies

52 H.C. TD510252 Direct mapped cache 20 10 Byte offset ValidTagDataIndex 0 1 2 1021 1022 1023 Tag Index HitData 20 32 31 30 13 12 1 1 2 1 0 Address (bit positions)

53 H.C. TD510253 Taking advantage of spatial locality: Direct mapped cache: larger blocks Address (bit positions)

54 H.C. TD510254 Increasing the block size tends to decrease miss rate: Performance

55 H.C. TD510255 4-way associative cache

56 H.C. TD510256 Performance 1 KB 2 KB 8 KB

57 H.C. TD510257 Cache Fundamentals The “Three C's” Compulsory Misses –1st access to a block: never in the cache Capacity Misses –Cache cannot contain all the blocks –Blocks are discarded and retrieved later –Avoided by increasing cache size Conflict Misses –Too many blocks mapped to same set –Avoided by increasing associativity

58 H.C. TD510258 for(i=0; i<10; i++) A[i] = f(B[i]); Cache(@ i=2) A[0] B[1] B[2] B[0] A[1] A[2] --- B[3], A[3] required B[3] never loaded before  loaded into cache A[3] never loaded before  allocates new line Cache(@ i=3) Compulsory miss example

59 H.C. TD510259 Capacity miss example B[3] B[0] A[0] i=0 B[3] B[0] A[0] B[4] B[1] A[1] i=1 A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] i=2 A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] i=3 A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] i=4 B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] i=5 B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] i=6 for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] i=7 11 compulsory misses (+8 write misses) 5 capacity misses Cache size: 8 blocks of 1 word Fully associative

60 H.C. TD510260 Cache (@ i=0) 1 2 3 4 5 6 7 B[0][j] A[0]/B[0][j] 0 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] 7 10 31 B[3][0] B[0][1] A[3] 2 3 4 B[0][0] B[1][0] B[1][1] B[2][0] 5 6 11 B[2][1] B[3][1] 12 B[0][2] B[1][2] 13 B[2][2] B[3][2] 8 9 14 15 0 1 7 2 7 2 3 4 5 6 3 4 5 0 1 6 7 B[0][3] 0... Memory address Cache address j=even A[0] multiply loaded A[i] multiple x read -> A[0] flushed in favor B[0][j] -> Miss j=odd Conflict miss example

61 H.C. TD510261 “Three C's” vs Cache size [Gee93]

62 Data layout may reduce cache misses

63 H.C. TD510263 Example 1: Capacity & Compulsory miss reduction B[3] B[0] A[0] i=0 B[3] B[0] A[0] B[4] B[1] A[1] i=1 A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] i=2 A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] i=3 A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] i=4 B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] i=5 B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] i=6 for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] i=7 11 compulsory misses (+8 write misses) 5 capacity misses

64 H.C. TD510264 #Words B[] i 60 Cache Memory Main Memory (16 words) AB[new] Fit data in cache with in-place mapping A[] 15 Detailed Analysis: max=15 words 12 for(i=0; i<12; i++) A[i] = B[i+3]+B[i]; Traditional Analysis: max=27 words

65 H.C. TD510265 Remove capacity / compulsory misses with in-place mapping AB[3] AB[0] i=0 AB[3] AB[0] AB[4] AB[1] i=1 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] i=2 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] i=3 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] i=4 AB[3] AB[8] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] i=5 AB[3] AB[8] AB[4] AB[9] AB[5] AB[2] AB[6] AB[7] i=6 for(i=0; i<N; i++) AB[i] = AB[i+3]+AB[i]; AB[7] AB[8] AB[4] AB[9] AB[5] AB[10] AB[6] AB[7] i=7 11 compulsory misses 5 cache hits (+8 write hits)

66 H.C. TD510266 Cache (@ i=0) 1 2 3 4 5 6 7 B[0][j] A[0]/B[0][j] 0 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] 7 10 31 B[3][0] B[0][1] A[3] 2 3 4 B[0][0] B[1][0] B[1][1] B[2][0] 5 6 11 B[2][1] B[3][1] 12 B[0][2] B[1][2] 13 B[2][2] B[3][2] 8 9 14 15 0 1 7 2 7 2 3 4 5 6 3 4 5 0 1 6 7 B[0][3] 0... Memory address Cache address j=even A[0] multiply loaded A[i] multiple x read -> A[0] flushed in favor B[0][j] -> Miss j=odd Example 2: Conflict miss reduction

67 H.C. TD510267 for(j=0; j<10; j++) for(i=0; i<4; i++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] 7 12 31 B[3][0] B[0][1] Main Memory A[3] 2 3 4 B[0][0] B[1][0] B[1][1] B[2][0] 5 6 13 Leave gap B[2][1] B[3][1] B[0][2] 0 1 7 4 7 2 3 4 5 6 5 6 7 14 15 184......... 1 2 3 4 5 6 7 B[0][j] A[0] 0 A[0] multiply loaded A[i] multiple x read No conflict Cache (@ i=0) j=any © imec 2001 Avoid conflict miss with main memory data layout

68 H.C. TD510268 Data Layout Organization for Direct Mapped Caches

69 H.C. TD510269 Conclusion on Data Management In multi-media applications exploring data transfer and storage issues should be done at source code level DMM method –Reducing number of external memory accesses –Reducing external memory size –Trade-offs between internal memory complexity and speed –Platform independent high-level transformations –Platform dependent transformations exploit platform characteristics (efficient use of memory, cache, …) –Substantial energy reduction


Download ppt "Embedded Systems in Silicon TD5102 Data Management (3) SCBD, MAA, and Data Layout Henk Corporaal Technical."

Similar presentations


Ads by Google