Presentation is loading. Please wait.

Presentation is loading. Please wait.

Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Data Management Part c: SCBD, MAA, and Data Layout.

Similar presentations


Presentation on theme: "Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Data Management Part c: SCBD, MAA, and Data Layout."— Presentation transcript:

1 Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Data Management Part c: SCBD, MAA, and Data Layout

2 Embedded Computer Architecture 5KK73 @H.C.2 Part 3 overview Recap on design flow Platform dependent steps –SCBD: Storage Cycle Budget Distribution –MAA: Memory Allocation and Assignment –Data layout techniques for RAM –Data layout techniques for Caches Results Conclusions Thanks to the IMEC DTSE people

3 Embedded Computer Architecture 5KK73 @H.C.3 Dynamic memory mgmt Task concurrency mgmt Physical memory mgmt Address optimization SWdesignflowHWdesignflow SW/HW co-design SW/HW co-design Concurrent OO spec Remove OO overhead

4 Embedded Computer Architecture 5KK73 @H.C.4 DM steps C-in Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout C-out Address optimization Today

5 Embedded Computer Architecture 5KK73 @H.C.5 Result of Memory hierarchy assignment for cavity detection L2 L1 L0 N*M 3*1 image_in M*3 gauss_x gauss_xycomp_edgeimage_out 3*3 1*1 3*3 1*1 N*M N*M*3 N*M 0 N*M*3 N*M N*M*3N*M*8 M*3 1MB SDRAM 16KB Cache 128 B RegFile

6 Embedded Computer Architecture 5KK73 @H.C.6 Data-reuse - cavity detection code for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixels initialized */ if (x==0 && y>=1 && y<=M-2) in_pixels[x%3] = image_in[x][y]; /* copy rest of in_pixel's in row */ if (x>=0 && x =1 && y<=M-2) in_pixels[(x+1)%3]= image_in[x+1][y]; if (x>=1 && x =1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) // 3x1 filter gauss_x_tmp += in_pixels[(x+k)%3]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; Code after reuse transformation (partly)

7 Storage Cycle Budget Distribution & Memory Allocation and Assignment

8 Embedded Computer Architecture 5KK73 @H.C.8 Define the memory organization which can provide enough bandwidth with minimal cost

9 Embedded Computer Architecture 5KK73 @H.C.9 Balancing memory bandwidth Reduce max. number of loads/store per cycle: Memory Bandwidth Required time High Memory Bandwidth Required time Low

10 Embedded Computer Architecture 5KK73 @H.C.10 Data management approach One of the many possible schedules Idea: find a schedule which fits in the number of cycles (= budget) reduces the number of ports avoids multi-ported memories

11 Embedded Computer Architecture 5KK73 @H.C.11 Data management approach; details

12 Embedded Computer Architecture 5KK73 @H.C.12 Conflict cost calculation Key issues: Number of conflicts Self conflicts Chromatic number = size of maximum clique

13 Embedded Computer Architecture 5KK73 @H.C.13 Self conflict  dual port memory Reschedule

14 Embedded Computer Architecture 5KK73 @H.C.14 Chromatic number  minimum # single port memories Reschedule

15 Embedded Computer Architecture 5KK73 @H.C.15 Lower number of conflicts  larger assignment freedom Reschedule

16 Embedded Computer Architecture 5KK73 @H.C.16 time slots ? R(C) W(B) R(B) W(A) R(A) R(C) W(C) R(D) W(D) 123456 W(A) W(C) R(C) W(B) Conflict Directed Ordering is used to find a good schedule Reduce intervals until all conflicts known Driven by cost of conflicts Constructive algorithm

17 Embedded Computer Architecture 5KK73 @H.C.17 Local optimization is not good for global optimization

18 Embedded Computer Architecture 5KK73 @H.C.18 Budget distribution has large impact on memory cost

19 Embedded Computer Architecture 5KK73 @H.C.19 Decreasing basic block length until target cycle budget is met

20 Embedded Computer Architecture 5KK73 @H.C.20 What's the effect of merging loops? More scheduling freedom !! Reschedule

21 Embedded Computer Architecture 5KK73 @H.C.21 Memory allocation and assignment

22 Embedded Computer Architecture 5KK73 @H.C.22 Memory Allocation and Assignment Substeps Array-to-memory Assignment D C A B Port Assignment Bus Sharing D C A B Memory Allocation 123 Allocation = Select number and type of memories

23 Embedded Computer Architecture 5KK73 @H.C.23 Influence of MAA Bit width Address range Nr. memories Nr. ports Assign arrays to memory Memory interconnect Minimize power & Area Bitwidth (maximum) Size Nr. ports (R/W/RW) MEMORY-1 A B Bitwidth (maximum) Size Nr. ports (R/W/RW) MEMORY-N K L 1001001110101001 100100111010XXXX 1001XXXXXX 0101110010

24 Embedded Computer Architecture 5KK73 @H.C.24 Example of bus sharing possibilities R(A)R(B) R(B)W(A) W(C)R(A) R(A)W(B) W(A)W(B) W(A)W(C) m1m2m3 AB X X C m1m2m3 ABC m1m2m3 AB X C Given Schedule

25 Embedded Computer Architecture 5KK73 @H.C.25 Decreasing cycle budget limits freedom and raises cost

26 Embedded Computer Architecture 5KK73 @H.C.26 Example: Resulting Pareto curve for DAB synchro application Energy cost

27 Embedded Computer Architecture 5KK73 @H.C.27 Example conflict graph for cavity detection

28 Embedded Computer Architecture 5KK73 @H.C.28 MAA result Power: On-chip area:

29 Embedded Computer Architecture 5KK73 @H.C.29 Data layout how to put data into memory

30 Embedded Computer Architecture 5KK73 @H.C.30 A C ? ? B MEM1 F G ? ? H MEM2 PE A' B' ? ? CACHE Memory data layout for custom and cache architectures PE A' B' CACHE A C MEM1 B F MEM2 G H C A B C B

31 Embedded Computer Architecture 5KK73 @H.C.31 for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); i-1 i j Window Intra-array in-place mapping reduces size of one array a ij time max nr. of life elements This number depends on the layout !! Compare e.g. row major and column major ordering. memory addresses

32 Embedded Computer Architecture 5KK73 @H.C.32 array domains C A B Two-phase mapping of array elements onto addresses abstract addresses aAaA aCaC aBaB Storage order real addresses a Allocation

33 Embedded Computer Architecture 5KK73 @H.C.33 a a=??? memory address variable domain Exploration of storage orders for 2-dimensional array: 8 options a2a2 a1a1 ?????? a=3a 1 +a 2 a=3(1-a 1 )+a 2 a=3a 1 +(2-a 2 ) a=2a 2 +a 1 a=2a 2 +(1-a 1 ) a=2(2-a 2 )+a 1 a=3(1-a 1 )+(2-a 2 ) a=2(2-a 2 )+(1-a 1 )

34 Embedded Computer Architecture 5KK73 @H.C.34 Chosen storage order determines window size for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); row-major ordering: a=5i+j for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*i+j] = f(a[5*i+j-5]); Highest live address: Lowest live address: 5*i+j 5*i+j-5 Difference + 1= Window: 6 column-major: a=5j+i for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*j+i] = f(a[5*j+i-1]); 5*4+i-1 5*0+i-1 21 j i

35 Embedded Computer Architecture 5KK73 @H.C.35 A B C D E Memory Size Static allocation: no in-place mapping E aEaE C aCaC A aAaA D aDaD B aBaB time

36 Embedded Computer Architecture 5KK73 @H.C.36 C Memory Size A D B E Static, windowed C Memory Size A D B E Dynamic, windowed Windowed Allocation: intra-array in-place mapping WAWA

37 Embedded Computer Architecture 5KK73 @H.C.37 Dynamic allocation: inter-array in-place mapping E aEaE C aCaC A aAaA D aDaD B aBaB A B C D E Memory Size

38 Embedded Computer Architecture 5KK73 @H.C.38 A B C E D A C E D B Memory Size Dynamic, common window Dynamic allocation strategy with common window

39 Embedded Computer Architecture 5KK73 @H.C.39 Before: bit8 B[10][20]; bit6 A[30]; for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[x][y] = …; After: bit8 memory[334]; bit8* B =(bit8*)&memory[134]; bit6* A =(bit6*)&memory[120]; for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[(x*20+y*2)%78] = …; Expressing memory data layout in source code Example: array of 10x20 elements A: offset 120, no window B: storage order [20, 2], offset 134, window 78

40 Embedded Computer Architecture 5KK73 @H.C.40 int x[W], y[W]; for (i1=0; i1 < W; i1++) x[i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * x[wrap(i2+di2,W)]; } y[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(y[i3]); Example of memory data layout for storage size reduction

41 Embedded Computer Architecture 5KK73 @H.C.41 Occupied address-time domain of x[] and y[]

42 Embedded Computer Architecture 5KK73 @H.C.42 int mem1[N+W]; for (i1=0; i1 < W; i1++) mem1[N+i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * mem1[N+wrap(i2+di2,W)]; } mem1[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(mem1[i3]); Optimized source code after memory data layout

43 Embedded Computer Architecture 5KK73 @H.C.43 Optimized OAT domain after memory data layout

44 Embedded Computer Architecture 5KK73 @H.C.44 In-place mapping for cavity detection example Input image is partly consumed by the time first results for output image are ready index time Image_in time address Image time index Image_out

45 Embedded Computer Architecture 5KK73 @H.C.45 In-place - cavity detection code for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */ … = image_in[x+1][y]; } for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */ … = image [x+1][y]; }

46 Embedded Computer Architecture 5KK73 @H.C.46 Cavity detection summary Overall result: Local accesses reduced by factor 3 Memory size reduced by factor 5 Power reduced by factor 5 System bus load reduced by factor 12 Performance worsened by factor 6

47 Embedded Computer Architecture 5KK73 @H.C.47 The last step: ADOPT (Address OPTimization) Increased execution time introduced by DTSE –Complicated address arithmetic (modulo: a%b ) –Additional complex control flow Additional transformations needed to –Simplify control flow –Simplify address arithmetic: common sub- expression elimination, modulo expansion, … –Match remaining expressions on target machine

48 Embedded Computer Architecture 5KK73 @H.C.48 ADOPT principles How to avoid % in address expressions, like int A[7]; for (i=0; i<… ; i++) … A[i % 7] Increase buffer size to power of 2 i % 8 => i && 0x07 Use if-statement int A[7]; for (i=0,j=0; i<… ; i++,j++) … A[j] if (j==8) j=0

49 Embedded Computer Architecture 5KK73 @H.C.49 for (i=-8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) { B[ ] = A[ ]; }} dist += A[ ]- B[ ]; } cse1 = (33025*i+6869616)*2; cse3 = 1040+i; cse4 = j*257+1032; cse5 = k+cse4; cse5+cse1 = cse5+cse3 3096 cse1 ADOPT principles: CSE for (i=- 8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) A[((208+i)*257+8+j)*257+ 16+i+k] = B[(8+j)*257+16+i+k]; } dist += A[3096] - B[((208+i)*257+4)*257+ 16+i-4]; } Example: Full-search Motion Estimation - applying Common Subexpression Elimination (CSE) Algebraic transformations at word-level

50 Embedded Computer Architecture 5KK73 @H.C.50 Conclusion on Data Management In multi-media applications exploring data transfer and storage issues should be done at source code level DMM method –Reducing number of external memory accesses –Reducing external memory size –Trade-offs between internal memory complexity and speed –Platform independent high-level transformations –Platform dependent transformations exploit platform characteristics (efficient use of memory, cache, …) –Substantial energy reduction


Download ppt "Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Data Management Part c: SCBD, MAA, and Data Layout."

Similar presentations


Ads by Google