Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Data Management Part c: SCBD, MAA, and Data Layout
Embedded Computer Architecture Part 3 overview Recap on design flow Platform dependent steps –SCBD: Storage Cycle Budget Distribution –MAA: Memory Allocation and Assignment –Data layout techniques for RAM –Data layout techniques for Caches Results Conclusions Thanks to the IMEC DTSE people
Embedded Computer Architecture Dynamic memory mgmt Task concurrency mgmt Physical memory mgmt Address optimization SWdesignflowHWdesignflow SW/HW co-design SW/HW co-design Concurrent OO spec Remove OO overhead
Embedded Computer Architecture DM steps C-in Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout C-out Address optimization Today
Embedded Computer Architecture Result of Memory hierarchy assignment for cavity detection L2 L1 L0 N*M 3*1 image_in M*3 gauss_x gauss_xycomp_edgeimage_out 3*3 1*1 3*3 1*1 N*M N*M*3 N*M 0 N*M*3 N*M N*M*3N*M*8 M*3 1MB SDRAM 16KB Cache 128 B RegFile
Embedded Computer Architecture Data-reuse - cavity detection code for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixels initialized */ if (x==0 && y>=1 && y<=M-2) in_pixels[x%3] = image_in[x][y]; /* copy rest of in_pixel's in row */ if (x>=0 && x =1 && y<=M-2) in_pixels[(x+1)%3]= image_in[x+1][y]; if (x>=1 && x =1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) // 3x1 filter gauss_x_tmp += in_pixels[(x+k)%3]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; Code after reuse transformation (partly)
Storage Cycle Budget Distribution & Memory Allocation and Assignment
Embedded Computer Architecture Define the memory organization which can provide enough bandwidth with minimal cost
Embedded Computer Architecture Balancing memory bandwidth Reduce max. number of loads/store per cycle: Memory Bandwidth Required time High Memory Bandwidth Required time Low
Embedded Computer Architecture Data management approach One of the many possible schedules Idea: find a schedule which fits in the number of cycles (= budget) reduces the number of ports avoids multi-ported memories
Embedded Computer Architecture Data management approach; details
Embedded Computer Architecture Conflict cost calculation Key issues: Number of conflicts Self conflicts Chromatic number = size of maximum clique
Embedded Computer Architecture Self conflict dual port memory Reschedule
Embedded Computer Architecture Chromatic number minimum # single port memories Reschedule
Embedded Computer Architecture Lower number of conflicts larger assignment freedom Reschedule
Embedded Computer Architecture time slots ? R(C) W(B) R(B) W(A) R(A) R(C) W(C) R(D) W(D) W(A) W(C) R(C) W(B) Conflict Directed Ordering is used to find a good schedule Reduce intervals until all conflicts known Driven by cost of conflicts Constructive algorithm
Embedded Computer Architecture Local optimization is not good for global optimization
Embedded Computer Architecture Budget distribution has large impact on memory cost
Embedded Computer Architecture Decreasing basic block length until target cycle budget is met
Embedded Computer Architecture What's the effect of merging loops? More scheduling freedom !! Reschedule
Embedded Computer Architecture Memory allocation and assignment
Embedded Computer Architecture Memory Allocation and Assignment Substeps Array-to-memory Assignment D C A B Port Assignment Bus Sharing D C A B Memory Allocation 123 Allocation = Select number and type of memories
Embedded Computer Architecture Influence of MAA Bit width Address range Nr. memories Nr. ports Assign arrays to memory Memory interconnect Minimize power & Area Bitwidth (maximum) Size Nr. ports (R/W/RW) MEMORY-1 A B Bitwidth (maximum) Size Nr. ports (R/W/RW) MEMORY-N K L XXXX 1001XXXXXX
Embedded Computer Architecture Example of bus sharing possibilities R(A)R(B) R(B)W(A) W(C)R(A) R(A)W(B) W(A)W(B) W(A)W(C) m1m2m3 AB X X C m1m2m3 ABC m1m2m3 AB X C Given Schedule
Embedded Computer Architecture Decreasing cycle budget limits freedom and raises cost
Embedded Computer Architecture Example: Resulting Pareto curve for DAB synchro application Energy cost
Embedded Computer Architecture Example conflict graph for cavity detection
Embedded Computer Architecture MAA result Power: On-chip area:
Embedded Computer Architecture Data layout how to put data into memory
Embedded Computer Architecture A C ? ? B MEM1 F G ? ? H MEM2 PE A' B' ? ? CACHE Memory data layout for custom and cache architectures PE A' B' CACHE A C MEM1 B F MEM2 G H C A B C B
Embedded Computer Architecture for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); i-1 i j Window Intra-array in-place mapping reduces size of one array a ij time max nr. of life elements This number depends on the layout !! Compare e.g. row major and column major ordering. memory addresses
Embedded Computer Architecture array domains C A B Two-phase mapping of array elements onto addresses abstract addresses aAaA aCaC aBaB Storage order real addresses a Allocation
Embedded Computer Architecture a a=??? memory address variable domain Exploration of storage orders for 2-dimensional array: 8 options a2a2 a1a1 ?????? a=3a 1 +a 2 a=3(1-a 1 )+a 2 a=3a 1 +(2-a 2 ) a=2a 2 +a 1 a=2a 2 +(1-a 1 ) a=2(2-a 2 )+a 1 a=3(1-a 1 )+(2-a 2 ) a=2(2-a 2 )+(1-a 1 )
Embedded Computer Architecture Chosen storage order determines window size for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); row-major ordering: a=5i+j for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*i+j] = f(a[5*i+j-5]); Highest live address: Lowest live address: 5*i+j 5*i+j-5 Difference + 1= Window: 6 column-major: a=5j+i for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*j+i] = f(a[5*j+i-1]); 5*4+i-1 5*0+i-1 21 j i
Embedded Computer Architecture A B C D E Memory Size Static allocation: no in-place mapping E aEaE C aCaC A aAaA D aDaD B aBaB time
Embedded Computer Architecture C Memory Size A D B E Static, windowed C Memory Size A D B E Dynamic, windowed Windowed Allocation: intra-array in-place mapping WAWA
Embedded Computer Architecture Dynamic allocation: inter-array in-place mapping E aEaE C aCaC A aAaA D aDaD B aBaB A B C D E Memory Size
Embedded Computer Architecture A B C E D A C E D B Memory Size Dynamic, common window Dynamic allocation strategy with common window
Embedded Computer Architecture Before: bit8 B[10][20]; bit6 A[30]; for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[x][y] = …; After: bit8 memory[334]; bit8* B =(bit8*)&memory[134]; bit6* A =(bit6*)&memory[120]; for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[(x*20+y*2)%78] = …; Expressing memory data layout in source code Example: array of 10x20 elements A: offset 120, no window B: storage order [20, 2], offset 134, window 78
Embedded Computer Architecture int x[W], y[W]; for (i1=0; i1 < W; i1++) x[i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * x[wrap(i2+di2,W)]; } y[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(y[i3]); Example of memory data layout for storage size reduction
Embedded Computer Architecture Occupied address-time domain of x[] and y[]
Embedded Computer Architecture int mem1[N+W]; for (i1=0; i1 < W; i1++) mem1[N+i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * mem1[N+wrap(i2+di2,W)]; } mem1[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(mem1[i3]); Optimized source code after memory data layout
Embedded Computer Architecture Optimized OAT domain after memory data layout
Embedded Computer Architecture In-place mapping for cavity detection example Input image is partly consumed by the time first results for output image are ready index time Image_in time address Image time index Image_out
Embedded Computer Architecture In-place - cavity detection code for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */ … = image_in[x+1][y]; } for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */ … = image [x+1][y]; }
Embedded Computer Architecture Cavity detection summary Overall result: Local accesses reduced by factor 3 Memory size reduced by factor 5 Power reduced by factor 5 System bus load reduced by factor 12 Performance worsened by factor 6
Embedded Computer Architecture The last step: ADOPT (Address OPTimization) Increased execution time introduced by DTSE –Complicated address arithmetic (modulo: a%b ) –Additional complex control flow Additional transformations needed to –Simplify control flow –Simplify address arithmetic: common sub- expression elimination, modulo expansion, … –Match remaining expressions on target machine
Embedded Computer Architecture ADOPT principles How to avoid % in address expressions, like int A[7]; for (i=0; i<… ; i++) … A[i % 7] Increase buffer size to power of 2 i % 8 => i && 0x07 Use if-statement int A[7]; for (i=0,j=0; i<… ; i++,j++) … A[j] if (j==8) j=0
Embedded Computer Architecture for (i=-8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) { B[ ] = A[ ]; }} dist += A[ ]- B[ ]; } cse1 = (33025*i )*2; cse3 = 1040+i; cse4 = j* ; cse5 = k+cse4; cse5+cse1 = cse5+cse cse1 ADOPT principles: CSE for (i=- 8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) A[((208+i)*257+8+j)* i+k] = B[(8+j)* i+k]; } dist += A[3096] - B[((208+i)*257+4)* i-4]; } Example: Full-search Motion Estimation - applying Common Subexpression Elimination (CSE) Algebraic transformations at word-level
Embedded Computer Architecture Conclusion on Data Management In multi-media applications exploring data transfer and storage issues should be done at source code level DMM method –Reducing number of external memory accesses –Reducing external memory size –Trade-offs between internal memory complexity and speed –Platform independent high-level transformations –Platform dependent transformations exploit platform characteristics (efficient use of memory, cache, …) –Substantial energy reduction