ASCI Winterschool on Embedded Systems March 2004 Renesse Data Memory Management Henk Corporaal Peter Knijnenburg
ASCI winterschool H.C.-P.K.2 Data Memory Management Overview Motivation Example application DMM steps Results Notes: -We concentrate on Static Data Memory Management‘ -The Data Transfer and Storage Exploration (DTSE) methodology, on which these slides are based, has been developed by IMEC, Leuven
ASCI winterschool H.C.-P.K.3 VLIW cpu I$ video-in video-out audio-in audio-out PCI bridge Serial I/O timersI 2 C I/O SDRAM D$ for (i=0;i<n;i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i*4+k]; SDRAM D$ Data storage bottleneck B[j] = A[i*4+k]; The underlying idea B[j] = A[i*4+k]; Data transfer bottleneck B[j] = A[i*4+k];
ASCI winterschool H.C.-P.K.4 Platform architecture model CPUs HW accel Level-1 Level-2Level-3Level-4 ICache Local Memory Disk Main Memory bus-if on-chip busses Local Memory L2 Cache Local Memory bridgeSCSI DCacheDisk bus SCSI bus Chip
ASCI winterschool H.C.-P.K.5 Data transfer and storage power Power(memory) Power(arithmetic) = 33
ASCI winterschool H.C.-P.K.6 Applications Architecture Instance Mapping Applications Performance Analysis Performance Numbers Data transfer and data storage specific rewrites in the application code Positioning in the Y-chart
ASCI winterschool H.C.-P.K.7 Mapping Given –architecture SuperDuperXYZ –reference C code for application e.g. MPEG-4 Motion Estimation Task –map application on this architecture But … wait a moment sdcc -o mpeg4_me mpeg4_me.c Thank you for running SuperDuperXYZ compiler. Your program uses bytes memory, 78 Watt, clock cycles Let’s help the compiler
ASCI winterschool H.C.-P.K.8 Application example Application domain: –Computer Tomography in medical imaging Algorithm: –Cavity detection in CT-scans –Detect dark regions in successive images –Indicate cavity in brain Bad news for owner of brain
ASCI winterschool H.C.-P.K.9 Application Reference (conceptual) C code for the algorithm –all functions: image_in[N x M] t-1 -> image_out[N x M] t –new value of pixel depends on its neighbors –neighbor pixels read from background memory –approximately 110 lines of C code (ignoring file I/O etc) –experiments with N x M = 640 x 400 pixels –straightforward implementation: 6 image buffers Compute Edges Gauss Blur x Reverse Detect Roots Max Value Gauss Blur y
ASCI winterschool H.C.-P.K.10 DMM principles Processor Data Paths L1 cache L2 cache Cache Bank Combine local latch 1 & bank 1 local latch N & bank N Exploit memory hierarchy Off-chip SDRAM Exploit limited life-time Avoid N-port Memories within real-time constraints Reduce redundant transfers Introduce Locality
ASCI winterschool H.C.-P.K.11 DMM steps C-in Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout C-out Address expression optimization
ASCI winterschool H.C.-P.K.12 The DMM steps Preprocessing –Rewrite code in 3 layers (parts) –Selective inlining, Single Assignment form,.... Data flow transformations –Eliminate redundant transfers and storage Loop and control flow transformations –Improve regularity of accesses and data locality Data re-use and memory hierarchy layer assignment –Determine when to move which data between memories to meet the cycle budget of the application with low cost –Determine in which layer to put the arrays (and copies)
ASCI winterschool H.C.-P.K.13 The DMM steps Per memory layer: Cycle budget distribution –determine memory access constraints for given cycle budget Memory allocation and assignment –which memories to use, and where to put the arrays Data layout –determine how to combine and put arrays into memories Address expression optimizations
ASCI winterschool H.C.-P.K.14 Preprocessing: Dividing an application in the 3 layers Module1a Module1b Module2Module3 Synchronisation - testbench call - dynamic event behaviour - mode selection for (i=0;i<N; i++) for (j=0; j<M; j++) if (i == 0) B[i][j] = 1; else B[i][j] = func1(A[i][j], A[i-1][j]); int func1(int a, int b) { return a*b; } LAYER1 LAYER2 LAYER3
ASCI winterschool H.C.-P.K.15 main(){ /* Layer 1 code */ read_image(IN_NAME, image_in); cav_detect(); write_image(image_out); } void cav_detect() {/* Layer 2 code */ for (x=GB; x<=N-1-GB; ++x) { for (y=GB; y<=M-1-GB; ++y) { gauss_x_tmp = 0; for (k=-GB; k<=GB; ++k) { gauss_x_tmp += in_image[x+k][y] * Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } Layered code structure
ASCI winterschool H.C.-P.K.16 Layered code structure int foo(int arg1) { /* Layer 3 */ /* arithmetic, data-dependent operations * to be mapped to data-path, controller */ } void cav_detect() {/* Layer 2 code */ for (x=GB; x<=N-1-GB; ++x) { for (y=GB; y<=M-1-GB; ++y) { gauss_x_tmp = 0; for (k=-GB; k<=GB; ++k) { gauss_x_tmp += in_image[x+k][y] * Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } }/* Makes code for data access */ }/* and data transfer explicit */
ASCI winterschool H.C.-P.K.17 N M Data-flow trafo - cavity detection for (x=0; x<N; ++x) for (y=0; y<M; ++y) gauss_x_image[x][y]=0; for (x=1; x<=N-2; ++x) { for (y=1; y<=M-2; ++y) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } M-2 N-2 #accesses: N * M + (N-2) * (M-2)
ASCI winterschool H.C.-P.K.18 Data-flow trafo - cavity detection N M N-2 M-2 for (x=0; x<N; ++x) for (y=0; y<M; ++y) if ((x>=1 && x<=N-2) && (y>=1 && y<=M-2)) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } else { gauss_x_image[x][y] = 0; } #accesses: N * M gain is ± 50 %
ASCI winterschool H.C.-P.K.19 Data-flow transformation In total 5 types of data-flow transformations: –advanced signal substitution and propagation –algebraic transformations (associativity etc.) –shifting “delay lines” –re-computation –transformations to eliminate bottlenecks for subsequent loop transformations
ASCI winterschool H.C.-P.K.20 Data-flow transformation - result
ASCI winterschool H.C.-P.K.21 Loop transformations –improve regularity of accesses –improve temporal locality: production consumption Expected influence –reduce temporary storage and (anticipated) background storage storage size N Loop transformations for (j=1; j<=M; j++) for (i=1; i<=N; i++) A[i]= foo(A[i]); for (i=1; i<=N; i++) out[i] = A[i]; for (i=1; i<=N; i++) { for (j=1; j<=M; j++) { A[i] = foo(A[i]); } out[i] = A[i]; } storage size 1
ASCI winterschool H.C.-P.K.22 Data enters Cavity Detector row-wise scan device Buffer serial scan Cavity Detector GaussBlur loop = image_in
ASCI winterschool H.C.-P.K.23 Scanner Loop trafo - cavity detection N x M Gauss Blur x N x M From double buffer to single buffer X Y X-Y Loop Interchange
ASCI winterschool H.C.-P.K.24 Loop trafo - cavity detection Compute Edges Gauss Blur y N x (2GB+1) Repeated fold and loop merge N x 3 From N x M to N x (3) buffer size From N x M to N x (2GB+1) buffer size 2GB+1 3(offset arrays) Gauss Blur x
ASCI winterschool H.C.-P.K.25 for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* 2nd filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ for (x=0;x<N;x++) /* 2nd filtering code */ Improve regularity and locality Loop Merging !! Impossible due to dependencies!
ASCI winterschool H.C.-P.K.26 Data dependencies between 1st and 2nd loop for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = … for (y=0;y<M;y++) for (x=0;x<N;x++) … for (k=-GB; k<=GB; k++) … = … gauss_x_image[x][y+k] …
ASCI winterschool H.C.-P.K.27 Enable merging with Loop Folding (bumping) for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = … for (y=0+GB;y<M+GB;y++) for (x=0;x<N;x++) … y-GB … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y+k-GB] …
ASCI winterschool H.C.-P.K.28 Y-loop merging on 1st and 2nd loop nest for (y=0;y<M+GB;y++) if (y<M) for (x=0;x<N;x++) … gauss_x_image[x][y] = … if (y>=GB) for (x=0;x<N;x++) if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else
ASCI winterschool H.C.-P.K.29 Simplify conditions in merged loop nest for (y=0;y<M+GB;y++) for (x=0;x<N;x++) if (y<M) … gauss_x_image[x][y] = … for (x=0;x<N;x++) if (y>=GB && x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else if (y>=GB)
ASCI winterschool H.C.-P.K.30 End result of global loop trafo for (y=0; y<M+GB+2; ++y) { for (x=0; x<N+2; ++x) { … if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute[x][y-GB][0] = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute[x][y-GB][GB+k+1] = gauss_xy_compute[x][y-GB][GB+k] + gauss_x_image[x][y-GB+k] * Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute[x][y-GB][(2*GB)+1]/tot; } else if (x =0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0; …
ASCI winterschool H.C.-P.K.31 Loop transformations - result
ASCI winterschool H.C.-P.K.32 #A = 100 P (original) = # access x power/access = 100 Processor Data Paths Reg File M Main memory P = 1 M’ P = 0.1 M’’ P = Data re-use & memory hierarchy Introduce memory hierarchy –reduce number of reads from main memory –heavily accessed arrays stored in smaller memories P (after) = 100 x x x 1 = 3
ASCI winterschool H.C.-P.K.33 Data re-use Data flow transformations to introduce extra copies of heavily accessed signals –Step 1: figure out data re-use possibilities –Step 2: calculate possible gain –Step 3: decide on data assignment to memory hierarchy int[2][6] A; for (h=0; h<N; h++) for (i=0; i<2; i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i][k]; iterations array index (6 * i + k)
ASCI winterschool H.C.-P.K.34 Data re-use Data flow transformations to introduce extra copies of heavily accessed signals –Step 1: figure out data re-use possibilities –Step 2: calculate possible gain –Step 3: decide on data assignment to memory hierarchy iterations frame1frame2frame3 array index 6*2 6*1 N*2*3*6 CPU 1*2*1*6 N*2*1*6
ASCI winterschool H.C.-P.K.35 Data re-use tree N*M N*1 3*1 image_in M*3 1*3 gauss_x M*3 3*3 gauss_xy/comp_edge M*3 1*1 N*M*3 N*M N*M*3 N*M image_out 0 N*M*8 CPU
ASCI winterschool H.C.-P.K.36 Memory hierarchy assignment L3 L2 L1 N*M 3*1 image_in M*3 gauss_x gauss_xycomp_edgeimage_out 3*3 1*1 3*3 1*1 N*M N*M*3 N*M 0 N*M*3 N*M N*M*3N*M*8 M*3 1MB SDRAM 16KB Cache 128 B RegFile
ASCI winterschool H.C.-P.K.37 Data-reuse - cavity detection code for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { if (x>=1 && x =1 && y<=M-2) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = foo(gauss_x_compute); } else { if (x<N && y<M) gauss_x_lines[x][y] = 0; } /* Other merged code omitted … */ } Code before reuse transformation
ASCI winterschool H.C.-P.K.38 Data-reuse - cavity code for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixel initialized */ if (x==0 && y>=1 && y<=M-2) for (k=0; k<1; ++k) in_pixels[(x+k)%3][y%1] = image_in[x+k][y]; /* copy rest of in_pixel's in row */ if (x>=0 && x =1 && y<=M-2) in_pixels[(x+1)%3][y%1] = image_in[x+1][y]; if (x>=1 && x =1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3][y%1]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; Code after reuse transformation detection
ASCI winterschool H.C.-P.K.39 Data reuse & memory hierarchy
ASCI winterschool H.C.-P.K.40 Data layout optimization At this point multi-dimensional arrays are to be assigned to physical memories Data layout optimization determines exactly where in each memory an array should be placed, to –reduce memory size by “in-placing” arrays that do not overlap in time (disjoint lifetimes) –to avoid cache misses due to conflicts –exploit spatial locality of the data in memory to improve performance of e.g. page-mode memory access sequences
ASCI winterschool H.C.-P.K.41 In-place mapping B C D A C A D B E C A D E B time E addresses A B C D E Inter in-place Intra in-place
ASCI winterschool H.C.-P.K.42 In-place mapping - results
ASCI winterschool H.C.-P.K.43 The last step: ADOPT (Address OPTimization) Increased execution time introduced by DMM –Complicated address arithmetic (modulo!) –Additional complex control flow Additional transformations needed to –Simplify control flow –Simplify address arithmetic: common sub-expression elimination, modulo expansion, … –Match remaining expressions on target machine
ASCI winterschool H.C.-P.K.44 for (i=-8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) { Ad[ ] = A[ ]; }} dist += A[ ]-Ad[ ]; } cse1 = (33025*i )*2; cse3 = 1040+i; cse4 = j* ; cse5 = k+cse4; cse5+cse1 = cse5+cse cse1 ADOPT example for (i=- 8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) A[((208+i)*257+8+j)* i+k] = B[(8+j)* i+k]; } dist += A[3096] - B[((208+i)*257+4)* i-4]; } From Full-search Motion Estimation Algebraic transformations at word-level
ASCI winterschool H.C.-P.K.45 Address optimization - result
ASCI winterschool H.C.-P.K.46 Conclusion In embedded applications exploring data transfer and storage issues should be done at system level DTSE is a methodology for Data Transfer and Storage Exploration based on manual and/or tool- assisted code rewriting –Platform independent high-level transformations –Platform dependent transformations exploit platform characteristics (efficient use of cache, local memories) –Substantial reduction in power and memory size demonstrated on MPEG-4, OFDM, H.263, ADSL,...