Embedded Systems in Silicon TD5102 Data Management (1) Overview Henk Corporaal Technical University Eindhoven DTI / NUS Singapore 2005/2006
H.C. TD51022 Data Management Overview Motivation Example application Data Management (DM) steps Results Important note: -We consider here static declared data structures only -DM is also called -DTSE (Data Transfer and Storage Exploration), or -Physical Memory Management
H.C. TD51023 Dynamic memory mgmt Task concurrency mgmt Physical memory mgmt Address optimization SWdesignflowHWdesignflow SW/HW co-design SW/HW co-design Concurrent OO spec Remove OO overhead
H.C. TD51024 VLIW cpu I$ video-in video-out audio-in audio-out PCI bridge Serial I/O timersI 2 C I/O SDRAM D$ for (i=0;i<n;i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i*4+k]; SDRAM D$ Data storage bottleneck B[j] = A[i*4+k]; The underlying idea B[j] = A[i*4+k]; Data transfer bottleneck B[j] = A[i*4+k];
H.C. TD51025 Platform architecture model CPUs HW accel Level-1 Level-2Level-3Level-4 ICache Local Memory Disk Main Memory bus-if on-chip busses Local Memory L2 Cache Local Memory bridgeSCSI DCacheDisk bus SCSI bus Chip
H.C. TD51026 Platform example: TriMedia 5 out of 27 processor FU’s 128*32b 16-port RegFile Hardware accelerators TriMedia TM M 1-port SDRAM 16K 2-port SRAM 256M 1-port SDRAM SW cache 8KB TriMedia TM1000 cache HW cache 8/16KB CPU Cache bypass SW controlled HW controlled
H.C. TD51027 Data transfer and storage power Power(memory) Power(arithmetic) = 33
H.C. TD51028 Applications Architecture Instance Mapping Applications Performance Analysis Performance Numbers Data transfer and data storage specific rewrites in the application code Positioning in the Y-chart
H.C. TD51029 Current practice Mapping, easy, but Given –reference C code for application e.g. MPEG-4 Motion Estimation –platform: SUPERDUPER-LX50 Task –map application on architecture But … wait a moment CC –o2 mpeg4_me mpeg4_me.c Thank you for running SUPERDUPER-LX50 compiler. Your program uses bytes memory, 78 Watt, clock cycles a=b*5+d; for (...) {.. } Idea
H.C. TD Let’s help the compiler... DTSE: data transfer and storage exploration DTSE is a methodology to explore data-transfer and data-storage in multi-media applications –Transforms C-code of the application –By focusing on multi-dimensional signals (arrays) –To better exploit platform capabilities This overview covers the major steps to improve power, area, performance trade-off
H.C. TD Data Management principles Processor Data Paths L1 cache L2 cache Cache Bank Combine local latch 1 & bank 1 local latch N & bank N Exploit memory hierarchy Off-chip SDRAM Exploit limited life-time Avoid N-port Memories within real-time constraints Reduce redundant transfers Introduce Locality
H.C. TD DM steps C-in Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout C-out Address optimization
H.C. TD The DM steps Preprocessing –Rewrite code in 3 layers (parts) –Selective inlining, Single Assignment form,.... Data flow transformations –Eliminate redundant transfers and storage Loop and control flow transformations –Improve regularity of accesses and data locality Data re-use and memory hierarchy layer assignment –Determine when to move which data between memories to meet the cycle budget of the application with low cost –Determine in which layer to put the arrays (and copies)
H.C. TD The DM steps Per memory layer: Cycle budget distribution –determine memory access constraints for given cycle budget Memory allocation and assignment –which memories to use, and where to put the arrays Data layout –determine how to combine and put arrays into memories Address optimization on the final C-code
H.C. TD Application example Application domain: –Computer Tomography in medical imaging Algorithm: –Cavity detection in CT-scans –Detect dark regions in successive images –Indicate cavity in brain Bad news for owner of brain
H.C. TD Data enters Cavity Detector row-wise scan device Buffer serial scan Cavity Detector GaussBlur loop = image_in
H.C. TD Application Reference (conceptual) C code for the algorithm –all functions: image_in[N x M] t-1 -> image_out[N x M] t –new value of pixel depends on its neighbors –neighbor pixels read from background memory –approximately 110 lines of C code (ignoring file I/O etc) –experiments with N x M = 640 x 400 pixels –straightforward implementation: 6 image buffers Compute Edges Gauss Blur x Reverse Detect Roots Max Value Gauss Blur y
H.C. TD Preprocessing: Dividing an application in the 3 layers Module1a Module1b Module2Module3 Synchronisation - testbench call - dynamic event behaviour - mode selection for (i=0;i<N; i++) for (j=0; j<M; j++) if (i == 0) B[i][j] = 1; else B[i][j] = func1(A[i][j], A[i-1][j]); int func1(int a, int b) { return a*b; } LAYER1 LAYER2 LAYER3
H.C. TD main(){ /* Layer 1 code */ read_image(IN_NAME, image_in); cav_detect(); write_image(image_out); } void cav_detect() { /* Layer 2 code */ for (x=GB; x<=N-1-GB; ++x) { for (y=GB; y<=M-1-GB; ++y) { gauss_x_tmp = 0; for (k=-GB; k<=GB; ++k) { gauss_x_tmp += in_image[x+k][y] * Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } Layered code structure
H.C. TD Layered code structure int foo(int arg1) { /* Layer 3 */ /* arithmetic, data-dependent operations * to be mapped to data-path, controller */ } void cav_detect() {/* Layer 2 code */ for (x=GB; x<=N-1-GB; ++x) { for (y=GB; y<=M-1-GB; ++y) { gauss_x_tmp = 0; for (k=-GB; k<=GB; ++k) { gauss_x_tmp += in_image[x+k][y] * Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } }/* Makes code for data access */ }/* and data transfer explicit */
H.C. TD N M Data-flow trafo - cavity detection for (x=0; x<N; ++x) for (y=0; y<M; ++y) gauss_x_image[x][y]=0; for (x=1; x<=N-2; ++x) { for (y=1; y<=M-2; ++y) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } M-2 N-2 #accesses: N * M + (N-2) * (M-2)
H.C. TD Data-flow trafo - cavity detection N M N-2 M-2 for (x=0; x<N; ++x) for (y=0; y<M; ++y) if ((x>=1 && x<=N-2) && (y>=1 && y<=M-2)) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } else { gauss_x_image[x][y] = 0; } #accesses: N * M gain is ± 50 %
H.C. TD Data-flow transformation In total 5 types of data-flow transformations: –advanced signal substitution and (copy) propagation –algebraic transformations (associativity, etc.) –shifting “delay lines” –re-computation –transformations to eliminate bottlenecks for subsequent loop transformations
H.C. TD Loop transformations –improve regularity of accesses –improve temporal locality: production consumption Expected influence –reduce temporary storage and (anticipated) background storage storage size N Loop transformations for (j=1; j<=M; j++) for (i=1; i<=N; i++) A[i]= foo(A[i]); for (i=1; i<=N; i++) out[i] = A[i]; for (i=1; i<=N; i++) { for (j=1; j<=M; j++) { A[i] = foo(A[i]); } out[i] = A[i]; } storage size 1
H.C. TD Global loop transformation steps applied to cavity detection Removal of data-flow bottleneck – allows merging of loops – done in global data-flow trafo step Make all loop dimensions equal Regularize loop traversal: Y and X loop interchange – follow order of input stream Y loop folding and global merging X loop folding and global merging – full, global scope regularity – nearly complete locality for main signals
H.C. TD Scanner Loop trafo - cavity detection N x M Gauss Blur x N x M From double buffer to single buffer X Y X-Y Loop Interchange
H.C. TD Single assignment always possible For all loops, to maintain regularity Loop interchange (Y X) for (x=0;x<N;x++) for (y=0;y<M;y++) /* filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* filtering code */
H.C. TD Loop trafo - cavity detection Compute Edges Gauss Blur y N x (2GB+1) Repeated fold and loop merge N x 3 From N x M to N x (3) buffer size From N x M to N x (2GB+1) buffer size 2GB+1 3(offset arrays) Gauss Blur x
H.C. TD for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* 2nd filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ for (x=0;x<N;x++) /* 2nd filtering code */ Improve regularity and locality Loop Merging !! Impossible due to dependencies!
H.C. TD Data dependencies between 1st and 2nd loop for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = … for (y=0;y<M;y++) for (x=0;x<N;x++) … for (k=-GB; k<=GB; k++) … = … gauss_x_image[x][y+k] …
H.C. TD Enable merging with Loop Folding (bumping) for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = … for (y=0+GB;y<M+GB;y++) for (x=0;x<N;x++) … y-GB … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y+k-GB] …
H.C. TD Y-loop merging on 1st and 2nd loop nest for (y=0;y<M+GB;y++) if (y<M) for (x=0;x<N;x++) … gauss_x_image[x][y] = … if (y>=GB) for (x=0;x<N;x++) if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else
H.C. TD Simplify conditions in merged loop nest for (y=0;y<M+GB;y++) for (x=0;x<N;x++) if (y<M) … gauss_x_image[x][y] = … for (x=0;x<N;x++) if (y>=GB && x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else if (y>=GB)
H.C. TD Global loop merging/folding steps 1 x y Loop interchange (done) 2 Global y-loop folding/merging: 1st and 2nd nest (done) 3 Global y-loop folding/merging: 1st/2nd and 3rd nest 4 Global y-loop folding/merging: 1st/2nd/3rd and 4th nest 5 Global x-loop folding/merging: 1st and 2nd nest 6 Global x-loop folding/merging: 1st/2nd and 3rd nest 7 Global x-loop folding/merging: 1st/2nd/3rd and 4th nest
H.C. TD End result of global loop trafo for (y=0; y<M+GB+2; ++y) { for (x=0; x<N+2; ++x) { … if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute[x][y-GB][0] = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute[x][y-GB][GB+k+1] = gauss_xy_compute[x][y-GB][GB+k] + gauss_x_image[x][y-GB+k] * Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute[x][y-GB][(2*GB)+1]/tot; } else if (x =0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0; …
H.C. TD #A = 100 P (original) = # access x power/access = 100 Processor Data Paths Reg File M Main memory P = 1 M’ P = 0.1 M’’ P = Data re-use & memory hierarchy Introduce memory hierarchy –reduce number of reads from main memory –heavily accessed arrays stored in smaller memories P (after) = 100 x x x 1 = 3
H.C. TD Data re-use Data flow transformations to introduce extra copies of heavily accessed signals –Step 1: figure out data re-use possibilities –Step 2: calculate possible gain –Step 3: decide on data assignment to memory hierarchy int[2][6] A; for (h=0; h<N; h++) for (i=0; i<2; i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i][k]; iterations array index (6 * i + k)
H.C. TD Data re-use Data flow transformations to introduce extra copies of heavily accessed signals –Step 1: figure out data re-use possibilities –Step 2: calculate possible gain –Step 3: decide on data assignment to memory hierarchy iterations frame1frame2frame3 array index 6*2 6*1 N*2*3*6 CPU 1*2*1*6 N*2*1*6
H.C. TD Data re-use tree N*M N*1 3*1 image_in M*3 1*3 gauss_x M*3 3*3 gauss_xy/comp_edge M*3 1*1 N*M*3 N*M N*M*3 N*M image_out 0 N*M*8 CPU
H.C. TD Memory hierarchy assignment L3 L2 L1 N*M 3*1 image_in M*3 gauss_x gauss_xycomp_edgeimage_out 3*3 1*1 3*3 1*1 N*M N*M*3 N*M 0 N*M*3 N*M N*M*3N*M*8 M*3 1MB SDRAM 16KB Cache 128 B RegFile
H.C. TD Data-reuse - cavity detection code for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { if (x>=1 && x =1 && y<=M-2) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = foo(gauss_x_compute); } else { if (x<N && y<M) gauss_x_lines[x][y] = 0; } /* Other merged code omitted … */ } Code before reuse transformation
H.C. TD Data-reuse - cavity code for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixel initialized */ if (x==0 && y>=1 && y<=M-2) for (k=0; k<1; ++k) in_pixels[(x+k)%3][y%1] = image_in[x+k][y]; /* copy rest of in_pixel's in row */ if (x>=0 && x =1 && y<=M-2) in_pixels[(x+1)%3][y%1] = image_in[x+1][y]; if (x>=1 && x =1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3][y%1]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; Code after reuse transformation detection
H.C. TD Data layout optimization At this point multi-dimensional arrays are to be assigned to physical memories Data layout optimization determines exactly where in each memory an array should be placed, to –reduce memory size by “in-placing” arrays that do not overlap in time (disjoint lifetimes) –to avoid cache misses due to conflicts –exploit spatial locality of the data in memory to improve performance of e.g. page-mode memory access sequences
H.C. TD In-place mapping B C D A C A D B E C A D E B time E addresses A B C D E Inter in-place Intra in-place
H.C. TD x0 0x28a0 B A In-place mapping Implements all the “anticipated” memory size savings obtained in previous steps Modifies code to introduce one array per “real” memory Changes indices to addresses in mem. arrays b8 A[100][100]; b6 B[20][20]; for (i,j,k,l; …) B[i][j] = f(B[j][i], A[i+k][j+l]); b8 mem1[10400]; for (i,j,k,l; …) mem1[10000+i+20*j] = f(mem1[10000+j+20*i], b6(mem1[i+k+100*(j+l)])); 0x2710
H.C. TD In-place mapping Input image is partly consumed by the time first results for output image are ready Image_out index time Image_in time index time address Image
H.C. TD In-place - cavity detection code for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */ … = image_in[x+1][y]; } for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */ … = image [x+1][y]; }
H.C. TD The last step: ADOPT (Address OPTimization) Increased execution time introduced by DTSE –Complicated address arithmetic (modulo!) –Additional complex control flow Multimedia platform not adapted to address calculations Additional transformations needed to –Simplify control flow –Simplify address arithmetic: common sub-expression elimination, modulo expansion, … –Match remaining expressions on target machine
H.C. TD ADOPT principles Processor specific algebraic transformations Optimized behavioral descr. for target processor Compile to target processor Behavioral description Extract address expr. code Perform addr. expr. splitting Apply transformations: - Loop invariant code motion - Induction variable analysis - Algebraic transformations Optimized behavioral descr. Map to custom ACU
H.C. TD for (i=-8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) { Ad[ ] = A[ ]; }} dist += A[ ]-Ad[ ]; } cse1 = (33025*i )*2; cse3 = 1040+i; cse4 = j* ; cse5 = k+cse4; cse5+cse1 = cse5+cse cse1 ADOPT principles for (i=- 8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) A[((208+i)*257+8+j)* i+k] = B[(8+j)* i+k]; } dist += A[3096] - B[((208+i)*257+4)* i-4]; } Example: Full-search Motion Estimation Algebraic transformations at word-level
H.C. TD DMM – results for cavity detection on ASIC
H.C. TD Cavity detection on Pentium-MMX Main Memory AccessesLocal Memory AccessesExecution Time (sec)
H.C. TD Applications Architecture Instance Mapping Applications Performance Analysis Performance Numbers Data transfer and data storage specific rewrites in the application code Data transfer and data storage specific platform customization The Y-chart revisited
H.C. TD Fixing platform parameters Assume configurable on-chip memory hierarchy –Trade-off power versus cycle-budget storage cycle budget power [mW] 50,000100,000150,
H.C. TD Conclusion In multi-media applications exploring data transfer and storage issues should be done at system level DTSE is a methodology for Data Transfer and Storage Exploration based on manual and/or tool- assisted code rewriting –Platform independent high-level transformations –Platform dependent transformations exploit platform characteristics (optimal use of cache, …) –Substantial reduction in power and memory size demonstrated on MPEG-4, OFDM, H.263, ADSL,...