Florin Balasa University of Illinois at Chicago Data-Flow Analysis in the Memory Management of Real-Time Multimedia Processing Systems Florin Balasa University of Illinois at Chicago
Introduction Real-time multimedia processing systems (video and image processing, real-time 3D rendering, audio and speech coding, medical imaging, etc.) A large part of power dissipation is due to data transfer and data storage Fetching operands from an off-chip memory for addition consumes 33 times more power than the computation [ Catthoor 98 ] Area cost often largely dominated by memories
Introduction memory management tasks tackled at scalar level In the early years of high-level synthesis memory management tasks tackled at scalar level Algebraic techniques -- similar to those used in modern compilers -- allow to handle memory management at non-scalar level Requirement: addressing the entire class of affine specifications multidimensional signals with (complex) affine indexes loop nests having as boundaries affine iterator functions conditions – relational and / or logical operators of affine fct.
Outline Memory size computation using data dependence analysis Hierarchical memory allocation based on data reuse analysis Data-flow driven data partitioning for on/off- chip memories Conclusions
Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … for (k=0; k<=511; k++) … B [i+k] [j+k] … How many memory locations are necessary to store the array references A [2i+3j+1] [5i+j+2] [ 4i+6j+3] & B [i+k] [j+k]
Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … Number of iterator triplets (i,j,k), that is 5123 ?? (i,j,k)=(0,1,1) No !! B [1] [2] (i,j,k)=(1,2,0)
Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … Number of index values (i+k,j+k), that is 10232 ?? (since 0 <= i+k , j+k <= 1022) No !! any (i,j,k) B [0] [512]
Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … z=4i+6j+3 j A[x][y][z] y=5i+j+2 i Iterator space Index space x=2i+3j+1
Computation of array reference size … A [2i+3j+1] [5i+j+2] [4i+6j+3] … j A[x][y][z] i Iterator space Index space
Computation of array reference size Remark The iterator space may have ``holes’’ too for (i=4; i<=8; i++) for (j=i-2; j<=i+2; j+=2) … C[i+j] … j for (i=4; i<=8; i++) 8 for (j=0; j<=2; j++) … C[2i+2j-2] … 6 j 4 normalization 2 2 1 i i 4 6 8 4 5 6 7 8
Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … x 2 3 1 i y = 5 1 + 2 j z 4 6 3 affine Index space Iterator space mapping 0 <= i , j <= 511
Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … k B[x][y] y=j+k j i x=i+k Iterator space Index space
Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … i x 1 1 j = + y 1 1 k affine Index space Iterator space mapping 0 <= i , j , k <= 511
Computation of array reference size Any array reference can be modeled as a linearly bounded lattice (LBL) LBL = { x = T·i + u | A·i >= b } Affine mapping Iterator space - scope of nested loops, and iterator-dependent conditions affine LBL Polytope mapping
Computation of array reference size The size of the array reference is the size of its index space – an LBL !! LBL = { x = T·i + u | A·i >= b } f : Zn Zm f(i) = T·i + u Is function f a one-to-one mapping ?? If YES Size(index space) = Size(iterator space)
Computation of array reference size f : Zn Zm f(i) = T·i + u H P·T·S = [Minoux 86] G H - nonsingular lower-triangular matrix S - unimodular matrix P - row permutation When rank(H)=m <= n , H is the Hermite Normal Form
Computation of array reference size Case 1 rank(H)=n function f is a one-to-one mapping for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … x 2 3 1 i y = 5 1 + 2 j z 4 6 3 2 3 1 -1 3 H P·T·S = I3 5 1 = -4 13 1 -2 - - - - 4 6 2 G Nr. locations A[ ][ ][ ] = size ( 0 <= i,j <= 511 ) = 512 x 512
Computation of array reference size Case 2 rank(H)<n for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … 1 -1 P·T·S = I2 1 1 1 = 1 -1 1 1 1 1 H { 0 <= i , j , k <= 511 } { 0 <= I-K , J-K , K <= 511 } | B[i+k][j+k] | = size ( 0<=I,J<=1022 , I-511<=J<=I+511 ) = 784,897
Computation of array reference size Loop bounds Size iterator space Size index space (# storage locations) |index space| / |iterator space| 7 512 169 33 % 15 4,096 721 17 % 31 32,768 2,977 9.0 % 63 262,144 12,097 4.6 % 127 2,097,152 48,769 2.3 % 255 16,777,216 195,841 1.1 % 511 134,217,728 784,897 0.5 % Array reference B [i+k] [j+k]
Computation of array reference size Computation of the size of an integer polytope The Fourier-Motzkin elimination n-dim polytope 1. xn >= Di (x1,…,xn-1) aikxk >= bk 2. xn <= Ej (x1,…,xn-1) 3. 0 <= Fk (x1,…,xn-1) (n-1)-dim polytope Di (x1,…,xn-1) <= Ej (x1,…,xn-1) 0 <= Fk (x1,…,xn-1) for each value of x1 1-dim polytope add size (n-1)-dim polytope Range of x1
Memory size computation # define n 6 for ( j=0; j<n ; j++ ) { A [ j ] [ 0 ] = in0; for ( i=0; i<n ; i++ ) A [ j ] [ i+1 ] = A [ j ] [ i ] + 1; } for ( i=0; i<n ; i++ ) { alpha [ i ] = A [ i ] [ n+i ] ; for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = j < i ? A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ; } for ( j=0; j<n ; j++ ) B [ j ] = A [ j ] [ 2*n ];
Memory size computation Decompose the LBL’s of the array refs. into non-overlapping pieces !! LBL1 LBL2 U LBL LBL1 = { x = T1·i1 + u1 | A1·i1 >= b1 } LBL2 = { x = T2·i2 + u2 | A2·i2 >= b2 } T1·i1 + u1 = T2·i2 + u2 Diophantine system of eqs. { A1·i1 >= b1 , A2·i2 >= b2 } New polytope
Memory size computation Keeping minimal the set of inequalities in the LBL intersection for ( i=0; i<n ; i++ ) for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = j < i ? A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ; Iterator space { 0 <= i , j <= n-1 , j+1 <= i } j (5 ineq.) n-1 { 0 <= j , i <= n-1 , j+1 <= i } i (3 ineq.) 1 n-1
Memory size computation Keeping minimal the set of inequalities in the LBL intersection The decomposition theorem of polyhedra Polyhedron = { x | C·x = d , A·x >= b } [ Motzkin 1953 ] Polyhedron = { x | x = V·a + L·b + R·g } a , g >= 0 , S ai =1
Memory size computation LBL’s of signal A (illustrative example)
Polyhedral data-dependence graphs Granularity level = 0 Granularity level = 1 Polyhedral data-dependence graphs
Scalar-level data-dependence graph Granularity level = 2 Scalar-level data-dependence graph
Polyhedral data-dependence graph motion detection algorithm [Chan 93] # scalars motion detection algorithm [Chan 93] # dependencies
Memory size computation Memory size variation during the motion detection alg.
Memory size computation To handle high throughput applications Extract the (largely hidden) parallelism from the initially specified code Find the lowest degree of parallelism to meet the throughput/hardware requirements Perform memory size computation for code with explicit parallelism instructions
Hierarchical memory allocation A large part of power dissipation in data-dominated applications is due to data transfers and data storage Power cost reduction memory hierarchy exploiting temporal locality in the data accesses Power dissipation = f ( memory size , access frequency )
Hierarchical memory allocation Power dissipation = f ( memory size , access freq. ) heavily used data Layer of small memories Layer of large memories
Hierarchical memory allocation distribution Non-hierarchical distribution Lower power consumption by accessing from smaller memories trade-offs Higher power consumption due to additional transfers to store copies of data Larger area additional area overhead (addressing logic)
Hierarchical memory allocation Synthesis of multilevel memory architecture optimized for area and / or power subject to performance constraints 1. Data reuse exploration Which intermediate copies of data are necessary for accessing data in a power- and area- efficient way 2. Memory allocation & assignment Distributed (hierarchical) memory architecture ( memory layers, memory size/ports/address-logic , signal-to-memory & signal-to-port assignment )
Hierarchical memory allocation Synthesis of multilevel memory architecture optimized for area and / or power subject to performance constraints 1. Data reuse exploration Array partitions to be considered as copy candidates: the LBL’s from the recursive intersection of array refs. 2. Memory allocation & assignment Cost = a · S Pread / write ( N bits , N words , f read / write ) + b · S Area ( N bits , N words , Nports , technology )
Partitioning for on/off- chip memories 1 cycle DRAM off-chip SRAM on-chip CPU Memory address space 10-20 Cache 1 cycle cycles Optimal data mapping to the SRAM / DRAM to maximize the performance of the application
Partitioning for on/off- chip memories Total number of array accesses exposed to cache conflicts Total conflict factor The importance of mapping to the on-chip SRAM Using the polyhedral data-dependence graph Precise info about the relative lifetimes of the different parts of arrays
Conclusions Algebraic techniques are powerful non-scalar instruments in the memory management of multimedia signal processing Data-dependence analysis at polyhedral level useful in many memory management tasks memory size computation for behavioral specifications hierarchical memory allocation data partitioning between on- and off- chip memories The End