Florin Balasa University of Illinois at Chicago

Data-Flow Analysis in the Memory Management of Real-Time Multimedia Processing Systems Florin Balasa University of Illinois at Chicago

2 Introduction Real-time multimedia processing systems
(video and image processing, real-time 3D rendering, audio and speech coding, medical imaging, etc.) A large part of power dissipation is due to data transfer and data storage Fetching operands from an off-chip memory for addition consumes 33 times more power than the computation [ Catthoor 98 ] Area cost often largely dominated by memories

3 Introduction memory management tasks tackled at scalar level
In the early years of high-level synthesis memory management tasks tackled at scalar level Algebraic techniques -- similar to those used in modern compilers -- allow to handle memory management at non-scalar level Requirement: addressing the entire class of affine specifications multidimensional signals with (complex) affine indexes loop nests having as boundaries affine iterator functions conditions – relational and / or logical operators of affine fct.

4 Outline Memory size computation using data dependence analysis
Hierarchical memory allocation based on data reuse analysis Data-flow driven data partitioning for on/off- chip memories Conclusions

5 Computation of array reference size
for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … for (k=0; k<=511; k++) … B [i+k] [j+k] … How many memory locations are necessary to store the array references A [2i+3j+1] [5i+j+2] [ 4i+6j+3] & B [i+k] [j+k]

6 Computation of array reference size
for (i=0; i<=511; i++) for (j=0; j<=511; j++) for (k=0; k<=511; k++) … B [i+k] [j+k] … Number of iterator triplets (i,j,k), that is 5123 ?? (i,j,k)=(0,1,1) No !! B [1] [2] (i,j,k)=(1,2,0)

7 Computation of array reference size
for (i=0; i<=511; i++) for (j=0; j<=511; j++) for (k=0; k<=511; k++) … B [i+k] [j+k] … Number of index values (i+k,j+k), that is ?? (since 0 <= i+k , j+k <= 1022) No !! any (i,j,k) B [0] [512]

8 Computation of array reference size
for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … z=4i+6j+3 j A[x][y][z] y=5i+j+2 i Iterator space Index space x=2i+3j+1

9 Computation of array reference size
… A [2i+3j+1] [5i+j+2] [4i+6j+3] … j A[x][y][z] i Iterator space Index space

10 Computation of array reference size
Remark The iterator space may have ``holes’’ too for (i=4; i<=8; i++) for (j=i-2; j<=i+2; j+=2) … C[i+j] … j for (i=4; i<=8; i++) 8 for (j=0; j<=2; j++) … C[2i+2j-2] … 6 j 4 normalization 2 2 1 i i

11 Computation of array reference size
for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … x 2 3 1 i y = 5 1 + 2 j z 4 6 3 affine Index space Iterator space mapping 0 <= i , j <= 511

12 Computation of array reference size
for (i=0; i<=511; i++) for (j=0; j<=511; j++) for (k=0; k<=511; k++) … B [i+k] [j+k] … k B[x][y] y=j+k j i x=i+k Iterator space Index space

13 Computation of array reference size
for (i=0; i<=511; i++) for (j=0; j<=511; j++) for (k=0; k<=511; k++) … B [i+k] [j+k] … i x 1 1 j = + y 1 1 k affine Index space Iterator space mapping 0 <= i , j , k <= 511

14 Computation of array reference size
Any array reference can be modeled as a linearly bounded lattice (LBL) LBL = { x = T·i + u | A·i >= b } Affine mapping Iterator space - scope of nested loops, and iterator-dependent conditions affine LBL Polytope mapping

15 Computation of array reference size
The size of the array reference is the size of its index space – an LBL !! LBL = { x = T·i + u | A·i >= b } f : Zn Zm f(i) = T·i + u Is function f a one-to-one mapping ?? If YES Size(index space) = Size(iterator space)

16 Computation of array reference size
f : Zn Zm f(i) = T·i + u H P·T·S = [Minoux 86] G H - nonsingular lower-triangular matrix S - unimodular matrix P - row permutation When rank(H)=m <= n , H is the Hermite Normal Form

17 Computation of array reference size
Case 1 rank(H)=n function f is a one-to-one mapping for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … x 2 3 1 i y = 5 1 + 2 j z 4 6 3 2 3 1 -1 3 H P·T·S = I3 5 1 = -4 13 1 -2 4 6 2 G Nr. locations A[ ][ ][ ] = size ( 0 <= i,j <= 511 ) = 512 x 512

18 Computation of array reference size
Case 2 rank(H)<n for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … 1 -1 P·T·S = I2 1 1 1 = 1 -1 1 1 1 1 H { 0 <= i , j , k <= 511 } { 0 <= I-K , J-K , K <= 511 } | B[i+k][j+k] | = size ( 0<=I,J<=1022 , I-511<=J<=I+511 ) = 784,897

19 Computation of array reference size
Loop bounds Size iterator space Size index space (# storage locations) |index space| / |iterator space| 7 512 169 33 % 15 4,096 721 17 % 31 32,768 2,977 9.0 % 63 262,144 12,097 4.6 % 127 2,097,152 48,769 2.3 % 255 16,777,216 195,841 1.1 % 511 134,217,728 784,897 0.5 % Array reference B [i+k] [j+k]

20 Computation of array reference size
Computation of the size of an integer polytope The Fourier-Motzkin elimination n-dim polytope 1. xn >= Di (x1,…,xn-1) aikxk >= bk 2. xn <= Ej (x1,…,xn-1) <= Fk (x1,…,xn-1) (n-1)-dim polytope Di (x1,…,xn-1) <= Ej (x1,…,xn-1) 0 <= Fk (x1,…,xn-1) for each value of x1 1-dim polytope add size (n-1)-dim polytope Range of x1

21 Memory size computation
# define n 6 for ( j=0; j<n ; j++ ) { A [ j ] [ 0 ] = in0; for ( i=0; i<n ; i++ ) A [ j ] [ i+1 ] = A [ j ] [ i ] + 1; } for ( i=0; i<n ; i++ ) { alpha [ i ] = A [ i ] [ n+i ] ; for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = j < i ? A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ; } for ( j=0; j<n ; j++ ) B [ j ] = A [ j ] [ 2*n ];

22 Memory size computation
Decompose the LBL’s of the array refs. into non-overlapping pieces !! LBL1 LBL2 U LBL LBL1 = { x = T1·i1 + u1 | A1·i1 >= b1 } LBL2 = { x = T2·i2 + u2 | A2·i2 >= b2 } T1·i1 + u1 = T2·i2 + u2 Diophantine system of eqs. { A1·i1 >= b1 , A2·i2 >= b2 } New polytope

23 Memory size computation
Keeping minimal the set of inequalities in the LBL intersection for ( i=0; i<n ; i++ ) for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = j < i ? A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ; Iterator space { 0 <= i , j <= n-1 , j+1 <= i } j (5 ineq.) n-1 { 0 <= j , i <= n-1 , j+1 <= i } i (3 ineq.) 1 n-1

24 Memory size computation
Keeping minimal the set of inequalities in the LBL intersection The decomposition theorem of polyhedra Polyhedron = { x | C·x = d , A·x >= b } [ Motzkin 1953 ] Polyhedron = { x | x = V·a + L·b + R·g } a , g >= 0 , S ai =1

25 Memory size computation
LBL’s of signal A (illustrative example)

26 Polyhedral data-dependence graphs
Granularity level = 0 Granularity level = 1 Polyhedral data-dependence graphs

27 Scalar-level data-dependence graph
Granularity level = 2 Scalar-level data-dependence graph

28 Polyhedral data-dependence graph motion detection algorithm [Chan 93]
# scalars motion detection algorithm [Chan 93] # dependencies

29 Memory size computation
Memory size variation during the motion detection alg.

30 Memory size computation
To handle high throughput applications Extract the (largely hidden) parallelism from the initially specified code Find the lowest degree of parallelism to meet the throughput/hardware requirements Perform memory size computation for code with explicit parallelism instructions

31 Hierarchical memory allocation
A large part of power dissipation in data-dominated applications is due to data transfers and data storage Power cost reduction memory hierarchy exploiting temporal locality in the data accesses Power dissipation = f ( memory size , access frequency )

32 Hierarchical memory allocation
Power dissipation = f ( memory size , access freq. ) heavily used data Layer of small memories Layer of large memories

33 Hierarchical memory allocation
distribution Non-hierarchical distribution Lower power consumption by accessing from smaller memories trade-offs Higher power consumption due to additional transfers to store copies of data Larger area additional area overhead (addressing logic)

34 Hierarchical memory allocation
Synthesis of multilevel memory architecture optimized for area and / or power subject to performance constraints 1. Data reuse exploration Which intermediate copies of data are necessary for accessing data in a power- and area- efficient way 2. Memory allocation & assignment Distributed (hierarchical) memory architecture ( memory layers, memory size/ports/address-logic , signal-to-memory & signal-to-port assignment )

35 Hierarchical memory allocation
Synthesis of multilevel memory architecture optimized for area and / or power subject to performance constraints 1. Data reuse exploration Array partitions to be considered as copy candidates: the LBL’s from the recursive intersection of array refs. 2. Memory allocation & assignment Cost = a · S Pread / write ( N bits , N words , f read / write ) + b · S Area ( N bits , N words , Nports , technology )

36 Partitioning for on/off- chip memories
1 cycle DRAM off-chip SRAM on-chip CPU Memory address space 10-20 Cache 1 cycle cycles Optimal data mapping to the SRAM / DRAM to maximize the performance of the application

37 Partitioning for on/off- chip memories
Total number of array accesses exposed to cache conflicts Total conflict factor The importance of mapping to the on-chip SRAM Using the polyhedral data-dependence graph Precise info about the relative lifetimes of the different parts of arrays

38 Conclusions Algebraic techniques are powerful non-scalar instruments in the memory management of multimedia signal processing Data-dependence analysis at polyhedral level useful in many memory management tasks memory size computation for behavioral specifications hierarchical memory allocation data partitioning between on- and off- chip memories The End

