Florin Balasa University of Illinois at Chicago

Florin Balasa University of Illinois at Chicago
Data-Flow Analysis in the Memory Management of Real-Time Multimedia Processing Systems Florin Balasa University of Illinois at Chicago

Introduction Real-time multimedia processing systems
(video and image processing, real-time 3D rendering, audio and speech coding, medical imaging, etc.) A large part of power dissipation is due to data transfer and data storage Fetching operands from an off-chip memory for addition consumes 33 times more power than the computation [ Catthoor 98 ] Area cost often largely dominated by memories

Introduction memory management tasks tackled at scalar level
In the early years of high-level synthesis memory management tasks tackled at scalar level Algebraic techniques -- similar to those used in modern compilers -- allow to handle memory management at non-scalar level Requirement: addressing the entire class of affine specifications multidimensional signals with (complex) affine indexes loop nests having as boundaries affine iterator functions conditions – relational and / or logical operators of affine fct.

Outline Memory size computation using data dependence analysis
Hierarchical memory allocation based on data reuse analysis Data-flow driven data partitioning for on/off- chip memories Conclusions

Computation of array reference size
for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … for (k=0; k<=511; k++) … B [i+k] [j+k] … How many memory locations are necessary to store the array references A [2i+3j+1] [5i+j+2] [ 4i+6j+3] & B [i+k] [j+k]

for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … Number of iterator triplets (i,j,k), that is 5123 ?? (i,j,k)=(0,1,1) No !! B [1] [2] (i,j,k)=(1,2,0)

for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … Number of index values (i+k,j+k), that is ?? (since 0 <= i+k , j+k <= 1022) No !! any (i,j,k) B [0] [512]

for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … z=4i+6j+3 j A[x][y][z] y=5i+j+2 i Iterator space Index space x=2i+3j+1

… A [2i+3j+1] [5i+j+2] [4i+6j+3] … j A[x][y][z] i Iterator space Index space

Remark The iterator space may have ``holes’’ too for (i=4; i<=8; i++) for (j=i-2; j<=i+2; j+=2) … C[i+j] … j for (i=4; i<=8; i++) 8 for (j=0; j<=2; j++) … C[2i+2j-2] … 6 j 4 normalization 2 2 1 i i

for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … x 2 3 1 i y = 5 1 + 2 j z 4 6 3 affine Index space Iterator space mapping 0 <= i , j <= 511

for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … k B[x][y] y=j+k j i x=i+k Iterator space Index space

for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … i x 1 1 j = + y 1 1 k affine Index space Iterator space mapping 0 <= i , j , k <= 511

Any array reference can be modeled as a linearly bounded lattice (LBL) LBL = { x = T·i + u | A·i >= b } Affine mapping Iterator space - scope of nested loops, and iterator-dependent conditions affine LBL Polytope mapping

The size of the array reference is the size of its index space – an LBL !! LBL = { x = T·i + u | A·i >= b } f : Zn Zm f(i) = T·i + u Is function f a one-to-one mapping ?? If YES Size(index space) = Size(iterator space)

f : Zn Zm f(i) = T·i + u H P·T·S = [Minoux 86] G H - nonsingular lower-triangular matrix S - unimodular matrix P - row permutation When rank(H)=m <= n , H is the Hermite Normal Form

Case 1 rank(H)=n function f is a one-to-one mapping for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … x 2 3 1 i y = 5 1 + 2 j z 4 6 3 2 3 1 -1 3 H P·T·S = I3 5 1 = -4 13 1 -2 4 6 2 G Nr. locations A[ ][ ][ ] = size ( 0 <= i,j <= 511 ) = 512 x 512

Case 2 rank(H)<n for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … 1 -1 P·T·S = I2 1 1 1 = 1 -1 1 1 1 1 H { 0 <= i , j , k <= 511 } { 0 <= I-K , J-K , K <= 511 } | B[i+k][j+k] | = size ( 0<=I,J<=1022 , I-511<=J<=I+511 ) = 784,897

Loop bounds Size iterator space Size index space (# storage locations) |index space| / |iterator space| 7 512 169 33 % 15 4,096 721 17 % 31 32,768 2,977 9.0 % 63 262,144 12,097 4.6 % 127 2,097,152 48,769 2.3 % 255 16,777,216 195,841 1.1 % 511 134,217,728 784,897 0.5 % Array reference B [i+k] [j+k]

Computation of the size of an integer polytope The Fourier-Motzkin elimination n-dim polytope 1. xn >= Di (x1,…,xn-1)  aikxk >= bk 2. xn <= Ej (x1,…,xn-1) <= Fk (x1,…,xn-1) (n-1)-dim polytope Di (x1,…,xn-1) <= Ej (x1,…,xn-1) 0 <= Fk (x1,…,xn-1) for each value of x1 1-dim polytope add size (n-1)-dim polytope Range of x1

Memory size computation
# define n 6 for ( j=0; j<n ; j++ ) { A [ j ] [ 0 ] = in0; for ( i=0; i<n ; i++ ) A [ j ] [ i+1 ] = A [ j ] [ i ] + 1; } for ( i=0; i<n ; i++ ) { alpha [ i ] = A [ i ] [ n+i ] ; for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = j < i ? A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ; } for ( j=0; j<n ; j++ ) B [ j ] = A [ j ] [ 2*n ];

Decompose the LBL’s of the array refs. into non-overlapping pieces !! LBL1 LBL2 U LBL LBL1 = { x = T1·i1 + u1 | A1·i1 >= b1 } LBL2 = { x = T2·i2 + u2 | A2·i2 >= b2 } T1·i1 + u1 = T2·i2 + u2 Diophantine system of eqs. { A1·i1 >= b1 , A2·i2 >= b2 } New polytope

Keeping minimal the set of inequalities in the LBL intersection for ( i=0; i<n ; i++ ) for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = j < i ? A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ; Iterator space { 0 <= i , j <= n-1 , j+1 <= i } j (5 ineq.) n-1 { 0 <= j , i <= n-1 , j+1 <= i } i (3 ineq.) 1 n-1

Keeping minimal the set of inequalities in the LBL intersection The decomposition theorem of polyhedra Polyhedron = { x | C·x = d , A·x >= b } [ Motzkin 1953 ] Polyhedron = { x | x = V·a + L·b + R·g } a , g >= 0 , S ai =1

LBL’s of signal A (illustrative example)

Polyhedral data-dependence graphs
Granularity level = 0 Granularity level = 1 Polyhedral data-dependence graphs

Scalar-level data-dependence graph
Granularity level = 2 Scalar-level data-dependence graph

Polyhedral data-dependence graph motion detection algorithm [Chan 93]
# scalars motion detection algorithm [Chan 93] # dependencies

Memory size variation during the motion detection alg.

To handle high throughput applications Extract the (largely hidden) parallelism from the initially specified code Find the lowest degree of parallelism to meet the throughput/hardware requirements Perform memory size computation for code with explicit parallelism instructions

Hierarchical memory allocation
A large part of power dissipation in data-dominated applications is due to data transfers and data storage Power cost reduction memory hierarchy exploiting temporal locality in the data accesses Power dissipation = f ( memory size , access frequency )

Power dissipation = f ( memory size , access freq. ) heavily used data Layer of small memories Layer of large memories

distribution Non-hierarchical distribution Lower power consumption by accessing from smaller memories trade-offs Higher power consumption due to additional transfers to store copies of data Larger area additional area overhead (addressing logic)

Synthesis of multilevel memory architecture optimized for area and / or power subject to performance constraints 1. Data reuse exploration Which intermediate copies of data are necessary for accessing data in a power- and area- efficient way 2. Memory allocation & assignment Distributed (hierarchical) memory architecture ( memory layers, memory size/ports/address-logic , signal-to-memory & signal-to-port assignment )

Synthesis of multilevel memory architecture optimized for area and / or power subject to performance constraints 1. Data reuse exploration Array partitions to be considered as copy candidates: the LBL’s from the recursive intersection of array refs. 2. Memory allocation & assignment Cost = a · S Pread / write ( N bits , N words , f read / write ) + b · S Area ( N bits , N words , Nports , technology )

Partitioning for on/off- chip memories
1 cycle DRAM off-chip SRAM on-chip CPU Memory address space 10-20 Cache 1 cycle cycles Optimal data mapping to the SRAM / DRAM to maximize the performance of the application

Partitioning for on/off- chip memories
Total number of array accesses exposed to cache conflicts Total conflict factor The importance of mapping to the on-chip SRAM Using the polyhedral data-dependence graph Precise info about the relative lifetimes of the different parts of arrays

Conclusions Algebraic techniques are powerful non-scalar instruments in the memory management of multimedia signal processing Data-dependence analysis at polyhedral level useful in many memory management tasks memory size computation for behavioral specifications hierarchical memory allocation data partitioning between on- and off- chip memories The End

Florin Balasa University of Illinois at Chicago

Similar presentations

Presentation on theme: "Florin Balasa University of Illinois at Chicago"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Florin Balasa University of Illinois at Chicago

Similar presentations

Presentation on theme: "Florin Balasa University of Illinois at Chicago"— Presentation transcript:

Similar presentations

About project

Feedback