Download presentation
Presentation is loading. Please wait.
Published byEmma Davis Modified over 9 years ago
1
Novel Algorithms in the Memory Management of Multi-Dimensional Signal Processing Florin Balasa University of Illinois at Chicago
2
Outline The importance of memory management in multi-dimensional signal processing in multi-dimensional signal processing A lattice-based framework The computation of the minimum data memory size memory size Optimization of the dynamic energy consumption in a hierarchical memory subsystem in a hierarchical memory subsystem Mapping multi-dimensional signals into hierarchical memory organizations into hierarchical memory organizations Future research directions Conclusions
3
Memory management for signal processing applications Real-time multi-dimensional signal processing systems (video and image processing, telecommunications, audio and speech coding, medical imaging, etc.) audio and speech coding, medical imaging, etc.) data transfer and data storage system performance power consumption chip area The designer must focus on the exploration of on the exploration of the memory subsystem the memory subsystem
4
In the early years of high-level synthesis memory management tasks tackled at scalar level Algebraic techniques (similar to those used in modern compilers) Memory management for signal processing applications register-transfer level (RTL) algorithmic specifications More recently memory management tasks at non-scalar level high-level algorithmic specifications
5
Affine algorithmic specifications specifications T[0] = 0; for ( j=16; j<=512; j++ ) { S[0][j-16][0] = 0; for ( k=0; k<=8; k++ ) S[0][j-16][0] = 0; for ( k=0; k<=8; k++ ) for (i=j-16; i<=j+16; i++ ) for (i=j-16; i<=j+16; i++ ) S[0][j-16][33*k+i-j+17] = S[0][j-16][33*k+i-j+16] + A[4][j] – A[k][i]; S[0][j-16][33*k+i-j+17] = S[0][j-16][33*k+i-j+16] + A[4][j] – A[k][i]; T[j-15] = S[0][j-16][297] + T[j-16]; T[j-15] = S[0][j-16][297] + T[j-16];} out = T[497]; Memory management for signal processing applications Loop-organized algorithmic specification Main data structures: multi-dimensional arrays
6
A Lattice-Based Framework … A [2i+3j+1] [5i+j+3] [4i+6j+2] … j i x=2i+3j+1 y=5i+j+3 z=4i+6j+2 A[x][y][z] Iterator space Index space for (i=0; i<=4; i++) for (j=0; j <= 2i && j <= -i+6; j++)
7
A Lattice-Based Framework 23 51 46 1 3 2 x y z i j + = Iterator space Index space affine mapping 0 <= i <= 4, 0 <= j <=2i, j <= -i+6 for (i=0; i<=4; i++) for (j=0; j <= 2i && j <= -i+6; j++) … A [2i+3j+1] [5i+j+3] [4i+6j+2] …
8
A Lattice-Based Framework Any array reference can be modeled as a linearly bounded lattice (LBL) as a linearly bounded lattice (LBL) LBL = { x = T·i + u | A·i >= b } Iterator space - scope of nested loops, and - iterator-dependent conditions Affine mapping PolytopeLBL affine mapping
9
A Lattice-Based Framework for (i=0; i<=4; i++) for (j=0; j <= 2i && j <= -i+6; j++) How many memory locations are necessary to store the array reference to store the array reference A [2i+3j+1] [5i+j+3] [4i+6j+2] … A [2i+3j+1] [5i+j+3] [4i+6j+2] …
10
A Lattice-Based Framework The storage requirement of an array reference is the size of its index space (i.e., a lattice !!) is the size of its index space (i.e., a lattice !!) LBL = { x = T·i + u | A·i >= b } f : Z n Z m f(i) = T·i + u Is function f a one-to-one mapping ?? Size(index space) = Size(iterator space) If YES
11
A Lattice-Based Framework Computation of the size of an integer polytope for (i=0; i<=4; i++) for (j=0; j <= 2i && j <= -i+6; j++) … A [2i+3j+1] [5i+j+3] [4i+6j+2] 1 2 1 0 Step 1 Find the vertices of the iterator space and their supporting polyhedral cones C(V 1 ) = { r 1, r 2 } =
12
A Lattice-Based Framework Computation of the size of an integer polytope (cont’d) 1 2 0 Step 2 C(V 1 ) = + Decompose the supporting cones into unimodular cones (Barvinok’s decomposition algorithm) 0 1 1 0 + Step 3 Find the generating function of each supporting cone F(V 1 ) = + (1-xy 2 ) (1-y -1 ) 1 (1-y) (1-x) 1 + Step 4 Find the number of monomials in the generating function of the whole polytope F = F(V 1 ) + F(V 2 ) + …
13
Affine algorithmic specifications specifications T[0] = 0; for ( j=16; j<=512; j++ ) { S[0][j-16][0] = 0; for ( k=0; k<=8; k++ ) S[0][j-16][0] = 0; for ( k=0; k<=8; k++ ) for (i=j-16; i<=j+16; i++ ) for (i=j-16; i<=j+16; i++ ) S[0][j-16][33*k+i-j+17] = S[0][j-16][33*k+i-j+16] + A[4][j] – A[k][i]; S[0][j-16][33*k+i-j+17] = S[0][j-16][33*k+i-j+16] + A[4][j] – A[k][i]; T[j-15] = S[0][j-16][297] + T[j-16]; T[j-15] = S[0][j-16][297] + T[j-16];} out = T[497]; The Memory Size Computation Problem What is the minimum data storage necessary to execute an algorithm (affine specification) to execute an algorithm (affine specification) Any scalar signal must be stored only during its lifetime Signals having disjoint lifetimes can share the same location
14
Affine algorithmic specifications specifications T[0] = 0; for ( j=16; j<=512; j++ ) { S[0][j-16][0] = 0; for ( k=0; k<=8; k++ ) S[0][j-16][0] = 0; for ( k=0; k<=8; k++ ) for (i=j-16; i<=j+16; i++ ) for (i=j-16; i<=j+16; i++ ) S[0][j-16][33*k+i-j+17] = S[0][j-16][33*k+i-j+16] + A[4][j] – A[k][i]; S[0][j-16][33*k+i-j+17] = S[0][j-16][33*k+i-j+16] + A[4][j] – A[k][i]; T[j-15] = S[0][j-16][297] + T[j-16]; T[j-15] = S[0][j-16][297] + T[j-16];} out = T[497]; The Memory Size Computation Problem The number of scalars (array elements): 153,366 The minimum data storage storage: 4,763 All the previous works proposed estimation techniques !
15
for ( j=0; j<n ; j++ ) { A [ j ] [ 0 ] = in0; for ( i=0; i<n ; i++ ) for ( i=0; i<n ; i++ ) A [ j ] [ i+1 ] = A [ j ] [ i ] + 1; A [ j ] [ i+1 ] = A [ j ] [ i ] + 1;} for ( i=0; i<n ; i++ ) { alpha [ i ] = A [ i ] [ n+i ] ; for ( j=0; j<n ; j++ ) for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = A [ j ] [ n+i+1 ] = j < i ? A [ j ] [ n+i ] : j < i ? A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ; alpha [ i ] + A [ j ] [ n+i ] ;} for ( j=0; j<n ; j++ ) B [ j ] = A [ j ] [ 2*n ]; # define n 6 The Memory Size Computation Problem
16
Decompose the LBL’s of the array refs. into disjoint lattices LBL 1 LBL 2 LBL 2 ULBL LBL 1 = { x = T 1 ·i 1 + u 1 | A 1 ·i 1 >= b 1 } LBL 2 = { x = T 2 ·i 2 + u 2 | A 2 ·i 2 >= b 2 } T 1 ·i 1 + u 1 = T 2 ·i 2 + u 2 Diophantine system of eqs. { A 1 ·i 1 >= b 1, A 2 ·i 2 >= b 2 } New polytope The Memory Size Computation Problem
17
Decomposition of the array references of signal A (illustrative example) (illustrative example) The Memory Size Computation Problem
18
Memory Size Computation Algorithm Step 1 For every indexed signal in the algorithmic specification, decompose the array references in disjoint lattices Step 2 Based on the lattice lifetime analysis, find the memory size at the boundaries between the blocks of code Step 3 Analyzing the amounts of signals produced and consumed In each block, prune the blocks of code where the maximum storage cannot happen Step 4 For each of the remaining blocks, compute the maximum memory size computing the maximum iterator vectors of the scalars exploiting the one-to-one mapping property of array references
19
Memory trace for an SVD updating algorithm
20
Memory trace for a 2-D Gaussian blur filter algorithm
21
for ( i = 0; i < 95 ; i++ ) for ( j = 0; j < 32 ; j++ ) { for ( j = 0; j < 32 ; j++ ) { if ( i+j > 30 && i+j 30 && i+j < 63 ) A [ i ] [ j ] = … ; A [ i ] [ j ] = … ; if ( i+j > 62 && i+j 62 && i+j < 95 ) … = A[ i - 32 ] [ j ] ; … = A[ i - 32 ] [ j ] ; } for ( j = 0; j < 32 ; j++ ) for ( i = 0; i < 95 ; i++ ) { for ( i = 0; i < 95 ; i++ ) { if ( i+j > 30 && i+j 30 && i+j < 63 ) A [ i ] [ j ] = … ; A [ i ] [ j ] = … ; if ( i+j > 62 && i+j 62 && i+j < 95 ) … = A[ i - 32 ] [ j ] ; … = A[ i - 32 ] [ j ] ; } 784 locations 32 locations Study the effect of loop transformations on the data memory
22
For the first time, the storage requirements of applications can be exactly computed using formal techniques can be exactly computed using formal techniques All the previous works are estimation techniques; they are sometimes very inaccurate they are sometimes very inaccurate The previous works have constraints on the specifications This approach works for the entire class of affine specifications The previous works are illustrated with “simple’’ benchmarks (in terms of array elements, array references, lines of code) This approach was tested on complex benchmarks e.g.: code with 113 loop nests 3-level deep, e.g.: code with 113 loop nests 3-level deep, 906 array references, over 900 lines of code, 906 array references, over 900 lines of code, 4 million scalar signals 4 million scalar signals The Memory Size Computation Problem
23
Multi-dimensional arrays stored off-chip Copies of the frequently accessed array parts should be stored on-chip should be stored on-chip Optimizing the Dynamic Energy Consumption in a Hierarchical Memory Subsystem SPM Off-chipmemory Copy candidate Two layer model On-chip scratch-pad memory
24
The need of an array partitioning based on the intensity of memory accesses Data reuse model based on lattices How to select the copy candidates? Rows/columns – somewhat better … Entire arrays – unlikely … How to find array parts heavily accessed?
25
A3A3 A 16 A2A2 A 17 A1A1 A 15 Optimizing the Dynamic Energy Consumption in a Hierarchical Memory Subsystem
26
#accesses A 1 : 13,569 #accesses A 2 : 13,569 #accesses A 3 : 13,569 #accesses A 17 : 131,625 Total #accesses A 17 : 172,332 Decomposition in disjoint lattices Computation of exact number of memory accesses per lattice Optimizing the Dynamic Energy Consumption in a Hierarchical Memory Subsystem Lattice A 17
27
Map of the array space based on average memory accesses Array space of signal A Optimizing the Dynamic Energy Consumption in a Hierarchical Memory Subsystem
28
Array space of signal A 3-D map of the array space based on the exact number of memory accesses
29
Map of the array space based on average memory accesses Array space of signal A Optimizing the Dynamic Energy Consumption in a Hierarchical Memory Subsystem
30
Map of the array space based on average memory accesses Array space of signal A Optimizing the Dynamic Energy Consumption in a Hierarchical Memory Subsystem
31
Dynamic energy – computed based on number of accesses to each memory layer [ Reinman 99 ] Optimizing the Dynamic Energy Consumption in a Hierarchical Memory Subsystem CACTI power model One or two orders of magnitude between an SPM access and an off-chip access Energy per access is SPM size-dependent – constant for small SPM sizes (< a few Kbytes)
32
Optimizing the Dynamic Energy Consumption in a Hierarchical Memory Subsystem
33
Signal-to-Memory Mapping A [ index 1 ] [ index 2 ] Physical Memory Base address Base address of signal A 0 1 2 Window size Window size of signal A
34
Signal-to-Memory Mapping Mapping model (can be used in hierarchical memory organizations) m-dim. array w i = Max { dist. alive elements having same index i } + 1 mapped to m-dim. window ( w 1, …, w m ) A [ index 1 ] … [ index m ] A [ index 1 mod w 1 ] … [ index m mod w m ]
35
Bounding window: (w 1,w 2 ) = (4,6) Storage requirements: 4 x 6 = 24 Iteration (i=7, j=9)
36
Bounding window: (w 1,w 2 ) = (4,6) Storage requirements: 4 x 6 = 24 Iteration (i=7, j=9)
37
Computation of the Window of a Lattice of Live Signals i j Iterator space 0123 1 2 Index space x = T·i + u index 1 index 2 A [2i+3j] [5i+j] for ( i=0; i<=3; i++ ) for ( j=0; j<= 2; j++ ) if ( 3i >= 2j ) … A [2i+3j] [5i+j] … … A [2i+3j] [5i+j] …
38
Computation of the Window of a Lattice of Live Signals index 1 index 2 A [2i+3j] [5i+j] 012 17 ( w 1 = 13, w 2 = 18 ) w 1 = 13 w 2 = 18 2-D window by integer projection of the lattice on the axes
39
Future work Computation of storage requirements for high-throughput applications, where the code contains explicit parallelism applications, where the code contains explicit parallelism Improve the algorithm that aims to optimize the dynamic energy consumption, extending it to an arbitrary number energy consumption, extending it to an arbitrary number of memory layers of memory layers Extend the hierarchical memory allocation model to save leakage energy leakage energy Use area models for memories in order to trade-off decrease of energy consumption and the increase of area decrease of energy consumption and the increase of area implied by the memory fragmentation implied by the memory fragmentation
40
Future work Memory management for configurable architectures Several FPGA contain distributed RAM modules Homogeneous architectures RAMs of same capacity evenly distributed (Xilinx Virtex II Pro) Heterogeneous architectures A variety of RAMs (Altera Stratix II) Memory management for dynamically reconfigurable systems
41
Conclusions The exact computation of the data storage requirement of an application [ IEEE TVLSI 2007 ] of an application [ IEEE TVLSI 2007 ] A data reuse formal model based on partitioning the arrays according to the intensity of memory accesses the arrays according to the intensity of memory accesses [ ICCAD 2006 ] [ ICCAD 2006 ] A general framework based on lattices for addressing several memory management problems several memory management problems Unique features of this research A signal-to-memory mapping model which works for hierarchical memory organizations [ DATE 2007 ] for hierarchical memory organizations [ DATE 2007 ]
42
Conclusions This topic is considered by the Semiconductor Research Corporation (SRC) one of the top synthesis problems Corporation (SRC) one of the top synthesis problems still unsolved still unsolved This research is interdisciplinary: EE+CS+Math General goal design of a (hierarchical) memory subsystem optimized for power consumption and chip area, s.t. performance constraints, starting from the specification of a (multi-dimensional) signal processing application There is interest for international co-operation (potential funding: the NSF-PIRE program) (potential funding: the NSF-PIRE program)
43
Conclusions Graduate students Hongwei Zhu (Ph.D. defense: Spring 2007) Ilie I. Luican (Ph.D. defense: Spring 2009) Karthik Chandramouli (M.S. completed)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.