Novel Algorithms in the Memory Management of Multi-Dimensional Signal Processing Florin Balasa University of Illinois at Chicago.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution.
Submission May, 2000 Doc: IEEE / 086 Steven Gray, Nokia Slide Brief Overview of Information Theory and Channel Coding Steven D. Gray 1.
Evaluating Performance and Power of Object-oriented vs. Procedural Programming in Embedded Processors A. Chatzigeorgiou, G. Stephanides Department of Applied.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Fast Filter Updates for Packet Classification using TCAM Authors: Haoyu Song, Jonathan Turner. Publisher: GLOBECOM 2006, IEEE Present: Chen-Yu Lin Date:
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.
Complexity Analysis (Part I)
Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.
Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
V The DARPA Dynamic Programming Benchmark on a Reconfigurable Computer Justification High performance computing benchmarking Compare and improve the performance.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Multiscalar processors
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.
A Framework for Distributed Model Predictive Control
Microprocessor-based systems Curse 7 Memory hierarchies.
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Constraint Directed CAD Tool For Automatic Latency-optimal Implementation of FPGA-based Systolic Arrays Greg Nash Reconfigurable Technology: FPGAs and.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
USC Search Space Properties for Pipelined FPGA Applications University of Southern California Information Sciences Institute Heidi Ziegler, Mary Hall,
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
2013/12/09 Yun-Chung Yang Partitioning and Allocation of Scratch-Pad Memory for Priority-Based Preemptive Multi-Task Systems Takase, H. ; Tomiyama, H.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.
RUN-Time Organization Compiler phase— Before writing a code generator, we must decide how to marshal the resources of the target machine (instructions,
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
Assembly - Arrays תרגול 7 מערכים.
A Design Flow for Optimal Circuit Design Using Resource and Timing Estimation Farnaz Gharibian and Kenneth B. Kent {f.gharibian, unb.ca Faculty.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.
Multimedia Retrieval Architecture Electrical Communication Engineering, Indian Institute of Science, Bangalore – , India Multimedia Retrieval Architecture.
Program Performance 황승원 Fall 2010 CSE, POSTECH. Publishing Hwang’s Algorithm Hwang’s took only 0.1 sec for DATASET1 in her PC while Dijkstra’s took 0.2.
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Memory Segmentation to Exploit Sleep Mode Operation
CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.
Florin Balasa University of Illinois at Chicago
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Jian Huang, Matthew Parris, Jooheung Lee, and Ronald F. DeMara
Spring 2008 CSE 591 Compilers for Embedded Systems
Implementation of a De-blocking Filter and Optimization in PLX
Presentation transcript:

Novel Algorithms in the Memory Management of Multi-Dimensional Signal Processing Florin Balasa University of Illinois at Chicago

Outline  The importance of memory management in multi-dimensional signal processing in multi-dimensional signal processing  A lattice-based framework  The computation of the minimum data memory size memory size  Optimization of the dynamic energy consumption in a hierarchical memory subsystem in a hierarchical memory subsystem  Mapping multi-dimensional signals into hierarchical memory organizations into hierarchical memory organizations  Future research directions  Conclusions

Memory management for signal processing applications Real-time multi-dimensional signal processing systems (video and image processing, telecommunications, audio and speech coding, medical imaging, etc.) audio and speech coding, medical imaging, etc.) data transfer and data storage system performance power consumption chip area The designer must focus on the exploration of on the exploration of the memory subsystem the memory subsystem

In the early years of high-level synthesis  memory management tasks tackled at scalar level Algebraic techniques (similar to those used in modern compilers) Memory management for signal processing applications  register-transfer level (RTL) algorithmic specifications More recently  memory management tasks at non-scalar level  high-level algorithmic specifications

 Affine algorithmic specifications specifications T[0] = 0; for ( j=16; j<=512; j++ ) { S[0][j-16][0] = 0; for ( k=0; k<=8; k++ ) S[0][j-16][0] = 0; for ( k=0; k<=8; k++ ) for (i=j-16; i<=j+16; i++ ) for (i=j-16; i<=j+16; i++ ) S[0][j-16][33*k+i-j+17] = S[0][j-16][33*k+i-j+16] + A[4][j] – A[k][i]; S[0][j-16][33*k+i-j+17] = S[0][j-16][33*k+i-j+16] + A[4][j] – A[k][i]; T[j-15] = S[0][j-16][297] + T[j-16]; T[j-15] = S[0][j-16][297] + T[j-16];} out = T[497]; Memory management for signal processing applications  Loop-organized algorithmic specification  Main data structures: multi-dimensional arrays

A Lattice-Based Framework … A [2i+3j+1] [5i+j+3] [4i+6j+2] … j i x=2i+3j+1 y=5i+j+3 z=4i+6j+2 A[x][y][z] Iterator space Index space for (i=0; i<=4; i++) for (j=0; j <= 2i && j <= -i+6; j++)

A Lattice-Based Framework x y z i j + = Iterator space Index space affine mapping 0 <= i <= 4, 0 <= j <=2i, j <= -i+6 for (i=0; i<=4; i++) for (j=0; j <= 2i && j <= -i+6; j++) … A [2i+3j+1] [5i+j+3] [4i+6j+2] …

A Lattice-Based Framework Any array reference can be modeled as a linearly bounded lattice (LBL) as a linearly bounded lattice (LBL) LBL = { x = T·i + u | A·i >= b } Iterator space - scope of nested loops, and - iterator-dependent conditions Affine mapping PolytopeLBL affine mapping

A Lattice-Based Framework for (i=0; i<=4; i++) for (j=0; j <= 2i && j <= -i+6; j++) How many memory locations are necessary to store the array reference to store the array reference  A [2i+3j+1] [5i+j+3] [4i+6j+2] … A [2i+3j+1] [5i+j+3] [4i+6j+2] …

A Lattice-Based Framework The storage requirement of an array reference is the size of its index space (i.e., a lattice !!) is the size of its index space (i.e., a lattice !!) LBL = { x = T·i + u | A·i >= b } f : Z n Z m f(i) = T·i + u Is function f a one-to-one mapping ?? Size(index space) = Size(iterator space) If YES

A Lattice-Based Framework Computation of the size of an integer polytope for (i=0; i<=4; i++) for (j=0; j <= 2i && j <= -i+6; j++) … A [2i+3j+1] [5i+j+3] [4i+6j+2] Step 1 Find the vertices of the iterator space and their supporting polyhedral cones C(V 1 ) = { r 1, r 2 } =

A Lattice-Based Framework Computation of the size of an integer polytope (cont’d) Step 2 C(V 1 ) = + Decompose the supporting cones into unimodular cones (Barvinok’s decomposition algorithm) Step 3 Find the generating function of each supporting cone F(V 1 ) = + (1-xy 2 ) (1-y -1 ) 1 (1-y) (1-x) 1 + Step 4 Find the number of monomials in the generating function of the whole polytope F = F(V 1 ) + F(V 2 ) + …

 Affine algorithmic specifications specifications T[0] = 0; for ( j=16; j<=512; j++ ) { S[0][j-16][0] = 0; for ( k=0; k<=8; k++ ) S[0][j-16][0] = 0; for ( k=0; k<=8; k++ ) for (i=j-16; i<=j+16; i++ ) for (i=j-16; i<=j+16; i++ ) S[0][j-16][33*k+i-j+17] = S[0][j-16][33*k+i-j+16] + A[4][j] – A[k][i]; S[0][j-16][33*k+i-j+17] = S[0][j-16][33*k+i-j+16] + A[4][j] – A[k][i]; T[j-15] = S[0][j-16][297] + T[j-16]; T[j-15] = S[0][j-16][297] + T[j-16];} out = T[497]; The Memory Size Computation Problem What is the minimum data storage necessary to execute an algorithm (affine specification) to execute an algorithm (affine specification)  Any scalar signal must be stored only during its lifetime  Signals having disjoint lifetimes can share the same location

 Affine algorithmic specifications specifications T[0] = 0; for ( j=16; j<=512; j++ ) { S[0][j-16][0] = 0; for ( k=0; k<=8; k++ ) S[0][j-16][0] = 0; for ( k=0; k<=8; k++ ) for (i=j-16; i<=j+16; i++ ) for (i=j-16; i<=j+16; i++ ) S[0][j-16][33*k+i-j+17] = S[0][j-16][33*k+i-j+16] + A[4][j] – A[k][i]; S[0][j-16][33*k+i-j+17] = S[0][j-16][33*k+i-j+16] + A[4][j] – A[k][i]; T[j-15] = S[0][j-16][297] + T[j-16]; T[j-15] = S[0][j-16][297] + T[j-16];} out = T[497]; The Memory Size Computation Problem The number of scalars (array elements): 153,366 The minimum data storage storage: 4,763 All the previous works proposed estimation techniques !

for ( j=0; j<n ; j++ ) { A [ j ] [ 0 ] = in0; for ( i=0; i<n ; i++ ) for ( i=0; i<n ; i++ ) A [ j ] [ i+1 ] = A [ j ] [ i ] + 1; A [ j ] [ i+1 ] = A [ j ] [ i ] + 1;} for ( i=0; i<n ; i++ ) { alpha [ i ] = A [ i ] [ n+i ] ; for ( j=0; j<n ; j++ ) for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = A [ j ] [ n+i+1 ] = j < i ? A [ j ] [ n+i ] : j < i ? A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ; alpha [ i ] + A [ j ] [ n+i ] ;} for ( j=0; j<n ; j++ ) B [ j ] = A [ j ] [ 2*n ]; # define n 6 The Memory Size Computation Problem

Decompose the LBL’s of the array refs. into disjoint lattices LBL 1 LBL 2 LBL 2 ULBL LBL 1 = { x = T 1 ·i 1 + u 1 | A 1 ·i 1 >= b 1 } LBL 2 = { x = T 2 ·i 2 + u 2 | A 2 ·i 2 >= b 2 } T 1 ·i 1 + u 1 = T 2 ·i 2 + u 2 Diophantine system of eqs. { A 1 ·i 1 >= b 1, A 2 ·i 2 >= b 2 } New polytope The Memory Size Computation Problem

Decomposition of the array references of signal A (illustrative example) (illustrative example) The Memory Size Computation Problem

Memory Size Computation Algorithm Step 1 For every indexed signal in the algorithmic specification, decompose the array references in disjoint lattices Step 2 Based on the lattice lifetime analysis, find the memory size at the boundaries between the blocks of code Step 3 Analyzing the amounts of signals produced and consumed In each block, prune the blocks of code where the maximum storage cannot happen Step 4 For each of the remaining blocks, compute the maximum memory size  computing the maximum iterator vectors of the scalars  exploiting the one-to-one mapping property of array references

Memory trace for an SVD updating algorithm

Memory trace for a 2-D Gaussian blur filter algorithm

for ( i = 0; i < 95 ; i++ ) for ( j = 0; j < 32 ; j++ ) { for ( j = 0; j < 32 ; j++ ) { if ( i+j > 30 && i+j 30 && i+j < 63 ) A [ i ] [ j ] = … ; A [ i ] [ j ] = … ; if ( i+j > 62 && i+j 62 && i+j < 95 ) … = A[ i - 32 ] [ j ] ; … = A[ i - 32 ] [ j ] ; } for ( j = 0; j < 32 ; j++ ) for ( i = 0; i < 95 ; i++ ) { for ( i = 0; i < 95 ; i++ ) { if ( i+j > 30 && i+j 30 && i+j < 63 ) A [ i ] [ j ] = … ; A [ i ] [ j ] = … ; if ( i+j > 62 && i+j 62 && i+j < 95 ) … = A[ i - 32 ] [ j ] ; … = A[ i - 32 ] [ j ] ; } 784 locations 32 locations Study the effect of loop transformations on the data memory

 For the first time, the storage requirements of applications can be exactly computed using formal techniques can be exactly computed using formal techniques All the previous works are estimation techniques; they are sometimes very inaccurate they are sometimes very inaccurate The previous works have constraints on the specifications  This approach works for the entire class of affine specifications The previous works are illustrated with “simple’’ benchmarks (in terms of array elements, array references, lines of code)  This approach was tested on complex benchmarks e.g.: code with 113 loop nests 3-level deep, e.g.: code with 113 loop nests 3-level deep, 906 array references, over 900 lines of code, 906 array references, over 900 lines of code, 4 million scalar signals 4 million scalar signals The Memory Size Computation Problem

 Multi-dimensional arrays stored off-chip  Copies of the frequently accessed array parts should be stored on-chip should be stored on-chip Optimizing the Dynamic Energy Consumption in a Hierarchical Memory Subsystem SPM Off-chipmemory Copy candidate Two layer model On-chip scratch-pad memory

 The need of an array partitioning based on the intensity of memory accesses  Data reuse model based on lattices How to select the copy candidates? Rows/columns – somewhat better … Entire arrays – unlikely … How to find array parts heavily accessed?

A3A3 A 16 A2A2 A 17 A1A1 A 15 Optimizing the Dynamic Energy Consumption in a Hierarchical Memory Subsystem

#accesses A 1 : 13,569 #accesses A 2 : 13,569 #accesses A 3 : 13,569 #accesses A 17 : 131,625 Total #accesses A 17 : 172,332 Decomposition in disjoint lattices Computation of exact number of memory accesses per lattice Optimizing the Dynamic Energy Consumption in a Hierarchical Memory Subsystem Lattice A 17

Map of the array space based on average memory accesses Array space of signal A Optimizing the Dynamic Energy Consumption in a Hierarchical Memory Subsystem

Array space of signal A 3-D map of the array space based on the exact number of memory accesses

Map of the array space based on average memory accesses Array space of signal A Optimizing the Dynamic Energy Consumption in a Hierarchical Memory Subsystem

Map of the array space based on average memory accesses Array space of signal A Optimizing the Dynamic Energy Consumption in a Hierarchical Memory Subsystem

 Dynamic energy – computed based on number of accesses to each memory layer [ Reinman 99 ] Optimizing the Dynamic Energy Consumption in a Hierarchical Memory Subsystem CACTI power model  One or two orders of magnitude between an SPM access and an off-chip access  Energy per access is SPM size-dependent – constant for small SPM sizes (< a few Kbytes)

Optimizing the Dynamic Energy Consumption in a Hierarchical Memory Subsystem

Signal-to-Memory Mapping A [ index 1 ] [ index 2 ] Physical Memory Base address Base address of signal A Window size Window size of signal A

Signal-to-Memory Mapping Mapping model (can be used in hierarchical memory organizations) m-dim. array w i = Max { dist. alive elements having same index i } + 1 mapped to m-dim. window ( w 1, …, w m ) A [ index 1 ] … [ index m ] A [ index 1 mod w 1 ] … [ index m mod w m ]

Bounding window: (w 1,w 2 ) = (4,6) Storage requirements: 4 x 6 = 24 Iteration (i=7, j=9)

Bounding window: (w 1,w 2 ) = (4,6) Storage requirements: 4 x 6 = 24 Iteration (i=7, j=9)

Computation of the Window of a Lattice of Live Signals i j Iterator space Index space x = T·i + u index 1 index 2 A [2i+3j] [5i+j] for ( i=0; i<=3; i++ ) for ( j=0; j<= 2; j++ ) if ( 3i >= 2j ) … A [2i+3j] [5i+j] … … A [2i+3j] [5i+j] …

Computation of the Window of a Lattice of Live Signals index 1 index 2 A [2i+3j] [5i+j] ( w 1 = 13, w 2 = 18 ) w 1 = 13 w 2 = 18 2-D window by integer projection of the lattice on the axes

Future work  Computation of storage requirements for high-throughput applications, where the code contains explicit parallelism applications, where the code contains explicit parallelism  Improve the algorithm that aims to optimize the dynamic energy consumption, extending it to an arbitrary number energy consumption, extending it to an arbitrary number of memory layers of memory layers  Extend the hierarchical memory allocation model to save leakage energy leakage energy  Use area models for memories in order to trade-off decrease of energy consumption and the increase of area decrease of energy consumption and the increase of area implied by the memory fragmentation implied by the memory fragmentation

Future work  Memory management for configurable architectures Several FPGA contain distributed RAM modules Homogeneous architectures RAMs of same capacity evenly distributed (Xilinx Virtex II Pro) Heterogeneous architectures A variety of RAMs (Altera Stratix II)  Memory management for dynamically reconfigurable systems

Conclusions  The exact computation of the data storage requirement of an application [ IEEE TVLSI 2007 ] of an application [ IEEE TVLSI 2007 ]  A data reuse formal model based on partitioning the arrays according to the intensity of memory accesses the arrays according to the intensity of memory accesses [ ICCAD 2006 ] [ ICCAD 2006 ]  A general framework based on lattices for addressing several memory management problems several memory management problems Unique features of this research  A signal-to-memory mapping model which works for hierarchical memory organizations [ DATE 2007 ] for hierarchical memory organizations [ DATE 2007 ]

Conclusions  This topic is considered by the Semiconductor Research Corporation (SRC) one of the top synthesis problems Corporation (SRC) one of the top synthesis problems still unsolved still unsolved  This research is interdisciplinary: EE+CS+Math  General goal design of a (hierarchical) memory subsystem optimized for power consumption and chip area, s.t. performance constraints, starting from the specification of a (multi-dimensional) signal processing application  There is interest for international co-operation (potential funding: the NSF-PIRE program) (potential funding: the NSF-PIRE program)

Conclusions Graduate students Hongwei Zhu (Ph.D. defense: Spring 2007) Ilie I. Luican (Ph.D. defense: Spring 2009) Karthik Chandramouli (M.S. completed)