Florin Balasa University of Illinois at Chicago

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

System Integration and Performance

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Instruction Set Design

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Lecture 12 Reduce Miss Penalty and Hit Time

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Universität Dortmund Actual design flows and tools.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”

Novel Algorithms in the Memory Management of Multi-Dimensional Signal Processing Florin Balasa University of Illinois at Chicago.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

Vector computers.

Lecture 38: Compiling for Modern Architectures 03 May 02

CMSC 611: Advanced Computer Architecture

Code Optimization Overview and Examples

18-447: Computer Architecture Lecture 23: Caches

The Goal: illusion of large, fast, cheap memory

A Closer Look at Instruction Set Architectures

Cache Memories CSE 238/2038/2138: Systems Programming

Conception of parallel algorithms

CS4961 Parallel Programming Lecture 11: Data Locality, cont

Parallel Programming By J. H. Wang May 2, 2017.

Chapter 9 – Real Memory Organization and Management

Modeling of Digital Systems

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Optimization Code Optimization ©SoftMoore Consulting.

Cache Memory Presentation I

Improving cache performance of MPEG video codec

Pipelining and Vector Processing

Introduction to Computer Systems

Register Pressure Guided Unroll-and-Jam

Radu Rugina and Martin Rinard Laboratory for Computer Science

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Multivector and SIMD Computers

Lecture 3: Main Memory.

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

Architectural-Level Synthesis

Architecture Synthesis

Final Project presentation

Computer Architecture

Lecture 22: Cache Hierarchies, Memory

Spring 2008 CSE 591 Compilers for Embedded Systems

CSC3050 – Computer Architecture

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Chapter 1 Computer System Overview

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Mapping DSP algorithms to a general purpose out-of-order processor

Principle of Locality: Memory Hierarchies

6- General Purpose GPU Programming

COMPUTER ORGANIZATION AND ARCHITECTURE

Introduction to Optimization

Optimizing single thread performance

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Florin Balasa University of Illinois at Chicago Data-Flow Analysis in the Memory Management of Real-Time Multimedia Processing Systems Florin Balasa University of Illinois at Chicago

Introduction Real-time multimedia processing systems (video and image processing, real-time 3D rendering, audio and speech coding, medical imaging, etc.) A large part of power dissipation is due to data transfer and data storage Fetching operands from an off-chip memory for addition consumes 33 times more power than the computation [ Catthoor 98 ] Area cost often largely dominated by memories

Introduction memory management tasks tackled at scalar level In the early years of high-level synthesis memory management tasks tackled at scalar level Algebraic techniques -- similar to those used in modern compilers -- allow to handle memory management at non-scalar level Requirement: addressing the entire class of affine specifications multidimensional signals with (complex) affine indexes loop nests having as boundaries affine iterator functions conditions – relational and / or logical operators of affine fct.

Outline Memory size computation using data dependence analysis Hierarchical memory allocation based on data reuse analysis Data-flow driven data partitioning for on/off- chip memories Conclusions

Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … for (k=0; k<=511; k++) … B [i+k] [j+k] … How many memory locations are necessary to store the array references A [2i+3j+1] [5i+j+2] [ 4i+6j+3] & B [i+k] [j+k]

Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … Number of iterator triplets (i,j,k), that is 5123 ?? (i,j,k)=(0,1,1) No !! B [1] [2] (i,j,k)=(1,2,0)

Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … Number of index values (i+k,j+k), that is 10232 ?? (since 0 <= i+k , j+k <= 1022) No !! any (i,j,k) B [0] [512]

Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … z=4i+6j+3 j A[x][y][z] y=5i+j+2 i Iterator space Index space x=2i+3j+1

Computation of array reference size … A [2i+3j+1] [5i+j+2] [4i+6j+3] … j A[x][y][z] i Iterator space Index space

Computation of array reference size Remark The iterator space may have ``holes’’ too for (i=4; i<=8; i++) for (j=i-2; j<=i+2; j+=2) … C[i+j] … j for (i=4; i<=8; i++) 8 for (j=0; j<=2; j++) … C[2i+2j-2] … 6 j 4 normalization 2 2 1 i i 4 6 8 4 5 6 7 8

Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … x 2 3 1 i y = 5 1 + 2 j z 4 6 3 affine Index space Iterator space mapping 0 <= i , j <= 511

Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … k B[x][y] y=j+k j i x=i+k Iterator space Index space

Computation of array reference size for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … i x 1 1 j = + y 1 1 k affine Index space Iterator space mapping 0 <= i , j , k <= 511

Computation of array reference size Any array reference can be modeled as a linearly bounded lattice (LBL) LBL = { x = T·i + u | A·i >= b } Affine mapping Iterator space - scope of nested loops, and iterator-dependent conditions affine LBL Polytope mapping

Computation of array reference size The size of the array reference is the size of its index space – an LBL !! LBL = { x = T·i + u | A·i >= b } f : Zn Zm f(i) = T·i + u Is function f a one-to-one mapping ?? If YES Size(index space) = Size(iterator space)

Computation of array reference size f : Zn Zm f(i) = T·i + u H P·T·S = [Minoux 86] G H - nonsingular lower-triangular matrix S - unimodular matrix P - row permutation When rank(H)=m <= n , H is the Hermite Normal Form

Computation of array reference size Case 1 rank(H)=n function f is a one-to-one mapping for (i=0; i<=511; i++) for (j=0; j<=511; j++) … A [2i+3j+1] [5i+j+2] [4i+6j+3] … x 2 3 1 i y = 5 1 + 2 j z 4 6 3 2 3 1 -1 3 H P·T·S = I3 5 1 = -4 13 1 -2 - - - - 4 6 2 G Nr. locations A[ ][ ][ ] = size ( 0 <= i,j <= 511 ) = 512 x 512

Computation of array reference size Case 2 rank(H)<n for (i=0; i<=511; i++) for (j=0; j<=511; j++) … for (k=0; k<=511; k++) … B [i+k] [j+k] … 1 -1 P·T·S = I2 1 1 1 = 1 -1 1 1 1 1 H { 0 <= i , j , k <= 511 } { 0 <= I-K , J-K , K <= 511 } | B[i+k][j+k] | = size ( 0<=I,J<=1022 , I-511<=J<=I+511 ) = 784,897

Computation of array reference size Loop bounds Size iterator space Size index space (# storage locations) |index space| / |iterator space| 7 512 169 33 % 15 4,096 721 17 % 31 32,768 2,977 9.0 % 63 262,144 12,097 4.6 % 127 2,097,152 48,769 2.3 % 255 16,777,216 195,841 1.1 % 511 134,217,728 784,897 0.5 % Array reference B [i+k] [j+k]

Computation of array reference size Computation of the size of an integer polytope The Fourier-Motzkin elimination n-dim polytope 1. xn >= Di (x1,…,xn-1)  aikxk >= bk 2. xn <= Ej (x1,…,xn-1) 3. 0 <= Fk (x1,…,xn-1) (n-1)-dim polytope Di (x1,…,xn-1) <= Ej (x1,…,xn-1) 0 <= Fk (x1,…,xn-1) for each value of x1 1-dim polytope add size (n-1)-dim polytope Range of x1

Memory size computation # define n 6 for ( j=0; j<n ; j++ ) { A [ j ] [ 0 ] = in0; for ( i=0; i<n ; i++ ) A [ j ] [ i+1 ] = A [ j ] [ i ] + 1; } for ( i=0; i<n ; i++ ) { alpha [ i ] = A [ i ] [ n+i ] ; for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = j < i ? A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ; } for ( j=0; j<n ; j++ ) B [ j ] = A [ j ] [ 2*n ];

Memory size computation Decompose the LBL’s of the array refs. into non-overlapping pieces !! LBL1 LBL2 U LBL LBL1 = { x = T1·i1 + u1 | A1·i1 >= b1 } LBL2 = { x = T2·i2 + u2 | A2·i2 >= b2 } T1·i1 + u1 = T2·i2 + u2 Diophantine system of eqs. { A1·i1 >= b1 , A2·i2 >= b2 } New polytope

Memory size computation Keeping minimal the set of inequalities in the LBL intersection for ( i=0; i<n ; i++ ) for ( j=0; j<n ; j++ ) A [ j ] [ n+i+1 ] = j < i ? A [ j ] [ n+i ] : alpha [ i ] + A [ j ] [ n+i ] ; Iterator space { 0 <= i , j <= n-1 , j+1 <= i } j (5 ineq.) n-1 { 0 <= j , i <= n-1 , j+1 <= i } i (3 ineq.) 1 n-1

Memory size computation Keeping minimal the set of inequalities in the LBL intersection The decomposition theorem of polyhedra Polyhedron = { x | C·x = d , A·x >= b } [ Motzkin 1953 ] Polyhedron = { x | x = V·a + L·b + R·g } a , g >= 0 , S ai =1

Memory size computation LBL’s of signal A (illustrative example)

Polyhedral data-dependence graphs Granularity level = 0 Granularity level = 1 Polyhedral data-dependence graphs

Scalar-level data-dependence graph Granularity level = 2 Scalar-level data-dependence graph

Polyhedral data-dependence graph motion detection algorithm [Chan 93] # scalars motion detection algorithm [Chan 93] # dependencies

Memory size computation Memory size variation during the motion detection alg.

Memory size computation To handle high throughput applications Extract the (largely hidden) parallelism from the initially specified code Find the lowest degree of parallelism to meet the throughput/hardware requirements Perform memory size computation for code with explicit parallelism instructions

Hierarchical memory allocation A large part of power dissipation in data-dominated applications is due to data transfers and data storage Power cost reduction memory hierarchy exploiting temporal locality in the data accesses Power dissipation = f ( memory size , access frequency )

Hierarchical memory allocation Power dissipation = f ( memory size , access freq. ) heavily used data Layer of small memories Layer of large memories

Hierarchical memory allocation distribution Non-hierarchical distribution Lower power consumption by accessing from smaller memories trade-offs Higher power consumption due to additional transfers to store copies of data Larger area additional area overhead (addressing logic)

Hierarchical memory allocation Synthesis of multilevel memory architecture optimized for area and / or power subject to performance constraints 1. Data reuse exploration Which intermediate copies of data are necessary for accessing data in a power- and area- efficient way 2. Memory allocation & assignment Distributed (hierarchical) memory architecture ( memory layers, memory size/ports/address-logic , signal-to-memory & signal-to-port assignment )

Hierarchical memory allocation Synthesis of multilevel memory architecture optimized for area and / or power subject to performance constraints 1. Data reuse exploration Array partitions to be considered as copy candidates: the LBL’s from the recursive intersection of array refs. 2. Memory allocation & assignment Cost = a · S Pread / write ( N bits , N words , f read / write ) + b · S Area ( N bits , N words , Nports , technology )

Partitioning for on/off- chip memories 1 cycle DRAM off-chip SRAM on-chip CPU Memory address space 10-20 Cache 1 cycle cycles Optimal data mapping to the SRAM / DRAM to maximize the performance of the application

Partitioning for on/off- chip memories Total number of array accesses exposed to cache conflicts Total conflict factor The importance of mapping to the on-chip SRAM Using the polyhedral data-dependence graph Precise info about the relative lifetimes of the different parts of arrays

Conclusions Algebraic techniques are powerful non-scalar instruments in the memory management of multimedia signal processing Data-dependence analysis at polyhedral level useful in many memory management tasks memory size computation for behavioral specifications hierarchical memory allocation data partitioning between on- and off- chip memories The End