Data Locality Analysis and Optimization

Slides:



Advertisements
Similar presentations
Example: Given a matrix defining a linear mapping Find a basis for the null space and a basis for the range Pamela Leutwyler.
Advertisements

3D Geometry for Computer Graphics
Using the Iteration Space Visualizer in Loop Parallelization Yijun YU
1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.
Chapter 1 Computing Tools Data Representation, Accuracy and Precision Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction.
1 2 Extreme Pathway Lengths and Reaction Participation in Genome Scale Metabolic Networks Jason A. Papin, Nathan D. Price and Bernhard Ø. Palsson.
3.II.1. Representing Linear Maps with Matrices 3.II.2. Any Matrix Represents a Linear Map 3.II. Computing Linear Maps.
Computer Graphics Recitation 5.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.
Data Locality CS 524 – High-Performance Computing.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Data Locality CS 524 – High-Performance Computing.
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
Syndrome Decoding of Linear Block Code
Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David Padua University of Illinois Presenter: Wei Zuo.
Matrices Jordi Cortadella Department of Computer Science.
1 Recognition by Appearance Appearance-based recognition is a competing paradigm to features and alignment. No features are extracted! Images are represented.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto
High-Level Transformations for Embedded Computing
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.
A Convergent Solution to Tensor Subspace Learning.
Affine Geometry Jehee Lee Seoul National University.
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
Dependence Analysis and Loops CS 3220 Spring 2016.
 Matrix Operations  Inverse of a Matrix  Characteristics of Invertible Matrices …
Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin,
EE611 Deterministic Systems Vector Spaces and Basis Changes Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Automatic Thread Extraction with Decoupled Software Pipelining
Dynamic Programming Typically applied to optimization problems
Linear Algebra Review.
Linear Algebra review (optional)
CS4961 Parallel Programming Lecture 11: Data Locality, cont
Section 7: Memory and Caches
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Loop Restructuring Loop unswitching Loop peeling Loop fusion
Prof. Zhang Gang School of Computer Sci. & Tech.
Clustering (3) Center-based algorithms Fuzzy k-means
CSCE 411 Design and Analysis of Algorithms
GROUPS & THEIR REPRESENTATIONS: a card shuffling approach
CSC4820/6820 Computer Graphics Algorithms Ying Zhu Georgia State University Transformations.
"Developing an Efficient Sparse Matrix Framework Targeting SSI Applications" Diego Rivera and David Kaeli The Center for Subsurface Sensing and Imaging.
4.6: Rank.
A Unified Framework for Schedule and Storage Optimization
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Parallelization of Sparse Coding & Dictionary Learning
Mathematics for Signals and Systems
Shared Memory Accesses
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Unsupervised Learning and Clustering
Maths for Signals and Systems Linear Algebra in Engineering Lectures 4-5, Tuesday 18th October 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR) IN.
Back to Cone Motivation: From the proof of Affine Minkowski, we can see that if we know generators of a polyhedral cone, they can be used to describe.
Loop Optimization “Programs spend 90% of time in loops”
Vector Spaces, Subspaces
MA5242 Wavelets Lecture 1 Numbers and Vector Spaces
Math review - scalars, vectors, and matrices
Introduction to Optimization
Optimizing single thread performance
ENERGY 211 / CME 211 Lecture 11 October 15, 2008.
Writing Cache Friendly Code
Presentation transcript:

Data Locality Analysis and Optimization Hua Lin and Wayne Wolf Dept. of Electrical Engineering Princeton University D E I M V S B N G T VET TES EN NOV TAM TVM

Improving Data Locality Background Cache configuration synthesis Measurement of Data Locality Data reuse, reuse distance Conflicts in cache Approaches Loop transformation: reducing the reuse distance Data layout transformation: creating more possible reuse, reducing conflicts in cache Combination of the above two transformations

Affine Representation j for i = 2, n for j = 2, n y[i,j]=0.5*(x[i-1,j-1]+x[i-1,j]) end (1) 4 3 Two-dimension loop nest: 2 N-dimension loop nest: i 2 3 4 5 Locality space: Non-singular loop transformation: T (1)

Previous Work Loop transformation Data layout transformation A sequence of transformations: the sequence Unimodular or non-singular transformation: legal transformation matrix Tiling: efficiency Loop distribution, fusion Data layout transformation Non-singular transformation: skewed layout Padding, tiling and partitioning: the effective range of data resue The combination Innermost loop?

Our Work The starting point Other initiatives Locality space instead of innermost loop Constructing the legal transformation matrix Unified loop and data layout transformation Other initiatives Dimensionality of the locality space and reuse vector space Individual statement instead of loop body as the atomic unit of the iteration space Locality space for arrays

Compacting the Reuse Distance with Non-Singular Transformation For a reuse vector that is not in the locality space, i.e. , a non-singular transformation T can be applied s.t. For a set of reuse vectors The dimensionality of the reuse vector space Choosing among reuse vectors: reuse quality, legal transformation The difficulty: legal transformation matrix To capture more reuse Reducing the dimensionality of the reuse vector space Increasing the dimensionality of the locality space

Reducing the Dimensionality of the Reuse Vector Space with Loop Alignment for i = 2, n for j = 2, n x[i,j]=x[i,j]-128 (S1) y[i,j]=0.5*(x[i-1,j-1]+x[i-1,j]) (S2) z[i,j]=x[i,j-1] (S3) end j 4 3 Before loop alignment Spatial reuse (not shown in the figure): [0, 1]: S1S1, [0, 1]: S2S2, [0, 1]: S3S3 Temporal reuse [1, 0]: S1S2, [1, 1]: S1S2, [0, 1]: S1S3 After loop alignment [1, 0]: S2 [0, 0]: S1S2, [0, 1]: S1S2, [0, 1]: S1S3 2 2 3 4 5 i j 4 3 2 1 2 3 4 5 i

Reducing the Dimensionality of the Reuse Vector Space with Loop Alignment (cont’) for i = 1, n for j = 2, n if i=1 then y[2,j]=0.5*(x[1,j-1]+x[1,j]) else if i=n then x[n,j]=x[n,j]-128 z[n,j]=x[n,j-1] else x[i,j]=x[i,j]-128 y[i+1,j]=0.5*(x[i,j-1]+x[i,j]) z[i,j]=x[i,j-1] endif end Minimizing the Dimensionality of the Reuse Vector Space

Increasing the Dimensionality of the Locality Space with Loop Tiling Tiling the innermost loops The dimensionality of the locality space with loop tiling The integrate procedure: Loop alignment  non-singular transformation  tiling

More Spatial Reuse with Non-singular Transformation for Array Layout Locality space for arrays: span( ) span( ) ( ): the last m columns of the reference matrix A (or AT 1 with affine loop transformation T). Let S = span( ): locality space for the array Creating spatial reuse If the fastest changing dimension of an array is not in its locality space, i.e. , a non-singular transformation TA can be applied to the array layout s.t. Dimension interchange:

In-dimension Stride Vector In-dimension Stride Vector (ISV) Distance vector between two iterations that access adjacent data within the same dimension of an array Why ISV? Each dimension of the array has its own ISV space If the dimension is switched to be the fastest changing dimension, its ISV’s become the self-spatial reuse vectors Automate the unification of loop transformation and array dimension interchange How to compute ISV:

An Example for i = 1, N for j = 1, N for k = 1, N c(i,j) = c(i,j) + a(i,k)*b(k,j) end ISV spaces: Array c: 1st ISV space: span{(1, 0, 0)}, 2nd ISV space: span{(0, 1, 0)} Array a: 1st ISV space: span{(1, 0, 0)}, 2nd ISV space: span{(0, 0, 1)} Array b: 1st ISV space: span{(0, 0, 1)}, 2nd ISV space: span{(0, 1, 0)} Transformed loop nest: (1) Assuming row major (2) T : (1, 0, 0)  (0, 0, 1) for l = 1, N for m = 1, N for n = 1, N c(m,n) = c(m,n) + a(l,n)*b(m,l)

Padding and Tiling with Locality Space Considering the cache conflicts with locality space Locality space for the array  subspace of array

Experimental Result (Matrix Multiplication) Cfg1: n=128, l=8, a=1, Cfg2: n=128, l=8, a=2 Cfg3: n=256, l=8, a=1, Cfg4: n=256, l=8, a=2 n : number of line sets, l : line size in words a : degree of associativity Figure: Miss rate for Matrix Multiplication

Summary Contributions in this work: The legal loop transformation matrix Locality space instead of innermost loop Locality space for arrays Dimensionality of the locality space and reuse vector space Individual statement instead of loop body as the atomic unit of the iteration space ISV: automating the unification of loop transformation and array dimension interchange

Work in Progress More work in quasi-affine transformation Parallelization with data locality