Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Advertisements

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Compiler Challenges for High Performance Architectures
Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.
CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop.
Parallel and Cluster Computing 1. 2 Optimising Compilers u The main specific optimization is loop vectorization u The compilers –Try to recognize such.
Data Locality CS 524 – High-Performance Computing.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.
Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G)
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Data Dependences CS 524 – High-Performance Computing.
1 Friday, November 03, 2006 “The greatest obstacle to discovery is not ignorance, but the illusion of knowledge.” -D. Boorstin.
Data Locality CS 524 – High-Performance Computing.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Please hand in Assignment 1 now Assignment.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)
Multi-Dimensional Arrays
09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,
Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.
Improving Locality through Loop Transformations Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.
ECE 1754 Loop Transformations by: Eric LaForest
Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Loops.
Computer programming Outline Arrays [chap 7 – Kochan] –The concept of array –Defining arrays –Initializing arrays –Character.
Multidimensional Arrays Computer and Programming.
Memory-Aware Compilation Philip Sweany 10/20/2011.
10/01/2009CS4961 CS4961 Parallel Programming Lecture 12/13: Introduction to Locality Mary Hall October 1/3,
MULTI-DIMENSIONAL ARRAYS 1. Multi-dimensional Arrays The types of arrays discussed so far are all linear arrays. That is, they all dealt with a single.
09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,
1 ENERGY 211 / CME 211 Lecture 4 September 29, 2008.
My Coordinates Office EM G.27 contact time:
Dependence Analysis and Loops CS 3220 Spring 2016.
Arrays. 1D Array Representation In C++ 1-dimensional array x = [a, b, c, d] map into contiguous memory locations Memory abcd start location(x[i]) = start.
1 Lecture 5a: CPU architecture 101 boris.
Lecture 38: Compiling for Modern Architectures 03 May 02
CS314 – Section 5 Recitation 13
Data Locality Analysis and Optimization
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Loop Restructuring Loop unswitching Loop peeling Loop fusion
Lecture 11: Advanced Static ILP
Register Pressure Guided Unroll-and-Jam
A Unified Framework for Schedule and Storage Optimization
CS 704 Advanced Computer Architecture
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Introduction to Optimization
Optimizing single thread performance
ENERGY 211 / CME 211 Lecture 11 October 15, 2008.
Presentation transcript:

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

CS243 Winter 2006 Stanford University 2 Agenda Introduction Introduction Loop Transformations Loop Transformations Affine Transform Theory Affine Transform Theory

CS243 Winter 2006 Stanford University 3 Memory Hierarchy CPU C C Memory Cache

CS243 Winter 2006 Stanford University 4 Cache Locality for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_for Suppose array A has column-major layout A[1,1]A[2,1]…A[1,2]A[2,2]…A[1,3]… Loop nest has poor spatial cache locality.

CS243 Winter 2006 Stanford University 5 Loop Interchange for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_for Suppose array A has column-major layout A[1,1]A[2,1]…A[1,2]A[2,2]…A[1,3]… New loop nest has better spatial cache locality. for j = 1, 200 for i = 1, 100 A[i, j] = A[i, j] + 3 end_for

CS243 Winter 2006 Stanford University 6 Interchange Loops? for i = 2, 100 for j = 1, 200 A[i, j] = A[i-1, j+1]+3 end_for e.g. dependence from (3,3) to (4,2) i j

CS243 Winter 2006 Stanford University 7 Dependence Vectors i j Distance vector (1,-1) = (4,2)-(3,3) Distance vector (1,-1) = (4,2)-(3,3) Direction vector (+, -) from the signs of distance vector Direction vector (+, -) from the signs of distance vector Loop interchange is not legal if there exists dependence (+, -) Loop interchange is not legal if there exists dependence (+, -)

CS243 Winter 2006 Stanford University 8 Agenda Introduction Introduction Loop Transformations Loop Transformations Affine Transform Theory Affine Transform Theory

CS243 Winter 2006 Stanford University 9 Loop Fusion for i = 1, 1000 A[i] = B[i] + 3 end_for for j = 1, 1000 C[j] = A[j] + 5 end_for for i = 1, 1000 A[i] = B[i] + 3 C[i] = A[i] + 5 end_for Better reuse between A[i] and A[i] Better reuse between A[i] and A[i]

CS243 Winter 2006 Stanford University 10 Loop Distribution for i = 1, 1000 A[i] = A[i-1] + 3 end_for for i = 1, 1000 C[i] = B[i] + 5 end_for for i = 1, 1000 A[i] = A[i-1] + 3 C[i] = B[i] + 5 end_for 2 nd loop is parallel 2 nd loop is parallel

CS243 Winter 2006 Stanford University 11 Register Blocking for j = 1, 2*m for i = 1, 2*n A[i, j] = A[i-1, j] + A[i-1, j-1] end_for for j = 1, 2*m, 2 for i = 1, 2*n, 2 A[i, j] = A[i-1,j] + A[i-1,j-1] A[i, j+1] = A[i-1,j+1] + A[i-1,j] A[i+1, j] = A[i, j] + A[i, j-1] A[i+1, j+1] = A[i, j+1] + A[i, j] end_for Better reuse between A[i,j] and A[i,j] Better reuse between A[i,j] and A[i,j]

CS243 Winter 2006 Stanford University 12 Virtual Register Allocation for j = 1, 2*M, 2 for i = 1, 2*N, 2 r1 = A[i-1,j] r2 = r1 + A[i-1,j-1] A[i, j] = r2 r3 = A[i-1,j+1] + r1 A[i, j+1] = r3 A[i+1, j] = r2 + A[i, j-1] A[i+1, j+1] = r3 + r2 end_for Memory operations reduced to register load/store 8MN loads to 4MN loads

CS243 Winter 2006 Stanford University 13 Scalar Replacement for i = 2, N+1 = A[i-1]+1 A[i] = end_for t1 = A[1] for i = 2, N+1 = t1 + 1 t1 = A[i] = t1 end_for Eliminate loads and stores for array references

CS243 Winter 2006 Stanford University 14Unroll-and-Jam for j = 1, 2*M for i = 1, N A[i, j] = A[i-1, j] + A[i-1, j-1] end_for for j = 1, 2*M, 2 for i = 1, N A[i, j]=A[i-1,j]+A[i-1,j-1] A[i, j+1]=A[i-1,j+1]+A[i-1,j] end_for Expose more opportunity for scalar replacement

CS243 Winter 2006 Stanford University 15 Large Arrays for i = 1, 1000 for j = 1, 1000 A[i, j] = A[i, j] + B[j, i] end_for Suppose arrays A and B have row-major layout B has poor cache locality. Loop interchange will not help.

CS243 Winter 2006 Stanford University 16 Loop Blocking for v = 1, 1000, 20 for u = 1, 1000, 20 for j = v, v+19 for i = u, u+19 A[i, j] = A[i, j] + B[j, i] end_for Access to small blocks of the arrays has good cache locality.

CS243 Winter 2006 Stanford University 17 Loop Unrolling for ILP for i = 1, 10 a[i] = b[i]; *p =... end_for for I = 1, 10, 2 a[i] = b[i]; *p = … a[i+1] = b[i+1]; *p = … end_for Large scheduling regions. Fewer dynamic branches Increased code size

CS243 Winter 2006 Stanford University 18 Agenda Introduction Introduction Loop Transformations Loop Transformations Affine Transform Theory Affine Transform Theory

CS243 Winter 2006 Stanford University 19 Objective Unify a large class of program transformations. Unify a large class of program transformations. Example: Example: float Z[100]; for i = 0, 9 Z[i+10] = Z[i]; end_for

CS243 Winter 2006 Stanford University 20 Iteration Space A d-deep loop nest has d index variables, and is modeled by a d-dimensional space. The space of iterations is bounded by the lower and upper bounds of the loop indices. A d-deep loop nest has d index variables, and is modeled by a d-dimensional space. The space of iterations is bounded by the lower and upper bounds of the loop indices. Iteration space i = 0,1, …9 Iteration space i = 0,1, …9 for i = 0, 9 Z[i+10] = Z[i]; end_for

CS243 Winter 2006 Stanford University 21 Matrix Formulation The iterations in a d-deep loop nest can be represented mathematically as The iterations in a d-deep loop nest can be represented mathematically as Z is the set of integers Z is the set of integers B is a d x d integer matrix B is a d x d integer matrix b is an integer vector of length d, and b is an integer vector of length d, and 0 is a vector of d 0’s. 0 is a vector of d 0’s.

CS243 Winter 2006 Stanford University 22 Example for i = 0, 5 for j = i, 7 Z[j,i] = 0; E.g. the 3 rd row –i+j ≥ 0 is from the lower bound j ≥ i for loop j.

CS243 Winter 2006 Stanford University 23 Symbolic Constants for i = 0, n Z[i] = 0; E.g. the 1 st row –i+n ≥ 0 is from the upper bound i ≤ n.

CS243 Winter 2006 Stanford University 24 Data Space An n-dimensional array is modeled by an n-dimensional space. The space is bounded by the array bounds. An n-dimensional array is modeled by an n-dimensional space. The space is bounded by the array bounds. Data space a = 0,1, …99 Data space a = 0,1, …99 float Z[100] for i = 0, 9 Z[i+10] = Z[i]; end_for

CS243 Winter 2006 Stanford University 25 Processor Space Initially assume unbounded number of virtual processors (vp1, vp2, …) organized in a multi-dimensional space. Initially assume unbounded number of virtual processors (vp1, vp2, …) organized in a multi-dimensional space. (iteration 1, vp1), (iteration 2, vp2),… (iteration 1, vp1), (iteration 2, vp2),… After parallelization, map to physical processors (p1, p2). After parallelization, map to physical processors (p1, p2). (vp1, p1), (vp2, p2), (vp3, p1), (vp4, p2),… (vp1, p1), (vp2, p2), (vp3, p1), (vp4, p2),…

CS243 Winter 2006 Stanford University 26 Affine Array Index Function Each array access in the code specifies a mapping from an iteration in the iteration space to an array element in the data space Each array access in the code specifies a mapping from an iteration in the iteration space to an array element in the data space Both i+10 and i are affine. Both i+10 and i are affine. float Z[100] for i = 0, 9 Z[i+10] = Z[i]; end_for

CS243 Winter 2006 Stanford University 27 Array Affine Access The bounds of the loop are expressed as affine expressions of the surrounding loop variables and symbolic constants, and The bounds of the loop are expressed as affine expressions of the surrounding loop variables and symbolic constants, and The index for each dimension of the array is also an affine expression of surrounding loop variables and symbolic constants The index for each dimension of the array is also an affine expression of surrounding loop variables and symbolic constants

CS243 Winter 2006 Stanford University 28 Matrix Formulation Array access maps a vector i within the bounds to array element location Fi+f. Array access maps a vector i within the bounds to array element location Fi+f. E.g. access X[i-1] in loop nest i,j E.g. access X[i-1] in loop nest i,j

CS243 Winter 2006 Stanford University 29 Affine Partitioning An affine function to assign iterations in an iteration space to processors in the processor space. An affine function to assign iterations in an iteration space to processors in the processor space. E.g. iteration i to processor 10-i. E.g. iteration i to processor 10-i. float Z[100] for i = 0, 9 Z[i+10] = Z[i]; end_for

CS243 Winter 2006 Stanford University 30 Data Access Region An affine function to assign iterations in an iteration space to processors in the processor space. An affine function to assign iterations in an iteration space to processors in the processor space. Region for Z[i+10] is {a | 10 ≤ a ≤ 20}. Region for Z[i+10] is {a | 10 ≤ a ≤ 20}. float Z[100] for i = 0, 9 Z[i+10] = Z[i]; end_for

CS243 Winter 2006 Stanford University 31 Data Dependences Solution to linear constraints as shown in the last lecture. Solution to linear constraints as shown in the last lecture. There exist i r, i w, such that There exist i r, i w, such that 0 ≤ i r, i w ≤ 9, 0 ≤ i r, i w ≤ 9, i w + 10 = i r i w + 10 = i r float Z[100] for i = 0, 9 Z[i+10] = Z[i]; end_for

CS243 Winter 2006 Stanford University 32 Affine Transform i j u v

CS243 Winter 2006 Stanford University 33 Locality Optimization for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_for for u = 1, 200 for v = 1, 100 A[v,u] = A[v,u]+ 3 end_for

CS243 Winter 2006 Stanford University 34 Old Iteration Space for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_for

CS243 Winter 2006 Stanford University 35 New Iteration Space for u = 1, 200 for v = 1, 100 A[v,u] = A[v,u]+ 3 end_for

CS243 Winter 2006 Stanford University 36 Old Array Accesses for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_for

CS243 Winter 2006 Stanford University 37 New Array Accesses for u = 1, 200 for v = 1, 100 A[v,u] = A[v,u]+ 3 end_for

CS243 Winter 2006 Stanford University 38 Interchange Loops? for i = 2, 1000 for j = 1, 1000 A[i, j] = A[i-1, j+1]+3 end_for e.g. dependence vector d old = (1,-1) i j

CS243 Winter 2006 Stanford University 39 Interchange Loops? A transformation is legal, if the new dependence is lexicographically positive, i.e. the leading non-zero in the dependence vector is positive. A transformation is legal, if the new dependence is lexicographically positive, i.e. the leading non-zero in the dependence vector is positive.

CS243 Winter 2006 Stanford University 40 Summary Locality Optimizations Locality Optimizations Loop Transformations Loop Transformations Affine Transform Theory Affine Transform Theory