CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1.

Slides:



Advertisements
Similar presentations
1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
AlphaZ: A System for Design Space Exploration in the Polyhedral Model
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
Approximations of points and polygonal chains
Advanced Topics in Algorithms and Data Structures Lecture 7.2, page 1 Merging two upper hulls Suppose, UH ( S 2 ) has s points given in an array according.
Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
How should we define corner points? Under any reasonable definition, point x should be considered a corner point x What is a corner point?
Numerical Algorithms Matrix multiplication
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Data Locality CS 524 – High-Performance Computing.
Affine Partitioning for Parallelism & Locality Amy Lim Stanford University
Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.
Finite Mathematics & Its Applications, 10/e by Goldstein/Schneider/SiegelCopyright © 2010 Pearson Education, Inc. 1 of 99 Chapter 4 The Simplex Method.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
Matrix Multiplication (i,j,k) for I = 1 to n do for j = 1 to n do for k = 1 to n do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Chapter 4 The Simplex Method
Optimization of Linear Problems: Linear Programming (LP) © 2011 Daniel Kirschen and University of Washington 1.
Simple Sorting Algorithms. 2 Bubble sort Compare each element (except the last one) with its neighbor to the right If they are out of order, swap them.
5  Systems of Linear Equations: ✦ An Introduction ✦ Unique Solutions ✦ Underdetermined and Overdetermined Systems  Matrices  Multiplication of Matrices.
CSE554Laplacian DeformationSlide 1 CSE 554 Lecture 8: Laplacian Deformation Fall 2012.
SVM by Sequential Minimal Optimization (SMO)
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed
Analytic Placement. Layout Project:  Sending the RTL file: −Thursday, 27 Farvardin  Final deadline: −Tuesday, 22 Ordibehesht  New Project: −Soon 2.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Memory Allocations for Tiled Uniform Dependence Programs Tomofumi Yuki and Sanjay Rajopadhye.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.
ELEC692 VLSI Signal Processing Architecture Lecture 3
CR18: Advanced Compilers L01 Introduction Tomofumi Yuki.
CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.
Basic Block Scheduling  Utilize parallelism at the instruction level (ILP)  Time spent in loop execution dominates total execution time  It is a technique.
Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
CR18: Advanced Compilers L07: Beyond Loop Transformations Tomofumi Yuki.
CR18: Advanced Compilers L02: Dependence Analysis Tomofumi Yuki 1.
Search in State Spaces Problem solving as search Search consists of –state space –operators –start state –goal states A Search Tree is an efficient way.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,
Dependence Analysis and Loops CS 3220 Spring 2016.
DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.
Lecture 7: Constrained Conditional Models
Math/CSE 1019C: Discrete Mathematics for Computer Science Fall 2012
Memory Consistency Models
Data Locality Analysis and Optimization
Memory Consistency Models
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Loop Restructuring Loop unswitching Loop peeling Loop fusion
CSE 554 Lecture 9: Laplacian Deformation
CISC5835, Algorithms for Big Data
A Unified Framework for Schedule and Storage Optimization
Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors Qingda Lu1, Christophe Alias2, Uday Bondhugula1, Thomas Henretty1,
Writing Cache Friendly Code
Presentation transcript:

CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

Recap Last time we saw scheduling techniques max. parallelism != best performance This time how can we do better? 2

Pluto Strategy We want only 1D parallelism coarse-grained (outer) parallelism good data locality We want Tiling wave-front parallelism is guaranteed each tile can be executed atomically good for sequential performance 3

Intuition of Pluto Algorithm Skew and Tile i j i j 4

Tiling Hyper-Planes Another name for 1D schedule θ set of θs define tiling Defines the transform (i,j->i+j,i) corresponds to the skew in prev. slide i j θ 2 =i θ 1 =i+j 5

Legality of Tiling Each tiling hyper-plane must satisfy: What is difference from causality condition? note this is about affine transform, not schedule Must be weakly satisfied for each dimension! 6

What does the condition mean? 1. Fully Permutable recall θs define the transform all statements mapped to a common d-D space let i 1,..., i n be the new indices Weakly satisfied in all dimensions  i 1 ≥i ’ 1,..., i n ≥i ’ n for all dependences  Reformulation of the fully permutable condition  works for scheduling imperfect loop nests 7

What does the condition mean? 2. All statements are fused somewhat implied by fully permutability what are possible dependences from S1 to S2 from S2 to S1 Exception when S1 do not use value of S2 for i for j S1 for j S2 for i for j S1 for j S2 8

Selecting Tiling Hyper-planes Which is better? i j i j 9

Cost Functions in Pluto Formulated as: What does this capture? dep: (i,j->i+1,j-1) δ 1 = (i+1+j-1) – (i+j) = 0 δ 2 = (i+1) – (i) = 1 i j θ 2 =i θ 1 =i+j 10

Cost Functions in Pluto Formulated as: What does this capture? dep: (i,j->i+1,j-1) δ 1 = (i+1+j-1) – (i+j) = 0 δ 2 = (i+1-(j-1)) – (i-j) = 2 i j θ 2 =i-j θ 1 =i+j 11

Reuse Distance When the θ corresponds to sequential loop Two dependences (i,j->i+1,j) (i,0->i,j) : j>0 what are the δs? δ represents #iterations in the loop (corresponding to θ) until reuse via e i j θ 1 =i θ 2 =j 12

Communication Volume When the θ corresponds to parallel loop Let s i, s j be the tile sizes Horizontal dependence s j values to the horizontal neighbor Vertical dependence s i valeus to N/s j tiles Constant is better 0 is even better! i j 13

Iterative Search We need d-hyper-planes for a d-D space note that we are not looking for parallelism parallelism comes with the tile wave-fronts Approach: find one θ for each statement constraint the space to be linearly independent of the θs already found repeat 14

Tilable Band Band of Loops/Schedules consecutive sequence of dimensions Tilable band a band that satisfies the legality condition for a common set of dependences PLuTo tiles the outermost tilable band 15

So, which is better? What are the θs and δs? what is the order? i j i j 16

Solving with ILP Farkas Lemma again we had enough of Farkas last time There is a problem when the constraint is: 17

The “Practical” Choice Given the schedule prototype: Constraint the coefficients to: What does this mean? Relaxed recently by a paper on PLUTO+ and 18

Example 1: Jacobi 1D One example implementation: but it is rather contrived due to limitations in polyhedral compilers The dependences are simple 19 for t = 0.. T for i = 1.. N-1 S1: B[i] = foo(A[i], A[i-1], A[i+1]); for i = 1.. N-1 S2: A[i] = foo(B[i], B[i-1], B[i+1]); for t = 0.. T for i = 1.. N-1 S1: B[i] = foo(A[i], A[i-1], A[i+1]); for i = 1.. N-1 S2: A[i] = foo(B[i], B[i-1], B[i+1]); for t = 0.. T for i = 1.. N-1 S1: A[t,i] = foo(A[t-1,i], A[t-1,i-1], A[t-1,i+1]); for t = 0.. T for i = 1.. N-1 S1: A[t,i] = foo(A[t-1,i], A[t-1,i-1], A[t-1,i+1]);

Example 1: Jacobi 1D Prototype: θ S1 (t,i) = a 1 t+a 2 i+a 0 δ 1 =a 1 (t+1)+a 2 i+a 0 -(a 1 t+a 2 i+a 0 )=a 1 δ 2 =a 1 (t+1)+a 2 (i+1)+a 0 -(a 1 t+a 2 i+a 0 )=a 1 +a 2 δ 3 =a 1 (t+1)+a 2 (i-1)+a 0 -(a 1 t+a 2 i+a 0 )=a 1 -a 2 20 S1[t,i] -> S1[t+1,i] S1[t,i] -> S1[t+1,i+1] S1[t,i] -> S1[t+1,i-1] δ 1 =θ S1 (t+1,i)-θ S1 (t,i) δ 2 =θ S1 (t+1,i+1)-θ S1 (t,i) δ 3 =θ S1 (t+1,i-1)-θ S1 (t,i)

Example 1: Jacobi 1D Prototype: θ S1 (t,i) = a 1 t+a 2 i+a 0 δ 1 =a 1 (t+1)+a 2 i+a 0 -(a 1 t+a 2 i+a 0 )=a 1 δ 2 =a 1 (t+1)+a 2 (i+1)+a 0 -(a 1 t+a 2 i+a 0 )=a 1 +a 2 δ 3 =a 1 (t+1)+a 2 (i-1)+a 0 -(a 1 t+a 2 i+a 0 )=a 1 -a 2 linearly independent with the previous 21 S1[t,i] -> S1[t+1,i] S1[t,i] -> S1[t+1,i+1] S1[t,i] -> S1[t+1,i-1] δ 1 =θ S1 (t+1,i)-θ S1 (t,i) δ 2 =θ S1 (t+1,i+1)-θ S1 (t,i) δ 3 =θ S1 (t+1,i-1)-θ S1 (t,i)

Example 1: Jacobi 1D We have a set of hyper-planes θ S1 (t,i) = (t,t+i) 22 t i t i

Example 2: 2mm Simplified a bit for i = 0.. N for j = 0.. N for k = 0.. N S1: C[i,j] += A[i,k] * B[k,j]; for i = 0.. N for j = 0.. N for k = 0.. N S2: E[i,j] += C[i,k] * D[k,j]; for i = 0.. N for j = 0.. N for k = 0.. N S1: C[i,j] += A[i,k] * B[k,j]; for i = 0.. N for j = 0.. N for k = 0.. N S2: E[i,j] += C[i,k] * D[k,j]; S1[i,j,k] -> S1[i,j,k+1] S2[i,j,k] -> S2[i,j,k+1] S1[i,j,N] -> S2[i’,j’,k’]: i=i’ and j=k’ S1[i,j,N] -> S2[i’,j’,k’]: i=i’ and j=k’ 23

Example 2: 2mm (dim 1) Prototype: θ S1 (i,j,k) = a 1 i+a 2 j+a 3 k+a 0 θ S2 (x,y,z) = b 1 x+b 2 y+b 3 z+b 0 Easy ones: Interesting case is the inter-statement dep. S1[i,j,k] -> S1[i,j,k+1] S2[x,y,z] -> S2[x,y,z+1] S1[i,j,N] -> S2[x,y,z]: i=x and j=z S1[i,j,N] -> S2[x,y,z]: i=x and j=z a 3 =0 b 3 =0 S2[i,j,k] -> S1[i,k,N]: or S2[x,y,z] -> S1[x,z,N]: 24

Example 2: 2mm (dim 1) Prototype: θ S1 (i,j,k) = a 1 i+a 2 j+a 3 k+a 0 θ S2 (x,y,z) = b 1 x+b 2 y+b 3 z+b 0 Interesting case is the inter-statement dep. b 1 x+b 2 y+b 3 z+b 0 - a 1 x+a 2 z+a 3 N+a 0 S1[i,j,N] -> S2[x,y,z]: i=x and j=z S1[i,j,N] -> S2[x,y,z]: i=x and j=z or S2[x,y,z] -> S1[x,z,N]:  (b 1 -a 1 )x+b 2 y+(b 3 -a 2 )z+b 0 +a 3 N+a 0 25  (b 1 -a 1 )x+b 2 y-a 2 z a 3 =b 3 =0

Example 2: 2mm (dim 1) Prototype: θ S1 (i,j,k) = a 1 i+a 2 j+a 3 k+a 0 θ S2 (x,y,z) = b 1 x+b 2 y+b 3 z+b 0 Minimize: subject to a 1 +a 2 +b 1 +b 2 =0 a,b≥0 (plus weakly satisfied) We get θ S1 (i,j,k) = i θ S2 (x,y,z) = x 26 (b 1 -a 1 )x+b 2 y-a 2 z

Example 2: 2mm (dim 2) Prototype: θ S1 (i,j,k) = a 1 i+a 2 j+a 3 k+a 0 θ S2 (x,y,z) = b 1 x+b 2 y+b 3 z+b 0 Minimize: subject to (b 1 -a 1 )x+b 2 y+(b 3 -a 2 )z+b 0 +a 3 N+a 0 linearly independent with the previous We get θ S1 (i,j,k) = j θ S2 (x,y,z) = z 27 (b 1 -a 1 )x+b 2 y-a 2 z

Example 2: 2mm (dim 3) Prototype: θ S1 (i,j,k) = a 1 i+a 2 j+a 3 k+a 0 θ S2 (x,y,z) = b 1 x+b 2 y+b 3 z+b 0 Minimize: subject to (b 1 -a 1 )x+b 2 y+(b 3 -a 2 )z+b 0 +a 3 N+a 0 linearly independent with the previous does θ S1 =k and θ S2 =y work? a 3 =1, b 2 =1, rest 0 28 (b 1 -a 1 )x+b 2 y-a 2 z S1[i,j,N] -> S2[x,y,z]: i=x and j=z

Example 2: 2mm (dim 3) Prototype: θ S1 (i,j,k) = a 1 i+a 2 j+a 3 k+a 0 θ S2 (x,y,z) = b 1 x+b 2 y+b 3 z+b 0 Minimize: subject to (b 1 -a 1 )x+b 2 y+(b 3 -a 2 )z+b 0 +a 3 N+a 0 linearly independent with the previous does θ S1 =k and θ S2 =y work? a 3 =1, b 2 =1, rest 0 29 (b 1 -a 1 )x+b 2 y-a 2 z S1[i,j,N] -> S2[x,y,z]: i=x and j=z or S2[x,y,z] -> S1[x,z,N]:

Example 2: 2mm (dim 3) Prototype: θ S1 (i,j,k) = a 1 i+a 2 j+a 3 k+a 0 θ S2 (x,y,z) = b 1 x+b 2 y+b 3 z+b 0 Minimize: subject to (b 1 -a 1 )x+b 2 y+(b 3 -a 2 )z+b 0 +a 3 N+a 0 linearly independent with the previous we have to split here θ S1 =0 and θ S2 =1 30 (b 1 -a 1 )x+b 2 y-a 2 z S1[i,j,N] -> S2[x,y,z]: i=x and j=z or S2[x,y,z] -> S1[x,z,N]:

Example 2: 2mm (dim 4) Proceed to the 4 th dimension because the 3 rd dimension is only for statement ordering Now solve the problem independently for each statement Case S1: linearly independent with [i] and [j] Case S2: linearly independent with [x] and [z] We get [k] and [y] 31 S1[i,j,k] -> S1[i,j,k+1] S2[x,y,z] -> S2[x,y,z+1]

Example 2: 2mm Finally, we have a set of hyper-planes θ S1 (i,j,k) = (i,j,0,k) θ S2 (i,j,k) = (i,k,1,j) 32 Tilable Band for i = 0.. N for j = 0.. N for k = 0.. N S1: C[i,j] += A[i,k] * B[k,j]; for i = 0.. N for j = 0.. N for k = 0.. N S2: E[i,j] += C[i,k] * D[k,j]; for i = 0.. N for j = 0.. N for k = 0.. N S1: C[i,j] += A[i,k] * B[k,j]; for i = 0.. N for j = 0.. N for k = 0.. N S2: E[i,j] += C[i,k] * D[k,j]; for i = 0.. N for j = 0.. N { for k = 0.. N S1: C[i,j] += A[i,k] * B[k,j]; for k = 0.. N S2: E[i,k] += C[i,j] * D[j,k]; } for i = 0.. N for j = 0.. N { for k = 0.. N S1: C[i,j] += A[i,k] * B[k,j]; for k = 0.. N S2: E[i,k] += C[i,j] * D[j,k]; }

Example 2: 2mm Output of Pluto 33

Summary of Pluto Paper in 2008 huge impact: 350+ citations already Works very well as the default strategy But, it is far from perfect! 34