Polyhedral Code Generation In The Real World Nicolas VASILACHE Cédric BASTOUL Albert COHEN.

Slides:



Advertisements
Similar presentations
1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
Advertisements

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
AlphaZ: A System for Design Space Exploration in the Polyhedral Model
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.
Compiler Challenges for High Performance Architectures
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.
Common Sub-expression Elim Want to compute when an expression is available in a var Domain:
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Data Flow Analysis Compiler Design October 5, 2004 These slides live on the Web. I obtained them from Jeff Foster and he said that he obtained.
Parallelizing Compilers Presented by Yiwei Zhang.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Fundamental Techniques
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College
DATA STRUCTURE Subject Code -14B11CI211.
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Precision Going back to constant prop, in what cases would we lose precision?
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
Optimal Parallelogram Selection for Hierarchical Tiling Authors: Xing Zhou, Maria J. Garzaran, David Padua University of Illinois Presenter: Wei Zuo.
A Review of Recursion Dr. Jicheng Fu Department of Computer Science University of Central Oklahoma.
Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Memory Allocations for Tiled Uniform Dependence Programs Tomofumi Yuki and Sanjay Rajopadhye.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
High-Level Transformations for Embedded Computing
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.
CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
Algorithmic Foundations COMP108 COMP108 Algorithmic Foundations Algorithm efficiency Prudence Wong.
Dependence Analysis and Loops CS 3220 Spring 2016.
In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Algorithmic Foundations COMP108 COMP108 Algorithmic Foundations Algorithm efficiency Prudence Wong
Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin,
Advanced Computer Systems
Code Optimization.
Data Locality Analysis and Optimization
Parallel Density-based Hybrid Clustering
Lecture 5: GPU Compute Architecture
Lecture 5: GPU Compute Architecture for the last time
Florin Balasa University of Illinois at Chicago
Optimizing Transformations Hal Perkins Autumn 2011
A Practical Stride Prefetching Implementation in Global Optimizer
Radu Rugina and Martin Rinard Laboratory for Computer Science
A Unified Framework for Schedule and Storage Optimization
Adaptive Perturbation Theory: QM and Field Theory
Data Flow Analysis Compiler Design
Lecture 19: Code Optimisation
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
Introduction to Optimization
Optimizing single thread performance
Presentation transcript:

Polyhedral Code Generation In The Real World Nicolas VASILACHE Cédric BASTOUL Albert COHEN

Outline Introduction Affine schedules Formal General Form Contributions Focus on Modulo Conditional Removal (speed & quality) Experimental Results 2

Introduction – Polyhedral Model Powerful expressiveness for high level transformations (parallelism, locality) Can express any composition of usual loop transformations [Pugh91] Compact representation of all legal transformations [Feautrier90] Code Generation was the weakest link [Griebl & al. 98] Until recent algorithm [Quilleré00]  without transformations However, still problematic on long, parametric sequences on SPECs 3

Introduction – Transformations 4 WHY TRANSFORM ??? Cholesky factorization, 6 statements, Optimal allocation functions [McKin92] SwimFP2000 [ICS05] ~ 30 polyhedral loop transformations 40% speedup wrt best peak perf. on AMD64 Huge code generation times (ex: full Swim ~ 421  2267 lines, 20 mn / 300 MB) In the context of complex transformations Goal : Generation time comparable to BE of a real compiler (EKOPath)

Introduction – Context & Notations 5 Code Generation : syntactic loops from matrix representation

Affine Schedules 6

13i j i j 1 3 (i,j)  (t 1 =i, t 2 =j) Bijection between domain and time iterations Time iterations determine the generated loops (nesting, bounds) Execution follows lexicographic order on time dimensions Domain values touched by the statement : i=t 1,j=t 2 Affine Schedule – Trivial Example for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j) for(t 1 =1;t 1 <=3;t 1 ++) for(t 2 =1;t 2 <=3;t 2 ++) S(i=t 1,j=t 2 ) 13t1t1 t2t2 1 3 time domain time domain = S(1,1)S(1,2)S(1,3)S(2,1)S(2,2)S(2,3)S(3,1)S(3,2) S(3,3)

13i j i j 1 3 (i,j)  (t 1 =j, t 2 =i) Another bijection between domain and time iterations New bounds computation Lexicographic order on time dimensions Domain values touched by the statement : i=t 2,j=t 1 Affine Schedule – Loop Interchange for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j) for(t 1 =1;t 1 <=3;t 1 ++) for(t 2 =1;t 2 <=3;t 2 ++) S(i=t 2,j=t 1 ) 13t1t1 t2t2 1 3 time domain = S(1,1)S(1,2)S(1,3)S(2,1)S(2,2)S(2,3)S(3,1)S(3,2) S(3,3)

13i j 1 3 (i,j)  (t 1 = i+j) NOT a bijection (just a surjection) New bounds computation (t1: [2, 6]) Domain values touched by the statement: {(i,j)|i+j==t 1 } Affine Schedule – Parallel Wavefronts for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j) for(t 1 =2;t 1 <=6;t 1 ++) DOALL{(i,j)|i+j==t 1 } S(i,j) 26 t1t1 time domain 13i j = S(1,1) S(1,2) S(1,3)S(2,1) S(2,2) S(2,3) S(3,1) S(3,2) S(3,3)

13i j i j 1 3 (i,j) 1  (t 1 =i, t 2 =j) (i,j) 2  (t 1 =i+1, t 2 =j) Affine Schedule – Statement Shifting for(i=1;i<=3;i++) for(j=1;j<=3;j++) S 1 (i,j) S 2 (i,j) for(t 2 =1;t 2 <=3;t 2 ++) S 1 (i=1,j=t 2 ) for(t 1 =2;t 1 <=3;t 1 ++) for(t 2 =1;t 2 <=3;t 2 ++) S 1 (i=t 1,j=t 2 ) S 2 (i=t 1 -1,j=t 2 ) for(t 2 =1;t 2 <=3;t 2 ++) S 2 (i=4-1,j=t 2 ) 14t1t1 t2t2 1 3 time domain New bounds computation (S 1 : [1,3]x[1,3] S 2 : [2,4]x[1,3]) have disjoint parts Separation phase needed on each time dimension (3 nb_stmt w.c. complexity) P K E 10 domain = = 1 2

General Case Schedules: Z mi  Z ni for each statement S i Schedules associate logical time to each iteration domain point Time value sets need to be separated  scattering functions Time part used for separation and ordering (Polylib computations 2 dim [Wilde93]) Domain part determines the values spanned by time dimensions Quilleré separation phase [Quilleré00, Bastoul04] Time Domain Time iterators Domain iterators 11

Quilleré separation phase 12

Separation Principles Consider statements with domain and schedule functions such that: S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9]) [0,1] [2,3] [4,5] Polyhedral inter / diff (2 dim ) worklist remaining 13 [2,5][0,3] Considering t 1

Separation Principles [0,1] [2,3] [4,5][-2,6] [-2,-1][0,1] [2,3] [0,1][-2,-1] Polyhedral inter / diff (2 dim ) worklist remaining kernel Consider statements with domain and schedule functions such that: S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9]) 14 Considering t 1 3 nb_stmt w.c. compl.

Separation Principles That was for the first time dimension Recursively for all time dimensions Result is a syntax tree of the generated loops Consider statements with domain and schedule functions such that: S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9]) 15 Considering t 1 for(t1=-2;t1<=-1;t1++) for(t2=5;t2<=9;t2++) S3(…) for(t1=0;t1<=1;t1++) for(t2=5;t2<=7;t2++) S2(…) S3(…) for(t2=8;t2<=9;t2++) S3(…)...

Contributions Problems provided by different sources (academia, industry, SPECFP2000) Exhibit different challenging issues Real World Issues State of the art polyhedral code generator  CLooG [Bastoul04] ALL PERFORMANCE COMPARISONS WILL BE CLooG vs URGenT 16 Node fusion (exploiting transformations’ “locality”) Exploiting scalar dimensions (replacing exponential computations with trivial ones) Domain iterator mapping improvement (replacing exponential by matrix inversions) Faster If-Hoisting yielding much smaller code (conditional factorization) Modulo Conditional removal by strip-mining ( stride issue ) (detailed) Code Generation Speed Code Quality

Generation Speed Improvements 17

Generation Speed – Node Fusion Multidimensional schedules allow expression of non affine (polynomial) quantities as affine ones with more dimensions  improved flexibility Drawback  Pressure on code generation (height of the tree) Add parameters  Add dimensions (polyhedral operation complexity) HOWEVER Loop level transformations affect blocks of statements (tiling, interchange…) Polyhedron inclusion check is NOT exponential Before each separation phase, fuse consecutive nodes with equal scattering polyhedra. 18

Generation Speed – Scalar Dimensions Some multidimensional schedules have scalar dimensions (UTF, URUK[ICS05]) Scalar dimensions express strict statement interleaving Comparison of integers, no need for polyhedral separation Syntactic tree height reduction (potentially half the height) Marginal overhead for detection and computation Combines well with Node Fusion 19

Generation of sequential loops for non invertible schedules (wavefronts) CLooG [Bas04] handles it with polyhedral projection on domain iterators Drawback  Adds dimensions (polyhedral operation complexity) Drawback  Additional polyhedral computations on each leaf Use transformation invertibility (ideally, given the rank, mix of projections and invertibility) 20 Generation Speed – Domain Iterator Regen. ST after Qui. separation Phase (3 nb_stmts )

Code Quality Improvements 21

Code Quality – If Hoisting Quilleré separation phase leaves conditionals on triangular loops Need of the so-called backtracking phase  too aggressive (code bloat) Potentially tremendous amount of useless work cond: t 1 <= 4 cond: 11 <= t 1 t2t2 t1t1 Smaller Code No useless work (simplification IS needed) Explains the generation speedup on dreamupT3 22 …… t1t1 Backtracking illustration for t 1 cond: 5 <= t 1 <=10 for t 2 for t 3 for t 1 for t 2 for t 3 If-Hoisting illustration Code Bloat Useless Work

Code Quality – If Hoisting Previous example doesn’t take place in real life (just an illustration) Backtrack + 50% Matrix Mult. with URUK : strip-mine by factor 4 (x3) interchange loops (x2) unroll 23

Let be the transformation function for a statement Suppose is invertible, and let the matrix of denominators of Let and Inverse Scatter Matrix expresses domain iterators from time iterators ensures all coefficients are integral Replaces leaf polyhedral projections by matrix inversions Time iterators Domain iterators Substitute for usual Hermite Normal Form in stride computations 24 Removing Modulos – Domain Iterator Regen. Problem since 91: [Irigoin91], [Pingali92], [Ramanujam95], [Xue96], [Griebl98] and others …

for(t 1 =5;t 1 <=2*M+3*N;t 1 ++) for(t 2 =?;t 2 <=?;t 2 ++) if(t 1 %3 == 0) S(i=t 2,j=t 1 /3-k) (t 2 = 3k) if(t 1 %3 == 2) S(i=t 2,j=(t 1 -2)/3-k) (t 2 = 3k+1) if(t 1 %3 == 1) S(i=t 2,j=(t 1 -1)/3-k-1) (t 2 = 3k+2) 25 Removing Modulos - Inverse Scatter Matrix Consider S with 2 domain iterators, = and = We have = and ISM = /3 -2/ for(i=1;i<=M;i++) for(j=1;j<=N;j++) S(i,j) Time iterators Domain iterators INTEGRAL Meaning: i = t 2, 3*j = t 1 -2*t 2 for(t 1 =5;t 1 <=2*M+3*N;t 1 ++) for(t 2 =1;t 2 <=min(M,t 1 /2);t 2 ++) if((t 1 – 2t 2 )%3 == 0) S(i=t 2,j=(t 1 -2t 2 )/3) = OUCH !!! SM & unroll t 2 by (3 / gcd(2,3)) SM & unroll t 1 by (3 / gcd(1,3)) for(t 1 =?;t 1 <=?;t 1 ++) for(t 2 =?;t 2 <=?;t 2 ++) S(i=t 2,j=l-k) (t 1 = 3l) S(i=t 2,j=l-k-1) (t 1 = 3l+1) S(i=t 2,j=l-k-2) (t 1 = 3l+2)

26 Removing Modulos – There is a CATCH Previous example flowed nicely  What about the loops’ bounds ??? “Issue” (feature) with our SM + unroll transformation (strip-mine NOT strided) Modulos are indeed removed from the kernels only P K E V.S. for(i=M;i<=N;i++) S(i,j) for(i=M;i<N;i+=2) for(ii=i; ii<=min(i+1,N); i++) S(i,j) Code Size HOWEVER: P and E have marginal execution time when SM factor is “decent” PROLOGUE gives us ALIGNMENT on %2 (strip-mine factor) !!!!!!!!!!!!!!! Transformation quality issue

27 Removing Modulos – Hermite Normal Form All statements need to have the same transformation TOO RESTRICTIVE Our solution unrolls modulo guards from kernels after strip-mining Hermite Normal Form: Mathematical decomposition of = U.H Where U is unimodular (skewing matrix) H is diagonal (stride in transformed space  diagonal coefficients) Suppresses the need for internal modulo guards BUT If U is not the same, skewing are different Deal with non parallel lattices … how ? In practice, used for 1 statement or “simple” examples

Experimental Results 28

Putting it all Together – Code Size Experiments 29 CL04CL06Improv. State of the art polyhedral code generator  CLooG [Bas04] PERFORMANCE COMPARISONS: CLooG04 vs CLooG06

Generation Speed – Experiments Domain Iterators Swim 36% Time reduction 58% Memory reduction 30 Node Fusion % of CL04 Scalar Dimensions % of CL04 We compare original CLooG (CL04) from [Bastiul04] PACT paper with our optimized CLooG (CL06)

Putting it all Together – Code Generation Speed Experiments 31 CLURCLUR Affine Schedule: 412  2267 lines (40% execution speedup wrt best peak) Pathscale –Ofast needs ~22s to process the AST (LNO OFF) State of the art polyhedral code generator  CLooG [Bas04] PERFORMANCE COMPARISONS: CLooG vs URGenT

Conclusion / Future Works Implemented as the Code Generation phase of the URUK framework [ICS05] Generation Speed Goal achieved (up to 56x, stands PathScale comparison) Greatly improved code size with improved if-hoisting technique (up to 5.8x) Modulo Conditionals are removed (from kernel)  Mix with HNF Still room for speeding up generation (caches, memory pools, parallelization) Focus on Code Generation Friendly transformations 32

Thank you !!! for full presentation & more