Polyhedral Code Generation In The Real World Nicolas VASILACHE Cédric BASTOUL Albert COHEN.

Polyhedral Code Generation In The Real World Nicolas VASILACHE Cédric BASTOUL Albert COHEN

Outline Introduction Affine schedules Formal General Form Contributions Focus on Modulo Conditional Removal (speed & quality) Experimental Results 2

Introduction – Polyhedral Model Powerful expressiveness for high level transformations (parallelism, locality) Can express any composition of usual loop transformations [Pugh91] Compact representation of all legal transformations [Feautrier90] Code Generation was the weakest link [Griebl & al. 98] Until recent algorithm [Quilleré00]  without transformations However, still problematic on long, parametric sequences on SPECs 3

Introduction – Transformations 4 WHY TRANSFORM ??? Cholesky factorization, 6 statements, Optimal allocation functions [McKin92] SwimFP2000 [ICS05] ~ 30 polyhedral loop transformations 40% speedup wrt best peak perf. on AMD64 Huge code generation times (ex: full Swim ~ 421  2267 lines, 20 mn / 300 MB) In the context of complex transformations Goal : Generation time comparable to BE of a real compiler (EKOPath)

Introduction – Context & Notations 5 Code Generation : syntactic loops from matrix representation

Affine Schedules 6

13i j 1 3 13i j 1 3 (i,j)  (t 1 =i, t 2 =j) Bijection between domain and time iterations Time iterations determine the generated loops (nesting, bounds) Execution follows lexicographic order on time dimensions Domain values touched by the statement : i=t 1,j=t 2 Affine Schedule – Trivial Example for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j) for(t 1 =1;t 1 <=3;t 1 ++) for(t 2 =1;t 2 <=3;t 2 ++) S(i=t 1,j=t 2 ) 13t1t1 t2t2 1 3 time domain time domain 7 1 0 0 1 = S(1,1)S(1,2)S(1,3)S(2,1)S(2,2)S(2,3)S(3,1)S(3,2) S(3,3)

13i j 1 3 13i j 1 3 (i,j)  (t 1 =j, t 2 =i) Another bijection between domain and time iterations New bounds computation Lexicographic order on time dimensions Domain values touched by the statement : i=t 2,j=t 1 Affine Schedule – Loop Interchange for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j) for(t 1 =1;t 1 <=3;t 1 ++) for(t 2 =1;t 2 <=3;t 2 ++) S(i=t 2,j=t 1 ) 13t1t1 t2t2 1 3 time domain 8 0 1 1 0 = S(1,1)S(1,2)S(1,3)S(2,1)S(2,2)S(2,3)S(3,1)S(3,2) S(3,3)

13i j 1 3 (i,j)  (t 1 = i+j) NOT a bijection (just a surjection) New bounds computation (t1: [2, 6]) Domain values touched by the statement: {(i,j)|i+j==t 1 } Affine Schedule – Parallel Wavefronts for(i=1;i<=3;i++) for(j=1;j<=3;j++) S(i,j) for(t 1 =2;t 1 <=6;t 1 ++) DOALL{(i,j)|i+j==t 1 } S(i,j) 26 t1t1 time domain 13i j 1 3 9 1 = S(1,1) S(1,2) S(1,3)S(2,1) S(2,2) S(2,3) S(3,1) S(3,2) S(3,3)

13i j 1 3 13i j 1 3 (i,j) 1  (t 1 =i, t 2 =j) (i,j) 2  (t 1 =i+1, t 2 =j) Affine Schedule – Statement Shifting for(i=1;i<=3;i++) for(j=1;j<=3;j++) S 1 (i,j) S 2 (i,j) for(t 2 =1;t 2 <=3;t 2 ++) S 1 (i=1,j=t 2 ) for(t 1 =2;t 1 <=3;t 1 ++) for(t 2 =1;t 2 <=3;t 2 ++) S 1 (i=t 1,j=t 2 ) S 2 (i=t 1 -1,j=t 2 ) for(t 2 =1;t 2 <=3;t 2 ++) S 2 (i=4-1,j=t 2 ) 14t1t1 t2t2 1 3 time domain New bounds computation (S 1 : [1,3]x[1,3] S 2 : [2,4]x[1,3]) have disjoint parts Separation phase needed on each time dimension (3 nb_stmt w.c. complexity) P K E 10 domain 1 0 0 1 = 2 1 0 0 1 0 1 0 0 1 = 1 2

General Case Schedules: Z mi  Z ni for each statement S i Schedules associate logical time to each iteration domain point Time value sets need to be separated  scattering functions Time part used for separation and ordering (Polylib computations 2 dim [Wilde93]) Domain part determines the values spanned by time dimensions Quilleré separation phase [Quilleré00, Bastoul04] Time Domain Time iterators Domain iterators 11

Quilleré separation phase 12

Separation Principles Consider statements with domain and schedule functions such that: S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9]) [0,1] [2,3] [4,5] Polyhedral inter / diff (2 dim ) worklist remaining 13 [2,5][0,3] Considering t 1

Separation Principles [0,1] [2,3] [4,5][-2,6] [-2,-1][0,1] [2,3] [0,1][-2,-1] Polyhedral inter / diff (2 dim ) worklist remaining kernel Consider statements with domain and schedule functions such that: S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9]) 14 Considering t 1 3 nb_stmt w.c. compl.

Separation Principles That was for the first time dimension Recursively for all time dimensions Result is a syntax tree of the generated loops Consider statements with domain and schedule functions such that: S1 has time dimensions (t1, t2, t3) spanning ([2,5]x[2,7]x[5,8]) S2 has time dimensions (t1, t2) spanning ([0,3]x[5,7]) S3 has time dimensions (t1, t2) spanning ([-2,6]x[5,9]) 15 Considering t 1 for(t1=-2;t1<=-1;t1++) for(t2=5;t2<=9;t2++) S3(…) for(t1=0;t1<=1;t1++) for(t2=5;t2<=7;t2++) S2(…) S3(…) for(t2=8;t2<=9;t2++) S3(…)...

Contributions Problems provided by different sources (academia, industry, SPECFP2000) Exhibit different challenging issues Real World Issues State of the art polyhedral code generator  CLooG [Bastoul04] ALL PERFORMANCE COMPARISONS WILL BE CLooG vs URGenT 16 Node fusion (exploiting transformations’ “locality”) Exploiting scalar dimensions (replacing exponential computations with trivial ones) Domain iterator mapping improvement (replacing exponential by matrix inversions) Faster If-Hoisting yielding much smaller code (conditional factorization) Modulo Conditional removal by strip-mining ( stride issue ) (detailed) Code Generation Speed Code Quality

Generation Speed Improvements 17

Generation Speed – Node Fusion Multidimensional schedules allow expression of non affine (polynomial) quantities as affine ones with more dimensions  improved flexibility Drawback  Pressure on code generation (height of the tree) Add parameters  Add dimensions (polyhedral operation complexity) HOWEVER Loop level transformations affect blocks of statements (tiling, interchange…) Polyhedron inclusion check is NOT exponential Before each separation phase, fuse consecutive nodes with equal scattering polyhedra. 18

Generation Speed – Scalar Dimensions Some multidimensional schedules have scalar dimensions (UTF, URUK[ICS05]) Scalar dimensions express strict statement interleaving Comparison of integers, no need for polyhedral separation Syntactic tree height reduction (potentially half the height) Marginal overhead for detection and computation Combines well with Node Fusion 19

Generation of sequential loops for non invertible schedules (wavefronts) CLooG [Bas04] handles it with polyhedral projection on domain iterators Drawback  Adds dimensions (polyhedral operation complexity) Drawback  Additional polyhedral computations on each leaf Use transformation invertibility (ideally, given the rank, mix of projections and invertibility) 20 Generation Speed – Domain Iterator Regen. ST after Qui. separation Phase (3 nb_stmts )

Code Quality Improvements 21

Code Quality – If Hoisting Quilleré separation phase leaves conditionals on triangular loops Need of the so-called backtracking phase  too aggressive (code bloat) Potentially tremendous amount of useless work cond: t 1 <= 4 cond: 11 <= t 1 t2t2 t1t1 Smaller Code No useless work (simplification IS needed) Explains the generation speedup on dreamupT3 22 …… t1t1 Backtracking illustration for t 1 cond: 5 <= t 1 <=10 for t 2 for t 3 for t 1 for t 2 for t 3 If-Hoisting illustration Code Bloat Useless Work

Code Quality – If Hoisting Previous example doesn’t take place in real life (just an illustration) Backtrack + 50% Matrix Mult. with URUK : strip-mine by factor 4 (x3) interchange loops (x2) unroll 23

Let be the transformation function for a statement Suppose is invertible, and let the matrix of denominators of Let and Inverse Scatter Matrix expresses domain iterators from time iterators ensures all coefficients are integral Replaces leaf polyhedral projections by matrix inversions Time iterators Domain iterators Substitute for usual Hermite Normal Form in stride computations 24 Removing Modulos – Domain Iterator Regen. Problem since 91: [Irigoin91], [Pingali92], [Ramanujam95], [Xue96], [Griebl98] and others …

for(t 1 =5;t 1 <=2*M+3*N;t 1 ++) for(t 2 =?;t 2 <=?;t 2 ++) if(t 1 %3 == 0) S(i=t 2,j=t 1 /3-k) (t 2 = 3k) if(t 1 %3 == 2) S(i=t 2,j=(t 1 -2)/3-k) (t 2 = 3k+1) if(t 1 %3 == 1) S(i=t 2,j=(t 1 -1)/3-k-1) (t 2 = 3k+2) 25 Removing Modulos - Inverse Scatter Matrix Consider S with 2 domain iterators, = and = We have = and ISM = 2 3 1 0 0 1 1/3 -2/3 1 0 0 3 0 1 -1 0 1 -2 0 -3 for(i=1;i<=M;i++) for(j=1;j<=N;j++) S(i,j) Time iterators Domain iterators INTEGRAL Meaning: i = t 2, 3*j = t 1 -2*t 2 for(t 1 =5;t 1 <=2*M+3*N;t 1 ++) for(t 2 =1;t 2 <=min(M,t 1 /2);t 2 ++) if((t 1 – 2t 2 )%3 == 0) S(i=t 2,j=(t 1 -2t 2 )/3) 2 3 1 0 = OUCH !!! SM & unroll t 2 by (3 / gcd(2,3)) SM & unroll t 1 by (3 / gcd(1,3)) for(t 1 =?;t 1 <=?;t 1 ++) for(t 2 =?;t 2 <=?;t 2 ++) S(i=t 2,j=l-k) (t 1 = 3l) S(i=t 2,j=l-k-1) (t 1 = 3l+1) S(i=t 2,j=l-k-2) (t 1 = 3l+2)

26 Removing Modulos – There is a CATCH Previous example flowed nicely  What about the loops’ bounds ??? “Issue” (feature) with our SM + unroll transformation (strip-mine NOT strided) Modulos are indeed removed from the kernels only P K E V.S. for(i=M;i<=N;i++) S(i,j) for(i=M;i<N;i+=2) for(ii=i; ii<=min(i+1,N); i++) S(i,j) Code Size HOWEVER: P and E have marginal execution time when SM factor is “decent” PROLOGUE gives us ALIGNMENT on %2 (strip-mine factor) !!!!!!!!!!!!!!! Transformation quality issue

27 Removing Modulos – Hermite Normal Form All statements need to have the same transformation TOO RESTRICTIVE Our solution unrolls modulo guards from kernels after strip-mining Hermite Normal Form: Mathematical decomposition of = U.H Where U is unimodular (skewing matrix) H is diagonal (stride in transformed space  diagonal coefficients) Suppresses the need for internal modulo guards BUT If U is not the same, skewing are different Deal with non parallel lattices … how ? In practice, used for 1 statement or “simple” examples

Experimental Results 28

Putting it all Together – Code Size Experiments 29 CL04CL06Improv. State of the art polyhedral code generator  CLooG [Bas04] PERFORMANCE COMPARISONS: CLooG04 vs CLooG06

Generation Speed – Experiments Domain Iterators Swim 36% Time reduction 58% Memory reduction 30 Node Fusion % of CL04 Scalar Dimensions % of CL04 We compare original CLooG (CL04) from [Bastiul04] PACT paper with our optimized CLooG (CL06)

Putting it all Together – Code Generation Speed Experiments 31 CLURCLUR Affine Schedule: 412  2267 lines (40% execution speedup wrt best peak) Pathscale –Ofast needs ~22s to process the AST (LNO OFF) State of the art polyhedral code generator  CLooG [Bas04] PERFORMANCE COMPARISONS: CLooG vs URGenT

Conclusion / Future Works Implemented as the Code Generation phase of the URUK framework [ICS05] Generation Speed Goal achieved (up to 56x, stands PathScale comparison) Greatly improved code size with improved if-hoisting technique (up to 5.8x) Modulo Conditionals are removed (from kernel)  Mix with HNF Still room for speeding up generation (caches, memory pools, parallelization) Focus on Code Generation Friendly transformations 32

Thank you !!! www.cloog.org for full presentation & more www.cloog.org

Polyhedral Code Generation In The Real World Nicolas VASILACHE Cédric BASTOUL Albert COHEN.

Similar presentations

Presentation on theme: "Polyhedral Code Generation In The Real World Nicolas VASILACHE Cédric BASTOUL Albert COHEN."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Polyhedral Code Generation In The Real World Nicolas VASILACHE Cédric BASTOUL Albert COHEN.

Similar presentations

Presentation on theme: "Polyhedral Code Generation In The Real World Nicolas VASILACHE Cédric BASTOUL Albert COHEN."— Presentation transcript:

Similar presentations

About project

Feedback