Parametric Tiling of Affine Loop Nests* Sanket Tavarageri 1 Albert Hartono 1 Muthu Baskaran 1 Louis-Noel Pouchet 1 J. “Ram” Ramanujam 2 P. “Saday” Sadayappan 1 1 Ohio State University 2 Louisiana State University *Supported by US NSF
A key loop transformation for: ◦ Efficient coarse-grained parallel execution ◦ Data locality optimization Loop Tiling i j i j for (i=1; i<=7; i++) for (j=1; j<=6; j++) S(i,j); for (it=1; it<=7; it+=Ti) for (jt=1; jt<=6; jt+=Tj) for (i=it; i<min(7,it+Ti-1); i++) for (j=jt; j<min(6,jt+Tj-1); j++) S(i,j); Inter-tile loops Intra-tile loops
Rectangular Tileability Legality of rectangular tiling: ◦ Atomic execution of each tile ◦ No cyclic dependence between tiles Data dependence lexicographically positive in all space dimensions Unimodular transformations (e.g., skewing) used as a pre-processing step to make rectangular tiling valid i j ij1ij1 i‘ j’ = Skewing
Parametric Tiling for (it=1; it<=N; it+=Ti) for (jt=1; jt<=N; jt+=Tj) for (i=it; i<min(N,it+Ti-1); i++) for (j=jt; j<min(N,jt+Tj-1); j++) S(i,j); for (i=1; i<=N; i++) for (j=1; j<=N; j++) S(i,j); Tile loop i with tile size Ti Tile loop j with tile size Tj Performance of tiled code can vary greatly with choice of tile sizes → Model-driven and/or empirical search for best tile sizes Parametric tile sizes ◦ Not fixed at compile time ◦ Runtime parameters ◦ Valuable for: Auto-tuning systems Generalized “ATLAS”
Approaches to Loop Tiling TLOG and HiTLOG ◦ Handles only perfectly nested loops ◦ Tile sizes can be runtime parameters ◦ Does not address parallelism Pluto ◦ Handles imperfectly nested loops ◦ Tile sizes must be fixed at compile time ◦ Addresses parallelism PrimeTile ◦ Handles imperfectly nested loops ◦ Tile sizes can be runtime parameters ◦ Does not address parallelism DynTile and PTile (this work): S ystems with all positive features of existing tiling tools: ◦ Handle imperfectly nested loops ◦ Tile sizes can be runtime parameters ◦ Address parallelism ◦ Support multilevel tiling
Tiled Code Generation with Polyhedral Model >= Original loop: for (i=1; i<=N; i++) for (j=1; j<=N; j++) S(i,j); Tiled loop: for (it=0; it<=floord(N,32); it++) for (jt=0; jt<=floord(N,32); jt++) for (i=max(1,32*it); i<=min(N,32*it+31); i++) for (j=max(1,32*jt); j<=min(N,32*jt+31); j++) S(i,j); ijNijN i’ j’ ijN1ijN ≤ i i ≤ N 1 ≤ j j ≤ N i’ = i j’ = j. ij1ij1 = it jt ≤ i-32∙it i-32∙it ≤ 31 0 ≤ j-32∙jt j-32∙jt ≤ it’ jt’ it jt it’ = it jt’ = jt Statement domain: Affine schedule: Tile sizes = 32 x 32Assume: Rectangular tiling is valid. i j 1 N … 1 N … 2 2 i ≥ 1 i ≤ N j ≤ N j ≥ 1 Constraint of polyhedral model: Inequalities of the loop bounds must be linear in terms of loop iterators and problem sizes
PrimeTile: Approach to Sequential Parametric Tiling Recursive level-by-level generation of tiling loops by non-polyhedral AST processing j i Full tiles (loop i) for (i=lbi; i<=ubi; i++) for (j=lbj(i); j<=ubj(i); j++) S(i,j); Output pseudocode: Partial tile (loop i) for it { } [epilog i] [compute lbv] [compute ubv] if (lbv<ubv) { } else { [untiled j] } [prolog j] [full tiles j] [epilog j] No full tiles Full tiles
PrimeTile: Multi-Level Tiling j i Essential for: ◦ Exploiting data locality in deep multi-level memory hierarchies Approach: ◦ Boundary tiles can be recursively tiled using smaller tile sizes 1 levels of tiling2 levels of tiling3 levels of tiling
for i for j1=l2-1,l2-1 S1(i,j1) for j2=l2,u2 S2(i,j2) for j3 S3(i,j3) DynTile: Parametric Tiling (Multi Statement Domains) for i S1(i) for j2=l2,u2 S2(i,j2) for j3 S3(i,j3) Pre-processing to embed in common space One-trip loop
Convex Hull for i for j1 S1(i,j1) for j2 S2(i,j2) for j3 S3(i,j3) i j S1 S2 S3 /* Inter-tile loops */ for it { for jt { } /* Intra-tile loops*/ DynTile: Parametric Tiling (Multiple Statement Domains)
DynTile: Wave-front Parallelism i j i’ j’ ij1ij1 i‘ j’ = wavefront k wavefront k+1 wavefront k+2 wavefront k+3 wavefront k+4 After sequential tiling: 1.If no loop carried dependences exist, then each tiling loop is directly parallelizable 2.If none of the tiling loops is parallel, then wave-front parallelization is always possible (all points in the same wavefront are independent of each other)
for each bin w { #pragma omp parallel for for each tile in w { } /** Intra-tile loops (treated as a black box) */ w1w2w3w4w5 DynTile: Inspector Code for Dynamic Scheduled Parallel Execution it jt /** Inter-tile loops */ for it { for jt { } /** * Intra-tile loops * (treated as * a black box) */ Tile iteration space w1w2w3w4w5 Step 1: Count #wavefronts and #tiles in each wavefront Step 2: Allocate bins to store wavefronts Step 3: Fill the bins with its corresponding tiles Step 4: Execute in parallel all tiles in each bin #wavefronts = 5 w1 has 2 tiles w2 has 3 tiles w3 has 4 tiles w4 has 3 tiles w5 has 4 tiles
DynTile: Implementation Pluto Modified CLooG Parser + AST Generator Clan Convex Hull Generator (using ISL) Convex Hull Generator (using ISL) Tiling Transformer Inspector Code Specifier Code Generator Sequence of loop nests Pre-process Statement polyhedra + Affine transforms (for rectangular tileability) Tileable loop code (with preserved embedding information) Loop ASTs Statement polyhedra Convex-hull loop AST Tiled loop ASTs Parallel tiled loop ASTs Parallel tiled loop code DynTile
PTile: Loop Generation Representation of Statement Domains ◦ Set of affine inequalities S: v 1, v 2, …, v n are loop variables (v 1 outermost and v n innermost) p 1, p 2, …, p k are program parameters Bounds of v i, r ≤ i ≤ n, r ≥ 1 max(f 1 (v 1, v 2, …, v r-1, p 1, p 2, …, p k, c), …, f t (v 1, v 2, …, v r-1, p 1, p 2, …, p k, c) ) ≤ v i ≤ min(g 1 (v 1, v 2, …, v r-1, p 1, p 2, …, p k, c), …, g s (v 1, v 2, …, v r-1, p 1, p 2, …, p k, c) ) Bounds are dependent on outer loop variables and parameters (row echelon form)
Loop Generation (cont.) B … 0 B 21 B 22 0 … 0. B n1 B n2 … B nn P 11 P 12 … P 1k P 21 P 22 … P 2k. P n1 P n2 … P nk c1c2...cnc1c2...cn v1.vnp1.pk1v1.vnp1.pk1 row echelon form row echelon form – suitable for generating loop code to scan iteration points represented by the system BP ≥ 0 B | P | C vp1vp1.. ≥ 0 C
Parametric Sequential Tiling Tiling transformation ◦ Express each variable v j in terms of inter-tile (tile) co- ordinates t j, intra-tile co-ordinates u j and tile sizes s j v j = s j.t j + u j and 0 ≤ u j ≤ s j -1 S ’ : S ’ is equivalent to S Not in Row echelon form for t But in Row echelon form for u tups1tups1. ≥ 0 B.s | B | P | 0 | C 0 | I | 0 | 0 | 0 0 | -I | 0 | I | -1 I : Identity matrix
◦ To derive a system in row echelon form for all variables Create a system S T with only tile variables, program parameters and tile sizes (also parameters) Relaxed projection to eliminate intra-tile variables u j In S T, B ij.u j = All solutions to S’ also satisfy S T S T : B.s has same nonzero structure as B => Row echelon form for t, where s is a diagonal matrix of parametric tile sizes Parametric Sequential Tiling (cont.) 0 if B ij ≤ 0 B ij. (s j -1) if B ij > 0 ≥ 0 B.s | P | B + | C’ tps1tps1.
Parametric Sequential Tiling (cont.) ≥ 0 B.s | P | B + | C’ tps1tps1. S’: ST:ST: In row echelon form for t - To generate tile loops S T |S’ : ≥ 0 tups1tups1. B.s | B | P | 0 | C 0 | I | 0 | 0 | 0 0 | -I | 0 | I | -1 In row echelon form for t and u - To generate tile loops and intra-tile loops ≥ 0 tups1tups1. B.s | B | P | 0 | C 0 | I | 0 | 0 | 0 0 | -I | 0 | I | -1 B.s | 0 | P | B + | C’ In row echelon form for u - To generate intra-tile loops
Parallel Non-parameterized Tiling /* Original loops */ for (i=1; i<=N; i++) for (j=1; j<=N; j++) for (k=i; k<=N; k++) S(i,j,k); /* Sequential tiled loops */ for (it= ⌈ -6/8 ⌉ ; it<= ⌊ N/8 ⌋ ; it++) for (jt= ⌈ -6/8 ⌉ ; jt<= ⌊ N/8 ⌋ ; jt++) for (kt= ⌈ (it*8-7)/8 ⌉ ; kt<= ⌊ N/8 ⌋ ; kt++) // intra-tile loops i,j,k Lower-bound constraintsUpper-bound constraints (1a): -6/8<=it (2a): -6/8<=jt (3a): (it*8-7)/8<=kt (1b): it<=N/8 (2b): jt<=N/8 (3b): kt<=N/8 (4a): w-it-jt<=kt(4b): kt<=w-it-jt (5a): Combine (4a) and (3b) w-it-jt<=N/8 (8*w-8*it-N)/8<=jt (5b): Combine (4b) and (3a) (it*8-7)/8<=w-it-jt jt<=(8*w-16*it+7)/8 (6a): Combine (5a) and (2b) (8*w-8*it-N)/8<=N/8 (4*w-N)/4<=it (6b): Combine (5b) and (2a) -6/8<=(8*w-16*it+7)/8 it<=(8*w+7)/16 (7a): Combine (6b) and (1a) -6/8<=(8*w+13)/16 -7/8<=w (7b): Combine (6a) and (1b) (4*w-N)/4<=N/8 w<=3*N/8 Tiling (8x8x8 tile sizes) Introduce new wavefront constraints (for loop kt) Original loop constraints Use Fourier Motzkin Elimination to derive new wavefront constraints (for loops w,it,jt ) w = it+jt+kt
Parallel Non-parameterized Tiling (cont.) /* Parallel tiled loops */ for (w= ⌈ -7/8 ⌉ ; w<= ⌊ 3*N/8 ⌋ ; w++) /* sequential */ for (it=max( ⌈ -6/8 ⌉, ⌈ (4*w-N)/4 ⌉ ); it<=min( ⌊ N/8 ⌋, ⌊ (8*w+7)/16 ⌋ ); it++) /* parallel */ for (jt=max( ⌈ -6/8 ⌉, ⌈ (8*w-8*it-N)/8 ⌉ ); jt<=min( ⌊ N/8 ⌋, ⌊ (8*w-16*it+7)/8 ⌋ ); jt++) /* parallel */ for (kt=max( ⌈ (it*8-7)/8 ⌉, w-it-jt); kt<=min( ⌊ N/8 ⌋, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k This works when tile sizes are fixed When tile sizes are parametric, Fourier Motzkin Elimination becomes problematic ◦ Sign of the coefficient in the combined inequalities can be indeterminate impossible to determine whether the new inequality is a lower-bound or upper-bound inequality
Parallel Parametric Tiling 1. Introduce an outermost wavefront loop 2. Optimize the innermost iterator using wavefront inequalities w-t 1 -…-t n-1 ≤ t n ≤ w-t 1 -…-t n-1 /* Parallel tiled loops */ for (w=w min ; w<=w max ; w++) /* sequential */ for (it=lbit; it<=ubit; it++) /* parallel */ for (jt=lbjt; jt<=ubjt; jt++) /* parallel */ for (kt=max(lbkt, w-it-jt); kt<=min(ubkt, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k
Static Determination of Lowest and Highest Wavefront Numbers The outermost tiling loop enumerates the wavefront numbers from lowest (w min ) to highest (w max ) The values of w min and w max can be determined at compile time using ILP solvers such as PIP/PipLib Similarly, parametric bound values of each tiling loop variable (t j min and t j max for 1 ≤ j ≤ n) can also be computed using ILP solver. Original point loops (affine inequalities) Global parameter values (affine inequalities) Lexicographic minimum point in each loop level, e.g., 1, 1 Lexicographic maximal point in each loop level, e.g., 200,2*N Lowest wavefront number e.g., w min = ⌊ 1/Ti ⌋ + ⌊ 1/Tj ⌋ Highest wavefront number e.g., w max = ⌊ 200/Ti ⌋ + ⌊ (2*N)/Tj ⌋ ILP Solver
Parallel Parametric Tiling 1. Introduce an outermost wavefront loop Utilize ILP solver to derive w min and w max 2. Optimize the innermost iterator using wavefront inequalities w-t 1 -…-t n-1 ≤ t n ≤ w-t 1 -…-t n-1 /* Parallel tiled loops */ for (w=w min ; w<=w max ; w++) /* sequential */ for (it=lbit; it<=ubit; it++) /* parallel */ for (jt=lbjt; jt<=ubjt; jt++) /* parallel */ for (kt=max(lbkt, w-it-jt); kt<=min(ubkt, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k Correct code, but may visit many empty tiles
Parallel Parametric Tiling (cont.) 3. Optimize using bounded wavefront inequalities Utilize ILP solver to derive parametric bound values t j min, t j max for 1 ≤ j ≤ n /* Parallel tiled loops */ for (w=w min ; w<=w max ; w++) /* sequential */ for (it=max(lbit, w-jt max -kt max ); it<=min(ubit, w-jt min -kt min ); it++) /* parallel */ for (jt=max(lbjt, w-it-kt max ); jt<=min(ubjt, w-it-kt min ); jt++) /* parallel */ for (kt=max(lbkt, w-it-jt); kt<=min(ubkt, w-it-jt); kt++) /* one-trip-count */ // intra-tile loops i,j,k Tighter loop bounds, but may still visit empty tiles
Parallel Parametric Tiling (cont.) 4. Optimize using Relaxed Symbolic Fourier Motzkin Elimination (RSFME) Lower-bound constraintsUpper-bound constraints (1a): w min <=w(1b): w max <=w (2a): (1-Ti+1)/Ti<=it (3a): (1-Tj+1)/Tj<=jt (4a): (it*Ti-Tk+1)/Tk<=kt (2b): it<=N/Ti (3b): jt<=N/Tj (4b): kt<=N/Tk (5a): w-it-jt<=kt(5b): kt<=w-it-jt (6a): Combine (5a) and (4b) w-it-jt<=N/Tk w-it-N/Tk<=jt (w*Tk-it*Tk-N)/Tk<=jt (6b): Combine (5b) and (4a) (it*Ti-Tk+1)/Tk<=w-it-jt jt<=w-it-it*Ti/Tk+1-1/Tk jt<=(w*Tk-it*Tk-it*Ti+Tk-1)/Tk (7a): Combine (6a) and (3b) w-it-N/Tk<=N/Tj w-N/Tj-N/Tk<=it (w*Tj*Tk-N*Tk- N*Tj)/Tj*Tk<=it (7b): Combine (6b) and (3a) 2/Tj-1<=w-it-it*Ti/Tk+1-1/Tk it+it*Ti/Tk<=w+2-2/Tj-1/Tk it<=(w*Tj*Tk^2+2*Tj*Tk^2- Tj*Tk-2*Tk^2) / (Ti*Tj*Tk+Tj*Tk^2) Very tight loop bounds, with negligible overhead of scanning empty tiles /* Original loops */ for (i=1; i<=N; i++) for (j=1; j<=N; j++) for (k=i; k<=N; k++) S(i,j,k); /* Sequential tiled loops */ for (it= ⌈ (1-Ti+1)/Ti ⌉ ; it<= ⌊ N/Ti ⌋ ; it++) for (jt= ⌈ (1-Tj+1)/Tj ⌉ ; jt<= ⌊ N/Tj ⌋ ; jt++) for (kt= ⌈ (it*Ti-Tk+1)/Tk ⌉ ; kt<= ⌊ N/Tk ⌋ ; kt++) // intra-tile loops i,j,k No ambiguous signs encountered
Ambiguous Sign Resolution Resolving ambiguous sign in RSFME Relaxation step ◦ Replace the tile loop variables with their parametric bounded values (t j min and t j max ) Lower-bound constraintsUpper-bound constraints (1a): w min <=w(1b): w max <=w (2a): (1-Ti+1)/Ti<=it (3a): (1-Tj+1)/Tj<=jt (4a): (it*Ti-Tk+1)/Tk<=kt (2b): it<=N/Ti (3b): jt<=(N-it*Ti)/Tj (4b): kt<=N/Tk (5a): w-it-jt<=kt(5b): kt<=w-it-jt (6a): Combine (5a) and (4b) w-it-jt<=N/Tk w-it-N/Tk<=jt (w*Tk-it*Tk-N)/Tk<=jt (6b): Combine (5b) and (4a) (it*Ti-Tk+1)/Tk<=w-it-jt jt<=w-it-it*Ti/Tk+1-1/Tk jt<=(w*Tk-it*Tk-it*Ti+Tk-1)/Tk (7a): Combine (6a) and (3b) w-it-N/Tk<=N/Tj-it*Ti/Tj w-N/Tj-N/Tk<= it*(1-Ti/Tj) (7b): Combine (6b) and (3a) 2/Tj-1<=w-it-it*Ti/Tk+1-1/Tk it+it*Ti/Tk<=w+2-2/Tj-1/Tk it<=(w*Tj*Tk^2+2*Tj*Tk^2- Tj*Tk-2*Tk^2) / (Ti*Tj*Tk+Tj*Tk^2) /* Original loops */ for (i=1; i<=N; i++) for (j=1; j<=N-i; j++) for (k=i; k<=N; k++) S(i,j,k); /* Sequential tiled loops */ for (it= ⌈ (1-Ti+1)/Ti ⌉ ; it<= ⌊ N/Ti ⌋ ; it++) for (jt= ⌈ (1-Tj+1)/Tj ⌉ ; jt<= ⌊ (N-it*Ti)/Tj ⌋ ; jt++) for (kt= ⌈ (it*Ti-Tk+1)/Tk ⌉ ; kt<= ⌊ N/Tk ⌋ ; kt++) // intra-tile loops i,j,k Ambiguous sign encountered (7a.1) w-N/Tj-N/Tk+ it min *Ti/Tj<=it (w*Tj*Tk-N*Tj-N*Tk+ it min *Ti*Tk)/(Tj*Tk)<=it w-N/Tj-N/Tk<=it-it*Ti/Tj (7a.2) it*Ti/Tj<= it max -w+N/Tj+N/Tk it<=( it max *Tj*Tk-w*Tj*Tk+N*Tj+N*Tk)/(Ti*Tk) Use it min and it max to resolve sign ambiguity:
PTile: Prototype Implementation Pluto Modified CLooG Parser + AST Generator Clan Convex Hull Generator (using ISL) Convex Hull Generator (using ISL) Tiling Transformer Wavefront Parallelizer + RSFME Code Generator Sequence of loop nests Pre-process Statement polyhedra + Affine transforms (for rectangular tileability) Tileable loop code (with preserved embedding information) Loop ASTs Statement polyhedra Convex-hull loop AST Sequential tiled loop ASTs Parallel tiled loop ASTs Parallel tiled loop code PTile
PTile, DynTile, PrimeTile: Experiments Main comparison: ◦ PTile, DynTile and PrimeTile AMD Opteron 2380: ◦ Dual-socket quad-core AMD Opteron 2380 processors running at 2.6 GHz with KB L1 cache, 2 MB of L2 cache Compilers: ◦ GCC version ◦ ICC version 11.0 Experiments: ◦ With and without vectorization ◦ For parallel runs, used OpenMP Benchmarks: 2-D FDTD, Cholesky, DTRMM, LU
Results - 1 ◦ PTile: The RSFME relaxation step was never needed in these and other benchmarks that we have tested with ◦ Control overhead : PrimeTile has simple loop bounds but larger code size PTile and DynTile generate more complex loop bounds For 2D-FDTD, there is a 20% to 40% difference in execution time due to control overhead; for the other benchmarks, no significant difference
Results - 2 Sequential Parallel Bench Compiler PrimeTile DynTile PTile DynTile PTile 2d-fdtd gcc-novec 43.84s 49.19s 56.78s 9.32s 10.98s 2d-fdtd gcc-vec 43.82s 49.22s 56.85s 9.37s 10.98s 2d-fdtd icc-novec 40.27s 48.12s 54.29s 13.30s 12.96s 2d-fdtd icc-vec 40.52s 49.61s 54.63s 13.03s 13.18s cholesky gcc-novec 6.13s 10.50s 13.43s 1.91s 2.81s cholesky gcc-vec 6.08s 10.46s 13.45s 1.89s 2.82s cholesky icc-novec 5.63s 5.86s 8.19s 1.21s 2.40s cholesky icc-vec 5.36s 5.74s 8.22s 1.27s 2.61s dtrmm gcc-novec 9.29s 14.34s 18.99s 2.55s 4.50s dtrmm gcc-vec 9.25s 14.57s 18.99s 2.54s 3.69s dtrmm icc-novec 9.84s 9.19s 13.27s 2.17s 3.22s dtrmm icc-vec 9.91s 9.12s 13.44s 2.33s 3.27s lu gcc-novec 8.30s 9.15s 10.98s 2.56s 2.94s lu gcc-vec 8.29s 9.15s 10.98s 2.98s 2.43s lu icc-novec 6.30s 5.63s 7.49s 6.18s 1.60s lu icc-vec 6.36s 5.58s 6.52s 6.36s 1.62s
Results - 3 Sequential: ◦ PrimeTile performs best; DynTile is close ◦ gcc has more trouble optimizing code from DynTile than code from PrimeTile (difference between icc and gcc) ◦ PTile is slower because the order of execution of tiles impacts locality Parallel: ◦ DynTile performs better than PTile (except for LU – we need to understand this better) ◦ All tiles in a waveftont are executed in parallel with DynTile, where as the OpenMP parallel pragma works only with the outermost tiled parallel loop in PTile Vectorization : ◦ complexity of loop bounds in generated code appear to make it difficult for the compiler to vectorize
Summary Developed DynTile and PTile, two parametric tiling systems with the following features ◦ Handle imperfectly nested loops ◦ Allow tile sizes to be run time parameters ◦ Address parallelism ◦ Support multi-level tiling Ongoing: Much more extensive set of experiments to understand and improve the efficiency of the approaches for generation of parallel parametrically tiled code