Arun Kejariwal Paolo D’Alberto Alexandru Nicolau Paolo D’Alberto Alexandru Nicolau Constantine D. Polychronopoulos A Geometric Approach for Partitioning N-Dimensional Non-Rectangular Iteration Spaces 1 Center for Embedded Computer Systems University of California at Irvine Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign
2 Outline Introduction Terminology Motivation Problem statement Uniform Partitioning Processor Allocation Our Approach Experimental Results Conclusion
3 Introduction Scientific and numerical Applications Computation intensive Large amounts of parallelism Multiprocessor systems Exploit parallelism Expose high-level loop parallelism Loop spreading Minimize communication overhead Minimize the number of processors
4 Terminology Index point do i = 1, N do j = 1, N H(i, j) enddo 1,1 i j Iteration Space ( Γ ) ( Γ ) (5,5) (2,5) * Notation used in “Loop Transformations for Restructuring Compilers” [Banerjee’93] *
5 Motivating Example do i 1 = 1, N do i 2 = 1, i 1 do i 3 = 1, N H(i 1, i 2, i 3 ) end do i1i1 i2i2 i3i3 N = 6 Top View (i 1 – i 2 plane) : Triangular geometry Front View (i 1 – i 3 plane) : Rectangular geometry
6 Motivating Example 1,1 i1i1 i2i2 Top View Assume P = 3 S2S2 S1S1 S3S3 S1S1 S2S2 S3S3 Contiguous partitioning Non-contiguous partitioning Load imbalance Perfect load balance Multiple loops per set Loss of locality
7 Motivating Example Assume P = 3 1,1 i1i1 i3i3 Front View Loop permutation-based contiguous partitioning Perfect load balance Remapping of index expressions Finding a permutation for uniform partitioning is non-trivial
8 Motivating Example 1,1 i1i1 i2i2 Top View P = 4 S2S2 S1S1 S3S3 S4S4 1,1 i1i1 i2i2 P = 5 S3S3 S1S1 S4S4 S5S5 S2S2 Processor Allocation during Iteration Space Partitioning
9 Previous Work Cyclic Partitioning False sharing Balanced Chunk Scheduling [Haghighat92] Restricted to double loops Canonical loop partitioning [Sakellariou96] Non-contiguous partitioning Communication minimization [Dion96, Koziris97] Do not address Processor Allocation
10 Our Model do i 1 = 1, N, s 1 do i 2 = f 1 (i 1 ), g 1 (i 1 ), s 2 · do i n = f n-1 (i 1, i 2, …, i n-1 ), g n-1 (i 1, i 2, …, i n-1 ), s n LOOP BODY enddo · enddo A Perfectly Nested DOALL Loop Non-Rectangular Iteration Spaces f r (i 1, i 2, …, i r-1 ) = a r0 + a r1 i 1 + … + a r(r-1) i r-1 g r (i 1, i 2, …, i r-1 ) = a r0 + a r1 i 1 + … + a r(r-1) i r-1 f r ≤ g r
11 Problem Statement 1,1 i N,1 Input : N-dimensional Iteration Space ( Γ ) P processors P processors j P1P1 P2P2 P Output : P partitions with “uniform” load Outermost Loop
12 Problem Statement I Uniform Partitioning Given : A partition with minimum execution time Objective : Minimize the number of processors for the given partition while maintaining the given partition while maintaining the performance the performance II Processor Allocation Given: An iteration space Γ and P processors Objective: Find a contiguous partition with uniform load across different processors load across different processors
13 Our Approach Basic Idea Model the iteration space as a convex polytope Partition the polytope into sets of equal volumes Equal volumes Ξ Uniform distribution of index points Each set of the partition is mapped to a different processor.
14 Our Approach 1: Compute the total volume V of Γ do i = 1, N do j = 1, i do k = 1, j LOOP BODY enddo 1,1,1 i j k N = 7 7,7,7
15 1,1,1 i j k Our Approach 2: Compute a partial volume V(x) of Γ 7,7,7 x 1,1,1 i j k P = 3 Each set has equal volume 7,7,7 3: Determine the breakpoints, for 1≤k≤ P-1 γkγkγkγk γ1γ1 γ2γ2
16 Our Approach 4: Eliminate void sets P = 5 1,1 i1i1 i2i2 S3S3 S1S1 S4S4 S2S2 Eliminate Minimizes the number of processors Size of the largest set remains constant
17 Our Approach 5: Determine loop bounds 1,1 i1i1 i2i2 S3S3 S1S1 S4S4 S2S2 γ1γ1γ1γ1 γ2γ2γ2γ2 γ3γ3γ3γ3 Given the breakpoints, compute lb i, ub i γkγkγkγk (lb 1, ub 1 ) = (1, 3) (lb 2, ub 2 ) = (4, 4) (lb 3, ub 3 ) = (5, 5) (lb 4, ub 4 ) = (6, 6) 6,1
18 Applications – Numerical packages (LINPACK etc.) and literature and literature Platform – 4-way shared-memory multiprocessor Problem size – N =1000 Results VOL : Our volume-based approach CAN : Canonical loop partitioning Setup
19 Results (contd.) Performance comparison Highlights : a) Yields better performance b) A generic approach b) A generic approach Number of index points in the largest set # of Processors L1L1L1L1 L2L2L2L2 L3L3L3L3 L4L4L4L4 VOLCAN VOLCAN VOLCAN200000NA100000NA50000NA25000NA VOLCAN25000NA12500NA6250NA3150NA Loop Nest
20 Conclusions Geometric approach for Iteration Space Partitioning Geometric approach for Iteration Space Partitioning Load balancing Processor Allocation More general than existing techniques More general than existing techniques Future Work Run-time Partitioning Run-time Partitioning
21 Results (contd.) Performance comparison Highlights : a) Yields better performance b) A generic approach b) A generic approach Number of index points in the largest set