Download presentation
Presentation is loading. Please wait.
1
Arun Kejariwal Paolo D’Alberto Alexandru Nicolau Paolo D’Alberto Alexandru Nicolau Constantine D. Polychronopoulos A Geometric Approach for Partitioning N-Dimensional Non-Rectangular Iteration Spaces 1 Center for Embedded Computer Systems University of California at Irvine Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign 211 1 2
2
2 Outline Introduction Terminology Motivation Problem statement Uniform Partitioning Processor Allocation Our Approach Experimental Results Conclusion
3
3 Introduction Scientific and numerical Applications Computation intensive Large amounts of parallelism Multiprocessor systems Exploit parallelism Expose high-level loop parallelism Loop spreading Minimize communication overhead Minimize the number of processors
4
4 Terminology Index point do i = 1, N do j = 1, N H(i, j) enddo 1,1 i j Iteration Space ( Γ ) ( Γ ) (5,5) (2,5) * Notation used in “Loop Transformations for Restructuring Compilers” [Banerjee’93] *
5
5 Motivating Example do i 1 = 1, N do i 2 = 1, i 1 do i 3 = 1, N H(i 1, i 2, i 3 ) end do i1i1 i2i2 i3i3 N = 6 Top View (i 1 – i 2 plane) : Triangular geometry Front View (i 1 – i 3 plane) : Rectangular geometry
6
6 Motivating Example 1,1 i1i1 i2i2 Top View Assume P = 3 S2S2 S1S1 S3S3 S1S1 S2S2 S3S3 Contiguous partitioning Non-contiguous partitioning Load imbalance Perfect load balance Multiple loops per set Loss of locality
7
7 Motivating Example Assume P = 3 1,1 i1i1 i3i3 Front View Loop permutation-based contiguous partitioning Perfect load balance Remapping of index expressions Finding a permutation for uniform partitioning is non-trivial
8
8 Motivating Example 1,1 i1i1 i2i2 Top View P = 4 S2S2 S1S1 S3S3 S4S4 1,1 i1i1 i2i2 P = 5 S3S3 S1S1 S4S4 S5S5 S2S2 Processor Allocation during Iteration Space Partitioning
9
9 Previous Work Cyclic Partitioning False sharing Balanced Chunk Scheduling [Haghighat92] Restricted to double loops Canonical loop partitioning [Sakellariou96] Non-contiguous partitioning Communication minimization [Dion96, Koziris97] Do not address Processor Allocation
10
10 Our Model do i 1 = 1, N, s 1 do i 2 = f 1 (i 1 ), g 1 (i 1 ), s 2 · do i n = f n-1 (i 1, i 2, …, i n-1 ), g n-1 (i 1, i 2, …, i n-1 ), s n LOOP BODY enddo · enddo A Perfectly Nested DOALL Loop Non-Rectangular Iteration Spaces f r (i 1, i 2, …, i r-1 ) = a r0 + a r1 i 1 + … + a r(r-1) i r-1 g r (i 1, i 2, …, i r-1 ) = a r0 + a r1 i 1 + … + a r(r-1) i r-1 f r ≤ g r
11
11 Problem Statement 1,1 i N,1 Input : N-dimensional Iteration Space ( Γ ) P processors P processors j P1P1 P2P2 P Output : P partitions with “uniform” load Outermost Loop
12
12 Problem Statement I Uniform Partitioning Given : A partition with minimum execution time Objective : Minimize the number of processors for the given partition while maintaining the given partition while maintaining the performance the performance II Processor Allocation Given: An iteration space Γ and P processors Objective: Find a contiguous partition with uniform load across different processors load across different processors
13
13 Our Approach Basic Idea Model the iteration space as a convex polytope Partition the polytope into sets of equal volumes Equal volumes Ξ Uniform distribution of index points Each set of the partition is mapped to a different processor.
14
14 Our Approach 1: Compute the total volume V of Γ do i = 1, N do j = 1, i do k = 1, j LOOP BODY enddo 1,1,1 i j k N = 7 7,7,7
15
15 1,1,1 i j k Our Approach 2: Compute a partial volume V(x) of Γ 7,7,7 x 1,1,1 i j k P = 3 Each set has equal volume 7,7,7 3: Determine the breakpoints, for 1≤k≤ P-1 γkγkγkγk γ1γ1 γ2γ2
16
16 Our Approach 4: Eliminate void sets P = 5 1,1 i1i1 i2i2 S3S3 S1S1 S4S4 S2S2 Eliminate Minimizes the number of processors Size of the largest set remains constant
17
17 Our Approach 5: Determine loop bounds 1,1 i1i1 i2i2 S3S3 S1S1 S4S4 S2S2 γ1γ1γ1γ1 γ2γ2γ2γ2 γ3γ3γ3γ3 Given the breakpoints, compute lb i, ub i γkγkγkγk (lb 1, ub 1 ) = (1, 3) (lb 2, ub 2 ) = (4, 4) (lb 3, ub 3 ) = (5, 5) (lb 4, ub 4 ) = (6, 6) 6,1
18
18 Applications – Numerical packages (LINPACK etc.) and literature and literature Platform – 4-way shared-memory multiprocessor Problem size – N =1000 Results VOL : Our volume-based approach CAN : Canonical loop partitioning Setup
19
19 Results (contd.) Performance comparison Highlights : a) Yields better performance b) A generic approach b) A generic approach Number of index points in the largest set # of Processors L1L1L1L1 L2L2L2L2 L3L3L3L3 L4L4L4L4 VOLCAN8336828483596878418100444380739321003180223977601073802411538395 VOLCAN4760005160002199002250371095001120565500057232 VOLCAN200000NA100000NA50000NA25000NA VOLCAN25000NA12500NA6250NA3150NA Loop Nest 2 4 8 16
20
20 Conclusions Geometric approach for Iteration Space Partitioning Geometric approach for Iteration Space Partitioning Load balancing Processor Allocation More general than existing techniques More general than existing techniques Future Work Run-time Partitioning Run-time Partitioning
21
21 Results (contd.) Performance comparison Highlights : a) Yields better performance b) A generic approach b) A generic approach Number of index points in the largest set
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.