Download presentation
Presentation is loading. Please wait.
Published byIsabella Elaine Kelley Modified over 9 years ago
1
Parallelization of Irregular Applications: Hypergraph-based Models and Methods for Partitioning and Load Balancing Cevdet Aykanat Bilkent University Computer Engineering Department
2
2 “Partitioning” of irregular computations Partitioning of computation into smaller works Divide the work and data for efficient parallel computation (applicable for data-parallelism)
3
Partitioning of irregular computations “Good Partitioning” Low computational imbalance (computational load balancing) Low communication overhead –Total message volume –Total message count (latency) –Maximum message volume and count per processor (Communication load minimization and balancing) 3
4
4 Why graph models are not sufficient? Existing Graph Models –Standard graph model –Bipartite graph model –Multi-constraint / multi-objective graph partitioning –Skewed partitioning Flaws –Wrong cost metric for communication volume –Latency: number of messages also important –Minimize the maximum volume and/or message count Limitations –Standard Graph Model can only express symmetric dependencies –Symmetric = identical partitioning of input and output data –Multiple computation phases
5
5 Preliminaries: Graph Partitioning Partitioning constraint: maintaining balance on part weights Partitioning objective: minimizing cutsize
6
6 Preliminaries: Hypergraph Partitioning Partitioning constraint: maintaining balance on part weights Partitioning objective: minimizing cutsize
7
7 Irregular Computations – an example: Parallel Sparse Matrix Vector Multiply (y=Ax) Abounds in scientific computing –A concrete example is iterative methods SpMxV is applied repeatedly efficient parallelization: load balance and small communication cost Characterizes a wide range of applications with irregular computational dependency A fine-grain computation –guaranteeing efficiency will guarantee higher efficiency in applications with coarser-grain computations
8
8 Sparse-matrix partitioning taxonomy for parallel y=Ax
9
9 Row-parallel y=Ax Rows (and hence y) and x is partitioned 1.Expand x vector (sends/receives) 2.Compute with diagonal blocks 3.Receive x and compute with off- diagonal blocks
10
10 Row-parallel y=Ax Communication requirements Total message volume: #nonzero column segments in off diagonal blocks (13) Total message number : #nonzero off diagonal blocks (9) Per processor: above two metrics confined within a column stripe We minimize total volume and number of messages and obtain balance on per processor basis
11
11 Column-parallel y=Ax Columns (and hence x) and y is partitioned 1.Compute with off diagonal blocks; obtain partial y results, issue sends/receives 2.Compute with diagonal block 3.Receive partial results on y for nonzero off-diagonal blocks and add the partial results
12
12 Row-column-parallel y=Ax Matrix nonzeros, x, and y are partitioned 1.Expand x vector 2.Scalar multiply and add y i a ij x j a ik x k 3.Fold on y vector (send and receive partial results) Send partial y 9 Send x 5 and x 7
13
13 Hypergraph Models for y = Ax
14
14 Hypergraph Model for row-parallel y = A x 1D Rowwise Partitioning: Column-net Model Columns are nets; rows are vertices: Connect v i to n j if a ij is nonzero Assign y i using v i, x j using n j (permutation policy) Respects row coherence, disturbs column coherence Symmetric partitioning: a ii should be nonzero Partitioning constraint: computational load balancing Partitioning objective: minimizing total communication volume Comm. vol. = 2(x 5 )+2(x 12 )+1(x 7 )+1(x 11 ) = 6 words
15
15 Dual of column-net model: rows are nets; columns are vertices Assign x j using v j, y i using n i (permutation policy) Symmetric partitioning: nonzero diagonal entries Respects column coherence, disturbs row coherence Partitioning constraint: computational load balancing Partitioning objective: minimizing total communication volume Comm. vol. = 2(y 5 )+2(y 12 )+1(y 6 )+1(y 10 ) = 6 words Hypergraph Model for column-parallel y = A x 1D Columnwise Partitioning: Row-net Model
16
16 Hypergraph Model for row-column-parallel y = A x: 2D Fine-Grain Partitioning: Row-column-net Model M N matrix A with nnz nonzeros: M row nets, N column nets nnz vertices. Vertex a ij is connected to nets r i and c j a32a32 r3r3 c2c2
17
17 2D Fine-Grain Partitioning One vertex for each nonzero Disturbs both row and column coherence Partitioning constraint: computational load balancing Partitioning objective: minimizing total communication volume Symmetric partitioning: nonzero diagonal entries P1P1 P2P2 P3P3 P4P4 con-1 =2 con-1=1
18
18 Hypergraph Methods for y = Ax
19
19 Hypergraph Methods for y = Ax 2D Jagged 2D Checkerboard √K–by–√K virtual processor mesh 2-phase method: Each phase models either expand or fold communication Column-net and Row-net models are used
20
20 HP Methods for y = Ax: 2D Jagged Partitioning Phase 1: √K-way rowwise partitioning using column-net model Respects row coherence at processor-row level
21
21 Phase 2: √K independent √K-way columnwise partitionings using row-net model Symmetric partitioning Column coherence for cols 2 and 5 are disturbed => P 3 sends messages to both P 1 and P 2 HP Methods for y = Ax: 2D Jagged Partitioning
22
22 Row coherence => fold comms confined to procs in the same row => max mssg count per proccessor = √K-1 in fold phase No column coherence => max mssg count per proc = K - √K in expand phase HP Methods for y = Ax: 2D Jagged Partitioning
23
23 HP Methods for y = Ax: 2D Checkerboard Partitioning Phase 1: √K-way rowwise partitioning using column-net model Respects row coherence at processor-row level Same as phase 1 of jagged partitioning
24
24 Phase 2: One √K-way √K-constraint columnwise partitioning using row-net model Respects column coherence at processor column level Symmetric partitioning HP Methods for y = Ax: 2D Checkerboard Partitioning
25
25 Row coherence => fold comms confined to procs in the same row => max mssg count per proccessor = √K-1 in fold phase Column coherence => expand comms confined to procs in the same column => max mssg count per proccessor = √K-1 in fold phase HP Methods for y = Ax: 2D Checkerboard Partitioning
26
26 Respects either row or column coherence at proc level Single communication phase (either expand or fold) Max mssg count per processor = K – 1 Fast partitioning time Determining the partitioning dimension –Dense columns => rowwise partitioning –Dense rows => columnwise partitioning –Partition both rowwise and columnwise and choose the best Comparison of Partitioning Schemes: 1D Partitioning
27
27 Do not respect row or column coherencies at processor level Two communication phases (both expand and fold) Better in load balancing 2D Fine Grain: –Disturbs both row and column coherencies at processor-row and processor-col level –Significant reductions in total volume wrt 1D and other 2D partitionings –Worse in total number of messages –Slow partitioning time 2D Checkerboard –Respect both row and column coherencies at processor-row and processor-col level –Restricts max mssg count per processor to 2(√K –1) –Good for architectures with high mssg latency –As fast as 1D partitioning schemes 2D Jagged: –Respect either row or column coherencies at processor-row or processor-col level –Restricts max mssg count per processor to K – 2√K –1 –Better communication volume and load balance wrt checkerboard –As fast as 1D partitioning schemes Comparison of Partitioning Schemes: 2D Partitioning
28
28 2-Phase Approach for Minimizing Multiple Communication Hypergraph Model Communication-Cost Metrics: Communication Cost Metrics -- Total message volume -- Total message count -- Maximum message volume and count per processor 2-Phase Approach: 1D unsymmetric partitioning -- Phase 1: Rowwise (colwise) partitioning using col-net (row-net) model – minimize total mssg volume – maintain computational load balance -- Phase 2: Refine rowwise (colwise) partition using communication hypergraph model – encapsulate the remaining communication-cost metrics – try to attain total mssg volume bound obtained in Phase 1
29
29 Construct communication matrix C x –Perform rowwise compression Compress row stripe R k of A into a single row r k of C x –Sparsity pattern of r k = Union of sparsity patterns of all rows in R k –Discard internal columns of R k –Nonzeros in r k : Subset of x-vector entries needed by processor P k. –Nonzeros in column c j : The set of processors that need x j Construct row-net model H of comm matrix C x –Vertices (columns of C x ) represent expand communication tasks –Nets (rows of C x ) represent processors –Partitioning constraint: balancing send volume loads of processors –Partitioning objective: minimizing total number of messages Communication Hypergraph Model: Phase 2 for 1D rowwise partitioning
30
30 Communication Hypergraph Model Rowwise partition obtained in phase one Communication matrix Communication hypergraphColwise permutation induced by comm HP
31
31 Communication Cost Metrics –Total message volume and count –Maximum message volume and count per processor Balancing on external nets of parts => minimizing max mssg count per processors Communication hypergraph model showing incoming and outgoing messages of processor P k Communication Hypergraph Model
32
32 Construct communication matrix Cy –Perform columnwise compression Compress kth column stripe of A into a single column c k of C y –Sparsity pattern of c k = Union of sparsity patterns of all columns in kth column stripe –Discard internal rows of kth column stripe –Nonzeros in c k : Subset of y-vector entries needed by processor P k. –Nonzeros in row r j : The set of processors that need y j Construct column-net model of comm matrix C y –Vertices (rows of C y ) represent fold communication tasks –Nets (cols of C y ) represent processors –Partitioning constraint: balancing send volume loads of processors –Partitioning objective: minimizing total number of messages Communication Hypergraph Model: Phase 2 for 1D columnwise partitioning
33
33 Communication Hypergraph Model Colwise partition obtained in Phase-1 Communication matrix Communication hypergraph Colwise permutation induced by comm HP
34
34 Communication Hypergraph Model for 2D Fine-Grain Partitioning Phase 1: Obtain a fine-grain partition on nonzero elements Phase 2: Obtain 2 communication matrices: –Cx: rowwise compression –Cy: columnwise compression Phase 2: Construct two communication hypergraphs –Hx: column-net model of Cx –Hy: row-net model of Cy Phase 2 for unsymmetric partitioning –Partition Hx and Hy independently –Partitioning constraints: balancing expand and fold volume loads of processors –Partitioning objectives: minimizing total number of messages in the expand and fold phases
35
35 Communication Matrix Cx Communication Matrix Cy Rowwise compression Columnwise compression Fine-grain partitioned matrix obtained in Phase-1 Cx represents which processor needs which x vector entry Cy represents which processor contributes to which y entry Communication Hypergraph Model for 2D Fine-Grain Partitioning
36
36 Phase-2 for unsymmetric partitioning Partition communication hypergraphs Hx and Hy Cut=1+1+2+2+1+1+1= 5 Cut=5 HxHx HyHy
37
37 Number of messages regarding x P1P1 P2P2 x 1 and x 2 P4P4 P1P1 x5x5 P4P4 P3P3 x 5 and x 7 P3P3 P2P2 x8x8 P2P2 P4P4 x 10
38
38 Number of messages regarding y P1P1 P4P4 partial y 4 P3P3 P4P4 P2P2 P3P3 P4P4 P2P2 P3P3 P1P1 partial y 5 partial y 7 partial y 9
39
39 Communication Hypergraph Model for 2D Fine-Grain Partitioning Phase-2 for symmetric partitioning –Combine two hypergraphs Hx and Hy into a new one H by merging corresponding vertices –Both net n k and m k represent processor k Cut net n k : processor k sending message(s) in expand phase Cut net m k : processor k receiving message(s) in fold phase –Fix both net n k and m k to part V k H contains K fixed vertices –Partition augmented hypergraph H partitioning objective: minimizing total number of messages 2 partitioning constraints: balancing expand and fold volume loads of processors
40
40 Phase 2 for symmetric partitioning Partition augmented hypergraph H. A portion of augmented hypergraph H Vertex v 4 does not exist in Hx whereas it exists in Hy. –All nonzeros of column 4 were assigned to processor 2 in fine-grain partitioning. Vertex v 7 exists in both Hx and Hy.
41
41 SpMxV Context: Iterative Methods Used for solving linear systems A x b –Usually A is sparse Involves –Linear vector operations x = x y x i = x i y i –Inner products = x,y = sum of x i y i –Sparse matrix-vector multiplies (SpMxV) y = Ax y i = A i,x y = A T x y i = A T i,x –Assuming 1D rowwise partitioning of A while not converged do computations check convergence
42
42 Parallelizing Iterative Methods Avoid communicating vector entries for linear vector operations and inner products Nothing to do for inner products –Regular communication –Low communication volume partial sum values communicated Efficiently parallelize the SpMxV operations –in the light of previous discussion, are we done? –Preconditioning?
43
43 Preconditioning Iterative methods may converge slowly, or diverge Transform A x b to another system that is easier to solve Preconditioner is a matrix that helps in obtaining desired transformation
44
44 Preconditioning We consider parallelization of iterative methods that use approximate inverse preconditioners Approximate inverse is a matrix M such that AM I Instead of solving A x b, use right preconditioning and solve AM y b and then set x = M y Preconditioning cliché: “Preconditioning is art rather than science”
45
45 Preconditioned Iterative Methods Additional SpMxV operations with M –never form matrix AM; perform successive SpMxVs Parallelizing a full step in these methods requires efficient SpMxV operations with A and M –partition A and M What have been done –a bipartite graph model with limited usage A blend of dependencies and interactions among matrices and vectors –partition A and M simultaneously
46
46 Preconditioned Iterative Methods Partition A and M simultaneously Figure out partitioning requirements through analyzing linear vector operations and inner products –Reminder: never communicate vector entries for these operations Different methods have different partitioning requirements
47
47 Preconditioned BiCG-STAB p, r, v should be partitioned conformably s should be with r and v t should be with s x should be with p and s
48
48 Preconditioned BiCG-STAB p, r, v, s, t, and, x should be partitioned conformably What remains? Columns of M and rows of A should be conformal should be conformal Rows of M and columns of A should be conformal PAQTQMPTPAQTQMPT
49
49 Partitioning Requirements “and” means there is a synchronization point between successive SpMxV’s –Load balance each SpMxV individually BiCG-STAB PAQ T QMP T PAMP T TFQMRPAP T and PM 1 M 2 PT GMRESPAP T and PMP T CGNEPAQ and PMP T
50
50 Model for simultaneous partitioning We use the previously proposed models –define operators to build composite models
51
51 Combining Hypergraph Models ×Net amalgamation: ×newer combine nets of individual hypergraphs Vertex amalgamation: –combine vertices of individual hypergraphs, and connect the composite vertex to the nets of the individual vertices Vertex weighting: –define multiple weights; individual vertex weights are not added up Vertex insertion: –create a new vertex, d i, to be connected to nets n i of individual hypergraphs Pin addition: –connect a specific vertex to a net
52
52 Combining Guideline 1.Determine partitioning requirements 1.Decide on partitioning dimension for each matrix generate column-net model for the matrices to be partitioned rowwise generate row-net model for the matrices to be partitioned columnwise
53
53 Combining Guideline 1.Apply vertex operations i.to impose identical partition on two vertices amalgamate them ii.if the application of matrices are interleaved with synchronization apply vertex weighting 2.Apply net operations i.if a net ought to be permuted with a specific vertex then establish a policy for that net. If it is not connected then apply pin addition ii.if two nets ought to be permuted together independent of the existing vertices then apply vertex insertion
54
54 Combining Example BiCG-STAB requires PAMP T –Reminder: rows of A and columns of M; columns of A and rows of M A rowwise and M columnwise 1 23i
55
55 Combining Example 3ii 4ii – P AM P T : Columns of A and rows of M should be conformable – d i vertices are different
56
56 Remarks on composite models Operations on hypergraphs are defined to built composite hypergraphs Partitioning the composite hypergraphs –balances computational loads of processors –minimizes the total communication volume in a full step of the preconditioned iterative methods Can meet a broad spectrum of partitioning requirements: multi-phase, multi-physics applications
57
57 Parallelization of a sample application (other than SpMxV) A Remapping Model for Image-Space Parallel Volume Rendering Volume Rendering –mapping a set of scalar or vectoral values defined in a 3D dataset to a 2D image on the screen. –Surface-based rendering –Direct volume rendering (DVR)
58
58 Ray-Casting-Based DVR
59
59 Parallel Rendering: Object-space (OS) and Image-space (IS) Parallelization
60
60 Screen Partitioning Static simplicity good load balancing high primitive replication Dynamic good load balancing –overhead of broadcasting pixel assignments –high primitive replication Adaptive good load balancing respects image-space coherency view-dependent preprocessing overhead
61
61 Adaptive Parallel DVR Pipeline
62
62 Screen Partitioning
63
63 View Independent Cell Clustering Tetrahedral datasetGraph representation
64
64 View Independent Cell Clustering Resulting 6 cell clusters6-way partitioned graph
65
65 Pixel Clustering
66
66 Cons and Pros of Cell and Pixel Clustering Reduction in the 3D-2D interaction Less preprocessing overhead Estimation errors in workload calculations Increase in data replication
67
67 Screen Partitioning Model A sample visualization instance 15 cell clusters and eight pixel blocks Interaction hypergraph
68
68 Screen Partitioning Model 3-way partition of Interaction hypergraph Cell cluster (net) Pixel block (vertex)
69
69 Two-Phase Remapping Model Cell cluster (net) Pixel block (vertex)
70
70 Two-Phase Remapping Model Weighted bipartite graph matching
71
71 One-Phase Remapping Model Hypergraph Partitioning with Fixed Vertices 3-way partitioning of a remapping hypergraph Cell cluster (net) Pixel block (free vertex) Processor (fixed vertex)
72
72 Adaptive IS-Parallel DVR Algorithm
73
73 Example Rendering
74
74 Example Screen Partitionings Jagged partitioningHypergraph partitioning
75
75 Rendering Load Imbalance
76
76 Total Volume of Communication
77
77 Speedups
78
78 Related works of our group Sparse matrix partitioning for parallel processing U.V. Çatalyürek, C. Aykanat, and B. Ucar, “On Two-Dimensional Sparse Matrix Partitioning: Models, Methods and a Recipe”, submitted to SIAM Journal on Scientific Computing. B. Ucar and C. Aykanat, “Revisiting Hypergraph Models for Sparse Matrix Partitioning,” SIAM Review, vol. 49(4), pp. 595–603, 2007. B. Uçar and C. Aykanat, “Partitioning Sparse Matrices for Parallel Preconditioned Iterative Methods,” SIAM Journal on Scientific Computing, vol. 29(4), pp. 1683–1709, 2007. B. Ucar and C. Aykanat, “Encapsulating Multiple Communication-Cost Metrics in Partitioning Sparse Rectangular Matrices for Parallel Matrix-Vector Multiplies," SIAM Journal on Scientific Computing, vol. 25(6), pp. 1837–1859, 2004. C. Aykanat, A. Pinar, and U.V. Catalyurek, “Permuting Sparse Rectangular Matrices into Block Diagonal Form," SIAM Journal on Scientific Computing, vol. 25(6), pp. 1860–1879, 2004. B.Ucar and C.Aykanat, "Minimizing Communication Cost in Fine-Grain Partitioning of Sparse Matrices,” Lecture Notes in Computer Science, vol. 2869, pp. 926–933, 2003. U.V. Çatalyürek and C. Aykanat, "Hypergraph-Partitioning-Based Decomposition for Parallel Sparse- Matrix Vector Multiplication," IEEE Transactions on Parallel and Distributed Systems, vol. 10, pp. 673– 693, 1999. U.V. Catalyurek and C. Aykanat, "Decomposing Irregularly Sparse Matrices for Parallel Matrix-Vector Multiplication," Lecture Notes in Computer Science, vol. 1117, pp. 75–86, 1996. A. Pinar, C. Aykanat and M. Pinar, "Decomposing Linear Programs for Parallel Solution," Lecture Notes in Computer Science, vol. 1041, pp. 473–482, 1996.
79
79 Related works of our group Graph and Hypergraph Models and Methods for other Parallel & Distributed Applications B.B. Cambazoğlu and C. Aykanat, “Hypergraph-Partioning-Based Remapping Models for Image- Space-Parallel Direct Volume Rendering of Unstructured Grids,” IEEE Transactions on Parallel and Distributed Systems, vol. 18(1), pp. 3–16, 2007. K. Kaya, B. Uçar and C. Aykanat, “Heuristics for Scheduling File-Sharing Tasks on Heterogeneous Systems with Distributed Repositories,” Journal of Parallel and Distributed Computing, vol. 67, pp. 271–285, 2007. B. Uçar, C. Aykanat, M. Pınar and T. Malas, “Parallel Image Restoration Using Surrogate Constraint Methods,” Journal of Parallel and Distributed Computing, vol. 67, pp. 186–204, 2007. C. Aykanat, B. B. Cambazoğlu, F. Findik, and T.M. Kurc, “Adaptive Decomposition and Remapping Algorithms for Object-Space-Parallel Direct Volume Rendering of Unstructured Grids,” Journal of Parallel and Distributed Computing, vol. 67, pp. 77–99, 2006. K. Kaya and C. Aykanat, “Iterative-Improvement-Based Heuristics for Adaptive Scheduling of Tasks Sharing Files on Heterogeneous Master-Slave Environments,” IEEE Transactions on Parallel and Distributed Systems, vol. 17(8), pp. 883–896, August 2006. B. Ucar, C. Aykanat, K. Kaya and M. İkinci, “Task Assignment in Heterogeneous Systems,” Journal of Parallel and Distributed Computing, vol. 66(1), pp. 32–46, 2006. M. Koyuturk and C. Aykanat, “Iterative-Improvement Based Declustering Heuristics for Multi-Disk Databases,” Information Systems, vol. 30, pp. 47–70, 2005. B.B. Cambazoglu, A. Turk and C. Aykanat, "Data-Parallel Web-Crawling Models,” Lecture Notes in Computer Science, vol. 3280, pp. 801–809, 2004.
80
80 Related works of our group Hypergraph Models and Methods for Sequential Applications E. Demir and C. Aykanat, “A Link-Based Storage Scheme for Efficient Aggregate Query Processing on Clustered Road Networks,” Information Systems, accepted for publication. E. Demir, C. Aykanat and B. B. Cambazoglu, “Clustering Spatial Networks for Aggregate Query Processing, a Hypergarph Approach,” Information Systems, vol. 33(1), pp. 1–17, 2008. M. Özdal and C. Aykanat, “Hypergraph Models and Algorithms for Data-Pattern Based Clustering," Data Mining and Knowledge Discovery, vol. 9, pp. 29–57, 2004. Software Package / Tool Development C. Aykanat, B.B. Cambazoglu and B. Ucar, “Multi-level Direct K-way Hypergraph Partitioning with Multiple Constraints and Fixed vertices,” Journal of Parallel and Distributed Computing, vol. 68, pp 609–625, 2008. B. Ucar and C. Aykanat, “A Library for Parallel Sparse Matrix Vector Multiplies”, Tech. Rept. BU- CE-0506, Dept. of Computer Eng., Bilkent Univ., 2005. Ümit V. Çatalyürek and Cevdet Aykanat. PaToH: A multilevel hypergraph- partitioning tool, ver. 3.0.Tech. Rept. BU-CE-9915, Dept. of Computer Eng., Bilkent Univ., 1999.
81
81 Conclusion Computational hypergraph models & methods for 1D & 2D sparse matrix partitioning for parallel SpMxV –All models encapsulate exact total message volume –1D models: Good trade-off between comm overhead & partitioning time –2D Fine-Grain: Lowest total mssg volume –2D Jagged & Checkerboard: Restrict mssg latency Communication hypergraph models –Minimize mssg latency (number of messages) –Balance communication loads of processors Composite hypergraph models –Parallelization of preconditioned iterative methods –Partition two or more matrices together for efficient SpMxVs Extension and adaptation of these models & methods for the parallelization of other irregular applications –e.g. Image-space parallelization of direct volume rendering
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.