Parallel Hypergraph Partitioning for Scientific Computing

Slides:

Advertisements

Similar presentations

Partitioning Screen Space for Parallel Rendering

Advertisements

Multilevel Hypergraph Partitioning Daniel Salce Matthew Zobel.

Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Graph Algorithms Carl Tropper Department of Computer Science McGill University.

Dynamic Load Balancing in Scientific Simulation Angen Zheng.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Distributed Breadth-First Search with 2-D Partitioning Edmond Chow, Keith Henderson, Andy Yoo Lawrence Livermore National Laboratory LLNL Technical report.

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

15-853Page :Algorithms in the Real World Separators – Introduction – Applications.

Multilevel Graph Partitioning and Fiduccia-Mattheyses

Department of Biomedical Informatics Dynamic Load Balancing (Repartitioning) & Matrix Partitioning Ümit V. Çatalyürek Associate Professor Department of.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Multilevel Hypergraph Partitioning G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar Computer Science Department, U of MN Applications in VLSI Domain.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Graph partition in PCB and VLSI physical synthesis Lin Zhong ELEC424, Fall 2010.

Combinatorial Scientific Computing is concerned with the development, analysis and utilization of discrete algorithms in scientific and engineering applications.

Sandia National Laboratories Graph Partitioning Workshop Oct. 15, Load Balancing Myths, Fictions & Legends Bruce Hendrickson Sandia National Laboratories.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Graph Partitioning Donald Nguyen October 24, 2011.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Parallel Computing Sciences Department MOV’01 Multilevel Combinatorial Methods in Scientific Computing Bruce Hendrickson Sandia National Laboratories Parallel.

Automatic Differentiation: Introduction Automatic differentiation (AD) is a technology for transforming a subprogram that computes some function into a.

Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.

CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Combinatorial Scientific Computing and Petascale Simulation (CSCAPES) A SciDAC Institute Funded by DOE’s Office of Science Investigators Alex Pothen, Florin.

Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Data Structures and Algorithms in Parallel Computing Lecture 2.

PaGrid: A Mesh Partitioner for Computational Grids Virendra C. Bhavsar Professor and Dean Faculty of Computer Science UNB, Fredericton This.

Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley,

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Data Structures and Algorithms in Parallel Computing Lecture 3.

Graph Algorithms Gayathri R To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

Computing Approximate Weighted Matchings in Parallel Fredrik Manne, University of Bergen with Rob Bisseling, Utrecht University Alicia Permell, Michigan.

Multilevel Partitioning

1 1 Zoltan: Toolkit of parallel combinatorial algorithms for unstructured, dynamic and/or adaptive computations Unstructured Communication Tools -Communication.

Dynamic Load Balancing in Scientific Simulation

CSCAPES Mission Research and development Provide load balancing and parallelization toolkits for petascale computation Develop advanced automatic differentiation.

CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning

High Performance Computing Seminar II Parallel mesh partitioning with ParMETIS Parallel iterative solvers with Hypre M.Sc. Caroline Mendonça Costa.

High Performance Computing Seminar

Auburn University

A Hypergraph-Partitioning Approaches for Workload Decomposition

Xing Cai University of Oslo

Parallel Graph Algorithms

Parallel Programming By J. H. Wang May 2, 2017.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

A Continuous Optimization Approach to the Minimum Bisection Problem

= Michael M. Wolf, Sandia National Laboratories

CS 240A: Graph and hypergraph partitioning

Parallel Programming in C with MPI and OpenMP

Parallel Sort, Search, Graph Algorithms

Integrating Efficient Partitioning Techniques for Graph Oriented Applications My dissertation work represents a study of load balancing and data locality.

Adaptive Mesh Applications

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Parallel Hypergraph Partitioning for Scientific Computing Erik Boman, Karen Devine, Robert Heaphy, Bruce Hendrickson Sandia National Laboratories, Albuquerque Umit Çatalyürek Ohio State University, Columbus Rob Bisseling Utrecht University, The Netherlands

Graph & hypergraph partitioning Applications in scientific computing: Load balancing minimize communication Sparse matrix-vector product Decomposition for LP Matrix orderings Parallel preconditioners Graph partitioning Commonly used, but has deficiencies Doesn’t accurately represent communication volume Not suitable for non-symmetric & rectangular matrices

Hypergraph partitioning More powerful than graph model A hyperedge connects a set of vertices Represents communication volume accurately Aykanat & Catalyurek (’95-’99) Problem definition: Given hypergraph H=(V,E’) and integer k where E’ is a set of hyperedges Partition V into k disjoint subsets Such that each subset has (approx.) same size and Number of hyperedges cut between subsets is minimized (scaled by number of parts) NP-hard But fast multilevel heuristics work well

Graph Partitioning vs. Hypergraph Partitioning Kernighan, Lin, Schweikert, Fiduccia, Mattheyes, Pothen, Simon, Hendrickson, Leland, Kumar, Karypis, et al. Hypergraph Partitioning Kernighan, Alpert, Kahng, Hauck, Borriello, Aykanat, Çatalyürek, Karypis, et al. Vertices: computation. Edges: two vertices. Hyperedges: two or more vertices. Edge cuts approximate communication volume. Hyperedge cuts accurately measure communication volume. Assign equal vertex weight while minimizing edge cut weight. Assign equal vertex weight while minimizing hyperedge cut weight. A A

Hypergraph Partitioning Several serial hypergraph partitioners available. hMETIS (Karypis) – PaToH (Çatalyürek) Mondriaan (Bisseling) Parallel partitioners needed for large and dynamic problems. Zoltan-PHG (Sandia) – ParKway (Trifunovic) Predicition: Hypergraph model and partitioning tools will eventually replace graph partitioning in scientific computing Except when partitioning time is important and quality matters less

Matrix Representation View hypergraph as matrix (Aykanat & Çatalyürek) We use row-net model: Vertices == columns Edges == rows Ex: 1D partitioning of sparse matrix (along columns) y * x = 1 2 3 4 5 6

Sparse Matrix-Vector Product Important in scientific computing Iterative methods Communication volume associated with edge e: Ce = (# processors in edge e) - 1 Total communication volume : ( ) X y * =

Sparse Matrix Partitioning 1D rows or columns: hypergraph partitioning (Aykanat & Catalyurek) 2D Cartesian Multiconstraint h.g.part. (Catalyurek & Aykanat) 2D recursive: Mondriaan (Bisseling & Vastenhouw) Non-Cartesian Lower comm. volume Fine-grain model: Hypergraph, each nonzero is a vertex (Catalyurek) Ultimate flexibility Courtesy: Rob Bisseling

Zoltan Toolkit: Suite of Partitioning Algorithms Recursive Coordinate Bisection Recursive Inertial Bisection Space Filling Curves Refinement-tree Partitioning Octree Partitioning Graph Partitioning ParMETIS , Jostle Hypergraph Partitioning NEW!

Zoltan Hypergraph Partitioner Parallel hypergraph partitioner for large-scale problems distributed memory (MPI) New package in the Zoltan toolkit Available Fall 2005 Open source; LGPL

Data Layout 2D data layout within hypergraph partitioner. Does not affect the layout returned to the application. Processors logically (not physically) organized as a 2D grid Vertex/hyperedge communication limited to only processors (along rows/columns) Maintain scalable memory usage. No “ghosting” of off-processor neighbor info. Differs from parallel graph partitioners and Parkway (1D). Design allows comparison of 1D and 2D distributions.

Recursive Bisection Recursive bisection approach: Parallelism: Partition data into two sets. Recursively subdivide each set into two sets. We allow arbitrary k (k ≠ 2n ). Parallelism: Split both the data and processors into two sets; subproblems solved independently in parallel

Multilevel Partitioning V-cycle Multilevel Scheme Multilevel hypergraph partitioning (Çatalyürek, Karypis) Analogous to multilevel graph partitioning (Bui&Jones, Hendrickson&Leland, Karypis&Kumar). Contraction: reduce HG to smaller representative HG. Coarse partitioning: assign coarse vertices to partitions. Refinement: improve balance and cuts at each level. … Coarse HG Initial HG Final Partition Coarse Partition Contraction Refinement Coarse Partitioning Multilevel Partitioning V-cycle

Contraction Merge pairs of “similar” vertices: matching Currently no agglomeration of more than 2 vertices Greedy maximal weight matching heuristics Matching is on a related graph (edges = similarities) Maximum weight solution not necessary We use Heavy connectivity matching (Aykanat & Çatalyürek) Inner-product matching (Bisseling) First-Choice (Karypis) Match columns with greatest inner product vertices with most shared hyperedges

Parallel Matching in 2D Data Layout On each processor: Broadcast subset of vertices (“candidates”) along processor row. Compute (partial) inner products of received candidates with local vertices. Accrue inner products in processor column. Identify best local matches for received candidates. Send best matches to candidates’ owners. Select best global match for each owned candidate. Send “match accepted” messages to processors owning matched vertices. Repeat until all unmatched vertices have been sent as candidates.

Coarse Partitioning Gather coarsest hypergraph to each processor. Gather edges to each processor in column. Gather vertices to each processor in row. Compute several different coarse partitions on each processor. Select best local partition. Compute best over all processors. Broadcast best partition to all.

Refinement For each level in V-cycle: Project coarse partition to finer hypergraph. Use local optimization (KL/FM) to improve balance and reduce cuts. Compute “root” processor in each processor column: processor with most nonzeros. Root processor computes moves for vertices in processor column. All column processors provide cut information; receive move information. Approximate KL/FM Exact parallel version needs too much synchronization

Results Cage14: Cage model of DNA electrophoresis (van Heukelum) 1.5M rows & cols; 27M nonzeros. Symmetric structure 64 partitions. Hypergraph partitioning reduced communication volume by 10-20% vs. graph partitioning. Zoltan much faster than ParKway

More Results Sensor placement – IP/LP model ParKway ran out of memory 5M rows, 4M columns 16M nonzeros ParKway ran out of memory 1d with ghosting, not scalable

Future Work More evaluation of current design. Increase speed while maintaining quality. Heuristics for more local, less expensive matching Better load balance within our code K-way refinement? More evaluation of current design. 1D vs. 2D data layouts Incremental partitioning for dynamic applications. Minimize data migration. Multiconstraint partitioning Interface for 2D partitioning (sparse matrices) Watch for release in Zoltan later this year!