Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI-02-0093RD Project Review Meeting Canadian Meteorological Centre August.

Slides:



Advertisements
Similar presentations
Nearest Neighbor Search
Advertisements

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Mesh, Loads & Boundary conditions CAD Course © Dr Moudar Zgoul,
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Adaptive Mesh Applications
Efficient Storage and Processing of Adaptive Triangular Grids using Sierpinski Curves Csaba Attila Vigh Department of Informatics, TU München JASS 2006,
Quadtrees Raster and vector.
Spatial Mining.
Indexing Network Voronoi Diagrams*
CS447/ Realistic Rendering -- Solids Modeling -- Introduction to 2D and 3D Computer Graphics.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR Collaborators: Adam Frank Brandon Shroyer Chen Ding Shule Li.
CS 584. Review n Systems of equations and finite element methods are related.
I/O Analysis and Optimization for an AMR Cosmology Simulation Jianwei LiWei-keng Liao Alok ChoudharyValerie Taylor ECE Department Northwestern University.
Daniel Blackburn Load Balancing in Distributed N-Body Simulations.
Parallel Mesh Refinement with Optimal Load Balancing Jean-Francois Remacle, Joseph E. Flaherty and Mark. S. Shephard Scientific Computation Research Center.
An Introduction to Parallel Computing Dr. David Cronk Innovative Computing Lab University of Tennessee Distribution A: Approved for public release; distribution.
Cache-Optimal Parallel Solution of PDEs Ch. Zenger Informatik V, TU München Finite Element Solution of PDEs Christoph Zenger Nadine Dieminger, Frank Günther,
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
E. WES BETHEL (LBNL), CHRIS JOHNSON (UTAH), KEN JOY (UC DAVIS), SEAN AHERN (ORNL), VALERIO PASCUCCI (LLNL), JONATHAN COHEN (LLNL), MARK DUCHAINEAU.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :
1 Finite-Volume Formulation. 2 Review of the Integral Equation The integral equation for the conservation statement is: Equation applies for a control.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Data Structures for Computer Graphics Point Based Representations and Data Structures Lectured by Vlastimil Havran.
ParFUM Parallel Mesh Adaptivity Nilesh Choudhury, Terry Wilmarth Parallel Programming Lab Computer Science Department University of Illinois, Urbana Champaign.
Grid Generation.
A Parallelisation Approach for Multi-Resolution Grids Based Upon the Peano Space-Filling Curve Student: Adriana Bocoi Advisor: Dipl.-Inf.Tobias Weinzierl.
Hybrid WENO-FD and RKDG Method for Hyperbolic Conservation Laws
1 Data Structures for Scientific Computing Orion Sky Lawlor charm.cs.uiuc.edu 2003/12/17.
Scalable Algorithms for Structured Adaptive Mesh Refinement Akhil Langer, Jonathan Lifflander, Phil Miller, Laxmikant Kale Parallel Programming Laboratory.
Slide 1 / 19 Mesh Generation and Load Balancing CS /11/2009 Stan Tomov Innovative Computing Laboratory Computer Science Department The University.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
SEMILARITY JOIN COP6731 Advanced Database Systems.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.
Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
Discontinuous Galerkin Methods and Strand Mesh Generation
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
EDGE DETECTION IN COMPUTER VISION SYSTEMS PRESENTATION BY : ATUL CHOPRA JUNE EE-6358 COMPUTER VISION UNIVERSITY OF TEXAS AT ARLINGTON.
The swiss-carpet preconditioner: a simple parallel preconditioner of Dirichlet-Neumann type A. Quarteroni (Lausanne and Milan) M. Sala (Lausanne) A. Valli.
Raster data models Rasters can be different types of tesselations SquaresTrianglesHexagons Regular tesselations.
1 Shape Segmentation and Applications in Sensor Networks Xianjin Xhu, Rik Sarkar, Jie Gao Department of CS, Stony Brook University INFOCOM 2007.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
Parallel Solution of the Poisson Problem Using MPI
Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.
Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
October 25, 2007 P_HUGG and P_OPT: An Overview of Parallel Hierarchical Cartesian Mesh Generation and Optimization- based Smoothing Presented at NASA Langley,
Maths & Technologies for Games Spatial Partitioning 2
MA/CS 471 Lecture 15, Fall 2002 Introduction to Graph Partitioning.
1 Data Structures for Scientific Computing Orion Sky Lawlor /04/14.
Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.
Implementing Finite Volume Methods 1.  Continue on Finite Volume Methods for Elliptic Equations  Finite Volumes in Two-Dimensions  Poisson’s Equation.
Optimizing 3D Multigrid to Be Comparable to the FFT Michael Maire and Kaushik Datta Note: Several diagrams were taken from Kathy Yelick’s CS267 lectures.
10/10/2002Yun (Helen) He, GHC20021 Effective Methods in Reducing Communication Overheads in Solving PDE Problems on Distributed-Memory Computer Architectures.
High Performance Computing Seminar II Parallel mesh partitioning with ParMETIS Parallel iterative solvers with Hypre M.Sc. Caroline Mendonça Costa.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
2D AFEAPI Overview Goals, Design Space Filling Curves Code Structure
Xing Cai University of Oslo
Implementing Finite Volume Methods
Parallel Programming in C with MPI and OpenMP
Supported by the National Science Foundation.
Adaptive Mesh Applications
Presentation transcript:

Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August 22-23, 2006

Outline  Introduction uNumerical methods  Parallel load-balancing with space-filling curves (SFC)  Data distribution  Adaptive mesh refinement and derefinement  Construction of the ghost boundary cells for each processor  Discretization of the Poisson equation  Parallel multigrid preconditioner with conjugate gradient method uNumerical results  Conclusions

Introduction Structured adaptive mesh refinement (AMR) Block-structured AMR Each Node represents a block of cells. Advantage: The cells in each block can be organized as two or three dimensional arrays. The structured grids solver can be used without too many modifications for AMR. Disadvantage: It is inflexible. A substantial number of cells can be wasted on a smooth flow.

Introduction Each Node represents a cell. The mesh is only locally refined in contrast to the block-structured AMR. It is more flexible, and computationally more efficient than the block-structured AMR. The cell-based AMR is chosen in the present paper. Cell-based AMR

Introduction The cells can be organized as a quad-tree for 2D, or oct-tree for 3D. For a oct-tree structure, it needs 17 words of memory if the connectivity information is explicitly stored. If it is not explicitly stored, a tree may be traversed up to its root to find the required neighboring cell. It is difficult to parallelize because a search may be extended from one processor to another processor. An ordinary tree data structure

Introduction All cells are grouped together as Octs. The memory overhead is significantly reduced. The maintenance of an octal FTT requires about three words of memory per cell instead of 17 words in the ordinary oct-tree. An oct-tree structure in a FTT Fully Threaded Tree (FTT) structure

Introduction The west and south neighbors of cell 6 can be found directly through its explicitly stored parent Oct. The east and north neighbors of cell 6 can be found through the neighboring cells of its parent Oct. No more than one level of the tree needs to be traversed to access the neighbors of a cell. Fully Threaded Tree (FTT) structure An example to access the neighbors of a cell without searching using FTT structure.

Introduction Objective:  Propose a new parallel approach to the AMR code based on the FTT data structure

Numerical methods SFC is chosen as the grid partitioner due to its mapping, compactness and locality. The points in the higher dimensional space can be mapped to the corresponding points on a line. Only the coordinates of the point in the higher dimensional domain are required to compute the corresponding location on the 1D line. In the Hilbert ordering, all adjacent neighbors on the 1D line are face- neighboring cells in the higher dimensional domain (locality). Parallel load-balancing with space-filling curves (SFC) Space-filling curves in two dimensions: (a) Hilbert or U ordering, (b) Morton or N ordering.

Numerical methods The different colors correspond to partitions on the different processors. Only leaf cells are shown in the left figure. Parallel load-balancing with space-filling curves (SFC) The two-dimensional adaptive grids partitioned on four processors with the Hilbert SFC.

Numerical methods A unique global ID is used to identify each cell instead of the local ID on each processor. Not stored processor ID for each cell, which can be computed from its spatial coordinates using SFC. Hash-table technique is applied to store the cells and oct structures on each processor. If a cell is marked to be migrated to another processor by the Hilbert SFC, both of the data in the cell and the corresponding oct structures have to be migrated. Data distribution The global ID is used to identify each cell.

Numerical methods Constraint: no neighboring cells with level difference greater than one are allowed. Cell A is marked to be refined –Check the neighboring cells of the parent cell of cell A (i.e., cells B and D), if the neighbors are leaves, they are marked to be refined. – If cells B and C belong to two different processors, send the global ID of the neighbor of the parent cell of cell B to the processor where cell C resides. Adaptive mesh refinement and derefinement A example showing how to flag cells to be refined over 2 processors

Numerical methods Adaptive mesh refinement and derefinement Before and after enforcing the refinement constraints on 4 processors

Numerical methods if cell A is marked to be coarsened –All the children cells of cell A should be leaves. –If any neighboring cell is not a leaf, check its nearest two children cells. If the nearest two children cells are not leaves, and they are not marked to be coarsened, cell A cannot be coarsened. Adaptive mesh refinement and derefinement An example showing how cell A is coarsened without violating the constraint.

Numerical methods The corresponding oct data structure has to be generated to make the boundary cells find their neighboring cells. Seven cells in each neighbor direction should use oct A to find their neighboring cells. Hilbert coordinates of all neighboring cells are computed to obtain their processor ID. The data in the oct A will be sent out to the processors where all the related neighboring cells reside. Construction of the ghost boundary cells for each processor The neighboring cells related to the Oct A in the FTT data structure.

Numerical methods The ghost boundary cells for each processor can be determined based on the oct data structures. Construction of the ghost boundary cells for each processor The local leaf cells together with their corresponding ghost leaf boundary cells on two processors

Numerical methods Poisson equation: second-order accuracy using the cell-centered gradients to approximate the value at the auxiliary node The least squares approach is used to evaluate the cell-centered gradients. Discretization of the Poisson equation Approximation of the gradient flux Fe based on values at the node E and the auxiliary node P'

Numerical methods Additive multigrid method: –The smoothing can be performed simultaneously (or in parallel ) at all grid levels. –Better parallel performance than the classical multigrid method –not convergence if used as a stand-alone smoother –as a preconditioner combined with the conjugate gradient method. Parallel multigrid preconditioner with conjugate gradient method A sketch of the V-cycle additive multigrid method.

Numerical results Considering a 2D Poisson equation The computational domain is The Neumann boundary conditions are used on the four boundaries. The parallel efficiency are tested on the cluster of computers in SHARCNET.

Numerical results Uniform grids: Using more processors does not always reduce the time. For the cases corresponding to levels less than 8, the times increase from 16 to 32 processors due to the domination of the communication times. As the problem becomes bigger, the parallel efficiency is improved because of the domination of the computational times. For the last case, a parallel efficiency 98% is achieved with 64 processors. The wall clock times on regular grids from level 5 to 10 with up to 64 processors

Numerical results AMR grids: The leaf cells are refined if is larger than the mean value. For problems with large grid sizes, the times decreases monotonically as the number of processors increases. For the last case, a parallel efficiency of 106% (>100%) is achieved due to efficient use of cache memory when the grid size in each processor becomes smaller. The wall clock times on AMR grids with up to 64 processors

Numerical results The grid partitioning and mapping times using the Hilbert SFC: The percentage increases slightly when a larger number of processors are used because a large amount of data have to be migrated over a larger number of processors. The ratio of the load balancing times to the total computational time is only 0.22% in the case of 64 processors. The proposed method is very efficient. The wall clock times associated with the load balancing procedure for an adaptive grid on the different processors.

Conclusions FTT data structure is used to organize the adaptive meshes because of its low memory overhead and accessing neighboring cells without searching. The Hilbert SFC approach is used to dynamically partition the adaptive meshes. The numerical experiments show that the proposed parallel Poisson solver is highly efficient.