Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August 22-23, 2006
Outline Introduction uNumerical methods Parallel load-balancing with space-filling curves (SFC) Data distribution Adaptive mesh refinement and derefinement Construction of the ghost boundary cells for each processor Discretization of the Poisson equation Parallel multigrid preconditioner with conjugate gradient method uNumerical results Conclusions
Introduction Structured adaptive mesh refinement (AMR) Block-structured AMR Each Node represents a block of cells. Advantage: The cells in each block can be organized as two or three dimensional arrays. The structured grids solver can be used without too many modifications for AMR. Disadvantage: It is inflexible. A substantial number of cells can be wasted on a smooth flow.
Introduction Each Node represents a cell. The mesh is only locally refined in contrast to the block-structured AMR. It is more flexible, and computationally more efficient than the block-structured AMR. The cell-based AMR is chosen in the present paper. Cell-based AMR
Introduction The cells can be organized as a quad-tree for 2D, or oct-tree for 3D. For a oct-tree structure, it needs 17 words of memory if the connectivity information is explicitly stored. If it is not explicitly stored, a tree may be traversed up to its root to find the required neighboring cell. It is difficult to parallelize because a search may be extended from one processor to another processor. An ordinary tree data structure
Introduction All cells are grouped together as Octs. The memory overhead is significantly reduced. The maintenance of an octal FTT requires about three words of memory per cell instead of 17 words in the ordinary oct-tree. An oct-tree structure in a FTT Fully Threaded Tree (FTT) structure
Introduction The west and south neighbors of cell 6 can be found directly through its explicitly stored parent Oct. The east and north neighbors of cell 6 can be found through the neighboring cells of its parent Oct. No more than one level of the tree needs to be traversed to access the neighbors of a cell. Fully Threaded Tree (FTT) structure An example to access the neighbors of a cell without searching using FTT structure.
Introduction Objective: Propose a new parallel approach to the AMR code based on the FTT data structure
Numerical methods SFC is chosen as the grid partitioner due to its mapping, compactness and locality. The points in the higher dimensional space can be mapped to the corresponding points on a line. Only the coordinates of the point in the higher dimensional domain are required to compute the corresponding location on the 1D line. In the Hilbert ordering, all adjacent neighbors on the 1D line are face- neighboring cells in the higher dimensional domain (locality). Parallel load-balancing with space-filling curves (SFC) Space-filling curves in two dimensions: (a) Hilbert or U ordering, (b) Morton or N ordering.
Numerical methods The different colors correspond to partitions on the different processors. Only leaf cells are shown in the left figure. Parallel load-balancing with space-filling curves (SFC) The two-dimensional adaptive grids partitioned on four processors with the Hilbert SFC.
Numerical methods A unique global ID is used to identify each cell instead of the local ID on each processor. Not stored processor ID for each cell, which can be computed from its spatial coordinates using SFC. Hash-table technique is applied to store the cells and oct structures on each processor. If a cell is marked to be migrated to another processor by the Hilbert SFC, both of the data in the cell and the corresponding oct structures have to be migrated. Data distribution The global ID is used to identify each cell.
Numerical methods Constraint: no neighboring cells with level difference greater than one are allowed. Cell A is marked to be refined –Check the neighboring cells of the parent cell of cell A (i.e., cells B and D), if the neighbors are leaves, they are marked to be refined. – If cells B and C belong to two different processors, send the global ID of the neighbor of the parent cell of cell B to the processor where cell C resides. Adaptive mesh refinement and derefinement A example showing how to flag cells to be refined over 2 processors
Numerical methods Adaptive mesh refinement and derefinement Before and after enforcing the refinement constraints on 4 processors
Numerical methods if cell A is marked to be coarsened –All the children cells of cell A should be leaves. –If any neighboring cell is not a leaf, check its nearest two children cells. If the nearest two children cells are not leaves, and they are not marked to be coarsened, cell A cannot be coarsened. Adaptive mesh refinement and derefinement An example showing how cell A is coarsened without violating the constraint.
Numerical methods The corresponding oct data structure has to be generated to make the boundary cells find their neighboring cells. Seven cells in each neighbor direction should use oct A to find their neighboring cells. Hilbert coordinates of all neighboring cells are computed to obtain their processor ID. The data in the oct A will be sent out to the processors where all the related neighboring cells reside. Construction of the ghost boundary cells for each processor The neighboring cells related to the Oct A in the FTT data structure.
Numerical methods The ghost boundary cells for each processor can be determined based on the oct data structures. Construction of the ghost boundary cells for each processor The local leaf cells together with their corresponding ghost leaf boundary cells on two processors
Numerical methods Poisson equation: second-order accuracy using the cell-centered gradients to approximate the value at the auxiliary node The least squares approach is used to evaluate the cell-centered gradients. Discretization of the Poisson equation Approximation of the gradient flux Fe based on values at the node E and the auxiliary node P'
Numerical methods Additive multigrid method: –The smoothing can be performed simultaneously (or in parallel ) at all grid levels. –Better parallel performance than the classical multigrid method –not convergence if used as a stand-alone smoother –as a preconditioner combined with the conjugate gradient method. Parallel multigrid preconditioner with conjugate gradient method A sketch of the V-cycle additive multigrid method.
Numerical results Considering a 2D Poisson equation The computational domain is The Neumann boundary conditions are used on the four boundaries. The parallel efficiency are tested on the cluster of computers in SHARCNET.
Numerical results Uniform grids: Using more processors does not always reduce the time. For the cases corresponding to levels less than 8, the times increase from 16 to 32 processors due to the domination of the communication times. As the problem becomes bigger, the parallel efficiency is improved because of the domination of the computational times. For the last case, a parallel efficiency 98% is achieved with 64 processors. The wall clock times on regular grids from level 5 to 10 with up to 64 processors
Numerical results AMR grids: The leaf cells are refined if is larger than the mean value. For problems with large grid sizes, the times decreases monotonically as the number of processors increases. For the last case, a parallel efficiency of 106% (>100%) is achieved due to efficient use of cache memory when the grid size in each processor becomes smaller. The wall clock times on AMR grids with up to 64 processors
Numerical results The grid partitioning and mapping times using the Hilbert SFC: The percentage increases slightly when a larger number of processors are used because a large amount of data have to be migrated over a larger number of processors. The ratio of the load balancing times to the total computational time is only 0.22% in the case of 64 processors. The proposed method is very efficient. The wall clock times associated with the load balancing procedure for an adaptive grid on the different processors.
Conclusions FTT data structure is used to organize the adaptive meshes because of its low memory overhead and accessing neighboring cells without searching. The Hilbert SFC approach is used to dynamically partition the adaptive meshes. The numerical experiments show that the proposed parallel Poisson solver is highly efficient.