Download presentation
Presentation is loading. Please wait.
1
Parallelizing Unstructured FEM Computation
Xing Cai Department of Informatics University of Oslo
2
Contents Background & introduction Parallelization approaches
at the linear algebra level based on domain decomposition Implementational aspects Numerical experiments
3
The Question Starting point: sequential FEM code
unstructured grids, implicit computation… How to do the parallelization? We need a good parallelization strategy a good and simple implementation of the strategy Resulting parallel solvers should have good overall numerical performance good parallel efficiency
4
Basic Strategy Different approaches to parallelization
Automatic compiler parallelization Loop level parallelization We use the strategy of divide & conquor divide the global domain into subdomains one process is reponsible for one subdomain make use of the message-passing paradigm Domain decomposition at different levels
5
A Generic Finite Element PDE Solver
Time stepping t0, t1, t2… Spatial discretization on the computational grid Solution of nonlinear problems Solution of linearized problems Iterative solution of Ax=b
6
Important Observations
The computation-intensive part is the iterative solution of Ax=b A parallel finite element PDE solver needs to run the linear algebra kernels in parallel vector addition inner-product of two vectors matrix-vector product Two types of inter-processor communication Ratio computation/communication is high Relatively tolerant of slow communication
7
Solution Domain Partition
Partition of the elements is non-overlapping Grid points shared between neighboring subdomains on the internal boundaries Non-overlapping grid partition
8
Natural Parallelization of PDE Solvers
The global solution domain is partitioned into many smaller subdomains One subdomain works as a ”unit”, with its sub-matrices and sub-vectors No need to create global matrices and vectors physically The global linear algebra operations can be realized by local operations + inter- processor communication
9
Work in Parallel Assembly of local stiffness matrix etc is embarrasingly parallel Vector addition/update is also embarrasingly parallel Inner-product between 2 distributed vectors requires collective communication Matrix-vector product requires immediate neighbors to exchange info
10
Overlapping Grid Partition
Necessory for preconditioning etc
11
Linear-algebra Level Parallelization
A SPMD model Reuse of existing code for local linear algebra operations Need new code for the parallelization specific tasks grid partition (non-overlapping, overlapping) communication parttern recognition inter-processor communication routines
12
OOP Simplifies Parallelization
Develop a small add-on ”toolbox” containing all the parallelization specific codes The ”toolbox” has many high-level routines, hides the low-level MPI details The existing sequential libraries are slightly modified to include a ”dummy” interface, thus incorporating ”fake” inter-processor communication routines A seamless coupling between the huge sequential libraries and the add-on toolbox
13
Diffpack O-O software environment for scientific computation (C++)
Rich collection of PDE solution components - portable, flexible, extensible H.P.Langtangen, Computational Partial Differential Equations, Springer 1999
14
Straightforward Parallelization
Develop a sequential simulator, without paying attention to parallelism Follow the Diffpack coding standards Use the add-on toolbox for parallel computing Add a few new statements for transformation to a parallel simulator
15
Linear-Algebra-Level Approach
Parallelize matrix/vector operations inner-product of two vectors matrix-vector product preconditioning - block contribution from subgrids Easy to use access to all Diffpack v3.0 CG-like methods, preconditioners and convergence monitors “hidden” parallelization need only to add a few lines of new code arbitrary choice of number of processors at run-time
16
A Simple Coding Example
GridPartAdm* adm; // access to parallelizaion functionality LinEqAdm* lineq; // administrator for linear system & solver // ... #ifdef PARALLEL_CODE adm->scan (menu); adm->prepareSubgrids (); adm->prepareCommunication (); lineq->attachCommAdm (*adm); #endif lineq->solve (); set subdomain list = DEFAULT set global grid = grid1.file set partition-algorithm = METIS set number of overlaps = 0
17
Solving an Elliptic PDE
Highly unstructured grid Discontinuity in the coefficient K (0.1 & 1)
18
Measurements 130,561 degrees of freedom Overlapping subgrids
Global BiCGStab using (block) ILU prec.
19
Parallel Simulation of 3D Acoustic Field
3D nonlinear model
20
3D Nonlinear Acoustic Field Simulation
Comparison between Origin 2000 and Linux cluster 1,030,301 grid points CPUs Origin 2000 Linux Cluster CPU-time Speedup 2 8670.8 N/A 6681.5 4 4726.5 3.75 3545.9 3.77 8 2404.2 7.21 1881.1 7.10 16 1325.6 13.0 953.89 14.0 24 1043.7 16.6 681.77 19.6 32 725.23 23.9 563.54 23.7 48 557.61 31.1 673.77 19.8
21
Imcompressible Navier-Stokes
Numerical strategy: operator splitting Calculation of an intermediate velocity in a predictor-corrector way Solution of a Poisson equation Correction of the intermediate velocity
22
Imcompressible Navier-Stokes
Explicit schemes for predicting and correcting the velocity Implicit solution of the pressure by CG Measurements on a Linux cluster P CPU-time Speedup Efficiency 1 665.45 N/A 2 329.57 2.02 1.01 4 166.55 4.00 1.00 8 89.98 7.40 0.92 16 48.96 13.59 0.85 24 34.85 19.09 0.80 48 34.22 19.45 0.41
23
Example: Vortex-Shedding
24
Simulation Snapshots Pressure
25
Simulation Snapshots Velocity
26
Animated Pressure Field
27
Parallel Simulation of Heart
Special code for balanced partition of coupled heart-torso grids Simple extension of sequential elliptic and parabolic solvers
28
Higher Level Parallelization
Apply overlapping Schwarz methods as both stand-alone solution method and preconditioner Solution of the original large problem through iteratively solving many smaller subproblems Flexibility -- localized treatment of irregular geometries, singularities etc Inherent parallelism, suitable for coarse grained parallelization
29
One Example of DD Poisson Eq. on unit square
30
Observations DD is a good parallelization strategy
The approach is not PDE-specific A program for the original global problem can be reused (modulo B.C.) for each subdomain Must communicate overlapping point values No need for global data Data distribution implied Explicit temporal schemes are a special case where no iteration is needed (“exact DD”)
31
Goals for the Implementation
Reuse sequential solver as subdomain solver Add DD management and communication as separate modules Collect common operations in generic library modules Flexibility and portability Simplified parallelization process for the end-user
32
Generic Programming Framework
33
The Subdomain Simulator
seq. solver add-on communication
34
The Communicator Need functionality for exchanging point values inside the overlapping regions The communicator works with a hidden communication model Make use of the add-on toolbox for linear-algebra level parallelization MPI in use, but easy to change
35
Making A Simulator Parallel
class SimulatorP : public SubdomainFEMSolver public Simulator { // … just a small amount of code virtual void createLocalMatrix () { Simulator::makeSystem (); } }; Administrator SubdomainSimulator Simulator SubdomainFEMSolver SimulatorP
36
Performance Algorithmic efficiency Parallel efficiency
efficiency of original sequential simulator(s) efficiency of domain decomposition method Parallel efficiency communication overhead (low) coarse grid correction overhead (normally low) load balancing subproblem size work on subdomain solves
37
P: number of processors
A Simple Application Poisson Equation on unit square DD as the global solution method Subdomain solvers use CG+FFT Fixed number of subdomains M=32 (independent of P) Straightforward parallelization of an existing simulator P: number of processors
38
Combined Approach Use a CG-like method as basic solver
(i.e. use a parallelized Diffpack linear solver) Use DD as preconditioner (i.e. SimulatorP is invoked as a preconditioning solve) Combine with coarse grid correction CG-like method + DD prec. is normally faster than DD as a basic solver
39
Two-Phase Porous Media Flow
Simulation result obtained on 16 processors
40
Two-phase Porous Media Flow
History of saturation for water and oil
41
Two-Phase Porous Media Flow
SEQ: PEQ: BiCGStab + DD prec. for global pressure eq. Multigrid V-cycle in subdomain solves
42
3D Nonlinear Water Waves
Fully nonlinear 3D water waves Primary unknowns:
43
3D Nonlinear Water Waves
Global 3D grid: 49x49x41 Global solver: CG + overlapping Schwarz prec. Multigrid V-cycle as subdomain solver CPU measurement of a total of 32 time steps Parallel simulation on a Linux cluster
44
Elasticity Test case: 2D linear elasticity, 241 x 241 global grid.
Vector equation Straightforward parallelization based on an existing Diffpack simulator
45
2D Linear Elasticity BiCGStab + DD prec. as global solver
Multigrid V-cycle in subdomain solves I: number of global BiCGStab iterations needed P: number of processors (P=#subdomains)
46
2D Linear Elasticity
47
Summary Goal: provide software and programming rules for easy parallelization of FEM codes Applicable to a wide range of PDE problems Two parallelization strategies: parallelization at the linear algebra level: “automatic” hidden parallelization parallel domain decomposition: very flexible, compact visible code/algorithm Performance: satisfactory speed-up
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.