Download presentation
Presentation is loading. Please wait.
Published byJoy McGee Modified over 9 years ago
1
Parallelizing finite element PDE solvers in an object-oriented framework Xing Cai Department of Informatics University of Oslo
2
Outline of the Talk Introduction & backgroundIntroduction & background 3 parallelization approaches3 parallelization approaches Implementational aspectsImplementational aspects Numerical experimentsNumerical experiments
3
The Scientific Software Group Knut Andreas Lie (SINTEF) Kent Andre Mardal Åsmund Ødegård Bjørn Fredrik Nielsen (NR) Joakim Sundnes Wen Chen Xing Cai Øyvind Hjelle (SINTEF) Ola Skavhaug Aicha Bounaim Hans Petter Langtangen Are Magnus Bruaset (NO) Linda Ingebrigtsen Glenn Terje Lines Aslak Tveito Part-time Ph.D. Students Post Docs Faculty Department of Informatics, University of Oslo http://www.ifi.uio.no/~tpv TomThorvaldsen
4
Projects Simulation of electrical activity in human heart Simulation of the diastolic left ventricle Numerical methods for option pricing Software for numerical solution of PDEs Scientific computing using a Linux-cluster Finite element modelling of ultrasound wave propagation Multi-physics models by domain decomposition methods Scripting techniques for scientific computing Numerical modelling of reactive fluid flow in porous media http://www.ifi.uio.no/~tpv
5
Diffpack O-O software environment for scientific computation (C++)O-O software environment for scientific computation (C++) Rich collection of PDE solution components - portable, flexible, extensibleRich collection of PDE solution components - portable, flexible, extensible http://www.nobjects.comhttp://www.nobjects.com H.P.Langtangen, Computational Partial Differential Equations, Springer 1999H.P.Langtangen, Computational Partial Differential Equations, Springer 1999
6
The Diffpack Philosophy Structural mechanics Porous media flow Aero- dynamics Incompressible flow Other PDE applications Water waves Stochastic PDEs Heat transfer Field Grid Matrix Vector I/O Ax=b FEM FDM
7
The Question Starting point: sequential PDE solver How to do the parallelization? Resulting parallel solvers should have 4 good parallel efficiency 4 good overall numerical performance We need 4 a good parallelization strategy 4 a good and simple implementation of the strategy
8
A generic finite element PDE solver Time stepping t 0, t 1, t 2 …Time stepping t 0, t 1, t 2 … Spatial discretization (computational grid)Spatial discretization (computational grid) Solution of nonlinear problemsSolution of nonlinear problems Solution of linearized problemsSolution of linearized problems Iterative solution of Ax=bIterative solution of Ax=b
9
An observation The computation-intensive part is the iterative solution of Ax=bThe computation-intensive part is the iterative solution of Ax=b A parallel finite element PDE solver needs to run the linear algebra operations in parallelA parallel finite element PDE solver needs to run the linear algebra operations in parallel –vector addition –inner-product of two vectors –matrix-vector product
10
Several parallelization options Automatic compiler parallelizationAutomatic compiler parallelization Loop-level parallelization (special compilation directives)Loop-level parallelization (special compilation directives) Domain decompositionDomain decomposition –divide-and-conquer –fully distributed computing –flexible –high parallel efficiency
11
A natural parallelization of PDE solvers The global solution domain is partitioned into many smaller sub-domainsThe global solution domain is partitioned into many smaller sub-domains One sub-domain works as a ”unit”, with its sub-matrices and sub-vectorsOne sub-domain works as a ”unit”, with its sub-matrices and sub-vectors No need to create global matrices and vectors physicallyNo need to create global matrices and vectors physically The global linear algebra operations can be realized by local operations + inter- processor communicationThe global linear algebra operations can be realized by local operations + inter- processor communication
12
Grid partition
13
Linear-algebra level parallelization A SPMD modelA SPMD model Reuse of existing code for local linear algebra operationsReuse of existing code for local linear algebra operations Need new code for the parallelization specific tasksNeed new code for the parallelization specific tasks –grid partition (non-overlapping, overlapping) –inter-processor communication routines
14
Object orientation An add-on ”toolbox” containing all the parallelization specific codesAn add-on ”toolbox” containing all the parallelization specific codes The ”toolbox” has many high-level routinesThe ”toolbox” has many high-level routines The existing sequential libraries are slightly modified to include a ”dummy” interface, thus incorporating ”fake” inter-processor communicationsThe existing sequential libraries are slightly modified to include a ”dummy” interface, thus incorporating ”fake” inter-processor communications A seamless coupling between the huge sequential libraries and the add-on toolboxA seamless coupling between the huge sequential libraries and the add-on toolbox
15
Straightforward Parallelization Develop a sequential simulator, without paying attention to parallelismDevelop a sequential simulator, without paying attention to parallelism Follow the Diffpack coding standardsFollow the Diffpack coding standards Use the add-on toolbox for parallel computingUse the add-on toolbox for parallel computing Add a few new statements for transformation to a parallel simulatorAdd a few new statements for transformation to a parallel simulator
16
A Simple Coding Example GridPartAdm* adm; // access to parallelization functionality LinEqAdm* lineq; // administrator for linear system & solver //... #ifdef PARALLEL_CODE adm->scan (menu); adm->prepareSubgrids (); adm->prepareCommunication (); lineq->attachCommAdm (*adm); #endif //... lineq->solve (); set subdomain list = DEFAULT set global grid = grid1.file set partition-algorithm = METIS set number of overlaps = 0
17
Solving an elliptic PDE Highly unstructured grid Highly unstructured grid Discontinuity in the coefficient K Discontinuity in the coefficient K
18
Measurements 130,561 degrees of freedom 130,561 degrees of freedom Overlapping subgrids Overlapping subgrids Global BiCGStab using (block) ILU prec. Global BiCGStab using (block) ILU prec.
19
Parallel Vortex-Shedding Simulation incompressible Navier-Stokes solved by a pressure correction method
20
Simulation Snapshots Pressure
21
Some CPU Measurements The pressure equation is solved by the CG method with “subdomain-wise” MILU prec.
22
Animated Pressure Field
23
Domain Decomposition Solution of the original large problem through iteratively solving many smaller subproblemsSolution of the original large problem through iteratively solving many smaller subproblems Can be used as solution method or preconditionerCan be used as solution method or preconditioner Flexibility -- localized treatment of irregular geometries, singularities etcFlexibility -- localized treatment of irregular geometries, singularities etc Very efficient numerical methods -- even on sequential computersVery efficient numerical methods -- even on sequential computers Suitable for coarse grained parallelizationSuitable for coarse grained parallelization
24
Overlapping DD Example: Solving the Poisson problem on the unit square
25
Observations DD is a good parallelization strategyDD is a good parallelization strategy The approach is not PDE-specificThe approach is not PDE-specific A program for the original global problem can be reused (modulo B.C.) for each subdomainA program for the original global problem can be reused (modulo B.C.) for each subdomain Must communicate overlapping point valuesMust communicate overlapping point values No need for global dataNo need for global data Data distribution impliedData distribution implied Explicit temporal schemes are a special case where no iteration is needed (“exact DD”)Explicit temporal schemes are a special case where no iteration is needed (“exact DD”)
26
Goals for the Implementation Reuse sequential solver as subdomain solverReuse sequential solver as subdomain solver Add DD management and communication as separate modulesAdd DD management and communication as separate modules Collect common operations in generic library modulesCollect common operations in generic library modules Flexibility and portabilityFlexibility and portability Simplified parallelization process for the end-userSimplified parallelization process for the end-user
27
Generic Programming Framework
28
Making the Simulator Parallel class SimulatorP : public SubdomainFEMSolver public Simulator public Simulator{ // … just a small amount of code // … just a small amount of code virtual void createLocalMatrix () virtual void createLocalMatrix () { Simulator::makeSystem (); } { Simulator::makeSystem (); }}; SubdomainSimulator SubdomainFEMSolver AdministratorSimulatorP Simulator
29
Application Poisson equation on unit square DD as the global solution method Subdomain solvers use CG+FFT Fixed number of subdomains M =32 (independent of P ) Straightforward parallelization of an existing simulator P: number of processors
30
A large scale problem Solving an elliptic boundary value problem on an unstructured grid
31
Combined Approach Use a CG-like method as basic solverUse a CG-like method as basic solver (i.e. use a parallelized Diffpack linear solver) Use DD as preconditionerUse DD as preconditioner (i.e. SimulatorP is invoked as a preconditioning solve) Combine with coarse grid correctionCombine with coarse grid correction CG-like method + DD prec. is normally faster than DD as a basic solverCG-like method + DD prec. is normally faster than DD as a basic solver
32
Elasticity Test case: 2D linear elasticity, 241 x 241 global grid. Vector equation Straightforward parallelization based on an existing Diffpack simulator
33
2D Linear Elasticity BiCGStab + DD prec. as global solverBiCGStab + DD prec. as global solver Multigrid V-cycle in subdomain solvesMultigrid V-cycle in subdomain solves I: number of global BiCGStab iterations neededI: number of global BiCGStab iterations needed P: number of processors ( P =#subdomains)P: number of processors ( P =#subdomains)
34
2D Linear Elasticity
35
Two-Phase Porous Media Flow PEQ: SEQ: BiCGStab + DD prec. for global pressure eq. Multigrid V-cycle in subdomain solves
36
Two-phase Porous Media Flow History of water saturation propagation
37
Nonlinear Water Waves Fully nonlinear 3D water waves Primary unknowns:
38
Nonlinear Water Waves CG + DD prec. for global solverCG + DD prec. for global solver Multigrid V-cycle as subdomain solverMultigrid V-cycle as subdomain solver Fixed number of subdomains M =16 (independent of P )Fixed number of subdomains M =16 (independent of P ) Subgrids from partition of a global 41x41x41 gridSubgrids from partition of a global 41x41x41 grid
39
Parallel Simulation of 3D Acoustic Field A linux-cluster: 48 Pentium-III 500Mhz procs, 100 Mbit interconnectionA linux-cluster: 48 Pentium-III 500Mhz procs, 100 Mbit interconnection SGI Cray Origin 2000: MIPS R10000SGI Cray Origin 2000: MIPS R10000 LAL parallelization; 2 cases:LAL parallelization; 2 cases: –Linear Model (linear wave equation), solved with an explicit method –Nonlinear Model, solved with an implicit method
40
Mathematical Nonlinear Model
41
Results - Linear Model CPUs Origin 2000 Linux Cluster CPU-timeSpeedupCPU-timeSpeedup 1944.83N/A640.7N/A 2549.211.72327.81.95 4282.753.34174.03.68 8155.016.1090.987.04 1680.4111.846.3513.8 2465.6314.434.0518.8 3249.9718.926.2724.4 4835.2326.817.7436.1
42
Results - Nonlinear Model CPUs Origin 2000 Linux Cluster CPU-timeSpeedupCPU-timeSpeedup 28670.8N/A6681.5N/A 44726.53.753545.93.77 82404.27.211881.17.10 161325.613.0953.8914.0 241043.716.6681.7719.6 32725.2323.9563.5423.7
43
Summary Goal: provide software and programming rules for easy parallelization of sequential simulatorsGoal: provide software and programming rules for easy parallelization of sequential simulators Applicable to a wide range of PDE problemsApplicable to a wide range of PDE problems Three parallelization approaches:Three parallelization approaches: –parallelization at the linear algebra level: “automatic” parallelization “automatic” parallelization –domain decomposition: very flexible, compact visible code/algorithm very flexible, compact visible code/algorithm –combined approach Performance: satisfactory speed-upPerformance: satisfactory speed-up http://www.ifi.uio.no/~tpv
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.