Parallelizing Unstructured FEM Computation

Slides:

Advertisements

Similar presentations

Steady-state heat conduction on triangulated planar domain May, 2002

Advertisements

Sedan Interior Acoustics

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

A Discrete Adjoint-Based Approach for Optimization Problems on 3D Unstructured Meshes Dimitri J. Mavriplis Department of Mechanical Engineering University.

Introduction to Finite Elements

Parallel Solution of Navier Stokes Equations Xing Cai Dept. of Informatics University of Oslo.

An efficient parallel particle tracker For advection-diffusion simulations In heterogeneous porous media Euro-Par 2007 IRISA - Rennes August 2007.

ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.

Landscape Erosion Kirsten Meeker

Network and Grid Computing –Modeling, Algorithms, and Software Mo Mu Joint work with Xiao Hong Zhu, Falcon Siu.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.

Direct and iterative sparse linear solvers applied to groundwater flow simulations Matrix Analysis and Applications October 2007.

Numerical methods for PDEs PDEs are mathematical models for –Physical Phenomena Heat transfer Wave motion.

© 2011 Autodesk Freely licensed for use by educational institutions. Reuse and changes require a note indicating that content has been modified from the.

1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Processing of a CAD/CAE Jobs in grid environment using Elmer Electronics Group, Physics Department, Faculty of Science, Ain Shams University, Mohamed Hussein.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

CADD: Component-Averaged Domain Decomposition Dan Gordon Computer Science University of Haifa Rachel Gordon Aerospace Engg. Technion January 13,

ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.

CFD Lab - Department of Engineering - University of Liverpool Ken Badcock & Mark Woodgate Department of Engineering University of Liverpool Liverpool L69.

The swiss-carpet preconditioner: a simple parallel preconditioner of Dirichlet-Neumann type A. Quarteroni (Lausanne and Milan) M. Sala (Lausanne) A. Valli.

Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.

A Software Strategy for Simple Parallelization of Sequential PDE Solvers Hans Petter Langtangen Xing Cai Dept. of Informatics University of Oslo.

Parallel Solution of the Poisson Problem Using MPI

Parallelizing finite element PDE solvers in an object-oriented framework Xing Cai Department of Informatics University of Oslo.

Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Cracow Grid Workshop, November 5-6, 2001 Concepts for implementing adaptive finite element codes for grid computing Krzysztof Banaś, Joanna Płażek Cracow.

High performance computing for Darcy compositional single phase fluid flow simulations L.Agélas, I.Faille, S.Wolf, S.Réquena Institut Français du Pétrole.

23/5/20051 ICCS congres, Atlanta, USA May 23, 2005 The Deflation Accelerated Schwarz Method for CFD C. Vuik Delft University of Technology

Discretization for PDEs Chunfang Chen,Danny Thorne Adam Zornes, Deng Li CS 521 Feb., 9,2006.

MA/CS 471 Lecture 15, Fall 2002 Introduction to Graph Partitioning.

An Object-Oriented Software Framework for Building Parallel Navier-Stokes Solvers Xing Cai Hans Petter Langtangen Otto Munthe University of Oslo.

A Software Framework for Easy Parallelization of PDE Solvers Hans Petter Langtangen Xing Cai Dept. of Informatics University of Oslo.

On the Performance of PC Clusters in Solving Partial Differential Equations Xing Cai Åsmund Ødegård Department of Informatics University of Oslo Norway.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering

Adaptive grid refinement. Adaptivity in Diffpack Error estimatorError estimator Adaptive refinementAdaptive refinement A hierarchy of unstructured gridsA.

Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.

Parallel Computing Activities at the Group of Scientific Software Xing Cai Department of Informatics University of Oslo.

A Software Framework for Easy Parallelization of PDE Solvers Hans Petter Langtangen Xing Cai Dept. of Informatics University of Oslo.

Computational Fluid Dynamics Lecture II Numerical Methods and Criteria for CFD Dr. Ugur GUVEN Professor of Aerospace Engineering.

High Performance Computing Seminar II Parallel mesh partitioning with ParMETIS Parallel iterative solvers with Hypre M.Sc. Caroline Mendonça Costa.

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

Hui Liu University of Calgary

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Xing Cai University of Oslo

Convection-Dominated Problems

Conception of parallel algorithms

Data Structures for Efficient and Integrated Simulation of Multi-Physics Processes in Complex Geometries A.Smirnov MulPhys LLC github/mulphys

Programming Models for SimMillennium

© Fluent Inc. 1/10/2018L1 Fluids Review TRN Solution Methods.

Lecture 19 MA471 Fall 2003.

Performance Evaluation of Adaptive MPI

GPU Implementations for Finite Element Methods

L Ge, L Lee, A. Candel, C Ng, K Ko, SLAC

A robust preconditioner for the conjugate gradient method

GENERAL VIEW OF KRATOS MULTIPHYSICS

Supported by the National Science Foundation.

Objective Numerical methods Finite volume.

A Software Framework for Easy Parallelization of PDE Solvers

Comparison of CFEM and DG methods

Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Parallelizing Unstructured FEM Computation Xing Cai Department of Informatics University of Oslo

Contents Background & introduction Parallelization approaches at the linear algebra level based on domain decomposition Implementational aspects Numerical experiments

The Question Starting point: sequential FEM code unstructured grids, implicit computation… How to do the parallelization? We need a good parallelization strategy a good and simple implementation of the strategy Resulting parallel solvers should have good overall numerical performance good parallel efficiency

Basic Strategy Different approaches to parallelization Automatic compiler parallelization Loop level parallelization We use the strategy of divide & conquor divide the global domain into subdomains one process is reponsible for one subdomain make use of the message-passing paradigm Domain decomposition at different levels

A Generic Finite Element PDE Solver Time stepping t0, t1, t2… Spatial discretization on the computational grid Solution of nonlinear problems Solution of linearized problems Iterative solution of Ax=b

Important Observations The computation-intensive part is the iterative solution of Ax=b A parallel finite element PDE solver needs to run the linear algebra kernels in parallel vector addition inner-product of two vectors matrix-vector product Two types of inter-processor communication Ratio computation/communication is high Relatively tolerant of slow communication

Solution Domain Partition Partition of the elements is non-overlapping Grid points shared between neighboring subdomains on the internal boundaries Non-overlapping grid partition

Natural Parallelization of PDE Solvers The global solution domain is partitioned into many smaller subdomains One subdomain works as a ”unit”, with its sub-matrices and sub-vectors No need to create global matrices and vectors physically The global linear algebra operations can be realized by local operations + inter- processor communication

Work in Parallel Assembly of local stiffness matrix etc is embarrasingly parallel Vector addition/update is also embarrasingly parallel Inner-product between 2 distributed vectors requires collective communication Matrix-vector product requires immediate neighbors to exchange info

Overlapping Grid Partition Necessory for preconditioning etc

Linear-algebra Level Parallelization A SPMD model Reuse of existing code for local linear algebra operations Need new code for the parallelization specific tasks grid partition (non-overlapping, overlapping) communication parttern recognition inter-processor communication routines

OOP Simplifies Parallelization Develop a small add-on ”toolbox” containing all the parallelization specific codes The ”toolbox” has many high-level routines, hides the low-level MPI details The existing sequential libraries are slightly modified to include a ”dummy” interface, thus incorporating ”fake” inter-processor communication routines A seamless coupling between the huge sequential libraries and the add-on toolbox

Diffpack O-O software environment for scientific computation (C++) Rich collection of PDE solution components - portable, flexible, extensible http://www.nobjects.com H.P.Langtangen, Computational Partial Differential Equations, Springer 1999

Straightforward Parallelization Develop a sequential simulator, without paying attention to parallelism Follow the Diffpack coding standards Use the add-on toolbox for parallel computing Add a few new statements for transformation to a parallel simulator

Linear-Algebra-Level Approach Parallelize matrix/vector operations inner-product of two vectors matrix-vector product preconditioning - block contribution from subgrids Easy to use access to all Diffpack v3.0 CG-like methods, preconditioners and convergence monitors “hidden” parallelization need only to add a few lines of new code arbitrary choice of number of processors at run-time

A Simple Coding Example GridPartAdm* adm; // access to parallelizaion functionality LinEqAdm* lineq; // administrator for linear system & solver // ... #ifdef PARALLEL_CODE adm->scan (menu); adm->prepareSubgrids (); adm->prepareCommunication (); lineq->attachCommAdm (*adm); #endif lineq->solve (); set subdomain list = DEFAULT set global grid = grid1.file set partition-algorithm = METIS set number of overlaps = 0

Solving an Elliptic PDE Highly unstructured grid Discontinuity in the coefficient K (0.1 & 1)

Measurements 130,561 degrees of freedom Overlapping subgrids Global BiCGStab using (block) ILU prec.

Parallel Simulation of 3D Acoustic Field 3D nonlinear model

3D Nonlinear Acoustic Field Simulation Comparison between Origin 2000 and Linux cluster 1,030,301 grid points CPUs Origin 2000 Linux Cluster CPU-time Speedup 2 8670.8 N/A 6681.5 4 4726.5 3.75 3545.9 3.77 8 2404.2 7.21 1881.1 7.10 16 1325.6 13.0 953.89 14.0 24 1043.7 16.6 681.77 19.6 32 725.23 23.9 563.54 23.7 48 557.61 31.1 673.77 19.8

Imcompressible Navier-Stokes Numerical strategy: operator splitting Calculation of an intermediate velocity in a predictor-corrector way Solution of a Poisson equation Correction of the intermediate velocity

Imcompressible Navier-Stokes Explicit schemes for predicting and correcting the velocity Implicit solution of the pressure by CG Measurements on a Linux cluster P CPU-time Speedup Efficiency 1 665.45 N/A 2 329.57 2.02 1.01 4 166.55 4.00 1.00 8 89.98 7.40 0.92 16 48.96 13.59 0.85 24 34.85 19.09 0.80 48 34.22 19.45 0.41

Example: Vortex-Shedding

Simulation Snapshots Pressure

Simulation Snapshots Velocity

Animated Pressure Field

Parallel Simulation of Heart Special code for balanced partition of coupled heart-torso grids Simple extension of sequential elliptic and parabolic solvers

Higher Level Parallelization Apply overlapping Schwarz methods as both stand-alone solution method and preconditioner Solution of the original large problem through iteratively solving many smaller subproblems Flexibility -- localized treatment of irregular geometries, singularities etc Inherent parallelism, suitable for coarse grained parallelization

One Example of DD Poisson Eq. on unit square

Observations DD is a good parallelization strategy The approach is not PDE-specific A program for the original global problem can be reused (modulo B.C.) for each subdomain Must communicate overlapping point values No need for global data Data distribution implied Explicit temporal schemes are a special case where no iteration is needed (“exact DD”)

Goals for the Implementation Reuse sequential solver as subdomain solver Add DD management and communication as separate modules Collect common operations in generic library modules Flexibility and portability Simplified parallelization process for the end-user

Generic Programming Framework

The Subdomain Simulator seq. solver add-on communication

The Communicator Need functionality for exchanging point values inside the overlapping regions The communicator works with a hidden communication model Make use of the add-on toolbox for linear-algebra level parallelization MPI in use, but easy to change

Making A Simulator Parallel class SimulatorP : public SubdomainFEMSolver public Simulator { // … just a small amount of code virtual void createLocalMatrix () { Simulator::makeSystem (); } }; Administrator SubdomainSimulator Simulator SubdomainFEMSolver SimulatorP

Performance Algorithmic efficiency Parallel efficiency efficiency of original sequential simulator(s) efficiency of domain decomposition method Parallel efficiency communication overhead (low) coarse grid correction overhead (normally low) load balancing subproblem size work on subdomain solves

P: number of processors A Simple Application Poisson Equation on unit square DD as the global solution method Subdomain solvers use CG+FFT Fixed number of subdomains M=32 (independent of P) Straightforward parallelization of an existing simulator P: number of processors

Combined Approach Use a CG-like method as basic solver (i.e. use a parallelized Diffpack linear solver) Use DD as preconditioner (i.e. SimulatorP is invoked as a preconditioning solve) Combine with coarse grid correction CG-like method + DD prec. is normally faster than DD as a basic solver

Two-Phase Porous Media Flow Simulation result obtained on 16 processors

Two-phase Porous Media Flow History of saturation for water and oil

Two-Phase Porous Media Flow SEQ: PEQ: BiCGStab + DD prec. for global pressure eq. Multigrid V-cycle in subdomain solves

3D Nonlinear Water Waves Fully nonlinear 3D water waves Primary unknowns:

3D Nonlinear Water Waves Global 3D grid: 49x49x41 Global solver: CG + overlapping Schwarz prec. Multigrid V-cycle as subdomain solver CPU measurement of a total of 32 time steps Parallel simulation on a Linux cluster

Elasticity Test case: 2D linear elasticity, 241 x 241 global grid. Vector equation Straightforward parallelization based on an existing Diffpack simulator

2D Linear Elasticity BiCGStab + DD prec. as global solver Multigrid V-cycle in subdomain solves I: number of global BiCGStab iterations needed P: number of processors (P=#subdomains)

2D Linear Elasticity

Summary Goal: provide software and programming rules for easy parallelization of FEM codes Applicable to a wide range of PDE problems Two parallelization strategies: parallelization at the linear algebra level: “automatic” hidden parallelization parallel domain decomposition: very flexible, compact visible code/algorithm Performance: satisfactory speed-up