Gordon Bell Prize Finalist Presentation SC’99 Achieving High Sustained Performance in an Unstructured Mesh CFD Application http://www.mcs.anl.gov/petsc-fun3d.

Slides:



Advertisements
Similar presentations
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Advertisements

A Discrete Adjoint-Based Approach for Optimization Problems on 3D Unstructured Meshes Dimitri J. Mavriplis Department of Mechanical Engineering University.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
OpenFOAM on a GPU-based Heterogeneous Cluster
Gordon Bell Prize Finalist Presentation SC’99 Achieving High Sustained Performance in an Unstructured Mesh CFD Application
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Steady Aeroelastic Computations to Predict the Flying Shape of Sails Sriram Antony Jameson Dept. of Aeronautics and Astronautics Stanford University First.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics.
1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
ANS 1998 Winter Meeting DOE 2000 Numerics Capabilities 1 Barry Smith Argonne National Laboratory DOE 2000 Numerics Capability
Using the PETSc Linear Solvers Lois Curfman McInnes in collaboration with Satish Balay, Bill Gropp, and Barry Smith Mathematics and Computer Science Division.
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
CFD Lab - Department of Engineering - University of Liverpool Ken Badcock & Mark Woodgate Department of Engineering University of Liverpool Liverpool L69.
Efficient Integration of Large Stiff Systems of ODEs Using Exponential Integrators M. Tokman, M. Tokman, University of California, Merced 2 hrs 1.5 hrs.
Parallel Simulation of Continuous Systems: A Brief Introduction
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
CMRS Review, PPPL, 5 June 2003 &. 4 projects in high energy and nuclear physics 5 projects in fusion energy science 14 projects in biological and environmental.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
The swiss-carpet preconditioner: a simple parallel preconditioner of Dirichlet-Neumann type A. Quarteroni (Lausanne and Milan) M. Sala (Lausanne) A. Valli.
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
CCA Common Component Architecture CCA Forum Tutorial Working Group CCA Status and Plans.
Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Outline Introduction Research Project Findings / Results
Discretization Methods Chapter 2. Training Manual May 15, 2001 Inventory # Discretization Methods Topics Equations and The Goal Brief overview.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for MPCUGLES code at University of Minnesota October 09,
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering
Adaptive grid refinement. Adaptivity in Diffpack Error estimatorError estimator Adaptive refinementAdaptive refinement A hierarchy of unstructured gridsA.
Quality of Service for Numerical Components Lori Freitag Diachin, Paul Hovland, Kate Keahey, Lois McInnes, Boyana Norris, Padma Raghavan.
Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.
Parallel Computing Activities at the Group of Scientific Software Xing Cai Department of Informatics University of Oslo.
High Performance Computing Seminar II Parallel mesh partitioning with ParMETIS Parallel iterative solvers with Hypre M.Sc. Caroline Mendonça Costa.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Auburn University
Xing Cai University of Oslo
Amit Amritkar & Danesh Tafti Eric de Sturler & Kasia Swirydowicz
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Programming Models for SimMillennium
Meros: Software for Block Preconditioning the Navier-Stokes Equations
HPC Modeling of the Power Grid
AIAA OBSERVATIONS ON CFD SIMULATION UNCERTAINITIES
AIAA OBSERVATIONS ON CFD SIMULATION UNCERTAINTIES
GENERAL VIEW OF KRATOS MULTIPHYSICS
Supported by the National Science Foundation.
Objective Numerical methods Finite volume.
Salient application properties Expectations TOPS has of users
CS 584.
AIAA OBSERVATIONS ON CFD SIMULATION UNCERTAINTIES
Gary M. Zoppetti Gagan Agrawal
Department of Computer Science, University of Tennessee, Knoxville
A Software Framework for Easy Parallelization of PDE Solvers
Jacobi Project Salvatore Orlando.
Parallelizing Unstructured FEM Computation
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
Parallel Programming in C with MPI and OpenMP
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Gordon Bell Prize Finalist Presentation SC’99 Achieving High Sustained Performance in an Unstructured Mesh CFD Application http://www.mcs.anl.gov/petsc-fun3d Kyle Anderson, NASA Langley Research Center William Gropp, Argonne National Laboratory Dinesh Kaushik, Old Dominion University & Argonne David Keyes, Old Dominion University, LLNL & ICASE Barry Smith, Argonne National Laboratory 15 minute presentation for first DOE NGI PI meeting

Application Performance History 3 orders of magnitude in 10 years SC’99

Features of this 1999 Submission Based on “legacy” (but contemporary) CFD application with significant F77 code reuse Portable, message-passing library-based parallelization, run on NT boxes through Tflop/s ASCI platforms Simple multithreaded extension (for ASCI Red) Sparse, unstructured data, implying memory indirection with only modest reuse - nothing in this category has ever advanced to Bell finalist round Wide applicability to other implicitly discretized multiple-scale PDE workloads - of interagency, interdisciplinary interest Extensive profiling has led to follow-on algorithmic research SC’99

Application Domain: Computational Aerodynamics SC’99

Background of FUN3D Application Tetrahedral vertex-centered unstructured grid code developed by W. K. Anderson (LaRC) for steady compressible and incompressible Euler and Navier-Stokes equations (with one-equation turbulence modeling) Used in airplane, automobile, and submarine applications for analysis and design Standard discretization is 2nd-order Roe  for convection and Galerkin for diffusion Newton-Krylov solver with global point-block-ILU preconditioning, with false timestepping for nonlinear continuation towards steady state; competitive with FAS multigrid in practice Legacy implementation/ordering is vector-oriented SC’99

Surface Visualization of Test Domain for Computing Flow over an ONERA M6 Wing Wing surface outlined in green triangles Nearly 2.8 M vertices in this computational domain SC’99

Fixed-size Parallel Scaling Results (Flop/s)

Fixed-size Parallel Scaling Results (Time in seconds)

Inside the Parallel Scaling Results on ASCI Red ONERA M6 Wing Test Case, Tetrahedral grid of 2.8 million vertices (about 11 million unknowns) on up to 3072 ASCI Red Nodes (each with dual Pentium Pro 333 MHz processors) SC’99

Algorithm: Newton-Krylov-Schwarz nonlinear solver asymptotically quadratic Krylov accelerator spectrally adaptive Schwarz preconditioner parallelizable SC’99

Merits of NKS Algorithm/Implementation Relative characteristics: the “exponents” are naturally good Convergence scalability weak (or no) degradation in problem size and parallel granularity (with use of small global problems in Schwarz preconditioner) Implementation scalability no degradation in ratio of surface communication to volume work (in problem-scaled limit) only modest degradation from global operations (for sufficiently richly connected networks) Absolute characteristics: the “constants” can be made good Operation count complexity residual reductions of 10-9 in 103 “work units” Per-processor performance up to 25% of theoretical peak Overall, machine-epsilon solutions require as little as 15 microseconds per degree of freedom! SC’99

Primary PDE Solution Kernels Vertex-based loops state vector and auxiliary vector updates Edge-based “stencil op” loops residual evaluation approximate Jacobian evaluation Jacobian-vector product (often replaced with matrix-free form, involving residual evaluation) Sparse, narrow-band recurrences approximate factorization and back substitution Vector inner products and norms orthogonalization/conjugation convergence progress and stability checks SC’99

Additive Schwarz Preconditioning for Au=f in Ω Form preconditioner B out of (approximate) local solves on (overlapping) subdomains Let Ri and RiT be Boolean gather and scatter operations, mapping between a global vector and its ith subdomain support SC’99

Iteration Count Estimates from the Schwarz Theory [ref: Smith, Bjorstad & Gropp, 1996, Camb. Univ. Pr.] Krylov-Schwarz iterative methods typically converge in a number of iterations that scales as the square-root of the condition number of the Schwarz-preconditioned system In terms of N and P, where for d-dimensional isotropic problems, N=h-d and P=H-d, for mesh parameter h and subdomain diameter H, iteration counts may be estimated as follows: Preconditioning Type in 2D in 3D Point Jacobi Ο(N1/2) Ο(N1/3) Domain Jacobi Ο((NP)1/4) Ο((NP)1/6) 1-level Additive Schwarz Ο(P1/3) 2-level Additive Schwarz Ο(1) SC’99

Time-Implicit Newton-Krylov-Schwarz Method For nonlinear robustness, NKS iteration is wrapped in time-stepping: for (l = 0; l < n_time; l++) { # n_time ~ 50 select time step for (k = 0; k < n_Newton; k++) { # n_Newton ~ 1 compute nonlinear residual and Jacobian for (j = 0; j < n_Krylov; j++) { # n_Krylov ~ 50 forall (i = 0; i < n_Precon ; i++) { solve subdomain problems concurrently } // End of loop over subdomains perform Jacobian-vector product enforce Krylov basis conditions update optimal coefficients check linear convergence } // End of linear solver perform DAXPY update check nonlinear convergence } // End of nonlinear loop } // End of time-step loop SC’99

Separation of Concerns: User Code/PETSc Library Main Routine Timestepping Solvers (TS) Nonlinear Solvers (SNES) Linear Solvers (SLES) PETSc PC KSP Application Initialization Function Evaluation Jacobian Evaluation Post- Processing User code PETSc code SC’99

Key Features of Implementation Strategy Follow the “owner computes” rule under the dual constraints of minimizing the number of messages and overlapping communication with computation Each processor “ghosts” its stencil dependences in its neighbors Ghost nodes ordered after contiguous owned nodes Domain mapped from (user) global ordering into local orderings Scatter/gather operations created between local sequential vectors and global distributed vectors, based on runtime connectivity patterns Newton-Krylov-Schwarz operations translated into local tasks and communication tasks Profiling used to help eliminate performance bugs in communication and memory hierarchy SC’99

Background of PETSc Developed by Gropp, Smith, McInnes & Balay (ANL) to support research, prototyping, and production parallel solutions of operator equations in message-passing environments Distributed data structures as fundamental objects - index sets, vectors/gridfunctions, and matrices/arrays Iterative linear and nonlinear solvers, combinable modularly and recursively, and extensibly Portable, and callable from C, C++, Fortran Uniform high-level API, with multi-layered entry Aggressively optimized: copies minimized, communication aggregated and overlapped, caches and registers reused, memory chunks preallocated, inspector-executor model for repetitivetasks (e.g., gather/scatter) SC’99

Single-processor Performance of PETSc-FUN3D Clock MHz Peak Mflop/s Opt. % of Reord. Only Interl. only Orig. % of Peak R10000 250 500 25.4 127 74 59 26 5.2 P3 200 800 20.3 163 87 68 32 4.0 P2SC (2 card) 120 480 21.4 101 51 35 13 2.7 P2SC (4 card) 24.3 117 40 15 3.1 604e 332 664 9.9 66 43 31 2.3 Alpha 21164 450 900 8.3 75 39 14 1.6 600 1200 7.6 91 47 37 16 1.3 Ultra II 300 12.5 42 18 3.0 360 720 13.0 94 54 25 3.5 Ultra II/HPC 400 8.9 71 36 20 2.5 Pent. II/LIN 20.8 83 52 33 Pent. II/NT 19.5 78 49 7.8 Pent. Pro 21.0 27 8.0 333 18.8 60 21 6.3 SC’99

Single-processor Performance of PETSc-FUN3D Clock MHz Peak Mflop/s Opt. % of Reord. Only Interl. only Orig. % of Peak R10000 250 500 25.4 127 74 59 26 5.2 P3 200 800 20.3 163 87 68 32 4.0 P2SC (2 card) 120 480 21.4 101 51 35 13 2.7 P2SC (4 card) 24.3 117 40 15 3.1 604e 332 664 9.9 66 43 31 2.3 Alpha 21164 450 900 8.3 75 39 14 1.6 600 1200 7.6 91 47 37 16 1.3 Ultra II 300 12.5 42 18 3.0 360 720 13.0 94 54 25 3.5 Ultra II/HPC 400 8.9 71 36 20 2.5 Pent. II/LIN 20.8 83 52 33 Pent. II/NT 19.5 78 49 7.8 Pent. Pro 21.0 27 8.0 333 18.8 60 21 6.3 SC’99

Single-processor Performance of PETSc-FUN3D Clock MHz Peak Mflop/s Opt. % of Reord. Only Interl. only Orig. % of Peak R10000 250 500 25.4 127 74 59 26 5.2 P3 200 800 20.3 163 87 68 32 4.0 P2SC (2 card) 120 480 21.4 101 51 35 13 2.7 P2SC (4 card) 24.3 117 40 15 3.1 604e 332 664 9.9 66 43 31 2.3 Alpha 21164 450 900 8.3 75 39 14 1.6 600 1200 7.6 91 47 37 16 1.3 Ultra II 300 12.5 42 18 3.0 360 720 13.0 94 54 25 3.5 Ultra II/HPC 400 8.9 71 36 20 2.5 Pent. II/LIN 20.8 83 52 33 Pent. II/NT 19.5 78 49 7.8 Pent. Pro 21.0 27 8.0 333 18.8 60 21 6.3 SC’99

Lessons for High-end Simulation of PDEs Unstructured (static) grid codes can run well on distributed hierarchical memory machines, with attention to partitioning, vertex ordering, component ordering, blocking, and tuning Parallel solver libraries can give new life to the most valuable, discipline-specific modules of legacy PDE codes Parallel scalability is easy, but attaining high per-processor performance for sparse problems gets more challenging with each machine generation The NKS family of algorithms can be and must be tuned to an application-architecture combination; profiling is critical Some gains from hybrid parallel programming models (message passing and multithreading together) require little work; squeezing the last drop is likely much more difficult SC’99

Remaining Challenges Parallelization of the solver leaves mesh generation, I/O, and post processing as Amdahl bottlenecks in overall time-to-solution moving finest mesh cross-country with ftp may take hours -- ideal software environment would generate and verify correctness of mesh in parallel, from relatively small geometry file Solution adaptivity of the mesh and parallel redistribution important in ultimate production environment Better multilevel preconditioners needed in some applications In progress: wrapping a parallel optimization capability (Lagrange-Newton-Krylov-Schwarz) around our parallel solver, with substantial code reuse (automatic differentiation tools will help) integrating computational snooping and steering into the PETSc environment SC’99

Bibliography Toward Realistic Performance Bounds for Implicit CFD Codes, Gropp, Kaushik, Keyes & Smith, 1999, in “Proceedings of Parallel CFD'99,” Elsevier (to appear) Implementation of a Parallel Framework for Aerodynamic Design Optimization on Unstructured Meshes, Nielsen, Anderson & Kaushik, 1999, in “Proceedings of Parallel CFD'99,” Elsevier (to appear) Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel Solution of Partial Differential Equations,” Springer, pp. 247-278 Newton-Krylov-Schwarz Methods for Aerodynamics Problems: Compressible and Incompressible Flows on Unstructured Grids, Kaushik, Keyes & Smith, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 513-520 How Scalable is Domain Decomposition in Practice, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 286-297 On the Interaction of Architecture and Algorithm in the Domain-Based Parallelization of an Unstructured Grid Incompressible Flow Code, Kaushik, Keyes & Smith, 1998, in “Proceedings of the 10th Intl. Conf. on Domain Decomposition Methods,” AMS, pp. 311-319 SC’99

Acknowledgments Accelerated Strategic Computing Initiative, DOE access to ASCI Red and Blue machines National Energy Research Scientific Computing Center (NERSC), DOE access to large T3E SGI-Cray National Science Foundation research sponsorship under Multidisciplinary Computing Challenges Program U. S. Department of Education graduate fellowship support for D. Kaushik SC’99

Related URLs Follow-up on this talk PETSc FUN3D ASCI platforms http://www.mcs.anl.gov/petsc-fun3d PETSc http://www.mcs.anl.gov/petsc FUN3D http://fmad-www.larc.nasa.gov/~wanderso/Fun ASCI platforms http://www.llnl.gov/asci/platforms International Conferences on Domain Decomposition Methods http://www.ddm.org International Conferences on Parallel CFD http://www.parcfd.org SC’99