1 Mark F. Adams 22 October 2004 Applications of Algebraic Multigrid to Large Scale Mechanics Problems.

Slides:



Advertisements
Similar presentations
05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.
Advertisements

A Discrete Adjoint-Based Approach for Optimization Problems on 3D Unstructured Meshes Dimitri J. Mavriplis Department of Mechanical Engineering University.
1 Iterative Solvers for Linear Systems of Equations Presented by: Kaveh Rahnema Supervisor: Dr. Stefan Zimmer
1 Numerical Solvers for BVPs By Dong Xu State Key Lab of CAD&CG, ZJU.
CS 290H 7 November Introduction to multigrid methods
SOLVING THE DISCRETE POISSON EQUATION USING MULTIGRID ROY SROR ELIRAN COHEN.
Geometric (Classical) MultiGrid. Hierarchy of graphs Apply grids in all scales: 2x2, 4x4, …, n 1/2 xn 1/2 Coarsening Interpolate and relax Solve the large.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
OpenFOAM on a GPU-based Heterogeneous Cluster
Algebraic MultiGrid. Algebraic MultiGrid – AMG (Brandt 1982)  General structure  Choose a subset of variables: the C-points such that every variable.
03/14/06CS267 Lecture 17 CS 267: Applications of Parallel Computers Unstructured Multigrid for Linear Systems James Demmel Based in part on material from.
Parallel Decomposition-based Contact Response Fehmi Cirak California Institute of Technology.
FEA Simulations Usually based on energy minimum or virtual work Component of interest is divided into small parts – 1D elements for beam or truss structures.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
03/09/06CS267 Lecture 16 CS 267: Applications of Parallel Computers Solving Linear Systems arising from PDEs - II James Demmel
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
04/13/07CS267 Guest Lecture CS 267: Applications of Parallel Computers Unstructured Multigrid for Linear Systems James Demmel Based in part on material.
CS267 L25 Solving PDEs II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 25: Solving Linear Systems arising from PDEs - II James Demmel.
Iterative Solvers for Coupled Fluid-Solid Scattering Jan Mandel Work presentation Center for Aerospace Structures University of Colorado at Boulder October.
FE Modeling Strategy Decide on details from design Find smallest dimension of interest Pick element types – 1D Beams – 2D Plate or.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
1 A Domain Decomposition Analysis of a Nonlinear Magnetostatic Problem with 100 Million Degrees of Freedom H.KANAYAMA *, M.Ogino *, S.Sugimoto ** and J.Zhao.
Scalable Multi-Stage Stochastic Programming
Hans De Sterck Department of Applied Mathematics University of Colorado at Boulder Ulrike Meier Yang Center for Applied Scientific Computing Lawrence Livermore.
Van Emden Henson Panayot Vassilevski Center for Applied Scientific Computing Lawrence Livermore National Laboratory Element-Free AMGe: General algorithms.
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
CFD Lab - Department of Engineering - University of Liverpool Ken Badcock & Mark Woodgate Department of Engineering University of Liverpool Liverpool L69.
ML: Multilevel Preconditioning Package Trilinos User’s Group Meeting Wednesday, October 15, 2003 Jonathan Hu Sandia is a multiprogram laboratory operated.
© 2011 Autodesk Freely licensed for use by educational institutions. Reuse and changes require a note indicating that content has been modified from the.
The swiss-carpet preconditioner: a simple parallel preconditioner of Dirichlet-Neumann type A. Quarteroni (Lausanne and Milan) M. Sala (Lausanne) A. Valli.
Fall 2011Math 221 Multigrid James Demmel
High Performance Computational Fluid-Thermal Sciences & Engineering Lab GenIDLEST Co-Design Virginia Tech AFOSR-BRI Workshop July 20-21, 2014 Keyur Joshi,
New Features in ML 2004 Trilinos Users Group Meeting November 2-4, 2004 Jonathan Hu, Ray Tuminaro, Marzio Sala, Michael Gee, Haim Waisman Sandia is a multiprogram.
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
1 Instituto Tecnológico de Aeronáutica Prof. Maurício Vicente Donadon AE-256 Lecture notes: Prof. Maurício V. Donadon NUMERICAL METHODS IN APPLIED STRUCTURAL.
Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.
Introduction to Scientific Computing II Multigrid Dr. Miriam Mehl.
High Performance Computing 1 Multigrid Some material from lectures of J. Demmel, UC Berkeley Dept of CS.
Lecture 21 MA471 Fall 03. Recall Jacobi Smoothing We recall that the relaxed Jacobi scheme: Smooths out the highest frequency modes fastest.
Yuki ONISHI Tokyo Institute of Technology, Japan
1 Mark F. Adams SciDAC - 27 June 2005 Ax=b: The Link between Gyrokinetic Particle Simulations of Turbulent Transport in Burning Plasmas and Micro-FE Analysis.
MULTISCALE COMPUTATIONAL METHODS Achi Brandt The Weizmann Institute of Science UCLA
MA/CS 471 Lecture 15, Fall 2002 Introduction to Graph Partitioning.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for MPCUGLES code at University of Minnesota October 09,
ML: A Multilevel Preconditioning Package Copper Mountain Conference on Iterative Methods March 29-April 2, 2004 Jonathan Hu Ray Tuminaro Marzio Sala Sandia.
Algebraic Solvers in FASTMath Argonne Training Program on Extreme-Scale Computing August 2015.
On the Performance of PC Clusters in Solving Partial Differential Equations Xing Cai Åsmund Ødegård Department of Informatics University of Oslo Norway.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Circuit Simulation using Matrix Exponential Method Shih-Hung Weng, Quan Chen and Chung-Kuan Cheng CSE Department, UC San Diego, CA Contact:
Brain (Tech) NCRR Overview Magnetic Leadfields and Superquadric Glyphs.
A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering
University of Colorado
ANSYS, Inc. Proprietary © 2004 ANSYS, Inc. Chapter 5 Distributed Memory Parallel Computing v9.0.
Optimizing 3D Multigrid to Be Comparable to the FFT Michael Maire and Kaushik Datta Note: Several diagrams were taken from Kathy Yelick’s CS267 lectures.
1 Mark F. Adams H.H. Bayraktar, T.M. Keaveny, P. Papadopoulos and Atul Gupta Ultrascalable Implicit Finite Element Analyses in Solid Mechanics with over.
High Performance Computing Seminar II Parallel mesh partitioning with ParMETIS Parallel iterative solvers with Hypre M.Sc. Caroline Mendonça Costa.
Hui Liu University of Calgary
Xing Cai University of Oslo
H.H. Bayraktar, T.M. Keaveny, P. Papadopoulos and Atul Gupta
H.H. Bayraktar, T.M. Keaveny, P. Papadopoulos and Atul Gupta
James Demmel Multigrid James Demmel Fall 2010 Math 221.
H.H. Bayraktar, T.M. Keaveny, P. Papadopoulos and Atul Gupta
Innovative Multigrid Methods
FEA Simulations Boundary conditions are applied
A robust preconditioner for the conjugate gradient method
H.H. Bayraktar, T.M. Keaveny, P. Papadopoulos and Atul Gupta
H.H. Bayraktar, T.M. Keaveny, P. Papadopoulos and Atul Gupta
CS 252 Project Presentation
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
(based in part on material from Mark Adams)
Presentation transcript:

1 Mark F. Adams 22 October 2004 Applications of Algebraic Multigrid to Large Scale Mechanics Problems

2 Outline  Algebraic multigrid (AMG) introduction  Industrial applications  Micro-FE bone modeling  Olympus Parallel FE framework  Scalability studies on IBM SPs  Scaled speedup  Plain speedup  Nodal performance

3 Multigrid smoothing and coarse grid correction (projection) smoothing Finest Grid Prolongation (P=R T ) The Multigrid V-cycle First Coarse Grid Restriction (R) Note: smaller grid

4 Multigrid V(    ) - cycle MG-V  Function u = MG-V(A,f)  if A is small  u  A -1 f  else  u  S  (f, u) --  steps of smoother (pre)  r H  P T ( f – Au ) MG-Vrecursion  u H  MG-V(P T AP, r H )-- recursion (Galerkin)  u  u + Pu H  u  S  (f, u) --  steps of smoother (post)  Iteration matrix: T = S ( I - P(RAP) -1 RA ) S  multiplicative

5 Smoothed Aggregation  Coarse grid space & smoother  MG method  Piecewise constant function: “Plain” agg. (P 0 )  Start with kernel vectors B of operator  eg, 6 RBMs in elasticity  Nodal aggregation BP0P0  “Smoothed” aggregation: lower energy of functions  One Jacobi iteration: P  ( I -  D -1 A ) P 0

6 Parallel Smoothers  CG/Jacobi: Additive (Requires damping for MG)  Damped by CG (Adams SC1999)  Dot products, non-stationary  Gauss-Seidel: multiplicative (Optimal MG smoother)  Complex communication and computation (Adams SC2001)  Polynomial Smoothers: Additive  Chebyshev is ideal for multigrid smoothers  Chebychev chooses p(y) such that  |1 - p(y) y | = min over interval [ *, max ]  Estimate of max easy  Use * = max / C (No need for lowest eigenvalue)  C related to rate of grid coarsening

7 Outline  Algebraic multigrid (AMG) introduction  Industrial applications  Micro-FE bone modeling  Olympus Parallel FE framework  Scalability studies on IBM SPs  Scaled speedup  Plain speedup  Nodal performance

8 Aircraft carrier 315,444 vertices Shell and beam elements (6 DOF per node) Linear dynamics – transient (time domain) About 1 min. per solve (rtol=10 -6 ) –2.4 GHz Pentium 4/Xenon processors –Matrix vector product runs at 254 Mflops

9 Solve and setup times (26 Sun processors)

10 Adagio: “BR” tire  ADAGIO: Quasi static solid mechanics app. (Sandia)  Nearly incompressible visco-elasticity (rubber)  Augmented Lagrange formulation w/ Uzawa like update  Contact (impenetrability constraint)  Saddle point solution scheme:  Uzawa like iteration (pressure) w/ contact search  Non-linear CG (with linear constraints and constant pressure)  Preconditioned with  Linear solvers (AMG, FETI, …)  Nodal (Jacobi)

11 “BR” tire

12 Displacement history

13 Outline  Algebraic multigrid (AMG) introduction  Industrial applications  Micro-FE bone modeling  Olympus Parallel FE framework  Scalability studies on IBM SPs  Scaled speedup  Plain speedup  Nodal performance

14 Trabecular Bone 5-mm Cube Cortical bone Trabecular bone

15 Micro-Computed Tomography  22  m resolution 3D image Mechanical Testing E,  yield,  ult, etc. 2.5 mm cube 44  m elements  FE mesh Methods:  FE modeling

16 Outline  Algebraic multigrid (AMG) introduction  Industrial applications  Micro-FE bone modeling  Olympus Parallel FE framework  Scalability studies on IBM SPs  Scaled speedup  Plain speedup  Nodal performance

17  Athena: Parallel FE  ParMetis  Parallel Mesh Partitioner (Univerisity of Minnesota)  Prometheus  Multigrid Solver  FEAP  Serial general purpose FE application (University of California)  PETSc  Parallel numerical libraries (Argonne National Labs)  FE Mesh Input File Athena ParMetis FE input file (in memory) Partition to SMPs Athena ParMetis File FEAP Material Card Silo DB Visit Prometheus PETSc ParMetis METIS pFEAP Computational Architecture Olympus

18 Outline  Algebraic multigrid (AMG) introduction  Industrial applications  Micro-FE bone modeling  Olympus Parallel FE framework  Scalability studies on IBM SPs  Scaled speedup  Plain speedup  Nodal performance

19 80 µm w/o shell  Inexact Newton  CG linear solver  Variable tolerance  Smoothed aggregation AMG preconditioner  Nodal block diagonal smoothers:  2 nd order Chebeshev (add.)  Gauss-Seidel (multiplicative) Scalability

20 Inexact Newton  Solve F(x)=0  Newton:  Solve F’(x k-1 )s k = -F(x k-1 ) for s k such that  || F(x k-1 ) + F’(x k-1 )s k || 2 < n k || F(x k-1 ) || 2  n k = ß || F(x k ) || 2 / || F(x k-1 ) || 2  x k  x k-1 + s k  If ( s k T F(x k-1 ) ) 1/2 < T r ( s k T F(x 0 ) ) 1/2  Return x k

21 80 µm w/ shell Vertebral Body With Shell  Large deformation elast.  6 load steps (3% strain)  Scaled speedup  ~131K dof/processor  7 to 537 million dof  4 to 292 nodes  IBM SP Power3  15 of 16 procs/node used  Double/Single Colony switch

22 Computational phases  Mesh setup (per mesh):  Coarse grid construction (aggregation)  Graph processing  Matrix setup (per matrix):  Coarse grid operator construction  Sparse matrix triple product RAP (expensive for S.A.)  Subdomain factorizations  Solve (per RHS):  Matrix vector products (residuals, grid transfer)  Smoothers (Matrix vector products)

23 Linear solver iterations Newton Load Small (7.5M dof)Large (537M dof)

24 131K dof / proc - Flops/sec/proc.47 Terflops processors

25 End to end times and (in)efficiency components

26 Sources of scale inefficiencies in solve phase 7.5M dof537M dof #iteration #nnz/row5068 Flop rate7674 #elems/pr19.3K33.0K model Measured

27 164K dof/proc

28 First try: Flop rates (265K dof/processor)  265K dof per proc.  IBM switch bug  Bisection bandwidth plateau nodes  Solution:  use more processors  Less dof per proc.  Less pressure on switch Bisection bandwidth

29 Outline  Algebraic multigrid (AMG) introduction  Industrial applications  Micro-FE bone modeling  Olympus Parallel FE framework  Scalability studies on IBM SPs  Scaled speedup  Plain speedup  Nodal performance

30 Speedup with 7.5M dof problem (1 to 128 nodes)

31 Outline  Algebraic multigrid (AMG) introduction  Industrial applications  Micro-FE bone modeling  Olympus Parallel FE framework  Scalability studies on IBM SPs  Scaled speedup  Plain speedup  Nodal performance

32 Nodal Performance of IBM SP Power3 and Power4  IBM power3, 16 processors per node  375 Mhz, 4 flops per cycle  16 GB/sec bus (~7.9 GB/sec w/ STREAM bm)  Implies ~1.5 Gflops/s MB peak for Mat-Vec  We get ~1.2 Gflops/s (15 x.08Gflops)  IBM power4, 32 processors per node  1.3 GHz, 4 flops per cycle  Complex memory architecture

33 Speedup