1 Mark F. Adams 22 October 2004 Applications of Algebraic Multigrid to Large Scale Mechanics Problems
2 Outline Algebraic multigrid (AMG) introduction Industrial applications Micro-FE bone modeling Olympus Parallel FE framework Scalability studies on IBM SPs Scaled speedup Plain speedup Nodal performance
3 Multigrid smoothing and coarse grid correction (projection) smoothing Finest Grid Prolongation (P=R T ) The Multigrid V-cycle First Coarse Grid Restriction (R) Note: smaller grid
4 Multigrid V( ) - cycle MG-V Function u = MG-V(A,f) if A is small u A -1 f else u S (f, u) -- steps of smoother (pre) r H P T ( f – Au ) MG-Vrecursion u H MG-V(P T AP, r H )-- recursion (Galerkin) u u + Pu H u S (f, u) -- steps of smoother (post) Iteration matrix: T = S ( I - P(RAP) -1 RA ) S multiplicative
5 Smoothed Aggregation Coarse grid space & smoother MG method Piecewise constant function: “Plain” agg. (P 0 ) Start with kernel vectors B of operator eg, 6 RBMs in elasticity Nodal aggregation BP0P0 “Smoothed” aggregation: lower energy of functions One Jacobi iteration: P ( I - D -1 A ) P 0
6 Parallel Smoothers CG/Jacobi: Additive (Requires damping for MG) Damped by CG (Adams SC1999) Dot products, non-stationary Gauss-Seidel: multiplicative (Optimal MG smoother) Complex communication and computation (Adams SC2001) Polynomial Smoothers: Additive Chebyshev is ideal for multigrid smoothers Chebychev chooses p(y) such that |1 - p(y) y | = min over interval [ *, max ] Estimate of max easy Use * = max / C (No need for lowest eigenvalue) C related to rate of grid coarsening
7 Outline Algebraic multigrid (AMG) introduction Industrial applications Micro-FE bone modeling Olympus Parallel FE framework Scalability studies on IBM SPs Scaled speedup Plain speedup Nodal performance
8 Aircraft carrier 315,444 vertices Shell and beam elements (6 DOF per node) Linear dynamics – transient (time domain) About 1 min. per solve (rtol=10 -6 ) –2.4 GHz Pentium 4/Xenon processors –Matrix vector product runs at 254 Mflops
9 Solve and setup times (26 Sun processors)
10 Adagio: “BR” tire ADAGIO: Quasi static solid mechanics app. (Sandia) Nearly incompressible visco-elasticity (rubber) Augmented Lagrange formulation w/ Uzawa like update Contact (impenetrability constraint) Saddle point solution scheme: Uzawa like iteration (pressure) w/ contact search Non-linear CG (with linear constraints and constant pressure) Preconditioned with Linear solvers (AMG, FETI, …) Nodal (Jacobi)
11 “BR” tire
12 Displacement history
13 Outline Algebraic multigrid (AMG) introduction Industrial applications Micro-FE bone modeling Olympus Parallel FE framework Scalability studies on IBM SPs Scaled speedup Plain speedup Nodal performance
14 Trabecular Bone 5-mm Cube Cortical bone Trabecular bone
15 Micro-Computed Tomography 22 m resolution 3D image Mechanical Testing E, yield, ult, etc. 2.5 mm cube 44 m elements FE mesh Methods: FE modeling
16 Outline Algebraic multigrid (AMG) introduction Industrial applications Micro-FE bone modeling Olympus Parallel FE framework Scalability studies on IBM SPs Scaled speedup Plain speedup Nodal performance
17 Athena: Parallel FE ParMetis Parallel Mesh Partitioner (Univerisity of Minnesota) Prometheus Multigrid Solver FEAP Serial general purpose FE application (University of California) PETSc Parallel numerical libraries (Argonne National Labs) FE Mesh Input File Athena ParMetis FE input file (in memory) Partition to SMPs Athena ParMetis File FEAP Material Card Silo DB Visit Prometheus PETSc ParMetis METIS pFEAP Computational Architecture Olympus
18 Outline Algebraic multigrid (AMG) introduction Industrial applications Micro-FE bone modeling Olympus Parallel FE framework Scalability studies on IBM SPs Scaled speedup Plain speedup Nodal performance
19 80 µm w/o shell Inexact Newton CG linear solver Variable tolerance Smoothed aggregation AMG preconditioner Nodal block diagonal smoothers: 2 nd order Chebeshev (add.) Gauss-Seidel (multiplicative) Scalability
20 Inexact Newton Solve F(x)=0 Newton: Solve F’(x k-1 )s k = -F(x k-1 ) for s k such that || F(x k-1 ) + F’(x k-1 )s k || 2 < n k || F(x k-1 ) || 2 n k = ß || F(x k ) || 2 / || F(x k-1 ) || 2 x k x k-1 + s k If ( s k T F(x k-1 ) ) 1/2 < T r ( s k T F(x 0 ) ) 1/2 Return x k
21 80 µm w/ shell Vertebral Body With Shell Large deformation elast. 6 load steps (3% strain) Scaled speedup ~131K dof/processor 7 to 537 million dof 4 to 292 nodes IBM SP Power3 15 of 16 procs/node used Double/Single Colony switch
22 Computational phases Mesh setup (per mesh): Coarse grid construction (aggregation) Graph processing Matrix setup (per matrix): Coarse grid operator construction Sparse matrix triple product RAP (expensive for S.A.) Subdomain factorizations Solve (per RHS): Matrix vector products (residuals, grid transfer) Smoothers (Matrix vector products)
23 Linear solver iterations Newton Load Small (7.5M dof)Large (537M dof)
24 131K dof / proc - Flops/sec/proc.47 Terflops processors
25 End to end times and (in)efficiency components
26 Sources of scale inefficiencies in solve phase 7.5M dof537M dof #iteration #nnz/row5068 Flop rate7674 #elems/pr19.3K33.0K model Measured
27 164K dof/proc
28 First try: Flop rates (265K dof/processor) 265K dof per proc. IBM switch bug Bisection bandwidth plateau nodes Solution: use more processors Less dof per proc. Less pressure on switch Bisection bandwidth
29 Outline Algebraic multigrid (AMG) introduction Industrial applications Micro-FE bone modeling Olympus Parallel FE framework Scalability studies on IBM SPs Scaled speedup Plain speedup Nodal performance
30 Speedup with 7.5M dof problem (1 to 128 nodes)
31 Outline Algebraic multigrid (AMG) introduction Industrial applications Micro-FE bone modeling Olympus Parallel FE framework Scalability studies on IBM SPs Scaled speedup Plain speedup Nodal performance
32 Nodal Performance of IBM SP Power3 and Power4 IBM power3, 16 processors per node 375 Mhz, 4 flops per cycle 16 GB/sec bus (~7.9 GB/sec w/ STREAM bm) Implies ~1.5 Gflops/s MB peak for Mat-Vec We get ~1.2 Gflops/s (15 x.08Gflops) IBM power4, 32 processors per node 1.3 GHz, 4 flops per cycle Complex memory architecture
33 Speedup