Download presentation
Presentation is loading. Please wait.
Published byGavin Bishop Modified over 9 years ago
2
1 Mark F. Adams 22 October 2004 Applications of Algebraic Multigrid to Large Scale Mechanics Problems
3
2 Outline Algebraic multigrid (AMG) introduction Industrial applications Micro-FE bone modeling Olympus Parallel FE framework Scalability studies on IBM SPs Scaled speedup Plain speedup Nodal performance
4
3 Multigrid smoothing and coarse grid correction (projection) smoothing Finest Grid Prolongation (P=R T ) The Multigrid V-cycle First Coarse Grid Restriction (R) Note: smaller grid
5
4 Multigrid V( ) - cycle MG-V Function u = MG-V(A,f) if A is small u A -1 f else u S (f, u) -- steps of smoother (pre) r H P T ( f – Au ) MG-Vrecursion u H MG-V(P T AP, r H )-- recursion (Galerkin) u u + Pu H u S (f, u) -- steps of smoother (post) Iteration matrix: T = S ( I - P(RAP) -1 RA ) S multiplicative
6
5 Smoothed Aggregation Coarse grid space & smoother MG method Piecewise constant function: “Plain” agg. (P 0 ) Start with kernel vectors B of operator eg, 6 RBMs in elasticity Nodal aggregation BP0P0 “Smoothed” aggregation: lower energy of functions One Jacobi iteration: P ( I - D -1 A ) P 0
7
6 Parallel Smoothers CG/Jacobi: Additive (Requires damping for MG) Damped by CG (Adams SC1999) Dot products, non-stationary Gauss-Seidel: multiplicative (Optimal MG smoother) Complex communication and computation (Adams SC2001) Polynomial Smoothers: Additive Chebyshev is ideal for multigrid smoothers Chebychev chooses p(y) such that |1 - p(y) y | = min over interval [ *, max ] Estimate of max easy Use * = max / C (No need for lowest eigenvalue) C related to rate of grid coarsening
8
7 Outline Algebraic multigrid (AMG) introduction Industrial applications Micro-FE bone modeling Olympus Parallel FE framework Scalability studies on IBM SPs Scaled speedup Plain speedup Nodal performance
9
8 Aircraft carrier 315,444 vertices Shell and beam elements (6 DOF per node) Linear dynamics – transient (time domain) About 1 min. per solve (rtol=10 -6 ) –2.4 GHz Pentium 4/Xenon processors –Matrix vector product runs at 254 Mflops
10
9 Solve and setup times (26 Sun processors)
11
10 Adagio: “BR” tire ADAGIO: Quasi static solid mechanics app. (Sandia) Nearly incompressible visco-elasticity (rubber) Augmented Lagrange formulation w/ Uzawa like update Contact (impenetrability constraint) Saddle point solution scheme: Uzawa like iteration (pressure) w/ contact search Non-linear CG (with linear constraints and constant pressure) Preconditioned with Linear solvers (AMG, FETI, …) Nodal (Jacobi)
12
11 “BR” tire
13
12 Displacement history
14
13 Outline Algebraic multigrid (AMG) introduction Industrial applications Micro-FE bone modeling Olympus Parallel FE framework Scalability studies on IBM SPs Scaled speedup Plain speedup Nodal performance
15
14 Trabecular Bone 5-mm Cube Cortical bone Trabecular bone
16
15 Micro-Computed Tomography CT @ 22 m resolution 3D image Mechanical Testing E, yield, ult, etc. 2.5 mm cube 44 m elements FE mesh Methods: FE modeling
17
16 Outline Algebraic multigrid (AMG) introduction Industrial applications Micro-FE bone modeling Olympus Parallel FE framework Scalability studies on IBM SPs Scaled speedup Plain speedup Nodal performance
18
17 Athena: Parallel FE ParMetis Parallel Mesh Partitioner (Univerisity of Minnesota) Prometheus Multigrid Solver FEAP Serial general purpose FE application (University of California) PETSc Parallel numerical libraries (Argonne National Labs) FE Mesh Input File Athena ParMetis FE input file (in memory) Partition to SMPs Athena ParMetis File FEAP Material Card Silo DB Visit Prometheus PETSc ParMetis METIS pFEAP Computational Architecture Olympus
19
18 Outline Algebraic multigrid (AMG) introduction Industrial applications Micro-FE bone modeling Olympus Parallel FE framework Scalability studies on IBM SPs Scaled speedup Plain speedup Nodal performance
20
19 80 µm w/o shell Inexact Newton CG linear solver Variable tolerance Smoothed aggregation AMG preconditioner Nodal block diagonal smoothers: 2 nd order Chebeshev (add.) Gauss-Seidel (multiplicative) Scalability
21
20 Inexact Newton Solve F(x)=0 Newton: Solve F’(x k-1 )s k = -F(x k-1 ) for s k such that || F(x k-1 ) + F’(x k-1 )s k || 2 < n k || F(x k-1 ) || 2 n k = ß || F(x k ) || 2 / || F(x k-1 ) || 2 x k x k-1 + s k If ( s k T F(x k-1 ) ) 1/2 < T r ( s k T F(x 0 ) ) 1/2 Return x k
22
21 80 µm w/ shell Vertebral Body With Shell Large deformation elast. 6 load steps (3% strain) Scaled speedup ~131K dof/processor 7 to 537 million dof 4 to 292 nodes IBM SP Power3 15 of 16 procs/node used Double/Single Colony switch
23
22 Computational phases Mesh setup (per mesh): Coarse grid construction (aggregation) Graph processing Matrix setup (per matrix): Coarse grid operator construction Sparse matrix triple product RAP (expensive for S.A.) Subdomain factorizations Solve (per RHS): Matrix vector products (residuals, grid transfer) Smoothers (Matrix vector products)
24
23 Linear solver iterations Newton Load Small (7.5M dof)Large (537M dof) 12345123456 15142021185113525702 251420 5113626702 35142022195113626702 45142022195113626702 55142022195113626702 65142022195113626702
25
24 131K dof / proc - Flops/sec/proc.47 Terflops - 4088 processors
26
25 End to end times and (in)efficiency components
27
26 Sources of scale inefficiencies in solve phase 7.5M dof537M dof #iteration450897 #nnz/row5068 Flop rate7674 #elems/pr19.3K33.0K model1.002.78 Measured 1.002.61
28
27 164K dof/proc
29
28 First try: Flop rates (265K dof/processor) 265K dof per proc. IBM switch bug Bisection bandwidth plateau 64-128 nodes Solution: use more processors Less dof per proc. Less pressure on switch Bisection bandwidth
30
29 Outline Algebraic multigrid (AMG) introduction Industrial applications Micro-FE bone modeling Olympus Parallel FE framework Scalability studies on IBM SPs Scaled speedup Plain speedup Nodal performance
31
30 Speedup with 7.5M dof problem (1 to 128 nodes)
32
31 Outline Algebraic multigrid (AMG) introduction Industrial applications Micro-FE bone modeling Olympus Parallel FE framework Scalability studies on IBM SPs Scaled speedup Plain speedup Nodal performance
33
32 Nodal Performance of IBM SP Power3 and Power4 IBM power3, 16 processors per node 375 Mhz, 4 flops per cycle 16 GB/sec bus (~7.9 GB/sec w/ STREAM bm) Implies ~1.5 Gflops/s MB peak for Mat-Vec We get ~1.2 Gflops/s (15 x.08Gflops) IBM power4, 32 processors per node 1.3 GHz, 4 flops per cycle Complex memory architecture
34
33 Speedup
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.