Parallel Adaptive Mesh Refinement for Radiation Transport and Diffusion Louis Howell Center for Applied Scientific Computing/ AX Division Lawrence Livermore.

Slides:



Advertisements
Similar presentations
Steady-state heat conduction on triangulated planar domain May, 2002
Advertisements

Closest Point Transform: The Characteristics/Scan Conversion Algorithm Sean Mauch Caltech April, 2003.
Progress Report on SPARTAN Chamber Dynamics Simulation Code Farrokh Najmabadi and Zoran Dragojlovic HAPL Meeting February 5-6, 2004 Georgia Institute of.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Collaborative Comparison of High-Energy-Density Physics Codes LA-UR Bruce Fryxell Center for Radiative Shock Hydrodynamics Dept. of Atmospheric,
3D spherical gridding based on equidistant, constant volume cells for FV/FD methods A new method using natural neighbor Voronoi cells distributed by spiral.
ME 595M J.Murthy1 ME 595M: Computational Methods for Nanoscale Thermal Transport Lecture 9: Introduction to the Finite Volume Method for the Gray BTE J.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR Collaborators: Adam Frank Brandon Shroyer Chen Ding Shule Li.
Lawrence Livermore National Laboratory Louis Howell Center for Applied Scientific Computing LLNL-PRES Lawrence Livermore National Laboratory, P.
Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Parallel Mesh Refinement with Optimal Load Balancing Jean-Francois Remacle, Joseph E. Flaherty and Mark. S. Shephard Scientific Computation Research Center.
Landscape Erosion Kirsten Meeker
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Preliminary Sensitivity Studies With CRASH 3D Bruce Fryxell CRASH Review October 20, 2009.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
The CRASH code: test matrix Eric S. Myra CRASH University of Michigan October 19, 2009.
Thermal Radiation Solver
Overview Anisotropic diffusion occurs in many different physical systems and applications. In magnetized plasmas, thermal conduction can be much more rapid.
ME 595M J.Murthy1 ME 595M: Computational Methods for Nanoscale Thermal Transport Lecture 9: Boundary Conditions, BTE Code J. Murthy Purdue University.
Non-hydrostatic algorithm and dynamics in ROMS Yuliya Kanarska, Alexander Shchepetkin, Alexander Shchepetkin, James C. McWilliams, IGPP, UCLA.
ME 595M J.Murthy1 ME 595M: Computational Methods for Nanoscale Thermal Transport Lecture 11: Homework solution Improved numerical techniques J. Murthy.
Heat Transfer Modeling
A comparison of radiation transport and diffusion using PDT and the CRASH code Fall 2011 Review Eric S. Myra Wm. Daryl Hawkins.
Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.
© Fluent Inc. 9/5/2015L1 Fluids Review TRN Solution Methods.
Heat Transfer and Thermal Boundary Conditions
Solving Scalar Linear Systems Iterative approach Lecture 15 MA/CS 471 Fall 2003.
1 CCOS Seasonal Modeling: The Computing Environment S.Tonse, N.J.Brown & R. Harley Lawrence Berkeley National Laboratory University Of California at Berkeley.
Improving Coarsening and Interpolation for Algebraic Multigrid Jeff Butler Hans De Sterck Department of Applied Mathematics (In Collaboration with Ulrike.
University of Veszprém Department of Image Processing and Neurocomputing Emulated Digital CNN-UM Implementation of a 3-dimensional Ocean Model on FPGAs.
Stratified Magnetohydrodynamics Accelerated Using GPUs:SMAUG.
Particle Transport Methods in Parallel Environments
Jonathan Carroll-Nellenback University of Rochester.
After step 2, processors know who owns the data in their assumed partitions— now the assumed partition defines the rendezvous points Scalable Conceptual.
Adaptive Mesh Modification in Parallel Framework Application of parFUM Sandhya Mangala (MIE) Prof. Philippe H. Geubelle (AE) University of Illinois, Urbana-Champaign.
Efficient Integration of Large Stiff Systems of ODEs Using Exponential Integrators M. Tokman, M. Tokman, University of California, Merced 2 hrs 1.5 hrs.
MEIAC 2001 COMET Laboratory for Computational Methods in Emerging Technologies Large-Scale Simulation of Ultra-Fast Laser Machining A preliminary outline.
Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR
Lesson 4: Computer method overview
Using the Segregated and Coupled Solvers
Parallel Solution of the Poisson Problem Using MPI
LLNL-PRES DRAFT This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract.
I/O for Structured-Grid AMR Phil Colella Lawrence Berkeley National Laboratory Coordinating PI, APDEC CET.
Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.
Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Outline Introduction Research Project Findings / Results
Discretization Methods Chapter 2. Training Manual May 15, 2001 Inventory # Discretization Methods Topics Equations and The Goal Brief overview.
Presented by Adaptive Hybrid Mesh Refinement for Multiphysics Applications Ahmed Khamayseh and Valmor de Almeida Computer Science and Mathematics Division.
August 12, 2004 UCRL-PRES Aug Outline l Motivation l About the Applications l Statistics Gathered l Inferences l Future Work.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for MPCUGLES code at University of Minnesota October 09,
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering
1 Spring 2003 Prof. Tim Warburton MA557/MA578/CS557 Lecture 32.
Visualization Techniques for Discrete Ordinates Method Radiation Transport Cornelius Toole Lawrence Berkeley National Laboratory Jackson State University.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
AstroBEAR Overview Road to Parallelization. Current Limitations.
Xing Cai University of Oslo
Chamber Dynamic Response Modeling
Department of Mathematics
J. Murthy Purdue University
© Fluent Inc. 1/10/2018L1 Fluids Review TRN Solution Methods.
Lecture 19 MA471 Fall 2003.
Discrete Ordinates Method (SN)
Resource Utilization in Large Scale InfiniBand Jobs
Convergence in Computational Science
Adaptive Mesh Applications
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
A Cell-by-Cell AMR Method for the PPM Hydrodynamics Code
Presentation transcript:

Parallel Adaptive Mesh Refinement for Radiation Transport and Diffusion Louis Howell Center for Applied Scientific Computing/ AX Division Lawrence Livermore National Laboratory May 18, 2005 1

Raptor Code: Overview Block-structured Adaptive Mesh Refinement (AMR) Multifluid Eulerian representation Explicit Godunov hydrodynamics Timestep varies with refinement level Single-group radiation diffusion (implicit, multigrid) Multi-group radiation diffusion under development Heat conduction, also implicit Now adding discrete ordinate (Sn) transport solvers AMR timestep requires both single and multilevel Sn Parallel implementation and scaling issues

Raptor Code: Core Algorithm Developers Rick Pember Jeff Greenough Sisira Weeratunga Alex Shestakov Louis Howell

Radiation Diffusion Capability Single-group radiation diffusion is coupled with multi-fluid Eulerian hydrodynamics on a regular grid using block-structured adaptive mesh refinement (AMR).

Radiation Diffusion Contrasted with Discrete Ordinates All three calculations conserve energy by using multilevel coarse-fine synchronization at the end of each coarse timestep. Fluid energy is shown (overexposed to bring out detail). Transport uses step characteristic discretization. Flux-limited Diffusion S16 (144 ordinates) 144 equally-spaced ordinates

Coupling of Radiation with Fluid Energy Advection and Conduction: Implicit Radiation Diffusion (gray, flux-limited):

Coupling of Radiation with Fluid Energy Advection and Conduction: Implicit Radiation Transport (gray, isotropic scattering):

Implicit Radiation Update Extrapolate Emission to New Temperature:

Implicit Radiation Update Iterative Form of Diffusion Update:

Implicit Radiation Update Iterative Form of Transport Update:

Simplified Transport Equation Gather Similar Terms: Simplified Gray Semi-discrete Form:

Discrete Ordinate Discretization Angular Discretization: Spatial Discretization in 2D Cartesian Coordinates: Other Coordinate Systems: 1D & 3D Cartesian, 1D Spherical, 2D Axisymmetric (RZ)

Spatial Transport Discretizations Step First order upwind, positive, inaccurate in both thick and thin limits Diamond Difference Second order but very vulnerable to oscillations Simple Corner Balance (SCB) More accurate in thick limit, groups cells in 2x2 blocks, each block requires 4x4 matrix inversion (8x8 in 3D). Upstream Corner Balance Attempts to improve on SCB in streaming limit, breaks conjugate gradient acceleration (implemented in 2D Cartesian only) Step Characteristic Gives sharp rays in thin streaming limit, positive, inaccurate in thick diffusion limit (implemented in 2D Cartesian only)

Axisymmetric Crooked Pipe Problem Diffusion S2 Step S8 Step S2 SCB S8 SCB Radiation Energy Density

Axisymmetric Crooked Pipe Problem Diffusion S2 Step S8 Step S2 SCB S8 SCB Fluid Temperature

AMR Timestep Advance Coarse (L0) Δt0 Advance Finer (L1) Advance Finest (L2) Δt0 Δt1 Δt2

AMR Timestep Synchronize L1 and L2 (Multilevel solve) Δt1 Repeat (L1 and L2) Synchronize L0 and L1 Δt1 Δt1 Δt0

Requirements for Radiation Package Features controled by the package: Nonlinear implicit update with fluid energy coupling Single level transport solver (for advancing each level) Multilevel transport solver (for synchronization) Features not directly controled by the package: Refinement criteria Grid layout Load balancing Timestep size Parallel support provided by BoxLib: Each refinement level distributed grid-by-grid over all processors Coarse and fine grids in same region may be on different processors

Multilevel Transport Sweeps

Sources Updated Iteratively Three “sources” must be recomputed after each sweep, and iterated to convergence: Scattering source Reflecting boundaries AMR refluxing source The AMR source converges most quickly, while the scattering source is often so slow that convergence acceleration is required.

Parallel Communication Four different communication operations are required: From grid to grid on the same level From coarse level to upstream edges of fine level From coarse level to downstream edges of fine level (to initialize flux registers) From fine level back to coarse as a refluxing source Operations 2 and 3 only needed when preparing to transfer control from coarse to fine level Operation 3 could be eliminated and 4 reduced if a data structure existed on the coarse processor to hold the information

Parallel Grid Sequencing To sweep a single ordinate, a grid needs information from the grids on its upstream faces Different grids sweep different ordinates at the same time 2D Cartesian, first quadrant only of S4 ordinate set: 13 stages for 3 ordinates

Parallel Grid Sequencing In practice, ordinates from all four quadrants are interleaved as much as possible Execution begins at the four corners of the domain and moves toward the center 2D Cartesian, all quadrants of S4 ordinate set: 22 stages for 12 ordinates

Parallel Grid Sequencing: RZ In axisymmetric (RZ) coordinates, angular differencing transfers energy from ordinates directed inward towards the axis into more outward ordinates. The inward ordinates must therefore be swept first. 2D RZ, S4 ordinate set requires 26 stages for 12 ordinates, up from 22 for Cartesian

Parallel Grid Sequencing: AMR 43 level 1 grids, 66 stages for 40 ordinates (S8) (20 waves in each direction): Stage 4 Stage 15 Stage 34 Stage 62

Parallel Grid Sequencing: 3D AMR In 2D, grids are sorted for each ordinate direction In 3D, sorting isn’t always possible—loops can form The solution is to split grids to break the loops Communication with split grids is implemented So is a heuristic for determining which grids to split It is possible to always choose splits in the z direction only

Acceleration by Conjugate Gradient A strong scattering term may make iterated transport sweeps slow to converge Conjugate gradient acceleration speeds up convergence dramatically The parallel operations required are then Transport sweeps Inner products A diagonal preconditioner may be used, or for larger ordinate sets, approximate solution of a related problem using a minimal S2 ordinate set No new parallel building blocks are required

2D Scaling (MCR Linux Cluster) Single Level, Not AMR 40,47,52, 58, 64, 68, 74 Stages Grids arranged in square array, one grid per processor, each grid is 400x400 cells. Sn tranport sweeps (Step and SCB) are for all 40 ordinates of an S8 ordinate set. Uses icc, ifc, hypre version 1.8.2b on MCR (2.4 GHz Xeon, Quadrics QsNet Elan3)

3D Scaling (MCR Linux Cluster) Single Level, Not AMR 85,99, 113, 129 Stages Grids arranged in cubical array, one grid per processor, each grid is 40x40x40 cells. Sn tranport sweeps (Step and SCB) are for all 80 ordinates of an S8 ordinate set.

AMR Scaling: 2D Grid Layout Case 1: Separate Clusters of Fine Grids To investigate scaling in AMR problems, I need to be able to generate “similar” problems of different sizes. I use repetitions of a unit cell of 4 coarse and 18 fine grids. Each processor gets 1 coarse grid. Due to load balancing, different processors get different numbers of fine grids.

AMR Scaling: 2D Grid Layout Case 2: Coupled Fine Grids The decoupled groups of fine grids in the previous AMR problem give the transport algorithms an advantage, since groups do not depend on each other. This new problem couples fine grids across the entire width of the domain. Note the minor variations in grid layout from one tile to the next, due to the sequential nature of the regridding algorithm.

2D Fine Scaling (MCR Linux Cluster) Case 1: Separate Clusters of Fine Grids Grids arranged in square array, 4 coarse grids and 18 fine grids for every four processors, each coarse grid is 256x256 cells, 41984 fine cells per processor. Sn tranport sweeps are for all 40 ordinates of an S8 ordinate set.

2D Fine Scaling (MCR Linux Cluster) Case 2: Coupled Fine Grids Grids arranged in square array, one coarse grid and 5-6 fine grids for every processor, each coarse grid is 256x256 cells, ~51000 fine cells per processor. Sn tranport sweeps are for all 40 ordinates of an S8 ordinate set.

3D Fine Scaling (MCR Linux Cluster) Case 1: Separate Clusters of Fine Grids Grids arranged in cubical array, 8 coarse grids and 58 fine grids for every eight processors, each coarse grid is 32x32x32 cells, 28800 fine cells per processor. Sn tranport sweeps are for all 80 ordinates of an S8 ordinate set.

3D Fine Scaling (MCR Linux Cluster) Case 2: Coupled Fine Grids Grids arranged in cubical array, one coarse grid and ~33 fine grids for every processor, each coarse grid is 32x32x32 cells, ~47600 fine cells per processor. Sn tranport sweeps are for all 80 ordinates of an S8 ordinate set.

2D AMR Scaling (MCR Linux Cluster) Case 1: Separate Clusters of Fine Grids Grids arranged in square array, 4 coarse grids and 18 fine grids for every four processors, each coarse grid is 256x256 cells, 41984 fine cells per processor. Sn tranport sweeps are for all 40 ordinates of an S8 ordinate set.

2D AMR Scaling (MCR Linux Cluster) Case 2: Coupled Fine Grids Grids arranged in square array, one coarse grid and 5-6 fine grids for every processor, each coarse grid is 256x256 cells, ~51000 fine cells per processor. Sn tranport sweeps are for all 40 ordinates of an S8 ordinate set.

2D AMR Scaling (MCR Linux Cluster) Case 2: Coupled Fine (Optimized Setup) This version has neighbor calculation in wave setup implemented using an O(n) bin sort, depth-first traversal for building waves (makes little difference). In stage setup wave intersections optimized and stored. All optimizations serial.

3D AMR Scaling (MCR Linux Cluster) Case 1: Separate Clusters of Fine Grids Grids arranged in cubical array, 8 coarse grids and 58 fine grids for every eight processors, each coarse grid is 32x32x32 cells, 28800 fine cells per processor. Sn tranport sweeps are for all 80 ordinates of an S8 ordinate set.

3D AMR Scaling (MCR Linux Cluster) Case 1: Separate Clusters (Optimized) This version has neighbor calculation in wave setup implemented using an O(n) bin sort. In stage setup wave intersections optimized and stored. All optimizations serial.

3D AMR Scaling (MCR Linux Cluster) Case 2: Coupled Fine Grids (Optimized) Grids arranged in cubical array, one coarse grid and ~33 fine grids for every processor, each coarse grid is 32x32x32 cells, ~47600 fine cells per processor. Sn tranport sweeps are for all 80 ordinates of an S8 ordinate set.

Transport Scaling Conclusions A sweep through an S8 ordinate set and a multigrid V- cycle take similar amounts of time, and scale in similar ways on up to 500 processors. Setup expenses for transport are amortized over several sweeps. This is code for determining the communication patterns between grids, including such things as the grid splitting algorithm in 3D. So far, optimized scalar setup code has given acceptable performance, even in 3D.

Acceleration by Conjugate Gradient Solve by sweeps, holding right hand side fixed: Solve homogeneous problem by conjugate gradient: Matrix form:

Acceleration by Conjugate Gradient Inner product: Preconditioners: Diagonal Solution of smaller (S2) system by DPCG This system can be solved to a weak (inaccurate) tolerance without spoiling the accuracy of the overall iteration

“Clouds” Test Problem: Acceleration Scheme Res Set Accel Iter Sweeps PreCon Time SCB 128 S2 SI 18472 58.12 CG 290 876 3.283 DPCG 112 342 1.433 S8 18674 560.3 1752 52.88 111 678 20.92 S2PCG 12 84 836 6.583 S16 2615 319.4 1017 125.2 126 828 19.98 128,512 263 2891 1304. 163 1809 824.3 17 208 3570 144.9

“Clouds” Test Problem: Acceleration Scheme Res Set Accel Iter Sweeps PreCon Time SCB 128,512 S2 SI 19398 227.4 CG 260 1053 14.03 DPCG 168 683 9.333 S8 263 2119 264.1 164 1327 166.4 S2PCG 16 143 3353 67.00 StepChar 197 1392 398.9 158 1119 323.0 15 118 2892 121.0 S16 2891 1304. 163 1809 824.3 17 208 3570 144.9 Step 11 129 1889 106.0 Diamond 19 274 5244 237.7 3108 265.3

“Clouds” Test Problem 1 km square domain No absorption or emission 400000 erg/cm2/s isotropic flux incoming at top Specular reflection at sides Absorbing bottom κs=10-2 cm-1 inside clouds κs=10-6 cm-1 elsewhere S2 uses DPCG S8 uses S2PCG Serial timings on GPS (1GHz Alpha EV6.8)

“Clouds” Test Problem: SCB Fluxes Resolutions S2 (4 ordinates) S8 (40 ordinates) Total Cells Flux Time 32 1024 17742 0.183 20115 0.950 64 4096 13825 0.433 15842 1.783 128 16384 18632 1.433 22677 6.583 32,128 7168 18633 1.233 22678 6.433 256 65536 19804 6.833 26568 33.87 64,256 20480 19819 3.416 26571 19.00 512 262144 20032 35.18 28644 209.2 128,512 57344 20057 9.333 28651 67.00 32,128,512 48128 20059 10.38 28654 70.55 1048576 19994 162.7 29651 1035. 256,1024 212992 20014 45.87 29658 313.4 64,256,1024 167936 20031 47.13 29644 286.3

UCRL-PRES-212183 This work was performed under the auspices of the U. S. Department of Energy by the University of California Lawrence Livermore National Laboratory under Contract W-7405-Eng-48.