Michael L. Norman, UC San Diego and SDSC

Slides:

Advertisements

Similar presentations

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Advertisements

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

Parallel Research at Illinois Parallel Everywhere

Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

I/O Analysis and Optimization for an AMR Cosmology Simulation Jianwei LiWei-keng Liao Alok ChoudharyValerie Taylor ECE Department Northwestern University.

Supporting Efficient Execution in Heterogeneous Distributed Computing Environments with Cactus and Globus Gabrielle Allen, Thomas Dramlitsch, Ian Foster,

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Parallel Programming Models and Paradigms

A Parallel Structured Ecological Model for High End Shared Memory Computers Dali Wang Department of Computer Science, University of Tennessee, Knoxville.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.

Present and Future Computing Requirements for Simulation and Analysis of Reacting Flow John Bell CCSE, LBNL NERSC ASCR Requirements for 2017 January 15,

Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Computer System Architectures Computer System Software

© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March

Scalable Algorithms for Structured Adaptive Mesh Refinement Akhil Langer, Jonathan Lifflander, Phil Miller, Laxmikant Kale Parallel Programming Laboratory.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.

Adaptive MPI Milind A. Bhandarkar

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

Hybrid MPI and OpenMP Parallel Programming

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

F. Douglas Swesty, DOE Office of Science Data Management Workshop, SLAC March Data Management Needs for Nuclear-Astrophysical Simulation at the Ultrascale.

1 1  Capabilities: Building blocks for block-structured AMR codes for solving time-dependent PDE’s Functionality for [1…6]D, mixed-dimension building.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

Outline Why this subject? What is High Performance Computing?

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

I MAGIS is a joint project of CNRS - INPG - INRIA - UJF iMAGIS-GRAVIR / IMAG Efficient Parallel Refinement for Hierarchical Radiosity on a DSM computer.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Background Computer System Architectures Computer System Software.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

SYSTEM MODELS FOR ADVANCED COMPUTING Jhashuva. U 1 Asst. Prof CSE

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

For Massively Parallel Computation The Chaotic State of the Art

Parallel Programming By J. H. Wang May 2, 2017.

Programming Models for SimMillennium

Parallel Algorithm Design

Parallel Programming in C with MPI and OpenMP

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Integrated Runtime of Charm++ and OpenMP

Q: What Does the Future Hold for “Parallel” Languages?

Lecture 2 The Art of Concurrency

Department of Computer Science, University of Tennessee, Knoxville

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Michael L. Norman, UC San Diego and SDSC

 A parallel AMR application for astrophysics and cosmology simulations  Hybrid physics: fluid + particle + gravity + radiation  Block structured AMR  MPI or hybrid parallelism  Under continuous development since 1994  Greg Bryan and Mike NCSA  Shared memory  distributed memory  hierarchical memory  C++/C/F, >185,000 LOC  Community code in widespread use worldwide  Hundreds of users, dozens of developers  Version

ASTROPHYSICAL FLUID DYNAMICSHYDRODYNAMIC COSMOLOGY Supersonic turbulence Large scale structure

PhysicsEquationsMath typeAlgorithm(s)Communication Dark matterNewtonian N-body Numerical integration Particle-meshGather-scatter GravityPoissonEllipticFFT multigrid Global Gas dynamicsEulerNonlinear hyperbolic Explicit finite volume Nearest neighbor Magnetic fields Ideal MHDNonlinear hyperbolic Explicit finite volume Nearest neighbor Radiation transport Flux-limited radiation diffusion Nonlinear parabolic Implicit finite difference Multigrid solves Global Multispecies chemistry Kinetic equations Coupled stiff ODEs Explicit BE, implicit None Inertial, tracer, source, and sink particles Newtonian N-body Numerical integration Particle-meshGather-scatter Physics modules can be used in any combination in 1D, 2D and 3D making ENZO a very powerful and versatile code

 Berger-Collela structured AMR  Cartesian base grid and subgrids  Hierarchical timetepping

Level 0 AMR = collection of grids (patches); each grid is a C++ object Level 1 Level 2

Unigrid = collection of Level 0 grid patches

 Shared memory (PowerC) parallel ( )  SMP and DSM architecture (SGI Origin 2000, Altix)  Parallel DO across grids at a given refinement level including block decomposed base grid  O(10,000) grids  Distributed memory (MPI) parallel ( )  MPP and SMP cluster architectures (e.g., IBM PowerN)  Level 0 grid partitioned across processors  Level >0 grids within a processor executed sequentially  Dynamic load balancing by messaging grids to underloaded processors (greedy load balancing)  O(100,000) grids

Projection of refinement levels 160,000 grid patches at 4 refinement levels

1 MPI task per processor Task = a Level 0 grid patch and all associated subgrids; processed sequentially across and within levels

 Hierarchical memory (MPI+OpenMP) parallel (2008-)  SMP and multicore cluster architectures (SUN Constellation, Cray XT4/5)  Level 0 grid partitioned across shared memory nodes/multicore processors  Parallel DO across grids at a given refinement level within a node  Dynamic load balancing less critical because of larger MPI task granularity (statistical load balancing)  O(1,000,000) grids

N MPI tasks per SMP M OpenMP threads per task Task = a Level 0 grid patch and all associated subgrids processed concurrently within levels and sequentially across levels Each grid is an OpenMP thread

ENZO ON CRAY XT51% OF THE SIMULATION  Non-AMR Mpc box  15,625 (25 3 ) MPI tasks, root grid tiles  6 OpenMP threads per task  93,750 cores  30 TB per checkpoint/re- start/data dump  >15 GB/sec read, >7 GB/sec write  Benefit of threading  reduce MPI overhead & improve disk I/O

ENZO ON CRAY XT510 5 SPATIAL DYNAMIC RANGE  AMR Mpc box, 7 levels of refinement  4096 (16 3 ) MPI tasks, 64 3 root grid tiles  1 to 6 OpenMP threads per task to 24,576 cores  Benefit of threading  Thread count increases with memory growth  reduce replication of grid hierarchy data

Using MPI+threads to access more RAM as the AMR calculation grows in size

ENZO-RHD ON CRAY XT5COSMIC REIONIZATION  Including radiation transport 10x more expensive  LLNL Hypre multigrid solver dominates run time  near ideal scaling to at least 32K MPI tasks  Non-AMR and 16 Mpc boxes  4096 (16 3 ) MPI tasks, 64 3 root grid tiles

 Cosmic Reionization is a weak-scaling problem  large volumes at a fixed resolution to span range of scales  Non-AMR with ENZO-RHD  Hybrid MPI and OpenMP  SMT and SIMD tuning  to root grid tiles  4-8 OpenMP threads per task  4-8 TBytes per checkpoint/re-start/data dump (HDF5)  In-core intermediate checkpoints (?)  64-bit arithmetic, 64-bit integers and pointers  Aiming for K cores  M hours (?)

 ENZO’s AMR infrastructure limits scalability to O(10 4 ) cores  We are developing a new, extremely scalable AMR infrastructure called Cello   ENZO-P will be implemented on top of Cello to scale to

 Hierarchical parallelism and load balancing to improve localization  Relax global synchronization to a minimum  Flexible mapping between data structures and concurrency  Object-oriented design  Build on best available software for fault- tolerant, dynamically scheduled concurrent objects (Charm++)

1. hybrid replicated/distributed octree-based AMR approach, with novel modifications to improve AMR scaling in terms of both size and depth; 2. patch-local adaptive time steps; 3. flexible hybrid parallelization strategies; 4. hierarchical load balancing approach based on actual performance measurements; 5. dynamical task scheduling and communication; 6. flexible reorganization of AMR data in memory to permit independent optimization of computation, communication, and storage; 7. variable AMR grid block sizes while keeping parallel task sizes fixed; 8. address numerical precision and range issues that arise in particularly deep AMR hierarchies; 9. detecting and handling hardware or software faults during run-time to improve software resilience and enable software self-management.

 Enzo website (code, documentation)   2010 Enzo User Workshop slides   yt website (analysis and vis.)   Jacques website (analysis and vis.) 

Level 0 xx x Level 1 Level 2 (0,0) (1,0) (2,0) (2,1)

(0) (1,0) (1,1) (2,0) (2,1) (2,2) (2,3) (2,4) (3,0)(3,1)(3,2)(3,4)(3,5)(3,6) (3,7) (4,0) (4,1) (4,3) (4,4) Depth (level) Breadth (# siblings) Scaling the AMR grid hierarchy in depth and breadth

LevelGridsMemory (MB)Work = Mem*(2^level) , ,275114,629229, ,52221,22684, ,4486,08548,680 47,2161,97531,600 53,3701,00632,192 61, , ,808 Total305,881324,860683,807

real grid object virtual grid object grid metadata physics data grid metadata Current MPI Implementation

 Flat MPI implementation is not scalable because grid hierarchy metadata is replicated in every processor  For very large grid counts, this dominates memory requirement (not physics data!)  Hybrid parallel implementation helps a lot!  Now hierarchy metadata is only replicated in every SMP node instead of every processor  We would prefer fewer SMP nodes ( ) with bigger core counts (32-64) (=262,144 cores)  Communication burden is partially shifted from MPI to intranode memory accesses

 Targeted at fluid, particle, or hybrid (fluid+particle) simulations on millions of cores  Generic AMR scaling issues:  Small AMR patches restrict available parallelism  Dynamic load balancing  Maintaining data locality for deep hierarchies  Re-meshing efficiency and scalability  Inherently global multilevel elliptic solves  Increased range and precision requirements for deep hierarchies