Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
OpenFOAM on a GPU-based Heterogeneous Cluster
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR Collaborators: Adam Frank Brandon Shroyer Chen Ding Shule Li.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
CS 584. A Parallel Programming Model We need abstractions to make it simple. The programming model needs to fit our parallel machine model. Abstractions.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
E. WES BETHEL (LBNL), CHRIS JOHNSON (UTAH), KEN JOY (UC DAVIS), SEAN AHERN (ORNL), VALERIO PASCUCCI (LLNL), JONATHAN COHEN (LLNL), MARK DUCHAINEAU.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
Application Performance Analysis on Blue Gene/L Jim Pool, P.I. Maciej Brodowicz, Sharon Brunett, Tom Gottschalk, Dan Meiron, Paul Springer, Thomas Sterling,
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Loads Balanced with CQoS Nicole Lemaster, Damian Rouson, Jaideep Ray Sandia National Laboratories Sponsor: DOE CCA Meeting – January 22, 2009.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Lecture 10 CSS314 Parallel Computing
Simulating Quarks and Gluons with Quantum Chromodynamics February 10, CS635 Parallel Computer Architecture. Mahantesh Halappanavar.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Load Balancing in Distributed Computing Systems Using Fuzzy Expert Systems Author Dept. Comput. Eng., Alexandria Inst. of Technol. Content Type Conferences.
PuReMD: Purdue Reactive Molecular Dynamics Package Hasan Metin Aktulga and Ananth Grama Purdue University TST Meeting,May 13-14, 2010.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
LLNL-PRES DRAFT This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract.
I/O for Structured-Grid AMR Phil Colella Lawrence Berkeley National Laboratory Coordinating PI, APDEC CET.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Toward interactive visualization in a distributed workflow Steven G. Parker Oscar Barney Ayla Khan Thiago Ize Steven G. Parker Oscar Barney Ayla Khan Thiago.
Progress on Component-Based Subsurface Simulation I: Smooth Particle Hydrodynamics Bruce Palmer Pacific Northwest National Laboratory Richland, WA.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Adaptive Mesh Applications Sathish Vadhiyar Sources: - Schloegel, Karypis, Kumar. Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes. JPDC.
Partitioning using Mesh Adjacencies  Graph-based dynamic balancing Parallel construction and balancing of standard partition graph with small cuts takes.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
Integrated Performance Analysis in the Uintah Computational Framework Steven G. Parker Allen Morris, Scott Bardenhagen, Biswajit Banerje, James Bigler,
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.
Performance Evaluation of OhHelp'ed PIC Simulation Hiroshi Nakashima (ACCMS, Kyoto U.) cooperated by Yohei Miyake (ACCMS, Kyoto U.) Hideyuki Usui (Kobe.
 155 South 1452 East Room 380  Salt Lake City, Utah  This research was sponsored by the National Nuclear Security Administration.
Strong Scalability Analysis and Performance Evaluation of a SAMR CCA-based Reacting Flow Code Sophia Lefantzi, Jaideep Ray and Sameer Shende SAMR: Structured.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Multi-Grid Esteban Pauli 4/25/06. Overview Problem Description Problem Description Implementation Implementation –Shared Memory –Distributed Memory –Other.
1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
 155 South 1452 East Room 380  Salt Lake City, Utah  This research was sponsored by the National Nuclear Security Administration.
1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Objects: Virtualization & In-Process Components
Resource Utilization in Large Scale InfiniBand Jobs
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
GENERAL VIEW OF KRATOS MULTIPHYSICS
CS 584.
Hybrid Programming with OpenMP and MPI
Case Studies with Projections
Charisma: Orchestrating Migratable Parallel Objects
Adaptive Mesh Applications
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding since 1997, NSF since 2008 Uintah solver for multiphase-fluid-structure interaction problems- explosive filled container in fire

Martin Berzins (Steve Parker) What are the hard apps problems? ADAPTIVE DYNAMIC GLOBAL How do the solutions get shared? ENCAPSULATION ABSTRACTION What non-apps work is needed? APPLICATION DRIVEN TOOLS Thanks to DOE for funding since 1997, NSF since 2008 Uintah solver for multiphase-fluid-structure interaction problems- explosive filled container in fire

Hard Apps Problems are Multi- Physics/Multiscale with Adaptive Methods and/or Global Comms. 1.Lack of predictability forces use of dynamic load balancing methods. 2.AMR data structures – migration. 3.Radiation problems –global communications needed. 4.Particles move across grid with our methods 5.Any attempt to compute to a particular solution accuracy will need adaptive methods UPREDICTABLE!!

Particle Variables Cell Centered Variables Cell –Vertex Variables Fundamental Uintah data Structure is a patch – multiple variable types SFC Load balancing uses patches Uintah Domain Decomposition User writes code for a patch and its communications only - Uintah uses this information to construct communications pattern via a task graph

Combining MPM and Mesh refinement

How Does Uintah Work? Simulation Controller Simulation Controller Problem Specification Problem Specification XML Simulation (One of Arches, ICE, MPM, MPMICE, MPMArches, …) Simulation (One of Arches, ICE, MPM, MPMICE, MPMArches, …) Scheduler Tasks Data Archiver Data Archiver Tasks Callbacks MPI Assignments Load Balancer Load Balancer Configuration

Burgers Equation code void Burger::timeAdvance(const ProcessorGroup*, const PatchSubset* patches, const MaterialSubset* matls, DataWarehouse* old_dw, DataWarehouse* new_dw) { //Loop for all patches on this processor for(int p=0;p size();p++){ //Get data from data warehouse including 1 layer of "ghost" nodes from surrounding patches old_dw->get(u, lb_->u, matl, patch, Ghost::AroundNodes, 1); // dt, dx Time and space increments Vector dx = patch->getLevel()->dCell(); old_dw->get(dt, sharedState_->get_delt_label()); // allocate memory for results new_dw->allocateAndPut(new_u, lb_->u, matl, patch); // define iterator range l and h …… lots missing here and Iterate through all the nodes for(NodeIterator iter(l, h);!iter.done(); iter++){ IntVector n = *iter; double dudx = (u[n+IntVector(1,0,0)] - u[n-IntVector(1,0,0)]) /(2.0 * dx.x()); double du = - u[n] * dt * (dudx); new_u[n]= u[n] + du; }

8 Task graph Each algorithm defines a description of the computation –Required inputs and outputs (names and spatial relationships) –Callbacks to perform each task on a single subregion of space Communication is performed at the edges in the graph Uintah uses this information to create a graph of computation and communication

AMR Scalability Challenging dynamically changing workload 8x8x8 patches: timings over about 30 timesteps 8K to 20K patches Original remeshes at every step Dilated only every 4 steps or so

Atlas LLNL 1152 nodes of 4 Opteron dual-core Infiniband Thunder LLNL 1024 nodes of 4 Intel Itainum2 Quadrics switch Redstorm SNL nodes of Opteron dual-core XT3 Ranger UT Austin 3,936 nodes of 4 Barcelona quad-core Infiniband Small Problem only 2-3 patches per proc at 4096 procs

Summary Uintah is adaptive multi-physics AMR code Clear separation between application user and system components Very general CS approach to load balancing and scalability Expensive multidisciplinary effort.

Existing AMR scalability work Brian Van Stralen: weak scaling easy strong scaling hard Strong scaling problems include message overload, malloc inconsistencies and load imbalance. Not clear that these problems cannot be solved AMR does scale to 12K processors Flash Code BG/L and beyond (?) Strong scalability: fixed problem size doubling processors should halve execution time Weak scalability: problem size grows with processors Doubling processors give constant execution time