ALEGRA is a large, highly capable, option rich, production application solving coupled multi-physics PDEs modeling magnetohydrodynamics, electromechanics,

Slides:



Advertisements
Similar presentations
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
Aug 9-10, 2011 Nuclear Energy University Programs Materials: NEAMS Perspective James Peltz, Program Manager, NEAMS Crosscutting Methods and Tools.
Software Process Models
Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Alternate Software Development Methodologies
Trellis: A Framework for Adaptive Numerical Analysis Based on Multiparadigm Programming in C++ Jean-Francois Remacle, Ottmar Klaas and Mark Shephard Scientific.
Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.
Coupled Fluid-Structural Solver CFD incompressible flow solver has been coupled with a FEA code to analyze dynamic fluid-structure coupling phenomena CFD.
1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.
Exploring Communication Options with Adaptive Mesh Refinement Courtenay T. Vaughan, and Richard F. Barrett Sandia National Laboratories SIAM Computational.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Trigger and online software Simon George & Reiner Hauser T/DAQ Phase 1 IDR.
© 2011 Autodesk Freely licensed for use by educational institutions. Reuse and changes require a note indicating that content has been modified from the.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
S.S. Yang and J.K. Lee FEMLAB and its applications POSTEC H Plasma Application Modeling Lab. Oct. 25, 2005.
Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.
JVB-STC'97- 1 #*#* Successful Adoption and Use of Object Oriented Technologies STC ‘97 April 30, 1997 Jim Van Buren.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Sigrity, Inc © Efficient Signal and Power Integrity Analysis Using Parallel Techniques Tao Su, Xiaofeng Wang, Zhengang Bai, Venkata Vennam Sigrity,
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
1 5.1 Software Engineering Practice  Provide value to the user  KIS—keep it simple!  Maintain the product and project “vision”  What you produce,
1 Chapter 5 Software Engineering Practice. 2 What is “Practice”? Practice is a broad array of concepts, principles, methods, and tools that you must consider.
Coming up: Software Engineering: A Practitioner’s Approach, 6/e Chapter 5 Practice: A Generic View copyright © 1996, 2001, 2005 R.S. Pressman & Associates,
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
Generative Programming. Automated Assembly Lines.
Simulated Pointers Limitations Of Java Pointers May be used for internal data structures only. Data structure backup requires serialization and deserialization.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
JAVA AND MATRIX COMPUTATION
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National Laboratory.
Ale with Mixed Elements 10 – 14 September 2007 Ale with Mixed Elements Ale with Mixed Elements C. Aymard, J. Flament, J.P. Perlat.
1 Software Engineering: A Practitioner’s Approach, 6/e Chapter 5 Practice: A Generic View Software Engineering: A Practitioner’s Approach, 6/e Chapter.
Standard Template Library The Standard Template Library was recently added to standard C++. –The STL contains generic template classes. –The STL permits.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
MESQUITE: Mesh Optimization Toolkit Brian Miller, LLNL
STK (Sierra Toolkit) Update Trilinos User Group meetings, 2014 R&A: SAND PE Sandia National Laboratories is a multi-program laboratory operated.
Full and Para Virtualization
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
© 2011 Autodesk Freely licensed for use by educational institutions. Reuse and changes require a note indicating that content has been modified from the.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
ENIAC was the first digital computer. It is easy to see how far we have come in the evolution of computers.
Process Asad Ur Rehman Chief Technology Officer Feditec Enterprise.
1 Data Structures for Scientific Computing Orion Sky Lawlor /04/14.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
US ATLAS Tier 1 Facility Rich Baker Deputy Director US ATLAS Computing Facilities October 26, 2000.
Scientific Computing Goals Past progress Future. Goals Numerical algorithms & computational strategies Solve specific set of problems associated with.
On the Path to Trinity - Experiences Bringing Codes to the Next Generation ASC Platform Courtenay T. Vaughan and Simon D. Hammond Sandia National Laboratories.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
Unstructured Meshing Tools for Fusion Plasma Simulations
Examples (D. Schmidt et al)
Parallel Unstructured Mesh Infrastructure
Programming Models for SimMillennium
Software Process Models
Construction of Parallel Adaptive Simulation Loops
Unstructured Grids at Sandia National Labs
LEAP-T: Multi-Moment Semi-Lagrangian Tracer Transport
GENERAL VIEW OF KRATOS MULTIPHYSICS
Gary M. Zoppetti Gagan Agrawal
Chapter 8, Design Patterns Introduction
Virtualization Dr. S. R. Ahmed.
Presentation transcript:

ALEGRA is a large, highly capable, option rich, production application solving coupled multi-physics PDEs modeling magnetohydrodynamics, electromechanics, stochastic damage modeling and detailed interface mechanics in high strain rate regimes on unstructured meshes in an ALE framework. Nearly all the algorithms must accept dynamic, mixed-material elements, which are modified by remeshing, interface reconstruction, and advection components. Recent trends in computing hardware have forced application developers to think about how to address and improve performance on traditional CPUs and to look forward to next generation platforms. Core to the ALEGRA performance strategy is to improve and rewrite loop bodies to be conformant with the requirements of high performance kernels, such as accessing data in array form, no pointer dereferencing, no function calls, and thread safety. Necessary to achieve this, however, are changes to the underlying infrastructure. We report on recent progress in the infrastructure to support array-based data access and on iteration of mesh objects. The effects on performance on traditional platforms will be shown. We also discuss the practical realities and cost estimates for attempting to move an existing full featured production application like ALEGRA toward running effectively on future platforms and being maintainable at the same time. The ALEGRA Production Application: Strategy, Challenges and Progress Toward Next Generation Platforms Richard R. Drake Dept Computational Multiphysics, Sandia National Laboratories Algorithms & Abstractions for Assembly in PDE Codes, May 12-14, 2014

ALEGRA: Shock Hydro & MHD 20 years of development & evolution Operator split, multi-physics Includes explicit and implicit PDE solvers 2 and 3 spatial dimensions Core hydro is multi-material Lagrangian plus remap An XFEM capability is maturing 650k LOC (not including libraries, such as Trilinos) Mix of research, development, and production capabilities Extensive material model choices Shock hydro 2D Magnetics 3D Resistive MHD Extensive material model choices

Some ALEGRA Core Algorithms Mixed material cell treatment Remap Remesh Material interface reconstruction Material & field advection Dynamic topology Extended Finite Element Method (XFEM) Spatial refinement/unrefinement Flexible set of material models comprising each material Central difference and midpoint time integration options XFEM requires topological enrichment Material interface reconstruction Swept volume & intersection remap

NEVADA Infrastructure (A Framework) Everything depends on the “Mesh” Field I/O Load Balancing Contact Spatial Adaptivity XFEM Adaptivity Halo Comm In-Situ Processing In-Situ Viz Remesh Interface Reconstruction Advection Input Parsing Physics Algorithms Unstructured Mesh Structured Mesh Materials

Performance We need to run faster ! Customer needs NW needs Optics (marketing) We need to run faster ! Customer needs NW needs Optics (marketing) It has become clear that: There is no performance silver bullet Application software must change This will require a resource shift Can’t rely on faster CPUs anymore ! 56% 60% Muzia, 2D

The ALEGRA Performance Strategy Work in the present but aim for the future. Incrementally reimplement algorithms Remesh, interface reconstruction, advection Lagrangian step pieces Matrix assembly coding Time step size computation Incrementally reimplement algorithms Remesh, interface reconstruction, advection Lagrangian step pieces Matrix assembly coding Time step size computation Focus on foundational concepts Accessing bulk data in array form Limit pointer dereferencing Limit function calls (non-inlined) Minimize the data read/writes Thread safety Focus on foundational concepts Accessing bulk data in array form Limit pointer dereferencing Limit function calls (non-inlined) Minimize the data read/writes Thread safety Refactor support infrastructure Enable array-based access Enable flat indexed based iteration Enable thread safety (colorings?) Refactor support infrastructure Enable array-based access Enable flat indexed based iteration Enable thread safety (colorings?) Consider new algorithms Alternate formulations New/different algorithms Consider new algorithms Alternate formulations New/different algorithms [Komatitsch]

Progress in Data Layout v1 v2 v3 v4... Object-based layout Array-based layout obj_idx 012 v1 v2 v3 v4... Indexed by “obj_idx” “double**” nd  Vector_Var( CURCOOR ) nd  data[ CURCOOR ] nd  data[ CURCOOR ][ nd  obj_idx ] Becomes, in object layout: in array layout: Object-based layout has more direct access to memory. Array-based layout has better cache & TLB behavior. Depending on the algorithm and problem size, the better memory behavior may or may not offset the extra dereferencing. Object-based layout has more direct access to memory. Array-based layout has better cache & TLB behavior. Depending on the algorithm and problem size, the better memory behavior may or may not offset the extra dereferencing. “Transpose” the storage Common, existing access pattern:

Speedups: Object- versus Array-Based Comparisons of unmodified versus array-based code Intel chips: RedSky=Nehalem, TLCC2=SandyBridge The memory behavior wins over the extra offset in many cases.

Algorithms Should Use the Arrays Directly Element * el = 0; TOTAL_ELEMENT_LOOP(el) { const Vector vara = el->Vector_Var( VARA_IDX ); Vector & varb = el->Vector_Var( VARB_IDX ); el->Vector_Var( VARA_IDX ) += varb; el->Scalar_Var( VARC_IDX ) = vara * varb; } Element * el = 0; TOTAL_ELEMENT_LOOP(el) { const Vector vara = el->Vector_Var( VARA_IDX ); Vector & varb = el->Vector_Var( VARB_IDX ); el->Vector_Var( VARA_IDX ) += varb; el->Scalar_Var( VARC_IDX ) = vara * varb; } ArrayView vara = mesh->getField( VARA_IDX ); ArrayView varb = mesh->getField( VARB_IDX ); ArrayView varc = mesh->getField( VARC_IDX ); Element * el = 0; TOTAL_ELEMENT_LOOP(el) { const int ei = el->Idx(); const Vector va = vara[ei]; vara[ei] += varb[ei]; varc[ei] = va * varb[ei]; } ArrayView vara = mesh->getField( VARA_IDX ); ArrayView varb = mesh->getField( VARB_IDX ); ArrayView varc = mesh->getField( VARC_IDX ); Element * el = 0; TOTAL_ELEMENT_LOOP(el) { const int ei = el->Idx(); const Vector va = vara[ei]; vara[ei] += varb[ei]; varc[ei] = va * varb[ei]; } Object-based access: Array-based access: (Oversimplified, hypothetical loop)

Object List & Iteration Improvements  Index based mesh object storage  Enables iteration without dereferencing objects  Performance comparison shows no improvement   Algorithms would have to take advantage first Doubly linked lists:Index sets: for ( int i=0; i<N; ++i ) { int ni = index_list[i]; vel[ni] = old_vel + dt * accl[ni];... } for ( int i=0; i<N; ++i ) { int ni = index_list[i]; vel[ni] = old_vel + dt * accl[ni];... } Can now do this: Convert to use integer offsets 012 List: Data: Nodes: … List: Data: … 012

Object Ordering Exploration  Improve cache locality by mesh object ordering  Hmm? No speedups over default ordering   Improve cache locality by mesh object ordering  Hmm? No speedups over default ordering  Order elements by space filling curve [wikipedia] Order nodes by first touch element loop

Summary  ALEGRA has adopted a low risk performance strategy  Main concept: incrementally rewrite algorithms towards NGP standards  Progress made on support infrastructure  Array-based field data  Integer index set object looping  1.4X speedup realized on realistic simulations  Work continues on infrastructure & algorithms  Data: Topology storage, integer field data, material data  Algorithms: Remap, Lagrangian step