1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R.

Slides:



Advertisements
Similar presentations
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Advertisements

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Daniel Blackburn Load Balancing in Distributed N-Body Simulations.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.
Derek C. Richardson (U Maryland) PKDGRAV : A Parallel k-D Tree Gravity Solver for N-Body Problems FMM 2004.
FLANN Fast Library for Approximate Nearest Neighbors
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Parallelization Of The Spacetime Discontinuous Galerkin Method Using The Charm++ FEM Framework (ParFUM) Mark Hills, Hari Govind, Sayantan Chakravorty,
Scalable Algorithms for Structured Adaptive Mesh Refinement Akhil Langer, Jonathan Lifflander, Phil Miller, Laxmikant Kale Parallel Programming Laboratory.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.
Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.
7 th Annual Workshop on Charm++ and its Applications ParTopS: Compact Topological Framework for Parallel Fragmentation Simulations Rodrigo Espinha 1 Waldemar.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.
N Tropy: A Framework for Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astronomical Data Analysis Jeffrey.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Workshop on Operating System Interference in High Performance Applications Performance Degradation in the Presence of Subnormal Floating-Point Values.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
1 Data Structures for Scientific Computing Orion Sky Lawlor /04/14.
Parallel Application Paradigms CS433 Spring 2001 Laxmikant Kale.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Barnes Hut – A Broad Review Abhinav S Bhatele The 27th day of April, 2006.
Using Charm++ with Arrays Laxmikant (Sanjay) Kale Parallel Programming Lab Department of Computer Science, UIUC charm.cs.uiuc.edu.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Teragrid 2009 Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University.
ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel James Wadsley Greg Stinson Laxmikant Kale Filippo Gioachin Pritish Jetley Celso.
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel James Wadsley Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit.
Robust Non-Intrusive Record-Replay with Processor Extraction Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
Debugging Large Scale Applications in a Virtualized Environment Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
PADTAD 2008 Memory Tagging in Charm++ Filippo Gioachin Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana-Champaign.
Cosmology On Petascale Computers. Fabio Governato, Greg Stinson, Chris Brook, Alison Brooks Joachim Stadel, Lucio Mayer, Ben Moore, George Lake (U. ZH)
ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit Sharma.
1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.
Debugging Tools for Charm++ Applications Filippo Gioachin University of Illinois at Urbana-Champaign.
ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit Sharma.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
ChaNGa: Design Issues in High Performance Cosmology
Sending a few stragglers home
Performance Evaluation of Adaptive MPI
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Parallel Programming in C with MPI and OpenMP
Component Frameworks:
Integrated Runtime of Charm++ and OpenMP
Course Outline Introduction in algorithms and applications
Parallelization of CPAIMD using Charm++
Hybrid Programming with OpenMP and MPI
Case Studies with Projections
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Parallel Programming in C with MPI and OpenMP
An Orchestration Language for Parallel Objects
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R. Quinn² ¹ University of Illinois at Urbana-Champaign ² University of Washington

2 Parallel Programming UIUC 9/29/2016 Outline: ParallelGravity ● Motivations ● Charm++ ● Basic algorithm ● CacheManager ● Prefetching ● Interaction lists ● Load balancer ● Scalability

3 Parallel Programming UIUC 9/29/2016 Motivations ● Need for simulations of the evolution of the universe ● Current parallel codes: – PKDGRAV – Gadget ● Scalability problems: – load imbalance – expensive domain decomposition – limit to 128 processors

4 Parallel Programming UIUC 9/29/2016 Charm++ Overview ● work decomposed into objects called chares ● message driven User view System view P 1 P 3 P 2 ● mapping of objects to processors transparent to user ● automatic load balancing ● communication optimization

5 Parallel Programming UIUC 9/29/2016 ParallelGravity ● Simulator of cosmological interaction (gravity) ● Particle based – high resolution where needed – based on tree structures ● different types of trees ● different domain decompositions ● Implemented in Charm++ – work divided among chares called TreePieces

6 Parallel Programming UIUC 9/29/2016 Datasets and Systems lambs 3 million particles (47 MB)dwarf 5 million particles (80 MB)

7 Parallel Programming UIUC 9/29/2016 Space decomposition TreePiece 1TreePiece 2TreePiece 3...

8 Parallel Programming UIUC 9/29/2016 Basic algorithm... ● Newtonian gravity interaction – Each particle is influenced by all others: O(n²) algorithm ● Barnes-Hut approximation: O(nlogn) – Influence from distant particles combined into center of mass

9 Parallel Programming UIUC 9/29/ in parallel ● Remote data – need to fetch from other processors ● Data reusage – same data needed by more than one particle

10 Parallel Programming UIUC 9/29/2016 Overall algorithm Processor 1 local work (low priority) remote work mis s TreePiece C local work (low priority) remote work TreePiece B TreePiece A local work (low priority) global work Start computation End computation remote present? request node CacheManager YES: return Processor n reply with requested data NO: fetch callback TreePiece on Processor 2 buffer

11 Parallel Programming UIUC 9/29/2016 Serial performance

12 Parallel Programming UIUC 9/29/2016 CacheManager importance 1 million lambs dataset on HPCx

13 Parallel Programming UIUC 9/29/2016 Prefetching 1) implicit in the cache ● computation performed with tree walks ● after visiting a node, its children will likely be visited ● while fetching remote nodes, the cache prefetches some of its children 2) explicit ● before force computation, data is requested for preload

14 Parallel Programming UIUC 9/29/2016 Cache implicit prefetching

15 Parallel Programming UIUC 9/29/2016 Interaction lists Node X opening criteria cut-off node X is undecided node X is accepted node X is opened

16

17 Parallel Programming UIUC 9/29/2016 Interaction list: results ● 10% average performance improvement

18 Parallel Programming UIUC 9/29/2016 Load balancer dwarf 5M dataset on BlueGene/L improvement between 15% and 35% flat lines good raising lines bad

19 Parallel Programming UIUC 9/29/2016 Load balancer lambs 300K subset on 64 processors of Tungsten ● lightweight domain decomposition ● charm++ load balancing Time Processors while: high utilization dark: processor idle

20 Parallel Programming UIUC 9/29/2016 Scalability comparison dwarf 5M comparison on Tungsten flat: perfect scaling diagonal: no scaling

21 Parallel Programming UIUC 9/29/2016 ParallelGravity scalability flat: perfect scaling diagonal: no scaling

22 Parallel Programming UIUC 9/29/2016 Future work ● Production level physics – Periodic boundaries – Smoothed Particle Hydrodynamics – Multiple timestepping ● New load balancers ● Beyond 1,024 processors

23 Parallel Programming UIUC 9/29/2016 Questions? Thank you

24 Parallel Programming UIUC 9/29/2016 Tree decomposition TreePiece 1TreePiece 3TreePiece 2 ● Exclusive ● Shared ● Remote

25 Parallel Programming UIUC 9/29/2016 Interaction list X TreePiece A

26 Parallel Programming UIUC 9/29/2016 Tree-in-cache lambs 300K subset on 64 processors of Tungsten