1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Advertisements

Parallel Strategies Partitioning consists of the following steps –Divide the problem into parts –Compute each part separately –Merge the results Divide.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
OpenFOAM on a GPU-based Heterogeneous Cluster
Daniel Blackburn Load Balancing in Distributed N-Body Simulations.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.
Derek C. Richardson (U Maryland) PKDGRAV : A Parallel k-D Tree Gravity Solver for N-Body Problems FMM 2004.
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
INAF Osservatorio Astrofisico di Catania “ScicomP 9” Bologna March 23 – Using LAPI and MPI-2 in an N-body cosmological code on IBM SP in an N-body.
Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.
N Tropy: A Framework for Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astronomical Data Analysis Jeffrey.
Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
1 Data Structures for Scientific Computing Orion Sky Lawlor /04/14.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Parallel Computing Presented by Justin Reschke
Barnes Hut – A Broad Review Abhinav S Bhatele The 27th day of April, 2006.
Teragrid 2009 Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University.
ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel James Wadsley Greg Stinson Laxmikant Kale Filippo Gioachin Pritish Jetley Celso.
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R.
ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel James Wadsley Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit.
Robust Non-Intrusive Record-Replay with Processor Extraction Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
Debugging Large Scale Applications in a Virtualized Environment Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
PADTAD 2008 Memory Tagging in Charm++ Filippo Gioachin Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana-Champaign.
Cosmology On Petascale Computers. Fabio Governato, Greg Stinson, Chris Brook, Alison Brooks Joachim Stadel, Lucio Mayer, Ben Moore, George Lake (U. ZH)
ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit Sharma.
Debugging Tools for Charm++ Applications Filippo Gioachin University of Illinois at Urbana-Champaign.
ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit Sharma.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Setup distribution of N particles
ChaNGa: Design Issues in High Performance Cosmology
Sending a few stragglers home
Performance Evaluation of Adaptive MPI
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Parallel Programming in C with MPI and OpenMP
Component Frameworks:
Title Meta-Balancer: Automated Selection of Load Balancing Strategies
Integrated Runtime of Charm++ and OpenMP
Course Outline Introduction in algorithms and applications
Case Studies with Projections
Setup distribution of N particles
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Database System Architectures
Parallel Programming in C with MPI and OpenMP
An Orchestration Language for Parallel Objects
Support for Adaptivity in ARMCI Using Migratable Objects
N-Body Gravitational Simulations
Presentation transcript:

1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign ² University of Washington

2 Parallel Programming UIUC 10/1/2016 Outline ● Motivations ● Algorithm overview ● Scalability ● Load balancer ● Multistepping

3 Parallel Programming UIUC 10/1/2016 Motivations ● Need for simulations of the evolution of the universe ● Current parallel codes: – PKDGRAV – Gadget ● Scalability problems: – load imbalance – expensive domain decomposition – limit to 128 processors

4 Parallel Programming UIUC 10/1/2016 ChaNGa: main characteristics ● Simulator of cosmological interaction – Newtonian gravity – Periodic boundary conditions – Multiple timestepping ● Particle based (Lagrangian) – high resolution where needed – based on tree structures ● Implemented in Charm++ – work divided among chares called TreePieces – processor-level optimization using a Charm++ group called CacheManager

5 Parallel Programming UIUC 10/1/2016 Space decomposition TreePiece 1TreePiece 2TreePiece 3...

6 Parallel Programming UIUC 10/1/2016 Basic algorithm... ● Newtonian gravity interaction – Each particle is influenced by all others: O(n²) algorithm ● Barnes-Hut approximation: O(nlogn) – Influence from distant particles combined into center of mass

7 Parallel Programming UIUC 10/1/ in parallel ● Remote data – need to fetch from other processors ● Data reusage – same data needed by more than one particle

8 Parallel Programming UIUC 10/1/2016 Overall algorithm Processor 1 local work (low priority) remote work mis s TreePiece C local work (low priority) remote work TreePiece B global work prefetch visit of the tree TreePiece A local work (low priority) Start computation End computation global work remot e present? request node CacheManager YES: return Processor n reply with requested data NO: fetch callback TreePiece on Processor 2 buffer High priority prefetch visit of the tree

9 Parallel Programming UIUC 10/1/2016 Datasets hrwh_LCDMs 16 milllion particles (576 MB) dwarf 5 and 50 million particles (80 MB and 1,778 MB) lambs 3 million particles (47 MB) with subsets drgas 700 million particles (25.2 GB)

10 Parallel Programming UIUC 10/1/2016 Systems

11 Parallel Programming UIUC 10/1/2016 Scaling: comparison lambs 3M on Tungsten

12 Parallel Programming UIUC 10/1/2016 Scaling: IBM BlueGene/L

13 Parallel Programming UIUC 10/1/2016 Scaling: Cray XT3

14 Parallel Programming UIUC 10/1/2016 Load balancing with GreedyLB dwarf 5M on 1,024 BlueGene/L processors 5.6s6.1s

15 Parallel Programming UIUC 10/1/2016 Load balancing with OrbLB lambs 5M on 1,024 BlueGene/L processors white is good processors tim e

16 Parallel Programming UIUC 10/1/2016 Load balancing with OrbRefineLB dwarf 5M on 1,024 BlueGene/L processors 5.6s 5.0s

17 Parallel Programming UIUC 10/1/2016 Scaling with load balancing Number of Processors x Execution Time per Iteration (s)

18 Parallel Programming UIUC 10/1/2016 Multistepping ● Particles with higher accelerations require smaller integration timesteps to be accurately predicted. ● Compute particles with highest accelerations every step, and particles with lower accelerations every few steps. ● Steps become different in terms of load.

19 Parallel Programming UIUC 10/1/2016 ChaNGa scalability - multistepping dwarf 5M on Tungsten

20 Parallel Programming UIUC 10/1/2016 ChaNGa scalability - multistepping

21 Parallel Programming UIUC 10/1/2016 Future work ● Adding new physics – Smoothed Particle Hydrodynamics ● More load balancer / scalability – Reducing overhead of communication – Load balancing without increasing communication volume – Multiphase for multistepping – Other phases of the computation

22 Parallel Programming UIUC 10/1/2016 Questions? Thank you

23 Parallel Programming UIUC 10/1/2016 Decomposition types ● OCT – Contiguous cubic volume of space to each TreePiece ● SFC – Morton and Peano-Hilbert – Space Filling Curve imposes total ordering of particles – Segment of this line to each TreePiece ● ORB – Space divided by Orthogonal Recursive Bisection on the number of particles – Contiguous non-cubic volume of space to each TreePiece – Due to the shapes of the decomposition, requires more computation to produce correct results

24 Parallel Programming UIUC 10/1/2016 Serial performance

25 Parallel Programming UIUC 10/1/2016 CacheManager importance 1 million lambs dataset on HPCx

26 Parallel Programming UIUC 10/1/2016 Prefetching 1) explicit ● before force computation, data is requested for preload 2) implicit in the cache ● computation performed with tree walks ● after visiting a node, its children will likely be visited ● while fetching remote nodes, the cache prefetches some of its children

27 Parallel Programming UIUC 10/1/2016 Cache implicit prefetching

28 Parallel Programming UIUC 10/1/2016 Load balancer lambs 300K subset on 64 processors of Tungsten ● lightweight domain decomposition ● charm++ load balancing Time Processors while: high utilization dark: processor idle

29 Parallel Programming UIUC 10/1/2016 Charm++ Overview ● work decomposed into objects called chares ● message driven User view System view P 1 P 3 P 2 ● mapping of objects to processors transparent to user ● automatic load balancing ● communication optimization

30 Parallel Programming UIUC 10/1/2016 Tree decomposition TreePiece 1TreePiece 3TreePiece 2 ● Exclusive ● Shared ● Remote

31 Parallel Programming UIUC 10/1/2016 Space decomposition TreePiece 1TreePiece 2TreePiece 3...

32 Parallel Programming UIUC 10/1/2016 Overall algorithm Processor 1 local work (low priority) remote work mis s TreePiece C local work (low priority) remote work TreePiece B TreePiece A local work (low priority) global work Start computation End computation remote present? request node CacheManager YES: return Processor n reply with requested data NO: fetch callback TreePiece on Processor 2 buffer

33 Parallel Programming UIUC 10/1/2016 Scalability comparison (old result) dwarf 5M comparison on Tungsten flat: perfect scaling diagonal: no scaling

34 Parallel Programming UIUC 10/1/2016 ChaNGa scalability (old results) flat: perfect scaling diagonal: no scaling results on BlueGene/L

35 Parallel Programming UIUC 10/1/2016 Interaction list X TreePiece A

36 Parallel Programming UIUC 10/1/2016 Interaction lists Node X opening criteria cut-off node X is undecided node X is accepted node X is opened

37

38 Parallel Programming UIUC 10/1/2016 Interaction list: results ● 10% average performance improvement

39 Parallel Programming UIUC 10/1/2016 Tree-in-cache lambs 300K subset on 64 processors of Tungsten

40 Parallel Programming UIUC 10/1/2016 Load balancer dwarf 5M dataset on BlueGene/L improvement between 15% and 35% flat lines good raising lines bad

41 Parallel Programming UIUC 10/1/2016 ChaNGa scalability flat: perfect scaling diagonal: no scaling