1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.

1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign ² University of Washington

2 Parallel Programming Laboratory @ UIUC 10/1/2016 Outline ● Motivations ● Algorithm overview ● Scalability ● Load balancer ● Multistepping

3 Parallel Programming Laboratory @ UIUC 10/1/2016 Motivations ● Need for simulations of the evolution of the universe ● Current parallel codes: – PKDGRAV – Gadget ● Scalability problems: – load imbalance – expensive domain decomposition – limit to 128 processors

4 Parallel Programming Laboratory @ UIUC 10/1/2016 ChaNGa: main characteristics ● Simulator of cosmological interaction – Newtonian gravity – Periodic boundary conditions – Multiple timestepping ● Particle based (Lagrangian) – high resolution where needed – based on tree structures ● Implemented in Charm++ – work divided among chares called TreePieces – processor-level optimization using a Charm++ group called CacheManager

5 Parallel Programming Laboratory @ UIUC 10/1/2016 Space decomposition TreePiece 1TreePiece 2TreePiece 3...

6 Parallel Programming Laboratory @ UIUC 10/1/2016 Basic algorithm... ● Newtonian gravity interaction – Each particle is influenced by all others: O(n²) algorithm ● Barnes-Hut approximation: O(nlogn) – Influence from distant particles combined into center of mass

7 Parallel Programming Laboratory @ UIUC 10/1/2016... in parallel ● Remote data – need to fetch from other processors ● Data reusage – same data needed by more than one particle

8 Parallel Programming Laboratory @ UIUC 10/1/2016 Overall algorithm Processor 1 local work (low priority) remote work mis s TreePiece C local work (low priority) remote work TreePiece B global work prefetch visit of the tree TreePiece A local work (low priority) Start computation End computation global work remot e present? request node CacheManager YES: return Processor n reply with requested data NO: fetch callback TreePiece on Processor 2 buffer High priority prefetch visit of the tree

9 Parallel Programming Laboratory @ UIUC 10/1/2016 Datasets hrwh_LCDMs 16 milllion particles (576 MB) dwarf 5 and 50 million particles (80 MB and 1,778 MB) lambs 3 million particles (47 MB) with subsets drgas 700 million particles (25.2 GB)

10 Parallel Programming Laboratory @ UIUC 10/1/2016 Systems

11 Parallel Programming Laboratory @ UIUC 10/1/2016 Scaling: comparison lambs 3M on Tungsten

12 Parallel Programming Laboratory @ UIUC 10/1/2016 Scaling: IBM BlueGene/L

13 Parallel Programming Laboratory @ UIUC 10/1/2016 Scaling: Cray XT3

14 Parallel Programming Laboratory @ UIUC 10/1/2016 Load balancing with GreedyLB dwarf 5M on 1,024 BlueGene/L processors 5.6s6.1s

15 Parallel Programming Laboratory @ UIUC 10/1/2016 Load balancing with OrbLB lambs 5M on 1,024 BlueGene/L processors white is good processors tim e

16 Parallel Programming Laboratory @ UIUC 10/1/2016 Load balancing with OrbRefineLB dwarf 5M on 1,024 BlueGene/L processors 5.6s 5.0s

17 Parallel Programming Laboratory @ UIUC 10/1/2016 Scaling with load balancing Number of Processors x Execution Time per Iteration (s)

18 Parallel Programming Laboratory @ UIUC 10/1/2016 Multistepping ● Particles with higher accelerations require smaller integration timesteps to be accurately predicted. ● Compute particles with highest accelerations every step, and particles with lower accelerations every few steps. ● Steps become different in terms of load.

19 Parallel Programming Laboratory @ UIUC 10/1/2016 ChaNGa scalability - multistepping dwarf 5M on Tungsten

20 Parallel Programming Laboratory @ UIUC 10/1/2016 ChaNGa scalability - multistepping

21 Parallel Programming Laboratory @ UIUC 10/1/2016 Future work ● Adding new physics – Smoothed Particle Hydrodynamics ● More load balancer / scalability – Reducing overhead of communication – Load balancing without increasing communication volume – Multiphase for multistepping – Other phases of the computation

22 Parallel Programming Laboratory @ UIUC 10/1/2016 Questions? Thank you

23 Parallel Programming Laboratory @ UIUC 10/1/2016 Decomposition types ● OCT – Contiguous cubic volume of space to each TreePiece ● SFC – Morton and Peano-Hilbert – Space Filling Curve imposes total ordering of particles – Segment of this line to each TreePiece ● ORB – Space divided by Orthogonal Recursive Bisection on the number of particles – Contiguous non-cubic volume of space to each TreePiece – Due to the shapes of the decomposition, requires more computation to produce correct results

24 Parallel Programming Laboratory @ UIUC 10/1/2016 Serial performance

25 Parallel Programming Laboratory @ UIUC 10/1/2016 CacheManager importance 1 million lambs dataset on HPCx

26 Parallel Programming Laboratory @ UIUC 10/1/2016 Prefetching 1) explicit ● before force computation, data is requested for preload 2) implicit in the cache ● computation performed with tree walks ● after visiting a node, its children will likely be visited ● while fetching remote nodes, the cache prefetches some of its children

27 Parallel Programming Laboratory @ UIUC 10/1/2016 Cache implicit prefetching 0 1 2 3

28 Parallel Programming Laboratory @ UIUC 10/1/2016 Load balancer lambs 300K subset on 64 processors of Tungsten ● lightweight domain decomposition ● charm++ load balancing Time Processors while: high utilization dark: processor idle

29 Parallel Programming Laboratory @ UIUC 10/1/2016 Charm++ Overview ● work decomposed into objects called chares ● message driven User view System view P 1 P 3 P 2 ● mapping of objects to processors transparent to user ● automatic load balancing ● communication optimization

30 Parallel Programming Laboratory @ UIUC 10/1/2016 Tree decomposition TreePiece 1TreePiece 3TreePiece 2 ● Exclusive ● Shared ● Remote

31 Parallel Programming Laboratory @ UIUC 10/1/2016 Space decomposition TreePiece 1TreePiece 2TreePiece 3...

32 Parallel Programming Laboratory @ UIUC 10/1/2016 Overall algorithm Processor 1 local work (low priority) remote work mis s TreePiece C local work (low priority) remote work TreePiece B TreePiece A local work (low priority) global work Start computation End computation remote present? request node CacheManager YES: return Processor n reply with requested data NO: fetch callback TreePiece on Processor 2 buffer

33 Parallel Programming Laboratory @ UIUC 10/1/2016 Scalability comparison (old result) dwarf 5M comparison on Tungsten flat: perfect scaling diagonal: no scaling

34 Parallel Programming Laboratory @ UIUC 10/1/2016 ChaNGa scalability (old results) flat: perfect scaling diagonal: no scaling results on BlueGene/L

35 Parallel Programming Laboratory @ UIUC 10/1/2016 Interaction list X TreePiece A

36 Parallel Programming Laboratory @ UIUC 10/1/2016 Interaction lists Node X opening criteria cut-off node X is undecided node X is accepted node X is opened

38 Parallel Programming Laboratory @ UIUC 10/1/2016 Interaction list: results ● 10% average performance improvement

39 Parallel Programming Laboratory @ UIUC 10/1/2016 Tree-in-cache lambs 300K subset on 64 processors of Tungsten

40 Parallel Programming Laboratory @ UIUC 10/1/2016 Load balancer dwarf 5M dataset on BlueGene/L improvement between 15% and 35% flat lines good raising lines bad

41 Parallel Programming Laboratory @ UIUC 10/1/2016 ChaNGa scalability flat: perfect scaling diagonal: no scaling

1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback