Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn

Simulations and Scientific Discovery ● Help reconcile observation and theory – Calculate final states of theories of structure formation ● Direct observational programs – What should we look for in space? ● Help determine underlying structures and masses

Computational Challenges ● N ~10^12 – Direct summation forces would take ~10^10 Teraflop years – Need efficient, scalable algorithms ● Large dynamic ranges – Need multiple timestepping ● Irregular domains – Balance load across processors

ChaNGa ● Uses Barnes-Hut algorithm ● Based on Charm++ – Processor virtualization – Asynchronous message-driven model ● Computation and communication overlap – Intelligent, adaptive runtime system ● Load balancing

Barnes-Hut Algorithm Overview ● Space divided into cells ● Cells form nodes of Barnes-Hut tree – Particles grouped into buckets – Buckets assigned to TreePieces TreePiece 1 TreePiece 3 TreePiece 2

Computing Forces ● Collect relevant nodes/particles at TreePiece ● Traverse global tree to get force on each bucket – Nodes “opened” (too close) – or not (far enough) Not involved Involved in computation

Processor Algorithm Overview Local Work Pref (n-1) Comp (n-1) Pref (n) Comp (n) Pref (n+1) Global Work Yes Reply with Particles No Request Particles Receive Particles TreePieces CacheManager TreePiece Needs Remote Particles Have in Cache? Comp (n+1)

Major Optimizations ● Pipelined computation – Prefetch tree chunk before starting traversal ● Tree-in-Cache – Aggregate trees from all chares on processor ● Tunable computation granularity – Response time for data requests vs Scheduling overhead

Experimental Setup lambs 3 million particles hrwh_LCDMs 16 milllion particles dwarf 5 and 50 million particles drgas 700 million particles

Experimental Setup (contd.) ● Platforms

Parallel Performance A comparison of Parallel Performance with PKDGRAV. (`Dwarf' dataset on Tungsten.)

Scaling Tests Cray XT3 IBM BG/L Poor scaling

Towards Greater Scalability ● Load Imbalance causes poor scaling ● Static balancing not good enough – Even number of particles != Even work distribution ● Must balance both computation & communication

Balancing Load to Improve Performance LB algorithms must consider both computation and communication Increased communication Greater balance Computation Communication Time → Processor Activity →

● 1024 BG/L processors ● Dwarf dataset ● OrbLB Processors → Time → Accounting for Communication: OrbRefineLB ● Based on Charm++ OrbLB – ORB along object ident. line ● OrbRefineLB: `Refines' placement by exchanging load between processors in shifting window

Results with OrbRefineLB ● Different datasets ● OrbRefineLB

Time → Processors Multistepped Simulations for Greater Efficiency ● Group particles into `rungs' – Lower rung means higher acceleration – Different rungs active at different times ● Update particles on higher rungs less frequently ● Less work done than singlestepping 0000 0: rung 0 11 1: rungs 0,1 22 2: rungs 0,1,2 Computation split into phases

Imbalance in MS Simulations ● But load imbalance is even more severe Time → Processors Fig: Execution profile of 32 processors during a MS simulation Different particles active during different phases 0012 Computation split into phases 0012

Balancing Load in MS Runs ● Different strategies for different phases ● Multiphase instrumentation ● Model-based load estimation (first few small steps) 0012

Preliminary Results Singlestepped (613 s) Multistepped (429 s) Multistepped with load balancing (228 s) ● Dwarf dataset ● 32 BG/L processors ● Different timestepping schemes

Preliminary Results ● ~50% reduction in execution time: ● Multistepping and overdecomposition More TreePieces → greater load balance ● Lambb dataset ● 512 and 1024 BG/L processors ● Singlestepped vs load- balanced multistepped ● Lambb dataset ● 1024 BG/L processors ● Varying num. TreePieces

Future Work ● SPH ● Alternative decomposition schemes ● Runtime optimizations to reduce communication cost ● More sophisticated load balancing algorithms – Account for: ● Complete simulation space topology ● Processor topology (reduce hop-bytes)

Conclusions ● Introduced ChaNGa ● Optimizations to reduce simulation time ● Load imbalance issues tackled ● Multiple timestepping beneficial ● Balancing load in multistepped simulations

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Similar presentations

Presentation on theme: "Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Similar presentations

Presentation on theme: "Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn."— Presentation transcript:

Similar presentations

About project

Feedback