Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Slides:

Advertisements

Similar presentations

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

Advertisements

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.

Parallel Simulation etc Roger Curry Presentation on Load Balancing.

Daniel Blackburn Load Balancing in Distributed N-Body Simulations.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.

Scalable Algorithms for Structured Adaptive Mesh Refinement Akhil Langer, Jonathan Lifflander, Phil Miller, Laxmikant Kale Parallel Programming Laboratory.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.

Adaptive MPI Milind A. Bhandarkar

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Kinshuk Govil, Dan Teodosiu*, Yongqiang Huang, and Mendel Rosenblum

Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Barnes Hut – A Broad Review Abhinav S Bhatele The 27th day of April, 2006.

Dynamic Load Balancing Tree and Structured Computations.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Teragrid 2009 Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University.

ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel James Wadsley Greg Stinson Laxmikant Kale Filippo Gioachin Pritish Jetley Celso.

1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R.

ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel James Wadsley Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit.

Cosmology On Petascale Computers. Fabio Governato, Greg Stinson, Chris Brook, Alison Brooks Joachim Stadel, Lucio Mayer, Ben Moore, George Lake (U. ZH)

ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit Sharma.

1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.

ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit Sharma.

Auburn University

ChaNGa: Design Issues in High Performance Cosmology

Conception of parallel algorithms

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Parallel Algorithm Design

Performance Evaluation of Adaptive MPI

Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.

Component Frameworks:

Title Meta-Balancer: Automated Selection of Load Balancing Strategies

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Course Outline Introduction in algorithms and applications

Gary M. Zoppetti Gagan Agrawal

CS 584 Lecture7 Assignment -- Due Now! Paper Review is due next week.

Case Studies with Projections

Adaptivity and Dynamic Load Balancing

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Parallel Programming in C with MPI and OpenMP

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Support for Adaptivity in ARMCI Using Migratable Objects

Parallel Implementation of Adaptive Spacetime Simulations A

Computational issues Issues Solutions Large time scale

Mattan Erez The University of Texas at Austin

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn

Simulations and Scientific Discovery ● Help reconcile observation and theory – Calculate final states of theories of structure formation ● Direct observational programs – What should we look for in space? ● Help determine underlying structures and masses

Computational Challenges ● N ~10^12 – Direct summation forces would take ~10^10 Teraflop years – Need efficient, scalable algorithms ● Large dynamic ranges – Need multiple timestepping ● Irregular domains – Balance load across processors

ChaNGa ● Uses Barnes-Hut algorithm ● Based on Charm++ – Processor virtualization – Asynchronous message-driven model ● Computation and communication overlap – Intelligent, adaptive runtime system ● Load balancing

Barnes-Hut Algorithm Overview ● Space divided into cells ● Cells form nodes of Barnes-Hut tree – Particles grouped into buckets – Buckets assigned to TreePieces TreePiece 1 TreePiece 3 TreePiece 2

Computing Forces ● Collect relevant nodes/particles at TreePiece ● Traverse global tree to get force on each bucket – Nodes “opened” (too close) – or not (far enough) Not involved Involved in computation

Processor Algorithm Overview Local Work Pref (n-1) Comp (n-1) Pref (n) Comp (n) Pref (n+1) Global Work Yes Reply with Particles No Request Particles Receive Particles TreePieces CacheManager TreePiece Needs Remote Particles Have in Cache? Comp (n+1)

Major Optimizations ● Pipelined computation – Prefetch tree chunk before starting traversal ● Tree-in-Cache – Aggregate trees from all chares on processor ● Tunable computation granularity – Response time for data requests vs Scheduling overhead

Experimental Setup lambs 3 million particles hrwh_LCDMs 16 milllion particles dwarf 5 and 50 million particles drgas 700 million particles

Experimental Setup (contd.) ● Platforms

Parallel Performance A comparison of Parallel Performance with PKDGRAV. (`Dwarf' dataset on Tungsten.)

Scaling Tests Cray XT3 IBM BG/L Poor scaling

Towards Greater Scalability ● Load Imbalance causes poor scaling ● Static balancing not good enough – Even number of particles != Even work distribution ● Must balance both computation & communication

Balancing Load to Improve Performance LB algorithms must consider both computation and communication Increased communication Greater balance Computation Communication Time → Processor Activity →

● 1024 BG/L processors ● Dwarf dataset ● OrbLB Processors → Time → Accounting for Communication: OrbRefineLB ● Based on Charm++ OrbLB – ORB along object ident. line ● OrbRefineLB: `Refines' placement by exchanging load between processors in shifting window

Results with OrbRefineLB ● Different datasets ● OrbRefineLB

Time → Processors Multistepped Simulations for Greater Efficiency ● Group particles into `rungs' – Lower rung means higher acceleration – Different rungs active at different times ● Update particles on higher rungs less frequently ● Less work done than singlestepping : rung : rungs 0,1 22 2: rungs 0,1,2 Computation split into phases

Imbalance in MS Simulations ● But load imbalance is even more severe Time → Processors Fig: Execution profile of 32 processors during a MS simulation Different particles active during different phases 0012 Computation split into phases 0012

Balancing Load in MS Runs ● Different strategies for different phases ● Multiphase instrumentation ● Model-based load estimation (first few small steps) 0012

Preliminary Results Singlestepped (613 s) Multistepped (429 s) Multistepped with load balancing (228 s) ● Dwarf dataset ● 32 BG/L processors ● Different timestepping schemes

Preliminary Results ● ~50% reduction in execution time: ● Multistepping and overdecomposition More TreePieces → greater load balance ● Lambb dataset ● 512 and 1024 BG/L processors ● Singlestepped vs load- balanced multistepped ● Lambb dataset ● 1024 BG/L processors ● Varying num. TreePieces

Future Work ● SPH ● Alternative decomposition schemes ● Runtime optimizations to reduce communication cost ● More sophisticated load balancing algorithms – Account for: ● Complete simulation space topology ● Processor topology (reduce hop-bytes)

Conclusions ● Introduced ChaNGa ● Optimizations to reduce simulation time ● Load imbalance issues tackled ● Multiple timestepping beneficial ● Balancing load in multistepped simulations