Case Studies with Projections

Slides:



Advertisements
Similar presentations
Dynamic Load Balancing in Scientific Simulation Angen Zheng.
Advertisements

ENERGY AND POWER CHARACTERIZATION OF PARALLEL PROGRAMS RUNNING ON THE INTEL XEON PHI JOAL WOOD, ZILIANG ZONG, QIJUN GU, RONG GE {JW1772, ZILIANG,
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Distributed Indexed Outlier Detection Algorithm Status Update as of March 11, 2014.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Reference: Message Passing Fundamentals.
Exploring Communication Options with Adaptive Mesh Refinement Courtenay T. Vaughan, and Richard F. Barrett Sandia National Laboratories SIAM Computational.
A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
Scalable Algorithms for Structured Adaptive Mesh Refinement Akhil Langer, Jonathan Lifflander, Phil Miller, Laxmikant Kale Parallel Programming Laboratory.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato
Temperature Aware Load Balancing For Parallel Applications Osman Sarood Parallel Programming Lab (PPL) University of Illinois Urbana Champaign.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, Sriram Krishnamoorthy, Laxmikant V. Kale.
6fb52297e004844aa81be d50cc3545bc Hashing!. Hashing  Group Activity 1:  Take the message you were given, and create your own version of hashing.  You.
Workshop on Operating System Interference in High Performance Applications Performance Degradation in the Presence of Subnormal Floating-Point Values.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.
Barnes Hut N-body Simulation Martin Burtscher Fall 2009.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
1 Data Structures for Scientific Computing Orion Sky Lawlor /04/14.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Barnes Hut N-body Simulation Martin Burtscher Fall 2009.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R.
Robust Non-Intrusive Record-Replay with Processor Extraction Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
Debugging Large Scale Applications in a Virtualized Environment Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit Sharma.
1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.
ChaNGa: Design Issues in High Performance Cosmology
Interconnection topologies
Parallel Objects: Virtualization & In-Process Components
Nithin Michael, Yao Wang, G. Edward Suh and Ao Tang Cornell University
Performance Evaluation of Adaptive MPI
PVMbuilder A Tool for Parallel Programming by Jan Bækgård Pedersen &
Component Frameworks:
Title Meta-Balancer: Automated Selection of Load Balancing Strategies
Programming Problem solving Debugging
Integrated Runtime of Charm++ and OpenMP
Welcome to the 2016 Charm++ Workshop!
Parallelism for summation
Faucets: Efficient Utilization of Multiple Clusters
Projections Overview Ronak Buch & Laxmikant (Sanjay) Kale
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson Jan 28,
Support for Adaptivity in ARMCI Using Migratable Objects
Parallel Implementation of Adaptive Spacetime Simulations A
Module 4 Loops and Repetition 9/19/2019 CSE 1321 Module 4.
Presentation transcript:

Case Studies with Projections Ronak Buch & Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign 11th Workshop of the INRIA-Illinois-ANL JLPC, Sophia Antipolis, France June 12, 2014

Basic Problem We have some Charm++ program Performance is worse than expected How can we: Identify the problem? Measure the impact of the problem? Fix the problem? Demonstrate that the fix was effective?

Key Ideas Start with high level overview and repeatedly specialize until problem is isolated Select metric to measure problem Iteratively attempt solutions, guided by the performance data

Key Ideas Start with high level overview and repeatedly specialize until problem is isolated Select metric to measure problem Iteratively attempt solutions, guided by the performance data

Stencil3d Performance

Stencil3d Basic 7 point stencil in 3d 3d domain decomposed into blocks Exchange faces to neighbors Synthetic load balancing experiment Calculation repeated based on position in domain

No Load Balancing

No Load Balancing Clear load imbalance, but hard to quantify in this view

No Load Balancing Clear that load varies from 90% to 60%

Next Steps Poor load balance identified as performance culprit Use Charm++’s load balancing support to evaluate the performace of different balancers Trivial to add load balancing Relink using -module CommonLBs Run using +balancer <loadBalancer>

GreedyLB Much improved balance, 75% average load

RefineLB Much improved balance, 80% average load

Greedy on left, Refine on right. Multirun Comparison Can see that refine is a bit faster. Greedy on left, Refine on right.

ChaNGa Performance

ChaNGa Charm N-body GrAvity solver Used for cosmological simulations Barnes-Hut force calculation Following data uses dwarf dataset on 8K cores of Blue Waters dwarf dataset has high concentration of particles at center

Original Time Profile

Why is utilization so low here? Original Time Profile Why is utilization so low here?

Original Time Profile Some PEs are doing work.

Next Steps Are all PEs doing a small amount of work, or are most idle while some do a lot? Outlier analysis can tell us If no outliers, then all are doing little work If outliers, then some are overburdened while most are waiting

Outlier Analysis

Large gulf between average and extrema Outlier Analysis Large gulf between average and extrema => Load imbalance

Next Steps Why does this load imbalance exist? What are the busy PEs doing and why are other waiting? Outlier analysis tells us which PEs are overburdened Timeline will show what methods those PEs are actually executing

Timeline

Timeline

Original Message Count Motivated creation of analysis tool for projections. Development is driven by actual use cases. Wrote new tool to parse Projections logs. Large disparity of messages across processors.

Next Steps Can we distribute the work? After identifying the problem, the code revealed that this was caused by tree node contention. To solve this, we tried randomly distributing copies of tree nodes to other PEs to distribute load.

Final Time Profile

Final Message Count Motivated creation of analysis tool for projections. Development is driven by actual use cases. Used to have 30000+ messages on some PEs, now all process <5000. Much better balance.