Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Slides:

Advertisements

Similar presentations

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Advertisements

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

3.5 Interprocess Communication

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.

Ritu Varma Roshanak Roshandel Manu Prasanna

Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.

Adaptive MPI Milind A. Bhandarkar

TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

Computational Design of the CCSM Next Generation Coupler Tom Bettge Tony Craig Brian Kauffman National Center for Atmospheric Research Boulder, Colorado.

Composing Adaptive Software Authors Philip K. McKinley, Seyed Masoud Sadjadi, Eric P. Kasten, Betty H.C. Cheng Presented by Ana Rodriguez June 21, 2006.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

Advanced / Other Programming Models Sathish Vadhiyar.

Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin.

A Software Framework for Distributed Services Michael M. McKerns and Michael A.G. Aivazis California Institute of Technology, Pasadena, CA Introduction.

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Abstract A Structured Approach for Modular Design: A Plug and Play Middleware for Sensory Modules, Actuation Platforms, Task Descriptions and Implementations.

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Towards Scalable Performance Analysis and Visualization through Data Reduction Chee Wai Lee, Celso Mendes, L. V. Kale University of Illinois at Urbana-Champaign.

Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.

Charm++ overview L. V. Kale. Parallel Programming Decomposition – what to do in parallel –Tasks (loop iterations, functions,.. ) that can be done in parallel.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

Projections - A Step by Step Tutorial By Chee Wai Lee For the 2004 Charm++ Workshop.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.

Teragrid 2009 Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University.

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Productive Performance Tools for Heterogeneous Parallel Computing

Performance Technology for Scalable Parallel Systems

Parallel Objects: Virtualization & In-Process Components

Performance Evaluation of Adaptive MPI

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Integrated Runtime of Charm++ and OpenMP

Allen D. Malony Computer & Information Science Department

Outline Introduction Motivation for performance mapping SEAA model

BigSim: Simulating PetaFLOPS Supercomputers

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

An Orchestration Language for Parallel Objects

Support for Adaptivity in ARMCI Using Migratable Objects

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University of Oregon Chee Wai Lee Laxmikant V. Kale Department Computer Science University of Illinois Urbana-Champaign

ICPP 2009Integrated Performance Views in Charm++ Outline  Motivation for integrated performance views  Charm++ background  Performance events  Charm++ performance framework  Callback-based performance module and Projections  Brief introduction to TAU performance system  Development of TAU performance module  NAMD performance case study  Demonstrate integrate performance views  New results  Conclusions and future work 2

ICPP 2009Integrated Performance Views in Charm++ Productivity and Performance  High-level parallel paradigms improve productivity  Rich abstractions for application development  Hide low-level coding and computation complexities  Natural tension between powerful development environments and ability to achieve high performance  General dogma  Further the application is removed from raw machine the more susceptible to performance inefficiencies  Performance problems and their sources become harder to observe and to understand  Dual goals of productivity and performance require performance tool integration and language knowledge 3

ICPP 2009Integrated Performance Views in Charm++ Challenges  Provide performance tool access to execution events of interest from different levels of language and runtime  Used to trigger performance measurements to record metrics specific to event semantics  Event observation supported as part of execution model  Enable different performance perspectives  Build measurement techniques and runtime support that can integrate multiple performance technologies  Map low-level performance data to high-level parallel abstractions and language constructs  Incorporate event knowledge and computation model  Identify performance factors at meaningful level  Open tools to enable integration and long-term support 4

ICPP 2009Integrated Performance Views in Charm++ Charm++ Background  Parallel object-oriented programming based on C++  Programs decomposed into set of parallel communicating objects (chares)  Runtime system maps chares onto parallel processes/threads 5

ICPP 2009Integrated Performance Views in Charm++ Charm++ Computation Model  Object entry method invocation triggers computation  Entry method message for remote process queued  Messages scheduled by Charm++ runtime scheduler  Entry methods executed to completion  May call new entry methods and other routines 6

ICPP 2009Integrated Performance Views in Charm++ Charm++ Performance Events  Several points in runtime system to observe execution  Make performance measurements (performance events)  Obtain information on execution context  Charm++ events  Start of an entry method  End of an entry method  Sending a message to another object  Change in scheduler state:  active to idle  idle to active  Observation of multiple events at different levels of abstraction are needed to get full performance view logical execution model runtime object interaction resource oriented state transitions 7 Performance level

ICPP 2009Integrated Performance Views in Charm++ Charm++ Performance Framework  How parallel language system operationalizes events is critical to building an effective performance framework  Charm++ implements performance callbacks  Runtime system calls performance module(s) at events  Any registered performance module (client) is invoked  Event ID and default performance data forwarded  Clients can access to Charm++ internal runtime routines  Performance framework exposes set of key runtime events as a base C++ class  Performance modules inherit and implement methods  Listen only to events of interest  Framework calls performance client initialization 8

ICPP 2009Integrated Performance Views in Charm++ Charm++ Performance Framework Interface // Base class of all tracing strategies. class Trace { // creation of message(s) virtual void creation(envelope *, int epIdx, int num=1) {} virtual void creationMulticast(envelope *, int epIdx, int num=1, int *pelist=NULL) {} virtual void creationDone(int num=1) {} virtual void beginExecute(envelope *) {} virtual void beginExecute(CmiObjId *tid) {} virtual void beginExecute( int event, // event type defined in trace-common.h int msgType, // message type int ep, // Charm++ entry point int srcPe // Which PE originated the call int ml, // message size CmiObjId* idx) // index { } virtual void endExecute(void) {} virtual void beginIdle(double curWallTime) {} virtual void endIdle(double curWallTime) {} virtual void beginComputation(void) {} virtual void endComputation(void) {} }; 9

ICPP 2009Integrated Performance Views in Charm++ Charm++ Performance Framework and Modules  Framework allows for separation of concerns  Event visibility  Event measurement  Allows measurement extension and customization  New modules may introduce new observation requirements TAU Profiler API 10

ICPP 2009Integrated Performance Views in Charm++ TAU Integration in Charm++  Goal  Extend Projections performance measurement  Tracing and summary modules  Enable use of TAU Performance System ® for Charm++  Demonstrate utility of alternate methods and integration  TAU profiling capability  address tracing overhead issues  Leverage Charm++ performance framework  Merge TAU performance model with Projections  Apply to Charm++ applications  NAMD  OpenAtom, ChaNGa 11

ICPP 2009Integrated Performance Views in Charm++ TAU Performance System ®  Integrated toolkit for performance problem solving  Instrumentation, measurement, analysis, visualization  Portable performance profiling and tracing facility  Performance data management and data mining  Based on direct performance measurement approach  Open source  Available on all HPC platforms 12 TAU Architecture

ICPP 2009Integrated Performance Views in Charm++ TAU Performance Profiling  Performance with respect to nested event regions  Program execution event stack (begin/end events)  Profiling measures inclusive and exclusive data  Exclusive measurements for region only performance  Inclusive measurements includes nested “child” regions  Support multiple profiling types  Flat, callpath, and phase profiling 13

ICPP 2009Integrated Performance Views in Charm++ TAU Module for Charm++ Performance Interface  Events of interest  Main: scheduler is active and processing messages  Idle: scheduler wait state  Entry method events (identified by entry name)  User program events and MPI events  instrumented using TAU API  Questions  What is the top-level event?  Scheduler regarded as top-level (Main is top-level event)  Measurement  Execution time  Hardware counters 14

ICPP 2009Integrated Performance Views in Charm++ TAU Performance Overhead  Measure module overhead with test program  Different instrumentation scenarios  Overhead depends on several factors  Proportional to number events collected  Look at overhead per method event 15

ICPP 2009Integrated Performance Views in Charm++ TAU Module Validation  Validate TAU performance measurement  Against Projections summary measurement  See how performance profile information differs  Test application  Charm++ 2D integration example 16

ICPP 2009Integrated Performance Views in Charm++ NAMD Performance Study  Demonstrate integrated analysis in real application  NAMD parallel molecular dynamics code  Compute interactions between atoms  Group atoms in patches  Hybrid decomposition  distribute patches to processors  create compute objects to handle interactions between atoms of different patches  Performance strategy  Distribute computational workload evenly  Keep communication to a minimum  Several factors: model complexity, size, balancing cost 17

ICPP 2009Integrated Performance Views in Charm++ NAMD ApoA1 Experiments  Solvated lipid-protein complex in periodic cell  Small 92K atom model  Demonstrate performance of small computational grain  Experiment on 256-processor Cray XT3 (BigBen) OverviewTimelineActivity Load 18 low utilization color-coded events, zoomed process subset changing utilization

ICPP 2009Integrated Performance Views in Charm++ NAMD STMV Experiments  ApoA1 is a small problem  Consider STMV virus benchmark  Ten times larger experiment  One million model  Observe selected portion of the simulation  Remove startup  Look at 2000 timesteps  Scaling studies  256, 512, 1024, 2048, 4096  BigBen, Ranger, Intrepid 19

ICPP 2009Integrated Performance Views in Charm++ NAMD STMV Performance (Intrepid)

ICPP 2009Integrated Performance Views in Charm++ NAMD STMV Scalability (Mean)  Intrepid  Idle fraction increases due to lower utilization  Ranger  Increase in Main reflects more time in communication Idle Main

ICPP 2009Integrated Performance Views in Charm++ NAMD STMV – Comparative Profile Analysis

ICPP 2009Integrated Performance Views in Charm++ NAMD STMV – Ranger versus Intrepid But Ranger has faster processors!

ICPP 2009Integrated Performance Views in Charm++ NAMD STMV – Ranger versus Intrepid We measured 10x the number of timesteps on Ranger than Intrepid!!!

ICPP 2009Integrated Performance Views in Charm++ TAU Profile Snapshots of NAMD Idle WorkDistrib..133

ICPP 2009Integrated Performance Views in Charm++ NAMD Performance Data Mining  Use TAU PerfExplorer data mining tool  Dimensionality reduction, clustering, correlation  Single profiles and across multiple experiments 26 PmeZPencilPmeXPencil PmeYPencil

ICPP 2009Integrated Performance Views in Charm++ NAMD STMV – Overhead Analysis  Evaluate overhead as scale number of processors  Overhead increases as granularity decreases  Apply event selection and further overhead reduction 27

ICPP 2009Integrated Performance Views in Charm++ ChaNGa Performance Experiments  Charm N-body GrAvity solver  Collisionless N- body simulations  Interested in observing relationships between events  Input TAU profiles to PerfExplorer processors

ICPP 2009Integrated Performance Views in Charm++ Conclusions  TAU is now integrated with Charm++  Complements Projections performance capabilities  Tested more advanced TAU features  User-level code events and communication events  Callpath and phase profiling  separate different aspects of the computation and runtime  Charm++ has more sophisticated execution modes  Threading, process migration, dynamic adaption, …  Need to test TAU with these and make needed changes  Apply to additional Charm++ applications  GPU-accelerated (see ParCo 2009 paper)  Performance framework update and refinement 29

ICPP 2009Integrated Performance Views in Charm++ Support Acknowledgements  Department of Energy (DOE)  Office of Science  NSF Software Development for Cyberinfrastructure (SDCI)  Grant No. NSFOCI