Teragrid 2009 Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University.

Slides:



Advertisements
Similar presentations
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Advertisements

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.
Reference: Message Passing Fundamentals.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Scalable Algorithms for Structured Adaptive Mesh Refinement Akhil Langer, Jonathan Lifflander, Phil Miller, Laxmikant Kale Parallel Programming Laboratory.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
Adaptive MPI Milind A. Bhandarkar
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Debugging and Profiling GMAO Models with Allinea’s DDT/MAP Georgios Britzolakis April 30, 2015.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
1 Network Access to Charm Programs: CCS Orion Sky Lawlor 2003/10/20.
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R.
Debugging Large Scale Parallel Applications Filippo Gioachin Parallel Programming Laboratory Departement of Computer Science University of Illinois at.
Doctoral Defense Filippo Gioachin September 27, 2010 Debugging Large Scale Applications with Virtualization.
Robust Non-Intrusive Record-Replay with Processor Extraction Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
Debugging Large Scale Applications in a Virtualized Environment Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
PADTAD 2008 Memory Tagging in Charm++ Filippo Gioachin Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana-Champaign.
ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit Sharma.
IPDPS 2009 Dynamic High-level Scripting in Parallel Applications Filippo Gioachin Laxmikant V. Kalé Department of Computer Science University of Illinois.
1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.
Debugging Tools for Charm++ Applications Filippo Gioachin University of Illinois at Urbana-Champaign.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
The Distributed Application Debugger (DAD)
LiveViz – What is it? Charm++ library Visualization tool
Spark Presentation.
Parallel Programming By J. H. Wang May 2, 2017.
Sending a few stragglers home
Parallel Objects: Virtualization & In-Process Components
Grid Computing.
CharmDebug Filippo Gioachin.
Performance Evaluation of Adaptive MPI
Component Frameworks:
Department of Computer Science Northwestern University
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Architectural Interactions in High Performance Clusters
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Gary M. Zoppetti Gagan Agrawal
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
Database System Architectures
Parallel Programming in C with MPI and OpenMP
An Orchestration Language for Parallel Objects
Support for Adaptivity in ARMCI Using Migratable Objects
MapReduce: Simplified Data Processing on Large Clusters
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Teragrid 2009 Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University of Illinois at Urbana-Champaign

Teragrid Filippo Gioachin - UIUC Outline ● Overview – Charm++ RTS – Converse Client Server (CCS) ● Case Studies – CharmDebug (parallel debugger) – Projections (performance analysis tool) – Salsa (particle analysis tool) ● Conclusions

Teragrid Filippo Gioachin - UIUC Overview ● Need for real-time communication with parallel applications – Steering computation – Visualizing/Analyzing data – Debugging problems ● Long running applications – Time consuming to recompile the code (if at all available) – Need to wait for application to re-execute ● Communication requirements: – Fast (low user waiting time) Scalable – Uniform method of connection

Teragrid Filippo Gioachin - UIUC Charm++ Overview ● Middleware written in C++ – Message passing paradigm (asynchronous comm.) ● User decomposes work among objects (chares) – The objects can be virtual MPI processors ● System maps chares to processors – automatic load balancing – communication optimizations System view User view

Teragrid Filippo Gioachin - UIUC Adaptive overlap and modules ● Allow easy integration of different modules ● Automatic overlap of communication an computation

Teragrid Filippo Gioachin - UIUC Parallel Objects, Adaptive Runtime System Libraries and Tools Computational Cosmology Rocket Simulation Protein Folding NAMD: Molecular Dynamics STMV virus simulation Develop Abstractions in Context of Full-Scale Applications Dendritic Growth Space-time meshes Crack Propagation Quantum Chemistry LeanCP

Teragrid Filippo Gioachin - UIUC Charm++ RTS Pytho n Modul e Pytho n Modul e Pytho n Modul e External Client Converse Client ServerConverse Client Server 1) Send request 4) Send back reply later 2) Execute the request 3) Combine results Client Server frontend Parallel program

Teragrid Filippo Gioachin - UIUC Case study: Parallel Debugging

Teragrid Filippo Gioachin - UIUC Large Scale Debugging: Motivations ● Bugs on sequential programs – Buffer overflow, memory leaks, pointers,... – More than 50% programming time spent debugging – GDB and others ● Bugs on parallel programs – Race conditions, non-determinism,... – Much harder to find ● Effects not only happen later in time, but also on different processors – Bugs may appear only on thousands of processors ● Network latencies delaying messages ● Data decomposition algorithm – TotalView, Allinea DDT

Teragrid Filippo Gioachin - UIUC CharmDebug Overview CharmDebug Java GUI (local machine) Firewal l Parallel Application (remote machine) CharmDebu g Applicatio n CCS (Converse Client- Server) GDB

Teragrid Filippo Gioachin - UIUC Main Program View output entry methods processor subsets messages queued message details

Teragrid Filippo Gioachin - UIUC CharmDebug at scale ● Current parallel debuggers don't work – Direct connection to every processor – STAT (MRNet) not a full debugger ● Kraken: Cray XT4 at NICS – Parallel operation collecting total allocated memory – Time at the client 16~20 ms – Up to 4K processors – Other tests to come ● Attaching to the running program took also very little (few seconds)

Teragrid Filippo Gioachin - UIUC CharmDebug: Introspection

Teragrid Filippo Gioachin - UIUC Severe leak: ghost layer messages leaked every iteration

Teragrid Filippo Gioachin - UIUC Case study: Performance Analysis

Teragrid Filippo Gioachin - UIUC Online, Interactive Access to Parallel Performance Data: Motivations ● Observation of time-varying performance of long-running applications through streaming – Re-use of local performance data buffers ● Interactive manipulation of performance data when parameters are difficult to define a priori – Perform data-volume reduction before application shutdown ● k-clustering parameters (like number of seeds to use) ● Write only one processor per cluster

Teragrid Filippo Gioachin - UIUC Projections: Online Streaming of Performance Data ● Parallel Application records performance data on local processor buffers ● Performance data is periodically processed and collected to a root processor ● Charm++ runtime adaptively co-schedules the data collection's computation and messages with the host parallel application's ● Performance data buffers can now be re-used ● Remote tool collects data through CCS

Teragrid Filippo Gioachin - UIUC Impact of Online Performance Data Streaming * Global Reduction of 8 kilobyte messages from each processor every Global Reductions per second of between 3.5 to 11 kilobyte messages from each processor. The visualization client receives 12 kilobytes/second.

Teragrid Filippo Gioachin - UIUC Online Visualization of Streamed Performance Data ● Pictures show 10-second snapshots of live NAMD detailed performance profiles from start-up (left) to the first major load- balancing phase (right) on 1024 Cray XT5 processors ● Ssh tunnel between client and compute node through head-node

Teragrid Filippo Gioachin - UIUC Case study: Cosmological Data Analysis

Teragrid Filippo Gioachin - UIUC Comsological Data Analysis: Motivations ● Astronomical simulations/observations generate huge amount of data ● This data cannot be loaded into a single machine ● Even if loaded, interaction with user too slow ● Need to parallel analyzer tools capable of – Scaling well to large number of processors – Provide flexibility to the user

Teragrid Filippo Gioachin - UIUC Salsa Write your own piece of Python script Collaboration with Prof. Quinn, (U. Washington) and Prof. Lawlor (U. Alaska)

Teragrid Filippo Gioachin - UIUC LiveViz ● Every piece is represented by a chare ● Under integration in ChaNGa (simulator)

Teragrid Filippo Gioachin - UIUC How well are we doing? ● JPEG is CPU bound – Inefficient on high bandwidth networks ● Bitmap is network bound – Bad on slow networks ● The bottleneck is on the client (network or processor) – Parallel application: use enough processors Courtesy of Prof. Lawlor, U.Alaska Salsa application framerate

Teragrid Filippo Gioachin - UIUC Impostors: Basic Idea Camera Impostor Geometry Courtesy of Prof. Lawlor, U.Alaska

Teragrid Filippo Gioachin - UIUC Particle Set to Volume Impostors Courtesy of Prof. Lawlor, U.Alaska

Teragrid Filippo Gioachin - UIUC Summary ● Generic framework (CCS) to connect to a running parallel application, and interact with it ● Demonstration in different scenarios: – Parallel debugging ● Low response time – Performance analysis ● Low runtime overhead – Application (cosmological) data analysis ● High frame rate ● All code is open source and available on our website

Teragrid Filippo Gioachin - UIUC Questions? Thank you