Introduction to HPC Debugging with Allinea DDT Nick Forrington

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

ARCHER Tips and Tricks A few notes from the CSE team.
Profiling your application with Intel VTune at NERSC
Intel® performance analyze tools Nikita Panov Idrisov Renat.
CS 345 Computer System Overview
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.
Andreas Sandberg ARM Research
Contemporary Languages in Parallel Computing Raymond Hummel.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
New Features of APV-SRS-LabVIEW Data Acquisition Program Eraldo Oliveri on behalf of Riccardo de Asmundis INFN Napoli [Certified LabVIEW Developer] NYC,
Microsoft ® Official Course Monitoring and Troubleshooting Custom SharePoint Solutions SharePoint Practice Microsoft SharePoint 2013.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Task Farming on HPCx David Henty HPCx Applications Support
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.
1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.
1 Performance Analysis with Vampir DKRZ Tutorial – 7 August, Hamburg Matthias Weber, Frank Winkler, Andreas Knüpfer ZIH, Technische Universität.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
Programming Project: Hybrid Programming Rebecca Hartman-Baker Oak Ridge National Laboratory
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
1 Computing Software. Programming Style Programs that are not documented internally, while they may do what is requested, can be difficult to understand.
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
1 Previous lecture review n Out of basic scheduling techniques none is a clear winner: u FCFS - simple but unfair u RR - more overhead than FCFS may not.
University of Maryland The DPCL Hybrid Project James Waskiewicz.
Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.
PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
SC 2012 © LLNL / JSC 1 HPCToolkit / Rice University Performance Analysis through callpath sampling  Designed for low overhead  Hot path analysis  Recovery.
Debugging and Profiling GMAO Models with Allinea’s DDT/MAP Georgios Britzolakis April 30, 2015.
TotalView Debugging Tool Presentation Josip Jakić
DDT Debugging Techniques Carlos Rosales Scaling to Petascale 2010 July 7, 2010.
Overview of CrayPat and Apprentice 2 Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
1 Performance Analysis with Vampir ZIH, Technische Universität Dresden.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Debugging and Profiling With some help from Software Carpentry resources.
Server to Server Communication Redis as an enabler Orion Free
Silberschatz and Galvin  Operating System Concepts Module 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming.
Software Overview Environment, libraries, debuggers, programming tools and applications Jonathan Carter NUG Training 3 Oct 2005.
CMS pixel data quality monitoring Petra Merkel, Purdue University For the CMS Pixel DQM Group Vertex 2008, Sweden.
A New Parallel Debugger for Franklin: DDT Katie Antypas User Services Group NERSC User Group Meeting September 17, 2007.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
Threaded Programming Lecture 2: Introduction to OpenMP.
Show don’t tell: improving vectorization awareness in HPC Mark O’Connor VP Product Management.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
31 Oktober 2000 SEESCOASEESCOA STWW - Programma Work Package 5 – Debugging Task Generic Debug Interface K. De Bosschere e.a.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.
Single Node Optimization Computational Astrophysics.
High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory
A Dynamic Tracing Mechanism For Performance Analysis of OpenMP Applications - Caubet, Gimenez, Labarta, DeRose, Vetter (WOMPAT 2001) - Presented by Anita.
Terra-Fusion Loads Tiles in real-time while panning Loads Tiles in real-time while panning Improved overall performance via: Improved overall performance.
1 Advanced.Net Debugging Using Visual Studio, R# and OzCode IT Week, Summer 2015.
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Software Architecture in Practice
Parallel Programming By J. H. Wang May 2, 2017.
Intel® Parallel Studio and Advisor
Adaptive Code Unloading for Resource-Constrained JVMs
Multicore and GPU Programming
Presentation transcript:

Introduction to HPC Debugging with Allinea DDT Nick Forrington

Debugging is hard! “Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?” –Brian Kernighan, "The Elements of Programming Style”

Debugging in general Hypothesize about potential cause Identify variables/data of interest Inspect values

HPC Debugging: Additional Challenges Remote system Batch systems Large code bases Parallelism => complexity Large, distributed data sets

Can we reduce the complexity? Reproduce at a smaller scale? Reduced data set may not trigger the problem? Is the problem related to the size? Is probability stacking up against you? Debugging at scale is a necessity

Print statement debugging The original debugger –Allows inspection of program state –Diagnose the problem from evidence and intuition Can be a long slow process –Particularly if trying to find code locations –Edit/Compile/Run cycle Fails at modest scale –Too much output –Matching output between processes RunView output Identify area of interest Insert print statements Compile

Allinea Forge Allinea Forge: a modern integrated environment for HPC developers –Allinea DDT + Allinea MAP –Productively debug code with Allinea DDT –Enhance application performance with Allinea MAP Scalable –Tested at full scale on Titan Supports various programming model/languages –C/C++, Fortran, CUDA –MPI, OpenMP, OpenACC Available on OLCF systems – module load forge –Allinea Forge – 164,868 processes

HPC Debugging: Solutions Remote system –Remote Client gives a local GUI – no remote graphics lag Batch systems –Launch with minimal modification to existing batch scripts –DDT can restart a program within an existing session –“Offline mode” allows batch debugging Large code base –Source code navigation – jump to class / function / etc. –Display version control information Parallelism –Manage and control groups of processes simultaneously –Inspect program location and data across processes to identify outliers Large, distributed data sets –Compare data across processes –Array viewer allows inspection of multi-dimension and distributed arrays

Demo

Quick and Easy Profiling with Allinea MAP Nick Forrington

caption The Uncomfortable Truth about Applications

Code optimization can be time consuming Image source: xkcd.com/1445/ Insert timers Run code Analyse result Change code

Small data files <5% runtime overhead No instrumentation No profiling configuration Allinea MAP in a nutshell

How Allinea MAP is different Adaptive sampling Sample frequency decreases over time Data never grows too much Run for as long as you want Scalable Same scalable infrastructure as Allinea DDT Merges sample data at end of job Handles very high core counts, fast Instruction analysis Categorizes instructions sampled Knows where processor spends time Shows vectorization and memory use Thread profiling Core-time not thread-time profiling Identifies lost compute time Detects OpenMP issues Integrated Part of Forge tool suite Zoom and drill into profile Profiling within your code

6 Steps To Improve Performance Get a realistic test case Performance on real data matters Keep the test case for reference and re-use Profile your code Add “ -g ” flag to your compilation Run with a profiler Look for the significant Which part/phase of the code dominates time? Is there any unexpected significant time use? What is the nature of the problem? Compute? I/O? MPI? Thread synchronization? Display the metrics that show the problem best Apply brainpower to solve MPI – can you balance the work better? Compute – is memory time dominant – can you improve layout? Think of the future Try larger process or thread counts to watch for scalability problems Keep the profile (.map file) for future comparison

Allinea MAP and other performance tools: a great synergy Simple optimization with Allinea MAP Characterize performance at-scale with a lightweight tool See which lines of code are hotspots Identify common problems with MAP Prepare optimization strategy with Allinea MAP Identify loop(s) to instrument Identify performance counter(s) to record Document performance issues to communicate to profiling experts Fine tune the code with tracing tool Retrieve low-level details using Score-P/Vampir, nvprof, etc Fix up CPU usage to make the code fly

Preparing your program for profiling Linking (on Titan) – $ module load forge – $ module load map-link-static # or map-link-dynamic –Re-link your program Should I recompile? –Debug information ( -g ) required to display source code. Caveats: –PGI: If using -g and -O, line number information may be inaccurate. –Cray: -g disables most optimizations – use -G2 instead. –Issues with source code locations? Include frame headers (e.g. --eh-frame-hdr ) Function Inlining (e.g. -fno-inline )

How to run MAP Modify existing job submission script – $ source $MODULESHOME/init/bash – $ module load forge – $ map --profile aprun … Submit to the queue – $ qsub submit.qsub Open the result – $ map./output.map –Use the remote client

caption Bonus: Summarize with Performance Reports $ module load perf-reports $ perf-report file.map

Demo