Profiling your application with Intel VTune at NERSC

Slides:



Advertisements
Similar presentations
Presented by The Evolving Environment at NCCS Vickie Lynch.
Advertisements

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
ARCHER Tips and Tricks A few notes from the CSE team.
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Job Submission.
Eos Compilers Fernanda Foertter HPC User Assistance Specialist.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
1 OBJECTIVES To generate a web-based system enables to assemble model configurations. to submit these configurations on different.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
Manage Run Activities Cognos 8 BI. Objectives  At the end of this course, you should be able to:  manage current, upcoming and past activities  manage.
Chapter 2: The Visual Studio.NET Development Environment Visual Basic.NET Programming: From Problem Analysis to Program Design.
Performance Engineering and Debugging HPC Applications David Skinner
HPCC Mid-Morning Break Interactive High Performance Computing Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
JGI/NERSC New Hardware Training Kirsten Fagnan, Seung-Jin Sul January 10, 2013.
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.
TPB Models Development Status Report Presentation to the Travel Forecasting Subcommittee Ron Milone National Capital Region Transportation Planning Board.
Lecture 8: Caffe - CPU Optimization
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
®® Microsoft Windows 7 for Power Users Tutorial 13 Using the Command-Line Environment.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
Tutorial 121 Creating a New Web Forms Page You will find that creating Web Forms is similar to creating traditional Windows applications in Visual Basic.
The Cray XC30 “Darter” System Daniel Lucio. The Darter Supercomputer.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
TotalView Debugging Tool Presentation Josip Jakić
Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.
Overview of CrayPat and Apprentice 2 Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
HPC for Statistics Grad Students. A Cluster Not just a bunch of computers Linked CPUs managed by queuing software – Cluster – Node – CPU.
August 2005 TMCOps TMC Operator Requirements and Position Descriptions Phase 2 Interactive Tool Project Presentation.
Introduction to the XC30 For Users of Cray’s XT5 and XK7 Aaron Vose 1.
Running Parallel Jobs Cray XE6 Workshop February 7, 2011 David Turner NERSC User Services Group.
 2002 Prentice Hall. All rights reserved. 1 Chapter 2 – Introduction to the Visual Studio.NET IDE Outline 2.1Introduction 2.2Visual Studio.NET Integrated.
Threaded Programming Lecture 2: Introduction to OpenMP.
CSCS-USI Summer School (Lugano, 8-19 July 2013)1 Hands-on exercise: NPB-MZ-MPI / BT VI-HPS Team.
Operating System Concepts Three User Interfaces Command-line Job-Control Language (JCL) Graphical User Interface (GUI)
Single Node Optimization Computational Astrophysics.
Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
Introduction to HPC Debugging with Allinea DDT Nick Forrington
Tuning Threaded Code with Intel® Parallel Amplifier.
NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable.
Native Computing & Optimization on Xeon Phi John D. McCalpin, Ph.D. Texas Advanced Computing Center.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
Advanced Computing Facility Introduction
SQL Database Management
Compute and Storage For the Farm at Jlab
Dive Into® Visual Basic 2010 Express
HPC usage and software packages
Current Generation Hypervisor Type 1 Type 2.
Spark Presentation.
Chapter 2: System Structures
NGS computation services: APIs and Parallel Jobs
NVIDIA Profiler’s Guide
Advanced TAU Commander
Productivity Tools for Scientific Computing
Chapter 2 – Introduction to the Visual Studio .NET IDE
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
Chapter 7 –Implementation Issues
Chapter 3: Processes.
Introduction to High Performance Computing Using Sapelo2 at GACRC
Multi-Core Programming Assignment
Quick Tutorial on MPICH for NIC-Cluster
Working in The IITJ HPC System
EViews Training The Basics: EViews Desktop, Workfiles and Objects Note: Data and Workfiles for examples in this tutorial are: Data: Data.xlsx Results:
Presentation transcript:

Profiling your application with Intel VTune at NERSC

VTune background and availability Focus: On-node performance analysis Sampling and trace-based profiling Performance counter integration Memory bandwidth analysis On-node parallelism: vectorization and threading Pre-defined analysis experiments GUI and command-line interface (good for headless collection and later analysis) NERSC availability (as the vtune module) Edison (Dual 12-core Ivy Bridge) Babbage (Dual 8-core Sandy Bridge + Dual Xeon Phi)

Running VTune on Edison I Use the Cray cc or ftn wrappers for the Intel compilers Suggested compiler flags: -g : enable debugging symbols -O2 : use production-realistic optimization levels (not -O0) To use VTune on Edison, you have to: Run within a CCM job (batch or interactive) Use dynamic linking if profiling OpenMP code (-dynamic) Use a working directory on a Lustre $SCRATCH filesystem edison09:BGW > ftn -dynamic -g -O2 -xAVX -openmp bgw.f90 -o bgw.x edison09:BGW > mkdir $SCRATCH/vtune-runs edison09:BGW > cp bgw.x $SCRATCH/vtune-runs/ edison09:BGW > cd $SCRATCH/vtune-runs/ edison09:vtune-runs > qsub -I -q ccm_int -l mppwidth=24 wait ...

Running VTune on Edison II Once you’re in a CCM job (either interactive or batch script) cd to your submission directory Launch VTune to profile your code on a compute node with aprun CCM Start success, 1 of 1 responses nid02433:~ > cd $PBS_O_WORKDIR edison09:vtune-runs > module load vtune nid02433:vtune-runs > aprun -n 1 amplxe-cl -collect experiment_name -r result_dir -- ./bgw.x amplxe-cl is the VTune CLI -collect : specifies the collection experiment to run -r : specifies an output directory to save results Set OMP_NUM_THREADS and associated aprun options (-d, -S, -cc depth, -cc numa_node) as needed Results can be analyzed by launching amplxe-gui and navigating to the result directory (preferably in NX)

Experiments: General exploration Available on Edison and Babbage (SNB + Xeon Phi) Detailed characterization of relevant performance metrics throughout your application Default: low-level detail aggregated into summary metrics Mouse-over for explanation of their significance Can be used to characterize locality issues, poor vectorization, etc. Multiple “viewpoints” available: Direct access to hardware event counters Spin / sync overhead for OpenMP threaded regions nid02433:vtune-runs > aprun -n 1 amplxe-cl -collect gener al-exploration –r ge_results -- ./bgw.x

Experiments: General exploration A whole lot of summary metrics!

Experiments: General exploration Filter by process and thread ID Show loops as well as functions

Experiments: General exploration Change viewpoint to get to hardware counters, hotspot analysis, and more

Experiments: Memory bandwidth Available on Edison and Babbage (Xeon Phi only) Caveat: avoid Babbage SNB for now (node will lock up) Gives DRAM read / write traffic as a function of time during program execution Useful to first calibrate with a well-understood code on the same platform (e.g. STREAM) Can help determine whether your code is at least partially (effectively) BW bound nid02433:vtune-runs > aprun -n 1 amplxe-cl –collect bandwi dth –r bw_results -- ./bgw.x

Experiments: Bandwidth Average BW listed by CPU package

Experiments: Bandwidth OpenMP regions Peak BW Read, write, and aggregate BW time series Click and drag to zoom in for more detail

More resources At NERSC At Intel On our debugging and profiling tools pages: http://www.nersc.gov/users/software/debugging-and-profiling/vtune/ More details on how to run your analysis on both the Edison compute nodes and the Babbage Xeon Phis Pointers to materials from previous NERSC trainings At Intel Main documentation for 2015 version: https://software.intel.com/en-us/node/529213 Detailed descriptions of the various experiment types Pointers to tutorials on specific topics or platforms