Germán Llort, Judit Giménez

Slides:

Advertisements

Similar presentations

ARCHER Tips and Tricks A few notes from the CSE team.

Advertisements

Profiling your application with Intel VTune at NERSC

IERG4180 Tutorial 4 Jim.

ImageJ Tutorial.

1 Introduction to Tool chains. 2 Tool chain for the Sitara Family (but it is true for other ARM based devices as well) A tool chain is a collection of.

Renesas Technology America Inc. 1 M16C/Tiny SKP Tutorial 2 Creating A New Project Using HEW4.

Introduction to UNIX/Linux Exercises Dan Stanzione.

M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,

BSC tools hands-on session. 2 Objectives Copy ~nct00001/tools-material into your ${HOME} –cp –r ~nct00001/tools-material ${HOME} Contents of.

Judit Giménez, Juan González, Pedro González, Jesús Labarta, Germán Llort, Eloy Martínez, Xavier Pegenaute, Harald Servat Brief introduction.

Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.

WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.

Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.

Introduction to NS2 -Network Simulator- -Prepared by Changyong Jung.

VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.

TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.

Debugging and Profiling GMAO Models with Allinea’s DDT/MAP Georgios Britzolakis April 30, 2015.

1 What is a Kernel The kernel of any operating system is the core of all the system’s software. The only thing more fundamental than the kernel is the.

DDT Debugging Techniques Carlos Rosales Scaling to Petascale 2010 July 7, 2010.

Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.

How to configure, build and install Trilinos November 2, :30-9:30 a.m. Jim Willenbring Mike Phenow.

Belgrade, 25 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Performance analysis Tools: a case study of NMMB on Marenostrum.

National Center for Supercomputing ApplicationsNational Computational Science Grid Packaging Technology Technical Talk University of Wisconsin Condor/GPT.

ARCHER Advanced Research Computing High End Resource

Threaded Programming Lecture 2: Introduction to OpenMP.

Lab 9 Department of Computer Science and Information Engineering National Taiwan University Lab9 - Debugging I 2014/11/4/ 28 1.

How to configure, build and install Trilinos November 2, :30-9:30 a.m. Jim Willenbring.

Intoduction to Andriod studio Environment With a hello world program.

CEPBA-Tools experiences with MRNet and Dyninst Judit Gimenez, German Llort, Harald Servat

Implementation of Embedded OS

Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.

Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

Introduction to HPC Debugging with Allinea DDT Nick Forrington

NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable.

Solvency II Tripartite template V2 and V3 Presentation of the conversion tools proposed by FundsXML France.

An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.

Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.

Day 1 Session 2. Setup & Installation

Advanced Computing Facility Introduction

Linux & Joker – An Introduction

SAP Business One 9.0 integration for SAP NetWeaver Installation and Technical Configuration 2013 March.

Introduction to Unix for FreeSurfer Users

Fundamental of Databases

GRID COMPUTING.

Specialized Computing Cluster An Introduction

Welcome to Indiana University Clusters

PARADOX Cluster job management

Development Environment Basics

HPC usage and software packages

Welcome to Indiana University Clusters

Obtaining the Required Tools

How to use the HPCC to do stuff

Getting Started with R.

TAU integration with Score-P

CRESCO Project: Salvatore Raia

Assignment Preliminaries

Practice #0: Introduction

Advanced TAU Commander

Productivity Tools for Scientific Computing

A configurable binary instrumenter

Advanced Computing Facility Introduction

Software Installation

Getting Started: Developing Code with Cloud9

Microsoft PowerPoint 2007 – Unit 2

Introduction to High Performance Computing Using Sapelo2 at GACRC

Using the Omega3P Eigensolver

Quick Tutorial on MPICH for NIC-Cluster

Working in The IITJ HPC System

HW4: Due Nov 22nd 23:59 Describe test cases to reach full path coverage of the triangle program by completing the path condition table below. Also,

BSC TOOLS: Instrumentation & Analysis

Presentation transcript:

Germán Llort, Judit Giménez BSC Tools Hands-On Germán Llort, Judit Giménez Barcelona Supercomputing Center

Getting a trace with Extrae

No need to recompile / relink! Extrae features Platforms Intel, Cray, BlueGene, Intel MIC, ARM, Android, Fujitsu Sparc… Parallel programming model MPI, OpenMP, pthreads, OmpSs, CUDA, OpenCL, Java, Python... Performance Counters Using PAPI interface Link to source code Callstack at MPI routines OpenMP outlined routines Selected user functions Periodic samples User events (Extrae API) No need to recompile / relink! BSCTools Hands-on

Extrae overheads Average values Archer Event 150 – 200 ns 160 – 170 ns Event + PAPI 750 – 1000 ns 800 – 950 ns Event + callstack (1 level) 600 ns 540 ns Event + callstack (6 levels) 1.9 us 1.5 us BSCTools Hands-on

How does Extrae work? Recommended Symbol substitution through LD_PRELOAD Specific libraries for each combination of runtimes MPI OpenMP OpenMP+MPI … Dynamic instrumentation Based on DynInst (developed by U.Wisconsin/U.Maryland) Instrumentation in memory Binary rewriting Static link (i.e., PMPI, Extrae API) Recommended BSCTools Hands-on

Linking in Archer Cray compilers link statically by default How make it dynamic? Add the flag –dynamic Enables tracing with LD_PRELOAD method archer> [ cc | CC | ftn ] ... -dynamic Footer (Insert > Header and Footer)

Problems with dynamic linking? Link statically against the tracing library (+ dependencies) Only supports MPI instrumentation Insert before the actual MPI library Extrae will always intercept the MPI calls Don’t set LD_PRELOAD LDFLAGS += \ -L$EXTRAE_HOME/lib –lmpitrace \ -L$BSCTOOLS_HOME/deps/binutils/2.24/lib –lbfd –liberty \ -L$BSCTOOLS_HOME/deps/libunwind/1.1/lib –lunwind \ -L/opt/cray/papi/5.4.1.2/lib –lpapi \ -L/usr/lib64 –lxml \ -lrt –lz -ldl Footer (Insert > Header and Footer)

Using Extrae in 3 steps Adapt your job submission script Configure what to trace XML configuration file Example configurations at $EXTRAE_HOME/share/example Run it! For further reference check the Extrae User Guide: https://tools.bsc.es/sites/default/files/documentation/html/extrae/index.html Also distributed with Extrae at $EXTRAE_HOME/share/doc BSCTools Hands-on

Login to Archer and copy the examples laptop> ssh –Y <USER>@login.archer.ac.uk archer> cp –r /work/y14/shared/bsctools/tools-material $WORK archer> ls $WORK/tools-material ... apps/ ... clustering/ ... extrae/ ... slides/ ... traces/ Here you have a copy of this slides BSCTools Hands-on

Step 1: Adapt the job script to load Extrae with LD_PRELOAD archer> vi $WORK/tools-material/extrae/run_lulesh_27p.sh PIcomputer.sh #!/bin/bash --login #PBS –N LULESH2 #PBS –l select=2 #PBS –l walltime=00:05:00 #PBS –A y14 module unload PrgEnv-cray PrgEnv-gnu module load PrgEnv-intel export PBS_O_WORKDIR=$(readlink –f $PBS_O_WORKDIR) cd ${PBS_O_WORKDIR} export OMP_NUM_THREADS=1 aprun –n 27 –S 7 ../apps/lulesh2.0 ... Request resources Change MPI version Run the program BSCTools Hands-on

Step 1: Adapt the job script to load Extrae with LD_PRELOAD archer> vi $WORK/tools-material/extrae/run_lulesh_27p.sh PIcomputer.sh #!/bin/bash --login #PBS –N LULESH2 #PBS –l select=2 #PBS –l walltime=00:05:00 #PBS –A y14 module unload PrgEnv-cray PrgEnv-gnu module load PrgEnv-intel export PBS_O_WORKDIR=$(readlink –f $PBS_O_WORKDIR) cd ${PBS_O_WORKDIR} export OMP_NUM_THREADS=1 export TRACE_NAME=lulesh_27p.prv aprun –n 27 –S 7 ./trace.sh ../apps/lulesh2.0 ... Activate Extrae during the run BSCTools Hands-on

Step 1: Adapt the job script to load Extrae with LD_PRELOAD archer> vi $WORK/tools-material/extrae/trace.sh PIcomputer.sh Select “what to trace” #!/bin/bash --login #PBS –N LULESH2 #PBS –l select=2 #PBS –l walltime=00:05:00 #PBS –A y14 module unload PrgEnv-cray PrgEnv-gnu module load PrgEnv-intel export PBS_O_WORKDIR=$(readlink –f $PBS_O_WORKDIR) cd ${PBS_O_WORKDIR} export OMP_NUM_THREADS=1 export TRACE_NAME=lulesh_27p.prv aprun –n 27 –S 7 ./trace.sh ../apps/lulesh2.0 ... #!/bin/bash source /work/.../extrae/intel-mpich/etc/extrae.sh # Configure Extrae export EXTRAE_CONFIG_FILE=./extrae.xml # Load the tracing library (choose C/Fortran) export LD_PRELOAD=${EXTRAE_HOME}/lib/libmpitrace.so #export LD_PRELOAD=${EXTRAE_HOME}/lib/libmpitracef.so # Run the program $* Same MPI version as the application Select your type of application BSCTools Hands-on

Step 1: Which tracing library? Choose depending on the application type Library Serial MPI OpenMP pthread CUDA libseqtrace  libmpitrace[f]1 libomptrace libpttrace libcudatrace libompitrace[f] 1 libptmpitrace[f] 1 libcudampitrace[f] 1 1 include suffix “f” in Fortran codes BSCTools Hands-on

Step 3: Run it! Submit your job Once finished the trace will be in the same folder: lulesh_27p.{pcf,prv,row} (3 files) Check the status of your job with: qstat –u $USER Any issue? Already generated at $WORK/tools-material/traces archer> cd $WORK/tools-material/extrae archer> qsub run_lulesh_27p.sh BSCTools Hands-on

Step 2: Extrae XML configuration archer> vi $WORK/tools-material/extrae/extrae.xml <mpi enabled="yes"> <counters enabled="yes" /> </mpi> <openmp enabled="yes"> <locks enabled="no" /> </openmp> <pthread enabled="no"> </pthread> <callers enabled="yes"> <mpi enabled="yes">1-3</mpi> <sampling enabled="no">1-5</sampling> </callers> Trace the MPI calls (What’s the program doing?) Trace the call-stack (Where in my code?) Compile with debug! (-g) BSCTools Hands-on

Step 2: Extrae XML configuration (II) <counters enabled="yes"> <cpu enabled="yes" starting-set-distribution="cyclic"> <set enabled="yes" changeat-time="500000us" domain="all“> PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_L1_DCM, PAPI_L3_TCM, PAPI_BR_INS, PAPI_L2_DCA </set> <set enabled="yes" changeat-time="500000us" domain="all"> PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_SR_INS, RESOURCE_STALLS:ROB, RESOURCE_STALLS:RS <set … /set> </cpu> <network enabled="no" /> <resource-usage enabled="no" /> <memory-usage enabled="no" /> </counters> Select which HW counters are measured (How’s the machine doing?) BSCTools Hands-on

Step 2: Extrae XML configuration (III) <buffer enabled="yes"> <size enabled="yes">5000000</size> <circular enabled="no" /> </buffer> <sampling enabled="no" type="default" period="50m" variability="10m" /> <merge enabled=“yes" synchronization="default" tree-fan-out="16" max-memory="512" joint-states="yes" keep-mpits="yes" sort-addresses="yes" overwrite="yes“ > $TRACE_NAME$ </merge> Trace buffer size (Flush/memory trade-off) Enable sampling (Want more details?) Automatic post-processing to generate the Paraver trace BSCTools Hands-on

Installing Paraver & First analysis steps

Install Paraver in your laptop Download from http://tools.bsc.es/downloads Also available @Archer /work/y14/shared/bsctools/tools-packages Pick your version laptop> scp <USER>@login.archer.ac.uk:/work/y14/ shared/bsctools/tools-packages/<PACKAGE> $HOME wxparaver-4.7.2-win.zip wxparaver-4.7.2-mac.zip wxparaver-4.7.2-Linux_x86_64.tar.gz (64-bits) wxparaver-4.7.2-Linux_i686.tar.gz (32-bits)

Install Paraver (II) Download links Download tutorials: Documentation -> Tutorial guidelines Also available @Archer /work/y14/shared/bsctools/tools-packages Download links laptop> scp <USER>@login.archer.ac.uk:/work/y14/shared/ bsctools/tools-packages/paraver-tutorials-20150526.tar.gz $HOME Footer (Insert > Header and Footer)

Uncompress, rename & move Uncompress both packages Rename folders into “paraver” and “tutorials” Drag “tutorials” folder into “paraver” ? Destination is… “Right click“ Show Package Contents Contents Resources Command-line (Linux) laptop> tar xf wxparaver-4.7.2-linux-x86_64.tar.gz laptop> mv wxparaver-4.7.2-linux-x86_64 paraver laptop> tar xf paraver-tutorials-20150526.tar.gz laptop> mv paraver-tutorials-20150526 paraver/tutorials Footer (Insert > Header and Footer)

Check that everything works Start Paraver Check that tutorials are available Remotely available in Archer laptop> $HOME/paraver/bin/wxparaver & Click on Help  Tutorials laptop> ssh –Y <USER>@login.archer.ac.uk archer> /work/y14/shared/bsctools/wxparaver/latest/bin/wxparaver BSCTools Hands-on

First steps of analysis Copy the trace to your laptop ( All 3 files: *.prv, *.pcf, *.row ) Load the trace Follow Tutorial #3 Introduction to Paraver and Dimemas methodology laptop> scp <USER>@login.archer.ac.uk:$WORK/tools-material/extrae/lulesh_27p.* ./ Click on File  Load Trace  Browse to the *.prv file Click on Help  Tutorials

Measure the parallel efficiency Click on the “mpi_stats.cfg” Click on “Open Control Window” Right click  Paste  Time Zoom to skip initialization / finalization phases (drag & drop) Parallel efficiency Comm efficiency Load balance Right click  Copy  Time BSCTools Hands-on

Computation time and work distribution Click on “2dh_usefulduration.cfg” (2nd link)  Shows time computing …and “2dh_useful_instructions.cfg” (3rd link)  Shows amount of work Zoom to skip large burst from the initialization (by drag-and-dropping) Then… Performance imbalance (zig-zag) Work imbalance (zig-zag) BSCTools Hands-on

Where does this happen? Slow Fast Right click  Copy & at the same time  Imbalance Hints  Callers  Caller function Go from the table to the timeline Slow Fast Click on “Open Filtered Control Window” Right click  Copy Right click  Paste  Time Select this area (by drag-and-dropping) Right click  Fit Semantic Scale  Fit both Zoom into 1 of the iterations (by drag-and-dropping) Hidden values (click to show) CommSend CommMonoQ TimeIncrement Footer (Insert > Header and Footer)

Right click on timeline Save CFG’s (2 methods) Right click on timeline 1. Main Paraver window 2. Select 3. Save Footer (Insert > Header and Footer)

CFG’s distribution Paraver comes with many more included CFG’s Footer (Insert > Header and Footer)

Hints: a good place to start! Paraver suggests CFG’s based on the information present in the trace Footer (Insert > Header and Footer)

Cluster-based analysis

Use clustering analysis Run clustering If you didn’t get your own trace, use a prepared one from: laptop> ssh –Y <USER>@login.archer.ac.uk archer> cd $WORK/tools-material/clustering archer> /work/y14/shared/bsctools/clustering/2.6.6/bin/BurstClustering -d cluster.xml -i ../extrae/lulesh_27p.prv -o lulesh_27p_clustered.prv archer> ls $WORK/tools-material/traces/lulesh_27p.prv BSCTools Hands-on

Cluster-based analysis Check the resulting scatter plot Identify main computing trends Work (Y) vs. Speed (X) Look at the clusters shape Variability in both axes indicate potential imbalances archer> gnuplot lulesh_27p_clustered.IPC.PAPI_TOT_INS.gnuplot Variable work Variable speed BSCTools Hands-on

Correlating scatter plot and time distribution Copy the clustered trace to your laptop and look at it Display the distribution of clusters over time File  Load configuration  $HOME/paraver/cfgs/clustering/clusterID_window.cfg laptop> $HOME/paraver/bin/wxparaver <path-to>/lulesh_27p_clustered.prv Variable work / speed + Simultaneously @ different processes = Imbalances BSCTools Hands-on