Germán Llort, Judit Giménez BSC Tools Hands-On Germán Llort, Judit Giménez Barcelona Supercomputing Center
Getting a trace with Extrae
No need to recompile / relink! Extrae features Platforms Intel, Cray, BlueGene, Intel MIC, ARM, Android, Fujitsu Sparc… Parallel programming model MPI, OpenMP, pthreads, OmpSs, CUDA, OpenCL, Java, Python... Performance Counters Using PAPI interface Link to source code Callstack at MPI routines OpenMP outlined routines Selected user functions Periodic samples User events (Extrae API) No need to recompile / relink! BSCTools Hands-on
Extrae overheads Average values Archer Event 150 – 200 ns 160 – 170 ns Event + PAPI 750 – 1000 ns 800 – 950 ns Event + callstack (1 level) 600 ns 540 ns Event + callstack (6 levels) 1.9 us 1.5 us BSCTools Hands-on
How does Extrae work? Recommended Symbol substitution through LD_PRELOAD Specific libraries for each combination of runtimes MPI OpenMP OpenMP+MPI … Dynamic instrumentation Based on DynInst (developed by U.Wisconsin/U.Maryland) Instrumentation in memory Binary rewriting Static link (i.e., PMPI, Extrae API) Recommended BSCTools Hands-on
Linking in Archer Cray compilers link statically by default How make it dynamic? Add the flag –dynamic Enables tracing with LD_PRELOAD method archer> [ cc | CC | ftn ] ... -dynamic Footer (Insert > Header and Footer)
Problems with dynamic linking? Link statically against the tracing library (+ dependencies) Only supports MPI instrumentation Insert before the actual MPI library Extrae will always intercept the MPI calls Don’t set LD_PRELOAD LDFLAGS += \ -L$EXTRAE_HOME/lib –lmpitrace \ -L$BSCTOOLS_HOME/deps/binutils/2.24/lib –lbfd –liberty \ -L$BSCTOOLS_HOME/deps/libunwind/1.1/lib –lunwind \ -L/opt/cray/papi/5.4.1.2/lib –lpapi \ -L/usr/lib64 –lxml \ -lrt –lz -ldl Footer (Insert > Header and Footer)
Using Extrae in 3 steps Adapt your job submission script Configure what to trace XML configuration file Example configurations at $EXTRAE_HOME/share/example Run it! For further reference check the Extrae User Guide: https://tools.bsc.es/sites/default/files/documentation/html/extrae/index.html Also distributed with Extrae at $EXTRAE_HOME/share/doc BSCTools Hands-on
Login to Archer and copy the examples laptop> ssh –Y <USER>@login.archer.ac.uk archer> cp –r /work/y14/shared/bsctools/tools-material $WORK archer> ls $WORK/tools-material ... apps/ ... clustering/ ... extrae/ ... slides/ ... traces/ Here you have a copy of this slides BSCTools Hands-on
Step 1: Adapt the job script to load Extrae with LD_PRELOAD archer> vi $WORK/tools-material/extrae/run_lulesh_27p.sh PIcomputer.sh #!/bin/bash --login #PBS –N LULESH2 #PBS –l select=2 #PBS –l walltime=00:05:00 #PBS –A y14 module unload PrgEnv-cray PrgEnv-gnu module load PrgEnv-intel export PBS_O_WORKDIR=$(readlink –f $PBS_O_WORKDIR) cd ${PBS_O_WORKDIR} export OMP_NUM_THREADS=1 aprun –n 27 –S 7 ../apps/lulesh2.0 ... Request resources Change MPI version Run the program BSCTools Hands-on
Step 1: Adapt the job script to load Extrae with LD_PRELOAD archer> vi $WORK/tools-material/extrae/run_lulesh_27p.sh PIcomputer.sh #!/bin/bash --login #PBS –N LULESH2 #PBS –l select=2 #PBS –l walltime=00:05:00 #PBS –A y14 module unload PrgEnv-cray PrgEnv-gnu module load PrgEnv-intel export PBS_O_WORKDIR=$(readlink –f $PBS_O_WORKDIR) cd ${PBS_O_WORKDIR} export OMP_NUM_THREADS=1 export TRACE_NAME=lulesh_27p.prv aprun –n 27 –S 7 ./trace.sh ../apps/lulesh2.0 ... Activate Extrae during the run BSCTools Hands-on
Step 1: Adapt the job script to load Extrae with LD_PRELOAD archer> vi $WORK/tools-material/extrae/trace.sh PIcomputer.sh Select “what to trace” #!/bin/bash --login #PBS –N LULESH2 #PBS –l select=2 #PBS –l walltime=00:05:00 #PBS –A y14 module unload PrgEnv-cray PrgEnv-gnu module load PrgEnv-intel export PBS_O_WORKDIR=$(readlink –f $PBS_O_WORKDIR) cd ${PBS_O_WORKDIR} export OMP_NUM_THREADS=1 export TRACE_NAME=lulesh_27p.prv aprun –n 27 –S 7 ./trace.sh ../apps/lulesh2.0 ... #!/bin/bash source /work/.../extrae/intel-mpich/etc/extrae.sh # Configure Extrae export EXTRAE_CONFIG_FILE=./extrae.xml # Load the tracing library (choose C/Fortran) export LD_PRELOAD=${EXTRAE_HOME}/lib/libmpitrace.so #export LD_PRELOAD=${EXTRAE_HOME}/lib/libmpitracef.so # Run the program $* Same MPI version as the application Select your type of application BSCTools Hands-on
Step 1: Which tracing library? Choose depending on the application type Library Serial MPI OpenMP pthread CUDA libseqtrace libmpitrace[f]1 libomptrace libpttrace libcudatrace libompitrace[f] 1 libptmpitrace[f] 1 libcudampitrace[f] 1 1 include suffix “f” in Fortran codes BSCTools Hands-on
Step 3: Run it! Submit your job Once finished the trace will be in the same folder: lulesh_27p.{pcf,prv,row} (3 files) Check the status of your job with: qstat –u $USER Any issue? Already generated at $WORK/tools-material/traces archer> cd $WORK/tools-material/extrae archer> qsub run_lulesh_27p.sh BSCTools Hands-on
Step 2: Extrae XML configuration archer> vi $WORK/tools-material/extrae/extrae.xml <mpi enabled="yes"> <counters enabled="yes" /> </mpi> <openmp enabled="yes"> <locks enabled="no" /> </openmp> <pthread enabled="no"> </pthread> <callers enabled="yes"> <mpi enabled="yes">1-3</mpi> <sampling enabled="no">1-5</sampling> </callers> Trace the MPI calls (What’s the program doing?) Trace the call-stack (Where in my code?) Compile with debug! (-g) BSCTools Hands-on
Step 2: Extrae XML configuration (II) <counters enabled="yes"> <cpu enabled="yes" starting-set-distribution="cyclic"> <set enabled="yes" changeat-time="500000us" domain="all“> PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_L1_DCM, PAPI_L3_TCM, PAPI_BR_INS, PAPI_L2_DCA </set> <set enabled="yes" changeat-time="500000us" domain="all"> PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_SR_INS, RESOURCE_STALLS:ROB, RESOURCE_STALLS:RS <set … /set> </cpu> <network enabled="no" /> <resource-usage enabled="no" /> <memory-usage enabled="no" /> </counters> Select which HW counters are measured (How’s the machine doing?) BSCTools Hands-on
Step 2: Extrae XML configuration (III) <buffer enabled="yes"> <size enabled="yes">5000000</size> <circular enabled="no" /> </buffer> <sampling enabled="no" type="default" period="50m" variability="10m" /> <merge enabled=“yes" synchronization="default" tree-fan-out="16" max-memory="512" joint-states="yes" keep-mpits="yes" sort-addresses="yes" overwrite="yes“ > $TRACE_NAME$ </merge> Trace buffer size (Flush/memory trade-off) Enable sampling (Want more details?) Automatic post-processing to generate the Paraver trace BSCTools Hands-on
Installing Paraver & First analysis steps
Install Paraver in your laptop Download from http://tools.bsc.es/downloads Also available @Archer /work/y14/shared/bsctools/tools-packages Pick your version laptop> scp <USER>@login.archer.ac.uk:/work/y14/ shared/bsctools/tools-packages/<PACKAGE> $HOME wxparaver-4.7.2-win.zip wxparaver-4.7.2-mac.zip wxparaver-4.7.2-Linux_x86_64.tar.gz (64-bits) wxparaver-4.7.2-Linux_i686.tar.gz (32-bits)
Install Paraver (II) Download links Download tutorials: Documentation -> Tutorial guidelines Also available @Archer /work/y14/shared/bsctools/tools-packages Download links laptop> scp <USER>@login.archer.ac.uk:/work/y14/shared/ bsctools/tools-packages/paraver-tutorials-20150526.tar.gz $HOME Footer (Insert > Header and Footer)
Uncompress, rename & move Uncompress both packages Rename folders into “paraver” and “tutorials” Drag “tutorials” folder into “paraver” ? Destination is… “Right click“ Show Package Contents Contents Resources Command-line (Linux) laptop> tar xf wxparaver-4.7.2-linux-x86_64.tar.gz laptop> mv wxparaver-4.7.2-linux-x86_64 paraver laptop> tar xf paraver-tutorials-20150526.tar.gz laptop> mv paraver-tutorials-20150526 paraver/tutorials Footer (Insert > Header and Footer)
Check that everything works Start Paraver Check that tutorials are available Remotely available in Archer laptop> $HOME/paraver/bin/wxparaver & Click on Help Tutorials laptop> ssh –Y <USER>@login.archer.ac.uk archer> /work/y14/shared/bsctools/wxparaver/latest/bin/wxparaver BSCTools Hands-on
First steps of analysis Copy the trace to your laptop ( All 3 files: *.prv, *.pcf, *.row ) Load the trace Follow Tutorial #3 Introduction to Paraver and Dimemas methodology laptop> scp <USER>@login.archer.ac.uk:$WORK/tools-material/extrae/lulesh_27p.* ./ Click on File Load Trace Browse to the *.prv file Click on Help Tutorials
Measure the parallel efficiency Click on the “mpi_stats.cfg” Click on “Open Control Window” Right click Paste Time Zoom to skip initialization / finalization phases (drag & drop) Parallel efficiency Comm efficiency Load balance Right click Copy Time BSCTools Hands-on
Computation time and work distribution Click on “2dh_usefulduration.cfg” (2nd link) Shows time computing …and “2dh_useful_instructions.cfg” (3rd link) Shows amount of work Zoom to skip large burst from the initialization (by drag-and-dropping) Then… Performance imbalance (zig-zag) Work imbalance (zig-zag) BSCTools Hands-on
Where does this happen? Slow Fast Right click Copy & at the same time Imbalance Hints Callers Caller function Go from the table to the timeline Slow Fast Click on “Open Filtered Control Window” Right click Copy Right click Paste Time Select this area (by drag-and-dropping) Right click Fit Semantic Scale Fit both Zoom into 1 of the iterations (by drag-and-dropping) Hidden values (click to show) CommSend CommMonoQ TimeIncrement Footer (Insert > Header and Footer)
Right click on timeline Save CFG’s (2 methods) Right click on timeline 1. Main Paraver window 2. Select 3. Save Footer (Insert > Header and Footer)
CFG’s distribution Paraver comes with many more included CFG’s Footer (Insert > Header and Footer)
Hints: a good place to start! Paraver suggests CFG’s based on the information present in the trace Footer (Insert > Header and Footer)
Cluster-based analysis
Use clustering analysis Run clustering If you didn’t get your own trace, use a prepared one from: laptop> ssh –Y <USER>@login.archer.ac.uk archer> cd $WORK/tools-material/clustering archer> /work/y14/shared/bsctools/clustering/2.6.6/bin/BurstClustering -d cluster.xml -i ../extrae/lulesh_27p.prv -o lulesh_27p_clustered.prv archer> ls $WORK/tools-material/traces/lulesh_27p.prv BSCTools Hands-on
Cluster-based analysis Check the resulting scatter plot Identify main computing trends Work (Y) vs. Speed (X) Look at the clusters shape Variability in both axes indicate potential imbalances archer> gnuplot lulesh_27p_clustered.IPC.PAPI_TOT_INS.gnuplot Variable work Variable speed BSCTools Hands-on
Correlating scatter plot and time distribution Copy the clustered trace to your laptop and look at it Display the distribution of clusters over time File Load configuration $HOME/paraver/cfgs/clustering/clusterID_window.cfg laptop> $HOME/paraver/bin/wxparaver <path-to>/lulesh_27p_clustered.prv Variable work / speed + Simultaneously @ different processes = Imbalances BSCTools Hands-on