Presentation is loading. Please wait.

Presentation is loading. Please wait.

Germán Llort, Judit Giménez

Similar presentations


Presentation on theme: "Germán Llort, Judit Giménez"— Presentation transcript:

1 Germán Llort, Judit Giménez
BSC Tools Hands-On Germán Llort, Judit Giménez Barcelona Supercomputing Center

2 Getting a trace with Extrae

3 No need to recompile / relink!
Extrae features Platforms Intel, Cray, BlueGene, Intel MIC, ARM, Android, Fujitsu Sparc… Parallel programming model MPI, OpenMP, pthreads, OmpSs, CUDA, OpenCL, Java, Python... Performance Counters Using PAPI interface Link to source code Callstack at MPI routines OpenMP outlined routines Selected user functions Periodic samples User events (Extrae API) No need to recompile / relink! BSCTools Hands-on

4 Extrae overheads Average values Archer Event 150 – 200 ns 160 – 170 ns
Event + PAPI 750 – 1000 ns 800 – 950 ns Event + callstack (1 level) 600 ns 540 ns Event + callstack (6 levels) 1.9 us 1.5 us BSCTools Hands-on

5 How does Extrae work? Recommended
Symbol substitution through LD_PRELOAD Specific libraries for each combination of runtimes MPI OpenMP OpenMP+MPI Dynamic instrumentation Based on DynInst (developed by U.Wisconsin/U.Maryland) Instrumentation in memory Binary rewriting Static link (i.e., PMPI, Extrae API) Recommended BSCTools Hands-on

6 Linking in Archer Cray compilers link statically by default
How make it dynamic? Add the flag –dynamic Enables tracing with LD_PRELOAD method archer> [ cc | CC | ftn ] ... -dynamic Footer (Insert > Header and Footer)

7 Problems with dynamic linking?
Link statically against the tracing library (+ dependencies) Only supports MPI instrumentation Insert before the actual MPI library Extrae will always intercept the MPI calls Don’t set LD_PRELOAD LDFLAGS += \ -L$EXTRAE_HOME/lib –lmpitrace \ -L$BSCTOOLS_HOME/deps/binutils/2.24/lib –lbfd –liberty \ -L$BSCTOOLS_HOME/deps/libunwind/1.1/lib –lunwind \ -L/opt/cray/papi/ /lib –lpapi \ -L/usr/lib64 –lxml \ -lrt –lz -ldl Footer (Insert > Header and Footer)

8 Using Extrae in 3 steps Adapt your job submission script
Configure what to trace XML configuration file Example configurations at $EXTRAE_HOME/share/example Run it! For further reference check the Extrae User Guide: Also distributed with Extrae at $EXTRAE_HOME/share/doc BSCTools Hands-on

9 Login to Archer and copy the examples
laptop> ssh –Y archer> cp –r /work/y14/shared/bsctools/tools-material $WORK archer> ls $WORK/tools-material ... apps/ ... clustering/ ... extrae/ ... slides/ ... traces/ Here you have a copy of this slides BSCTools Hands-on

10 Step 1: Adapt the job script to load Extrae with LD_PRELOAD
archer> vi $WORK/tools-material/extrae/run_lulesh_27p.sh PIcomputer.sh #!/bin/bash --login #PBS –N LULESH2 #PBS –l select=2 #PBS –l walltime=00:05:00 #PBS –A y14 module unload PrgEnv-cray PrgEnv-gnu module load PrgEnv-intel export PBS_O_WORKDIR=$(readlink –f $PBS_O_WORKDIR) cd ${PBS_O_WORKDIR} export OMP_NUM_THREADS=1 aprun –n 27 –S 7 ../apps/lulesh Request resources Change MPI version Run the program BSCTools Hands-on

11 Step 1: Adapt the job script to load Extrae with LD_PRELOAD
archer> vi $WORK/tools-material/extrae/run_lulesh_27p.sh PIcomputer.sh #!/bin/bash --login #PBS –N LULESH2 #PBS –l select=2 #PBS –l walltime=00:05:00 #PBS –A y14 module unload PrgEnv-cray PrgEnv-gnu module load PrgEnv-intel export PBS_O_WORKDIR=$(readlink –f $PBS_O_WORKDIR) cd ${PBS_O_WORKDIR} export OMP_NUM_THREADS=1 export TRACE_NAME=lulesh_27p.prv aprun –n 27 –S 7 ./trace.sh ../apps/lulesh Activate Extrae during the run BSCTools Hands-on

12 Step 1: Adapt the job script to load Extrae with LD_PRELOAD
archer> vi $WORK/tools-material/extrae/trace.sh PIcomputer.sh Select “what to trace” #!/bin/bash --login #PBS –N LULESH2 #PBS –l select=2 #PBS –l walltime=00:05:00 #PBS –A y14 module unload PrgEnv-cray PrgEnv-gnu module load PrgEnv-intel export PBS_O_WORKDIR=$(readlink –f $PBS_O_WORKDIR) cd ${PBS_O_WORKDIR} export OMP_NUM_THREADS=1 export TRACE_NAME=lulesh_27p.prv aprun –n 27 –S 7 ./trace.sh ../apps/lulesh #!/bin/bash source /work/.../extrae/intel-mpich/etc/extrae.sh # Configure Extrae export EXTRAE_CONFIG_FILE=./extrae.xml # Load the tracing library (choose C/Fortran) export LD_PRELOAD=${EXTRAE_HOME}/lib/libmpitrace.so #export LD_PRELOAD=${EXTRAE_HOME}/lib/libmpitracef.so # Run the program $* Same MPI version as the application Select your type of application BSCTools Hands-on

13 Step 1: Which tracing library?
Choose depending on the application type Library Serial MPI OpenMP pthread CUDA libseqtrace libmpitrace[f]1 libomptrace libpttrace libcudatrace libompitrace[f] 1 libptmpitrace[f] 1 libcudampitrace[f] 1 1 include suffix “f” in Fortran codes BSCTools Hands-on

14 Step 3: Run it! Submit your job
Once finished the trace will be in the same folder: lulesh_27p.{pcf,prv,row} (3 files) Check the status of your job with: qstat –u $USER Any issue? Already generated at $WORK/tools-material/traces archer> cd $WORK/tools-material/extrae archer> qsub run_lulesh_27p.sh BSCTools Hands-on

15 Step 2: Extrae XML configuration
archer> vi $WORK/tools-material/extrae/extrae.xml <mpi enabled="yes"> <counters enabled="yes" /> </mpi> <openmp enabled="yes"> <locks enabled="no" /> </openmp> <pthread enabled="no"> </pthread> <callers enabled="yes"> <mpi enabled="yes">1-3</mpi> <sampling enabled="no">1-5</sampling> </callers> Trace the MPI calls (What’s the program doing?) Trace the call-stack (Where in my code?) Compile with debug! (-g) BSCTools Hands-on

16 Step 2: Extrae XML configuration (II)
<counters enabled="yes"> <cpu enabled="yes" starting-set-distribution="cyclic"> <set enabled="yes" changeat-time="500000us" domain="all“> PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_L1_DCM, PAPI_L3_TCM, PAPI_BR_INS, PAPI_L2_DCA </set> <set enabled="yes" changeat-time="500000us" domain="all"> PAPI_TOT_INS, PAPI_TOT_CYC, PAPI_SR_INS, RESOURCE_STALLS:ROB, RESOURCE_STALLS:RS <set … /set> </cpu> <network enabled="no" /> <resource-usage enabled="no" /> <memory-usage enabled="no" /> </counters> Select which HW counters are measured (How’s the machine doing?) BSCTools Hands-on

17 Step 2: Extrae XML configuration (III)
<buffer enabled="yes"> <size enabled="yes"> </size> <circular enabled="no" /> </buffer> <sampling enabled="no" type="default" period="50m" variability="10m" /> <merge enabled=“yes" synchronization="default" tree-fan-out="16" max-memory="512" joint-states="yes" keep-mpits="yes" sort-addresses="yes" overwrite="yes“ > $TRACE_NAME$ </merge> Trace buffer size (Flush/memory trade-off) Enable sampling (Want more details?) Automatic post-processing to generate the Paraver trace BSCTools Hands-on

18 Installing Paraver & First analysis steps

19 Install Paraver in your laptop
Download from Also /work/y14/shared/bsctools/tools-packages Pick your version laptop> scp shared/bsctools/tools-packages/<PACKAGE> $HOME wxparaver win.zip wxparaver mac.zip wxparaver Linux_x86_64.tar.gz (64-bits) wxparaver Linux_i686.tar.gz (32-bits)

20 Install Paraver (II) Download links
Download tutorials: Documentation -> Tutorial guidelines Also /work/y14/shared/bsctools/tools-packages Download links laptop> scp bsctools/tools-packages/paraver-tutorials tar.gz $HOME Footer (Insert > Header and Footer)

21 Uncompress, rename & move
Uncompress both packages Rename folders into “paraver” and “tutorials” Drag “tutorials” folder into “paraver” ? Destination is… “Right click“ Show Package Contents Contents Resources Command-line (Linux) laptop> tar xf wxparaver linux-x86_64.tar.gz laptop> mv wxparaver linux-x86_64 paraver laptop> tar xf paraver-tutorials tar.gz laptop> mv paraver-tutorials paraver/tutorials Footer (Insert > Header and Footer)

22 Check that everything works
Start Paraver Check that tutorials are available Remotely available in Archer laptop> $HOME/paraver/bin/wxparaver & Click on Help  Tutorials laptop> ssh –Y archer> /work/y14/shared/bsctools/wxparaver/latest/bin/wxparaver BSCTools Hands-on

23 First steps of analysis
Copy the trace to your laptop ( All 3 files: *.prv, *.pcf, *.row ) Load the trace Follow Tutorial #3 Introduction to Paraver and Dimemas methodology laptop> scp ./ Click on File  Load Trace  Browse to the *.prv file Click on Help  Tutorials

24 Measure the parallel efficiency
Click on the “mpi_stats.cfg” Click on “Open Control Window” Right click  Paste  Time Zoom to skip initialization / finalization phases (drag & drop) Parallel efficiency Comm efficiency Load balance Right click  Copy  Time BSCTools Hands-on

25 Computation time and work distribution
Click on “2dh_usefulduration.cfg” (2nd link)  Shows time computing …and “2dh_useful_instructions.cfg” (3rd link)  Shows amount of work Zoom to skip large burst from the initialization (by drag-and-dropping) Then… Performance imbalance (zig-zag) Work imbalance (zig-zag) BSCTools Hands-on

26 Where does this happen? Slow Fast Right click  Copy
& at the same time  Imbalance Hints  Callers  Caller function Go from the table to the timeline Slow Fast Click on “Open Filtered Control Window” Right click  Copy Right click  Paste  Time Select this area (by drag-and-dropping) Right click  Fit Semantic Scale  Fit both Zoom into 1 of the iterations (by drag-and-dropping) Hidden values (click to show) CommSend CommMonoQ TimeIncrement Footer (Insert > Header and Footer)

27 Right click on timeline
Save CFG’s (2 methods) Right click on timeline 1. Main Paraver window 2. Select 3. Save Footer (Insert > Header and Footer)

28 CFG’s distribution Paraver comes with many more included CFG’s
Footer (Insert > Header and Footer)

29 Hints: a good place to start!
Paraver suggests CFG’s based on the information present in the trace Footer (Insert > Header and Footer)

30 Cluster-based analysis

31 Use clustering analysis
Run clustering If you didn’t get your own trace, use a prepared one from: laptop> ssh –Y archer> cd $WORK/tools-material/clustering archer> /work/y14/shared/bsctools/clustering/2.6.6/bin/BurstClustering -d cluster.xml -i ../extrae/lulesh_27p.prv -o lulesh_27p_clustered.prv archer> ls $WORK/tools-material/traces/lulesh_27p.prv BSCTools Hands-on

32 Cluster-based analysis
Check the resulting scatter plot Identify main computing trends Work (Y) vs. Speed (X) Look at the clusters shape Variability in both axes indicate potential imbalances archer> gnuplot lulesh_27p_clustered.IPC.PAPI_TOT_INS.gnuplot Variable work Variable speed BSCTools Hands-on

33 Correlating scatter plot and time distribution
Copy the clustered trace to your laptop and look at it Display the distribution of clusters over time File  Load configuration  $HOME/paraver/cfgs/clustering/clusterID_window.cfg laptop> $HOME/paraver/bin/wxparaver <path-to>/lulesh_27p_clustered.prv Variable work / speed + different processes = Imbalances BSCTools Hands-on


Download ppt "Germán Llort, Judit Giménez"

Similar presentations


Ads by Google