Judit Giménez, Juan González, Pedro González, Jesús Labarta, Germán Llort, Eloy Martínez, Xavier Pegenaute, Harald Servat Brief introduction.

tools@bsc.es Judit Giménez, Juan González, Pedro González, Jesús Labarta, Germán Llort, Eloy Martínez, Xavier Pegenaute, Harald Servat Brief introduction to the wonders of performance analysis with BSCtools

Outline 1.Performance tools –Extrae –Paraver –Dimemas 2.Analysis methodology 3.Case study 4.Advanced techniques (Performance analytics) 5.Hands-on session 2

3 The ways of debugging & performance analysis printf(“Hellooooo!?”); … printf(“I’m here!”); … printf(“Roger that”); gettimeofday(&start, NULL); /* Stuff that matters */ gettimeofday(&end, NULL); printf(“Took %d seconds to get here”, end.tv_sec – start.tv_sec); NAS BT – 1 task NAS BT – 32 tasks A picture is worth a thousand words

Performance tools @ BSC Since 1991 Based on traces Flexibility and detail Core Tools Trace generation - Extrae Trace analyzer - Paraver Message passing simulator - Dimemas Open-source 4 Do not speculate about your code performance LOOK AT IT

ROW PCF Basic Workflow Instrumentation (Run-time) Application Process Extrae Application Process Extrae Application Process Extrae Paraver Dimemas Clustering Tracking Folding … Analysis (Post-mortem) PRV 5

E X T R A E

Extrae features Parallel programming models –MPI, OpenMP, pthreads, OmpSs, CUDA, OpenCL, Intel MIC… Performance Counters –Using PAPI and PMAPI interfaces Link to source code –Callstack at MPI routines –OpenMP outlined routines and their containers –Selected user functions Periodic samples User events (Extrae API) No need to recompile / relink! 7

8 How does Extrae work? Dynamic instrumentation –Based on DynInst (developed by U.Wisconsin/U.Maryland) Instrumentation in memory Binary rewriting Symbol substitution through LD_PRELOAD –Specific libraries for each combination of runtimes MPI OpenMP OpenMP+MPI … Alternatives –Static link (i.e., PMPI, Extrae API)

9 How to use Extrae? 1.Adapt job submission script 2.Tune XML configuration file –Examples distributed with Extrae $EXTRAE_HOME/share/example 3.Run it! For further reference check the Extrae User Guide: –Also distributed with Extrae at $EXTRAE_HOME/share/doc –http://www.bsc.es/computer-sciences/performance-tools/documentation

10 Example: Extrae with DynInst #!/bin/bash … # @ total_tasks = 4 # @ cpus_per_task = 1 # @ tasks_per_node = 4 … srun./my_MPI_binary #!/bin/bash … # @ total_tasks = 4 # @ cpus_per_task = 1 # @ tasks_per_node = 4 … srun./trace.sh./my_MPI_binary application.job #!/bin/sh export EXTRAE_HOME=… export EXTRAE_CONFIG_FILE=extrae.xml source ${EXTRAE_HOME}/etc/extrae.sh # Run the desired program ${EXTRAE_HOME}/bin/extrae –v $* trace.sh

11 Example: Extrae with LD_PRELOAD #!/bin/bash … # @ total_tasks = 4 # @ cpus_per_task = 1 # @ tasks_per_node = 4 … srun./trace.sh./my_MPI_binary application.job #!/bin/sh export EXTRAE_HOME=… export EXTRAE_CONFIG_FILE=extrae.xml export LD_PRELOAD=${EXTRAE_HOME}/lib/libmpitrace.so # Run the desired program $* trace.sh

12 LD_PRELOAD library selection Choose depending on the application type LibrarySerialMPIOpenMPpthreadCUDA libseqtrace libmpitrace[f] 1 libomptrace libpttrace libcudatrace libompitrace[f] 1 libptmpitrace[f] 1 libcudampitrace[f] 1 1 include suffix “f” in Fortran codes

P A R A V E R

14 Multiple views of the same reality Zoom in & out Apply filters to the data Highlight different aspects

ROW PCF Paraver displays PRV Raw time-stamped performance data MPI calls, OpenMP regions, user functions, peer-to-peer & collective communications, performance counters, samples… Timelines 2D / 3D Tables (statistics) 15

16 Timelines: Description Time Objects Process dimension - Thread (default) - Process - Application - Workload Resource dimension - CPU - Node - System

Timelines: Semantics Each window computes a function of time per object Two types of functions –Categorical State, user function, MPI call… Color encoding –1 color per value –Numerical IPC, instructions, cache misses, computation duration… Gradient encoding –Black (or background) for zero –From light green to dark blue –Limits in yellow and orange Function line encoding Min Max 0 17

18 Timelines: Deriving Basic metrics Derived metrics Models MPI callInstructions Min Max Min Max 0 Min Max 0 Min Max 0

From timelines to tables MPI calls profile MPI calls Computation duration Computation duration histogram 19

Analyzing variability through histograms and timelines Useful Duration Instructions IPC L2 miss ratio 20

Analyzing variability through histograms and timelines By the way: six months later… Useful Duration Instructions IPC L2 miss ratio 21

Tables: back to timelines Where in the timeline do certain values appear? –e.g. which is the time distribution of a given routine? –e.g. when does a routine occur in the timeline? 22

23 Configuration files CFG’s are programmable Paraver windows –Codify your formula once, use it forever! Find many pre-built configurations at $PARAVER_HOME/cfgs –General Basic views (timelines), tables(2/3D profiles), links to source code –Counters_PAPI Hardware counter derived metrics. –Program: related to algorithmic/compilation (instructions, floating-point ops…) –Architecture: related to execution on specific architectures (cache misses…) –Performance: metrics reporting rates per time (MFLOPS, MIPS, IPC…) –MPI Calls, peer-to-peer, collectives, bandwidth… –OpenMP Parallel functions, outlined routines, locks… –… and many more! Useful Duration Instructions executed IPC User functions L2 miss ratio Instructions committed Cycles wasted per L2 miss MPI calls Comm. bandwidth MPI calls profile Instructions histogram L2 miss ratio histogram IPC histogram

D I M E M A S

25 Dimemas Coarse grain trace driven simulator for MPI codes –Doesn’t model details Simple MPI protocols Abstract architecture Objective –Fast & simple “what-if” analyses Model components –Non-linear Resource allocation time (e.g. waiting for output links) –Linear Resource usage time (e.g. transfer time) CPU Local Memory B CPU L Local Memory L CPU Local Memory L

26 Dimemas vs. Paraver Paraver trace  What happens when –Actual wall clock time of events Dimemas trace  Sequence of resource demands –Duration of computation bursts –Type of communication, partners and bytes Mutual feedback –Paraver traces can be converted into Dimemas –Dimemas generates as output Paraver traces of the simulated run prv2dim

The impossible machine Ideal network BW = , L = 0  Transfer times gone! Unveils the intrinsic application behavior –Load balance problems? –Serialization problems? Waitall Sendrecv Alltoall Real run Allgather + Sendrecv Allreduce GADGET 256 tasks Nehalem cluster Computation 27 Ideal network

Impact of architectural parameters Ideal speeding up ALL computations by a constant factor –The more processes, the less speedup The network becomes critical! GADGET 28 128 tasks 64 tasks 256 tasks

The potential of hybrid/accelerator parallelization Speedup SELECTED regions only 93.67%97.49% 99.11% Code region % Elapsed time GADGET 128 tasks 29

M E T H O D O L O G Y

31 Performance analysis tools objective Help generate hypotheses Help validate hypotheses Qualitatively & Quantitatively

First steps Parallel efficiency: Time % invested on computation –Identify sources for “inefficiency”: Load balance Communication / synchronization Serial efficiency: How far from peak performance? –IPC, correlate with other counters (e.g. cache misses) Scalability: Code replication? –Total number of instructions Behavioral structure: Variability? 32 PARAVER Tutorial Introduction to Paraver & Dimemas methodology

Scaling Model: Parallel Efficiency Measured with MPI call profile – η (Parallel efficiency) “Time % doing useful work” – CommEff (Communication Efficiency) “Time % communicating” – LB (Load balance) “Stalls waiting for other ranks” 33

Scaling Model: Communication Efficiency But… … there is another type of LB! –µLB “Stalls due to serializations” We measure µLB using Dimemas! –Using an ideal network  Transfer efficiency = 1 34

C A S E S T U D Y

AVBP (CFD) – Strong scale efficiency (12 – 960) # Cores Efficiency 36

Identifying iterations 37 MPI ranks MPI calls Time Duration of computations between MPI calls

Comparing different core counts Showing 5 iterations (at different time scales) –Increasing MPI times & variability MPI_Isend MPI_Irecv MPI_Waitall MPI_Allreduce 12 ranks384 ranks960 ranks 38

39 Measuring Parallel Efficiency # Cores Efficiency η REMEMBER Real numbers from MPI calls profile (η, LB, Trf) & ideal network simulation (uLB)

40 Measuring Computation Efficiency

41 Looking at the variability Computations duration Instructions 384 tasks MPI ranks short large

42 Network sensitivity (Dimemas analysis) 384 tasks CPU ratio Speedup (with respect to real run)

(A G L I M P S E O F) A D V A N C E D T E C H N I Q U E S

44 Identifying structure (Clustering analysis) IPCInstructions completed

45 Evolution of behavior (Tracking analysis)

46 Instantaneous performance (Folding analysis)

H A N D S – O N

Judit Giménez, Juan González, Pedro González, Jesús Labarta, Germán Llort, Eloy Martínez, Xavier Pegenaute, Harald Servat Brief introduction.

Similar presentations

Presentation on theme: "Judit Giménez, Juan González, Pedro González, Jesús Labarta, Germán Llort, Eloy Martínez, Xavier Pegenaute, Harald Servat Brief introduction."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Judit Giménez, Juan González, Pedro González, Jesús Labarta, Germán Llort, Eloy Martínez, Xavier Pegenaute, Harald Servat Brief introduction.

Similar presentations

Presentation on theme: "Judit Giménez, Juan González, Pedro González, Jesús Labarta, Germán Llort, Eloy Martínez, Xavier Pegenaute, Harald Servat Brief introduction."— Presentation transcript:

Similar presentations

About project

Feedback