Automatic trace analysis with Scalasca Alexandru Calotoiu German Research School for Simulation Sciences (with content used with permission from tutorials.

Slides:



Advertisements
Similar presentations
Performance Analysis Tools for High-Performance Computing Daniel Becker
Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Module 7: Advanced Development  GEM only slides here  Started on page 38 in SC09 version Module 77-0.
Profiling your application with Intel VTune at NERSC
Utilizing the GDB debugger to analyze programs Background and application.
DKRZ Tutorial 2013, Hamburg1 Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Frank Winkler 1),
V0.01 © 2009 Research In Motion Limited Introduction to Java Application Development for the BlackBerry Smartphone Trainer name Date.
Guide To UNIX Using Linux Third Edition
Web Applications Basics. Introduction to Web Web features Clent/Server HTTP HyperText Markup Language URL addresses Web server - a computer program that.
Working with SharePoint Document Libraries. What are document libraries? Document libraries are collections of files that you can share with team members.
DEMONSTRATION FOR SIGMA DATA ACQUISITION MODULES Tempatron Ltd Data Measurements Division Darwin Close Reading RG2 0TB UK T : +44 (0) F :
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
1 Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Markus Geimer 2), Bert Wesarg 1), Brian Wylie.
DKRZ Tutorial 2013, Hamburg Analysis report examination with CUBE Markus Geimer Jülich Supercomputing Centre.
Analysis report examination with CUBE Alexandru Calotoiu German Research School for Simulation Sciences (with content used with permission from tutorials.
DKRZ Tutorial 2013, Hamburg1 Hands-on: NPB-MZ-MPI / BT VI-HPS Team.
SUSE Linux Enterprise Server Administration (Course 3037) Chapter 4 Manage Software for SUSE Linux Enterprise Server.
1 Performance Analysis with Vampir DKRZ Tutorial – 7 August, Hamburg Matthias Weber, Frank Winkler, Andreas Knüpfer ZIH, Technische Universität.
© 2008, Renesas Technology America, Inc., All Rights Reserved 1 Purpose  This training course describes how to configure the the C/C++ compiler options.
Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
Tutorial 1 Getting Started with Adobe Dreamweaver CS3
Program Guide v25Q3. Overview » Concepts » Workflow  Press sheet  Linking product  Program guide  Publishing a program guide day » Layout configuration.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering1 Score-P Hands-On CUDA: Jacobi example.
CCS APPS CODE COVERAGE. CCS APPS Code Coverage Definition: –The amount of code within a program that is exercised Uses: –Important for discovering code.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Automatic trace analysis with Scalasca Markus Geimer, Brian Wylie, David.
Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.
12th VI-HPS Tuning Workshop, 7-11 October 2013, JSC, Jülich1 Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca,
Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.
Presented by KOJAK and SCALASCA Jack Dongarra, Shirley Moore, Farzona Pulatova, and Fengguang Song University of Tennessee and Oak Ridge National Laboratory.
Tutorial 111 The Visual Studio.NET Environment The major differences between Visual Basic 6.0 and Visual Basic.NET are the latter’s support for true object-oriented.
HPCA2001HPCA Message Passing Interface (MPI) and Parallel Algorithm Design.
VAMPIR. Visualization and Analysis of MPI Resources Commercial tool from PALLAS GmbH VAMPIRtrace - MPI profiling library VAMPIR - trace visualization.
TotalView Debugging Tool Presentation Josip Jakić
DDT Debugging Techniques Carlos Rosales Scaling to Petascale 2010 July 7, 2010.
Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.
Introduction to Parallel Programming with C and MPI at MCSR Part 1 The University of Southern Mississippi April 8, 2010.
SUSE Linux Enterprise Desktop Administration Chapter 6 Manage Software.
11th VI-HPS Tuning Workshop, April 2013, MdS, Saclay1 Hands-on exercise: NPB-MZ-MPI / BT VI-HPS Team.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
1 Performance Analysis with Vampir ZIH, Technische Universität Dresden.
MPI Communications Point to Point Collective Communication Data Packaging.
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
Creating Graphical User Interfaces (GUI’s) with MATLAB By Jeffrey A. Webb OSU Gateway Coalition Member.
ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA This work.
Hands-on: NPB-MZ-MPI / BT VI-HPS Team. 10th VI-HPS Tuning Workshop, October 2012, Garching Local Installation VI-HPS tools accessible through the.
MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
3 Copyright © 2004, Oracle. All rights reserved. Working in the Forms Developer Environment.
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
Threaded Programming Lecture 2: Introduction to OpenMP.
CSCS-USI Summer School (Lugano, 8-19 July 2013)1 Hands-on exercise: NPB-MZ-MPI / BT VI-HPS Team.
Introduction to Parallel Programming at MCSR Message Passing Computing –Processes coordinate and communicate results via calls to message passing library.
SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Analysis report examination with CUBE Markus Geimer Jülich Supercomputing.
Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Markus Geimer 1), Peter Philippen 1) Andreas.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
SC‘13: Hands-on Practical Hybrid Parallel Application Performance Engineering Hands-on example code: NPB-MZ-MPI / BT (on Live-ISO/DVD) VI-HPS Team.
WebScan: Implementing QueryServer 2.0 Karl Geiger, Amgen Inc. BRS NA UG August 1999.
Presented by Jack Dongarra University of Tennessee and Oak Ridge National Laboratory KOJAK and SCALASCA.
Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
HP-SEE TotalView Debugger Josip Jakić Scientific Computing Laboratory Institute of Physics Belgrade The HP-SEE initiative.
Working in the Forms Developer Environment
In-situ Visualization using VisIt
MPI Message Passing Interface
Advanced TAU Commander
A configurable binary instrumenter
MPI-Message Passing Interface
Presentation transcript:

Automatic trace analysis with Scalasca Alexandru Calotoiu German Research School for Simulation Sciences (with content used with permission from tutorials by Markus Geimer, Brian Wylie/JSC)

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Automatic trace analysis Idea –Automatic search for patterns of inefficient behaviour –Classification of behaviour & quantification of significance –Guaranteed to cover the entire event trace –Quicker than manual/visual trace analysis –Parallel replay analysis exploits available memory & processors to deliver scalability 2 Call path Property Location Low-level event trace High-level result Analysis 

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala The Scalasca project: Overview Project started in 2006 –Initial funding by Helmholtz Initiative & Networking Fund –Many follow-up projects Follow-up to pioneering KOJAK project (started 1998) –Automatic pattern-based trace analysis Now joint development of –Jülich Supercomputing Centre –German Research School for Simulation Sciences 3

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala The Scalasca project: Objective Development of a scalable performance analysis toolset for most popular parallel programming paradigms Specifically targeting large-scale parallel applications –such as those running on IBM BlueGene or Cray XT systems with one million or more processes/threads Latest release in March 2013: Scalasca v1.4.3 Here: Scalasca v2.0  with Score-P support –release targeted with Score-P v1.2 in summer 4

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Scalasca 1.4 features Open source, New BSD license Portable –IBM BlueGene, IBM SP & blade clusters, Cray XT, SGI Altix, Fujitsu FX10 & K computer, NEC SX, Intel Xeon Phi, Solaris & Linux clusters,... Supports parallel programming paradigms & languages –MPI, OpenMP & hybrid MPI+OpenMP –Fortran, C, C++ Integrated instrumentation, measurement & analysis toolset –Runtime summarization (callpath profiling) –Automatic event trace analysis 5

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Scalasca 2.0  features Open source, New BSD license Uses Score-P instrumenter & measurement libraries –Scalasca 2.0 core package focuses on trace-based analyses –Generally same usage as Scalasca 1.4 Supports common data formats –Reads event traces in OTF2 format –Writes analysis reports in CUBE4 format Still aims to be portable –But not widely tested yet –Known issues Unable to handle OTF2 traces containing CUDA events Trace flush & pause event regions not handled correctly OTF2 traces created with SIONlib not handled correctly 6

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Scalasca trace analysis Scalasca workflow 7 Instr. target application Measurement library HWC Parallel wait- state search Wait-state report Local event traces Summary report Optimized measurement configuration Instrumenter compiler / linker Instrumented executable Source modules Report manipulation Which problem? Where in the program? Which process?

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Example: Wait at NxN Time spent waiting in front of synchronizing collective operation until the last process reaches the operation Applies to: MPI_Allgather, MPI_Allgatherv, MPI_Alltoall, MPI_Reduce_scatter, MPI_Reduce_scatter_block, MPI_Allreduce 8 time location MPI_Allreduce

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Example: Late Broadcast Waiting times if the destination processes of a collective 1-to-N operation enter the operation earlier than the source process (root) Applies to: MPI_Bcast, MPI_Scatter, MPI_Scatterv time location MPI_Bcast (root) MPI_Bcast

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Example: Late Sender Waiting time caused by a blocking receive operation posted earlier than the corresponding send Applies to blocking as well as non-blocking communication time location MPI_Recv MPI_Send time location MPI_Recv MPI_Send MPI_IrecvMPI_Wait MPI_Send time location MPI_RecvMPI_Irecv MPI_Isend MPI_Wait MPI_IsendMPI_Wait

Hands-on: NPB-MZ-MPI / BT

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Preamble NPB-MZ-MPI/BT 12 %copy the code example archive to a directory of your choice cp /bubo/home/gu/calotoiu/ex.tar.gz. %unpack the example code tar –xzvf ex.tar.gz %add the tools to your path PATH=$PATH:/bubo/home/gu/calotoiu/scorepinstall/bin PATH=$PATH:/bubo/home/gu/calotoiu/cubeinstall/bin PATH=$PATH:/bubo/home/gu/calotoiu/sc2install/bin

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Scalasca compatibility command: skin Scalasca application instrumenter –Provides compatibility with Scalasca 1.X –Generally use Score-P instrumenter directly 13 % skin Scalasca 2.0: application instrumenter using scorep usage: skin [-v] [–comp] [-pdt] [-pomp] [-user] -comp={all|none|...}: routines to be instrumented by compiler (... custom instrumentation specification for compiler) -pdt: process source files with PDT instrumenter -pomp: process source files for POMP directives -user: enable EPIK user instrumentation API macros in source code -v: enable verbose commentary when instrumenting --*: options to pass to Score-P instrumenter

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Scalasca convenience command: scan Scalasca measurement collection & analysis nexus 14 % scan Scalasca 2.0: measurement collection & analysis nexus usage: scan {options} [launchcmd [launchargs]] target [targetargs] where {options} may include: -h Help: show this brief usage message and exit. -v Verbose: increase verbosity. -n Preview: show command(s) to be launched but don't execute. -q Quiescent: execution with neither summarization nor tracing. -s Summary: enable runtime summarization. [Default] -t Tracing: enable trace collection and analysis. -a Analyze: skip measurement to (re-)analyze an existing trace. -e exptdir : Experiment archive to generate and/or analyze. (overrides default experiment archive title) -f filtfile : File specifying measurement filter. -l lockfile : File that blocks start of measurement.

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Scalasca convenience command: square Scalasca analysis report explorer 15 % square Scalasca 2.0: analysis report explorer usage: square [-v] [-s] [-f filtfile] [-F] <experiment archive | cube file> -F : Force remapping of already existing reports -f filtfile : Use specified filter file when doing scoring -s : Skip display and output textual score report -v : Enable verbose mode

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Automatic measurement configuration scan configures Score-P measurement by setting some environment variables automatically –e.g., experiment title, profiling/tracing mode, filter file, … –Precedence order: Command-line arguments Environment variables already set Automatically determined values Also, scan includes consistency checks and prevents corrupting existing experiment directories For tracing experiments, after trace collection completes then automatic parallel trace analysis is initiated –uses identical launch configuration to that used for measurement (i.e., the same allocated compute resources) 16

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala NPB-MZ-MPI / BT Instrumentation Edit config/make.def to adjust build configuration –Modify specification of compiler/linker: MPIF77 17 # SITE- AND/OR PLATFORM-SPECIFIC DEFINITIONS # # Items in this file may need to be changed for each platform. # # # The Fortran compiler used for MPI programs # #MPIF77 = mpif77 # Alternative variants to perform instrumentation... MPIF77 = scorep --user mpif77 # This links MPI Fortran programs; usually the same as ${MPIF77} FLINK = $(MPIF77)... Uncomment the Score-P compiler wrapper specification

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala NPB-MZ-MPI / BT Instrumented Build Return to root directory and clean-up Re-build executable using Score-P compiler wrapper 18 % make clean % make bt-mz CLASS=W NPROCS=4 make: Entering directory 'BT-MZ' cd../sys; cc -o setparams setparams.c -lm../sys/setparams bt-mz 4 W scorep --user mpif77 -c -O3 -openmp bt.f [...] cd../common; scorep --user mpif77 -c -O3 -fopenmp timers.f scorep --user mpif77 –O3 -openmp -o../bin.scorep/bt-mz_W.4 \ bt.o initialize.o exact_solution.o exact_rhs.o set_constants.o \ adi.o rhs.o zone_setup.o x_solve.o y_solve.o exch_qbc.o \ solve_subs.o z_solve.o add.o error.o verify.o mpi_setup.o \../common/print_results.o../common/timers.o Built executable../bin.scorep/bt-mz_W.4 make: Leaving directory 'BT-MZ'

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala BT-MZ Summary Analysis Report Filtering Report scoring with prospective filter listing 6 USR regions 19 % cat../config/scorep.filt SCOREP_REGION_NAMES_BEGIN EXCLUDE binvcrhs* matmul_sub* matvec_sub* exact_solution* binvrhs* lhs*init* timer_* % scorep-score -f../config/scorep.filt scorep_bt-mz_C_8x8_sum/profile.cubex Estimated aggregate size of event trace (total_tbc): bytes Estimated requirements for largest trace buffer (max_tbc): bytes (hint: When tracing set SCOREP_TOTAL_MEMORY > max_tbc to avoid intermediate flushes or reduce requirements using file listing names of USR regions to be filtered.) 514 MB total memory 64 MB per rank!

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala BT-MZ summary measurement Run the application using the Scalasca measurement collection & analysis nexus prefixed to launch command Creates experiment directory./scorep_bt-mz_W_4x4_sum 20 % cd bin.scorep % OMP_NUM_THREADS=4 scan –f scorep.filt mpiexec -np 4./bt-mz_W.4 S=C=A=N: Scalasca 2.0 runtime summarization S=C=A=N:./scorep_bt-mz_W_4x4_sum experiment archive S=C=A=N: Thu Sep 13 18:05: : Collect start mpiexec -np 4./bt-mz_W.4 NAS Parallel Benchmarks (NPB3.3-MZ-MPI) - BT-MZ MPI+OpenMP Benchmark Number of zones: 8 x 8 Iterations: 200 dt: Number of active processes: 4 [... More application output...] S=C=A=N: Thu Sep 13 18:05: : Collect done (status=0) 22s S=C=A=N:./scorep_bt-mz_W_4x4_sum complete. copy / edit / llsubmit../jobscripts/mds/run_scan.ll

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala BT-MZ summary analysis report examination Score summary analysis report Post-processing and interactive exploration with CUBE The post-processing derives additional metrics and generates a structured metric hierarchy 21 % square scorep_bt-mz_W_4x4_sum INFO: Displaying./scorep_bt-mz_W_4x4_sum/summary.cubex... [GUI showing summary analysis report] % square -s scorep_bt-mz_W_4x4_sum INFO: Post-processing runtime summarization result... INFO: Score report written to./scorep_bt-mz_W_4x4_sum/scorep.score

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Post-processed summary analysis report 22 Split base metrics into more specific metrics

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala BT-MZ trace measurement collection... To enable additional statistics and pattern instance tracking, set SCAN_ANALYZE_OPTS=“-i s” Re-run the application using Scalasca nexus with “-t” flag 23 % OMP_NUM_THREADS=4 scan –f scorep.filt -t mpiexec -np 4./bt-mz_W.4 S=C=A=N: Scalasca 2.0 trace collection and analysis S=C=A=N:./scorep_bt-mz_W_4x4_trace experiment archive S=C=A=N: Thu Sep 13 18:05: : Collect start mpiexec -np 4./bt-mz_W.4 NAS Parallel Benchmarks (NPB3.3-MZ-MPI) - BT-MZ MPI+OpenMP Benchmark Number of zones: 8 x 8 Iterations: 200 dt: Number of active processes: 4 [... More application output...] S=C=A=N: Thu Sep 13 18:05: : Collect done (status=0) 19s [.. continued...] % export SCAN_ANALYZE_OPTS=“-i -s”

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala BT-MZ trace measurement... analysis Continues with automatic (parallel) analysis of trace files 24 S=C=A=N: Thu Sep 13 18:05: : Analyze start mpiexec -np 4 scout.hyb –i -s./scorep_bt-mz_W_4x4_trace/traces.otf2 SCOUT Copyright (c) Forschungszentrum Juelich GmbH Copyright (c) German Research School for Simulation Sciences GmbH Analyzing experiment archive./scorep_bt-mz_W_4x4_trace/traces.otf2 Opening experiment archive... done (0.002s). Reading definition data... done (0.004s). Reading event trace data... done (0.669s). Preprocessing... done (0.975s). Analyzing trace data... done (0.675s). Writing analysis report... done (0.112s). Max. memory usage : MB Total processing time : 2.785s S=C=A=N: Thu Sep 13 18:06: : Analyze done (status=0) 4s

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala BT-MZ trace analysis report exploration Produces trace analysis report in experiment directory containing trace-based wait-state metrics 25 % square scorep_bt-mz_W_4x4_trace INFO: Post-processing runtime summarization result... INFO: Post-processing trace analysis report... INFO: Displaying./scorep_bt-mz_W_4x4_sum/trace.cubex... [GUI showing trace analysis report]

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Post-processed trace analysis report 26 Additional trace-based metrics in metric hierarchy

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Online metric description 27 Access online metric description via context menu

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Online metric description 28

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Pattern instance statistics 29 Access pattern instance statistics via context menu Click to get statistics details

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Connect to Vampir trace browser 30 To investigate most severe pattern instances, connect to a trace browser… …and select trace file from the experiment directory

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Show most severe pattern instances 31 Select “Max severity in trace browser” from context menu of call paths marked with a red frame

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Investigate most severe instance in Vampir 32 Vampir will automatically zoom to the worst instance in multiple steps (i.e., undo zoom provides more context)

EU COST IC0805 ComplexHPC school, 3-7 June 2013, Uppsala Further information Scalable performance analysis of large-scale parallel applications –toolset for scalable performance measurement & analysis of MPI, OpenMP & hybrid parallel applications –supporting most popular HPC computer systems –available under New BSD open-source license –sources, documentation & publications: mailto: 33