Download presentation
Presentation is loading. Please wait.
Published byGregory Blair Modified over 9 years ago
1
Allen D. Malony malony@cs.uoregon.edu http://www.cs.uoregon.edu/research/tau Performance Research Laboratory Department of Computer and Information Science University of Oregon Hybrid Performance Analysis in the TAU Performance System
2
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 2 Outline Hybrid parallel programming and performance analysis TAU performance system Instrumentation Measurement Analysis tools MPI support OpenMP support Hybrid support Conclusions and future work
3
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 3 SMP/Multi-core Clusters & Hybrid Programming Clusters of SMPs (with multi-core) motivate hybrid (mixed-mode) parallel programming …… …… … … interconnection network MM MM PPPP PPPP …… …… … … MM MM PPPP PPPP Cluster of SMPs multiple processors per cluster node multi-core processors Heterogeneous cluster cluster of SMPs Heterogeneous many-core devices per nodes
4
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 4 Hybrid Parallel Programming and Tools Multi-programming methods for hybrid execution Explicit / implicit Distributed memory message passing MPI Shared memory multi-threading pthreads, OpenMP, OpenCL,... Implicit UPC, CAF, GA What about tools? performance debugging difficult to integrate and often non-portable The Netherlands Bicycle Band Zurich Police Band Festival, 2010
5
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 5 Research and Tools A. Malony, B. Mohr, S. Shende, F. Wolf, "Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting," EWOMP 2001. B.Mohr, A.Malony, S.Shende, F.Wolf, "Design and Prototype of a Performance Tool Interface for OpenMP," The Journal of Supercomputing, 23:105–128, 2002. J. Cownie, J. DelSignore, B. de Supinski, K. Warren, "DMPL: An OpenMP DLL Debugging Interface," WOMPAT 2003, LNCS 2716:137–146, Springer, Heidelberg, 2003. K. Fuerlinger, M. Gerndt, "ompP: A Profiling Tool for OpenMP," IWOMP 2005/IWOMP 2006, LNCS 4315:15–23, Springer, Heidelberg, 2008. A. Morris, A. Malony, S. Shende, "Supporting Nested OpenMP Parallelism in the TAU Performance System," IWOMP 2005/IWOMP 2006, LNCS 4315:279–288, Springer, Heidelberg, 2008. V. Bui, O. Hernandez, B. Chapman, R. Kufrin, P. Gopalkrishnan, D. Tafti, "Towards an Implementation of the OpenMP Collector API," ParCo 2007. M. Itzkowitz, O. Mazurov, N. Copty, Y. Lin, "White Paper: An OpenMP Runtime API for Profiling," Technical Report, Sun Microsystems, Inc., 2007. OpenMP Architecture Review Board, "OpenMP Application Program Interface, Version 3.0," 2008, http://www.openmp.org/mp-documents/spec30.pdf. R. Nathan, N. Tallent, J. Mellor-Crummey, "Effective Performance Measurement and Analysis of Multithreaded Applications," PPoPP 2009, ACM, New York, 2009. Y. Lin and O. Mazurov, "Providing Observability for OpenMP 3.0 Applications," IWOMP 2009, LNCS 5568:104–117, Springer-Verlag, 2009. Tools: TAU, Scalasca, Paraver, ompP, HPCToolkit, OpenUH, Sun, Intel, Cray,...
6
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 6 TAU Performance System ® Tuning and Analysis Utilities (18+ year project) Performance problem solving framework for HPC Integrated, scalable, flexible, portable Target all parallel programming / execution paradigms Integrated performance toolkit Instrumentation, measurement, analysis, visualization Widely-ported performance profiling / tracing system Performance data management and data mining Open source (BSD-style license) Broad application use (NSF, DOE, DOD, …) http://tau.uoregon.edu
7
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 7 memory Node VM space Context SMP Threads node memory … … Interconnection Network Inter-node message communication * * physical view model view General Target Computation Model in TAU Node: physically distinct shared memory machine Message passing node interconnection network Context: distinct virtual memory space within node Thread: execution threads (user/system) in context
8
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 8 TAU Performance System Components TAU Architecture Program Analysis Parallel Profile Analysis PDT PerfDMF ParaProf Performance Data Mining Performance Monitoring TAUoverSupermon PerfExplorer
9
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 9 TAU Instrumentation / Measurement
10
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 10 TAU Analysis
11
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 11 ParaProf Profile Analysis Framework
12
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 12 TAU Instrumentation Approach Based on direct performance observation Direct instrumentation of program (system) code (probes) Instrumentation invokes performance measurement Event measurement: performance data, meta-data, context Support for standard program events Routines, classes and templates Statement-level blocks and loops Begin/End events (Interval events) Support for user-defined events Begin/End events specified by user Atomic events (e.g., size of memory allocated/freed) Flexible selection of event statistics Provides static events and dynamic events
13
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 13 Automatic Source Instrumentation tau_instrumentor Parsed program Instrumentation specification file Instrumented source TAU source analyzer Application source
14
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 14 MPI Instrumentation and Measurement Uses standard MPI Profiling Interface (PMPI) Provides name shifted interface (weak bindings) MPI_Send PMPI_Send Interpose TAU's MPI wrapper library -lmpi replaced by –lTauMpi –lpmpi –lmpi No change to the source code! Just re-link the application to generate performance data No re-compilation or re-linking! Preloading of TAU MPI library Uses LD_PRELOAD for Linux TAU captures profiles and traces of communication Includes messages sizes, source/destination,... 14
15
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 15 PFLOTRAN Profile (Exclusive, 16,380 cores) MPI_Allreduce MPI_Waitany KSPSolve oursnesjacobian TAU ParaProf profile analyzer
16
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 16 ParaProf 3D Full Profile (Exclusive) MPI_Allreduce
17
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 17 ParaProf 3D Full Profile (minus MPI_Allreduce) 17
18
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 18 OpenMP Instrumentation with POMP / OPARI POMP: Profiling interface for OpenMP POMP-1 specification (FJZ, UO) (EWOMP '01) POMP-2 (FJZ, UO, Pallas, Intel) (EWOMP '02) Measurement tool implements the POMP library OPARI: OpenMP Pragma And Region Instrumentor Source rewriter to insert POMP calls around OpenMP constructs and API functions Supports C and C++, Fortran77 and Fortran90, OpenMP 2.x Scalasca and TAU POMP implementations Preserves source code information (#line, file)
19
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 19 OpenMP Event Model What events are necessary to observe performance? An OpenMP thread executes on behalf of an OpenMP task inside an OpenMP parallel region Tool need to have knowledge of this context to relate performance information to the OpenMP execution model OpenMP constructs and directives/pragmas Enter/Exit around OpenMP construct plus Begin/End around associated body OpenMP API calls Enter/Exit events around omp_set_*_lock() functions User functions and regions
20
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 20 OpenMP Directive Instrumentation (POMP) Insert calls to pomp_NAME_TYPE(d) at appropriate places around directives NAME name of the OpenMP construct TYPE fork, join mark change in parallelism grade enter, exit flag entering/exiting OpenMP construct begin, end mark start/end of construct bodies d context descriptor Observation of implicit barrier at DO, SECTIONS, WORKSHARE, SINGLE constructs Add NOWAIT to construct Make barrier explicit
21
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 21 !$OMP PARALLEL DO Instrumentation !$OMP PARALLEL DO clauses... do loop !$OMP END PARALLEL DO !$OMP PARALLEL other-clauses... !$OMP DO schedule-clauses, ordered-clauses, lastprivate-clauses do loop !$OMP END DO !$OMP END PARALLEL DO NOWAIT !$OMP BARRIER call pomp_parallel_fork(d) call pomp_parallel_begin(d) call pomp_parallel_end(d) call pomp_parallel_join(d) call pomp_do_enter(d) call pomp_do_exit(d) call pomp_barrier_enter(d) call pomp_barrier_exit(d)
22
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 22 OpenMP API Instrumentation Transform omp_#_lock() pomp_#_lock() omp_#_nest_lock() pomp_#_nest_lock() [ # = init | destroy | set | unset | test ] POMP version Calls omp version internally Can do performance measurement before and after call
23
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 23 POMP Context Descriptors Describe execution contexts through context descriptor typedef struct ompregdescr { char name[]; /* construct */ char sub_name[]; /* region name */ int num_sections; char filename[]; /* src filename */ int begin_line1, begin_lineN; /* begin line # */ int end_line1, end_lineN; /* end line # */ WORD data[4]; /* perf. data */ struct ompregdescr* next; } OMPRegDescr; Generate context descriptors in global static memory: OMPRegDescr rd42675 = { "critical", "phase1", 0, "foo.c", 5, 5, 13, 13 }; Pass address to POMP functions
24
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 24 Support for OpenMP Nested Parellelism in TAU Hierarchical nature of nested parallelism poses problems Use OMP_NESTED or omp_set_nested() to enable Performance measurement requires knowledge of thread context to correctly attribute data to thread's execution How to determine thread nesting level? omp_get_thread_num() can not be used to uniquely identify the unique thread – logical thread ID in team Nesting context is not available to the tool interface Static analysis and instrumentation (à la Opari) no help Requires runtime solution to identify thread and nesting Lack of performance tools interface specification (v2.x)
25
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 25 Portable Approach in TAU (IWOMP 2006) Need to find a new scheme for thead identification Leverage OpenMP directive #pragma threadprivate() Create persistent data for each thread in parallel region Values do not persist between parallel regions Memory location for threadprivate variables are unique Single threadprivate variable is initialized in TAU library Used to register threads not seen before during execution Region thread IDs are unique, but may change between Lose nesting depth and team identifier A. Morris, A. Malony, and S. Shende, “Supporting Nested OpenMP Parallelism in the TAU Performance System,” IWOMP, 2006.
26
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 26 TFS Case Study (IWOMP 2006) TFS computational fluid dynamics code (RWTH Aachen) Parallelized using ParaWise to generate: intra-block parallelization over a single block dimension inter-block parallelization over blocks multi-level (hybrid) with nested OpenMP (intra, inter) Instrumentation approach Opari for OpenMP constructs and regions PDT for source-level information about the routine names, and their respective entry and exit locations .TAU application event Started at thread execution and stopped at execution end Shows roughly how much time each thread spends idle
27
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 27 TFS Multi-level Profile (Exclusive: Mean, Flat) idle 90 second run 3 secs 22 secs
28
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 28 TFS Multi-level Profile (Exclusive)
29
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 29 TFS Callpath Profile (Exclusive, Thread 0)
30
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 30 TFS Callgraph Profile (Exclusive, Thread 0)
31
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 31 Callgraph for Each TFS Thread Thread 0 (main)
32
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 32 Hybrid Parallel Application Case Studies 2D Stommel model of ocean circulation Jacobi iteration, 5-point stencil Timothy Kaiser (San Diego Supercomputing Center) GTC Particle-in-cell simulation of fusion turbulence 128 cores: 32 MPI processes x 4 OpenMP threads Phases assigned to iterations Performance instrumentation OpenMP with Opari MPI with PMPI
33
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 33 OpenMP + MPI Ocean Modeling (Profile) % configure -papi=../packages/papi -openmp -c++=pgCC -cc=pgcc -mpiinc=../packages/mpich/include -mpilib=../packages/mpich/libo FP instructions Integrated OpenMP + MPI events
34
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 34 OpenMP + MPI Ocean Modeling (Trace) Integrated OpenMP + MPI events Thread message pairing
35
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 35 GTC Full Profile (32x4, Exclusive Time) Overall visual impression of parallel performance See hybrid structure and interesting behavior
36
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 36 GTC Full Profile (32x4, Exclusive Time) Stacked view shows per event comparison More clearly highlights OpenMP/MPI differences
37
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 37 GTC MPI-only Performance (128 cores)
38
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 38 GTC OpenMP-only Performance (128 cores) Height: FP count Color : Time
39
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 39 GTC Phase Profiling (32x4) increasing phase execution time decreasing flops rate declining cache performance
40
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 40 GTC Trace with Jumpshot (Argonne) (128 cores) Full trace visualization with collapse process view Communication between computation phases highlighted
41
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 41 GTC Process Trace Visualization (128 cores) Zoomed in to show phase structure Still viewing collapsed process
42
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 42 GTC OpenMP Thread Visualization (128 cores) Zoomed view showing OpenMP thread performance CummulativeExclusionRatio in Jumpshot
43
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 43 NAS Multi-zone Hybrid Benchmarks BT-MZ and SP-MZ in NPB Two levels of hybrid parallelism are exploited OpenMP is applied to fine-grained intra-zone MPI is used for coarse-grained inter-zone Load balancing is based on a bin-packing algorithm Multiple zones are clustered into zone groups computational workload is evenly distributed over them zones are sorted by size and bin-packed into zone groups Each zone group is then assigned to an MPI process exchanging boundary data within each time step requires MPI many-to-many communication Hybrid version is part of the standard NPB distribution
44
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 44 BT-MZ and ST-MZ Traces BT-MZSP-MZ
45
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 45 Conclusion and Future Work TAU supports hybrid parallel performance analysis Measurement of MPI+OpenMP programs at scale "Portable" OpenMP instrumentation with Opari Support for nested parallelism Need better integration with OpenMP compilers / RTS Need to build better support for OpenMP 3.0 Task model events and context Leverage performance tools interface in OpenMP 3.0 Extend with instrumentation and measurement Incorporate event-based sampling (TAUebs)
46
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 46 Support Acknowledgements Department of Energy (DOE) Office of Science ASC/NNSA Department of Defense (DoD) HPC Modernization Office (HPCMO) NSF Software Development for Cyberinfrastructure (SDCI) Research Centre Juelich Argonne National Laboratory Technical University Dresden ParaTools, Inc.
47
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 47 Hybrid Parallel Computation (Opus / HPF) Hybrid, hierarchical programming and execution model Multi-threaded SMP and inter-node message passing Integrated task and data parallelism Opus / HPF environment (University of Vienna) Combined data (HPF) and task (Opus) parallelism HPF compiler produces Fortran 90 modules Processes interoperate using Opus runtime system producer / consumer model MPI and pthreads Performance influence at multiple software levels Performance analysis oriented to programming model
48
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 48 TAU Tracing of Opus / HPF Application Multiple producers Multiple consumers
49
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 49 Opus / HPF Execution Trace 4-node, 28 process Process-grouping in Vampir visualization
50
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 50 Hybrid Parallel Computation (Java + MPI) Multi-language applications and hybrid execution Java, C, C++, Fortran Java threads and MPI mpiJava (Syracuse, JavaGrande) Java wrapper package with JNI C bindings to MPI routines Integrate cross-language, cross-system performance technology JVMPI and Tau profiler agent MPI profiling interface - link-time interposition (wrapper) library Cross execution mode uniformity and consistency invoke JVMPI control routines to control Java threads access thread information and expose to MPI interface “Performance Tools for Parallel Java Environments,” Java Workshop, ICS 2000, May 2000.
51
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 51 JVMPI Thread API Event notification TAU Java Instrumentation Architecture Java program TAU package mpiJava package MPI profiling interface TAU wrapper Native MPI library Profile DB JNI TAU
52
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 52 Parallel Java Game of Life (Profile) mpiJava testcase 4 nodes, 28 threads Node 0 Node 1 Node 2 Thread 4 executes all MPI routines Merged Java and MPI event profiles
53
39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 53 Parallel Java Game of Life (Trace) Integrated event tracing Merged trace viz Node process grouping Thread message pairing Vampir display Multi-level event grouping
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.