Download presentation
Presentation is loading. Please wait.
Published byGwendolyn Holt Modified over 9 years ago
1
The Vampir Performance Analysis Tool Hans–Christian Hoppe Gesellschaft für Parallele Anwendungen und Systeme mbH Pallas GmbH Hermülheimer Straße 10 D-50321 Brühl, Germany info@pallas.com http://www.pallas.com SCICOMP 2000 Tutorial, San Diego
2
© Pallas GmbH Outline Performance tools for parallel programming Performance analysis for MPI The Vampir tool The Vampir roadmap
3
© Pallas GmbH Why performance tools? CPUs and interconnects are getting faster all the time Compilers are improving “Abundance of computing power” Shouldn’t it be sufficient to just write an application and let the system do the rest?
4
© Pallas GmbH Why performance tools? In reality, there remain severe performance bottlenecks –slow memory access (instructions and data) –cache consistency effects –starvation of instruction units –contention of interconnection systems –adverse interaction with schedulers
5
© Pallas GmbH Why performance tools? The application programmer does the rest –excessive sequential sections –bad load balance –non–optimized communication patterns –excessive synchronization Performance analysis tools can –help to diagnose system–level performance problems –help to identify user–level performance bottlenecks –assist the users in improving their applications
6
© Pallas GmbH Achieved performance vs. effort Effort Code Performance OpenMP MPI Code doesn’t work Performance tools KAP, Debuggers
7
© Pallas GmbH Performance aspects Sequential performance –Optimize memory accesses –Optimize instruction sequences Parallel performance –Minimize sequential sections and replicated work –Optimize load balance and communication –Reduce synchronization Parallel correctness –Analyze results –Analyze execution traces –Compare parallel vs. sequential code
8
© Pallas GmbH Kinds of performance tools Sequential performance –Profiling tools –Compiler– and hardware–specific Parallel performance –Static code analysis –Automatic parallelisation –Counter–based profiling tools –Event tracing tools (analysis, prediction) Parallel correctness –Static code analysis tools –Trace–based verification
9
© Pallas GmbH Vendor–specific vs. portable tools Vendor–specific tools –Superior support for platform specifics –Proprietary data formats, API’s and user interfaces –Very useful for sequential optimizations, vendor–specific parallel models Portable tools –Concentrate on (portable) programming model –Open data formats and API’s –Useful for parallel optimizations, portable parallel models Examples –Guide (counter–based profiling) –Vampir, Dimemas, jumpshot (event trace analysis) –Assure (trace–based code verification)
10
© Pallas GmbH Performance tools – goals? Holy grail –Automatic parallelisation and optimization –One code version for sequential and parallel –One code version for all platforms –Automatic code verification –Automatic performance verification –Automatic detection of performance problems –Integration of performance analysis and parallelisation
11
© Pallas GmbH Performance tools – reality? Open problems –Limited capabilities of automatic parallelisation –Performance portability portable sequential optimizations portable parallel optimizations –Code version maintenance –Verification of MPI applications –Scaling to large, hierarchical systems
12
© Pallas GmbH MPI performance specifics Static SPMD–model, weak synchronization No sequential sections – work is replicated or sequential communication patterns are used Data distribution defined by communication Work distribution determined by data distribution Explicit communication and synchronization Optimization areas –Load balancing (tune data distribution) –Parallelize replicated work –Tune communication patterns –Reduce synchronization
13
© Pallas GmbH Event–based MPI Analysis Record trace of application execution –Calls to MPI and user routines –MPI communication events –Source locations –Values of performance registers or program variables From a trace, a performance analysis tool can show –Protocol of execution over time –Statistics for MPI routine execution –Statistics for communication –Dynamic calling tree Important advantage –Focus on any phase of the execution
14
© Pallas GmbH Vampirtrace details Vampirtrace ™ –Instrumentation library producing traces for Vampir and Dimemas –Supports MPI–1 (incl. collective operations) and MPI–I/O –Exploits MPI profiling interface –Works with vendors MPI implementations –API for user–level instrumentation –Capability to filter for event subsets Developed, productized and marketed by Pallas Available for IBM SP, PE 3.x
15
© Pallas GmbH Vampir details Vampir ™ –Event–trace visualization tool –Analyzes MPI and user routines –Analyzes point–to–point, collective and MPI–IO operations –Focus on arbitrary execution phases –Execution and communication statistics –Filter processes, messages, and user/MPI routines Jointly developed by TU Dresden and Pallas Productized and marketed by Pallas Available for IBM RS6000, AIX 4.2/AIX 4.3
16
© Pallas GmbH Dimemas details Dimemas –Event–based performance prediction tool –Parameterized machine model CPU performance Communication and network performance –Predicts performance on modeled platform –What–if analysis determined influence of parameters Jointly developed by UPC Barcelona and Pallas Productized and marketed by Pallas Available for IBM RS6000, AIX 4.2/AIX 4.3
17
© Pallas GmbH Vampirtrace options Filter events for –Processes –Time interval or record count –Event type Instrumentation (user routines, counters) –portable: by hand –some platforms (Fujitsu, Hitachi, NEC): by compiler Limit memory use –Spill data to disk, store all events –Only store n first/last events
18
© Pallas GmbH Vampir main window Vampir 2.5 main window Tracefile loading can be interrupted at any time Tracefile loading can be resumed Tracefile can be loaded starting at a specified time offset Tracefile can be re–written
19
© Pallas GmbH Aggregated profiling information –Execution time –Number of calls Inclusive or exclusive of called routines Summary chart
20
© Pallas GmbH Vampir state model User specifies activities and symbol grouping Look at all/any activities or all symbols Summary chart Calculation Tracing MPI MPI_Send MPI_Recv MPI_Wait ssor exchange Activities Symbols
21
© Pallas GmbH Timeline display To zoom, mark region with the mouse
22
© Pallas GmbH Timeline display – zoomed
23
© Pallas GmbH Timeline display – contents Shows all selected processes Shows state changes (activity color) Shows messages, collective and MPI–IO operations Can show parallelism display at the bottom
24
© Pallas GmbH Timeline display – message details Click on message line Message receive op Message send op Message information
25
© Pallas GmbH Communication statistics Message statistics for each process/node pair: –Byte and message count –min/max/avg message length, bandwidth
26
© Pallas GmbH Message histograms Message statistics by length, tag or communicator –Byte and message count –Min/max/avg bandwidth
27
© Pallas GmbH Collective operations For each process: mark operation locally Connect start/stop points by lines Start of op Data being sent Data being received Stop of op Connection lines
28
© Pallas GmbH Collective operations Click on collective operation display See global timing info See local timing info
29
© Pallas GmbH Collective operations Filter collective operations Change display style
30
© Pallas GmbH Collective operations statistics Statistics for collective operations: –operation counts, Bytes sent/received –transmission rates Filter for collective operation MPI_Gather only All collective operations
31
© Pallas GmbH I/O transfers are shown as lines MPI–I/O operations Click on I/O line See detailed I/O information
32
© Pallas GmbH MPI–I/O statistics Statistics for MPI–I/O transfers by file –Operation counts –Bytes read/written, transmission rates
33
© Pallas GmbH Activity chart Profiling information for all processes
34
© Pallas GmbH Global calling tree Display for each symbol: –Number of calls, min/max. execution time Fold/unfold or restrict to subtrees
35
© Pallas GmbH Process–local displays Timeline (showing calling levels) Activity chart Calling tree (showing number of calls)
36
© Pallas GmbH Other displays Parallelism display Pending Messages display Trace Comparison feature –compare different runs (scalability analysis) –compare different processes
37
© Pallas GmbH Focus on a time interval Chose a time interval by zooming with the timeline display Enable the Show Timeline Portion option All statistics windows are updated for the selected interval Use to focus on one application phase or iteration!
38
© Pallas GmbH Effects of zooming Select one iteration Updated summary Updated message statistics
39
© Pallas GmbH Compare traces Compare profiling information –To check load balance (between processes) –To evaluate scalability (different runs) –To look at optimization effects (different code versions) Compare processes 6 and 19 Comparison by routine
40
© Pallas GmbH Coupling Vampir and Dimemas Actual program run vs. Ideal communication
41
© Pallas GmbH Vampir/Vampirtrace roadmap Ongoing developments –Scalability enhancements –Functionality enhancements –Instrumentation enhancements Will be first available commercially on NEC and Compaq platforms –Earth simulator –ASCI machines PathForward developments for ASCI machines
42
© Pallas GmbH Scalability challenges Scalability in processor count –ASCI–class machines have 1000s of processors –High–end systems have 100s of processors –Applications use most of them Scalability in time –Need to analyze actual production runs (hours/days) Scalability in detail –Record and analyze system–specific performance data –Support for threaded and hybrid models
43
© Pallas GmbH Scalability problems Counter–based profiling tools are basically OK –Severely limited in the level of detail –Can’t focus into parts of application run Event–based tools have problems –Event traces get really large –Display tools use huge amounts of memory –Many displays do not scale Example: Vampir tracefiles for NAS NPB–LU –128 processes: 3.000.000 records(120 Mbyte) –256 processes: 15.000.000 records(600 Mbyte) –512 processes: 150.000.000 records(6 Gbyte)
44
© Pallas GmbH Threaded programming models Enhance Vampir to display –Thread fork/join –Thread synchronization –Show a timeline per thread / aggregate threads into single timeline –Display subroutine/code block execution for each thread Create instrumentation library for thread packages Integrate instrumentation capability into OpenMP systems
45
© Pallas GmbH Cluster node display Cluster information is already recorded Enhance Vampir to –show aggregate execution information per node –show communication volume per node
46
© Pallas GmbH Cluster timeline display Display node–level information Show communication volume within nodes Show communication between nodes as usual Allow to expand nodes into processes There may be more than two hierarchy levels...
47
© Pallas GmbH Cluster timeline display
48
© Pallas GmbH Structured tracefile format Subdivide the tracefile into frames –Time intervals, thread/process/node subsets Put frame data –All in one file (as today) –In multiple files (one per frame...) –On a parallel filesystem (exploit parallelism) Frame index file holds –Location of frame start/end –Frame statistic data for immediate display –“Frame thumbnail”
49
© Pallas GmbH Structured tracefile format Vampir loads the frame index Displays immediately available –Global profiling/communication statistics –By–frame profiling/communication statistics –Thumbnail timeline User gets overview of application run –Can load particular frame data –Can navigate between frames User can refine instrumentation/tracing –Get detailed trace of interesting frames
50
© Pallas GmbH Dynamic tracing control What can be controlled –Definition of frames –Data to be recorded per frame Control methods –Instrumentation with Vampirtrace API –Binary instrumentation (atom) or use of a debugger –Configuration file –Interactive control agent (debugger) Tracing the right data is an iterative process!
51
© Pallas GmbH Cluster timeline display For very large systems, still can’t look at complete system (too many nodes) Display “interesting” nodes only –Regarding communication volume/delays –Regarding load imbalance –Regarding execution times of particular code modules
52
© Pallas GmbH Scalable Vampir structure Scalable user–interface Scalable internals Data Control Vampir SC User Interaction Trace Data Processing Trace Data I/O Data Control Vampir DC User Interaction Trace Data Analysis Display Handling Structured Trace Data runs on WS runs on parallel system may exploit parallel FS
53
© Pallas GmbH Access to Pallas tools Download free evaluation copies from http://www.pallas.com
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.