Download presentation
Presentation is loading. Please wait.
Published byHope Tyler Modified over 9 years ago
1
Intel Trace Collector and Trace Analyzer Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive note
2
2 Basic Information Name: Intel Trace Collector, Intel Trace Analyzer Developer: Intel Current versions: Intel Trace Collector 5.0.1.0 Intel Trace Analyzer 4.0.3.1 Website: http://www.intel.com/software/products/cluster Contact: http://premier.intel.com
3
3 Intel Cluster Tools Overview A toolkit for creating high-performance applications on Intel’s architectures (x86, IA64) Intel MPI Library Intel’s implementation of MPI Intel Cluster Math Kernel Library Contains several Intel-optimized math routines Also has a version of ScaLAPACK Intel Trace Collector & Trace Analyzer Represent the performance analysis portion of Intel Cluster Tools The two are used in conjunction to analyze performance of parallel applications (mostly MPI): Trace Collector: Provides a method for instrumenting programs and recording performance data Trace Analyzer: Provides graphical representation of trace data from STF trace file Formerly known as Vampirtrace & Vampir
4
4 Trace Collector Overview What can be traced: MPI applications can be traced automatically by linking against profiling library Records of MPI routine calls Data describing communication (point-to-point and collective) Hardware counter data if available Statistics – function calls, sent messages, collective operations (count, duration, bytes) User-level code can be traced through manual instrumentation using ITC API User defined states User defined counters Non-MPI (distributed) applications can be traced Use same API calls as instrumenting user code in MPI apps Binaries instrumentation without recompilation is possible Use itcinstrument tool Must use MPI or must explicitly initialize/finalize Trace Collector Java programs
5
5 Trace Collector Libraries ITC Offers four different libraries for creating trace files. Each offers different operation characteristics libVT Contains wrapper functions for automatic logging of MPI calls Offers extended functionality through an API for logging of user defined data libVTnull Contains dummy versions of API calls libVTfs Same functionality as libVT Trace file writing is done via TCP sockets In case of failure, trace data is not lost libVTcs Similar to VTfs in that it uses TCP sockets to write tracefiles Does not automatically log MPI calls Requires that a process be explicitly designated as server for trace file creation coordination
6
6 Structured Trace File Format (STF) Structured Trace File format is the default format for traces Data is divided into logical frames, which helps to partition data for large- scale programs with large traces (possibly GBs) Time axis Location axis Type of data (state, collective operations, point-to-point messages, counter values, MPI-IO) Indexing allows for quick random access Uses multiple files File division does not necessarily reflect frame division Allows for parallelism in reading and writing Documentation does not detail the innerworkings Can be converted to single-file STF for ease of file handling and transmission No documentation provided on how actual construct STF trace files without using Trace Collector
7
7 STF Utilities STF files can be manipulated using stftool and xstftool: Extract various data Manipulate frames, and groupings Convert STF files into AVT, or XVT AVT Format used by previous versions of Vampir Should be understood by Trace Analyzer Created by other existing tools XVT Similar to AVT in syntax Replaces integer descriptors with more easily understood titles Combine all data in one file Alternatives are human and script readable No means is provided to facilitate importing the data into another tool
8
8 Trace Collector API Intel Trace Collector offers an API to Trace user code in detail Trace non-MPI distributed apps Functions are defined to: Record user defined states in trace Record user defined communication events in the trace Record source code locations for correlation in Intel Trace Analyzer Record user defined counters in trace Define process groupings used in trace analyzer Define frames (recommended to use config options instead) Turn tracing on and off during execution Enable tracing of multithreaded applications Initialize and finalize Intel Trace Collector - needed for non-MPI applications
9
9 Trace Collector Overhead All programs executed correctly when instrumented Benchmarks marked with a star had high variability in execution time Readings with stars probably not accurate In most cases overhead less than 8% Wasn’t able to test overhead of hardware counter instrumentation However, trace file writing for class B LU with 32 processes took almost 20 minutes!
10
10 Trace Analyzer Intel Trace Analyzer (ITA) is a visualization program Reads STF tracefiles Tracefiles from previous versions should also work ITA can display: Event based data (including messages) Statistical data Counter data if it is contained in the trace Displays may represent view of: Multiple processes Individual processes Group of processes (depending on selected filtering options) Single process Possible to configure the views in various ways Activities / Symbols Absolute Time / Scaled (percentage of total) Time Number of processes displayed at once Colors used for activities
11
11 Trace Analyzer (2) Data from a large trace file can be viewed in increments Select the appropriate frames from the STF file Views may be linked to visible portion of zoomed timeline Pre-computed statistical data can be viewed without loading trace data General Notes on ITA Interface Uses X-windows Is quite stable Provides good interface responsiveness Interface is intuitive (for the most part) ITA is not capable of automatic analysis of trace data.
12
12 Trace Analyzer Views Summary Chart Display Allows the user to see how much work is spent in MPI calls Timeline Display Zoomable, scrollable timeline representation of program execution Summary Chart Timeline Display
13
13 Trace Analyzer Views (2) Summary Timeline Timeline/histogram representation showing the number of processes in each activity per time bin Counter Timeline Value over time representation (behavior depends on counter definition in trace) Summary Timeline Counter Timeline
14
14 Trace Analyzer Views (3) Message Statistics Display Message data to/from each process (count,length, rate, duration) Process Profile Display Per process data regarding activities Message Statistics Process Profile Display
15
15 Trace Analyzer Views (4) Statistics Display Various statistics regarding activities in histogram, table, or text format Call Tree Display Statistics Display Call Tree Display
16
16 Trace Analyzer Views (5) Source View Source code correlation with events in Timeline Activity Chart Per Process histograms of Application and MPI activity Source View Activity Chart
17
17 Trace Analyzer Views (6) Process Timeline Activity timeline and counter timeline for a single process Process Activity Chart Same type of information as Global Summary Chart Process Call Tree Same type of information as Global Call Tree Process Timeline Process Activity Chart & Call Tree
18
18 Bottleneck Identification Test Suite Testing metric: what did trace visualization tell us (automatic instrumentation)? CAMEL: PASSED Identified large number of small messages at beginning of program execution Easily see that MPI calls take up small portion of run time (<3%) NAS LU: PASSED Showed communication bottlenecks very clearly Large(!) number of small messages Shows sensitivity to latency for processors waiting on data from other processors “W” Class: 18 MB trace file Loads quickly “B” Class: 240 MB trace file Loads slowly (2-3 min.), responsiveness of program is diminished However, can be loaded in small pieces that load much faster Some information is available with out loading any frames Took nearly 20 minutes to write trace after program completion!
19
19 Bottleneck Identification Test Suite (2) Big message: PASSED Traces illustrated large amount of time spent in send and receive Diffuse procedure: PASSED Traces illustrated a lot of synchronization with each process executing user code in an exclusive, alternating manner Hot procedure: TOSS-UP Assuming hardware counters work, would be easy to see extra CPU utilization Manually instrumenting code would improve accuracy of source code correlation Intensive server: PASSED Trace clearly shows that all processes communicate with a single process whose response time is delayed by user code Ping pong: PASSED Traces illustrated that most time is spent in MPI code sending and receiving messages, with little time spent in user code Random barrier: PASSED Traces show that there are many barriers, with each one held up by a random processor in user code Small messages: PASSED Traces illustrated a large number of messages being sent to node 0 System time: TOSS-UP Hardware counter timeline might be able to indicate bottleneck if they were working Wrong way: PASSED Trace shows that first receive takes a long time, but the rest of the messages sent during this time period are received quickly
20
20 General Comments Intel Trace Collector/Analyzer are very popular and effective tools for creating and displaying trace files. These tools are proprietary, and closed source. Analyzing performance of MPI applications is the primary intended use. Support for analyzing non-MPI applications is provided via an API, and a special library (libVTcs - allows for coordination of tracefile creation without MPI). Performance analysis requires the user to have a good understanding of the types of problems likely to affect performance. No automatic detection of bottlenecks
21
21 Evaluation (1) Available metrics: 4.5/5 Can use PAPI Many metrics (event-based and counter-based) are available, but it is not possible to create custom metrics as in Paraver Cost: 3/5 A single-user license costs ~$500 Multiple user licenses are for a single cluster only A 20-user license costs ~$5000 A100-user license costs ~$15,000, A unlimited user license costs ~$30,000 Documentation quality: 4/5 Documentation covers most of the features in a clear and consistent fashion Trace Analyzer documentation includes a section that walks a user through the process of analyzing a trace file for bottlenecks through a sample scenario However, some parts of the documentation are confusing if the document is not read in it’s entirety Doesn’t describe inner-workings of trace collection/display *Note: evaluated IA:32 MPICH Linux version
22
22 Evaluation (2) Extensibility: 0/5 Commerical (no source) Trace file format is not documented However could possible use distributed application tracing features to create traces Filtering and aggregation: 4/5 Much of what is recorded in trace files can be controlled through a configuration file (or command line arguments) Some post-mortem filtering and aggregation can be controlled from within Trace Analyzer, but it is not as customizable as other tools Hardware support: 1/5 Supports only systems using Intel IA-32, Itanium 2, or Intel Extended Memory 64 Heterogeneity support: 5/5 Through the use of libVTcs one may manually instrument the code of distributed applications across heterogeneous platforms No automatic event capturing for heterogeneous applications, however
23
23 Evaluation (3) Installation: 4.5/5 Install was very simple, and worked immediately However, I was never able to get hardware counters to function due to incompatibilities with installed PAPI and getrusage Interoperability: 1/5 Trace Analyzer is capable of reading older vampirtrace trace file format files which can be output by some other tools A tracefiles can be output in (or converted to) older ASCII-based vampirtrace trace file format Learning curve: 4.5/5 Most important, and useful views and features are intuitive and easy to understand Some features seem a bit redundant or oddly named Manual overhead: 3/5 MPI call tracing is done automatically by linking against profiling library Can also instrument all functions or a handful of functions using binary instrumentation More detailed tracing information requires manually inserting API function calls A null library is included so that binaries utilizing API function calls need not be altered
24
24 Evaluation (4) Measurement accuracy: 4/5 CAMEL overhead ~5% Tracing overhead is negligible However, sometimes trace analyzer finds reversed messages that shouldn’t be there Multiple executions: 1/5 Multiple instances of Trace Analyzer can be opened at once, but comparing views must be done manually Some support is offered for comparing statistics between two different tracefiles but it is greatly limited (difference or quotient of histograms between two runs) Multiple analyses & views: 4/5 A number of common, useful views are available However, the values displayed are not as customizable as other tools No automatic analysis is offered Analysis can be performed by examining timelines, histograms, or textual representations Performance bottleneck identification: 4.5/5 No automatic detection Views provided should allow for manual detection of most common bottlenecks
25
25 Evaluation (5) Profiling/tracing support: 5/5 Both tracing (recording events, and messages) and profiling (recording statistics) are supported and can be used independent of each other Response time: 2/5 No data at all until after run has completed and tracefile has been opened Some information available without fully loading tracefile Large trace files can take a long time to write out and read back in Searching: 0/5 (not supported) Software support: 4.5/5 MPI profiling interface should permit use with many MPI implementations (support of Intel, Lam, and MPICH is explicitly offered) Full support is available for C/C++, Fortran, and some support for Java and OpenMP
26
26 Evaluation (6) Source code correlation: 4/5 All MPI calls on time line offer click source code correlation User code correlation requires more manual effort System stability: 4.5/5 Trace Analyzer crashed (segmentation fault) only once throughout evaluation Trace Collector never caused an application to fail Technical support: 4/5 Quick initial response through support webpage (a few hours) Subsequent responses required a few days
27
27 References Intel Trace Analyzer 4.0 User’s Guide 4.0.3.0 Intel Trace Collector - IA32-LIN-MPICH PRODUCT.5.0.1.0 User’s Guide PRODUCT 5.0.1.0
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.