Survey of Performance Evaluation Tools Last modified: 10/18/05.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Automated Instrumentation and Monitoring System (AIMS)
4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.
MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Workloads Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Live workload Benchmark applications Micro- benchmark.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Workloads Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Live workload Benchmark applications Micro- benchmark.
Chapter 4 M. Keshtgary Spring 91 Type of Workloads.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
Chapter 14 Chapter 14: Server Monitoring and Optimization.
Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
MCITP Guide to Microsoft Windows Server 2008 Server Administration (Exam #70-646) Chapter 14 Server and Network Monitoring.
Maintaining and Updating Windows Server 2008
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Module 8: Monitoring SQL Server for Performance. Overview Why to Monitor SQL Server Performance Monitoring and Tuning Tools for Monitoring SQL Server.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
Windows Server 2008 Chapter 11 Last Update
Overview SAP Basis Functions. SAP Technical Overview Learning Objectives What the Basis system is How does SAP handle a transaction request Differentiating.
The Client/Server Database Environment
Computer System Lifecycle Chapter 1. Introduction Computer System users, administrators, and designers are all interested in performance evaluation. Whether.
Introduction to Systems Analysis and Design Trisha Cummings.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Tool Visualizations, Metrics, and Profiled Entities Overview Adam Leko HCS Research Laboratory University of Florida.
MCTS Guide to Microsoft Windows Vista Chapter 11 Performance Tuning.
MCTS Guide to Microsoft Windows 7
Performance of Web Applications Introduction One of the success-critical quality characteristics of Web applications is system performance. What.
(C) 2009 J. M. Garrido1 Object Oriented Simulation with Java.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
MPICL/ParaGraph Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information.
Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
Performance evaluation of component-based software systems Seminar of Component Engineering course Rofideh hadighi 7 Jan 2010.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Computer Architecture
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Computer Simulation of Networks ECE/CSC 777: Telecommunications Network Design Fall, 2013, Rudra Dutta.
Chapter 8 System Management Semester 2. Objectives  Evaluating an operating system  Cooperation among components  The role of memory, processor,
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Maintaining and Updating Windows Server 2008 Lesson 8.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
OPERATING SYSTEMS CS 3502 Fall 2017
Lecture 2: Performance Evaluation
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Database System Concepts and Architecture
The Client/Server Database Environment
Software Architecture in Practice
MCTS Guide to Microsoft Windows 7
The Client/Server Database Environment
The Client/Server Database Environment
Software Architecture in Practice
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
CSCI1600: Embedded and Real Time Software
Predictive Performance
CMSC 611: Advanced Computer Architecture
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Survey of Performance Evaluation Tools Last modified: 10/18/05

Summary Given features of existing performance evaluation tools, want to: Determine collectable performance metrics What is recorded, hardware counters, etc Identify tool’s architectural components, e.g., Data and communication (protocol) managements Software capabilities: monitoring or profiling, visualizations, and modeling Goals are: Investigate how a component-based performance evaluation framework can be constructed by leveraging existing tools Investigate scaling (scale up and out) of this framework to large-scale systems Large scale: ranges from 1000 to nodes Analyze workload characterization on deployed platforms for real applications and users

Outline Background Tools Monitoring Profiling/Tracing Workload Characterization (WLC) Techniques A proposal: performance evaluation frameworks

Background

What is Workload? According to Cambridge dictionary, a workload is defined as: “The amount of work to be done, especially by a particular person or machine in a period of time” Given the realm of computer systems, a workload is can be loosely defined as: A set of requests presented to a computer in a period of time. Workload can be classified into: Synthetic workload: created for controlled testing Real workload: any observed requests during normal operations

What is WLC? WLC plays a key role in all performance evaluation studies WLC is a synthetic description of a workload by means of quantitative parameters and functions The objective is to formulate a model to show, capture, and reproduce the static and dynamic behavior of real workloads WLC is a difficult as well as a neglected task A large amount of measurements are collected Extensive analysis has to be performed

Performance Modeling  Analyze Requirements  Predict Requirements Performance Tuning  Optimize Application Responsiveness  Predict Impact of Changes Performance Analysis  Analyze Performance  Optimize Resource Usage  Predict Requirements  Predict Application Responsiveness Workload Analysis Analyze Performance Profile Application WLC in Performance Evaluation Life Cycle Initial Sizing & Resizing Evaluation Production Driving Forces:  Competitions  Hardware  Software  Growth On-going Operation Performance Reporting  Report Performance  Report Resource Usage

WLC in Performance Evaluation Methodology Mathematics Models

Workloads Data Flows Experimental environment Real system exec- driven simulation Trace- driven simulation Stochastic simulation Real applications Benchmark applications Micro- benchmark programs Synthetic benchmark programs Traces Distributions & other statistics Monitor (or Profiler) Analysis Generator Synthetic traces Made-up data Data sets © 2003, Carla Ellis

Workload Issues Selection of benchmarks Requirements: Repeatability Availability (software) Acceptance (by community) Representative (of typical usage, e.g. timeliness) Realistic (predictive of real performance, e.g. scaling issue) Types of workloads: Real, Synthetic Workloads monitoring & tracing Monitor/Profiler design Compression, simulation Workload characterization Workload generators

Types: Real and Synthetic Workloads Real workloads: Advantages: Represent reality “Deployment experience” Disadvantage is they’re uncontrolled Can’t be repeated and described simply Difficult to analyze Nevertheless, often useful for “final analysis” papers Synthetic workloads: Advantages: Controllable Repeatable Portable to other systems Easily modified Disadvantage: can never be sure real world will be the same (i.e., are they representative?)

Types: Instruction Workloads Useful only for CPU performance But teach useful lessons for other situations Development over decades “Typical” instruction (ADD) Instruction mix (by frequency of use) Sensitive to compiler, application, architecture Still used today (MFLOPS) Modern complexity makes mixes invalid Pipelining Data/instruction caching Prefetching Kernel is inner loop that does useful work: Sieve, matrix inversion, sort, etc. Ignores setup, I/O, so can be timed by analysis if desired (at least in theory)

Real Applications Standard Pick a representative application Pick sample data Run it on system to be tested Easy to do, accurate for that sample data Fails to consider other applications, data Microkernel Choose most important subset of functions Write benchmark to test those functions Tests what computer will be used for Need to be sure important characteristics aren’t missed

Synthetic Applications Complete programs: Designed specifically for measurement May do real or “fake” work May be adjustable (parameterized) Two major classes: Synthetic benchmarks: often need to compare general-purpose computer systems for general-purpose use Examples: Sieve, Ackermann’s function, Whetstone, Linpack, Dhrystone, Livermore loops, SPEC, MAB Microbenchmarks: for I/O, network, non-CPU measurements Examples: HPCtoolkits

Workload Considerations Services exercised Level of detail Representative Timeliness Other considerations

Services Exercised What services does a system actually use? Faster CPU won’t speed up “cp” Network performance useless for matrix work What metrics measure these services? MIPS for CPU speed Bandwidth for network, I/O TPS for transaction processing May be possible to isolate interfaces to just one component E.g., instruction mix for CPU System often has layers of services Consider services provided and used by that component Can cut at any point and insert workload

Integrity Computer systems are complex Effect of interactions hard to predict So must be sure to test entire system Important to understand balance between components I.e., don’t use 90% CPU mix to evaluate I/O-bound application Sometimes only individual components are compared Would a new CPU speed up our system? How does IPV6 affect Web server performance? But component may not be directly related to performance

Workload Characterization Identify service provided by major subsystem List factors affecting performance List metrics that quantify demands and performance Identify workload provided to that service

Example: Web Service Web Client Analysis Services: visit page, follow hyperlink, display information Factors: page size, number of links, fonts required, embedded graphics, sound Metrics: response time Workload: a list of pages to be visited and links to be followed Network Analysis Services: connect to server, transmit request, transfer data Factors: bandwidth, latency, protocol used Metrics: connection setup time, response latency, achieved bandwidth Workload: a series of connections to one or more servers, with data transfer Web Server Analysis Services: accept and validate connection, fetch HTTP data Factors: Network performance, CPU speed, system load, disk subsystem performance Metrics: response time, connections served Workload: a stream of incoming HTTP connections and requests File System Analysis Services: open file, read file (writing doesn’t matter for Web server) Factors: disk drive characteristics, file system software, cache size, partition size Metrics: response time, transfer rate Workload: a series of file- transfer requests Disk Drive Analysis Services: read sector, write sector Factors: seek time, transfer rate Metrics: response time Workload: a statistically-generated stream of read/write requests Web Client Network TCP/IP Connections Web Server HTTP Requests File System Web Page Accesses Disk Drive Disk Transfers Web Page Visits

Level of Detail Detail trades off accuracy vs. cost Highest detail is complete trace Lowest detail is one request (most common) Intermediate approach: weight by frequency

Representative Obviously, workload should represent desired application Arrival rate of requests Resource demands of each request Resource usage profile of workload over time Again, accuracy and cost trade off Need to understand whether detail matters

Timeliness Usage patterns change over time File size grows to match disk size Web pages grow to match network bandwidth If using “old” workloads, must be sure user behavior hasn’t changed Even worse, behavior may change after test, as result of installing new system

Other Considerations Loading levels Full capacity Beyond capacity Actual usage External components not considered as parameters Repeatability of workload

Tools

Desire Features of a Measurement Tool Basic usages of performance evaluation tools are: Performance analysis and enhancements of system operations Troubleshooting and recovery of operations of system components Support to the component performing Job scheduling Resource management (e.g. when accomplishing load balancing) Collection of information on applications Fault detection or prevention (HA) Security threats and “holes” detection. Desirable features include, but not limited to,: Non-intrusiveness Integration with batch job management systems System usage statistics retrieval Availability in cluster distributions Ability to scale to large system Graphic interface (standard GUI or web portal)

Criterion for Comparing Tools Evaluation criteria: Metrics collected Monitored/Profiled entities Visualization Data and Communication management Other criteria: Knowledge representations Tools interoperability “Standard” APIs Scalability

Some Terminology Monitoring: A program that observes, supervises, or controls the activities of other programs. Profiling: A statistical view of how well resources are being used by a program, often in the form of a graph or table, representing distinctive features or characteristics. Tracing: A graphic record of (system or application) events that is recorded by a program.

Monitoring System monitoring Provide a continuous collection and aggregation of system performance data. Application monitoring Measure actual application performance via a batch system. PerMinerperfminer.pdc.kth.se/ NWPerf PerMinerperfminer.pdc.kth.se/ NWPerf SuperMonsupermon.sourceforge.net/ Hawkeyewww.cs.wisc.edu/condor/hawkeye/ Gangliaganglia.sourceforge.net/ CluMonclumon.ncsa.uiuc.edu/

Profiling and Tracing Provide a static, instrumentation tool, which focuses on source code that users have direct control. TAUwww.cs.uoregon.edu/research/tau/home.php Paradynwww.paradyn.org/ MPE/Jumpshotwww-unix.mcs.anl.gov/perfvis/ Dimemes/Paraver mpiPwww.llnl.gov/CASC/mpip/ DynoProficl.cs.utk.edu/~mucci/dynaprof/ KOJAKwww.fz-juelich.de/zam/kojak/ ICTwww.intel.com/cd/software/products/asmo-na/eng/cluster/index.htm Pablopablo.renci.org/Software/Pablo/pablo.htm MPICL/Paragraphwww.csm.ornl.gov/picl/ CoPilotwww.sgi.com/products/software/co-pilot/ IPMwww.nersc.gov/projects/ipm/ PerfSuiteperfsuite.ncsa.uiuc.edu/

Data Management and Data Formant Databases/Query Languages JDBC SQL Data Formats HDF, involves the development and support of software and file formats for scientific data management. The HDF software includes I/O libraries and tools for analyzing, visualizing, and converting scientific data. There are two HDF formats, the original HDF (4.x and previous releases) and HDF5, which is a completely new format and library. NetCDF, the Network Common Data Form, provides an interface for array-oriented data access and a library that supports an implementation of the interface. XDR XML, the Extensible Markup Language, provides a standard way to define application specific data languages.

Monitoring Tools

CoPilot Metrics collected Monitored entities Visualizations

Hawkeye Metrics collected Monitored entities Visualizations

IPM Metrics collected Monitored entities Visualizations

PerfSuite Metrics collected Monitored entities Visualizations

NWPerf Metrics collected Profiled entities Visualizations

PerMiner Metrics collected Profiled entities Visualizations

SuperMon Metrics collected Profiled entities Visualizations

CluMon Metrics collected Profiled entities Visualizations

Profiling/Tracing Tools

TAU Metrics recorded Two modes: profile, trace Profile mode Inclusive/exclusive time spent in functions Hardware counter information PAPI/PCL: L1/2/3 cache reads/writes/misses, TLB misses, cycles, integer/floating point/load/store/stalls executed, wall clock time, virtual time Other OS timers (gettimeofday, getrusage) MPI message size sent Trace mode Same as profile (minus hardware counters?) Message send time, message receive time, message size, message sender/recipient(?) Profiled entities Functions (automatic & dynamic), loops + regions (manual instrumentation)

TAU Visualizations Profile mode Text-based: pprof (example next slide), shows a summary of profile information Graphical: racy (old), jracy a.k.a. paraprof Trace mode No built-in visualizations Can export to CUBE (see KOJAK), Jumpshot (see MPE), and Vampir format (see Intel Cluster Tools)

TAU – pprof output Reading Profile files in profile.* NODE 0;CONTEXT 0;THREAD 0: %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call , main() (calls f1, f5) ,001 15, f1() (sleeps 1 sec, calls f2, f4) ,001 15, main() (calls f1, f5) => f1() (sleeps 1 sec, calls f2, f4) ,003 10, f2() (sleeps 2 sec, calls f3) ,001 9, f1() (sleeps 1 sec, calls f2, f4) => f4() (sleeps 4 sec, calls f2) ,001 9, f4() (sleeps 4 sec, calls f2) ,003 6, f2() (sleeps 2 sec, calls f3) => f3() (sleeps 3 sec) ,003 6, f3() (sleeps 3 sec) ,001 5, f4() (sleeps 4 sec, calls f2) => f2() (sleeps 2 sec, calls f3) ,001 5, f1() (sleeps 1 sec, calls f2, f4) => f2() (sleeps 2 sec, calls f3) ,001 5, f5() (sleeps 5 sec) ,001 5, main() (calls f1, f5) => f5() (sleeps 5 sec)

TAU – paraprof

Paradyn Metrics recorded Number of CPUs, number of active threads, CPU and inclusive CPU time Function calls to and by Synchronization (# operations, wait time, inclusive wait time) Overall communication (# messages, bytes sent and received), collective communication (# messages, bytes sent and received), point-to-point communication (# messages, bytes sent and received) I/O (# operations, wait time, inclusive wait time, total bytes) All metrics recorded as “time histograms” (fixed-size data structure) Profiled entities Functions only (but includes functions linked to in existing libraries)

Paradyn Visualizations Time histograms Tables Barcharts “Terrains” (3-D histograms)

Paradyn Time View Histrogram across multiple hosts

Paradyn – table (current metric values) Table (current metric values) Bar chart (current or average metric values

MPE/Jumpshot Metrics collected MPI message send time, receive time, size, message sender/recipient User-defined event entry & exit Profiled entities All MPI functions Functions or regions via manual instrumentation and custom events Visualization Jumpshot: timeline view (space-time diagram overlaid on Gantt chart), histogram

Jumpshot Timeline ViewHistogram View

Dimemas/Paraver Metrics recorded (MPITrace) All MPI functions Hardware counters (2 from the following two lists, uses PAPI) Counter 1 Cycles Issued instructions, loads, stores, store conditionals Failed store conditionals Decoded branches Quadwords written back from scache(?) Correctible scache data array errors(?) Primary/secondary I- cache misses Instructions mispredicted from scache way prediction table(?) External interventions (cache coherency?) External invalidations (cache coherency?) Graduated instructions Counter 2 Cycles Graduated instructions, loads, stores, store conditionals, floating point instructions TLB misses Mispredicted branches Primary/secondary data cache miss rates Data mispredictions from scache way prediction table(?) External intervention/invalidation (cache coherency?) Store/prefetch exclusive to clean/shared block

Dimemas/Paraver Profiled entities (MPITrace) All MPI functions (message start time, message end time, message size, message recipient/sender) User regions/functions via manual instrumentation Visualization Timeline display (like Jumpshot) Shows Gantt chart and messages Also can overlay hardware counter information Clicking on timeline brings up a text listing of events near where you clicked 1D/2D analysis modules

Paraver – timeline timeline (HW counter) timeline (standard)

Paraver – text module

Paraver 1D analysis 2D analysis

mpiP Metrics collected Start time, end time, message size for each MPI call Profiled entities MPI function calls + PMPI wrapper Visualization Text-based output, with graphical browser that displays statistics in- line with source Displayed information: Overall time (%) for each MPI node Top 20 callsites for time (MPI%, App%, variance) Top 20 callsites for message size (MPI%, App%, variance) Min/max/average/MPI%/App% time spent at each call site Min/max/average/sum of message sizes at each call site App time = wall clock time between MPI_Init and MPI_Finalize MPI time = all time consumed by MPI functions App% = % of metric in relation to overall app time MPI% = % of metric in relation to overall MPI time

mpiP – graphical view

Dynaprof Metrics collected Wall clock time or PAPI metric for each profiled entity Collects inclusive, exclusive, and 1-level call tree % information Profiled entities Functions (dynamic instrumentation) Visualizations Simple text-based Simple GUI (shows same info as text-based)

Dynaprof – output dynaprof]$ wallclockrpt lu- 1.wallclock Exclusive Profile. Name Percent Total Calls TOTAL e+11 1 unknown e+11 1 main 3.837e Inclusive Profile. Name Percent Total SubCalls TOTAL e+11 0 main e Level Inclusive Call Tree. Parent/-Child Percent Total Calls TOTAL e+11 1 main e f_setarg e e f_setsig e e f_init e e atexit e e MAIN__

KOJAK Metrics collected MPI: message start time, receive time, size, message sender/recipient Manual instrumentation: start and stop times 1 PAPI metric / run (only FLOPS and L1 data misses visualized) Profiled entities MPI calls (MPI wrapper library) Function calls (automatic instrumentation, only available on a few platforms) Regions and function calls via manual instrumentation Visualizations Can export traces to Vampir trace format (see ICT) Shows profile and analyzed data via CUBE (described on next few slides)

CUBE overview: simple description Uses a 3-pane approach to display information Metric pane Module/calltree pane Right-clicking brings up source code location Location pane (system tree) Each item is displayed along with a color to indicate severity of condition Severity can be expressed 4 ways Absolute (time) Percentage Relative percentage (changes module & location pane) Comparative percentage (differences between executions) Despite documentation, interface is actually quite intuitive

Intel Cluster Tools (ICT) Metrics collected MPI functions: start time, end time, message size, message sender/recipient User-defined events: counter, start & end times Code location for source-code correlation Instrumented entities MPI functions via wrapper library User functions via binary instrumentation(?) User functions & regions via manual instrumentation Visualizations Different types: timelines, statistics & counter info Described in next slides

ICT visualizations – timelines & summaries Summary Chart Display Allows the user to see how much work is spent in MPI calls Timeline Display Zoomable, scrollable timeline representation of program execution Fig. 2 Timeline DisplayFig. 1 Summary Chart

ICT visualizations – histogram & counters Summary Timeline Timeline/histogram representation showing the number of processes in each activity per time bin Counter Timeline Value over time representation (behavior depends on counter definition in trace) Fig. 3 Summary TImeline Fig 4. Counter Timeline

ICT visualizations – message stats & process profiles Message Statistics Display Message data to/from each process (count,length, rate, duration) Process Profile Display Per process data regarding activities Fig. 5 Message Statistics Fig. 6 Process Profile Display

ICT visualizations – general stats & call tree Statistics Display Various statistics regarding activities in histogram, table, or text format Call Tree Display Fig. 7 Statistics Display Fig. 8 Call Tree Display

ICT visualizations – source & activity chart Source View Source code correlation with events in Timeline Activity Chart Per Process histograms of Application and MPI activity Fig 9. Source View Fig. 10 Activity Chart

ICT visualizations – process timeline & activity chart Process Timeline Activity timeline and counter timeline for a single process Process Activity Chart Same type of informartion as Global Summary Chart Process Call Tree Same type of information as Global Call Tree Figure 11. Process Timeline Figure 12. Process Activity Chart & Call Tree

Pablo Metrics collected Time inclusive/exclusive of a function Hardware counters via PAPI Summary metrics computed from timing info Min/max/avg/stdev/count Profiled entities Functions, function calls, and outer loops All selected via GUI Visualizations Displays derived summary metrics color-coded and inline with source code Shown on next slide

SvPablo

MPICL/Paragraph Metrics collected MPI functions: start time, end time, message size, message sender/recipient Manual instrumentation: start time, end time, “work” done (up to user to pass this in) Profiled entities MPI function calls via PMPI interface User functions/regions via manual instrumentation Visualizations Many, separated into 4 categories: utilization, communication, task, “other” Described in following slides

ParaGraph visualizations Utilization visualizations Display rough estimate of processor utilization Utilization broken down into 3 states: Idle – When program is blocked waiting for a communication operation (or it has stopped execution) Overhead – When a program is performing communication but is not blocked (time spent within MPI library) Busy – if execution part of program other than communication “Busy” doesn’t necessarily mean useful work is being done since it assumes (not communication) := busy Communication visualizations Display different aspects of communication Frequency, volume, overall pattern, etc. “Distance” computed by setting topology in options menu

ParaGraph visualizations Task visualizations Display information about when processors start & stop tasks Requires manually instrumented code to identify when processors start/stop tasks Other visualizations Miscellaneous things

Utilization visualizations – utilization count Displays # of processors in each state at a given moment in time Busy shown on bottom, overhead in middle, idle on top Displays utilization state of each processor as a function of time (gnatt chart)

Utilization visualizations – Kiviat diagram Shows our friend, the Kiviat diagram Each spoke is a single processor Dark green shows moving average, light green shows current high watermark Timing parameters for each can be adjusted Metric shown can be “busy” or “busy + overhead”

Utilization visualizations – streak Shows “streak” of state Similar to winning/losing streaks of baseball teams Win = overhead or busy Loss = idle Not sure how useful this is

Utilization visualizations – utilization summary Shows percentage of time spent in each utilization state up to current time

Utilization visualizations – utilization meter Shows percentage of processors in each utilization state at current time

Utilization visualizations – concurrency profile Shows histograms of # processors in a particular utilization state Ex: Diagram shows Only 1 processor was busy ~5% of the time All 8 processors were busy ~90% of the time

Communication visualizations – color code Color code controls colors used on most communication visualizations Can have color indicate message sizes, message distance, or message tag Distance computed by topology set in options menu

Communication visualizations – communication traffic Shows overall traffic at a given time Bandwidth used, or Number of messages in flight Can show single node or aggregate of all nodes

Communication visualizations – spacetime diagram Shows standard space-time diagram for communication Messages sent from node to node at which times

Communication visualizations – message queues Shows data about message queue lengths Incoming/outgoing Number of bytes queued/number of messages queued Colors mean different things Dark color shows current moving average Light color shows high watermark

Communication visualizations – communication matrix Shows which processors sent data to which other processors

Communication visualizations – communication meter Show percentage of communication used at the current time Message count or bandwidth 100% = max # of messages / max bandwidth used by the application at a specific time

Communication visualizations – animation Animates messages as they occur in trace file Can overlay messages over topology Available topologies Mesh Ring Hypercube User-specified Can layout each node as you want Can store to a file and load later on

Communication visualizations – node data Shows detailed communication data Can display Metrics Which node Message tag Message distance Message length For a single node, or aggregate for all nodes

Task visualizations – task count Shows number of processors that are executing a task at the current time At end of run, changes to show summary of all tasks

Task visualizations – task Gantt Shows Gantt chart of which task each processor was working on at a given time

Task visualizations – task speed Similar to Gantt chart, but displays “speed” of each task Must record work done by task in instrumentation call (not done for example shown above)

Task visualizations – task status Shows which tasks have started and finished at the current time

Task visualizations – task summary Shows % time spent on each task Also shows any overlap between tasks

Task visualizations – task surface Shows time spent on each task by each processor Useful for seeing load imbalance on a task-by- task basis

Task visualizations – task work Displays work done by each processor Shows rate and volume of work being done Example doesn’t show anything because no work amounts recorded in trace being visualized

Other visualizations – clock, coordinates Clock Shows current time Coordinate information Shows coordinates when you click on any visualization

Other visualizations – critical path Highlights critical path in space-time diagram in red Longest serial path shown in red Depends on point-to-point communication (collective can screw it up)

Other visualizations – phase portrait Shows relationship between processor utilization and communication usage

Other visualizations – statistics Gives overall statistics for run Data % busy, overhead, idle time Total count and bandwidth of messages Max, min, average Message size Distance Transit time Shows max of 16 processors at a time

Other visualizations – processor status Shows Processor status Which task each processor is executing Communication (sends & receives) Each processor is a square in the grid (8-processor example shown)

Other visualizations – trace events Shows text output of all trace file events

WLC Techniques

Static vs. Dynamic Static: Explore the intrinsic characteristics of the workload Correlation between workload parameters and distributions Techniques: Clustering Principle component analysis Averaging Correlations Dynamic: Explore the characteristics of the workload over time Predict the workload behavior in the future Techniques: Markov chains User behavior graphs Regression methods

Discussion

Proposed WLC Framework 1. Requirements Analysis 3. Model Construction 4. Model Validation Graphical Analysis 3. Investigate Real-Time Analysis Collect UNIX/Linux/Windows XP 2. Measurements Web Access ODBC: SQL M/S Access XML Apply Criterion Representative Execute Model Workload Model Calibrate Model Data Mining Analytical/Statistical Tools Database NO Yes Results 6. Visualize Analyze Workload Characterization Predict Response Time Analysis Predictive Analysis 5. Evaluation input

References Network monitoring tools - PacketBench – network traffic Rubicon - I/O The Tracefile Testbed