INTRODUCTION TO PERFORMANCE ENGINEERING Allen D. Malony Performance Research Laboratory University of Oregon SC ‘09: Productive Performance Engineering.

Slides:



Advertisements
Similar presentations
K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Advertisements

Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.
1.Data categorization 2.Information 3.Knowledge 4.Wisdom 5.Social understanding Which of the following requires a firm to expend resources to organize.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
Building Enterprise Applications Using Visual Studio ®.NET Enterprise Architect.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Integrated Scientific Workflow Management for the Emulab Network Testbed Eric Eide, Leigh Stoller, Tim Stack, Juliana Freire, and Jay Lepreau and Jay Lepreau.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Instrumentation and Profiling David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.
Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.
An Automated Component-Based Performance Experiment and Modeling Environment Van Bui, Boyana Norris, Lois Curfman McInnes, and Li Li Argonne National Laboratory,
4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.
Alok 1Northwestern University Access Patterns, Metadata, and Performance Alok Choudhary and Wei-Keng Liao Department of ECE,
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Evaluating Performance Information for Mapping Algorithms to Advanced Architectures Nayda G. Santiago, PhD, PE Electrical and Computer Engineering Department.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
CS 584. Performance Analysis Remember: In measuring, we change what we are measuring. 3 Basic Steps Data Collection Data Transformation Data Visualization.
John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.
Technology + Process SDCI HPC Improvement: High-Productivity Performance Engineering (Tools, Methods, Training) for NSF HPC Applications Rick Kufrin *,
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Performance Monitoring of Diverse Computer Systems Joseph M. Lancaster, Roger D. Chamberlain Dept. of Computer Science and Engineering Washington University.
Introduction Complex and large SW. SW crises Expensive HW. Custom SW. Batch execution Structured programming Product SW.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
1 Performance Analysis with Vampir ZIH, Technische Universität Dresden.
MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.
Performance evaluation of component-based software systems Seminar of Component Engineering course Rofideh hadighi 7 Jan 2010.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Performance evaluation on grid Zsolt Németh MTA SZTAKI Computer and Automation Research Institute.
Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Introduction to Parallel Performance Engineering Markus Geimer, Brian Wylie.
Connections to Other Packages The Cactus Team Albert Einstein Institute
Lecture 2a: Performance Measurement. Goals of Performance Analysis The goal of performance analysis is to provide quantitative information about the performance.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
The Performance Evaluation Research Center (PERC) Participating Institutions: Argonne Natl. Lab.Univ. of California, San Diego Lawrence Berkeley Natl.
Introduction Complex and large SW. SW crises Expensive HW. Custom SW. Batch execution Structured programming Product SW.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Vertical Profiling : Understanding the Behavior of Object-Oriented Applications Sookmyung Women’s Univ. PsLab Sewon,Moon.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
Presented by Jack Dongarra University of Tennessee and Oak Ridge National Laboratory KOJAK and SCALASCA.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Maikel Leemans Wil M.P. van der Aalst. Process Mining in Software Systems 2 System under Study (SUS) Functional perspective Focus: User requests Functional.
Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.
Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Chapter Goals Describe the application development process and the role of methodologies, models, and tools Compare and contrast programming language generations.
Building Enterprise Applications Using Visual Studio®
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Productive Performance Tools for Heterogeneous Parallel Computing
Tutorial Outline Welcome (Malony)
Allen D. Malony Computer & Information Science Department
Presentation transcript:

INTRODUCTION TO PERFORMANCE ENGINEERING Allen D. Malony Performance Research Laboratory University of Oregon SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS

Performance Engineering 2 Optimization process Effective use of performance technology characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance storage Performance Technology Data mining Models Expert systems Performance Technology SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS

Performance Optimization Cycle SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 3 Expose factors Collect performance data Calculate metrics Analyze results Visualize results Identify problems Tune performance Instrumentation Presentation Measurement Optimization Analysis

Parallel Performance Properties SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 4 Parallel code performance is influenced by both sequential and parallel factors? Sequential factors – Computation and memory use – Input / output Parallel factors – Thread / process interactions – Communication and synchronization

Performance Observation SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 5 Understanding performance requires observation of performance properties Performance tools and methodologies are primarily distinguished by what observations are made and how – What aspects of performance factors are seen – What performance data is obtained Tools and methods cover broad range

Metrics and Measurement SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 6 Observability depends on measurement A metric represents a type of measured data – Count, time, hardware counters A measurement records performance data – Associates with program execution aspects Derived metrics are computed – Rates (e.g., flops) Metrics / measurements decided by need

Execution Time SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 7 Wall-clock time – Based on realtime clock Virtual process time – Time when process is executing ser time and system time – Does not include time when process is stalled Parallel execution time – Runs whenever any parallel part is executing – Global time basis

Direct Performance Observation SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 8 Execution actions exposed as events – In general, actions reflect some execution state presence at a code location or change in data occurrence in parallelism context (thread of execution) – Events encode actions for observation Observation is direct – Direct instrumentation of program code (probes) – Instrumentation invokes performance measurement – Event measurement = performance data + context Performance experiment – Actual events + performance measurements

Indirect Performance Observation SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 9 Program code instrumentation is not used Performance is observed indirectly – Execution is interrupted can be triggered by different events – Execution state is queried (sampled) different performance data measured – Event-based sampling (EBS) Performance attribution is inferred – Determined by execution context (state) – Observation resolution determined by interrupt period – Performance data associated with context for period

Direct Observation: Events SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 10 Event types – Interval events (begin/end events) measures performance between begin and end metrics monotonically increase – Atomic events used to capture performance data state Code events – Routines, classes, templates – Statement-level blocks, loops User-defined events – Specified by the user Abstract mapping events

Direct Observation: Instrumentation SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 11 Events defined by instrumentation access Instrumentation levels – Source code– Library code – Object code– Executable code – Runtime system– Operating system Different levels provide different information Different tools needed for each level Levels can have different granularity

Direct Observation: Techniques SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 12 Static instrumentation – Program instrumented prior to execution Dynamic instrumentation – Program instrumented at runtime Manual and automatic mechanisms Tool required for automatic support – Source time: preprocessor, translator, compiler – Link time: wrapper library, preload – Execution time: binary rewrite, dynamic Advantages / disadvantages

Direct Observation: Mapping SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 13 Associate performance data with high-level semantic abstractions Abstract events at user-level provide semantic context

Indirect Observation: Events/Triggers SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 14 Events are actions external to program code – Timer countdown, HW counter overflow, … – Consequence of program execution – Event frequency determined by: Type, setup, number enabled (exposed) Triggers used to invoke measurement tool – Traps when events occur (interrupt) – Associated with events – May add differentiation to events

Indirect Observation: Context SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 15 When events trigger, execution context determined at time of trap (interrupt) – Access to PC from interrupt frame – Access to information about process/thread – Possible access to call stack requires call stack unwinder Assumption is that the context was the same during the preceding period – Between successive triggers – Statistical approximation valid for long running programs

Direct / Indirect Comparison SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 16 Direct performance observation Measures performance data exactly Links performance data with application events  Requires instrumentation of code  Measurement overhead can cause execution intrusion and possibly performance perturbation Indirect performance observation Argued to have less overhead and intrusion Can observe finer granularity No code modification required (may need symbols)  Inexact measurement and attribution

Measurement Techniques SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 17 When is measurement triggered? – External agent (indirect, asynchronous) interrupts, hardware counter overflow, … – Internal agent (direct, synchronous) through code modification How are measurements made? – Profiling summarizes performance data during execution per process / thread and organized with respect to context – Tracing trace record with performance data and timestamp per process / thread

Measured Performance SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 18 Counts Durations Communication costs Synchronization costs Memory use Hardware counts System calls

Critical issues SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 19 Accuracy – Timing and counting accuracy depends on resolution – Any performance measurement generates overhead Execution on performance measurement code – Measurement overhead can lead to intrusion – Intrusion can cause perturbation alters program behavior Granularity – How many measurements are made – How much overhead per measurement Tradeoff (general wisdom) – Accuracy is inversely correlated with granularity

Profiling SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 20 Recording of aggregated information – Counts, time, … … about program and system entities – Functions, loops, basic blocks, … – Processes, threads Methods – Event-based sampling (indirect, statistical) – Direct measurement (deterministic)

inclusive duration exclusive duration int foo() { int a; a = a + 1; bar(); a = a + 1; return a; } Inclusive and Exclusive Profiles SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 21 Performance with respect to code regions Exclusive measurements for region only Inclusive measurements includes child regions

Flat and Callpath Profiles SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 22 Static call graph – Shows all parent-child calling relationships in a program Dynamic call graph – Reflects actual execution time calling relationships Flat profile – Performance metrics for when event is active – Exclusive and inclusive Callpath profile – Performance metrics for calling path (event chain) – Differentiate performance with respect to program execution state – Exclusive and inclusive

Tracing Measurement SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 23 void master {... send(B, tag, buf);... } Process A: void slave {... recv(A, tag, buf);... } Process B: void worker {... recv(A, tag, buf);... } void master {... send(B, tag, buf);... } 58AENTER1 60BENTER2 62ASENDB 64AEXIT1 68BRECVA... 69BEXIT2... 1master 2worker 3... trace(ENTER, 1); trace(SEND, B); trace(EXIT, 1); trace(ENTER, 2); trace(RECV, A); trace(EXIT, 2); MONITOR

Tracing Analysis and Visualization SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 24 1master 2worker AENTER160BENTER262ASENDB64AEXIT168BRECVA... 69BEXIT2... main master worker B A

Trace Formats SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 25 Different tools produce different formats – Differ by event types supported – Differ by ASCII and binary representations Vampir Trace Format (VTF) KOJAK (EPILOG) Jumpshot (SLOG-2) Paraver Open Trace Format (OTF) – Supports interoperation between tracing tools

Profiling / Tracing Comparison SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 26 Profiling Finite, bounded performance data size Applicable to both direct and indirect methods  Loses time dimension (not entirely)  Lacks ability to fully describe process interaction Tracing Temporal and spatial dimension to performance data Capture parallel dynamics and process interaction  Some inconsistencies with indirect methods  Unbounded performance data size (large)  Complex event buffering and clock synchronization

Performance Problem Solving Goals SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 27 Answer questions at multiple levels of interest – High-level performance data spanning dimensions machine, applications, code revisions, data sets examine broad performance trends – Data from low-level measurements use to predict application performance Discover general correlations – performance and features of external environment – Identify primary performance factors Benchmarking analysis for application prediction Workload analysis for machine assessment

Performance Analysis Questions SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 28 How does performance vary with different compilers? Is poor performance correlated with certain OS features? Has a recent change caused unanticipated performance? How does performance vary with MPI variants? Why is one application version faster than another? What is the reason for the observed scaling behavior? Did two runs exhibit similar performance? How are performance data related to application events? Which machines will run my code the fastest and why? Which benchmarks predict my code performance best?

Automatic Performance Analysis SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 29 Performance database Build application Execute application Simple analysis feedback 72% Faster! build information environment / performance data Offline analysis

Performance Data Management SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 30 Performance diagnosis and optimization involves multiple performance experiments Support for common performance data management tasks augments tool use – Performance experiment data and metadata storage – Performance database and query What type of performance data should be stored? – Parallel profiles or parallel traces – Storage size will dictate – Experiment metadata helps in meta analysis tasks Serves tool integration objectives

Metadata Collection SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 31 Integration of metadata with each parallel profile – Separate information from performance data Three ways to incorporate metadata – Measured hardware/system information CPU speed, memory in GB, MPI node IDs, … – Application instrumentation (application-specific) Application parameters, input data, domain decomposition Capture arbitrary name/value pair and save with experiment – Data management tools can read additional metadata Compiler flags, submission scripts, input files, … Before or after execution Enhances analysis capabilities

Performance Data Mining SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 32 Conduct parallel performance analysis in a systematic, collaborative and reusable manner – Manage performance complexity and automate process – Discover performance relationship and properties – Multi-experiment performance analysis Data mining applied to parallel performance data – Comparative, clustering, correlation, characterization, … – Large-scale performance data reduction Implement extensible analysis framework – Abstraction / automation of data mining operations – Interface to existing analysis and data mining tools

How to explain performance? SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 33 Should not just redescribe performance results Should explain performance phenomena – What are the causes for performance observed? – What are the factors and how do they interrelate? – Performance analytics, forensics, and decision support Add knowledge to do more intelligent things – Automated analysis needs good informed feedback – Performance model generation requires interpretation Performance knowledge discovery framework – Integrating meta-information – Knowledge-based performance problem solving

Metadata and Knowledge Role SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 34 Performance Knowledge SourceCode Build Environment Run Environment Performance Result Execution You have to capture these......to understand this Application Machine Performance Problems Context Knowledge

Performance Optimization Process SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 35 Performance characterization – Identify major performance contributors – Identify sources of performance inefficiency – Utilize timing and hardware measures Performance diagnosis (Performance Debugging) – Look for conditions of performance problems – Determine if conditions are met and their severity – What and where are the performance bottlenecks Performance tuning – Focus on dominant performance contributors – Eliminate main performance bottlenecks

POINT Project SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 36 “High-Productivity Performance Engineering (Tools, Methods, Training) for NSF HPC Applications” – NSF SDCI, Software Improvement and Support – University of Oregon, University of Tennessee, National Center for Supercomputing Applications, Pittsburgh Supercomputing Center POINT project – Petascale Productivity from Open, Integrated Tools –

Motivation SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 37 Promise of HPC through scalable scientific and engineering applications Performance optimization through effective performance engineering methods – Performance analysis / tuning “best practices” Productive petascale HPC will require – Robust parallel performance tools – Training good performance problem solvers

Objectives SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 38 Robust parallel performance environment – Uniformly available across NSF HPC platforms Promote performance engineering – Training in performance tools and methods – Leverage NSF TeraGrid EOT Work with petascale applications teams Community building

Challenges SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 39 Consistent performance tool environment – Tool integration, interoperation, and scalability – Uniform deployment across NSF HPC platforms Useful evaluation metrics and process – Make part of code development routine – Recording performance engineering history Develop performance engineering culture – Proceed beyond “hand holding” engagements

Performance Engineering Levels SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 40 Target different performance tool users – Different levels of expertise – Different performance problem solving needs Level 0 (entry) – Simpler tool use, limited performance data Level 1 (intermediate) – More tool sophistication, increased information Level 2 (advanced) – Access to powerful performance techniques

POINT Project Organization 41 Testbed Apps ENZO NAMD NEMO3D SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS

Parallel Performance Technology SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 42 PAPI – University of Tennessee, Knoxville PerfSuite – National Center for Supercomputing Applications TAU Performance System – University of Oregon Kojak / Scalasca – Research Centre Juelich Vampir and VampirTrace – T.U. Dresden

Parallel Engineering Training SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 43 User engagement User support in TeraGrid Training workshops Quantify tool impact POINT lead pilot site – Pittsburgh Supercomputing Center – NSF TeraGrid site

Testbed Applications SC ‘09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS 44 ENZO – Adaptive mesh refinement (AMR), grid-based hybrid code (hydro+Nbody) designed to do simulations of cosmological structure formation NAMD – Mature community parallel molecular dynamics application deployed for research in large-scale biomolecular systems NEMO3D – Quantum mechanical based simulation tool created to provide quantitative predictions for nanometer-scale semiconductor devices