John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.

Slides:



Advertisements
Similar presentations
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Advertisements

Copyright © SoftTree Technologies, Inc. DB Tuning Expert.
1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.
Professor Michael J. Losacco CIS 1150 – Introduction to Computer Information Systems Programming and Languages Chapter 13.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Snick  snack A Working Computer Slides based on work by Bob Woodham and others.
Computer Concepts 5th Edition Parsons/Oja Page 546 CHAPTER 11 Software Engineering Section A PARSONS/OJA Computer Programming.
1 Programming Languages b Each type of CPU has its own specific machine language b But, writing programs in machine languages is cumbersome (too detailed)
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.
Chapter 10 Application Development. Chapter Goals Describe the application development process and the role of methodologies, models and tools Compare.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Web-Enabling the Warehouse Chapter 16. Benefits of Web-Enabling a Data Warehouse Better-informed decision making Lower costs of deployment and management.
Virtualization Concept. Virtualization  Real: it exists, you can see it.  Transparent: it exists, you cannot see it  Virtual: it does not exist, you.
Computer Organization
Robert Fowler John Mellor-Crummey Nathan Tallent Gabriel Marin Department of Computer Science Rice University Performance Tuning on Modern Systems: Tools.
Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University
Michelle Mills Strout OpenAnalysis: Representation- Independent Program Analysis CCA Meeting January 17, 2008.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
CS 594:SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE ANALYSIS TOOLS: PART III Gabriel Marin Includes slides from John Mellor-Crummey.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Copyright©2008 N.AlJaffan®KSU1 Chapter 11 Information system development and programming language.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment The prominent role of the memory hierarchy as one of the major bottlenecks.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
What if Performance Tools Were Really Easy to Use? Rob Fowler Center for High Performance Software Rice University.
CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 4 Computer Software.
Statistical Analysis of Inlining Heuristics in Jikes RVM Jing Yang Department of Computer Science, University of Virginia.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
1 EECS 6083 Compiler Theory Based on slides from text web site: Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
1 Oracle Enterprise Manager Slides from Dominic Gélinas CIS
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
HPCToolkit Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:
Compiler Construction (CS-636)
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Connections to Other Packages The Cactus Team Albert Einstein Institute
1 University of Maryland Runtime Program Evolution Jeff Hollingsworth © Copyright 2000, Jeffrey K. Hollingsworth, All Rights Reserved. University of Maryland.
Introduction to OOP CPS235: Introduction.
CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.
Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.
High throughput biology data management and data intensive computing drivers George Michaels.
Profiling/Tracing Method and Tool Evaluation Strategy Summary Slides Hung-Hsun Su UPC Group, HCS lab 1/25/2005.
Programming 2 Intro to Java Machine code Assembly languages Fortran Basic Pascal Scheme CC++ Java LISP Smalltalk Smalltalk-80.
Compilers: History and Context COMP Outline Compilers and languages Compilers and architectures – parallelism – memory hierarchies Other uses.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Java Programming: From the Ground Up
Assembler, Compiler, MIPS simulator
Chapter 1 Introduction.
Chapter 9 – Real Memory Organization and Management
Chapter 1 Introduction.
Web Applications Security What are web Applications?
Performance Analysis, Tools and Optimization
Chapter 4 Computer Software.
Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Computer Science I CSC 135.
CMSC 611: Advanced Computer Architecture
Calpa: A Tool for Automating Dynamic Compilation
CMSC 611: Advanced Computer Architecture
Introduction to Virtual Machines
Introduction to Virtual Machines
Performance Hardware – If I Ran the World
Presentation transcript:

John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit : Multi-platform Tools for Performance Analysis

2 The Big Picture Long-term: compiler and architecture research requires detailed performance understanding — identify performance bottlenecks in complex applications — understand the mismatch between application needs and architecture capabilities — automate strategies for performance improvement Short-term result: programmer-accessible tools for understanding application performance

3 Performance Analysis and Tuning Increasingly necessary — gap between typical and peak performance is growing Increasingly hard — complex architectures are harder to program effectively –deeply-pipelined microprocessors VLIW or superscalar –complex memory hierarchy non-blocking, multi-level caches — large-scale scientific applications pose challenges for tools

4 LACSI HPCToolkit Support large, multi-lingual applications — a mix of of Fortran, C, C++ — hundreds of thousands of lines, many procedures — external libraries Eliminate manual labor from run, analyze tune cycle — use optimized application binaries directly –no: manual instrumentation, build process changes, recompilation Platform, language, and compiler independence — emphasis on LANL ASC Platforms (Origin, AlphaServer, Opteron) — multiple data sources  cross platform comparisons Scalable data collection Effective presentation of analysis results — intuitive, top-down user interface –hierarchical program structure with loop level metrics

5 HPCToolkit System Overview application source application source profile execution performance profile performance profile binary object code binary object code compilation linking binary analysis program structure program structure interpret profile source correlation hyperlinked database hyperlinked database hpcviewer

6 HPCToolkit System Overview profile execution performance profile performance profile application source application source binary object code binary object code compilation linking binary analysis program structure program structure interpret profile source correlation hyperlinked database hyperlinked database hpcviewer — launch unmodified, optimized application binaries — collect statistical profiles of events of interest

7 HPCToolkit System Overview profile execution performance profile performance profile application source application source binary object code binary object code compilation linking binary analysis program structure program structure interpret profile source correlation hyperlinked database hyperlinked database hpcviewer — decode instructions and combine with profile data

8 HPCToolkit System Overview profile execution performance profile performance profile application source application source binary object code binary object code compilation linking binary analysis program structure program structure interpret profile source correlation hyperlinked database hyperlinked database hpcviewer — extract loop nesting information from executables

9 HPCToolkit System Overview profile execution performance profile performance profile application source application source binary object code binary object code compilation linking binary analysis program structure program structure interpret profile source correlation hyperlinked database hyperlinked database hpcviewer — synthesize new metrics by combining metrics — relate metrics, structure, and program source

10 HPCToolkit System Overview profile execution performance profile performance profile application source application source binary object code binary object code compilation linking binary analysis program structure program structure interpret profile source correlation hyperlinked database hyperlinked database hpcviewer — support top-down analysis with interactive viewer — analyze results anytime, anywhere

11 HPCViewer Screenshot MetricsNavigation Annotated Source View

12 Impact on LANL Code Teams HPCToolkit deployed on Origin — improved SAGE by 2x on one example (see next slide) First performance workshop (Feb 03) — Feedback: needed on Q, smaller DB on large codes — Improvements: Sophisticated support for Alpha/Tru64 platform, new Java browser using compact database Second performance workshop (July 03) — Feedback: ready to use, binary analysis too slow on large codes — Improvement: sped up binary analysis on large codes by 30x HPCToolkit deployed on secure machines (July 03) — used to evaluate FLAG for ASCI burn code review (Aug 03) Ongoing interactions — Feedback: better support for shared libraries and Opteron — Improvement: new support for shared libraries installed on Q — Ongoing work: LANL/Rice collaboration for Opteron support

13 Sage Solver Performance Improvement

14 Future Collect and present dynamic context — what path gets us to expensive computations — accurate call-graph profiling of unmodified binaries — analysis and presentation of dynamic context to explain performance –solver is slow only when called on non-preconditioned matrices –MPI wait cost is incurred in the backsolve Statistical clustering — effective analysis of large collections of processes Performance diagnosis — why rather than what