Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

Slides:

Advertisements

Similar presentations

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Advertisements

Part IV: Memory Management

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

© 2009 IBM Corporation July, 2009 | PADTAD Chicago, Illinois A Proposal of Operation History Management System for Source-to-Source Optimization.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

Component Patterns – Architecture and Applications with EJB copyright © 2001, MATHEMA AG Component Patterns Architecture and Applications with EJB JavaForum.

Reference: Message Passing Fundamentals.

Fjording the Stream: An Architecture for Queries over Streaming Sensor Data Samuel Madden, Michael J. Franklin University of California, Berkeley Proceedings.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

Profiling S3D on Cray XT3 using TAU Sameer Shende

TAU Parallel Performance System DOD UGC 2004 Tutorial Allen D. Malony, Sameer Shende, Robert Bell Univesity of Oregon.

Home: Phones OFF Please Unix Kernel Parminder Singh Kang Home:

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.

Register Allocation (via graph coloring)

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

CS294-6 Reconfigurable Computing Day 3 September 1, 1998 Requirements for Computing Devices.

Resource Fabrics: The Next Level of Grids and Clouds Lei Shi.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Rainbow Facilitating Restorative Functionality Within Distributed Autonomic Systems Philip Miseldine, Prof. Taleb-Bendiab Liverpool John Moores University.

Loads Balanced with CQoS Nicole Lemaster, Damian Rouson, Jaideep Ray Sandia National Laboratories Sponsor: DOE CCA Meeting – January 22, 2009.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

CQoS Update Li Li, Boyana Norris, Lois Curfman McInnes Argonne National Laboratory Kevin Huck University of Oregon.

Checkpoint & Restart for Distributed Components in XCAT3 Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University.

An Introduction to Software Architecture

Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

Implementing Processes and Process Management Brian Bershad.

Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski

Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Microcontroller Presented by Hasnain Heickal (07), Sabbir Ahmed(08) and Zakia Afroze Abedin(19)

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Technology + Process SDCI HPC Improvement: High-Productivity Performance Engineering (Tools, Methods, Training) for NSF HPC Applications Rick Kufrin *,

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

Unit-1 Introduction Prepared by: Prof. Harish I Rathod

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Query Execution Section 15.1 Shweta Athalye CS257: Database Systems ID: 118 Section 1.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

SPRUCE Special PRiority and Urgent Computing Environment Advisor Demo Nick Trebon University of Chicago Argonne National Laboratory

PerfExplorer Component for Performance Data Analysis Kevin Huck – University of Oregon Boyana Norris – Argonne National Lab Li Li – Argonne National Lab.

FPL Sept. 2, 2003 Software Decelerators Eric Keller, Gordon Brebner and Phil James-Roxby Xilinx Research Labs.

Enabling Self-management of Component-based High-performance Scientific Applications Hua (Maria) Liu and Manish Parashar The Applied Software Systems Laboratory.

1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.

Connections to Other Packages The Cactus Team Albert Einstein Institute

Glen Dobson, Lancaster University Service Grids Workshop NeSC Edinburgh 23/7/04 Endpoint Services Glen Dobson Lancaster University,

Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

1 University of Maryland Runtime Program Evolution Jeff Hollingsworth © Copyright 2000, Jeffrey K. Hollingsworth, All Rights Reserved. University of Maryland.

1 Exploiting Nonstationarity for Performance Prediction Christopher Stewart (University of Rochester) Terence Kelly and Alex Zhang (HP Labs)

Component Patterns – Architecture and Applications with EJB copyright © 2001, MATHEMA AG Component Patterns Architecture and Applications with EJB Markus.

CISC Machine Learning for Solving Systems Problems Microarchitecture Design Space Exploration Lecture 4 John Cavazos Dept of Computer & Information.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Dr. Mohamed Ramadan Saady 314ALL CH1.1 Chapter 1: Introduction to Compiling.

Background Computer System Architectures Computer System Software.

Architectural Mismatch: Why reuse is so hard? Garlan, Allen, Ockerbloom; 1994.

Hello world !!! ASCII representation of hello.c.

Operating Systems A Biswas, Dept. of Information Technology.

Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Productive Performance Tools for Heterogeneous Parallel Computing

Parallel Objects: Virtualization & In-Process Components

Allen D. Malony, Sameer Shende

Chapter 15 QUERY EXECUTION.

Outline Introduction Motivation for performance mapping SEAA model

Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

What Are Performance Counters?

Presentation transcript:

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of Computer and Information ScienceSandia National Laboratories Performance Research Laboratory University of Oregon “Performance Modeling of Component Assemblies with TAU”

Component Performance Modeling with TAUCompframe, Jun. 23, Outline  Motivation  Introduction and Background  Performance Measurement in HPC Component Environment  Performance Measuring and Modeling Infrastructure  Proxies  TAU component  Mastermind  Component Assembly Optimization  Conclusions

Component Performance Modeling with TAUCompframe, Jun. 23, Motivations  Given a set of components, where each component has multiple implementations, what is the optimal subset of implementations that solve a given problem?  How to model a single component?  How to create a global model from a set of component models?  How to select optimal subset of implementations? From a performance perspective, a component by itself has no meaning. A component needs a context. Context is affected by: The problem being solved Parameters (e.g., size of an array) Mismatched data structures

Component Performance Modeling with TAUCompframe, Jun. 23, Performance in HPC Component Environments  Traditional role of performance measurement and modeling  Analysis-and-optimization phase  e.g., porting a stable code base to a new architecture  Performance model => predict scalability  In a component environment  Applications are dynamically composed at runtime  Application developers typically do not implement all of their own components  Performance measurements need to be non-intrusive  Users interested in a coarse-grained performance

Component Performance Modeling with TAUCompframe, Jun. 23, What does performance mean?  Given a problem (characterized by tuple P), what time T e does a component C need to solve it ? i.e  T e = f ( P ) ; what’s f ?  To create a performance model f ( P ), we need:  T e = Execution time for a method call  T m = Execution time of message passing calls within a method  T c = Compute time for a given method (T c = T e - T m )  Input parameters that affect performance (e.g., size of an array)  For our purposes start with simplifying assumptions  Blocking communication and no overlap of communication and computation  Ignore disk I/O

Component Performance Modeling with TAUCompframe, Jun. 23, How to measure performance?  Need to “instrument” the code  But has to be non-intrusive  What kind of performance infrastructure can achieve this?  Previous research suggests proxies  Proxies serve to intercept and forward method calls

Component Performance Modeling with TAUCompframe, Jun. 23, CCA Performance Infrastructure  The proxy measurement system infrastructure:  Proxy  Lightweight : simply, a switch that turns measurement on and off  1 proxy per component  Tuning and Analysis Utilities (TAU) component  Utilizes the TAU measurement library  Provides a measurement port  Responsible for making the measurements  Mastermind component  Responsible for gathering, storing, and reporting measurement data (timing data from TAU as well as input parameters from proxies)  Queries the TAU component for method-level measurements

Component Performance Modeling with TAUCompframe, Jun. 23, Proxy  A proxy uses and provides the same ports that the actual component provides  Also, uses a MonitorPort  Identifies performance- dependent parameters C1 C2 Before: C1 C2 P2 After: MM

Component Performance Modeling with TAUCompframe, Jun. 23, Automatic Proxy Generation  A tool based upon the Program Database Toolkit (University of Oregon)  1 proxy created per port

Component Performance Modeling with TAUCompframe, Jun. 23, MasterMind  A record is created for each instrumented routine and stores, for each invocation:  Measurement data (e.g., execution time, communication time, cache hits, etc.)  Input parameters  Currently, the MasterMind outputs all records at application completion  In the future, perhaps the MasterMind could output a performance model for a given component (based upon a linear regression) ?

Component Performance Modeling with TAUCompframe, Jun. 23, TAU Component  TAU component is a wrapper to the TAU library  Provides access to timers to measure execution time and communication time  Also provides access to hardware metrics (e.g., cache hits) via external libraries such as PAPI or PCL  See

Component Performance Modeling with TAUCompframe, Jun. 23, TAU Performance System Architecture

Component Performance Modeling with TAUCompframe, Jun. 23, Using performance timings to select optimal components  To find optimal solution, need to reduce solution space  Eliminate “insignificant” components  2-step heuristic Are children, as a group, insignificant to a parent? Is an individual node insignificant relative to its siblings?  Optimize reduced core for an approximately optimal solution

Component Performance Modeling with TAUCompframe, Jun. 23, Case Study Example  Core identification ran on hydro shock simulation developed at Sandia National Labs  10% thresholds  The original call-graph consisting of 18 nodes reduced to 8 nodes

Component Performance Modeling with TAUCompframe, Jun. 23, Conclusions  The proxy-based measurement system allows for non- intrusive measurement of components  A single component may have multiple performance models based on different contexts  Eliminating “insignificant” components can ease the identification of an approximately optimal solution.

Component Performance Modeling with TAUCompframe, Jun. 23, Future Work  Synthesize a composite performance model from individual component models  Generalizing performance models (e.g. parameterizing models by a processor speed and cache model to make them architecture independent)  Model representation  XML?  Quality-of-Service  Dynamic Implementation Selection