Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.

Slides:

Advertisements

Similar presentations

IcePro Source Code Management Source code analysis Runtime analysis Application deployment Source code generation Multi sites Click ! IcePro.

Advertisements

Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,

Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.

Thoughts on Shared Caches Jeff Odom University of Maryland.

Chapter 19: Network Management Business Data Communications, 5e.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.

Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.

Chapter 19: Network Management Business Data Communications, 4e.

1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.

Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.

1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.

Reconfigurable Hardware for use in Ad Hoc Sensor Networks Supervisors Charles Greif Nandita Bhattacharjee.

Software Frameworks for Acquisition and Control European PhD – 2009 Horácio Fernandes.

GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.

06 April 2006 Parallel Performance Wizard: Analysis Module Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr.

A Free sample background from © 2001 By Default!Slide 1.NET Overview BY: Pinkesh Desai.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Ross Brennan On the Introduction of Reconfigurable Hardware into Computer Architecture Education Ross Brennan

What is Enterprise Architecture?

UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.

Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.

RMsis – v Simplify Requirement Management for JIRA.

11 July 2005 Tool Evaluation Scoring Criteria Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko,

Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

1 Performance Optimization In QTP Execution Over Video Automation Testing Speaker : Krishnesh Sasiyuthaman Nair Date : 10/05/2012.

Overview of CrayPat and Apprentice 2 Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.

Center for Component Technology for Terascale Simulation Software CCA is about: Enhancing Programmer Productivity without sacrificing performance. Supporting.

Alternative ProcessorsHPC User Forum Panel1 HPC User Forum Alternative Processor Panel Results 2008.

Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.

Parallelization of Classification Algorithms For Medical Imaging on a Cluster Computing System 指導教授 : 梁廷宇老師系所 : 碩光通一甲姓名 : 吳秉謙學號 :

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Distributed Components for Integrating Large- Scale High Performance Computing Applications Nanbor Wang, Roopa Pundaleeka and Johan Carlsson

ProActive components and legacy code Matthieu MOREL.

Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:

Performance Analysis with Parallel Performance Wizard Prashanth Prakash, Research Assistant Dr. Vikas Aggarwal, Research Scientist. Vrishali Hajare, Research.

Parallel Performance Wizard: a Performance Analysis Tool for UPC (and other PGAS Models) Max Billingsley III 1, Adam Leko 1, Hung-Hsun Su 1, Dan Bonachea.

21 Sep UPC Performance Analysis Tool: Status and Plans Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

Route Selection Using Policy Controls

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

Parallel Computing Presented by Justin Reschke

Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.

July 19, 2004Joint Techs – Columbus, OH Network Performance Advisor Tanya M. Brethour NLANR/DAST.

Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

Large-Scale Record Linkage Support for Cloud Computing Platforms Yuan Xue, Bradley Malin, Elizabeth Durham EECS Department, Biomedical Informatics Department,

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Profiling/Tracing Method and Tool Evaluation Strategy Summary Slides Hung-Hsun Su UPC Group, HCS lab 1/25/2005.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Chapter 19: Network Management

Dynamo: A Runtime Codesign Environment

Overview – SOE PatchTT November 2015.

Self Healing and Dynamic Construction Framework:

Maintaining software solutions

Mariana Vertenstein CCSM Software Engineering Group NCAR

Overview of Workflows: Why Use Them?

Presentation transcript:

Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW Overview Computationally intensive parallel applications are constantly being developed in many scientific fields using parallel programming paradigms such as: Message-passing: MPI, etc. Partitioned Global Address Space (PGAS): Unified Parallel C (UPC), SHMEM, Co-array Fortran (CAF), Titanium, etc. Reconfigurable Computing (RC) systems and other non-traditional paradigms Performance optimization is often needed to minimize the application’s overall execution time Performance analysis tools are very useful in this process, but existing tools have limited programming paradigm support Data Visualizations Generalized Operation Types Timeline visualization (through export to Jumpshot) of Synthetic Aperture Radar MPI application using PPW Visualization representing time spent in N-Queens RC benchmark program Data transfer visualization of Space Aperture Radar MPI application PGAS model-specific array distribution visualization of UPC NPB FT benchmark Tree table visualization of N-Queens RC benchmark program Automatic Bottleneck DetectionRC Application Performance Analysis Parallel Performance Wizard (PPW) was originally designed and developed to improve the much-needed performance tool support for PGAS programming models Global Address Space Performance (GASP) interface introduced ( Version 1.0 released in April 2007 Latest PPW updates & extensions include Redesigned framework to enable additional model/paradigm support with minimal effort Automatic performance bottleneck detection Enhanced Cray XT UPC support; HP UPC support coming very soon Version 1.1 available for download at Previous versions of PPW (as with other tools) were largely model-dependent Multiple versions of the same component (one per model) had to be developed in a very similar fashion However, constructs from different models behave very closely to each other, and thus can be handled similarly by the tool Latest version of PPW takes advantage of a generalized operation type abstraction Model constructs are classified into one of the pre-defined operation types Components are categorized into model- dependent or model-independent parts Once modification has been made, we are able to add new programming model support to PPW in a relatively small amount of time In most cases, adding new model support can be achieve by performing Classification of model constructs Implementation of instrumentation and bottleneck resolution units MPI support was added in a matter of months (as opposed to years) Data exchange Pair-wise sync. Group-wise sync. Local processing One-sided (put / get) Lock manipulation Sub-group (barrier, collectives) Work distribution (for-all) Two-sided (send / receive) Wait on remote (fence, join) Global (barrier, collectives) User functions & I/O operations Automatic bottleneck detection feature is desirable for a performance analysis tool because Novice users often do not know upon what they should concentrate their efforts Performance data generated by long-running or complex applications can be difficult to visualize and understand A new post-mortem bottleneck detection approach is currently being developed for PPW Perform data filtering at various stages to minimize execution time Detection mechanism is parallelizable (each node performs analysis semi-independently) Potential speedup for large applications Performance data from all nodes need not be merged Operates using the generalized operation type abstraction New operation type-specific detection mechanisms to identify known bottleneck classes Potential to support multi-model application (one that uses two or more models) analysis Instrumentation and measurement of both CPUs and FPGAs, towards a unified performance tool for RC systems Automated instrumentation of hardware & software for ease-of-use Runtime storage & transfer of performance data for continued monitoring of performance Configurable profiling, tracing, and sampling in hardware to complement software data Low overhead (application can run at or near full-speed to improve accuracy of results) Visualization of performance data in tables, charts, and timeline views Allows for strategic instrumentation and measurement from hardware and software Enables a cohesive view of system performance in order to facilitate locating performance bottlenecks Provide useful information to aid designer in fixing bottlenecks FPGA