Tools for applications improvement George Bosilca.

Slides:



Advertisements
Similar presentations
Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,
Advertisements

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Systems Analysis, Prototyping and Iteration Systems Analysis.
1 Architectural Complexity: Opening the Black Box Methods for Exposing Internal Functionality of Complex Single and Multiple Processor Systems EECC-756.
Copyright © 2003, SAS Institute Inc. All rights reserved. Where's Waldo Uncovering Hard-to-Find Application Killers Claire Cates SAS Institute, Inc
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Extensible Networking Platform 1 Liquid Architecture Cycle Accurate Performance Measurement Richard Hough Phillip Jones, Scott Friedman, Roger Chamberlain,
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
G Robert Grimm New York University Pulling Back: How to Go about Your Own System Project?
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
G Robert Grimm New York University Pulling Back: How to Go about Your Own System Project?
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
MPI Program Performance. Introduction Defining the performance of a parallel program is more complex than simply optimizing its execution time. This is.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.
1 1 Profiling & Optimization David Geldreich (DREAM)
AMOST Experimental Comparison of Code-Based and Model-Based Test Prioritization Bogdan Korel Computer Science Department Illinois Institute of Technology.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.
PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.
A Framework for Tracking Memory Accesses in Scientific Applications Antonio J. Peña Pavan Balaji Argonne National Laboratory
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
A Time Predictable Instruction Cache for a Java Processor Martin Schoeberl.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Computing Infrastructure for Large Ecommerce Systems -- based on material written by Jacob Lindeman.
RPROM , 2002 Lassi A. Tuura, Northeastern University Using Valgrind Lassi A. Tuura Northeastern University,
RAM, PRAM, and LogP models
1 Components of the Virtual Memory System  Arrows indicate what happens on a lw virtual address data physical address TLB page table memory cache disk.
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
® IBM Software Group © 2006 IBM Corporation PurifyPlus on Linux / Unix Vinay Kumar H S.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
Software Overview Environment, libraries, debuggers, programming tools and applications Jonathan Carter NUG Training 3 Oct 2005.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Timing Programs and Performance Analysis Tools for Analysing and Optimising advanced Simulations.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
CSCI Rational Purify 1 Rational Purify Overview Michel Izygon - Jim Helm.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Measuring Memory using valgrind CSCE 221H Parasol Lab, Texas A&M University.
Perseus Design. 2 Lockheed Martin and Government Use Only Architecture Behavioral “signatures” are extracted from a baseline execution Prototype will.
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
Programmability Hiroshi Nakashima Thomas Sterling.
Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe.
1 University of Maryland Runtime Program Evolution Jeff Hollingsworth © Copyright 2000, Jeffrey K. Hollingsworth, All Rights Reserved. University of Maryland.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
Vertical Profiling : Understanding the Behavior of Object-Oriented Applications Sookmyung Women’s Univ. PsLab Sewon,Moon.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
CSE598c - Virtual Machines - Spring Diagnosing Performance Overheads in the Xen Virtual Machine EnvironmentPage 1 CSE 598c Virtual Machines “Diagnosing.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
HP-SEE Valgrind Usage Josip Jakić Scientific Computing Laboratory Institute of Physics Belgrade The HP-SEE initiative.
Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
July 10, 2016ISA's, Compilers, and Assembly1 CS232 roadmap In the first 3 quarters of the class, we have covered 1.Understanding the relationship between.
??? ple r B Amulya Sai EDM14b005 What is simple scalar?? Simple scalar is an open source computer architecture simulator developed by Todd.
Kernel Code Coverage Nilofer Motiwala Computer Sciences Department
Debugging Memory Issues
Valgrind, the anti-Alzheimer pill for your memory problems
In-situ Visualization using VisIt
Characterization of Parallel Scientific Simulations
Understanding Performance Counter Data - 1
Chapter 4: Threads.
Architectural Interactions in High Performance Clusters
Determining the Accuracy of Event Counts - Methodology
What Are Performance Counters?
Dynamic Binary Translators and Instrumenters
Presentation transcript:

Tools for applications improvement George Bosilca

Performance vs. Correctness »MPI is a corner-stone library for most of the HPC applications »Delivering high performance require a fast and reliable MPI library »It require a well designed application »And well tested libraries …

GNU coverage tool »Describe a fine grained execution of the application »The applications is split in atomic blocks and information is gathered about each of the blocks. »How often each line of code is executed »Which lines are actually executed »How long each block take

GNU coverage tool »Allow to write tests that cover all the possible execution path »Give information about the most expensive parts/blocks of the code »Is this a very useful information ?! »Require a exhaustive knowledge of the application architecture to be able to provide useful information

GNU profiling tool »Provide information about: »How many times each function has been executed »How percentage of the total time has been spent inside each function »Dependencies between functions »Call-graph of the execution »Each sample counts as 0.01 seconds. » % cumulative self self total time seconds seconds calls us/call us/call name » mca_btl_mvapi_component_progress » ompi_group_free » ompi_errcode_get_mpi_code » right_rotate » mca_mpool_sm_free » mca_coll_sm_allreduce_intra

Valgrind »Suite of tools for debugging and profiling Linux programs. »Memcheck: detect erroneous memory accesses »Use of uninitialized memory »Reading/writing memory after it has been free’d »Reading/writing off the end of malloc’d blocks »Reading/writing inappropriate areas on the stack »Memory leaks »Passing of uninitialized and/or unaddressible memory to system calls »Mismatched use of malloc/new/new[] vs. free/delete/delete[]

Valgrind

»Cachegrind »Cache profiler »Detailed simulation of all levels of cache »Accurately pinpoint the source of cache misses in the code. »Statistics: number of cache misses, memory references and instructions executed for each line of code, function, module and program.

Valgrind »Callgrind »It’s more than just profiling !!! »Automatically instrument at runtime the application and gather information about the call-graphs and the timings »A visual version of gprof

Callgrind

The power of the 2P »Further improvement of the MPI library require an detailed understanding of the lifetime cycle of the point-to-point operations »There are 2 parts: »All local overheads on the sender and the receiver »And the network overhead/latency »However, we cannot improve the network latency …

Peruse »A standard to-be ? »An MPI extension for revealing unexposed implementation information… »Similar to PAPI (who expose processor counters) but exposing MPI request events »PERUSE is : A set of events tracing the lifetime of an MPI request

Peruse

PAPI »Use the processor counters to give information about the execution »Only few of then are interesting for the MPI library »Cache misses »Instruction counters »TLB »Cache disturbances

The 2P »Mixing PERUSE and PAPI »At each step in the lifetime of a request we gather: »Cache misses »Instruction counter »Therefore we can compute accurately the cost of each of the steps

The 2P »PERUSE will be included in the main Open MPI trunk shortly »The PAPI extension is in a quite advanced state »This is still work in progress …

Does it really work ?

Conclusion »Performance always start from the first design step »A bad design decision have a persistent impact on the overall performance of the application »Not always easy to achieve especially when several peoples are involved »But there is hope: »Now we have the tools required to work on

Open MPI multi-rail

»Actual algorithm depend only on the priority »Similar to collectives we plan to use a model to predict the »Message size from where multi-rail make sense »Determine the size of each segment based on the latency and the bandwidth of the network.