Performance Analysis, Tools and Optimization

Slides:



Advertisements
Similar presentations
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Advertisements

Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
HELICS Petteri Johansson & Ilkka Uuhiniemi. HELICS COW –AMD Athlon MP 1.4Ghz –512 (2 in same computing node) –35 at top500.org –Linpack Benchmark 825.
Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
1 The VAMPIR and PARAVER performance analysis tools applied to a wet chemical etching parallel algorithm S. Boeriu 1 and J.C. Bruch, Jr. 2 1 Center for.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
UNIX SVR4 COSC513 Zhaohui Chen Jiefei Huang. UNIX SVR4 UNIX system V release 4 is a major new release of the UNIX operating system, developed by AT&T.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Profiling, Tracing, Debugging and Monitoring Frameworks Sathish Vadhiyar Courtesy: Dr. Shirley Moore (University of Tennessee)
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Portable Parallel Performance Tools Shirley Browne, UTK Clay Breshears, CEWES MSRC Jan 27-28, 1998.
A New Parallel Debugger for Franklin: DDT Katie Antypas User Services Group NERSC User Group Meeting September 17, 2007.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
BridgePoint Integration John Wolfe / Robert Day Accelerated Technology.
Summertime Fun Everyone loves performance Shirley Browne, George Ho, Jeff Horner, Kevin London, Philip Mucci, John Thurman.
HPD -- A High Performance Debugger Implementation A Parallel Tools Consortium project
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Performance Data Standard and API Shirley Browne, Jack Dongarra, and Philip Mucci University of Tennessee from the Ptools Annual Meeting, May 1998.
Benchmarking, Performance Evaluation, Modeling and Prediction Erich Strohmaier.
Monitoring and Debugging Message Passing Applications with MPVisualizer Ana Paula Cláudio, João Duarte Cunha, and Maria Beatriz Carmo – Thu.
Reference Implementation of the High Performance Debugging (HPD) Standard Kevin London ( ) Shirley Browne ( ) Robert.
Defining the Competencies for Leadership- Class Computing Education and Training Steven I. Gordon and Judith D. Gardiner August 3, 2010.
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
DDC 2223 SYSTEM SOFTWARE DDC2223 SYSTEM SOFTWARE.
MASS Java Documentation, Verification, and Testing
CST 1101 Problem Solving Using Computers
Clouds , Grids and Clusters
Done By: Ashlee Lizarraga Ricky Usher Jacinto Roches Eli Gomez
Netscape Application Server
For Massively Parallel Computation The Chaotic State of the Art
Performance Analysis and optimization of parallel applications
System Programming and administration
Constructing a system with multiple computers or processors
Texas Instruments TDA2x and Vision SDK
Processes The most important processes used in Web-based systems and their internal organization.
OPERATING SYSTEM OVERVIEW
CSCI/CMPE 3334 Systems Programming
Many-core Software Development Platforms
Agenda Why simulation Simulation and model Instruction Set model
Intel® Parallel Studio and Advisor
Introduction to Computer Systems
Guoliang Chen Parallel Computing Guoliang Chen
Oracle Architecture Overview
Compiler Back End Panel
Compiler Back End Panel
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Hybrid Programming with OpenMP and MPI
Chapter 7 –Implementation Issues
Back End Compiler Panel
Chapter 2: Operating-System Structures
HPC User Forum: Back-End Compiler Technology Panel
Chapter 2: Operating-System Structures
Types of Parallel Computers
What Are Performance Counters?
Presentation transcript:

Performance Analysis, Tools and Optimization Philip J. Mucci Kevin S. London University of Tennessee, Knoxville ARL MSRC Users’ Group Meeting September 2, 1998

PET, UT and You Training Environments Benchmarking Evaluation and Reviews Consulting Development

Training Courses on Benchmarking, Performance Optimization, Parallel Tools Provides good mechanism for technology transfer Develop needs and direction from the interaction with the user community Tremendous knowledge base from which to draw

Environments Use of the MSRC environments provides Bug reports to the vendor System tuning System administrator support Analysis of software needs Performance evaluation Researchers access to advanced hardware

Performance Understanding In order to optimize we must understand Why is our code performing a certain way? What can be done about it? How good can we do? Results in confidence, efficiency and better code development Time spent is an investment in the future

Tool Evaluation Ptools Consortium Review of available performance tools, particularly parallel Regular reports are issued Tools that we find useful get presented to the developers in training or consultation Installation, testing and training Example: VAMPIR for scalability analysis

Optimization Course Course focuses on compiler options, available tools and single processor performance Single biggest bottleneck to many codes, especially cache performance Why? Link speeds have increased within an order of magnitude of memory bandwidths Also, MPI and language specific issues

Benchmarks CacheBench - performance of the memory hierarchy MPBench - performance of core MPI operations BLASBench - performance of dense numerical kernels Intended to provide an orthogonal set of low-level benchmarks with which we can parameterize codes

Cache Performance

Cache Performance Tuning for caches is difficult without some understanding of computer architecture No way to really know what’s in the cache during a given point in an application Factor of 2-4 performance increase is common Develop a tool to help identify regions in the source code, a specific reference.

Cache Simulator Profiling the code reveals cache problems Automated instrumentation of offending routines via a GUI or by hand Link with simulator library Make architecture configuration file Addresses are traced and simulated Miss locations are recorded and reports are generated

PerfAPI A standardized interface to hardware performance counters Easily usable by application engineers as well as tool developers Intended for Performance tools Evaluation Modeling Watch http://www.cs.utk.edu/~mucci/pdsa

High Performance Debugger Industry wide lack of good debugging support for parallel programs TotalView is expensive and GUI only Bandwidth is often not-available off-site Based on dbx and gdb as backends Uses p2d2 from NASA as a framework Standardized, familiar command-line interface

MPI Connect Connects separate MPI jobs with PVM 3 function calls to enroll Uses include Metacomputing with Vendor MPI Dynamic and Fault Tolerant MPI jobs now

The Future BYOC Workshops Regular Training Schedule Web Based Training Consulting Cross-MSRC Information Exchange Technology Transfer Tool development

Origin 2000 Performance Prescription Always use dplace on all codes Always use -LNO:cache_size2=4096 For accuracy compile and link with -O2 -IPA -SWP:=ON -LNO -TENV:X=0-5 or -Ofast=ip27 -OPT:roundoff=0-3 -OPT:IEEE_arithmetic=1-3

Origin 2000 Performance Prescription In Fortran, innermost array index should change fastest Use functions in -lcomplib.sgimath or -lscs -lfastm -lm Use MPI_Ixxxx primitives Always execute IRECV early

Vampir Timeline Display

Vampir Global Activity Chart

Identifying a Message in Vampir

Identifying a Message in Vampir

Nupshot Display