The Performance Evaluation Research Center (PERC) Patrick H. Worley, ORNL Participating Institutions: Argonne Natl. Lab.Univ. of California, San Diego.

Slides:



Advertisements
Similar presentations
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Advertisements

Configuration management
Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Tools for applications improvement George Bosilca.
Hiperspace Lab University of Delaware Antony, Sara, Mike, Ben, Dave, Sreedevi, Emily, and Lori.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.
A Parallel Structured Ecological Model for High End Shared Memory Computers Dali Wang Department of Computer Science, University of Tennessee, Knoxville.
Instrumentation and Profiling David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.
MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
SOS7, Durango CO, 4-Mar-2003 Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD Distilled [Trimmed & Distilled for SOS7 by M.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
Michael Ernst, page 1 Collaborative Learning for Security and Repair in Application Communities Performers: MIT and Determina Michael Ernst MIT Computer.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering Introduction to VI-HPS Brian Wylie Jülich Supercomputing Centre.
Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.
Cluster Reliability Project ISIS Vanderbilt University.
November 13, 2006 Performance Engineering Research Institute 1 Scientific Discovery through Advanced Computation Performance Engineering.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.
Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Chapter 5: Implementing Intrusion Prevention
ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
HPCMP Benchmarking Update Cray Henry April 2008 Department of Defense High Performance Computing Modernization Program.
Experts in numerical algorithms and High Performance Computing services Challenges of the exponential increase in data Andrew Jones March 2010 SOS14.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Lawrence Livermore National Laboratory S&T Principal Directorate - Computation Directorate Tools and Scalable Application Preparation Project Computation.
Connections to Other Packages The Cactus Team Albert Einstein Institute
CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.
Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:
Presented by Performance Engineering Research Institute (PERI) Patrick H. Worley Computational Earth Sciences Group Computer Science and Mathematics Division.
Programmability Hiroshi Nakashima Thomas Sterling.
Full and Para Virtualization
21 Sep UPC Performance Analysis Tool: Status and Plans Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
The Performance Evaluation Research Center (PERC) Participating Institutions: Argonne Natl. Lab.Univ. of California, San Diego Lawrence Berkeley Natl.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.
Application Communities Phase 2 (AC2) Project Overview Nov. 20, 2008 Greg Sullivan BAE Systems Advanced Information Technologies (AIT)
Tackling I/O Issues 1 David Race 16 March 2010.
UCI Large-Scale Collection of Application Usage Data to Inform Software Development David M. Hilbert David F. Redmiles Information and Computer Science.
Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.
MicroGrid Update & A Synthetic Grid Resource Generator Xin Liu, Yang-suk Kee, Andrew Chien Department of Computer Science and Engineering Center for Networked.
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Presented by SciDAC-2 Petascale Data Storage Institute Philip C. Roth Computer Science and Mathematics Future Technologies Group.
VisIt Project Overview
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Performance Analysis, Tools and Optimization
Many-core Software Development Platforms
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

The Performance Evaluation Research Center (PERC) Patrick H. Worley, ORNL Participating Institutions: Argonne Natl. Lab.Univ. of California, San Diego Lawrence Berkeley Natl. Lab.Univ. of Maryland Lawrence Livermore Natl. Lab.Univ. of North Carolina Oak Ridge Natl. Lab.Univ. of Tennessee, Knoxville Website:

Fishing for Performance "Give a person a fish; you have fed them for today. Teach a person to fish; and you have fed them for a lifetime." (Chinese proverb) PERC builds fishing equipment (poles, nets, and dynamite) and develops new fishing techniques for performance analysis, optimization, and modeling. We also do a little teaching and a little fishing. PERC Mission:  Develop a science of performance.  Engineer tools for performance analysis and optimization.

New Fishing Techniques Performance Monitoring  Flexible performance instrumentation systems  Performance data management infrastructure  Tools to tie performance data to user's source code Performance Modeling  Convolution schemes to generate models from independent application and system characterizations  Statistical and data fitting models  Hierarchical phase models  Performance bounds Performance Optimization  Source code analysis  Self-tuning software  Runtime adaptability

Fishing Equipment Tools being developed within PERC  PAPI: cross-platform interface to hardware performance counters  MetaSim Tracer: tool for acquiring operation counts and memory address stream information  CONE: call-graph profiler for MPI applications  mpiP: lightweight scalable profiling tool for MPI  MPIDTRACE: a communications (MPI and I/O) tracer  Performance Assertions: performance expressions for triggering actions at runtime in application codes  Dyninst: framework for modifying programs as they run  dsd: memory access pattern identification tool  EXPERT: automatic analysis of performance problems in traces  CUBE: analysis framework for hierarchical performance data  MetaSim Convolver: performance prediction tool for computational phases

Fishing Equipment Tools being developed within PERC  Active Harmony: software architecture for runtime tuning of distributed applications.  SvPablo: graphical environment for instrumenting source code and analyzing performance data  ROSE: compiler framework for recognizing and optimizing high-level abstractions  PBT: tool for generating performance bounds from source code Tools being leveraged in PERC research  DIMEMAS: network simulator used to examine impact of varying bandwidth, latency, topology and computational performance  TAU: tool suite for automatic instrumentation, data gathering, experiment multiplexing, and performance data storage and analysis  ATOM: binary instrumentation package on Alpha systems  PIN: binary instrumentation package on Intel systems

Recent Research Results Selected recent PERC research highlights:  Scalability analysis in SvPablo  Identifying and exploiting regularity in memory access patterns using dsd  Convolution-based performance modeling

SvPablo Scalability Analysis Features  Automatic generation of scalability data based on SvPablo performance files  Detection of bottleneck moves as the number of processors changes  Scalability Analysis at source code level  Easy-to-use graphical user interface correlating to the main GUI Example Enhanced Virginia Hydrodynamics One (EVH1)  all source code in F90  intertask communication via MPI Execution Environment  IBM SP at LBNL/NERSC  Test runs on 16, 32, 64, 128, 256, 512 processors

Scalability Analysis: EVH1 Colored boxes corresponding to inefficiency Colored boxes corresponding to inefficiency Each column represents one test run Line graph showing scalability across the executions

Scalability Analysis: EVH1 Major routine sweepy scales very well across the executions

Exploiting Memory Access Patterns dsd: Regularity Measurement Tool  Instrumentation  PAPI for bottlenecks  DynInst for transitory instrumentation  user control of detection overhead  binary instrumentation only (source code not needed)  Applied to several codes  Fortran and C  SPEC, NAS benchmarks, and production codes  Regularity data used to guide optimizations  4% improvement in gzip  5% FT improvement

Metrics and Optimizations

Optimization of gzip Based on Regularity  Code restructuring  Targets TLB misses  Alternate array access order in fill_window  Results to the right  Prefetching (SGI)  14% improvement for fill_window  Overall 4% improvement  Sensitivity of regularity  fill_window invariant  Additional guidance FunctionTLB Miss % Change Memory Stall % Change fill_ window deflate longest_ match compress_ block send_bits updcrc ct_tally Total

Convolution-based Performance Modeling  Follows an automated “recipe” to generate the model  Application characteristics gathered with MetaSim Tracer  Number and types of primitive instructions (flops, loads, stores)  Memory access patterns and ranges (yielding Cache hit rates)  Communications patterns (MPI primitives) and message sizes  I/O patterns and sizes  System characteristics measured with PMaC HPC Benchmark Suite probes  EFF_BW  I/O Bench  MAPS  MAPS_CG  MAPS_Ping  PEAK

Performance Predictions for POP Ocean Code

Performance Predictions for POP Ocean Code  Seconds per simulation day

Application Test Case HPC System Number of Processors Predicted Timing (sec) Actual Timing (sec) Percent Error HYCOM StandardIBM P3594,5584,3594.6% HYCOM StandardIBM P4592,0811, % HYCOM StandardHP SC45592,3821, % HYCOM StandardIBM P3962,6052, % HYCOM StandardIBM P4961,2051, % HYCOM StandardHP SC45961,4321,3783.9% HYCOM LargeIBM P32345,6125,3115.7% HYCOM LargeIBM P42342,6922, % HYCOM LargeHP SC452343,0433, % Performance Predictions for HYCOM Ocean Code Results of “blind” predictions using convolution recipe (predictions made by PMaC, confirmed by 3 rd party at DoD HPCMO) Errors in the model were reduced below 10% in all cases post-mortem.

Fish Stories (Recent Application Impacts) PERC analysis and optimization techniques have recently led to the following (easily quantifiable) performance improvements:  AGILE-BOLTZTRAN (TSI): performance improved up to 55% on IBM by eliminating redundant computations  ZEUS-MP (TSI): single processor performance improved by 90% on IBM by changing compiler options  GS2 (PMP): performance improved by 300% on IBM by changing data decomposition  POP (CCSM): performance improved over 300% on Cray by modifying communication algorithms and correcting OS performance problems  CAM (CCSM): performance improved by 10-30% from new load balancing schemes (in addition to performance optimizations reported last year).

 SciDAC code developed to study low-frequency turbulence in magnetized plasma. Typical use  assess the microstability of plasmas  calculate key properties of the turbulence  simulate turbulence in plasmas which occur in nature  Initial analysis  IBM Power3 (NERSC Seaborg: 8 nodes, 16 processors/node)  Runtime adaptation using Active Harmony  Performance (execution time) improvement by changing layout  55.06s =>16.25s (without collisions)  71.08s => 31.55s (with collisions)  Work in progress  Tuning time spent in communication Runtime Optimization of GS2

Evolution of Ocean Model Performance Performance analysis motivated code modifications and identified OS optimizations that increased performance by more than a factor of 3 from May 7 baseline.

Evolution of Atmospheric Model Performance as of April 2003 Last year’s results for 64x128x26 grid and 3 advected fields. Since then, resolution has increased to 128x256x26, 11 advected fields and parameterizations for many new processes.

Evolution of Atmospheric Model Performance as of March 2004 New resolution decreased simulation rate significantly. Previous optimizations continue to be important, But new opportunities for optimization arise as science in code evolves and as code is ported to new platforms.

Using Visualization to Identify Load Imbalance Graphs show processor utilization on the Cray X1 before and after enabling load balancing in the Community Atmospheric Model. The visualization was used to motivate using a load balancing option that has not been useful on other platforms.

Fishing Lessons The PERC tutorial was presented at ScicomP’03 and SC’03, and will be taught again at Sigmetrics/Performance at Columbia University, N.Y. during the week of June 12th. PERC will also be happy to present the tutorial at SciDAC project meetings. The tutorial can also be downloaded from the PERC website:

Looking to the Future: The Massively Parallel Challenge Systems featuring 10,000+ CPUs, present daunting challenges for performance analysis and tools:  What performance phenomena should we measure?  How can we collect and manage performance data spewed out by tens of thousands of CPUs?  How can we visualize performance phenomena on 10,000+ CPUs?  How can we identify bottlenecks in these systems? Intelligent, highly automated performance tools, applicable over a wide range of system sizes and architectures, are needed.

Looking to the Future: Benchmarking and Modeling  How can a center meaningfully procure a system with 10,000+ CPUs, a system 10 to 100 times more powerful than any system currently in existence?  How can we define a benchmark that provides meaningful results for systems spanning four orders of magnitude in size?  Can we assess the programming effort required to achieve these extra-high levels of performance? Reliable performance modeling techniques, usable with modest effort and expertise, offers the best hope for a solution here. The DARPA High-Productivity Computing Systems (HPCS) program is also examining benchmarks for future high-end systems.

Looking to the Future: System Simulation  “Computational scientists have become expert in simulating every phenomena except for the systems they run on.” -- Speaker at recent HPC workshop.  System simulations heretofore have been used sparingly in system studies, because of the great cost and difficulty in parallelization of such simulations. Such simulations are now feasible, due to:  Availability of large-scale parallel systems  Developments in the parallel discrete event simulation field

Looking to the Future: User-Level Automatic Tuning  Self-tuning software technology has already been demonstrated in a few large-scale libraries:  FFTW (MIT).  LAPACK-ATLAS (Univ. of Tennessee). Near term: Adapting these techniques to a wider group of widely used scientific libraries. Mid term: Automatically incorporate simple performance models into user application codes. Future goal: Automatically incorporate simple run-time tests, using compiler technology, into user application codes.

Working with PERC For further information: