Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, Department of Computer and Information.

Slides:



Advertisements
Similar presentations
K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Advertisements

Use of Kalman filters in time and frequency analysis John Davis 1st May 2011.
G. Alonso, D. Kossmann Systems Group
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
July Terry Jones, Integrated Computing & Communications Dept Fast-OS.
Self-Correlating Predictive Information Tracking for Large-Scale Production Systems Zhao, Tan, Gong, Gu, Wambolt Presented by: Andrew Hahn.
Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
Early Experiences with KTAU on the IBM Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A CASE STUDY.
Kernel-level Measurement for Integrated Parallel Performance Views A. Nataraj, A. Malony, S. Shende, A. Morris
Value of Information for Complex Economic Models Jeremy Oakley Department of Probability and Statistics, University of Sheffield. Paper available from.
Enhancing Paper Mill Performance through Advanced Diagnostics Services Dr. BS Babji Process Automation Division ABB India Ltd, Bangalore IPPTA Zonal Seminar.
Swiss Federal Institute of Technology Computer Engineering and Networks Laboratory Influence of different system abstractions on the performance analysis.
Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Power Containers: An OS Facility for Fine-Grained Power and Energy Management on Multicore Servers Kai Shen, Arrvindh Shriraman, Sandhya Dwarkadas, Xiao.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
OPTIMAL SERVER PROVISIONING AND FREQUENCY ADJUSTMENT IN SERVER CLUSTERS Presented by: Xinying Zheng 09/13/ XINYING ZHENG, YU CAI MICHIGAN TECHNOLOGICAL.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,
Human Computer Interaction
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
Hotspot Detection in a Service Oriented Architecture Pranay Anchuri,
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Architectures of distributed systems Fundamental Models
임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Statistical Sampling-Based Parametric Analysis of Power Grids Dr. Peng Li Presented by Xueqian Zhao EE5970 Seminar.
1 University of Maryland Linger-Longer: Fine-Grain Cycle Stealing in Networks of Workstations Kyung Dong Ryu © Copyright 2000, Kyung Dong Ryu, All Rights.
Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Spatial Interpolation III
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
Power Containers: An OS Facility for Fine-Grained Power and Energy Management on Multicore Servers Kai Shen, Arrvindh Shriraman, Sandhya Dwarkadas, Xiao.
Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon.
Belgrade, 25 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Performance analysis Tools: a case study of NMMB on Marenostrum.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Ricardo Portillo, Diana Villa, Patricia J. Teller The University of.
1 Tom Edgar’s Contribution to Model Reduction as an introduction to Global Sensitivity Analysis Procedure Accounting for Effect of Available Experimental.
Exploring Social Influence via Posterior Effect of Word-of-Mouth Recommendations Junming Huang, Xue-Qi Cheng, Hua-Wei Shen, Tao Zhou, Xiaolong Jin WSDM.
If you have a transaction processing system, John Meisenbacher
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Resource Characterization Rich Wolski, Dan Nurmi, and John Brevik Computer Science Department University of California, Santa Barbara VGrADS Site Visit.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Productive Performance Tools for Heterogeneous Parallel Computing
CS 704 Advanced Computer Architecture
Department of Computer Science University of California, Santa Barbara
Architectural Interactions in High Performance Clusters
LoGPC: Modeling Network Contention in Message-Passing Programs
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Determining the Accuracy of Event Counts - Methodology
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, Department of Computer and Information Science University of Oregon l Mathematics and Computer Science Division Argonne National Laboratory The Ghost in the Machine Observing the Effects of Kernel Operation on Parallel Application Performance

2 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Outline of the Talk  Part A: Operating System Interference & Us  Introduction to OS/R Interference  Research Questions & Our Contribution  Contrasting to State-of-the-Art Noise Measurement  Part B  The (K)TAU Performance System  Fine-grained OS/App performance Correlation  The Noise-Effect Estimation Technique Described  Part C  Evaluating accuracy of Scale  Demonstrating Noise-Estimation  Conclusion & Future Plans

3 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Parallel Applications’ Noise Propagation Introduction to Operating System Interference Waiting time due to OS Overhead accumulates Local Cycles: 3 Global Cycles: 9 Compute w w Collective Compute w w Collective Compute w w Collective os Simple Parallel Application Phases of Alternating Communication & Reduction What is the cause of variability?

4 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Is OS Noise a problem? When?  Earlier work has shown significant OS/R interference problems  Petrini et. al. SC’03  ASCI Q, P=4,096 | SAGE took twice as long predicted by performance model  Jones et. al. SC’03,  IBM SP | Degraded linear scaling performance of allreduce()  Cause identified as OS/R Interference  Noise Distribution Matters  Agarwal et. al. HIPC’05 - Use theoretical approach  Heavy-tailed & Bernoulli noise most detrimental  Beckman et. al. CCJ’07 - Noise Emulation (Injection)  Max Noise duration, for some noise-ratio, determines noise effects on collectives  Our own noise-injection experiments show noise can be a problem Introduction to Operating System Interference

5 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Is OS Noise a problem? When? Introduction to Operating System Interference Effect of Noise Duration on Allreduce() P. Beckman, K. Iskra, K. Yoshii, S. Coghlan, A. Nataraj “Benchmarking the Effects of OS Interference on Extreme-Scale Parallel Machines” Cluster Computing Journal ’07, Special Issue from CLUSTER’06 (to appear). Noise Ratio (or %Noise) Not good indicator by itself Noise-event Duration Significantly influences effect => Allreduce() tight-loop => %Noise Fixed: 0.15% => Increase Noise-Event Duration

6 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Is OS Noise a problem? When? Introduction to Operating System Interference Parallel Ocean Program (POP) - Effect of Noise at Scale Yes. Noise does affect (some) real applications at scale. Different parts of application have different sensitivities to noise Total: 1024 | 2048 Baroclinic: 1024 | 2048 Barotropic: 1024 | 2048 POP, 1x Problem BG/L, VN Mode P = 1024, 2048 Noise: 1000HZ, 16 usec 1.6% Noise Total = Baroclinic + Barotropic Percent Increase in Runtime No. of Processors

7 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Motivating a Direct Measurement Approach to Noise Estimation  OS/R Interference of Parallel Applications is a complex phenomenon  Depends on the underlying platform (e.g. interconnect latencies, [*others?])  Depends on the OS/R configuration & resulting noise distributions  Depends on parallel application behavior (comp. grain, global interactions)  Simulation, Emulation, Micro-benchmarking, Modeling [*what else?]  All important parts of the performance tool-box  But need to be carefully parameterized to match the above dependencies  Not all complex parallel applications have been modeled  And models/simulations can be too simple or wrongly parameterized  Micro-benchmarks do not capture application complexity (intentionally)  What is App.X’s delay due to existing/intrinsic noise on Platform.Y?  Need general measurement technique to answer that question  Complementary to the other approaches in the tool-box Introduction to Operating System Interference

8 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Who cares about noise? Their concerns?  System administrators / Site Managers:  What are noise characteristics of new tera/peta-scale platform we are purchasing?  Do the site’s “champion” applications suffer significant noise-slowdowns on it?  Can these effects be mitigated by changes to configuration (OS/R, H/W)?  Application Developers:  Is my parallel application sensitive to noise? On a particular platform?  Can I change application characteristics to reduce that sensitivity?  What parts and call-sequences are most sensitive?  OS/R Developers:  What are the noise sources in the OS/R?  What are the effects of different sources? Do all need to be removed?  Not all are comparative - cannot simply look at difference in runtime  No “normal” (or baseline) noise-less candidate to compare with Introduction to Operating System Interference

9 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Our Key Contribution  A general noise-effect estimation technique based on direct measurement  To answer the performance questions  Q.1: Where is the OS/R noise on Platform.Y originating from (sources)?  Q.2: What is App.X’s delay D, due to existing OS/R Interference on Platform.Y?  Q.3: How do different OS/R noise sources contribute to delay D?  Q.4: Which parts (operations, call-seq) of the applications are sensitive?  Applies to any X & Y, (as long as X & Y allow the needed direct measurement)  Without requiring “baseline” or “noise-less” candidate. Why?  If “noiseless” candidate exists ==> then where is the noise-problem?  A “baseline” will make answering Q.2 trivial. No special technique. Just Runtime.  Can only answer comparative questions - effect of adding/removing noise.  Wait a sec... Can’t existing work/methods already answer these questions?  For Q.1, yes. But no general approach for Q.2 - Q.4. Lets look at some. Introduction to Operating System Interference

10 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Contrasting with the State-of-the-Art  Petrini et. al. SC’03 - Micro-benchmark, simulation, modeling to close the loop  Very specific to application (SAGE). Accurate model difficult to produce.  Gioiosa et. al. ISSIPIT’04 - Microbenchmark & measurement  Identify noise sources using OProfile sampling. Quantifies only local-noise.  Agarwal et. al. HiPC’05 - Modeling  Theoretical model used to study noise effects under different distributions  Assumptions include: Balanced Load, Stationary, Balanced Noise, Identical noise  Beckman et. al. CLUSTER’06 - Noise emulation  Injected artificial noise into micro-benchmarks to understand effects  Approach not applicable to measurement of real, existing noise  Sottile et. al. IPDPS’06 - Trace-based Noise Simulation  Modify application traces by adding artificial noise. Compare effect on runtime.  Nataraj et. al. CLUSTER’06 / Europar’06 - Integrated Measurement of OS / Application using KTAU / TAU. Detect & Quantify Per-Process OS interactions. Introduction to Operating System Interference

11 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Part B

12 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 The (K)TAU Performance System [* TODO]  What is TAU? Features? Who uses it?  What is KTAU? Why? ZeptoOS?  Tight Integration with TAU - why? Introduction to Operating System Interference

13 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 KTAU OS Metrics - How it works... schedule timer_interrupt schedule timer_interrupt sys_read MPI Rank 0 Application User Kernel KTAU Performance State PID: 1423 User-level Double-Buffered Metric Container On schedule() update counter get_shared_counter() 14 schedule timer_interrupt schedule timer_interrupt sys_read MPI Rank 1 Application User KTAU Performance State PID: 1430 User-level Double-Buffered Metric Container On timer_interrupt update counter get_shared_counter() sys_write do_IRQ

14 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Performance of KTAU OS Metrics & TAU MPI Tracing

15 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Integrated User/OS Profiling Virualized OS Metrics Per-MPI Rank | Detects & Quantifies OS/R Interactions

16 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Integrated User/OS Phase Profiling [* TODO]  Can attribute OS Noise to specific Phases in MPI Ranks  Again Can detect & quantify, but at finer grain within phases.  ToDo collect data on garuda (or neuronic) easy to do.

17 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Noise-Effect Estimation Technique Outlined - Example A time Measured Timeline Approximated Timeline Noise Amplification with Wait time Process 1 Sends; Process 2 Receives Noise-Effect Estimation Technique Outlined

18 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Noise-Effect Estimation Technique Outlined - Example B time Measured Timeline Approximated Timeline Noise Absorption with Wait time Process 1 Sends; Process 2 Receives Noise-Effect Estimation Technique Outlined

19 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Noise-Effect Estimation Technique Outlined - Example C  Multiple Sources - Combined Noise-Effect  Have the pic. must place here. ToDo

20 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 Prior Work on Trace-based Timeline Approximation  Wolf, Malony, Shende, Morris | HPCC’06  Use timeline approximation from application traces.  Context: Measurement perturbation compensation, not noise-effect estimation.  Sottile, Chandu, Bader | IPDPS’06  Trace-based simulation  Inject artificial noise into existing application traces  Reverse of our current work - we remove real noise from the trace.  Current work inspired by and builds upon the above two works

21 Observing the Effects of Kernel Operation on Parallel Application PerformanceSC 2007 [* ToDo] Part C