Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon.

Slides:

Advertisements

Similar presentations

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Advertisements

University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.

PAPI for Blue Gene/Q: The 5 BGPM Components Heike Jagode and Shirley Moore Innovative Computing Laboratory University of Tennessee-Knoxville

Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.

Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.

1 Threads, SMP, and Microkernels Chapter 4. 2 Process: Some Info. Motivation for threads! Two fundamental aspects of a “process”: Resource ownership Scheduling.

A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.

Building a Cluster Support Service Implementation of the SCS Program UC Computing Services Conference Gary Jung SCS Project Manager

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

A Parallel Structured Ecological Model for High End Shared Memory Computers Dali Wang Department of Computer Science, University of Tennessee, Knoxville.

TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,

Early Experiences with KTAU on the IBM Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance.

Kei Davis and Fabrizio Petrini Europar 2004, Pisa Italy 1 CCS-3 P AL A CASE STUDY.

Kernel-level Measurement for Integrated Parallel Performance Views A. Nataraj, A. Malony, S. Shende, A. Morris

Experience with K42, an open- source, Linux-compatible, scalable operation-system kernel IBM SYSTEM JOURNAL, VOL 44 NO 2, 2005 J. Appovoo 、 M. Auslander.

Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.

Chiba City: A Testbed for Scalablity and Development FAST-OS Workshop July 10, 2002 Rémy Evard Mathematics.

Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.

Power is Leading Design Constraint Direct Impacts of Power Management – IDC: Server 2% of US energy consumption and growing exponentially HPC cluster market.

Performance Technology for Complex Parallel Systems REFERENCES.

Virtualization Technology Prof D M Dhamdhere CSE Department IIT Bombay Moving towards Virtualization… Department of Computer Science and Engineering, IIT.

Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.

Chapter 4 Threads, SMP, and Microkernels Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E.

TAU: Recent Advances KTAU: Kernel-Level Measurement for Integrated Parallel Performance Views TAUg: Runtime Global Performance Data Access Using MPI Aroon.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, Department of Computer and Information.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: Dennis Hoppe (HLRS) ATOM: A near-real time Monitoring.

Cluster Reliability Project ISIS Vanderbilt University.

Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.

Technology + Process SDCI HPC Improvement: High-Productivity Performance Engineering (Tools, Methods, Training) for NSF HPC Applications Rick Kufrin *,

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.

Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Parallel and Distributed Simulation Introduction and Motivation.

Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon.

1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon.

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

Operating Systems: Internals and Design Principles

Full and Para Virtualization

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.

Performance profiling of Experiments’ Geant4 Simulations Geant4 Technical Forum Ryszard Jurga.

HPC University Requirements Analysis Team Training Analysis Summary Meeting at PSC September Mary Ann Leung, Ph.D.

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Productive Performance Tools for Heterogeneous Parallel Computing

Performance Technology for Scalable Parallel Systems

TAU integration with Score-P

Structural Simulation Toolkit / Gem5 Integration

Reducing OS noise using offload driver on Intel® Xeon Phi™ Processor

Allen D. Malony, Sameer Shende

Model-Driven Analysis Frameworks for Embedded Systems

Outline Introduction Motivation for performance mapping SEAA model

Department of Computer Science, University of Tennessee, Knoxville

Presentation transcript:

Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon

Outline Introduction Motivations Objectives Architecture KTAU on Blue Gene / L Ongoing / Recent work Future work and directions Acknowledgements References Team

Introduction : ZeptoOS and TAU DOE OS/RTS for Extreme Scale Scientific Computation(Fastos)  Conduct OS research to provide effective OS/Runtime for petascale systems ZeptoOS (under Fastos)  Scalable components for petascale architectures  Joint project Argonne National Lab and University of Oregon  ANL: Putting light-weight kernel (based on Linux) on BG/L and other platforms (XT3) University of Oregon  Kernel performance monitoring, tuning  KTAU Integration of TAU infrastructure in Linux Kernel Integration with ZeptoOS, installation on BG/L Port to 32-bit and 64-bit Linux platforms

ZeptoOS: The Small Linux for Big Computers Research Exploration  What are the fundamental limits and advanced designs required for petascale Operating System Suites Behaviour at large scales Management & optimization of OS suites Collectives Fault tolerance Measurement, collection and analysis of OS performance data from large number of nodes Strategy  Modified Linux on BG/L I/O nodes Measure and understand behavior  Modified Linux for BG/L compute nodes Measure and understand behavior  Specialized I/O daemon on I/O node (ZOID) Measure and understand behavior (ZeptoOS BG/L Symposium presentation slide reused with permission from Pete Beckman [beckman06-bgl])

ZeptoOS and KTAU Lots of fine-grained OS measurement is required for each component of the ZeptoOS work Exactly what aspects of Linux need to be changed to achieve ZeptoOS goals? How and why do the various OS source and configuration changes affect parallel applications? How do we correlate performance data between  the parallel application,  the compute node OS,  the I/O Daemon and  the I/O Node OS Enter TAU/KTAU - An integrated methodology and framework to measure performance of applications and OS kernel across a system like BG/L.

Motivation Application Performance  user-level execution performance +  OS-level operations performance Domains: Time and Hardware Performance Metrics PAPI (Performance Application Programming Interface)  Exposes virtualized hardware counters TAU (Tuning and Analysis Utility)  Measures most user-level entities: parallel application, MPI, libraries …  Time domain  Uses PAPI to correlate counter information to source But how to correlate OS-level influences with App. Performance?

As HPC systems continue to scale to larger processor counts  Application performance more sensitive  New OS factors become performance bottlenecks (E.g. [Petrini’03, Jones’03, other works…])  Isolating these system-level issues as bottlenecks is non-trivial Require Comprehensive Performance Understanding  Observation of all performance factors  Relative contributions and interrelationship  Can we correlate? Motivation (continued)

Motivation (continued) Program - OS Interactions Program OS Interactions - Direct vs. Indirect Entry Points  Direct - Applications invoke the OS for certain services Syscalls (and internal OS routines called directly from syscalls)  Indirect - OS takes actions without explicit invocation by application Preemptive Scheduling (HW) Interrupt handling OS-background activity (keeping track of time and timers, bottom-half handling, etc)  Indirect interactions can occur at any OS entry (not just when entering through Syscalls) Direct Interactions easier to handle  Synchronous with user-code and in process-context Indirect Interactions more difficult to handle  Usually asynchronous and in interrupt-context: Hard to measure and harder to correlate/integrate with app. measurements

Motivation (continued) Kernel-wide vs. Process-centric Kernel-wide - Aggregate kernel activity of all active processes in system  Understand overall OS behavior, identify and remove kernel hot spots.  Cannot show what parts of app. spend time in OS and why Process-centric perspective - OS performance within context of a specific application’s execution  Virtualization and Mapping performance to process  Interactions between programs, daemons, and system services  Tune OS for specific workload or tune application to better conform to OS config.  Expose real source of performance problems (in the OS or the application)

Motivation (continued) Existing Approaches User-space Only measurement tools  Many tools only work at user-level and cannot observe system-level performance influences Kernel-level Only measurement tools  Most only provide the kernel-wide perspective – lack proper mapping/virtualization  Some provide process-centric views but cannot integrate OS and user-level measurements Combined or Integrated User/Kernel Measurement Tools  A few powerful tools allow fine-grained measurement and correlation of kernel and user-level performance  Typically these focus only on Direct OS interactions. Indirect interactions not merged. Using Combinations of above tools  Without better integration, does not allow fine-grained correlation between OS and App.  Many kernel tools do not explicitly recognize Parallel workloads (e.g. MPI ranks) Need an integrated approach to parallel perf. observation, analyses

High-Level Objectives Support low-overhead OS performance measurement at multiple levels of function and detail Provide both kernel-wide and process-centric perspectives of OS performance Merge user-level and kernel-level performance information across all program-OS interactions Provide online information and the ability to function without a daemon where possible Support both profiling and tracing for kernel-wide and process- centric views in parallel systems Leverage existing parallel performance analysis tools Support for observing, collecting and analyzing parallel data

KTAU: Outline Introduction Motivations Objectives Architecture KTAU on Blue Gene / L Recent/Ongoing Work (since publication) Future work and directions Acknowledgements References Team

KTAU Architecture

KTAU On BGL’s ZeptoOS I/O Node  Open source modified Linux Kernel (2.4, 2.6) - ZeptoOS  Control I/O Daemon (CIOD) handles I/O syscalls from Compute nodes in pset. Compute Node  IBM proprietary (closed-source) light-weight kernel  No scheduling or virtual memory support  Forwards I/O syscalls to CIOD on I/O node KTAU on I/O Node:  Integrated into ZeptoOS config and build system.  Require KTAU-D (daemon) as CIOD is closed-source.  KTAU-D periodically monitors sys-wide or individual process Visualization of trace/profile of ZeptoOS, CIOD using Paraprof, Vampir/Jumpshot.

KTAU On BG/L (current)

On BG/L (continued) Early Experiences CIOD Kernel Trace zoomed-in (running iotest benchmark)

On BG/L (continued) Early Experiences

Correlating CIOD and RPC-IOD Activity

KTAU On BG/L Will Eventually Look Like … Replace with: ZOID + TAU Replace with: Linux + KTAU

Ongoing/Recent Work (since publication) Accurate Identification of “noise” sources  Modified Linux on BG/L should not take a performance loss  One area of concern - OS “noise” effects on Synchronization / Collectives  Requires identifying exactly what aspects (code paths, configurations, devices attached) of the OS induce what types of interference  This will require user-level as well as OS measurement Our Approach  Use the Selfish benchmark [Beckman06] to identify “detours” (or noise events) in user-space  This shows durations and frequencies of events, but NOT cause/source.  Simultaneously use KTAU OS-tracing to record OS activity  Correlate time of occurrence (both use same time source - hw time counter)  Infer which type of OS-activity (if any) caused the “detour” Remove or alleviate interference using above information (Work-in- progress)

Ongoing/Recent Work (continued) “Noise” Source Identification BGL IO-N: Merged OS/User Performance View of Scheduling

Ongoing/Recent Work (continued) “Noise” Source Identification Merged OS/User View of OS Background Activity

Ongoing/Recent Work (continued) “Noise” Source Identification Zoomed-In: Merged OS/User View of OS Background Activity

Future Work Dynamic measurement control - enable/disable events w/o recompilation or reboot Improve performance data sources that KTAU can access - E.g. PAPI Improve integration with TAU’s user-space capabilities to provide even better correlation of user and kernel performance information  full callpaths,  phase-based profiling,  merged user/kernel traces (already available) Integration of Tau, Ktau with Supermon Porting efforts: IA-64, PPC-64 and AMD Opteron ZeptoOS: Planned characterization efforts  BGL I/O node  Dynamically adaptive kernels

Support Acknowledgements Department of Energy’s Office of Science (contract no. DE-FG02-05ER25663) and National Science Foundation (grant no. NSF CCF )

References [petrini’03]:F. Petrini, D. J. Kerbyson, and S. Pakin, “The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of asci q,” in SC ’03 [jones’03]: T. Jones and et al., “Improving the scalability of parallel jobs by adding parallel awareness to the operating system,” in SC ’03 [PAPI]: S. Browne et al., “A Portable Programming Interface for Performance Evaluation on Modern Processors”. The International Journal of High Performance Computing Applications, 14(3): , Fall [VAMPIR]: W. E. Nagel et. al., “VAMPIR: Visualization and analysis of MPI resources,” Supercomputer, vol. 12, no. 1, pp. 69–80, [ZeptoOS]: “ZeptoOS: The small linux for big computers,” [NPB]: D.H. Bailey et. al., “The nas parallel benchmarks,” The International Journal of Supercomputer Applications, vol. 5, no. 3, pp. 63–73, Fall 1991.

References [Sweep3d]: A. Hoise et. al., “A general predictive performance model for wavefront algorithms on clusters of SMPs,” in International Conference on Parallel Processing, 2000 [LMBENCH]: L. W. McVoy and C. Staelin, “lmbench: Portable tools for performance analysis,” in USENIX Annual Technical Conference, 1996, pp. 279–294 [TAU]: “TAU: Tuning and Analysis Utilities,” [KTAU-BGL]: A. Nataraj, A. Malony, A. Morris, and S. Shende, “Early experiences with ktau on the ibm bg/l,” in EuroPar’06, European Conference on Parallel Processing, [KTAU]: A. Nataraj et al., “Kernel-Level Measurement for Integrated Parallel Performance Views: the KTAU Project” in IEEE Cluster-2006 (Best Paper)

Team University of Oregon (UO) Core Team Aroon Nataraj, PhD Student Prof. Allen D Malony Dr. Sameer Shende, Senior Scientist Alan Morris, Senior Software Engineer Argonne National Lab (ANL) Contributors Pete Beckman Kamil Iskra Kazutomo Yoshii Past Members Suravee Suthikulpanit, MS Student, UO, (Graduated)

Thank You Questions? Comments? Feedback?