Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon
Introduction : ZeptoOS and TAU DOE OS/RTS for Extreme Scale Scientific Computation(Fastos) Conduct OS research to provide effective OS/Runtime for petascale systems ZeptoOS (under Fastos) Scalable components for petascale architectures Joint project Argonne National Lab and University of Oregon ANL: Putting light-weight kernel (based on Linux) on BG/L and other platforms (XT3) University of Oregon Kernel performance monitoring, tuning KTAU Integration of TAU infrastructure in Linux Kernel Integration with ZeptoOS, installation on BG/L Port to 32-bit and 64-bit Linux platforms
ZeptoOS: The Small Linux for Big Computers Research Exploration What are the fundamental limits and advanced designs required for petascale Operating System Suites Behaviour at large scales Management & optimization of OS suites Collectives Fault tolerance Measurement, collection and analysis of OS performance data from large number of nodes Strategy Modified Linux on BG/L I/O nodes Measure and understand behavior Modified Linux for BG/L compute nodes Measure and understand behavior Specialized I/O daemon on I/O node (ZOID) Measure and understand behavior (ZeptoOS BG/L Symposium presentation slide reused with permission from Pete Beckman [beckman06-bgl])
ZeptoOS and KTAU Lots of fine-grained OS measurement is required for each component of the ZeptoOS work Exactly what aspects of Linux need to be changed to achieve ZeptoOS goals? How and why do the various OS source and configuration changes affect parallel applications? How do we correlate performance data between the parallel application, the compute node OS, the I/O Daemon and the I/O Node OS Enter TAU/KTAU - An integrated methodology and framework to measure performance of applications and OS kernel across a system like BG/L.
Motivation Application Performance user-level execution performance + OS-level operations performance Domains: Time and Hardware Performance Metrics PAPI (Performance Application Programming Interface) Exposes virtualized hardware counters TAU (Tuning and Analysis Utility) Measures most user-level entities: parallel application, MPI, libraries … Time domain Uses PAPI to correlate counter information to source But how to correlate OS-level influences with App. Performance?
As HPC systems continue to scale to larger processor counts Application performance more sensitive New OS factors become performance bottlenecks (E.g. [Petrini’03, Jones’03, other works…]) Isolating these system-level issues as bottlenecks is non-trivial Require Comprehensive Performance Understanding Observation of all performance factors Relative contributions and interrelationship Can we correlate? Motivation (continued)
Motivation (continued) Program - OS Interactions Program OS Interactions - Direct vs. Indirect Entry Points Direct - Applications invoke the OS for certain services Syscalls (and internal OS routines called directly from syscalls) Indirect - OS takes actions without explicit invocation by application Preemptive Scheduling (HW) Interrupt handling OS-background activity (keeping track of time and timers, bottom-half handling, etc) Indirect interactions can occur at any OS entry (not just when entering through Syscalls) Direct Interactions easier to handle Synchronous with user-code and in process-context Indirect Interactions more difficult to handle Usually asynchronous and in interrupt-context: Hard to measure and harder to correlate/integrate with app. measurements
Motivation (continued) Kernel-wide vs. Process-centric Kernel-wide - Aggregate kernel activity of all active processes in system Understand overall OS behavior, identify and remove kernel hot spots. Cannot show what parts of app. spend time in OS and why Process-centric perspective - OS performance within context of a specific application’s execution Virtualization and Mapping performance to process Interactions between programs, daemons, and system services Tune OS for specific workload or tune application to better conform to OS config. Expose real source of performance problems (in the OS or the application)
Motivation (continued) Existing Approaches User-space Only measurement tools Many tools only work at user-level and cannot observe system-level performance influences Kernel-level Only measurement tools Most only provide the kernel-wide perspective – lack proper mapping/virtualization Some provide process-centric views but cannot integrate OS and user-level measurements Combined or Integrated User/Kernel Measurement Tools A few powerful tools allow fine-grained measurement and correlation of kernel and user-level performance Typically these focus only on Direct OS interactions. Indirect interactions not merged. Using Combinations of above tools Without better integration, does not allow fine-grained correlation between OS and App. Many kernel tools do not explicitly recognize Parallel workloads (e.g. MPI ranks) Need an integrated approach to parallel perf. observation, analyses
High-Level Objectives Support low-overhead OS performance measurement at multiple levels of function and detail Provide both kernel-wide and process-centric perspectives of OS performance Merge user-level and kernel-level performance information across all program-OS interactions Provide online information and the ability to function without a daemon where possible Support both profiling and tracing for kernel-wide and process- centric views in parallel systems Leverage existing parallel performance analysis tools Support for observing, collecting and analyzing parallel data
KTAU: Outline Introduction Motivations Objectives Architecture KTAU on Blue Gene / L Recent/Ongoing Work (since publication) Future work and directions Acknowledgements References Team
KTAU Architecture
KTAU On BGL’s ZeptoOS I/O Node Open source modified Linux Kernel (2.4, 2.6) - ZeptoOS Control I/O Daemon (CIOD) handles I/O syscalls from Compute nodes in pset. Compute Node IBM proprietary (closed-source) light-weight kernel No scheduling or virtual memory support Forwards I/O syscalls to CIOD on I/O node KTAU on I/O Node: Integrated into ZeptoOS config and build system. Require KTAU-D (daemon) as CIOD is closed-source. KTAU-D periodically monitors sys-wide or individual process Visualization of trace/profile of ZeptoOS, CIOD using Paraprof, Vampir/Jumpshot.
KTAU On BG/L (current)
On BG/L (continued) Early Experiences CIOD Kernel Trace zoomed-in (running iotest benchmark)
On BG/L (continued) Early Experiences
Correlating CIOD and RPC-IOD Activity
KTAU On BG/L Will Eventually Look Like … Replace with: ZOID + TAU Replace with: Linux + KTAU
Ongoing/Recent Work (since publication) Accurate Identification of “noise” sources Modified Linux on BG/L should not take a performance loss One area of concern - OS “noise” effects on Synchronization / Collectives Requires identifying exactly what aspects (code paths, configurations, devices attached) of the OS induce what types of interference This will require user-level as well as OS measurement Our Approach Use the Selfish benchmark [Beckman06] to identify “detours” (or noise events) in user-space This shows durations and frequencies of events, but NOT cause/source. Simultaneously use KTAU OS-tracing to record OS activity Correlate time of occurrence (both use same time source - hw time counter) Infer which type of OS-activity (if any) caused the “detour” Remove or alleviate interference using above information (Work-in- progress)
Ongoing/Recent Work (continued) “Noise” Source Identification BGL IO-N: Merged OS/User Performance View of Scheduling
Ongoing/Recent Work (continued) “Noise” Source Identification Merged OS/User View of OS Background Activity
Ongoing/Recent Work (continued) “Noise” Source Identification Zoomed-In: Merged OS/User View of OS Background Activity
Future Work Dynamic measurement control - enable/disable events w/o recompilation or reboot Improve performance data sources that KTAU can access - E.g. PAPI Improve integration with TAU’s user-space capabilities to provide even better correlation of user and kernel performance information full callpaths, phase-based profiling, merged user/kernel traces (already available) Integration of Tau, Ktau with Supermon Porting efforts: IA-64, PPC-64 and AMD Opteron ZeptoOS: Planned characterization efforts BGL I/O node Dynamically adaptive kernels
University of Oregon (UO) Core Team Aroon Nataraj, PhD Student Prof. Allen D Malony Dr. Sameer Shende, Senior Scientist Alan Morris, Senior Software Engineer Argonne National Lab (ANL) Contributors Pete Beckman Kamil Iskra Kazutomo Yoshii Past Members Suravee Suthikulpanit, MS Student, UO, (Graduated)
