Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon.

Slides:

Advertisements

Similar presentations

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Advertisements

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

A Parallel Structured Ecological Model for High End Shared Memory Computers Dali Wang Department of Computer Science, University of Tennessee, Knoxville.

OS Spring’03 Introduction Operating Systems Spring 2003.

Early Experiences with KTAU on the IBM Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance.

Kernel-level Measurement for Integrated Parallel Performance Views A. Nataraj, A. Malony, S. Shende, A. Morris

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.

Chiba City: A Testbed for Scalablity and Development FAST-OS Workshop July 10, 2002 Rémy Evard Mathematics.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

User-Level Process towards Exascale Systems Akio Shimada [1], Atsushi Hori [1], Yutaka Ishikawa [1], Pavan Balaji [2] [1] RIKEN AICS, [2] Argonne National.

Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.

MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.

TAU: Recent Advances KTAU: Kernel-Level Measurement for Integrated Parallel Performance Views TAUg: Runtime Global Performance Data Access Using MPI Aroon.

Aroon Nataraj, Alan Morris, Allen Malony, Matthew Sottile, Pete Beckman l {anataraj, amorris, malony, Department of Computer and Information.

Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,

CSC 7080 Graduate Computer Architecture Lec 12 – Advanced Memory Hierarchy 2 Dr. Khalaf Notes adapted from: David Patterson Electrical Engineering and.

2.Sampling over the Ranks in each time Step. Sampling also reduces Amt of data (but over Diff. dimension). 9 Scalable Online Parallel Performance Measurement.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future.

Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.

In Large-Scale Cluster Yutaka Ishikawa Computer Science Department/Information Technology Center The University of Tokyo

Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon.

Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon.

Full and Para Virtualization

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.

Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Understanding Virtualization Overhead.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.

Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

CS 6560: Operating Systems Design

Performance Technology for Scalable Parallel Systems

OPERATING SYSTEMS CS3502 Fall 2017

Is System X for Me? Cal Ribbens Computer Science Department

Performance Evaluation of Adaptive MPI

Ray-Cast Rendering in VTK-m

Department of Computer Science University of California, Santa Barbara

Operating Systems Chapter 5: Input/Output Management

Outline Introduction Motivation for performance mapping SEAA model

Department of Computer Science, University of Tennessee, Knoxville

Department of Computer Science University of California, Santa Barbara

TEE-Perf A Profiler for Trusted Execution Environments

Presentation transcript:

Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon

KTAU: Outline Introduction Motivations Objectives Architecture / Implementation Choices Experimentation – the performance views Perturbation Study Future work and directions Acknowledgements

Introduction : ZeptoOS and TAU DOE OS/RTS for Extreme Scale Scientific Computation(Fastos)  Conduct OS research to provide effective OS/Runtime for petascale systems ZeptoOS (under Fastos)  Scalable components for petascale architectures  Joint project Argonne National Lab and University of Oregon  ANL: Putting light-weight kernel (based on Linux) on BG/L and other platforms (XT3) University of Oregon  Kernel performance monitoring, tuning  KTAU Integration of TAU infrastructure with Linux Kernel Integration with ZeptoOS, installation on BG/L Port to 32-bit and 64-bit Linux platforms

KTAU: Motivation Application Performance  user-level execution performance +  OS-level operations performance Different Domains: E.g. Time, Hardware Perf. Metrics PAPI (Performance Application Programming Interface)  Exposes virtualized hardware counters TAU (Tuning and Analysis Utility)  Measures a lot of interesting user-level entities: parallel application, MPI, libraries …  Time domain  Uses PAPI to correlate counter information with source

KTAU: Motivation Simple Parallel Model

KTAU: Motivation Simple Parallel Model - Scale

As HPC systems continue to scale to larger processor counts  Application performance more sensitive  New OS factors become performance bottlenecks (E.g. [Petrini’03, Jones’03, other works…])  Isolating these system-level issues as bottlenecks is non-trivial Comprehensive performance understanding  Observation of all performance factors  Relative contributions and interrelationship: can we correlate? KTAU: Motivation Effects of Scale

KTAU: Motivation Program - OS Interactions Program OS Interactions - Direct vs. Indirect Entry Points  Direct - Applications invoke the OS for certain services Syscalls (and internal OS routines called directly from syscalls)  Indirect - OS takes actions without explicit invocation by application Preemptive Scheduling (HW) Interrupt handling OS-background activity (keeping track of time and timers, bottom-half handling, etc)  Indirect interactions can occur at any OS entry (not just when entering through Syscalls)

KTAU: Motivation Program - OS Interactions

Direct Interactions easier to handle  Synchronous with user-code and in process-context Indirect Interactions more difficult to handle  Usually asynchronous and in interrupt-context: Hard to measure and harder to correlate/integrate with app. measurements Indirect interactions may be unrelated to current task  E.g. Kernel-level packet processing for another process  But related in terms of time to current process

KTAU: Motivation Program - OS Interactions (Partial)

KTAU: Motivation Kernel-wide vs. Process-centric Kernel-wide - Aggregate kernel activity of all active processes in system  Understand overall OS behavior, identify and remove kernel hot spots.  Cannot show what parts of app. spend time in OS and why Process-centric perspective - OS performance within context of a specific application’s execution  Virtualization and Mapping performance to process  Interactions between programs, daemons, and system services  Tune OS for specific workload or tune application to better conform to OS config.  Expose real source of performance problems (in the OS or the application)

KTAU: Motivation Kernel-wide vs. Process-centric

KTAU: Motivation Existing Approaches User-space Only measurement tools  Many tools only work at user-level and cannot observe system-level performance influences Kernel-level Only measurement tools  Most only provide the kernel-wide perspective – lack proper mapping/virtualization  Some provide process-centric views but cannot integrate OS and user-level measurements Combined or Integrated User/Kernel Measurement Tools  A few powerful tools allow fine-grained measurement and correlation of kernel and user-level performance  Typically these focus only on Direct OS interactions. Indirect interactions not merged. Using Combinations of above tools  Without better integration, does not allow fine-grained correlation between OS and App.  Many kernel tools do not explicitly recognize Parallel workloads (e.g. MPI ranks) Need an integrated approach to parallel perf. observation, analyses

KTAU: High-Level Objectives Support low-overhead OS performance measurement at multiple levels of function and detail Provide both kernel-wide and process-centric perspectives of OS performance Integrate user and kernel-level performance information across all program-OS interactions Provide online information and the ability to function without a daemon where possible Support both profiling and tracing for kernel-wide and process- centric views in parallel systems Leverage existing parallel performance analysis/viz tools Support for observing, collecting and analyzing parallel data

KTAU: Outline Introduction Motivations Objectives Architecture / Implementation Choices Experimentation – the performance views Perturbation Study ZeptoOS – KTAU on Blue Gene / L Future work and directions Acknowledgements

KTAU Architecture

KTAU: Arch. / Impl. Choices Instrumentation  Static Source instrumentation  Macro Map-ID: Map block of code and process-context to unique index (dense id-space) – easy array lookup.  Macro Start, Stop – provide the mapping index and process-context is implicit Measurement  Differentiate between ‘local/self’ and ‘inter-context’ access. HPC codes primarily use ‘self’.  Store performance data in PCB (task_struct) Integrating Kernel/User Performance state  Don’t assume synchronous kernel-entry or process- context  Have to use memory mapping between kernel and appl. State  Pinning shared state in memory  Kernel Call Groups – program-OS interactions summary Analyses and Visualization – Use TAU facilities

KTAU: Controlled Experiments Controlled Experiments  Exercise kernel in controlled fashion  Check if KTAU produces the expected correct and meaningful views Test machines  Neutron: 4-CPU Intel P3 Xeon 550MHz, 1GB RAM, Linux (ktau)  Neuronic: 16-node 2-CPU Intel P4 Xeon 2.8GHz, 2GB RAM/node, Redhat Enterprise Linux 2.4(ktau) Benchmarks  NPB LU, SP applications [NPB] Simulated computational fluid dynamics (CFD) applications. A regular-sparse, block lower and upper triangular system solution.  LMBENCH [LMBENCH] Suite of micro-benchmarks exercising Linux kernel  A few others not shown (e.g. SKAMPI)

KTAU: Controlled Examples continued… Profiling

Approx. 25 secs dilation in Total Inclusive time. Why? Approx. 16 secs dilation in MPSP() Exclusive time. Why? User-level profile does not tell the whole story! KTAU: Controlled Experiments Observing Interrupts User-level Inclusive TimeUser-level Exclusive Time Benchmark: NPB-SP application on 16 nodes

KTAU: Controlled Experiments Observing Interrupts User+OS Inclusive TimeUser+OS Exclusive Time Kernel-Space Time Taken by: 1.do_softirq() 2. schedule() 3. do_IRQ() 4. sys_poll() 5. icmp_rcv() 6. icmp_reply() MPSP excl. time difference only 4 secs. Excl-time view clearly identifies the culprits. 1. schedule() 2. do_IRQ() 3. icmp_reply() 4. do_softirq() Pings cause interrupts (do_IRQ). Which in turn handled after interrupt by soft-interrupts (do_softirq). Actual routine is icmp_reply/rcv. Large number of softirqs causes ksoftirqd to be scheduled-in, causing SP to be scheduled-out.

KTAU: Controlled Experiments Observing Scheduling NPB LU application on 8 CPU (neuronic.nic.uoregon.edu) Simulate daemon interference using “toy” daemon Daemon periodically wakes-up and performs activity What does KTAU show - different views… A: Aggregated Kernel-wide View (Each row is single host)

KTAU: Controlled Experiments Observing Scheduling B: Process-level View (Each row is single process on host 8) Select node 8 and take a closer look … 2 NPB LU processes ‘Toy’ Daemon activity

KTAU: Controlled Experiments Observing Scheduling C: Voluntary / Involuntary Scheduling (Each row is single MPI rank) Instrumentation to differentiate voluntary/involuntary schedule Experiment re-run on 4-processor SMP Local slowdown - preemptive scheduling Remote slowdown - voluntary scheduling (waiting!) Pre-empted out by ‘Toy’ daemon Other 3 yield cpu voluntarily and wait!

LMBENCH Page-Fault Call Group Relations Program-OS Call Graph KTAU: Controlled Experiments Observing Exceptions

Merging App / OS Traces MPI_Send OS Routines Fine-grained Tracing Shows detail inside interrupts and bottom halves Using VAMPIR Trace Visualization [VAMPIR] KTAU: Controlled Experiments Tracing

KTAU: Controlled Examples continued… Tracing Correlating CIOD and RPC-IOD Activity

KTAU: Larger-Scale Runs Run parallel benchmarks on larger-scale (128 dual-cpu nodes)  Identify (and remove) system-level performance issues  Understand perturbation overheads introduced by KTAU NPB benchmark: LU Application [NPB]  Simulated computational fluid dynamics (CFD) application. A regular-sparse, block lower and upper triangular system solution. ASC benchmark: Sweep3D [Sweep3d]  Solves a 3-D, time-independent, neutron particle transport equation on an orthogonal mesh. Test machine: Chiba-City Linux cluster (ANL)  128 dual-CPU Pentium III, 450MHz, 512MB RAM/node, Linux (ktau) kernel, connected by Ethernet

KTAU: Larger-Scale Runs Experienced problems on Chiba by chance Initially ran NPB-LU and Sweep3D codes on 128x1 configuration Then ran on 64x2 configuration Extreme performance hit (72% slower!) with the 64x2 runs Used KTAU views to identify and solve issues iteratively Eventually brought performance gap to 13% for LU and 9% for Sweep.

KTAU: Larger-scale Runs Two ranks - relatively very low MPI_Recv() time. Two ranks - MPI_Recv() diff. from Mean in OS-SCHED. User-level MPI_RecvMPI_Recv OS Interactions

KTAU: Larger-scale Runs Two ranks have very low voluntary scheduling durations. (Same) Two ranks have very large preemptive scheduling. Voluntary SchedulingPreemptive Scheduling Note: x-axis log scale

KTAU Larger-scale Runs NPB LU processes PID:4066, PID:4068 active. No other significant activity! Why the Pre- emption? 64x2 Pinned: Interrupt Activity Bimodal across MPI ranks. ccn10 Node-level ViewInterrupt Activity

KTAU Larger-scale Runs Many more OS-TCP CallsApprox. 100% longer 100% More background OS-TCP activity in Compute phase. More imbalance! Use ‘Merged’ performance data to identify imbalance.Why does purely compute bound region have lots of I/O? TCP within Compute : TimeTCP within Compute : Calls

KTAU Larger-scale Runs OS-TCP in SMP Costlier IRQ-Balancing blindly distributes interrupts and bottom-halves. E.g.: Handling TCP related BH in CPU-0 for LU-process on CPU-1  Cache issues! [COMSWARE] Cost / Call of OS-level TCP

KTAU Perturbation Study Five different Configurations  Base: Vanilla kernel, un-instrumented benchmark  Ktau-Off: Kernel patched with Ktau and instrumentations compiled-in. But all instrumentations turned Off (boot-time control)  Prof-All: All kernel instrumentations turned On.  Prof-Sched: Only scheduler subssystem’s instrumentations turned on  Prof-All+TAU: ProfAll, but also with user-level Tau instrumentation enabled NPB LU application benchmark:  16 nodes, 5 different configurations, Mean over 5 runs each ASC Sweep3D:  128 nodes, Base and Prof-All+TAU, Mean over 5 runs each. Test machine: Chiba-City ANL

KTAU Perturbation Study Disabled probe effect. Single instrumentation very cheap. E.g. Scheduling. Complete Integrated Profiling Cost under 3% on Avg. and as low as 1.58%. Sweep3d on 128 Nodes Base ProfAll+TAU Elapsed Time: % Avg Slow.: 0.49%

KTAU: Outline Introduction Motivations Objectives Architecture / Implementation Choices Experimentation – the performance views Perturbation Study Future work and directions Acknowledgements

KTAU: Future Work Dynamic measurement control - enable/disable events w/o recompilation or reboot Improve performance data sources that KTAU can access - E.g. PAPI Improve integration with TAU’s user-space capabilities to provide even better correlation of user and kernel performance information  full callpaths,  phase-based profiling,  merged user/kernel traces Integration of Tau, Ktau with Supermon (possibly MRNet?), TAUg (next) Porting efforts: IA-64, PPC-64 and AMD Opteron ZeptoOS: Planned characterization efforts  BGL I/O node  Dynamically adaptive kernels

Acknowledgements Prof. Allen D Malony Dr. Sameer Shende, Senior Scientist Alan Morris, Senior Software Engineer, PRL Suravee Suthikulpanit, MS Student (Graduated)

Support Acknowledgements Department of Energy’s Office of Science (contract no. DE-FG02-05ER25663) and National Science Foundation (grant no. NSF CCF )

References [petrini’03]:F. Petrini, D. J. Kerbyson, and S. Pakin, “The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of asci q,” in SC ’03 [jones’03]: T. Jones and et al., “Improving the scalability of parallel jobs by adding parallel awareness to the operating system,” in SC ’03 [PAPI]: S. Browne et al., “A Portable Programming Interface for Performance Evaluation on Modern Processors”. The International Journal of High Performance Computing Applications, 14(3): , Fall [VAMPIR]: W. E. Nagel et. al., “VAMPIR: Visualization and analysis of MPI resources,” Supercomputer, vol. 12, no. 1, pp. 69–80, [ZeptoOS]: “ZeptoOS: The small linux for big computers,” [NPB]: D.H. Bailey et. al., “The nas parallel benchmarks,” The International Journal of Supercomputer Applications, vol. 5, no. 3, pp. 63–73, Fall 1991.

References [Sweep3d]: A. Hoise et. al., “A general predictive performance model for wavefront algorithms on clusters of SMPs,” in International Conference on Parallel Processing, 2000 [LMBENCH]: L. W. McVoy and C. Staelin, “lmbench: Portable tools for performance analysis,” in USENIX Annual Technical Conference, 1996, pp. 279–294 [TAU]: “TAU: Tuning and Analysis Utilities,” [KTAU-BGL]: A. Nataraj, A. Malony, A. Morris, and S. Shende, “Early experiences with ktau on the ibm bg/l,” in EuroPar’06, European Conference on Parallel Processing, [KTAU]: A. Nataraj et al., “Kernel-Level Measurement for Integrated Parallel Performance Views: the KTAU Project” (under submission)