Kernel-level Measurement for Integrated Parallel Performance Views A. Nataraj, A. Malony, S. Shende, A. Morris

Kernel-level Measurement for Integrated Parallel Performance Views A. Nataraj, A. Malony, S. Shende, A. Morris {anataraj,malony,sameer,amorris}@cs.uoregon.edu Performance Research Lab University of Oregon

Kernel-level Measurement for Integrated Parallel Performance Views1Cluster 2006 Motivation Application performance is a consequence of  User-level execution  OS-level operation Good tools exist for observing user-level performance  User-level events  Communication events  Execution time measures  Hardware performance Fewer tools exist to observe OS-level aspects Ideally would like to do both simultaneously  OS-level influences on application performance

Kernel-level Measurement for Integrated Parallel Performance Views2Cluster 2006 Scale and Performance Sensitivity HPC systems continue to scale to larger processor counts  Application performance more performance sensitive  OS factors can lead to performance bottlenecks [Petrini’03, Jones’03, …]  System/application performance effects are complex  Isolating system-level factors is non-trivial Require comprehensive performance understanding  Observation of all performance factors  Relative contributions and interrelationship  Can we correlate OS and application performance?

Kernel-level Measurement for Integrated Parallel Performance Views3Cluster 2006 Phase Performance Effects Waiting time due to OS Overhead accumulates

Kernel-level Measurement for Integrated Parallel Performance Views4Cluster 2006  Direct applications invoke the OS for certain services syscalls and internal OS routines called from syscalls  Indirect OS operations w/o explicit invocation by application preemptive scheduling (HW) interrupt handling OS-background activity can occur at any time Program - OS Interactions

Kernel-level Measurement for Integrated Parallel Performance Views5Cluster 2006 Performance Perspectives Kernel-wide  Aggregate kernel activity of all active processes  Understand overall OS behavior  Identify and remove kernel hot spots  Cannot show application-specific OS actions Process-centric  OS performance in specific application context  Virtualization and mapping performance to process  Programs, daemons, and system services interactions  Expose sources of performance problems  Tune OS for specific workload and application for OS

Kernel-level Measurement for Integrated Parallel Performance Views6Cluster 2006 Existing Approaches User-space only measurement tools  Cannot observe system-level performance influences Kernel-level only measurement tools  Cannot integrate OS and user-level measurements A few Combined or integrated user/kernel measurement tools  Can correlate kernel and user-level performance  Typically focus ONLY on direct OS interactions  Indirect interactions not normally merged  Do not explicitly recognize parallel workloads MPI ranks, OpenMP threads, … Need an integrated approach to parallel performance observation and analyses that support both perspectives

Kernel-level Measurement for Integrated Parallel Performance Views7Cluster 2006 High-Level Objectives Low-overhead OS performance measurement Both kernel-wide and process-centric perspectives Merge user-level and kernel-level performance information across all program-OS interactions Provide online information and the ability to function without a daemon where possible Support both profiling and tracing for kernel-wide and process- centric views in parallel systems Leverage existing parallel performance analysis, viz. tools Support for observing, collecting and analyzing parallel data

Kernel-level Measurement for Integrated Parallel Performance Views8Cluster 2006 ZeptoOS and TAU/KTAU Lots of fine-grained OS measurement is required for each component of the ZeptoOS work How and why do the various OS source and configuration changes affect parallel applications? How do we correlate performance data between  OS components  Parallel application and OS Solution: TAU/KTAU  An integrated methodology and framework to measure performance of applications and OS kernel

Kernel-level Measurement for Integrated Parallel Performance Views9Cluster 2006 KTAU Architecture

Kernel-level Measurement for Integrated Parallel Performance Views10Cluster 2006 KTAU Case Studies Demonstrate KTAU capabilities  Controlled Experiments  Larger-Scale Benchmarks Investigate KTAU operating characteristics  Overhead Test environment  Neutron: 4-CPU Intel P3 Xeon 550MHz, 1GB RAM, Linux 2.6.14.3(ktau)  Neuronic: 16-node 2-CPU Intel P4 Xeon 2.8GHz, 2GB RAM/node, Redhat Enterprise Linux 2.4(ktau)  Chiba City: 128 dual-CPU Pentium III, 450MHz, 512MB RAM/node, Linux 2.6.14.2 (ktau) kernel, connected by Ethernet

Kernel-level Measurement for Integrated Parallel Performance Views11Cluster 2006 Scenarios and Benchmarks Controlled experiment scenarios  Interrupt load  Scheduling load  System calls Benchmarks  NPB LU, SP applications [NPB] Simulated computational fluid dynamics (CFD) applications. A regular-sparse, block lower and upper triangular system solution.  LMBENCH [LMBENCH] (not shown here) Suite of micro-benchmarks exercising Linux kernel

Kernel-level Measurement for Integrated Parallel Performance Views12Cluster 2006 Approx. 25 secs dilation in Total Inclusive time. Why? Approx. 16 secs dilation in MPSP() Exclusive time. Why? User-level profile does not tell the whole story! Observing Interrupts User-level Inclusive TimeUser-level Exclusive Time Benchmark: NPB-SP application on 16 nodes

Kernel-level Measurement for Integrated Parallel Performance Views13Cluster 2006 Observing Interrupts User+OS Inclusive TimeUser+OS Exclusive Time Kernel-Space Time Taken by: 1.do_softirq() 2. schedule() 3. do_IRQ() 4. sys_poll() 5. icmp_rcv() 6. icmp_reply() MPSP excl. time difference only 4 secs. Excl-time view clearly identifies the culprits. 1. schedule() 2. do_IRQ() 3. icmp_reply() 4. do_softirq()

Kernel-level Measurement for Integrated Parallel Performance Views14Cluster 2006 Observing Scheduling NPB LU application on 8 CPU (neuronic.nic.uoregon.edu) Simulate daemon interference using “toy” daemon Daemon periodically wakes-up and performs activity What does KTAU show - different views… A: Aggregated Kernel-wide View (Each row is single host)

Kernel-level Measurement for Integrated Parallel Performance Views15Cluster 2006 Observing Scheduling B: Process-level View (Each row is single process on host 8) Select node 8 and take a closer look … 2 NPB LU processes ‘Toy’ Daemon activity

Kernel-level Measurement for Integrated Parallel Performance Views16Cluster 2006 Merging App / OS Traces MPI_Send OS Routines Fine-grained Tracing Shows detail inside interrupts and bottom halves Using VAMPIR Trace Visualization [VAMPIR] Tracing System Calls

Kernel-level Measurement for Integrated Parallel Performance Views17Cluster 2006 Larger-Scale Runs Run parallel benchmarks on larger-scale  128 dual-cpu nodes  Identify (and remove) system-level performance issues  Understand overheads introduced by KTAU NPB benchmark: LU Application [NPB] ASC benchmark: Sweep3D [Sweep3d]

Kernel-level Measurement for Integrated Parallel Performance Views18Cluster 2006 LU Experiment Experienced problems on Chiba by chance Initially ran NPB-LU and Sweep3D codes on 128x1 configuration Then ran on 64x2 configuration Extreme performance hit (72% slower!) with the 64x2 runs Used KTAU views to identify and solve issues iteratively Eventually brought performance gap to 13% for LU and 9% for Sweep.

Kernel-level Measurement for Integrated Parallel Performance Views19Cluster 2006 LU Experiment Two ranks - relatively very low MPI_Recv() time. Two ranks - MPI_Recv() diff. from Mean in OS-SCHED. User-level MPI_Recv MPI_Recv OS Interactions 64x2 Configuration

Kernel-level Measurement for Integrated Parallel Performance Views20Cluster 2006 LU Experiment Two ranks have very low voluntary scheduling durations. (Same) Two ranks have very large preemptive scheduling. Voluntary SchedulingPreemptive Scheduling Note: x-axis log scale

Kernel-level Measurement for Integrated Parallel Performance Views21Cluster 2006 LU Experiment NPB LU processes PID:4066, PID:4068 active. No other significant activity! Why the Pre-emption? 64x2 Pinned: Interrupt Activity Bimodal across MPI ranks. ccn10 Node-level ViewInterrupt Activity

Kernel-level Measurement for Integrated Parallel Performance Views22Cluster 2006 Sweep3D Experiment Many more OS-TCP CallsApprox. 100% longer 100% More background OS-TCP activity in Compute phase. More imbalance! Use ‘Merged’ performance data to identify overlap, imbalance. Why does purely compute bound region have lots of I/O? TCP within Compute : TimeTCP within Compute : Calls 1.I/O Compute Overlap - Explicit MPI or OS buffering - Tcp Sends 2.Imbalance in workload - Tcp Recvs from on-time finishers

Kernel-level Measurement for Integrated Parallel Performance Views23Cluster 2006 Sweep3D Experiment OS-TCP in SMP Costlier IRQ-Balancing blindly distributes interrupts and bottom-halves. E.g.: Handling TCP related BH in CPU-0 for LU-process on CPU-1  Cache issues! [COMSWARE] Cost / Call of OS-level TCP

Kernel-level Measurement for Integrated Parallel Performance Views24Cluster 2006 Perturbation Study Five different Configurations  Baseline: Vanilla kernel, un-instrumented benchmark  Ktau-Off: Kernel instrumentations compiled-in. But turned Off - (disabled probe effect)  Prof-All: All kernel instrumentations turned On  Prof-Sched: Only scheduler subssystem’s instrumentations turned on  Prof-All+TAU: ProfAll, + user-level Tau instrumentation enabled NPB LU application benchmark:  16 nodes, 5 different configurations, Mean over 5 runs each ASC Sweep3D:  128 nodes, Base and Prof-All+TAU, Mean over 5 runs each. Test machine: Chiba-City ANL

Kernel-level Measurement for Integrated Parallel Performance Views25Cluster 2006 Perturbation Study Disabled probe effect. Single instrumentation very cheap. E.g. Scheduling. Complete Integrated Profiling Cost under 3% on Avg. and as low as 1.58%. Sweep3d on 128 Nodes Base ProfAll+TAU Elapsed Time: 368.25 369.9 % Avg Slow.: 0.49%

Kernel-level Measurement for Integrated Parallel Performance Views26Cluster 2006 Future Work Dynamic measurement control Improve performance data sources Improve integration with TAU’s user-space capabilities  Better correlation of user and kernel performance  Full callpaths and phase-based profiling  Merged user/kernel traces (already available) Integration of TAU and KTAU with Supermon Porting efforts to IA-64, PPC-64, and AMD Opteron ZeptoOS characterization efforts  BGL I/O node  Dynamically adaptive kernels

Kernel-level Measurement for Integrated Parallel Performance Views27Cluster 2006 Acknowledgements Department of Energy’s Office of Science National Science Foundation University of Oregon (UO) Core Team  Aroon Nataraj, PhD Student  Prof. Allen D Malony  Dr. Sameer Shende, Senior Scientist  Alan Morris, Senior Software Engineer Argonne National Lab (ANL) Contributors  Pete Beckman  Kamil Iskra  Kazutomo Yoshii Thanks also to…  Suravee Suthikulpanit, UO MS (Graduated)  Rick Bradshaw, HPC Systems Support, ANL

Kernel-level Measurement for Integrated Parallel Performance Views28Cluster 2006 Outline Motivations Objectives ZeptoOS project KTAU Architecture Case Studies - the Performance Views Perturbation Study Future work Acknowledgements

Kernel-level Measurement for Integrated Parallel Performance Views29Cluster 2006 Recent Work on ZeptoOS Project Accurate Identification of “noise” sources  Modified Linux on BG/L should be efficient  Effect of OS “noise” on synchronization / collectives  What OS aspects induce what types of interference code paths configurations devices attached  Requires user-level and OS measurement If can identify noise sources, then can remove or alleviate interference

Kernel-level Measurement for Integrated Parallel Performance Views30Cluster 2006 Approach ANL Selfish benchmark to identify “detours”  Noise events in user-space  Shows durations and frequencies of events  Does NOT show cause or source  Runs a tight loop with an expected (ideal) duration logs times and duration of detours Use KTAU OS-tracing to record OS activity  Correlate time of occurrence uses same time source as Selfish benchmark  Infer type of OS-activity (if any) causing the “detour”

Kernel-level Measurement for Integrated Parallel Performance Views31Cluster 2006 OS/User Performance View of Scheduling preemptive scheduling

Kernel-level Measurement for Integrated Parallel Performance Views32Cluster 2006 OS/User View of OS Background Activity

Kernel-level Measurement for Integrated Parallel Performance Views33Cluster 2006 OS/User View of OS Background Activity

Kernel-level Measurement for Integrated Parallel Performance Views34Cluster 2006 Other Applications of KTAU Need to fill this in … (is there time)

Kernel-level Measurement for Integrated Parallel Performance Views A. Nataraj, A. Malony, S. Shende, A. Morris

Similar presentations

Presentation on theme: "Kernel-level Measurement for Integrated Parallel Performance Views A. Nataraj, A. Malony, S. Shende, A. Morris"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kernel-level Measurement for Integrated Parallel Performance Views A. Nataraj, A. Malony, S. Shende, A. Morris

Similar presentations

Presentation on theme: "Kernel-level Measurement for Integrated Parallel Performance Views A. Nataraj, A. Malony, S. Shende, A. Morris"— Presentation transcript:

Similar presentations

About project

Feedback