Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon.

Similar presentations


Presentation on theme: "Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon."— Presentation transcript:

1 Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon

2 KTAU: Outline Introduction Motivations Objectives Architecture / Implementation Choices Experimentation – the performance views Perturbation Study Future work and directions Acknowledgements

3 Introduction : ZeptoOS and TAU DOE OS/RTS for Extreme Scale Scientific Computation(Fastos)  Conduct OS research to provide effective OS/Runtime for petascale systems ZeptoOS (under Fastos)  Scalable components for petascale architectures  Joint project Argonne National Lab and University of Oregon  ANL: Putting light-weight kernel (based on Linux) on BG/L and other platforms (XT3) University of Oregon  Kernel performance monitoring, tuning  KTAU Integration of TAU infrastructure with Linux Kernel Integration with ZeptoOS, installation on BG/L Port to 32-bit and 64-bit Linux platforms

4 KTAU: Motivation Application Performance  user-level execution performance +  OS-level operations performance Different Domains: E.g. Time, Hardware Perf. Metrics PAPI (Performance Application Programming Interface)  Exposes virtualized hardware counters TAU (Tuning and Analysis Utility)  Measures a lot of interesting user-level entities: parallel application, MPI, libraries …  Time domain  Uses PAPI to correlate counter information with source

5 KTAU: Motivation Simple Parallel Model

6 KTAU: Motivation Simple Parallel Model - Scale

7 As HPC systems continue to scale to larger processor counts  Application performance more sensitive  New OS factors become performance bottlenecks (E.g. [Petrini’03, Jones’03, other works…])  Isolating these system-level issues as bottlenecks is non-trivial Comprehensive performance understanding  Observation of all performance factors  Relative contributions and interrelationship: can we correlate? KTAU: Motivation Effects of Scale

8 KTAU: Motivation Program - OS Interactions Program OS Interactions - Direct vs. Indirect Entry Points  Direct - Applications invoke the OS for certain services Syscalls (and internal OS routines called directly from syscalls)  Indirect - OS takes actions without explicit invocation by application Preemptive Scheduling (HW) Interrupt handling OS-background activity (keeping track of time and timers, bottom-half handling, etc)  Indirect interactions can occur at any OS entry (not just when entering through Syscalls)

9 KTAU: Motivation Program - OS Interactions

10 Direct Interactions easier to handle  Synchronous with user-code and in process-context Indirect Interactions more difficult to handle  Usually asynchronous and in interrupt-context: Hard to measure and harder to correlate/integrate with app. measurements Indirect interactions may be unrelated to current task  E.g. Kernel-level packet processing for another process  But related in terms of time to current process

11 KTAU: Motivation Program - OS Interactions (Partial)

12 KTAU: Motivation Kernel-wide vs. Process-centric Kernel-wide - Aggregate kernel activity of all active processes in system  Understand overall OS behavior, identify and remove kernel hot spots.  Cannot show what parts of app. spend time in OS and why Process-centric perspective - OS performance within context of a specific application’s execution  Virtualization and Mapping performance to process  Interactions between programs, daemons, and system services  Tune OS for specific workload or tune application to better conform to OS config.  Expose real source of performance problems (in the OS or the application)

13 KTAU: Motivation Kernel-wide vs. Process-centric

14 KTAU: Motivation Existing Approaches User-space Only measurement tools  Many tools only work at user-level and cannot observe system-level performance influences Kernel-level Only measurement tools  Most only provide the kernel-wide perspective – lack proper mapping/virtualization  Some provide process-centric views but cannot integrate OS and user-level measurements Combined or Integrated User/Kernel Measurement Tools  A few powerful tools allow fine-grained measurement and correlation of kernel and user-level performance  Typically these focus only on Direct OS interactions. Indirect interactions not merged. Using Combinations of above tools  Without better integration, does not allow fine-grained correlation between OS and App.  Many kernel tools do not explicitly recognize Parallel workloads (e.g. MPI ranks) Need an integrated approach to parallel perf. observation, analyses

15 KTAU: High-Level Objectives Support low-overhead OS performance measurement at multiple levels of function and detail Provide both kernel-wide and process-centric perspectives of OS performance Integrate user and kernel-level performance information across all program-OS interactions Provide online information and the ability to function without a daemon where possible Support both profiling and tracing for kernel-wide and process- centric views in parallel systems Leverage existing parallel performance analysis/viz tools Support for observing, collecting and analyzing parallel data

16 KTAU: Outline Introduction Motivations Objectives Architecture / Implementation Choices Experimentation – the performance views Perturbation Study ZeptoOS – KTAU on Blue Gene / L Future work and directions Acknowledgements

17 KTAU Architecture

18 KTAU: Arch. / Impl. Choices Instrumentation  Static Source instrumentation  Macro Map-ID: Map block of code and process-context to unique index (dense id-space) – easy array lookup.  Macro Start, Stop – provide the mapping index and process-context is implicit Measurement  Differentiate between ‘local/self’ and ‘inter-context’ access. HPC codes primarily use ‘self’.  Store performance data in PCB (task_struct) Integrating Kernel/User Performance state  Don’t assume synchronous kernel-entry or process- context  Have to use memory mapping between kernel and appl. State  Pinning shared state in memory  Kernel Call Groups – program-OS interactions summary Analyses and Visualization – Use TAU facilities

19 KTAU: Controlled Experiments Controlled Experiments  Exercise kernel in controlled fashion  Check if KTAU produces the expected correct and meaningful views Test machines  Neutron: 4-CPU Intel P3 Xeon 550MHz, 1GB RAM, Linux 2.6.14.3(ktau)  Neuronic: 16-node 2-CPU Intel P4 Xeon 2.8GHz, 2GB RAM/node, Redhat Enterprise Linux 2.4(ktau) Benchmarks  NPB LU, SP applications [NPB] Simulated computational fluid dynamics (CFD) applications. A regular-sparse, block lower and upper triangular system solution.  LMBENCH [LMBENCH] Suite of micro-benchmarks exercising Linux kernel  A few others not shown (e.g. SKAMPI)

20 KTAU: Controlled Examples continued… Profiling

21 Approx. 25 secs dilation in Total Inclusive time. Why? Approx. 16 secs dilation in MPSP() Exclusive time. Why? User-level profile does not tell the whole story! KTAU: Controlled Experiments Observing Interrupts User-level Inclusive TimeUser-level Exclusive Time Benchmark: NPB-SP application on 16 nodes

22 KTAU: Controlled Experiments Observing Interrupts User+OS Inclusive TimeUser+OS Exclusive Time Kernel-Space Time Taken by: 1.do_softirq() 2. schedule() 3. do_IRQ() 4. sys_poll() 5. icmp_rcv() 6. icmp_reply() MPSP excl. time difference only 4 secs. Excl-time view clearly identifies the culprits. 1. schedule() 2. do_IRQ() 3. icmp_reply() 4. do_softirq() Pings cause interrupts (do_IRQ). Which in turn handled after interrupt by soft-interrupts (do_softirq). Actual routine is icmp_reply/rcv. Large number of softirqs causes ksoftirqd to be scheduled-in, causing SP to be scheduled-out.

23 KTAU: Controlled Experiments Observing Scheduling NPB LU application on 8 CPU (neuronic.nic.uoregon.edu) Simulate daemon interference using “toy” daemon Daemon periodically wakes-up and performs activity What does KTAU show - different views… A: Aggregated Kernel-wide View (Each row is single host)

24 KTAU: Controlled Experiments Observing Scheduling B: Process-level View (Each row is single process on host 8) Select node 8 and take a closer look … 2 NPB LU processes ‘Toy’ Daemon activity

25 KTAU: Controlled Experiments Observing Scheduling C: Voluntary / Involuntary Scheduling (Each row is single MPI rank) Instrumentation to differentiate voluntary/involuntary schedule Experiment re-run on 4-processor SMP Local slowdown - preemptive scheduling Remote slowdown - voluntary scheduling (waiting!) Pre-empted out by ‘Toy’ daemon Other 3 yield cpu voluntarily and wait!

26 LMBENCH Page-Fault Call Group Relations Program-OS Call Graph KTAU: Controlled Experiments Observing Exceptions

27 Merging App / OS Traces MPI_Send OS Routines Fine-grained Tracing Shows detail inside interrupts and bottom halves Using VAMPIR Trace Visualization [VAMPIR] KTAU: Controlled Experiments Tracing

28 KTAU: Controlled Examples continued… Tracing Correlating CIOD and RPC-IOD Activity

29 KTAU: Larger-Scale Runs Run parallel benchmarks on larger-scale (128 dual-cpu nodes)  Identify (and remove) system-level performance issues  Understand perturbation overheads introduced by KTAU NPB benchmark: LU Application [NPB]  Simulated computational fluid dynamics (CFD) application. A regular-sparse, block lower and upper triangular system solution. ASC benchmark: Sweep3D [Sweep3d]  Solves a 3-D, time-independent, neutron particle transport equation on an orthogonal mesh. Test machine: Chiba-City Linux cluster (ANL)  128 dual-CPU Pentium III, 450MHz, 512MB RAM/node, Linux 2.6.14.2 (ktau) kernel, connected by Ethernet

30 KTAU: Larger-Scale Runs Experienced problems on Chiba by chance Initially ran NPB-LU and Sweep3D codes on 128x1 configuration Then ran on 64x2 configuration Extreme performance hit (72% slower!) with the 64x2 runs Used KTAU views to identify and solve issues iteratively Eventually brought performance gap to 13% for LU and 9% for Sweep.

31 KTAU: Larger-scale Runs Two ranks - relatively very low MPI_Recv() time. Two ranks - MPI_Recv() diff. from Mean in OS-SCHED. User-level MPI_RecvMPI_Recv OS Interactions

32 KTAU: Larger-scale Runs Two ranks have very low voluntary scheduling durations. (Same) Two ranks have very large preemptive scheduling. Voluntary SchedulingPreemptive Scheduling Note: x-axis log scale

33 KTAU Larger-scale Runs NPB LU processes PID:4066, PID:4068 active. No other significant activity! Why the Pre- emption? 64x2 Pinned: Interrupt Activity Bimodal across MPI ranks. ccn10 Node-level ViewInterrupt Activity

34 KTAU Larger-scale Runs Many more OS-TCP CallsApprox. 100% longer 100% More background OS-TCP activity in Compute phase. More imbalance! Use ‘Merged’ performance data to identify imbalance.Why does purely compute bound region have lots of I/O? TCP within Compute : TimeTCP within Compute : Calls

35 KTAU Larger-scale Runs OS-TCP in SMP Costlier IRQ-Balancing blindly distributes interrupts and bottom-halves. E.g.: Handling TCP related BH in CPU-0 for LU-process on CPU-1  Cache issues! [COMSWARE] Cost / Call of OS-level TCP

36 KTAU Perturbation Study Five different Configurations  Base: Vanilla kernel, un-instrumented benchmark  Ktau-Off: Kernel patched with Ktau and instrumentations compiled-in. But all instrumentations turned Off (boot-time control)  Prof-All: All kernel instrumentations turned On.  Prof-Sched: Only scheduler subssystem’s instrumentations turned on  Prof-All+TAU: ProfAll, but also with user-level Tau instrumentation enabled NPB LU application benchmark:  16 nodes, 5 different configurations, Mean over 5 runs each ASC Sweep3D:  128 nodes, Base and Prof-All+TAU, Mean over 5 runs each. Test machine: Chiba-City ANL

37 KTAU Perturbation Study Disabled probe effect. Single instrumentation very cheap. E.g. Scheduling. Complete Integrated Profiling Cost under 3% on Avg. and as low as 1.58%. Sweep3d on 128 Nodes Base ProfAll+TAU Elapsed Time: 368.25 369.9 % Avg Slow.: 0.49%

38 KTAU: Outline Introduction Motivations Objectives Architecture / Implementation Choices Experimentation – the performance views Perturbation Study Future work and directions Acknowledgements

39 KTAU: Future Work Dynamic measurement control - enable/disable events w/o recompilation or reboot Improve performance data sources that KTAU can access - E.g. PAPI Improve integration with TAU’s user-space capabilities to provide even better correlation of user and kernel performance information  full callpaths,  phase-based profiling,  merged user/kernel traces Integration of Tau, Ktau with Supermon (possibly MRNet?), TAUg (next) Porting efforts: IA-64, PPC-64 and AMD Opteron ZeptoOS: Planned characterization efforts  BGL I/O node  Dynamically adaptive kernels

40 Acknowledgements Prof. Allen D Malony Dr. Sameer Shende, Senior Scientist Alan Morris, Senior Software Engineer, PRL Suravee Suthikulpanit, MS Student (Graduated)

41 Support Acknowledgements Department of Energy’s Office of Science (contract no. DE-FG02-05ER25663) and National Science Foundation (grant no. NSF CCF 0444475)

42 References [petrini’03]:F. Petrini, D. J. Kerbyson, and S. Pakin, “The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of asci q,” in SC ’03 [jones’03]: T. Jones and et al., “Improving the scalability of parallel jobs by adding parallel awareness to the operating system,” in SC ’03 [PAPI]: S. Browne et al., “A Portable Programming Interface for Performance Evaluation on Modern Processors”. The International Journal of High Performance Computing Applications, 14(3):189-- 204, Fall 2000. [VAMPIR]: W. E. Nagel et. al., “VAMPIR: Visualization and analysis of MPI resources,” Supercomputer, vol. 12, no. 1, pp. 69–80, 1996. [ZeptoOS]: “ZeptoOS: The small linux for big computers,” http://www.mcs.anl.gov/zeptoos/ [NPB]: D.H. Bailey et. al., “The nas parallel benchmarks,” The International Journal of Supercomputer Applications, vol. 5, no. 3, pp. 63–73, Fall 1991.

43 References [Sweep3d]: A. Hoise et. al., “A general predictive performance model for wavefront algorithms on clusters of SMPs,” in International Conference on Parallel Processing, 2000 [LMBENCH]: L. W. McVoy and C. Staelin, “lmbench: Portable tools for performance analysis,” in USENIX Annual Technical Conference, 1996, pp. 279–294 [TAU]: “TAU: Tuning and Analysis Utilities,” http://www.cs.uoregon.edu/research/paracomp/tau/ http://www.cs.uoregon.edu/research/paracomp/tau/ [KTAU-BGL]: A. Nataraj, A. Malony, A. Morris, and S. Shende, “Early experiences with ktau on the ibm bg/l,” in EuroPar’06, European Conference on Parallel Processing, 2006. [KTAU]: A. Nataraj et al., “Kernel-Level Measurement for Integrated Parallel Performance Views: the KTAU Project” (under submission)


Download ppt "Kernel-level Measurement for Integrated Parallel Performance Views KTAU: Kernel - TAU Aroon Nataraj Performance Research Lab University of Oregon."

Similar presentations


Ads by Google