Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tom Furlani, Director September 19, 2015 XDMoD Overview.

Similar presentations


Presentation on theme: "Tom Furlani, Director September 19, 2015 XDMoD Overview."— Presentation transcript:

1 Tom Furlani, Director September 19, 2015 XDMoD Overview

2 Outline Motivation XDMoD Portal Measuring QoS Job Level Performance

3 XDMoD: Comprehensive HPC Management 5-year NSF Grant (XD Net Metrics Service – XMS) XDMoD: XD Metrics on Demand* Analytics framework developed for XSEDE On demand, responsive, access to job accounting data Comprehensive Framework for HPC Management Support for several resource managers (Slurm, PBS, LSF, SGE) Utilization metrics across multiple dimensions Measure QoS of HPC Infrastructure (App Kernels) Job-level performance data (SUPReMM) Open XDMoD: Open Source version for HPC Centers 100+ academic & industrial installations worldwide http://xdmod.sourceforge.net/ 1.J.T. Palmer, S.M. Gallo, T. R. Furlani, M. D. Jones, R. L. DeLeon, J. P. White, N. Simakov, A. K. Patra, J. Sperhac, T. Yearke, R. Rathsam, M. Innus, C. D. Cornelius, J. C. Browne, W. L. Barth, R.T. Evans, “Open XDMoD: A Tool for the Comprehensive Management of High Performance Computing Resources”, Computing in Science and Engineering, 17, No. 4, 52-62, July – August 2015. DOI:10.1109/MCSE.2015.68

4 Motivation Improve User Experience User shouldn’t be the “canary in the coal mine” identifying problems Example: Log File Analysis Discovers Two Malfunctioning Nodes

5 Motivation Improve User Throughput Software tools to automatically identify poorly performing jobs Job 2552292 ran very inefficiently After HPC specialist user support, a similar job was vastly improved Before CPU efficiency below 35% After CPU efficiency near 100%

6 XDMoD Portal: XD Metrics on Demand Display Metrics – GUI Interface Utilization, performance, publications Role Based: View tailored to role of user Public, End user, PI, Center Director, Program Officer Custom Report Builder Multiple File Export Capability - Excel, PDF, XML, RSS, etc

7 CPU Hours Delivered by Decanal Unit

8 Drill Down: CPU Hours and Jobs by Engineering Dept

9 Computationally lightweight Run continuously and on demand to actively measure performance Utilize open source codes such as GAMESS, NWChem, NAMD, OpenFOAM, etc., as well as customized kernels Measure system performance from User’s perspective Local scratch, global filesystem performance, local processor-memory bandwidth, allocatable shared memory, processing speed, network latency and bandwidth QoS: Application Kernels

10 Application Kernel Use Case Application kernels help detect user environment anomaly at CCR Example: Performance variation of NWChem due to bug in commercial parallel file system (PanFS) that was subsequently fixed by vendor vendor patch installed

11 Application Kernel Use Case Uncovered performance issue with CCR’s Panasas parallel file system Timing coincided with recent core switch software upgrade

12 Measuring Job Level Performance Collaboration with Texas Advanced Computing Center Integration of XDMoD with Monitoring Frameworks TACC_Stats/Lariat, Performance CoPilot, Ganglia, etc Supply XDMoD with job performance data – applications run, memory, local I/O, network, file-system, and CPU usage Identify poorly performing jobs (users) and applications

13 Metrics Gathered Metrics gathered per node: Anything available, really – cpu, i/o, memory, filesystem Extensible – measurable quantities can be included with some development work (e.g. CUDA, MIC, panFS, gpfs, script capture, etc.) Overhead: so far we have not been able to measure it compared to the variability inherent in running jobs (order of percent), but keep in mind the potential for overhead when extending metrics

14 Why Collect Job Level Performance Data User Report Card Identify Underperforming User Codes Need an automated process Thousands of jobs run per day – not possible to manually search for poorly performing codes Jobs can be flagged for: Idle nodes Node failure High Cycles per Instruction (CPI) Performance plots and data from Web interface Command-line HPC consultants can use tools to identify/diagnose problems Single job viewer

15 Single Job Viewer Job Information on Stampede job (# 3147958), displays accounting data, application classification, SUPReMM metrics with custom analysis

16 Single Job Viewer Job Information on Stampede job (# 3147958), displays accounting data, application classification, SUPReMM metrics with custom analysis

17 Single Job Viewer Job Information on Stampede job (# 3147958), displays accounting data, application classification, SUPReMM metrics with custom analysis

18 Single Job Viewer Job Information on Stampede job (# 3147958), displays accounting data, application classification, SUPReMM metrics with custom analysis

19 Improving Job Level Performance - Success Stories MILC Code One project using MILC found to be running higher than expected CPI (1.1 vs 0.7) Members were not using available vectorized intrinsics 11% reduction in runtime DNS Turbulence Code CPI of 1.1 w/ lots of SUs Line-level profiling revealed MPI/OpenMP hot spots Converted OpenMP workshare block to parallel do block Improvement: 7% overall, 10% in main loop, 76% in code block Singularity Code (General Relativity) CPI of 1.15 w/ lots of SUs: 239,631 for 1st quarter 2014 Code was making many extraneous calls to cat and rm Code was not using any optimization flags (-O3 or –xhost) 26% decrease in run time after simple changes made

20 Recovering Wasted CPU Cycles TACC Stampede Job 3130503 ran for 48 hours on one node out of seven

21 Recovering Wasted CPU Cycles TACC Stampede (via TAS Single Job Viewer) Job 1836370 ran effectively on 45 nodes but did a serial write for 3 hours After user support, a similar job (1856224) using parallel write fixed this Savings: 3 hours on 45 nodes recovered… Before: serial writeAfter: Parallel write CPU User Serial write Parallel write

22 Underperforming Job Notification All jobs running on cluster XDMoD/SUPReMM Collects all job data (prolog, epilog & 10 min intervals) XDMoD/SUPReMM Collects all job data (prolog, epilog & 10 min intervals) Automatically process all jobs analyze for: low CPU usage, drop in cache use, etc. Automatically process all jobs analyze for: low CPU usage, drop in cache use, etc. Flag “bad” jobs by category Notify: User support, Send message to user Notify: User support, Send message to user User or HPC Support can analyze and improve job performance thru Single Job Viewer

23 Broader Impact Open XDMoD Targeted at Academic and Industrial HPC Centers Based on XDMoD Source Code NCAR Beta test site Collaborating on developing storage reporting schema Blue Waters NSF’s largest supercomputer Many Academic HPC Centers Rice, Cambridge, UGA, UF, Chicago, CERN, New Mexico, Southampton, Manitoba, Leibniz U, Univ Medical Center Utrecht, Liverpool, Illinois, Kansas, RIT, U Leuven – Belgium, U Geneva, SciNet – U Toronto, Case Western, NY Genome Center, U Buffalo, ……… Known Industrial HPC Centers Rolls Royce, Dow, Lockheed Martin, Hess Energy, ……..

24 Acknowledgement TAS/SUPReMM UB: Tom Furlani, Matt Jones, Steve Gallo, Bob DeLeon, Ryan Rathsum, Jeff Palmer, Tom Yearke, Joe White, Jeanette Sperhac, Abani Patra, Nikolay Simakov, Cynthia Cornelius, Martins Innus, Ben Plessinger Indiana: Gregor von Laszewski, Fugang Wang University of Texas: Jim Browne TACC: Bill Barth, Todd Evans, Weijia Xu NICS: Shiquan Su NSF TAS: OCI 1025159, SUPReMM: OCI1203560 XMS: ACI-1445806

25 Colella’s 7 Dwarfs “Seven Dwarfs” of algorithms for simulation in the physical sciences. (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf)http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf “Dwarfs” mine compute cycles for golden results

26 Colella’s 7 Dwarfs Application Area Structured Grids Unstructured Grids FFTDense Linear Algebra Sparse Linear Algebra N-BodyMonte Carlo Molecular Physics XX XX Nanoscale Science X X XX ClimateXXX XXX EnvironmentXX XXX CombustionX X X FusionXXXXXXX Nuclear Energy X XX AstrophysicsXX XXX Nuclear Physics X Accelerator Physics X X QCDX X AerodynamicsXX XX Table 1. Algorithms that play a key role within select scientific applications as characterized according to a seven dwarfs classification* *SCIENTIFIC APPLICATION REQUIREMENTS FOR LEADERSHIP COMPUTING AT THE EXASCALE, ORNL/TM-2007/238


Download ppt "Tom Furlani, Director September 19, 2015 XDMoD Overview."

Similar presentations


Ads by Google