Open XDMoD Overview Tom Furlani, Center for Computational Research 4/26/2017 12:23:39 PM Open XDMoD Overview Tom Furlani, Center for Computational Research University at Buffalo, October 15, 2015
XDMoD: What is It? Comprehensive Framework for HPC Management Provide wide range of utilization metrics Web-based portal interface Measure QoS of HPC Infrastructure Diagnostic tools – early identification of system problems Provide job-level performance data Identify underperforming jobs/applications 5-year NSF Grant (XD Net Metrics Service – XMS) XDMoD – XSEDE version Open XDMoD – Open Source version for HPC Centers* 100+ academic & industrial installations worldwide http://xdmod.sourceforge.net/
Open XDMoD Benefits for the Stakeholders University Senior Leadership Comprehensive resource management and planning tool Scientific Impact - Return on Investment Metrics HPC Center Director Return on Investment Metrics Systems Administrator System diagnostic and performance tuning tool (QoS), application tuning, detailed job level performance information HPC Support Specialist Tool to identify and help diagnose underperforming applications PI and End User More effective use of allocation, resource selection, improved code performance, improved throughput
XDMoD Portal: XD Metrics on Demand Display Metrics – GUI Interface Utilization, performance, publications Role Based: View tailored to role of user Public, End user, PI, Center Director, Program Officer Custom Report Builder Multiple File Export Capability - Excel, PDF, XML, RSS, etc
QoS: Application Kernel Use Case Application kernels help detect user environment anomaly at CCR Example: Performance variation of NWChem due to bug in commercial parallel file system (PanFS) that was subsequently fixed by vendor vendor patch installed
Measuring Job Level Performance Collaboration with Texas Advanced Computing Center Integration of XDMoD with Monitoring Frameworks TACC_Stats/Lariat, Performance CoPilot, Ganglia, etc Supply XDMoD with job performance data – applications run, memory, local I/O, network, file-system, and CPU usage Available in Open XDMoD in Beta Release at SC15 Already in production in XSEDE version Identify poorly performing jobs (users) and applications Automated process Thousands of jobs run per day – not possible to manually search for poorly performing codes Jobs can be flagged for: Idle nodes, Node failure, High Cycles per Instruction (CPI) HPC consultants can use tools to identify/diagnose problems Job viewer tab in XDMoD portal User Report Card
XDMoD Job Viewer Example 1 Relatively poor CPU User fraction (0.75), poor CPU User Balance (some cores not utilized)
XDMoD Job Viewer Example 1.1 Per-node CPU activity tops out at 75% …
XDMoD Job Viewer Example 1.2 Drilldown per node reveals underutilized cores (12/16) …
Recovering Wasted CPU Cycles Software tools to identify poorly performing jobs Job 2552292 ran very inefficiently (less than 30% CPU usage) After HPC specialist user support, a similar job had ~100% CPU usage Before CPU efficiency below 35% After CPU efficiency near 100% The slurm script was using srun in a loop. The job was not utilizing all requested nodes and cores (only using 6 of the 60 cores). This type of computation is better suited for a job array. Job 2585868 This was launched by the same user as 2552292, but this user has corrected his slurm script. As a consequence this job used all requested cores.
Derived Metrics Derived metrics for job compute efficiency analysis: CPU User (job length > 1h): CPU user average, normalized to unity CPU User balance (job length > 1h): Ratio of best cpu user average to worst, normalized to unity (1 = uniform) CPU Homogeneity (job length > 1h): Inverse ratio of largest drop in L1 data cache rate, normalized to one (zero = inhomogeneous) (graphical header currently only if all 3 available, User, User Balance, Homogeneity) CPI (counter availability): clocks per instruction Intel fixed counters: CLOCKS_UNHALTED_REF,INSTRUCTIONS_RETIRED CPLD (counter availability): clocks per L1 data cache loads (CLOCKS_UNHALTED_REF, LOAD_L1D_ALL, MEM_LOAD_RETIRED_L1D_HIT) Flop/s (counter availability): Varies by CPU: Intel: SIMD_DOUBLE_256, SSE_DOUBLE_ALL (SSE_DOUBLE_SCALAR, SSE_DOUBLE_PACKED) (nada for Haswell – blame Intel)