Open XDMoD Overview Tom Furlani, Center for Computational Research

Slides:



Advertisements
Similar presentations
© Copyright 2003, the Yankee Group. All rights reserved. March 2004 Page 1 Sanjay Mewada Vice-President Telecom Software Strategies The Yankee Group March.
Advertisements

The Moab Grid Suite CSS´ 06 – Bonn – July 28, 2006.
Virtual Machine Technology Dr. Gregor von Laszewski Dr. Lizhe Wang.
Internet Information Server 6.0. IIS 6.0 Enhancements  Fundamental changes, aimed at: Reliability & Availability Reliability & Availability Performance.
An Analysis of Node Sharing on HPC Clusters using XDMoD/TACC_Stats Joseph P White, Ph.D Scientific Programmer - Center for Computational Research University.
Information Technology Center Introduction to High Performance Computing at KFUPM.
CS 345 Computer System Overview
Workload Characterization and Performance Assessment of Yellowstone using XDMoD and Exploratory data analysis (EDA) 1 August 2014 Ying Yang, SUNY, University.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Chapter 14 Chapter 14: Server Monitoring and Optimization.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 2: Managing Hardware Devices.
SUSE Linux Enterprise Server Administration (Course 3037) Chapter 1 Introduction to Managing the SUSE Linux Enterprise Server.
MCITP Guide to Microsoft Windows Server 2008 Server Administration (Exam #70-646) Chapter 14 Server and Network Monitoring.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
PRESTON SMITH ROSEN CENTER FOR ADVANCED COMPUTING PURDUE UNIVERSITY A Cost-Benefit Analysis of a Campus Computing Grid Condor Week 2011.
Maintaining and Updating Windows Server 2008
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
Understanding and Managing WebSphere V5
 Introduction Introduction  Definition of Operating System Definition of Operating System  Abstract View of OperatingSystem Abstract View of OperatingSystem.
Chapter 17: Watching Your System BAI617. Chapter Topics Working With Event Viewer Performance Monitor Resource Monitor.
Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.
Twelfth Lecture Hour 10:30 – 11:20 am, Saturday, September 15 Software Management Disciplines Project Organization and Responsibilities (from Part III,
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 2: Managing Hardware Devices.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster
Bright Cluster Manager Advanced cluster management made easy Dr Matthijs van Leeuwen CEO Bright Computing Mark Corcoran Director of Sales Bright Computing.
At A Glance VOLT is a freeware, platform independent tool set that coordinates cross-mission observation planning and scheduling among one or more space.
Module 7: Fundamentals of Administering Windows Server 2008.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
Suite zTPFGI Facilities. Suite Focus Three of zTPFGI’s facilities:  zAutomation  zTREX  Logger.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
®® Microsoft Windows 7 for Power Users Tutorial 9 Evaluating System Performance.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
Learningcomputer.com SQL Server 2008 – Profiling and Monitoring Tools.
14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.
Performance of mathematical software Agner Fog Technical University of Denmark
Scott Ferguson Section 1
Tom Furlani Director, Center for Computational Research SUNY Buffalo Metrics for HPC September 30, 2010.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF.
Chapter 8 System Management Semester 2. Objectives  Evaluating an operating system  Cooperation among components  The role of memory, processor,
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
Remote & Collaborative Visualization. TACC Remote Visualization Systems Longhorn – Dell XD Visualization Cluster –256 nodes, each with 48 GB (or 144 GB)
Tom Furlani, Director September 19, 2015 XDMoD Overview.
PEER 2003 Meeting 03/08/031 Interdisciplinary Framework Major focus areas Structural Representation Fault Systems Earthquake Source Physics Ground Motions.
CIS-NG CASREP Information System Next Generation Shawn Baugh Amy Ramirez Amy Lee Alex Sanin Sam Avanessians.
Maintaining and Updating Windows Server 2008 Lesson 8.
1
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.
Managing Microsoft SQL 2000 with MOM MOM Overview Why Monitor SMS 2003 with MOM 2005 The SMS 2003 Management Pack Inside The Management Pack Best.
Unit 3 Computer Systems. What is software? unlike hardware it can’t be physically touched it’s the missing link between the computer hardware and the.
Extreme Scale Infrastructure
Matt Link Associate Vice President (Acting) Director, Systems
Recap: introduction to e-science
Introduction to XSEDE Resources HPC Workshop 08/21/2017
Introduction to Operating System (OS)
Chapter 1: Introduction
XSEDE’s Campus Bridging Project
CS703 - Advanced Operating Systems
Backup Monitoring – EMC NetWorker
Backup Monitoring – EMC NetWorker
Operating System Overview
SharePoint 2013 Best Practices
Presentation transcript:

Open XDMoD Overview Tom Furlani, Center for Computational Research 4/26/2017 12:23:39 PM Open XDMoD Overview Tom Furlani, Center for Computational Research University at Buffalo, October 15, 2015

XDMoD: What is It? Comprehensive Framework for HPC Management Provide wide range of utilization metrics Web-based portal interface Measure QoS of HPC Infrastructure Diagnostic tools – early identification of system problems Provide job-level performance data Identify underperforming jobs/applications 5-year NSF Grant (XD Net Metrics Service – XMS) XDMoD – XSEDE version Open XDMoD – Open Source version for HPC Centers* 100+ academic & industrial installations worldwide http://xdmod.sourceforge.net/

Open XDMoD Benefits for the Stakeholders University Senior Leadership Comprehensive resource management and planning tool Scientific Impact - Return on Investment Metrics HPC Center Director Return on Investment Metrics Systems Administrator System diagnostic and performance tuning tool (QoS), application tuning, detailed job level performance information HPC Support Specialist Tool to identify and help diagnose underperforming applications PI and End User More effective use of allocation, resource selection, improved code performance, improved throughput

XDMoD Portal: XD Metrics on Demand Display Metrics – GUI Interface Utilization, performance, publications Role Based: View tailored to role of user Public, End user, PI, Center Director, Program Officer Custom Report Builder Multiple File Export Capability - Excel, PDF, XML, RSS, etc

QoS: Application Kernel Use Case Application kernels help detect user environment anomaly at CCR Example: Performance variation of NWChem due to bug in commercial parallel file system (PanFS) that was subsequently fixed by vendor vendor patch installed

Measuring Job Level Performance Collaboration with Texas Advanced Computing Center Integration of XDMoD with Monitoring Frameworks TACC_Stats/Lariat, Performance CoPilot, Ganglia, etc Supply XDMoD with job performance data – applications run, memory, local I/O, network, file-system, and CPU usage Available in Open XDMoD in Beta Release at SC15 Already in production in XSEDE version Identify poorly performing jobs (users) and applications Automated process Thousands of jobs run per day – not possible to manually search for poorly performing codes Jobs can be flagged for: Idle nodes, Node failure, High Cycles per Instruction (CPI) HPC consultants can use tools to identify/diagnose problems Job viewer tab in XDMoD portal User Report Card

XDMoD Job Viewer Example 1 Relatively poor CPU User fraction (0.75), poor CPU User Balance (some cores not utilized)

XDMoD Job Viewer Example 1.1 Per-node CPU activity tops out at 75% …

XDMoD Job Viewer Example 1.2 Drilldown per node reveals underutilized cores (12/16) …

Recovering Wasted CPU Cycles Software tools to identify poorly performing jobs Job 2552292 ran very inefficiently (less than 30% CPU usage) After HPC specialist user support, a similar job had ~100% CPU usage Before CPU efficiency below 35% After CPU efficiency near 100% The slurm script was using srun in a loop.  The job was not utilizing all requested nodes and cores (only  using 6 of the 60 cores).  This type of computation is better suited for a job array. Job 2585868 This was launched by the same user as 2552292, but this user has corrected his slurm script. As a consequence this job used all requested cores.

Derived Metrics Derived metrics for job compute efficiency analysis: CPU User (job length > 1h): CPU user average, normalized to unity CPU User balance (job length > 1h): Ratio of best cpu user average to worst, normalized to unity (1 = uniform) CPU Homogeneity (job length > 1h): Inverse ratio of largest drop in L1 data cache rate, normalized to one (zero = inhomogeneous) (graphical header currently only if all 3 available, User, User Balance, Homogeneity) CPI (counter availability): clocks per instruction Intel fixed counters: CLOCKS_UNHALTED_REF,INSTRUCTIONS_RETIRED CPLD (counter availability): clocks per L1 data cache loads (CLOCKS_UNHALTED_REF, LOAD_L1D_ALL, MEM_LOAD_RETIRED_L1D_HIT) Flop/s (counter availability): Varies by CPU: Intel: SIMD_DOUBLE_256, SSE_DOUBLE_ALL (SSE_DOUBLE_SCALAR, SSE_DOUBLE_PACKED) (nada for Haswell – blame Intel)