Tom Furlani, Director September 19, 2015 XDMoD Overview.

Slides:



Advertisements
Similar presentations
Introduction to Grid Application On-Boarding Nick Werstiuk
Advertisements

XSEDE TAS Scientific Impact and FutureGrid Lessons Gregor von Laszewski (IU), Fugang Wang (IU), Geoffrey C. Fox Steve Gallo (UB) &
An Analysis of Node Sharing on HPC Clusters using XDMoD/TACC_Stats Joseph P White, Ph.D Scientific Programmer - Center for Computational Research University.
SALSA HPC Group School of Informatics and Computing Indiana University.
Vendor Briefing May 26, 2006 AMI Overview & Communications TCM.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
Monitoring and performance measurement in Production Grid Environments David Wallom.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Loupe /loop/ noun a magnifying glass used by jewelers to reveal flaws in gems. a logging and error management tool used by.NET teams to reveal flaws in.
Authors: Mateusz Jarus, Ewa Kowalczuk, Michał Madziar, Ariel Oleksiak, Andrzej Pałejko, Michał Witkowski Poznań Supercomputing and Networking Center GICOMP.
Net Optics Confidential and Proprietary Net Optics appTap Intelligent Access and Monitoring Architecture Solutions.
ROOT: A Data Mining Tool from CERN Arun Tripathi and Ravi Kumar 2008 CAS Ratemaking Seminar on Ratemaking 17 March 2008 Cambridge, Massachusetts.
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.
Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson.
Thinking about Accounting Matteo Melani SLAC Open Science Grid.
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D The Case for Monitoring and Testing David.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster
Bright Cluster Manager Advanced cluster management made easy Dr Matthijs van Leeuwen CEO Bright Computing Mark Corcoran Director of Sales Bright Computing.
CoG Kit Overview Gregor von Laszewski Keith Jackson.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
2005 Materials Computation Center External Board Meeting The Materials Computation Center Duane D. Johnson and Richard M. Martin (PIs) Funded by NSF DMR.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
Pre-GDB 2014 Infrastructure Analysis Christian Nieke – IT-DSS Pre-GDB 2014: Christian Nieke1.
Tom Furlani Director, Center for Computational Research SUNY Buffalo Metrics for HPC September 30, 2010.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Open XDMoD Overview Tom Furlani, Center for Computational Research
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
The Performance Evaluation Research Center (PERC) Participating Institutions: Argonne Natl. Lab.Univ. of California, San Diego Lawrence Berkeley Natl.
Accelerating High Performance Cluster Computing Through the Reduction of File System Latency David Fellinger Chief Scientist, DDN Storage ©2015 Dartadirect.
Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF.
Chapter 8 System Management Semester 2. Objectives  Evaluating an operating system  Cooperation among components  The role of memory, processor,
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
PEER 2003 Meeting 03/08/031 Interdisciplinary Framework Major focus areas Structural Representation Fault Systems Earthquake Source Physics Ground Motions.
If you have a transaction processing system, John Meisenbacher
Steve Gallo Center for Computational Research SUNY Buffalo Technology Audit for TG:XD March 10, 2010.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
XD Net Metrics Service (XMS)
Lizhe Wang, Gregor von Laszewski, Jai Dayal, Thomas R. Furlani
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Device Maintenance and Management, Parental Control, and Theft Protection for Home Users Made Easy with Remo MORE and Power of Azure MICROSOFT AZURE APP.
A Brief Introduction to NERSC Resources and Allocations
Integrating Scientific Tools and Web Portals
Meemim's Microsoft Azure-Hosted Knowledge Management Platform Simplifies the Sharing of Information with Colleagues, Clients or the Public MICROSOFT AZURE.
Matt Link Associate Vice President (Acting) Director, Systems
Greening Your Information Systems
GWE Core Grid Wizard Enterprise (
Free Cloud Management Portal for Microsoft Azure Empowers Enterprise Users to Govern Their Cloud Spending and Optimize Cloud Usage and Planning MICROSOFT.
Insurance Fraud Analytics in the Cloud with Saama and Microsoft Azure
Architecture & System Overview
Recap: introduction to e-science
Introduction to XSEDE Resources HPC Workshop 08/21/2017
Digital Science Center I
Utilizing the Capabilities of Microsoft Azure, Skipper Offers a Results-Based Platform That Helps Digital Advertisers with the Marketing of Their Mobile.
Excelian Grid as a Service Offers Compute Power for a Variety of Scenarios, with Infrastructure on Microsoft Azure and Costs Aligned to Actual Use MICROSOFT.
Last.Backend is a Continuous Delivery Platform for Developers and Dev Teams, Allowing Them to Manage and Deploy Applications Easier and Faster MICROSOFT.
Panel on Research Challenges in Big Data
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Tom Furlani, Director September 19, 2015 XDMoD Overview

Outline Motivation XDMoD Portal Measuring QoS Job Level Performance

XDMoD: Comprehensive HPC Management 5-year NSF Grant (XD Net Metrics Service – XMS) XDMoD: XD Metrics on Demand* Analytics framework developed for XSEDE On demand, responsive, access to job accounting data Comprehensive Framework for HPC Management Support for several resource managers (Slurm, PBS, LSF, SGE) Utilization metrics across multiple dimensions Measure QoS of HPC Infrastructure (App Kernels) Job-level performance data (SUPReMM) Open XDMoD: Open Source version for HPC Centers 100+ academic & industrial installations worldwide 1.J.T. Palmer, S.M. Gallo, T. R. Furlani, M. D. Jones, R. L. DeLeon, J. P. White, N. Simakov, A. K. Patra, J. Sperhac, T. Yearke, R. Rathsam, M. Innus, C. D. Cornelius, J. C. Browne, W. L. Barth, R.T. Evans, “Open XDMoD: A Tool for the Comprehensive Management of High Performance Computing Resources”, Computing in Science and Engineering, 17, No. 4, 52-62, July – August DOI: /MCSE

Motivation Improve User Experience User shouldn’t be the “canary in the coal mine” identifying problems Example: Log File Analysis Discovers Two Malfunctioning Nodes

Motivation Improve User Throughput Software tools to automatically identify poorly performing jobs Job ran very inefficiently After HPC specialist user support, a similar job was vastly improved Before CPU efficiency below 35% After CPU efficiency near 100%

XDMoD Portal: XD Metrics on Demand Display Metrics – GUI Interface Utilization, performance, publications Role Based: View tailored to role of user Public, End user, PI, Center Director, Program Officer Custom Report Builder Multiple File Export Capability - Excel, PDF, XML, RSS, etc

CPU Hours Delivered by Decanal Unit

Drill Down: CPU Hours and Jobs by Engineering Dept

Computationally lightweight Run continuously and on demand to actively measure performance Utilize open source codes such as GAMESS, NWChem, NAMD, OpenFOAM, etc., as well as customized kernels Measure system performance from User’s perspective Local scratch, global filesystem performance, local processor-memory bandwidth, allocatable shared memory, processing speed, network latency and bandwidth QoS: Application Kernels

Application Kernel Use Case Application kernels help detect user environment anomaly at CCR Example: Performance variation of NWChem due to bug in commercial parallel file system (PanFS) that was subsequently fixed by vendor vendor patch installed

Application Kernel Use Case Uncovered performance issue with CCR’s Panasas parallel file system Timing coincided with recent core switch software upgrade

Measuring Job Level Performance Collaboration with Texas Advanced Computing Center Integration of XDMoD with Monitoring Frameworks TACC_Stats/Lariat, Performance CoPilot, Ganglia, etc Supply XDMoD with job performance data – applications run, memory, local I/O, network, file-system, and CPU usage Identify poorly performing jobs (users) and applications

Metrics Gathered Metrics gathered per node: Anything available, really – cpu, i/o, memory, filesystem Extensible – measurable quantities can be included with some development work (e.g. CUDA, MIC, panFS, gpfs, script capture, etc.) Overhead: so far we have not been able to measure it compared to the variability inherent in running jobs (order of percent), but keep in mind the potential for overhead when extending metrics

Why Collect Job Level Performance Data User Report Card Identify Underperforming User Codes Need an automated process Thousands of jobs run per day – not possible to manually search for poorly performing codes Jobs can be flagged for: Idle nodes Node failure High Cycles per Instruction (CPI) Performance plots and data from Web interface Command-line HPC consultants can use tools to identify/diagnose problems Single job viewer

Single Job Viewer Job Information on Stampede job (# ), displays accounting data, application classification, SUPReMM metrics with custom analysis

Single Job Viewer Job Information on Stampede job (# ), displays accounting data, application classification, SUPReMM metrics with custom analysis

Single Job Viewer Job Information on Stampede job (# ), displays accounting data, application classification, SUPReMM metrics with custom analysis

Single Job Viewer Job Information on Stampede job (# ), displays accounting data, application classification, SUPReMM metrics with custom analysis

Improving Job Level Performance - Success Stories MILC Code One project using MILC found to be running higher than expected CPI (1.1 vs 0.7) Members were not using available vectorized intrinsics 11% reduction in runtime DNS Turbulence Code CPI of 1.1 w/ lots of SUs Line-level profiling revealed MPI/OpenMP hot spots Converted OpenMP workshare block to parallel do block Improvement: 7% overall, 10% in main loop, 76% in code block Singularity Code (General Relativity) CPI of 1.15 w/ lots of SUs: 239,631 for 1st quarter 2014 Code was making many extraneous calls to cat and rm Code was not using any optimization flags (-O3 or –xhost) 26% decrease in run time after simple changes made

Recovering Wasted CPU Cycles TACC Stampede Job ran for 48 hours on one node out of seven

Recovering Wasted CPU Cycles TACC Stampede (via TAS Single Job Viewer) Job ran effectively on 45 nodes but did a serial write for 3 hours After user support, a similar job ( ) using parallel write fixed this Savings: 3 hours on 45 nodes recovered… Before: serial writeAfter: Parallel write CPU User Serial write Parallel write

Underperforming Job Notification All jobs running on cluster XDMoD/SUPReMM Collects all job data (prolog, epilog & 10 min intervals) XDMoD/SUPReMM Collects all job data (prolog, epilog & 10 min intervals) Automatically process all jobs analyze for: low CPU usage, drop in cache use, etc. Automatically process all jobs analyze for: low CPU usage, drop in cache use, etc. Flag “bad” jobs by category Notify: User support, Send message to user Notify: User support, Send message to user User or HPC Support can analyze and improve job performance thru Single Job Viewer

Broader Impact Open XDMoD Targeted at Academic and Industrial HPC Centers Based on XDMoD Source Code NCAR Beta test site Collaborating on developing storage reporting schema Blue Waters NSF’s largest supercomputer Many Academic HPC Centers Rice, Cambridge, UGA, UF, Chicago, CERN, New Mexico, Southampton, Manitoba, Leibniz U, Univ Medical Center Utrecht, Liverpool, Illinois, Kansas, RIT, U Leuven – Belgium, U Geneva, SciNet – U Toronto, Case Western, NY Genome Center, U Buffalo, ……… Known Industrial HPC Centers Rolls Royce, Dow, Lockheed Martin, Hess Energy, ……..

Acknowledgement TAS/SUPReMM UB: Tom Furlani, Matt Jones, Steve Gallo, Bob DeLeon, Ryan Rathsum, Jeff Palmer, Tom Yearke, Joe White, Jeanette Sperhac, Abani Patra, Nikolay Simakov, Cynthia Cornelius, Martins Innus, Ben Plessinger Indiana: Gregor von Laszewski, Fugang Wang University of Texas: Jim Browne TACC: Bill Barth, Todd Evans, Weijia Xu NICS: Shiquan Su NSF TAS: OCI , SUPReMM: OCI XMS: ACI

Colella’s 7 Dwarfs “Seven Dwarfs” of algorithms for simulation in the physical sciences. ( “Dwarfs” mine compute cycles for golden results

Colella’s 7 Dwarfs Application Area Structured Grids Unstructured Grids FFTDense Linear Algebra Sparse Linear Algebra N-BodyMonte Carlo Molecular Physics XX XX Nanoscale Science X X XX ClimateXXX XXX EnvironmentXX XXX CombustionX X X FusionXXXXXXX Nuclear Energy X XX AstrophysicsXX XXX Nuclear Physics X Accelerator Physics X X QCDX X AerodynamicsXX XX Table 1. Algorithms that play a key role within select scientific applications as characterized according to a seven dwarfs classification* *SCIENTIFIC APPLICATION REQUIREMENTS FOR LEADERSHIP COMPUTING AT THE EXASCALE, ORNL/TM-2007/238