Slide 1 User-Centric Workload Analytics: Towards Better Cluster Management Saurabh Bagchi Purdue University Joint work with: Subrata Mitra, Suhas Javagal,

Slides:



Advertisements
Similar presentations
An Analysis of Node Sharing on HPC Clusters using XDMoD/TACC_Stats Joseph P White, Ph.D Scientific Programmer - Center for Computational Research University.
Advertisements

Parallel ISDS Chris Hans 29 November 2004.
Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
Ch4: Distributed Systems Architectures. Typically, system with several interconnected computers that do not share clock or memory. Motivation: tie together.
Process Description and Control Chapter 3. Major Requirements of an OS Interleave the execution of several processes to maximize processor utilization.
Introduction to HP LoadRunner Getting Familiar with LoadRunner >>>>>>>>>>>>>>>>>>>>>>
Computer System Architectures Computer System Software
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Introduction to HPC resources for BCB 660 Nirav Merchant
VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.
Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
HYDRA: Using Windows Desktop Systems in Distributed Parallel Computing Arvind Gopu, Douglas Grover, David Hart, Richard Repasky, Joseph Rinkovsky, Steve.
Looking Ahead: A New PSU Research Cloud Architecture Chuck Gilbert - Systems Architect and Systems Team Lead Research CI Coordinating Committee Meeting.
N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
Slide 1/23 F RESCO : An Open Failure Data Repository for Dependability Research and Practice Saurabh Bagchi, Carol Song (Purdue University) Ravi Iyer,
Process Description and Control Chapter 3. Source Modified slides from Missouri U. of Science and Tech.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
Operational and Application Experiences with the Infiniband Environment Sharon Brunett Caltech May 1, 2007.
Getting Started: XSEDE Comet Shahzeb Siddiqui - Software Systems Engineer Office: 222A Computer Building Institute of CyberScience May.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.
HPC need and potential of ANSYS CFD and mechanical products at CERN A. Rakai EN-CV-PJ2 5/4/2016.
Slide 1 Cluster Workload Analytics Revisited Saurabh Bagchi Purdue University Joint work with: Subrata Mitra, Suhas Javagal, Stephen Harrell (Purdue),
An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.
Compute and Storage For the Farm at Jlab
Getting the Most out of Scientific Computing Resources
Monitoring Windows Server 2012
Understanding and Improving Server Performance
Welcome to Indiana University Clusters
Getting the Most out of Scientific Computing Resources
Processes and threads.
HPC usage and software packages
OpenPBS – Distributed Workload Management System
A Study of Failures in Community Clusters: The Case of Conte
Heterogeneous Computation Team HybriLIT
GWE Core Grid Wizard Enterprise (
Memory Thrashing Protection in Multi-Programming Environment
OPERATING SYSTEMS CS3502 Fall 2017
Chapter 2: System Structures
Architecture & System Overview
Grid Computing.
Practical aspects of multi-core job submission at CERN
CommLab PC Cluster (Ubuntu OS version)
Myoungjin Kim1, Yun Cui1, Hyeokju Lee1 and Hanku Lee1,2,*
USF Health Informatics Institute (HII)
Haiyan Meng and Douglas Thain
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
Process Description and Control
CSCE 313 – Introduction to UNIx process
Chapter 2: Operating-System Structures
Wide Area Workload Management Work Package DATAGRID project
The Neuronix HPC Cluster:
IP Control Gateway (IPCG)
LO2 – Understand Computer Software
Chapter 2: Operating-System Structures
Lecture Topics: 11/1 Hand back midterms
Presentation transcript:

Slide 1 User-Centric Workload Analytics: Towards Better Cluster Management Saurabh Bagchi Purdue University Joint work with: Subrata Mitra, Suhas Javagal, Stephen Harrell (Purdue), Adam Moody, Todd Gamblin (LLNL) Presentation available at: engineering.purdue.edu/dcsl Supported by National Science Foundation (NSF) Jul `15-Jul `18

Slide 2 Problem Context Shared computing clusters at university or government labs is not uncommon Users have a varying level of expertise –Writing own job scripts –Using scripts like a black box Varying user needs –High computation power Analysis of large structures (Civil, Aerospace engineering) –High Lustre bandwidth for file operations Working with multiple large databases/files (Genomics) –High Network Bandwidth A parallel processing application

Slide 3 Motivation Challenge for the cluster management –Need for customer centric analytics to pro-actively help users –Improve cluster availability –In addition to failures, investigate performance issues in jobs Need for open data repository of system usage data –Currently, lack of publicly available, annotated quantitative data for analyzing workloads –Available public data sets provide only system level information and not up-to-date –Dataset must not violate user privacy or IT security concerns URL:

Slide 4 Cluster Details: Purdue Purdue’s cluster is called Conte Conte is a ``Community” cluster –580 homogeneous nodes –Each node contains two 8 core Intel Xeon E Sandy Bridge processors running at 2.6 GHz –Two Xeon Phi 5110P accelerator card, each with 60 cores –Memory: 64GB of DDR3, 1.6 GHz RAM 40 Gbps FDR10 Infiniband interconnect along with IP Lustre file system, 2GB/s RHEL 6.6 PBS based job scheduling using Torque

Slide 5 Cluster Details: LLNL SLURM: Job Scheduler TOSS 2.2 OS 16-core Intel Xeon processors (Cab) 12-core Intel Xeon processors (Sierra) 32GB memory (Cab) 24GB memory (Sierra) 1296 nodes (Cab) and 1944 nodes (Sierra) Infiniband network

Slide 6 Cluster Policies Scheduling: –Each job requests for certain time duration, number of nodes and in some cases, amount of memory needed –When job exceeds the specified time limit, it is killed –Jobs are also killed by out-of-memory (OOM) killer scripts, if it exhausts available physical memory and swap space Node sharing: –By default only a single job is scheduled on a an entire node giving dedicated access to all the resources –However, user can enable sharing by using a configuration in the job submission scripts

Slide 7 Data Set Accounting logs from the job scheduler, TORQUE Node-level performance statistics from TACC stats –CPU, Lustre, Infiniband, Virtual memory, Memory and more… Library list for each job, called liblist Job scripts submitted by users Syslog messages SummaryConteCab and Sierra Data set durationOct’14 – Mar’15May’15 – Nov’15 Total number of jobs489,971247,888 and 227,684 Number of users and 207 URL:

Slide 8 Analysis: Types of Jobs (A) (B) Purdue clusterLLNL cluster Different job sizes –Purdue has a large number of “narrow” jobs –LLNL jobs span hundreds to thousands of processes

Slide 9 Analysis: Requested versus Actual Runtime LLNL clusterPurdue cluster Users have little clue how much runtime to request –Purdue: 45% of jobs used less than 10% of requested time –LLNL: 15% of jobs used less than 1% of requested time Consequence: Insufficient utilization of computing resources

Slide 10 Analysis: Resource Usage by App Groups Clearly, there are 2 distinct types of jobs –Few jobs need high bandwidth backplane for Network and IO –In case of memory, such a distinction is not present Follow-on: Specialized cluster built in 2015 for high resource demands –Has 56 GBps Infiniband network Infiniband read rate on Conte Lustre read rate on ConteMemory usage on Conte

Slide 11 Analysis: Performance Issues due to Memory In the extreme, memory exhaustion leads to invocation of oom-killer, kernel level memory manager Multiple evidence for memory problems: Syslog messages with out-of-memory (OOM) code and application exit code –92% of jobs with memory exhaustion logged OOM messages –77% of jobs with OOM messages had memory exhaustion exit code Find a (quantitative) threshold on major page fault rate Find all jobs (and job owners) which exceed the threshold

Slide 12 Current status of the repository Workload traces from Purdue cluster –Accounting information (Torque logs) –TACC stats performance data –User documentation Privacy –Anonymize machine specific information –Anonymize user/group identifiers Library list is not shared –For privacy reasons URL:

Slide 13 Conclusion It is important to analyze how resources are being utilized by users –Scheduler tuning, resource provisioning, and educating users It is important to look at workload information together with failure events –Workload affects the kinds of hardware-software failures that are triggered Open repository started with the goal for different kind of analyses to enhance system dependability URL: