SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

COURSE: COMPUTER PLATFORMS
Sphinx Server Sphinx Client Data Warehouse Submitter Generic Grid Site Monitoring Service Resource Message Interface Current Sphinx Client/Server Multi-threaded.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Desktop Computing Strategic Project Sandia National Labs May 2, 2009 Jeremy Allison Andy Ambabo James Mcdonald Sandia is a multiprogram laboratory operated.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Introduction to Systems Architecture Kieran Mathieson.
VIRTUALISATION OF HADOOP CLUSTERS Dr G Sudha Sadasivam Assistant Professor Department of CSE PSGCT.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Types of Operating System
Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.
Introduction Optimizing Application Performance with Pinpoint Accuracy What every IT Executive, Administrator & Developer Needs to Know.
CHAPTER 2 OPERATING SYSTEM OVERVIEW 1. Operating System Operating System Definition A program that controls the execution of application programs and.
Resource Management and Accounting Working Group Working Group Scope and Components Progress made Current issues being worked Next steps Discussions involving.
Microsoft ® Official Course Module 10 Optimizing and Maintaining Windows ® 8 Client Computers.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Thanks to Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction n What is an Operating System? n Mainframe Systems.
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
Operating Systems  A collection of programs that  Coordinates computer usage among users  Manages computer resources  Handle Common Tasks.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
So, Jung-ki Distributed Computing System LAB School of Computer Science and Engineering Seoul National University Implementation of Package Management.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 13-14, 2002.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
◦ What is an Operating System? What is an Operating System? ◦ Operating System Objectives Operating System Objectives ◦ Services Provided by the Operating.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Process Management Working Group Process Management “Meatball” Dallas November 28, 2001.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Introduction to HP Availability Manager.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
Block1 Wrapping Your Nugget Around Distributed Processing.
Learningcomputer.com SQL Server 2008 – Profiling and Monitoring Tools.
October 18, 2005 Charm++ Workshop Faucets A Framework for Developing Cluster and Grid Scheduling Solutions Presented by Esteban Pauli Parallel Programming.
Computers Operating System Essentials. Operating Systems PROGRAM HARDWARE OPERATING SYSTEM.
A Brief Documentation.  Provides basic information about connection, server, and client.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting May 10-11, 2005 Argonne, IL.
SciDAC SSS Quarterly Report Sandia Labs August 27, 2004 William McLendon Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Page 1 Process Migration & Allocation Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this.
INFORMATION SYSTEM-SOFTWARE Topic: OPERATING SYSTEM CONCEPTS.
Operating System Principles And Multitasking
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Cloud Computing Lecture 5-6 Muhammad Ahmad Jan.
Virtual Machine Movement and Hyper-V Replica
Performing Fault-tolerant, Scalable Data Collection and Analysis James Jolly University of Wisconsin-Madison Visualization and Scientific Computing Dept.
Copyright © 2004 R2AD, LLC Submitted to GGF ACS Working Group for GGF-16 R2AD, LLC Distributing Software Life Cycles Join the ACS Team GGF-16, Athens R2AD,
OPERATING SYSTEM REVIEW. System Software The programs that control and maintain the operation of the computer and its devices The two parts of system.
1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.
Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory.
Process Management & Monitoring WG Quarterly Report August 26, 2004.
SciDAC SSS Quarterly Report Sandia Labs January 25, 2005 William McLendon Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed.
Slide 1 User-Centric Workload Analytics: Towards Better Cluster Management Saurabh Bagchi Purdue University Joint work with: Subrata Mitra, Suhas Javagal,
Slide 1 Cluster Workload Analytics Revisited Saurabh Bagchi Purdue University Joint work with: Subrata Mitra, Suhas Javagal, Stephen Harrell (Purdue),
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
2. OPERATING SYSTEM 2.1 Operating System Function
Software Architecture in Practice
High Availability in HTCondor
Where are being used the OS?
Process Description and Control
THE OPERATION SYSTEM The need for an operating system
Introduction to Operating System (OS)
Chapter 1: Introduction
Operating System Concepts
Unit 1: Introduction to Operating System
The Neuronix HPC Cluster:
Operating System Concepts
Presentation transcript:

SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Effective System Performance Benchmark Scalability –Service Node –Cluster Size Durability Anomalies Overview

The Setup The physical machine –dual processor 3GHz Xeon –2 GB RAM –FC3 and VMWare 5 The 4 node VMWare cluster –1 service node –4 compute nodes –OSCAR 1.0 on Redhat 9 The 64 virtual node cluster –16 WarehouseNodeMonitors running on each compute node

Dual Processor Xeon service1 - SystemMonitor VMWare compute1compute4compute3compute2 NodeMon 1 NodeMon 5 NodeMon 9 NodeMon13... NodeMon n-3 NodeMon 2 NodeMon 6 NodeMon 10 NodeMon NodeMon n-2 NodeMon 3 NodeMon 7 NodeMon 11 NodeMon NodeMon n-1 NodeMon 4 NodeMon 8 NodeMon 12 NodeMon NodeMon n

Effective System Performance Benchmark Developed by the National Energy Research Scientific Computing Center System utilization test, NOT a throughput test Focused on O/S attributes –launch time, accounting, job scheduling Constructed to be processor-speed independent Low resource usage (besides network) Two variants: Throughput and Multimode The final result is the ESP Efficiency Ratio

ESP Efficiency Ratio Calculating the ESP Efficiency Ratio CPUsecs = sum(jobsize * runtime * job count) AMT = CPUsecs/syssize ESP Efficiency Ratio = AMT/observed runtime

ESP2 Efficiency (64 nodes) CPUsecs = AMT = /64 = Observed Runtime = ESP Efficiency Ratio =

Scalability Service Node Scalability (Load Testing) –Bamboo (Queue Manager) –Gold (Accounting) Cluster Size –Warehouse scalability (Status Monitor) –Maui scalability (Scheduler)

Usage Reports User DB AccountingScheduler Job Manager & Monitor System Monitor Queue Manager Checkpoint/ Restart Data Migration Meta Scheduler Node Configuration & Build Manager Meta Monitor Meta Manager Resource Allocation management Application Environment High Performance Communication & I/O Access control Security manager File System Interacts with all components User utilities

Usage Reports User DB AccountingScheduler Job Manager & Monitor System Monitor Queue Manager Checkpoint/ Restart Data Migration Meta Scheduler Node Configuration & Build Manager Meta Monitor Meta Manager Resource Allocation management Application Environment High Performance Communication & I/O Access control Security manager File System Interacts with all components User utilities

Bamboo Job Submission

Usage Reports User DB AccountingScheduler Job Manager & Monitor System Monitor Queue Manager Checkpoint/ Restart Data Migration Meta Scheduler Node Configuration & Build Manager Meta Monitor Meta Manager Resource Allocation management Application Environment High Performance Communication & I/O Access control Security manager File System Interacts with all components User utilities

Gold Operations

Usage Reports User DB AccountingScheduler Job Manager & Monitor System Monitor Queue Manager Checkpoint/ Restart Data Migration Meta Scheduler Node Configuration & Build Manager Meta Monitor Meta Manager Resource Allocation management Application Environment High Performance Communication & I/O Access control Security manager File System Interacts with all components User utilities

Warehouse Scalability Initial concerns –per process file descriptor (socket) limits –time required to gather status from 1000s of nodes Discussed with Craig Steffen –had the same concerns –experienced file descriptor limits –resolved with a hierarchical configuration –no tests on large clusters, just simulations

Usage Reports User DB AccountingScheduler Job Manager & Monitor System Monitor Queue Manager Checkpoint/ Restart Data Migration Meta Scheduler Node Configuration & Build Manager Meta Monitor Meta Manager Resource Allocation management Application Environment High Performance Communication & I/O Access control Security manager File System Interacts with all components User utilities

Maui Scalability

Scalability Conclusions Bamboo Gold Warehouse Maui

Durability What is durability? A few terms regarding starting and stopping Easy Tests Hard Tests

Durability and Other Terms Durability Testing - examines a software system's ability to react to and recover from failures and conditions external to the system itself. Warm Start/Stop - an orderly startup/shutdown of the SSS services on a particular node Cold Start/Stop – a warm start/stop paired with a system boot/shutdown on a particular node

Easy Tests Compute Node Warm Stop –30 sec delay between stop and Maui notification –race condition Compute Node Warm Start –10 sec delay between start and Maui notification –jobs in the queue do not get scheduled, new jobs do Compute Node Cold Stop –30 sec delay between stop and Maui notification –race condition

More Easy Tests Single Node Job Failure –mpd to queue manager communication Resource Hog - stress –disk –memory –network

More Easy Tests Resource Exhaustion –compute node disk – no failures –service node disk – gold fails in logging package

Hard Tests Compute Node Failure/Restore –Current release of warehouse fails to reconnect Service Node Failure/Restore –Requires a restart of mpd on all compute nodes Compute Node Network Failure/Restore –30 sec delay between failure and Maui notification –race condition –20 sec delay between restore and Maui notification

More Hard Tests Service Node Network Failure/Restore –30 sec delay between failure and Maui notification –race condition –20 sec delay between restore and Maui notification –If outage >10 sec, mpd can't reconnect to computes

Durability Conclusions Bamboo Gold Warehouse Maui

Anomalies Discovered Maui –Jobs in the queue do not get scheduled after service node warm restart –If max runtime expires on the last job in the queue, repeated attempts are made to delete it; the account is charged actual runtime + max runtime –Otherwise, the last job in the queue doesn't get charge until another job is submitted –Maui loses connections to other services

More Anomalies Warehouse –warehouse_SysMon exits after ~8 hrs (current release) –warehouse_SysMon doesn't reconnect to power cycled compute nodes (current release) Gold –“Quotation Create” pair fails with missing column error –gquote succeeds, glsquote fails with similar error –Spikes CPU usage when gold.db file gets large (>64MB). sqlite problem?

More Anomalies happynsm –/etc/init.d/nsmup needs a delay to allow the server time to initialize –Is NSM in use at this time? emng.py throws errors –After a few hundred jobs, errors begin showing up in /var/log/messages –Jobs continue to execute, but slowly without events

Conclusions Overall scalability is good. Warehouse needs to be tested on a large cluster. Overall durability is good. Some problems with warehouse have been resolved in the latest development release.

ToDo List Develop and execute tests for the BLCR module Retest on a larger cluster Get the latest release of all the software and retest Formalize this information into a report