Process Management & Monitoring WG Quarterly Report January 25, 2005.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

System Integration and Performance
Welcome to the Montreal MIS Tutorial. MIS Tutorial What is an MIS What is the MIS role in JDF Introduction Gray Boxes MIS Requirements – Job Costing.
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Job Submission.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Job Submission on WestGrid Feb on Access Grid.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
One.box Distributed home service interface. Core Components Pop3 client Router Storage Pop3 Server.
MCITP Guide to Microsoft Windows Server 2008 Server Administration (Exam #70-646) Chapter 14 Server and Network Monitoring.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 8: Implementing and Managing Printers.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
The Origin of the VM/370 Time-sharing system Presented by Niranjan Soundararajan.
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
Windows Server 2008 Chapter 11 Last Update
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting February 24-25, 2003.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Scalability By Alex Huang. Current Status 10k resources managed per management server node Scales out horizontally (must disable stats collector) Real.
Welcome to SasView Code Camp III Getting started: Where we are Some news about changes already before we start Work Packages Schedule of camp Planning.
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Copyright ®xSpring Pte Ltd, All rights reserved Versions DateVersionDescriptionAuthor May First version. Modified from Enterprise edition.NBL.
SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
SC84, Epics C# Library Written by Christoph Seiler Presented by Dirk Zimoch.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 13-14, 2002.
Cisco S2 C4 Router Components. Configure a Router You can configure a router from –from the console terminal (a computer connected to the router –through.
Process Management Working Group Process Management “Meatball” Dallas November 28, 2001.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
CASPUR Site Report Andrei Maslennikov Lead - Systems Karlsruhe, May 2005.
Early Experiences with Energy-Aware (EAS) Scheduling
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
Progress on Release, API Discussions, Vote on APIs, and PI mtg Al Geist January 14-15, 2004 Chicago, ILL.
An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
1998 PI System Users’ Conference PI 3.2 New Features Enhancements Product Support Plans for the next release.
CiFTS Coordinated Infrastructure for Fault Tolerant Systems.
Wir schaffen Wissen – heute für morgen Gateway (Redux) PSI - GFA Controls IT Alain Bertrand Renata Krempaska, Hubert Lutz, Matteo Provenzano, Dirk Zimoch.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting January 15-16, 2004 Argonne, IL.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting September 11-12, 2003 Washington D.C.
7. CBM collaboration meetingXDAQ evaluation - J.Adamczewski1.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
Oak Ridge National Laboratory -- U.S. Department of Energy 1 SSS Deployment using OSCAR John Mugler, Thomas Naughton & Stephen Scott May 2005, Argonne,
Cluster Software Overview
Silberschatz, Galvin and Gagne  Operating System Concepts UNIT II Operating System Services.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Page 1 Printing & Terminal Services Lecture 8 Hassan Shuja 11/16/2004.
A View from the Top Al Geist June Houston TX.
SSS Build and Configuration Management Update February 24, 2003 Narayan Desai
Globus Grid Tutorial Part 2: Running Programs Across Multiple Resources.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
Process Manager Specification Rusty Lusk 1/15/04.
“Warehouse” Monitoring Software Infrastructure Craig Steffen, NCSA SSS Meeting June 5, Argonne, Illinois.
Amdahl’s Law & I/O Control Method 1. Amdahl’s Law The overall performance of a system is a result of the interaction of all of its components. System.
II EGEE conference Den Haag November, ROC-CIC status in Italy
Process Management & Monitoring WG Quarterly Report August 26, 2004.
Checkpoint/restart in Slurm: current status and new developments
New cluster capabilities through Checkpoint/ Restart
CrawlBuddy The web’s best friend.
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Process Management & Monitoring WG Quarterly Report January 25, 2005

PMWG Quarterly Report 2 Components Process Management  Process Manager  Checkpoint Manager Monitoring  Job Monitor  System/Node Monitors  Meta Monitoring

January 25, 2005 PMWG Quarterly Report 3 Component Progress Checkpoint Manager (LBNL)  BLCR Process Manager (ANL)  MPDPM Monitoring (NCSA)  Warehouse

January 25, 2005 PMWG Quarterly Report 4 Checkpoint Manager: BLCR Status Full save and restore of  CPU registers  Memory  Signals (handlers & pending signals)  PID, PGID, etc  Files (w/ limitations)  Communication (via MPI)

January 25, 2005 PMWG Quarterly Report 5 Checkpoint Manager: BLCR Status (Files) Files  Files unmodified between checkpoint and restart  Files appended to between checkpoint and restart  Pipes between processes

January 25, 2005 PMWG Quarterly Report 6 Checkpoint Manager: BLCR Status (Comms) LAM/MPI 7.x over TCP and GM  Handles in-flight data (drains)  Linear scaling of time w/ job size  Migratable OpenMPI  Will inherit LAM/MPI’s support ChaMPIon/Pro (Verari)

January 25, 2005 PMWG Quarterly Report 7 Checkpoint Manager: BLCR Status (Ports) Linux only  “Stock” 2.4.X  RedHat 7.2 – 9  SuSE 7.2 – 9.0  RHEL3/CentOS 3.1  2.6.x port in progress (FC2 & SuSE 9.2) x86 (IA32) only today  x86_64 (Opteron) will follow 2.6.x port  Alpha, PPC and PPC64 may be trivial  No IA64 (Itanium) plans

January 25, 2005 PMWG Quarterly Report 8 Checkpoint Manager: BLCR Future Work Additional coverage  Process groups and Sessions (next priority)  Terminal characteristics  Interval timers  Queued RT signals More on files  Mutable files  Directories

January 25, 2005 PMWG Quarterly Report 9 Checkpoint Manager: SSS Integration Rudimentary Checkpoint Manager  Works with Bamboo, Maui and MPDPM  Long delayed plans for “next gen” Upgraded interface spec (using LRS) Management of “context files” lampd  mpirun replacement for running LAM/MPI jobs under MPD

January 25, 2005 PMWG Quarterly Report 10 Checkpoint Manager: Non-SSS Integration Grid Engine  DONE by 3 rd party (online howto) Verari Command Center  In testing PBS family  Torque: Cluster Resources interested  PBSPro: Altair Engineering interested (if funded) SLURM  Mo Jette of LLNL interested (if funded) LoadLeveler  IBM may publish our URL in support documents

January 25, 2005 PMWG Quarterly Report 11 Process Manager Progress (ANL) Continued daily use on Chiba City, along with other components Miscellaneous hardening of MPD implementation of PM, particularly with respect to error conditions, prompted by Intel use and Chiba experience Conversion to LRS, in preparation for presentation of interface at this meeting Preparation for BG/L

January 25, 2005 PMWG Quarterly Report 12 Monitoring at NCSA Warehouse Status Network code has been revamped; that code is in cvs in oscar sss Connections are now retried Starting to monitor does not wait for all connections to finish Connection and monitoring thread pools are independent No full reset (if lots of nodes are down, continues blindly) Any component can be restarted. Restart no longer depends on start order. Features intended for sss-oscar 1.0 (SC2004), didn't make it, made it into 1.01

January 25, 2005 PMWG Quarterly Report 13 Monitoring at NCSA Warehouse Testing Warehouse run on former Platinum cluster at NCSA Node count kept dropping  400 nodes originally  200 nodes in post-cluster configuration  120 available for testing Ran on 120 nodes with no problems Have head node, but cannot have whole cluster  So didn't try sss-oscar

January 25, 2005 PMWG Quarterly Report 14 Monitoring at NCSA Warehouse Testing (2) "Infinite" Itanium cluster (Infiniband development machine)  Have root access  Will run warehouse for sure, for long range testing  Might try whole suite (semi-production) T2 cluster (Dell Xeon 500+ nodes)  May run warehouse across (Mike Showerman says) Anecdote:  Went to test new warehouse_monitor on xtorc. Installed and started new warehouse_monitors on nodes. Called up warehouse_System_Monitor to make sure it wasn't running. The already running System Monitor had connected to all the new warehouse_monitors and everything was running fine.

January 25, 2005 PMWG Quarterly Report 15 Monitoring Work at NCSA David Boxer, RA, working on warehouse Craig worked bugs and fiddly things, David did development heavy lifting  Revamped network code (modularized)  Developed new info storage (more on this in the afternoon)  New info store and logistics info store: redesigned and updated: DONE protocol re-designed: DONE send protocol: DONE receive protocol: still to do IBM offered him real money - he's off to work for them.

January 25, 2005 PMWG Quarterly Report 16 Monitoring Work at NCSA Wire Protocol:  I (Craig) need to have working knowledge of signature/hash functions. When I do, I'll be back to coding on this  Perilously close to being able to do useful stuff Documentation:  Have most of a web site written with philosophy of warehouse, and debugging tools.

January 25, 2005 PMWG Quarterly Report 17 Monitoring at NCSA Future Work New interaction (to come): Node Build and Config Manager  On start-up, will talk to Node State Manager and get list of up nodes  Subscribe to Node State Manager events for updates  For now, can continue to store node state, transition to Scheduler obtaining state information itself. Also to come:  Intelligent error handling (target-based vs. severity based)  Command line debugging/control?