Presentation is loading. Please wait.

Presentation is loading. Please wait.

Process Management & Monitoring WG Quarterly Report January 25, 2005.

Similar presentations


Presentation on theme: "Process Management & Monitoring WG Quarterly Report January 25, 2005."— Presentation transcript:

1 Process Management & Monitoring WG Quarterly Report January 25, 2005

2 PMWG Quarterly Report 2 Components Process Management  Process Manager  Checkpoint Manager Monitoring  Job Monitor  System/Node Monitors  Meta Monitoring

3 January 25, 2005 PMWG Quarterly Report 3 Component Progress Checkpoint Manager (LBNL)  BLCR Process Manager (ANL)  MPDPM Monitoring (NCSA)  Warehouse

4 January 25, 2005 PMWG Quarterly Report 4 Checkpoint Manager: BLCR Status Full save and restore of  CPU registers  Memory  Signals (handlers & pending signals)  PID, PGID, etc  Files (w/ limitations)  Communication (via MPI)

5 January 25, 2005 PMWG Quarterly Report 5 Checkpoint Manager: BLCR Status (Files) Files  Files unmodified between checkpoint and restart  Files appended to between checkpoint and restart  Pipes between processes

6 January 25, 2005 PMWG Quarterly Report 6 Checkpoint Manager: BLCR Status (Comms) LAM/MPI 7.x over TCP and GM  Handles in-flight data (drains)  Linear scaling of time w/ job size  Migratable OpenMPI  Will inherit LAM/MPI’s support ChaMPIon/Pro (Verari)

7 January 25, 2005 PMWG Quarterly Report 7 Checkpoint Manager: BLCR Status (Ports) Linux only  “Stock” 2.4.X  RedHat 7.2 – 9  SuSE 7.2 – 9.0  RHEL3/CentOS 3.1  2.6.x port in progress (FC2 & SuSE 9.2) x86 (IA32) only today  x86_64 (Opteron) will follow 2.6.x port  Alpha, PPC and PPC64 may be trivial  No IA64 (Itanium) plans

8 January 25, 2005 PMWG Quarterly Report 8 Checkpoint Manager: BLCR Future Work Additional coverage  Process groups and Sessions (next priority)  Terminal characteristics  Interval timers  Queued RT signals More on files  Mutable files  Directories

9 January 25, 2005 PMWG Quarterly Report 9 Checkpoint Manager: SSS Integration Rudimentary Checkpoint Manager  Works with Bamboo, Maui and MPDPM  Long delayed plans for “next gen” Upgraded interface spec (using LRS) Management of “context files” lampd  mpirun replacement for running LAM/MPI jobs under MPD

10 January 25, 2005 PMWG Quarterly Report 10 Checkpoint Manager: Non-SSS Integration Grid Engine  DONE by 3 rd party (online howto) Verari Command Center  In testing PBS family  Torque: Cluster Resources interested  PBSPro: Altair Engineering interested (if funded) SLURM  Mo Jette of LLNL interested (if funded) LoadLeveler  IBM may publish our URL in support documents

11 January 25, 2005 PMWG Quarterly Report 11 Process Manager Progress (ANL) Continued daily use on Chiba City, along with other components Miscellaneous hardening of MPD implementation of PM, particularly with respect to error conditions, prompted by Intel use and Chiba experience Conversion to LRS, in preparation for presentation of interface at this meeting Preparation for BG/L

12 January 25, 2005 PMWG Quarterly Report 12 Monitoring at NCSA Warehouse Status Network code has been revamped; that code is in cvs in oscar sss Connections are now retried Starting to monitor does not wait for all connections to finish Connection and monitoring thread pools are independent No full reset (if lots of nodes are down, continues blindly) Any component can be restarted. Restart no longer depends on start order. Features intended for sss-oscar 1.0 (SC2004), didn't make it, made it into 1.01

13 January 25, 2005 PMWG Quarterly Report 13 Monitoring at NCSA Warehouse Testing Warehouse run on former Platinum cluster at NCSA Node count kept dropping  400 nodes originally  200 nodes in post-cluster configuration  120 available for testing Ran on 120 nodes with no problems Have head node, but cannot have whole cluster  So didn't try sss-oscar

14 January 25, 2005 PMWG Quarterly Report 14 Monitoring at NCSA Warehouse Testing (2) "Infinite" Itanium cluster (Infiniband development machine)  Have root access  Will run warehouse for sure, for long range testing  Might try whole suite (semi-production) T2 cluster (Dell Xeon 500+ nodes)  May run warehouse across (Mike Showerman says) Anecdote:  Went to test new warehouse_monitor on xtorc. Installed and started new warehouse_monitors on nodes. Called up warehouse_System_Monitor to make sure it wasn't running. The already running System Monitor had connected to all the new warehouse_monitors and everything was running fine.

15 January 25, 2005 PMWG Quarterly Report 15 Monitoring Work at NCSA David Boxer, RA, working on warehouse Craig worked bugs and fiddly things, David did development heavy lifting  Revamped network code (modularized)  Developed new info storage (more on this in the afternoon)  New info store and logistics info store: redesigned and updated: DONE protocol re-designed: DONE send protocol: DONE receive protocol: still to do IBM offered him real money - he's off to work for them.

16 January 25, 2005 PMWG Quarterly Report 16 Monitoring Work at NCSA Wire Protocol:  I (Craig) need to have working knowledge of signature/hash functions. When I do, I'll be back to coding on this  Perilously close to being able to do useful stuff Documentation:  Have most of a web site written with philosophy of warehouse, and debugging tools.

17 January 25, 2005 PMWG Quarterly Report 17 Monitoring at NCSA Future Work New interaction (to come): Node Build and Config Manager  On start-up, will talk to Node State Manager and get list of up nodes  Subscribe to Node State Manager events for updates  For now, can continue to store node state, transition to Scheduler obtaining state information itself. Also to come:  Intelligent error handling (target-based vs. severity based)  Command line debugging/control?


Download ppt "Process Management & Monitoring WG Quarterly Report January 25, 2005."

Similar presentations


Ads by Google