Presentation is loading. Please wait.

Presentation is loading. Please wait.

Process Management & Monitoring WG Quarterly Report August 26, 2004.

Similar presentations


Presentation on theme: "Process Management & Monitoring WG Quarterly Report August 26, 2004."— Presentation transcript:

1 Process Management & Monitoring WG Quarterly Report August 26, 2004

2 PMWG Quarterly Report 2 Components Process Management  Process Manager  Checkpoint Manager Monitoring  Job Monitor  System/Node Monitors  Meta Monitoring

3 August 26, 2004 PMWG Quarterly Report 3 Component Progress Checkpoint Manager (LBNL) Monitoring (NCSA) Process Manager (ANL)

4 August 26, 2004 PMWG Quarterly Report 4 Checkpoint Manager: BLCR Status Full save and restore of  CPU registers  Memory  Signals (handlers & pending signals)  PID, PGID, etc  Files (w/ limitations)  Communication (via LAM/MPI)

5 August 26, 2004 PMWG Quarterly Report 5 Checkpoint Manager: BLCR Status Files  Files unmodified between checkpoint and restart  Files appended to between checkpoint and restart  Pipes between processes

6 August 26, 2004 PMWG Quarterly Report 6 Checkpoint Manager: BLCR Status LAM/MPI over TCP (and GM)  Handles in flight data (drains)  Linear scaling  Migratable

7 August 26, 2004 PMWG Quarterly Report 7 Checkpoint Manager: BLCR Status Linux only  “Stock” 2.4.X  RedHat 7.2, 7.3, 8.0, 9  SuSE 7.? and 9  RHEL3/CentOS nearly ready  2.6.x port has begun “in background” X86 only  Alpha, PPC may be 95% ready  IA64 and X86_64 possible

8 August 26, 2004 PMWG Quarterly Report 8 Checkpoint Manager: BLCR Future Work More on files  Mutable files  Directories Misc.  Process groups and Sessions  Terminal characteristics

9 August 26, 2004 PMWG Quarterly Report 9 Checkpoint Manager: SSS Work Rudimentary Checkpoint Manager  Works with Bamboo and MPDPM  Long delayed plans for “next gen” Upgraded interface spec (what syntax?) Management of “context files” lampd  mpirun replacement for running LAM/MPI jobs under MPD

10 August 26, 2004 PMWG Quarterly Report 10 Process Manager Progress Continued daily use on Chiba City, along with other components At Brett’s request, addition of option to signal entire (Unix) process group of a user process or just the process itself.  Default is just the top-level user process  Example: Miscellaneous hardening of MPD system, particularly in error conditions, prompted by Intel use.

11 August 26, 2004 PMWG Quarterly Report 11 Monitoring Work at NCSA A major fix has been implemented in warehouse. Before, there was a threshold of network bad-ness that if exceeded, would cause none of the nodes to be monitored at all (due to messages being stacked up in the incoming sockets). The code has been fixed so that multiple messages can be monitored per pass, which means that if the above threshold is exceeded, the nodes will just be monitored more slowly. This code was tested in the "good" realm against Dave, Scott, Brett in July, before having another release of the RMAP suite. It has not been tested in the "bad" realm, because that's a dedicated test. The bad news is that upon coming back from vacation in Britain, the hard drive on my desktop had had a complete hardware failure. I had been backing up warehouse religiously, and since I had transported code down to xtorc to create new rpms, I lost nothing on warehouse. I did, however, lose a bunch of work on the SSSRMAP wire protocol. Unfortunately, this included a bunch of annotated code that I would have liked to have had. Fortunately, most of what I'd done was figuring stuff out, and some of that carried over in memory so that reconstructing the second time is much easier. So I've been working feverishly on trying to get back on track with that project. I am at the point that it will be useful for me to sit down with Narayan and Dave, and ask "where does this go" install/Makefile sorts of questions.


Download ppt "Process Management & Monitoring WG Quarterly Report August 26, 2004."

Similar presentations


Ads by Google