Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reliability and Troubleshooting with Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002.

Similar presentations


Presentation on theme: "Reliability and Troubleshooting with Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002."— Presentation transcript:

1 Reliability and Troubleshooting with Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002

2 Condor Reliability Condor was designed for idle machines: –Reclaim, reboot, crash, out of memory... –Sounds much like the grid! US-CMS testbed –Distributed ownership, control, and resources. –(War stories abound.) Condor tools add controlled reliability. –Not absolute reliability, but: A finite amount of retry. A notification/recovery strategy. Logging and book-keeping. Known state after a failure.

3 US-CMS Physical Structure Head Node MOP Master Private Network Head Node Public Internet Workers

4 US-CMS Logical Structure Master Site Impala MOP Condor-G Worker Globus Condor Real Work DAGMan Red items expect a reliable environment. Green items create a reliable environment.

5 Local Resource Manager Condor-G Gatekeeper Job Managers Run Idle Head Node Condor-G Submitter System Log Job Log Job Queue Run Idle Grid Managers GAHP-Server GRAM End-User Tools (transaction interface)

6 Directed Acyclic Graph Manager (DAGMan) Condor-G deals with system failures, DAGMan deals with app and user failures. PRE and POST may be used to validate inputs and outputs. “Rescue DAG” describes what is left unexecuted. DAG nodes may themselves be DAGs. A B D C post.pl pre.pl

7 Fault Tolerant Shell (FTSH) Standard shell scripts are very error-prone. FTSH adds time limits, retry, logging, and clean termination. “Exceptions for scripts:” unexpected errors cannot accidentally be ignored. try 10 times try for 15 minutes globus_url_copy A B end try for 1 hour run-simulation C gzip D end try for 15 minutes globus_url_copy D E end

8 Hawkeye (Example Hawkeye Page)Example Hawkeye Page Probe Modules Probe Modules Probe Modules Hawkeye Manager ClassAd Data Policy Manager Trigger Exprs ClassAd Queries Submit Repair Job Contact Sysadmin Log Event

9 For More Info... Condor-G –http://www.cs.wisc.edu/condor/condorg DAGMan –http://www.cs.wisc.edu/condor/dagman Fault Tolerant Shell –http://www.cs.wisc.edu/~thain/research/ftsh Hawkeye –http://www.cs.wisc.edu/condor/hawkeye Philosophy of Error Management –http://www.cs.wisc.edu/condor/doc/error-scope.pdf The Condor Project –http://www.cs.wisc.edu/condor


Download ppt "Reliability and Troubleshooting with Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002."

Similar presentations


Ads by Google