Reliability and Troubleshooting with Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002.

Slides:



Advertisements
Similar presentations
Claudio Grandi INFN Bologna DataTAG WP4 meeting, Bologna 14 jan 2003 CMS Grid Integration Claudio Grandi (INFN – Bologna)
Advertisements

Part 7: CondorG A: Condor-G B: Laboratory: CondorG.
June 21-25, 2004Lecture2: Grid Job Management1 Lecture 3 Grid Resources and Job Management Jaime Frey Condor Project, University of Wisconsin-Madison
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.
WP 1 Grid Workload Management Massimo Sgaravatto INFN Padova.
CMS HLT production using Grid tools Flavia Donno (INFN Pisa) Claudio Grandi (INFN Bologna) Ivano Lippi (INFN Padova) Francesco Prelz (INFN Milano) Andrea.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
The Ethernet Approach to Grid Computing Douglas Thain and Miron Livny Condor Project, University of Wisconsin
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison
Introduction to Makeflow and Work Queue CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.
Vladimir Litvin, Harvey Newman Caltech CMS Scott Koranda, Bruce Loftis, John Towns NCSA Miron Livny, Peter Couvares, Todd Tannenbaum, Jamie Frey Wisconsin.
Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.
National Alliance for Medical Image Computing Grid Computing with BatchMake Julien Jomier Kitware Inc.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Networked Storage Technologies Douglas Thain University of Wisconsin GriPhyN NSF Project Review January 2003 Chicago.
STAR scheduling future directions Gabriele Carcassi 9 September 2002.
OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.
Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Status of Grid-enabled UTA McFarm software Tomasz Wlodek University of the Great State of TX At Arlington.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
US CMS Centers & Grids – Taiwan GDB Meeting1 Introduction l US CMS is positioning itself to be able to learn, prototype and develop while providing.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Condor DAGMan: Managing Job Dependencies with Condor
Operations Support Manager - Open Science Grid
U.S. ATLAS Grid Production Experience
Intermediate HTCondor: Workflows Monday pm
Migratory File Services for Batch-Pipelined Workloads
Grid Compute Resources and Job Management
Job workflow Pre production operations:
CMS report from FNAL demo week Marco Verlato (INFN-Padova)
US CMS Testbed.
Introduction to Makeflow and Work Queue
What’s New in DAGMan HTCondor Week 2013
Genre1: Condor Grid: CSECCR
GRID Workload Management System for CMS fall production
Condor-G Making Condor Grid Enabled
Presentation transcript:

Reliability and Troubleshooting with Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002

Condor Reliability Condor was designed for idle machines: –Reclaim, reboot, crash, out of memory... –Sounds much like the grid! US-CMS testbed –Distributed ownership, control, and resources. –(War stories abound.) Condor tools add controlled reliability. –Not absolute reliability, but: A finite amount of retry. A notification/recovery strategy. Logging and book-keeping. Known state after a failure.

US-CMS Physical Structure Head Node MOP Master Private Network Head Node Public Internet Workers

US-CMS Logical Structure Master Site Impala MOP Condor-G Worker Globus Condor Real Work DAGMan Red items expect a reliable environment. Green items create a reliable environment.

Local Resource Manager Condor-G Gatekeeper Job Managers Run Idle Head Node Condor-G Submitter System Log Job Log Job Queue Run Idle Grid Managers GAHP-Server GRAM End-User Tools (transaction interface)

Directed Acyclic Graph Manager (DAGMan) Condor-G deals with system failures, DAGMan deals with app and user failures. PRE and POST may be used to validate inputs and outputs. “Rescue DAG” describes what is left unexecuted. DAG nodes may themselves be DAGs. A B D C post.pl pre.pl

Fault Tolerant Shell (FTSH) Standard shell scripts are very error-prone. FTSH adds time limits, retry, logging, and clean termination. “Exceptions for scripts:” unexpected errors cannot accidentally be ignored. try 10 times try for 15 minutes globus_url_copy A B end try for 1 hour run-simulation C gzip D end try for 15 minutes globus_url_copy D E end

Hawkeye (Example Hawkeye Page)Example Hawkeye Page Probe Modules Probe Modules Probe Modules Hawkeye Manager ClassAd Data Policy Manager Trigger Exprs ClassAd Queries Submit Repair Job Contact Sysadmin Log Event

For More Info... Condor-G – DAGMan – Fault Tolerant Shell – Hawkeye – Philosophy of Error Management – The Condor Project –