Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Operations.

Slides:

Advertisements

Similar presentations

Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.

Advertisements

Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.

Part 7: CondorG A: Condor-G B: Laboratory: CondorG.

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison

Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.

Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.

A Computation Management Agent for Multi-Institutional Grids

Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.

Basic Grid Job Submission Alessandra Forti 28 March 2006.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

1 Enabling Secure Internet Access with ISA Server.

Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Course 6421A Module 7: Installing, Configuring, and Troubleshooting the Network Policy Server Role Service Presentation: 60 minutes Lab: 60 minutes Module.

Condor at Brookhaven Xin Zhao, Antonio Chan Brookhaven National Lab CondorWeek 2009 Tuesday, April 21.

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

DONE-10: Adminserver Survival Tips Brian Bowman Product Manager, Data Management Group.

Troubleshooting Windows Vista Security Chapter 4.

Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.

20411B 8: Installing, Configuring, and Troubleshooting the Network Policy Server Role Presentation: 60 minutes Lab: 60 minutes After completing this module,

Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.

GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.

3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.

1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

Zach Miller Computer Sciences Department University of Wisconsin-Madison Bioinformatics Applications.

Hao Wang Computer Sciences Department University of Wisconsin-Madison Authentication and Authorization.

Grid job submission using HTCondor Andrew Lahiff.

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Remote Administration Remote Desktop Remote Desktop Gateway Remote Assistance Windows Remote Management Service Remote Server Administration Tools.

1 Implementing Monitoring and Reporting. 2 Why Should Implement Monitoring? One of the biggest complaints we hear about firewall products from almost.

Open Science Grid OSG CE Quick Install Guide Siddhartha E.S University of Florida.

1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.

Condor-G A Quick Introduction Alan De Smet Condor Project University of Wisconsin - Madison.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Review of Condor,SGE,LSF,PBS

Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,

Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.

Configuring and Troubleshooting Identity and Access Solutions with Windows Server® 2008 Active Directory®

Module 10: Windows Firewall and Caching Fundamentals.

FermiGrid School Steven Timm FermiGrid School FermiGrid 201 Scripting and running Grid Jobs.

© Geodise Project, University of Southampton, Geodise Middleware Graeme Pound, Gang Xue & Matthew Fairman Summer 2003.

Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.

Nicholas Coleman Computer Sciences Department University of Wisconsin-Madison Distributed Policy Management.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.

Matthew Farrellee Computer Sciences Department University of Wisconsin-Madison Condor and Web Services.

JSS Job Submission Service Massimo Sgaravatto INFN Padova.

Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.

Hands-On Microsoft Windows Server 2008 Chapter 5 Configuring Windows Server 2008 Printing.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Job Router.

EGEE-II INFSO-RI Enabling Grids for E-sciencE Practical using WMProxy advanced job submission.

WP1 WMS release 2: status and open issues Massimo Sgaravatto INFN Padova.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Gabi Kliot Computer Sciences Department Technion – Israel Institute of Technology Adding High Availability to Condor Central Manager Adding High Availability.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G: Condor and Grid Computing.

HTCondor-CE. 2 The Open Science Grid OSG is a consortium of software, service and resource providers and researchers, from universities, national laboratories.

UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.

Module Overview Installing and Configuring a Network Policy Server

Primer for Site Debugging

The Condor JobRouter.

Condor-G Making Condor Grid Enabled

Condor-G: An Update.

Presentation transcript:

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Operations

Previously Covered › Some topics from the vanilla Condor operations talk apply to Condor-G  Configuration files  Log files  Command-line tools  Job policy expressions  Where to get more help

HELD Status › Jobs will be held when Condor-G needs help with an error › On release, Condor-G will retry › The reason for the hold will be saved in the job ad and user log

Hold Reason › condor_q –held jfrey 2/13 13:58 CREAM_Delegate Error: Received NULL fault; › cat job.log 012 ( ) 02/13 13:58:38 Job was held. CREAM_Delegate Error: Received NULL fault; the error is due to another cause… › condor_q –format ‘%s\n’ HoldReason CREAM_Delegate Error: Received NULL fault; the error is due to another cause…

Common Errors › Authentication  Hold reason may be misleading  User may not be authorized by CE  Condor-G may not have access to all Certificate Authority files  User’s proxy may have expired

Common Errors › CE no longer knows about job  CE admin may forcibly remove job files  Condor-G is obsessive about not leaving orphaned jobs  May need to take extra steps to convince Condor-G that remote job is gone

Nonessential Jobs › Jobs can be marked nonessential in the submit file  +nonessential = true › This makes Condor-G more willing to leave orphaned jobs and files on the CE › Use with caution

More Detail on Errors › More details on errors can be found in the gridmanager log › You’ll probably want to increase the debug level and log file size  GRIDMANAGER_DEBUG = D_FULLDEBUG  MAX_GRIDMANAGER_LOG =

Machines Down › If a remote server is down, Condor-G will wait for it to come back up › The time it went down is kept in the job ad  GridResourceUnavailableTime = › And in the user log 026 ( ) 02/13 14:20:39 Detected Down Grid Resource GridResource: gt2 chopin.cs.wisc.edu/jobmanager-fork

Throttles and Timeouts › Limits that prevent Condor-G or CEs from being overwhelmed by large numbers of jobs › Defaults are fairly conservative

Throttles and Timeouts › GRIDMANAGER_MAX_SUBMITTED_JOBS_PER _RESOURCE = 1000  You can increase to 10,000 or more › GRIDMANAGER_MAX_JOBMANAGERS_PER_RE SOURCE = 10  GRAM2 only  Default is conservative  Can increase to ~100 if this is the only client

Throttles and Timeouts › GRIDMANAGER_MAX_PENDING_REQUESTS = 50  Number of commands sent to a GAHP in parallel  Can increase to a couple hundred › GRIDMANAGER_GAHP_CALL_TIMEOUT = 300  Time after which a GAHP command is considered failed  May need to lengthen if pending requests is increased

Network Connectivity › Outbound connections only for most job types › GRAM requires incoming connections  Need 2 open ports per pair