Condor Week 2008 - Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.

Slides:



Advertisements
Similar presentations
© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Job Submission.
Advertisements

Parasol Architecture A mild case of scary asynchronous system stuff.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Operating Systems Concepts Professor Rick Han Department of Computer Science University of Colorado at Boulder.
Chapter 8 Operating System Support
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Minerva Infrastructure Meeting – October 04, 2011.
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
Thomas Finnern Evaluation of a new Grid Engine Monitoring and Reporting Setup.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  Concept of the Project  System architecture  Implementation – HDFS  Implementation – System.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  System architecture  Implementation – HDFS  Implementation – System Analysis ◦ System Information.
Parag Mhashilkar (Fermi National Accelerator Laboratory)
CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Welcome to Indiana University Clusters
RoboVis Alistair Wick.
Development Environment
Condor DAGMan: Managing Job Dependencies with Condor
A Grid Job Monitoring System
Unix Scripts and PBS on BioU
Welcome to Indiana University Clusters
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Workload Management System
Work report Xianghu Zhao Nov 11, 2014.
HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017.
Glidein Factory Operations
High Availability in HTCondor
CREAM-CE/HTCondor site
Experience with jemalloc
Introduction to Operating System (OS)
FCT Follow-up Meeting 31 March, 2017 Fernando Meireles
Skill Based Assessment
Skill Based Assessment
Skill Based Assessment
Troubleshooting Your Jobs
Accounting, Group Quotas, and User Priorities
Open Source Toolkit for Turn-Key AI Cluster (Introduction)
Compiling and Job Submission
Processor Management Damian Gordon.
Initial job submission and monitoring efforts with JClarens
CPU scheduling decisions may take place when a process:
HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.
Condor: Firewall Mirroring
Late Materialization has (lately) materialized
Processor Management Damian Gordon.
Production client status
Troubleshooting Your Jobs
PU. Setting up parallel universe in your pool and when (not
Putting your users in a Box
Presentation transcript:

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi2 What is the problem ● Users submit vanilla jobs that run for hours ● Users want to know what is the status of the jobs – Are they using CPU? – Are they producing the expected files? – Any error messages in the log files? ● Especially when experimenting with new code, waiting for the jobs to finish may waste a lot of time and resources

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi3 What can be done about it? ● Waiting for jobs to finish – always an option, but inefficient ● Use the standard universe – may be tricky, not all apps can run there ● Implement application specific solution – lots of work! ● Let Condor do the monitoring – How????

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi4 Monitoring with Condor ● Before looking for a solution, define the problem ● What are the actions we are interested in? – Get the process list and the CPU usage – Get the content of the job working directory – Get the content of a job file ● They are all batch type actions! – We just need to run them on the same worker node

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi5 Use multiple slots x CPU ● Condor_startd can have multiple slots x CPU – Use one for the jobs, and one for the monitoring Worker Node Startd Starter.job User job Starter.monitor

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi6 Use multiple slots x CPU ● Condor_startd can have multiple slots x CPU – Use one for the jobs, and one for the monitoring Worker Node Startd Starter.job User job Starter.monitor Monitor job Monitor job Monitor output

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi7 Startd configuration ● Enable multiple slots NUM_CPUS = 2 VIRTUAL_MACHINE_TYPE_1 = cpus=1, memory=1%, swap=1%, disk=1% NUM_VIRTUAL_MACHINES_TYPE_1 = 1 VIRTUAL_MACHINE_TYPE_2 = cpus=1, memory=99%, swap=99%, disk=99% NUM_VIRTUAL_MACHINES_TYPE_2 = 1 ● Enable cross_slot information flow STARTD_RESOURCE_PREFIX = vm STARTD_VM_EXPRS = State, RemoteUser ● Config one slot for monitoring and one for jobs VM1_VM2_MATCH = (vm2_State=?="Claimed")&&(vm2_RemoteUser=?=User) VM1_START_CONDITION = $(VM1_VM2_MATCH) && (JOB_Is_Monitor) VM2_START_CONDITION = START = ((VirtualMachineID == 1) && ($(VM1_START_CONDITION))) || ((VirtualMachineID == 2) && ($(VM2_START_CONDITION))) Single CPU assumed here Can be done for multiple CPUs, too

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi8 Monitoring job ● Could be a simple shell script #!/bin/sh ps -fu `id -n -u` ● Determine which node the job is running on – Use condor_q ● Submit the monitoring job with +JOB_Is_Monitor=True ”) You can get a demo tool that does this for you at:

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi9 Does it work? ● Yes – Used in glideinWMS for the past few yearsglideinWMS ● Drawbacks: – It takes a negotiation cycle to get the results – Cross slot information updated only every few minutes ● May not be able to monitor a job for the first few minutes

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi10 Conclusions ● Users need job monitoring ● Condor out of the box does not provide an easy tool to monitor vanilla jobs ● Using multiple slots can solve the problem – Needs an additional tool to make it easy – Get a generic demo at