Condor Week 2008 - Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.

Slides:

Advertisements

Similar presentations

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Job Submission.

Advertisements

Parasol Architecture A mild case of scary asynchronous system stuff.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Southgreen HPC system Concepts Cluster : compute farm i.e. a collection of compute servers that can be shared and accessed through a single “portal”

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Operating Systems Concepts Professor Rick Han Department of Computer Science University of Colorado at Boulder.

Chapter 8 Operating System Support

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)

Minerva Infrastructure Meeting – October 04, 2011.

DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.

Thomas Finnern Evaluation of a new Grid Engine Monitoring and Reporting Setup.

Bigben Pittsburgh Supercomputing Center J. Ray Scott

BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.

Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April

Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  Concept of the Project  System architecture  Implementation – HDFS  Implementation – System.

Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.

Pilot Factory using Schedd Glidein Barnett Chiu BNL

Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison

Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  System architecture  Implementation – HDFS  Implementation – System Analysis ◦ System Information.

Parag Mhashilkar (Fermi National Accelerator Laboratory)

CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.

Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.

OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.

UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.

UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.

Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.

Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.

Welcome to Indiana University Clusters

RoboVis Alistair Wick.

Development Environment

Condor DAGMan: Managing Job Dependencies with Condor

A Grid Job Monitoring System

Unix Scripts and PBS on BioU

Welcome to Indiana University Clusters

Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)

Workload Management System

Work report Xianghu Zhao Nov 11, 2014.

HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017.

Glidein Factory Operations

High Availability in HTCondor

CREAM-CE/HTCondor site

Experience with jemalloc

Introduction to Operating System (OS)

FCT Follow-up Meeting 31 March, 2017 Fernando Meireles

Skill Based Assessment

Skill Based Assessment

Skill Based Assessment

Troubleshooting Your Jobs

Accounting, Group Quotas, and User Priorities

Open Source Toolkit for Turn-Key AI Cluster (Introduction)

Compiling and Job Submission

Processor Management Damian Gordon.

Initial job submission and monitoring efforts with JClarens

CPU scheduling decisions may take place when a process:

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

Condor: Firewall Mirroring

Late Materialization has (lately) materialized

Processor Management Damian Gordon.

Production client status

Troubleshooting Your Jobs

PU. Setting up parallel universe in your pool and when (not

Putting your users in a Box

Presentation transcript:

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi2 What is the problem ● Users submit vanilla jobs that run for hours ● Users want to know what is the status of the jobs – Are they using CPU? – Are they producing the expected files? – Any error messages in the log files? ● Especially when experimenting with new code, waiting for the jobs to finish may waste a lot of time and resources

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi3 What can be done about it? ● Waiting for jobs to finish – always an option, but inefficient ● Use the standard universe – may be tricky, not all apps can run there ● Implement application specific solution – lots of work! ● Let Condor do the monitoring – How????

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi4 Monitoring with Condor ● Before looking for a solution, define the problem ● What are the actions we are interested in? – Get the process list and the CPU usage – Get the content of the job working directory – Get the content of a job file ● They are all batch type actions! – We just need to run them on the same worker node

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi5 Use multiple slots x CPU ● Condor_startd can have multiple slots x CPU – Use one for the jobs, and one for the monitoring Worker Node Startd Starter.job User job Starter.monitor

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi6 Use multiple slots x CPU ● Condor_startd can have multiple slots x CPU – Use one for the jobs, and one for the monitoring Worker Node Startd Starter.job User job Starter.monitor Monitor job Monitor job Monitor output

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi7 Startd configuration ● Enable multiple slots NUM_CPUS = 2 VIRTUAL_MACHINE_TYPE_1 = cpus=1, memory=1%, swap=1%, disk=1% NUM_VIRTUAL_MACHINES_TYPE_1 = 1 VIRTUAL_MACHINE_TYPE_2 = cpus=1, memory=99%, swap=99%, disk=99% NUM_VIRTUAL_MACHINES_TYPE_2 = 1 ● Enable cross_slot information flow STARTD_RESOURCE_PREFIX = vm STARTD_VM_EXPRS = State, RemoteUser ● Config one slot for monitoring and one for jobs VM1_VM2_MATCH = (vm2_State=?="Claimed")&&(vm2_RemoteUser=?=User) VM1_START_CONDITION = $(VM1_VM2_MATCH) && (JOB_Is_Monitor) VM2_START_CONDITION = START = ((VirtualMachineID == 1) && ($(VM1_START_CONDITION))) || ((VirtualMachineID == 2) && ($(VM2_START_CONDITION))) Single CPU assumed here Can be done for multiple CPUs, too

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi8 Monitoring job ● Could be a simple shell script #!/bin/sh ps -fu `id -n -u` ● Determine which node the job is running on – Use condor_q ● Submit the monitoring job with +JOB_Is_Monitor=True ”) You can get a demo tool that does this for you at:

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi9 Does it work? ● Yes – Used in glideinWMS for the past few yearsglideinWMS ● Drawbacks: – It takes a negotiation cycle to get the results – Cross slot information updated only every few minutes ● May not be able to monitor a job for the first few minutes

Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi10 Conclusions ● Users need job monitoring ● Condor out of the box does not provide an easy tool to monitor vanilla jobs ● Using multiple slots can solve the problem – Needs an additional tool to make it easy – Get a generic demo at