Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.

Slides:

Advertisements

Similar presentations

1 14 Feb 2007 CMS Italia – Napoli A. Fanfani Univ. Bologna A. Fanfani University of Bologna MC Production System & DM catalogue.

Advertisements

CMS Grid Batch Analysis Framework

Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

P-GRADE and WS-PGRADE portals supporting desktop grids and clouds Peter Kacsuk MTA SZTAKI

WP 1 Grid Workload Management Massimo Sgaravatto INFN Padova.

CMS HLT production using Grid tools Flavia Donno (INFN Pisa) Claudio Grandi (INFN Bologna) Ivano Lippi (INFN Padova) Francesco Prelz (INFN Milano) Andrea.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

Workload Management Massimo Sgaravatto INFN Padova.

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

GRID job tracking and monitoring Dmitry Rogozin Laboratory of Particle Physics, JINR 07/08/ /09/2006.

Framework for Automated Builds Natalia Ratnikova CHEP’03.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

Use of R-GMA in BOSS Henry Nebrensky (Brunel University) VRVS 26 April 2004 Some slides stolen from various talks at EDG 2 nd Review (

Tech talk 20th June Andrey Grid architecture at PHENIX Job monitoring and related stuff in multi cluster environment.

SUMS ( STAR Unified Meta Scheduler ) SUMS is a highly modular meta-scheduler currently in use by STAR at there large data processing sites (ex. RCF /

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.

DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.

PNPI HEPD seminar 4 th November Andrey Shevel Distributed computing in High Energy Physics with Grid Technologies (Grid tools at PHENIX)

LCG Middleware Testing in 2005 and Future Plans E.Slabospitskaya, IHEP, Russia CERN-Russia Joint Working Group on LHC Computing March, 6, 2006.

Grid Workload Management Massimo Sgaravatto INFN Padova.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Andrey Meeting 7 October 2003 General scheme: jobs are planned to go where data are and to less loaded clusters SUNY.

13 May 2004EB/TB Middleware meeting Use of R-GMA in BOSS for CMS Peter Hobson & Henry Nebrensky Brunel University, UK Some slides stolen from various talks.

Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.

Claudio Grandi INFN Bologna CHEP'03 Conference, San Diego March 27th 2003 BOSS: a tool for batch job monitoring and book-keeping Claudio Grandi (INFN Bologna)

Job Life Cycle Management Libraries for CMS Workflow Management Projects Stuart Wakefield on behalf of CMS DMWM group Thanks to Frank van Lingen for the.

T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.

APST Internals Sathish Vadhiyar. apstd daemon should be started on the local resource Opens a port to listen for apst client requests Runs on the host.

David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.

MySQL and GRID status Gabriele Carcassi 9 September 2002.

 CMS data challenges. The nature of the problem.  What is GMA ?  And what is R-GMA ?  Performance test description  Performance test results  Conclusions.

1 DIRAC Job submission A.Tsaregorodtsev, CPPM, Marseille LHCb-ATLAS GANGA Workshop, 21 April 2004.

Use of the gLite-WMS in CMS for production and analysis Giuseppe Codispoti On behalf of the CMS Offline and Computing.

Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

K. Harrison CERN, 22nd September 2004 GANGA: ADA USER INTERFACE - Ganga release status - Job-Options Editor - Python support for AJDL - Job Builder - Python.

Summary from WP 1 Parallel Section Massimo Sgaravatto INFN Padova.

Daniele Spiga PerugiaCMS Italia 14 Feb ’07 Napoli1 CRAB status and next evolution Daniele Spiga University & INFN Perugia On behalf of CRAB Team.

STAR Scheduling status Gabriele Carcassi 9 September 2002.

INFSO-RI Enabling Grids for E-sciencE Using of GANGA interface for Athena applications A. Zalite / PNPI.

EGEE is a project funded by the European Union under contract IST Information and Monitoring Services within a Grid R-GMA (Relational Grid.

Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.

Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.

IBM Express Runtime Quick Start Workshop © 2007 IBM Corporation Deploying a Solution.

STAR Scheduler Gabriele Carcassi STAR Collaboration.

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

INFSO-RI Enabling Grids for E-sciencE Ganga 4 Technical Overview Jakub T. Moscicki, CERN.

Ganga/Dirac Data Management meeting October 2003 Gennady Kuznetsov Production Manager Tools and Ganga (New Architecture)

Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.

INFSO-RI Enabling Grids for E-sciencE Padova site report Massimo Sgaravatto On behalf of the JRA1 IT-CZ Padova group.

Workload Management Workpackage

CMS High Level Trigger Configuration Management

BOSS: the CMS interface for job summission, monitoring and bookkeeping

BOSS: the CMS interface for job summission, monitoring and bookkeeping

CRAB and local batch submission

BOSS: the CMS interface for job summission, monitoring and bookkeeping

GLAST Release Manager Automated code compilation via the Release Manager Navid Golpayegani, GSFC/SSAI Overview The Release Manager is a program responsible.

DUCKS – Distributed User-mode Chirp-Knowledgeable Server

Wide Area Workload Management Work Package DATAGRID project

Production Manager Tools (New Architecture)

Presentation transcript:

Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D. Colling, B. MacEvoy, S. Wakefield, Y. Zhang. Imperial College London

Stuart Wakefield Imperial College London Outline Introduction to BOSS. Previous features and usage. New functionality. Reengineering of the design. Current status and plans.

Stuart Wakefield Imperial College London Introduction Batch Object Submission System. See Previous talk at CHEP03, monitoring track, THET001. A tool for batch job submission, real time monitoring and book keeping. Interfaced to many schedulers both local and grid. Utilizes relational database for persistency. Full logging and bookkeeping information stored. Job commands: submit, kill, query and output retrieval. Can define custom job types which allows specify monitoring unique to the submitted application.

Stuart Wakefield Imperial College London BOSS in CMS computing Used in CMS MC production for 4 years. Prototype CMS distributed analysis system (GROSS) based on BOSS and later new analysis system using BOSS. Last year it was decided that the BOSS architecture needed to be redesigned in order to meet the changing requirements of CMS computing. BOSS Logging & bookkeeping monitoring Production / analysis tool

Stuart Wakefield Imperial College London V3.x workflow I boss submit boss query boss kill BOSS DB BOSS Scheduler farm node Wrapper User specifies job - parameters including: –Executable name. –Executable type - turn on customized monitoring. –Output files to retrieve (for sites without shared file system and grid). User tells Boss to submit jobs specifying scheduler i.e. PBS, LSF, SGE, Condor, LCG, GLite etc.. Job consists of job wrapper, Real time monitoring service and users executable.

Stuart Wakefield Imperial College London V3.x workflow II Once running wrapper starts real time monitoring services and users executable. Writes all logging information (start time, finish time, exit code etc.) to local journal file. Monitoring services parse job output looking for regular expressions specified by the job type. Monitoring info saved to journal file and returned to the user via a database connection to the BOSS DB or via R-GMA (if possible). output #!/usr/bin/perl $i = 0; while($i<3){ sleep(1); $i++; print "counter $i\n"; } User job test JOBID COUNTER BOSS DB #!/usr/bin/perl while( ){ if($_=~/.*counter\s+(\d+).*/){ print “COUNTER=$1\n"; } BOSS jobExecutor counter 1 counter 2 counter 3COUNTER=1COUNTER=2COUNTER=3 123 Filter journal 1234 test counter test counter test counter 3 BOSS dbUpdator

Stuart Wakefield Imperial College London V3.x workflow III Using BOSS user can get status of jobs, pulling in information from BOSS DB, scheduler and Real-time Monitoring DB. When job finished output automatically stored at final destination if possible (i.e. shared file system on local cluster) if not (i.e, LCG) output must be fetched by separate BOSS command. If Real Time monitoring not available (i.e. firewall) BOSS DB can be updated from journal file. % boss q -all -specific -type test ID S_USR EXECUTABLE ST EXE_HOST START TIME STOP TIME counter 1 grandi test.pl 15 E pccms10.bo 14:30:00 06/06 14:30:16 06/ grandi test.pl 15 R pccms10.bo 14:30:02 06/

Stuart Wakefield Imperial College London Proposed changes Following experience from CMS MC and distributed analysis systems it was decided to re-engineer BOSS. Provide a C++ and Python API (via SWIG) to allow higher level tools to steer BOSS. Introduce task, chain and program. –Program is the users executable. –Chain is an arbitrarily complex set of different programs run on the same worker node. –Task is a group of homogeneous jobs that may be executed in parallel. In order to describe new task hierarchy move to xml task descriptions. Separate bookkeeping from real time monitoring. Improve real time monitoring but leave as optional. Allow multiple real time monitoring mechanisms. Allow pluggable chaining tools i.e. ShReek (CHEP06 id 276).

Stuart Wakefield Imperial College London Logging and Monitoring Separate users logging and (optional) monitoring DB’s. Only allow access to logging DB via BOSS tools. i.e. remove all server requirements (allows personal db implementation in SQLite on local disk). Fill logging database with BOSS tools from information in monitoring DB and journal file retrieved at end of job. Real time server updated by updater on worker node. Transport mechanism possibly utilizing a proxy server. Real time update mechanism possible implementations R- GMA, MonaLisa etc… Allow for different RT mechanisms for each job. Information in monitoring database expires.

Stuart Wakefield Imperial College London New data flow

Stuart Wakefield Imperial College London New job wrapper Job wrapper will start chainer and monitoring modules. Job chainer will launch each executable separately within its own environment. Job wrapper will provide 2 levels of monitoring, job and executable level. –Job level monitoring includes overall variables such as total time, total memory usage etc.. –Executable monitoring will monitor the executables progress and journal. Future plans include allowing action to be taken if certain circumstances are met - i.e running out of memory, detecting infinite loops etc. Chain Journal Task stdin user exec runtime-filter pre-filter post-filter stdout stderr TaskExecutor Task stdin user exec runtime-filter pre-filter post-filter stdout stderr TaskExecutor Program stdin user exec runtime-filter pre-filter post-filter stdout stderr ProgramExecuter JobMonitor (real-time updater) JobChaining JobExecuter (wrapper)

Stuart Wakefield Imperial College London Sample Task specification <program exec="test.pl" args=”ITR" stderr="err_ITR” program_type="test” stdin="in” stdout="out_ITR" infiles="Examples/test.pl,Examples/in” outfiles="out_ITR,err_ITR” outtopdir="" /> Example of task containing 100 chains each consisting of 1 program. Program specific monitoring activated - results returned via MySQL connection.

Stuart Wakefield Imperial College London Status and plans Significant new functionality identified and being actively integrated into BOSS. Latest release v3.6 includes much of the new functionality: –Tasks, job and executables. –XML task description. –C++ and Python API’s –Basic executable chaining - currently only default chainer with linear chaining. –Separate logging and monitoring DB’s. –Implemented DB’s in either MySQL or SQLite (more to come). –Optional RT monitoring with multiple implementations, currently only MonaLisa and direct MySQL connections (to be deprecated). Still to be done: –Allow chainer plugins. –Implement more RT monitoring solutions i.e R-GMA. –Finalize API. –Look at writing wrapper in scripting language i.e Perl/Python.