User Experience in using CRAB and the LPC CAF Suvadeep Bose TIFR/LPC US CMS 2008 Run Plan Workshop May 15, 2008.

Slides:

Advertisements

Similar presentations

Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.

Advertisements

EU 2nd Year Review – Jan – Title – n° 1 WP1 Speaker name (Speaker function and WP ) Presentation address e.g.

Workload management Owen Maroney, Imperial College London (with a little help from David Colling)

INFSO-RI Enabling Grids for E-sciencE Workload Management System and Job Description Language.

1 CRAB Tutorial 19/02/2009 CERN F.Fanzago CRAB tutorial 19/02/2009 Marco Calloni CERN – Milano Bicocca Federica Fanzago INFN Padova.

1 CMS user jobs submission with the usage of ASAP Natalia Ilina 16/04/2007, ITEP, Moscow.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

The Grid Constantinos Kourouyiannis Ξ Architecture Group.

Job Submission The European DataGrid Project Team

CRAB Tutorial Federica Fanzago – Cern/Cnaf 13/02/2007 CRAB Tutorial (Cms Remote Analysis Builder)

User Experience in using CRAB and the LPC CAF Suvadeep Bose TIFR/LPC CMS101++ June 20, 2008.

INFSO-RI Enabling Grids for E-sciencE EGEE Middleware The Resource Broker EGEE project members.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

GRID Workload Management System Massimo Sgaravatto INFN Padova.

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.

Asynchronous Solution Appendix Eleven. Training Manual Asynchronous Solution August 26, 2005 Inventory # A11-2 Chapter Overview In this chapter,

How to install the Zelle graphics package

Operating Systems (CSCI2413) Lecture 3 Processes phones off (please)

Israel Cluster Structure. Outline The local cluster Local analysis on the cluster –Program location –Storage –Interactive analysis & batch analysis –PBS.

Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)

DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.

The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.

Physicists's experience of the EGEE/LCG infrastructure usage for CMS jobs submission Natalia Ilina (ITEP Moscow) NEC’2007.

What is Sure BDCs? BDC stands for Batch Data Communication and is also known as Batch Input. It is a technique for mass input of data into SAP by simulating.

AliEn uses bbFTP for the file transfers. Every FTD runs a server, and all the others FTD can connect and authenticate to it using certificates. bbFTP implements.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

Bigben Pittsburgh Supercomputing Center J. Ray Scott

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Computational grids and grids projects DSS,

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.

Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.

Nadia LAJILI User Interface User Interface 4 Février 2002.

INFSO-RI Enabling Grids for E-sciencE Workload Management System Mike Mineter

LCG Middleware Testing in 2005 and Future Plans E.Slabospitskaya, IHEP, Russia CERN-Russia Joint Working Group on LHC Computing March, 6, 2006.

July 28' 2011INDIA-CMS_meeting_BARC1 Tier-3 TIFR Makrand Siddhabhatti DHEP, TIFR Mumbai July 291INDIA-CMS_meeting_BARC.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Submitting jobs to the grid Argonne Jamboree January 2010 R. Yoshida (revised March 2010) Esteban Fullana.

Ganga A quick tutorial Asterios Katsifodimos Trainer, University of Cyprus Nicosia, Feb 16, 2009.

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Karsten Köneke October 22 nd 2007 Ganga User Experience 1/9 Outline: Introduction What are we trying to do? Problems What are the problems? Conclusions.

Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

CENG 476 Projects 2014 (10’th November 2014) 1. Projects One problem for each student One problem for each student 2.

Tier 3 Status at Panjab V. Bhatnagar, S. Gautam India-CMS Meeting, July 20-21, 2007 BARC, Mumbai Centre of Advanced Study in Physics, Panjab University,

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CRAB: the CMS tool to allow data analysis.

Overview Background: the user’s skills and knowledge Purpose: what the user wanted to do Work: what the user did Impression: what the user think of Ganga.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.

Daniele Spiga PerugiaCMS Italia 14 Feb ’07 Napoli1 CRAB status and next evolution Daniele Spiga University & INFN Perugia On behalf of CRAB Team.

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

1Bockjoo Kim 2nd Southeastern CMS Physics Analysis Workshop CMS Commissioning and First Data Stan Durkin The Ohio State University for the CMS Collaboration.

The DataGrid Project NIKHEF, Wetenschappelijke Jaarvergadering, 19 December 2002

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

CCJ introduction RIKEN Nishina Center Kohei Shoji.

Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.

1 Tutorial:Initiation a l’Utilisation de la Grille EGEE/LCG, June 5-6 N. De Filippis CMS tools for distributed analysis N. De Filippis - LLR-Ecole Polytechnique.

GRID commands lines Original presentation from David Bouvet CC/IN2P3/CNRS.

Enabling Grids for E-sciencE Work Load Management & Simple Job Submission Practical Shu-Ting Liao APROC, ASGC EGEE Tutorial.

Xiaomei Zhang CMS IHEP Group Meeting December

Work report Xianghu Zhao Nov 11, 2014.

CRAB and local batch submission

LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.

5. Job Submission Grid Computing.

Survey on User’s Computing Experience

The CMS Beijing Site: Status and Application

Presentation transcript:

User Experience in using CRAB and the LPC CAF Suvadeep Bose TIFR/LPC US CMS 2008 Run Plan Workshop May 15, 2008

Outline Learning to use CRAB Learning to use LPC CAF (Condor) Useful twiki pages Possible mistakes with CRAB Lessons learnt Some comments USCMS RunPlan Workshop /2Suvadeep Bose

How did I learn CRAB? USCMS RunPlan Workshop /3Suvadeep Bose

The workshops/ tutorials that helped me A tutorial by Oliver Gutsche (26 June, 2007) US CMS First Physics Workshop (11 Oct – 13 Oct, 2007) by Eric Vaandering “CRAB Tutorial - analysis of starter kit, crab, condor, storage” in US CMS JTerm II (Jan, ‘08) USCMS RunPlan Workshop /4Suvadeep Bose

A typical script for CRAB [CRAB] jobtype = cmssw scheduler = glitecoll [CMSSW] datasetpath = /CSA07AllEvents/CMSSW_1_6_9-FastSim /AODSIM pset = analysis.cfg total_number_of_events = -1 events_per_job = output_file = rootfile.root [USER] return_data = 1 use_central_bossDB = 0 use_boss_rt = 0 [EDG] rb = CERN proxy_server = myproxy.cern.ch virtual_organization = cms retry_count = 0 lcg_catalog_type = lfc lfc_host = lfc-cms-test.cern.ch lfc_home = /grid/cms USCMS RunPlan Workshop /5Suvadeep Bose

If one wants to get a sure output rootfile better to use storage-path [CRAB] jobtype = cmssw scheduler = condor_g [CMSSW] datasetpath = /CSA07AllEvents/CMSSW_1_6_9-FastSim /AODSIM pset= analysis.cfg total_number_of_events=-1 events_per_job = output_file = rootfile.root, test.txt [USER] return_data = 0 copy_data = 1 storage_element = cmssrm.fnal.gov storage_path = /srm/managerv1?SFN=/resilient/username/subdir use_central_bossDB = 0 use_boss_rt = 1 [EDG] lcg_version = 2 rb = CERN proxy_server = myproxy.cern.ch additional_jdl_parameters = AllowZippedISB = false; se_white_list = cmssrm.fnal.gov ce_white_list = cmsosgce2.fnal.gov virtual_organization = cms retry_count = 2 lcg_catalog_type = lfc lfc_host = lfc-cms-test.cern.ch lfc_home = /grid/cms mkdir /pnfs/cms/WAX/resilient/username/subdir chmod +775 /pnfs/cms/WAX/resilient/username/subdir USCMS RunPlan Workshop /6Suvadeep Bose

Possible mistakes with CRAB The dataset is not available in the scheduler chosen. After declaring a storage path is one forgets to perform chmod +775 on that directory prior to creating the crab jobs – then crab fails to write the output. The output file name in the crab config file does not match with that in the analysis config file. If the output file size is too big and the user does not specify a storage path and tries to retrieve output rootfiles from the output sandbox (by doing get-ouput) then due to big output file size the job gets aborted. In the scheduler chosen in the crab config file if the farms which the scheduler uses do not have that CMSSW version which the user is using checked out then the job aborts. The concept of Done(success) is not very clear. Many a times a job is shown to be Done(success) but doing a getoutput one sees no output !!! USCMS RunPlan Workshop /Suvadeep Bose7

Lessons learnt  User must run interactively on small samples in the local environment to develop the analysis code and test it. Once ready the user selects a large (whole) sample to submit the very same code to analyze many more events. Submitting a crab job once and realising the mistake and hence killing jobs minimise the job’s priority in the cluster.  CRAB has (currently) two ways of handling output: - the output sandbox - Copy files to a dedicated storage element (SE)  The input and output sandbox is limited in size: Input Sandbox : 10 MB Output Sandbox : 50 MB  TRUNCATED if it exceeds 50 MB - corrupt files  Rule of thumb : If you would like to get CMSSW ROOT files back, please use a storage element. Recommended for large outputs. ( the standard output and error from CRAB still come through the sandbox )  Even if your job fails, run crab -getoutput. Otherwise your output clogs up the server. USCMS RunPlan Workshop /8Suvadeep Bose

Useful links Grid Analysis Job Diagnosis Template twiki page Best source for user support is the CRAB feedback hypernews: USCMS RunPlan Workshop /9Suvadeep Bose Can we have a simpler but more elaborate error table? Help!

Using LPC CAF (Condor) Main source of information : One needs to write scripts (shell-script/ python) that produces these condor executables. This is very much unlike CRAB where CRAB splits user jobs into manageable pieces and transport user analysis code to the data location for execution and execute user jobs. In Condor jobs user can not modify the config file even after submitting the condor jobs until he/she gets the output whereas in CRAB the whole things gets bundled up into a package and once crab job is submitted one may modify the config file. Issues with Condor that I faced: ◦ At times jobs stand Idle for long. Is there any way to put higher priority to it? ◦ When jobs are NOT exited normally then one needs to look into the Output / Error. Mostly it is easily understood. USCMS RunPlan Workshop /Suvadeep Bose10

Comments - I  CERN has options for submitting small jobs in the short queue. One can mention expected duration of the jobs at the time of submission. Presently in LPC CAF there is no such system. Though there exists a command called LENGTH = “short” for jobs that are expected to run for less than one hour. But this simply increases the job’s priority only if the nodes in the LPC CAF cluster are free. But there is no separate queue for short job submission. Hence for running short jobs a user has to run interactively in one of the twelve cmslpc machines consuming the avaliable CPU time of that particluar node.  For job submission in CONDOR one needs to create different jobs manually which CRAB does automatically. This is important for massive job running.  LPC condor farm has less number of working nodes than the production farm. So a job, submitted by CRAB with scheduler condor_g gets more nodes to run on than the same job submitted directly by condor_submit. But the user competes with the production jobs. If mass production going on (eg. CSA exercises) then a common user gets less priority and jobs stand idle. One needs take these things into account before choosing the mode of submission. USCMS RunPlan Workshop /11Suvadeep Bose

Comments - II  If there is some dataset in remote site and the user needs to use that for analysis then even to test his analysis file he needs to submit crab job and wait for the output. Can a user request small samples through FedEx?  At times once a crab job is aborted the error message is not easy to understand. Can we have a more comprehendible error message?  At times in CRAB status I got status “waiting” which is not listed in the twiki page for crab which has three status messages - Scheduled, Running and Done. !! In such cases I killed the job and resubmited. But what one should ideally do? USCMS RunPlan Workshop /Suvadeep Bose12