User Experience in using CRAB and the LPC CAF Suvadeep Bose TIFR/LPC CMS101++ June 20, 2008.

Slides:



Advertisements
Similar presentations
EU 2nd Year Review – Jan – Title – n° 1 WP1 Speaker name (Speaker function and WP ) Presentation address e.g.
Advertisements

1 CRAB Tutorial 19/02/2009 CERN F.Fanzago CRAB tutorial 19/02/2009 Marco Calloni CERN – Milano Bicocca Federica Fanzago INFN Padova.
1 CMS user jobs submission with the usage of ASAP Natalia Ilina 16/04/2007, ITEP, Moscow.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
The Grid Constantinos Kourouyiannis Ξ Architecture Group.
Job Submission The European DataGrid Project Team
CRAB Tutorial Federica Fanzago – Cern/Cnaf 13/02/2007 CRAB Tutorial (Cms Remote Analysis Builder)
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
INFSO-RI Enabling Grids for E-sciencE EGEE Middleware The Resource Broker EGEE project members.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
Operating Systems (CSCI2413) Lecture 3 Processes phones off (please)
Israel Cluster Structure. Outline The local cluster Local analysis on the cluster –Program location –Storage –Interactive analysis & batch analysis –PBS.
A tool to enable CMS Distributed Analysis
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
PHP Tutorials 02 Olarik Surinta Management Information System Faculty of Informatics.
The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.
Physicists's experience of the EGEE/LCG infrastructure usage for CMS jobs submission Natalia Ilina (ITEP Moscow) NEC’2007.
What is Sure BDCs? BDC stands for Batch Data Communication and is also known as Batch Input. It is a technique for mass input of data into SAP by simulating.
AliEn uses bbFTP for the file transfers. Every FTD runs a server, and all the others FTD can connect and authenticate to it using certificates. bbFTP implements.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Computational grids and grids projects DSS,
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
INFSO-RI Enabling Grids for E-sciencE Workload Management System Mike Mineter
LCG Middleware Testing in 2005 and Future Plans E.Slabospitskaya, IHEP, Russia CERN-Russia Joint Working Group on LHC Computing March, 6, 2006.
Grid job submission using HTCondor Andrew Lahiff.
July 28' 2011INDIA-CMS_meeting_BARC1 Tier-3 TIFR Makrand Siddhabhatti DHEP, TIFR Mumbai July 291INDIA-CMS_meeting_BARC.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Ganga A quick tutorial Asterios Katsifodimos Trainer, University of Cyprus Nicosia, Feb 16, 2009.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
User Experience in using CRAB and the LPC CAF Suvadeep Bose TIFR/LPC US CMS 2008 Run Plan Workshop May 15, 2008.
Karsten Köneke October 22 nd 2007 Ganga User Experience 1/9 Outline: Introduction What are we trying to do? Problems What are the problems? Conclusions.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
CENG 476 Projects 2014 (10’th November 2014) 1. Projects One problem for each student One problem for each student 2.
Tier 3 Status at Panjab V. Bhatnagar, S. Gautam India-CMS Meeting, July 20-21, 2007 BARC, Mumbai Centre of Advanced Study in Physics, Panjab University,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CRAB: the CMS tool to allow data analysis.
Overview Background: the user’s skills and knowledge Purpose: what the user wanted to do Work: what the user did Impression: what the user think of Ganga.
Data Management The European DataGrid Project Team
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Daniele Spiga PerugiaCMS Italia 14 Feb ’07 Napoli1 CRAB status and next evolution Daniele Spiga University & INFN Perugia On behalf of CRAB Team.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
T3g software services Outline of the T3g Components R. Yoshida (ANL)
1Bockjoo Kim 2nd Southeastern CMS Physics Analysis Workshop CMS Commissioning and First Data Stan Durkin The Ohio State University for the CMS Collaboration.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
STAR Scheduler Gabriele Carcassi STAR Collaboration.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.
1 Tutorial:Initiation a l’Utilisation de la Grille EGEE/LCG, June 5-6 N. De Filippis CMS tools for distributed analysis N. De Filippis - LLR-Ecole Polytechnique.
Five todos when moving an application to distributed HTC.
Advanced Taverna Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft, Aleksandra Pawlik, Alan Williams
CERN IT Department CH-1211 Genève 23 Switzerland t CMS SAM Testing Andrea Sciabà Grid Deployment Board May 14, 2008.
Starting Analysis with Athena (Esteban Fullana Torregrosa) Rik Yoshida High Energy Physics Division Argonne National Laboratory.
Enabling Grids for E-sciencE Work Load Management & Simple Job Submission Practical Shu-Ting Liao APROC, ASGC EGEE Tutorial.
Work report Xianghu Zhao Nov 11, 2014.
CRAB and local batch submission
Analysis Operations Monitoring Requirements Stefano Belforte
Survey on User’s Computing Experience
Haiyan Meng and Douglas Thain
The CMS Beijing Site: Status and Application
Presentation transcript:

User Experience in using CRAB and the LPC CAF Suvadeep Bose TIFR/LPC CMS101++ June 20, 2008

Outline Learning to use CRAB Learning to use LPC CAF (Condor) Useful twiki pages Possible mistakes with CRAB Lessons learnt Some comments CMS101++2Suvadeep Bose

How did I learn CRAB? CMS101++3Suvadeep Bose

Useful workshops/ tutorials A tutorial by Oliver Gutsche (26 June, 2007) US CMS First Physics Workshop (11 Oct – 13 Oct, 2007) by Eric Vaandering “CRAB Tutorial - analysis of starter kit, crab, condor, storage” in US CMS JTerm II (Jan, ‘08) CMS101++4Suvadeep Bose

Using CRAB – step by step CMS101++Suvadeep Bose5 Initial preparation This talk starts from here

A typical script for CRAB [CRAB] jobtype = cmssw scheduler = glitecoll [CMSSW] datasetpath = /CSA07AllEvents/CMSSW_1_6_9-FastSim /AODSIM pset = analysis.cfg total_number_of_events = -1 events_per_job = output_file = rootfile.root [USER] return_data = 1 use_central_bossDB = 0 use_boss_rt = 0 [EDG] rb = CERN proxy_server = myproxy.cern.ch virtual_organization = cms retry_count = 0 lcg_catalog_type = lfc lfc_host = lfc-cms-test.cern.ch lfc_home = /grid/cms CMS101++6Suvadeep Bose

If one wants to get a sure output rootfile better to use storage-path [CRAB] jobtype = cmssw scheduler = condor_g [CMSSW] datasetpath = /CSA07AllEvents/CMSSW_1_6_9-FastSim /AODSIM pset= analysis.cfg total_number_of_events=-1 events_per_job = output_file = rootfile.root, test.txt [USER] return_data = 0 copy_data = 1 storage_element = cmssrm.fnal.gov storage_path = /srm/managerv1?SFN=/resilient/username/subdir use_central_bossDB = 0 use_boss_rt = 1 [EDG] lcg_version = 2 rb = CERN proxy_server = myproxy.cern.ch additional_jdl_parameters = AllowZippedISB = false; se_white_list = cmssrm.fnal.gov ce_white_list = cmsosgce2.fnal.gov virtual_organization = cms retry_count = 2 lcg_catalog_type = lfc lfc_host = lfc-cms-test.cern.ch lfc_home = /grid/cms mkdir /pnfs/cms/WAX/resilient/username/subdir chmod +775 /pnfs/cms/WAX/resilient/username/subdir CMS101++7Suvadeep Bose

Possible mistakes with CRAB The dataset is not available in the scheduler chosen. After declaring a storage path is one forgets to perform chmod +775 on that directory prior to creating the crab jobs – then crab fails to write the output. The output file name in the crab-config file does not match with that in the analysis config file. If the output file size is too big and the user does not specify a storage path and tries to retrieve output rootfiles from the output sandbox (by performing getoutput) then due to big output file size the job gets aborted. If the farms chosen by the scheduler do not have that CMSSW version which the user is working on, then the job aborts. Sometimes runtime of the job exceeds the grid-proxy time resulting in job failure. CMS101++Suvadeep Bose8

Lessons learnt  User must run interactively on small samples in the local environment to develop the analysis code and test it. Once ready the user selects a large (whole) sample to submit the very same code to analyze many more events. Submitting a crab job once and realising the mistake and hence killing jobs minimise the job’s priority in the cluster.  CRAB has (currently) two ways of handling output: - the output sandbox - Copy files to a dedicated storage element (SE)  The input and output sandbox is limited in size: Input Sandbox : 10 MB Output Sandbox : 50 MB  TRUNCATED if it exceeds 50 MB - corrupt files  Rule of thumb : If you would like to get CMSSW ROOT files back, please use a storage element. Recommended for large outputs. ( the standard output and error from CRAB still come through the sandbox )  Even if your job fails, run crab -getoutput. Otherwise your output clogs up the server. CMS101++9Suvadeep Bose

Useful links Grid Analysis Job Diagnosis Template twiki page Best source for user support is the CRAB feedback hypernews: CMS Suvadeep Bose Can we have a simpler but more elaborate error table? Help!

Using LPC CAF (Condor) Main source of information : One needs to write scripts (shell-script/ python) that produces these condor executables. This is very much unlike CRAB where CRAB splits user jobs into manageable pieces and transport user analysis code to the data location for execution and execute user jobs. In Condor jobs user can not modify the config file even after submitting the condor jobs until he/she gets the output whereas in CRAB the whole things gets bundled up into a package and once crab job is submitted one may modify the config file. Issues with Condor that I faced: ◦ At times jobs stand Idle for long. Is there any way to put higher priority to it? ◦ When jobs are NOT exited normally then one needs to look into the Output / Error. Mostly it is easily understood. CMS101++Suvadeep Bose11

Comments - I  CERN has options for submitting small jobs in the short queue. One can mention expected duration of the jobs at the time of submission. Presently in LPC CAF there is no such system. Though there exists a command called LENGTH = “short” for jobs that are expected to run for less than one hour. But this simply increases the job’s priority only if the nodes in the LPC CAF cluster are free. But there is no separate queue for short job submission. Hence for running short jobs a user has to run interactively in one of the twelve cmslpc machines consuming the avaliable CPU time of that particluar node.  For job submission in CONDOR one needs to create different jobs manually which CRAB does automatically. This is important for massive job running. (†)  LPC condor farm has less number of working nodes than the production farm. So a job, submitted by CRAB with scheduler condor_g gets more nodes to run on than the same job submitted directly by condor_submit. But the user competes with the production jobs. If mass production going on (eg. CSA exercises) then a common user gets less priority and jobs stand idle. One needs take these things into account before choosing the mode of submission. CMS Suvadeep Bose

Comments - II  If there is some dataset in remote site and the user needs to use that for analysis then even to test his analysis file he needs to submit crab job and wait for the output. Can a user request small samples through PhEDEx?  At times once a crab job is aborted the error message is not easy to understand. Can we have a more comprehendible error message?  At times in CRAB status I got status “waiting” which is not listed in the twiki page for crab which has three status messages - Scheduled, Running and Done. !! In such cases I killed the job and resubmited. But what should one ideally do?  While checking the status of a job the concept of Done(success) is not very clear. Many a times a job is shown to be Done(success) but performing a getoutput one sees no output !!! CMS101++Suvadeep Bose13

Comments - III (†) Recently we have the facility to submit condor jobs through crab. CMS101++Suvadeep Bose14 copy_data = 1 storage_path = /uscms_data/d1/username/ copyCommand = cp  Now one can submit jobs to the cmslpc cluster just like jobs to the Grid using the same CRAB commands as crab –create, crab –submit etc. Options for dealing with output file: ((Eric Vaandering)

Conclusion In a nutshell it depends a lot on the user. If the user is careful about the job he/she is going to submit then most of the unwanted failures can be escaped. If the data is sitting here at Fermilab then using condor_submit is faster than using CRAB in most cases. The twiki pages for CRAB along with the tutorials are very useful and answers most of the questions that a user comes across. If you have problems running crab please ask people who have been using crab for long. Experience is the key thing here. CMS101++Suvadeep Bose15