First test of the PoC
Caveats I am not a developer ;) I was also beta tester of Crab3+WMA in 2011; I restarted testing it ~2 weeks ago to have a 1 to 1 comparison The first 2 weeks of the PoC test were mainly – Finding a problem – Communicating the developers – Getting a new version – Trying again – I simply skip this part, which is ok; I speak about the results after all the fixes
What I tested (with both) A complicated workflow: the official (V)H->bb analysis step1 (see nalysisNewCode#NtupleV42_CMSSW_5_3_3_pat ch2 ) which takes ~2 hours just to compile nalysisNewCode#NtupleV42_CMSSW_5_3_3_pat ch2 – Indeed ISB ~ 45 MB, with 56 user compiled libraries Running on dataset /DoubleElectron/Run2012B- PromptReco-v1/AOD – 40 LS/job -> ~ 1200 jobs, a couple of hours each
Where I tested CRAB3/Panda: test is restricted to few sites (FNAL, Pisa, DESY, …) – The sample is indeed just in FNAL and Pisa among the PoC sites CRAB3/WMA: 8 T2s available, some of poor quality (T2_RU_*) Always used Pisa as storage site
Moreover PoC is not expected to provide full Crab3 functionality, just (as in the I got) – Submit – Resubmit – Kill – Status – Getoutput – Getlog So I stick to these also for Crab3/WMA (i.e. I do not do DBS publication)
Configs from WMCore.Configuration import Configuration import os from datetime import datetime config = Configuration() config.section_("General") config.General.serverUrl = 'poc3test.cern.ch’ config.General.ufccacheUrl = 'cmsweb-testbed.cern.ch’ config.section_("JobType") config.JobType.pluginName = 'Analysis' config.JobType.psetName = 'patData.py’ config.section_("Data") config.Data.inputDataset = '/DoubleElectron/Run2012B- PromptReco-v1/AOD' config.Data.publishDataName = os.path.basename(os.path.abspath('.')) +"_tom" config.Data.lumiMask = 'Lumi.json’ config.Data.publishDbsUrl = " _writer/servlet/DBSServlet" config.Data.splitting = 'LumiBased' config.Data.unitsPerJob = 40 config.section_("User") config.User. = ’’ config.section_("Site") config.Site.storageSite = 'T2_IT_Pisa' from WMCore.Configuration import Configuration import os config = Configuration() config.section_("General") config.General.requestName = 'request_name2' config.General.serverUrl = 'crab3-test.cern.ch' config.General.ufccacheUrl = 'cmsweb.cern.ch' config.section_("JobType") config.JobType.pluginName = 'Analysis' config.JobType.psetName = 'patData.py' config.section_("Data") config.Data.inputDataset = '/DoubleElectron/Run2012B- PromptReco-v1/AOD’ config.Data.splitting = 'LumiBased' config.Data.unitsPerJob = 40 config.Data.lumiMask = 'Lumi.json’ config.section_("User") config.User. = ’’ config.section_("Site") config.Site.storageSite = 'T2_IT_Pisa' Panda WMA
Soon after submit bash-3.2$ crab status -t crab_ _ i Registering user credentials Task name: tboccali_crab_ _113729_121127_ Panda url: &user=Tommaso%20Boccali Details: running 0.78 % (10/1279) activated % (1269/1279) Information per site are not available. Log file is /afs/cern.ch/work/b/boccalio/PoC/CMSSW_5_3_3_patch2/src/VHb bAnalysis/HbbAnalyzer/test/PoCTests/crab_ _113729/cra b.log No information per site, link to monitoring present bash-3.2$ crab status -t crab_request_name2 -i Registering user credentials Task Status: running Using 7 site(s): Jobs Details: submitted % ( running % pending % ) T2_US_Florida: submitted % T2_FR_GRIF_IRFU: submitted % T2_RU_JINR: submitted % T2_UK_London_IC: submitted % T2_FR_GRIF_LLR: submitted % T2_IT_Pisa: submitted % T2_ES_IFCA: submitted % Log file is /afs/cern.ch/work/b/boccalio/PoC/CMSSW_5_3_3_patch2/src/VHb bAnalysis/HbbAnalyzer/test/Crab3Tests/crab_request_name2/crab.log (no link to dashboard?) – one has to find by hand
Few Considerations Let’s start from the obvious: with both systems I reached 100% done, with some “resubmit” (site problems) Feature: with Panda a resubmit is a second task (with a second web page)… Not used to it but not a critical issue (you need just to get used to it)
ASO It worked flawlessly in both cases Nothing more to say I guess … (I did not even need to look into the ASO monitoring) You can get the files before ASO operated (I guess lcg-cp is used, …)
Issues with Panda Kill did not work for me; I understood it was simple timeout to be set to a different threshold, did not check more
Is resubmit working fine? In both cases, it was for me Caveat: the PoC enabled sites are generally good/very good. No chance to test a massive failure scenario
Let’s go straight to the point Up to here executive summary could be: “Limiting the scenario to what the PoC is supposed to allow me to do, PANDA performs at least as well as WMA” (again, this _after_ the two weeks of initial testing)
What is different Panda Monitoring seems by far better than what we are used to
Dashboard/WMA… (as usual)
…Plus WMStats Some debugging info added, but not that much (where is the WN name? where is the LSF id?)
Features we usually do not have All the log (pilots + stderr + stdout) are on the web – All: not only snippets for failed jobs – I guess ph support would love it, instead of asking to upload logs – support can get all the info from WEB, no need to ask the (maybe not too skilled user) – Snippets are not ok in general: a failure can be dependent from a bad Env Variable … cannot be seen from the snippet alone There is link PILOT LSF id ! This I considered lost since we left gLite, and it is a MAJOR help to debug strange problems (like WNs acting as black holes)
Pilot log WN LSF id
logs (full logs present, not just snippets guessed as interesting by the system) Full logs uploaded to SE
Other features I liked Panda seems user friendly when scheduling jobs: if you submit a task, even if your priority is very low, a few jobs are executed almost immediately, allowing you to spot broken workflows in advance It seems I can resubmit at any time (no need to wait for task in cooloff …) – Is it because ACDC is not in the game? Is there anything we pay for this (side effects I am not aware of?)
Conclusions? As said, functionally both were doing what asked – PANDA does not look at all behind I cannot speak about what is NOT supposed to be in PoC (which is not a small subset) The major differences to me are – Monitoring: way better in PoC with full disclosure of all the info – The early prioritization of some jobs is a lot of help (goes far beyond simple python sanity check) – You seem to be able to resubmit any time – no cool off needed; this potentially cuts the time to process tails