First test of the PoC. Caveats I am not a developer ;) I was also beta tester of Crab3+WMA in 2011; I restarted testing it ~2 weeks ago to have a 1 to.

First test of the PoC

Caveats I am not a developer ;) I was also beta tester of Crab3+WMA in 2011; I restarted testing it ~2 weeks ago to have a 1 to 1 comparison The first 2 weeks of the PoC test were mainly – Finding a problem – Communicating the developers – Getting a new version – Trying again – I simply skip this part, which is ok; I speak about the results after all the fixes

What I tested (with both) A complicated workflow: the official (V)H->bb analysis step1 (see https://twiki.cern.ch/twiki/bin/view/CMS/VHbbA nalysisNewCode#NtupleV42_CMSSW_5_3_3_pat ch2 ) which takes ~2 hours just to compile https://twiki.cern.ch/twiki/bin/view/CMS/VHbbA nalysisNewCode#NtupleV42_CMSSW_5_3_3_pat ch2 – Indeed ISB ~ 45 MB, with 56 user compiled libraries Running on dataset /DoubleElectron/Run2012B- PromptReco-v1/AOD – 40 LS/job -> ~ 1200 jobs, a couple of hours each

Where I tested CRAB3/Panda: test is restricted to few sites (FNAL, Pisa, DESY, …) – The sample is indeed just in FNAL and Pisa among the PoC sites CRAB3/WMA: 8 T2s available, some of poor quality (T2_RU_*) Always used Pisa as storage site

Moreover PoC is not expected to provide full Crab3 functionality, just (as in the email I got) – Submit – Resubmit – Kill – Status – Getoutput – Getlog So I stick to these also for Crab3/WMA (i.e. I do not do DBS publication)

Configs from WMCore.Configuration import Configuration import os from datetime import datetime config = Configuration() config.section_("General") config.General.serverUrl = 'poc3test.cern.ch’ config.General.ufccacheUrl = 'cmsweb-testbed.cern.ch’ config.section_("JobType") config.JobType.pluginName = 'Analysis' config.JobType.psetName = 'patData.py’ config.section_("Data") config.Data.inputDataset = '/DoubleElectron/Run2012B- PromptReco-v1/AOD' config.Data.publishDataName = os.path.basename(os.path.abspath('.')) +"_tom" config.Data.lumiMask = 'Lumi.json’ config.Data.publishDbsUrl = "https://cmsdbsprod.cern.ch:8443/cms_dbs_ph_analysis_02 _writer/servlet/DBSServlet" config.Data.splitting = 'LumiBased' config.Data.unitsPerJob = 40 config.section_("User") config.User.email = ’’ config.section_("Site") config.Site.storageSite = 'T2_IT_Pisa' from WMCore.Configuration import Configuration import os config = Configuration() config.section_("General") config.General.requestName = 'request_name2' config.General.serverUrl = 'crab3-test.cern.ch' config.General.ufccacheUrl = 'cmsweb.cern.ch' config.section_("JobType") config.JobType.pluginName = 'Analysis' config.JobType.psetName = 'patData.py' config.section_("Data") config.Data.inputDataset = '/DoubleElectron/Run2012B- PromptReco-v1/AOD’ config.Data.splitting = 'LumiBased' config.Data.unitsPerJob = 40 config.Data.lumiMask = 'Lumi.json’ config.section_("User") config.User.email = ’’ config.section_("Site") config.Site.storageSite = 'T2_IT_Pisa' Panda WMA

Soon after submit bash-3.2$ crab status -t crab_20121127_113729 -i Registering user credentials Task name: tboccali_crab_20121127_113729_121127_103859 Panda url: http://panda.cern.ch/server/pandamon/query?job=*&jobsetID=19 &user=Tommaso%20Boccali Details: running 0.78 % (10/1279) activated 99.22 % (1269/1279) Information per site are not available. Log file is /afs/cern.ch/work/b/boccalio/PoC/CMSSW_5_3_3_patch2/src/VHb bAnalysis/HbbAnalyzer/test/PoCTests/crab_20121127_113729/cra b.log No information per site, link to monitoring present bash-3.2$ crab status -t crab_request_name2 -i Registering user credentials Task Status: running Using 7 site(s): Jobs Details: submitted 100.00 % ( running 44.31 % pending 55.69 % ) T2_US_Florida: submitted 14.58 % T2_FR_GRIF_IRFU: submitted 14.58 % T2_RU_JINR: submitted 14.58 % T2_UK_London_IC: submitted 12.54 % T2_FR_GRIF_LLR: submitted 14.58 % T2_IT_Pisa: submitted 14.58 % T2_ES_IFCA: submitted 14.58 % Log file is /afs/cern.ch/work/b/boccalio/PoC/CMSSW_5_3_3_patch2/src/VHb bAnalysis/HbbAnalyzer/test/Crab3Tests/crab_request_name2/crab.log (no link to dashboard?) – one has to find by hand

Few Considerations Let’s start from the obvious: with both systems I reached 100% done, with some “resubmit” (site problems) Feature: with Panda a resubmit is a second task (with a second web page)… Not used to it but not a critical issue (you need just to get used to it)

ASO It worked flawlessly in both cases Nothing more to say I guess … (I did not even need to look into the ASO monitoring) You can get the files before ASO operated (I guess lcg-cp is used, …)

Issues with Panda Kill did not work for me; I understood it was simple timeout to be set to a different threshold, did not check more

Is resubmit working fine? In both cases, it was for me Caveat: the PoC enabled sites are generally good/very good. No chance to test a massive failure scenario

Let’s go straight to the point Up to here executive summary could be: “Limiting the scenario to what the PoC is supposed to allow me to do, PANDA performs at least as well as WMA” (again, this _after_ the two weeks of initial testing)

What is different Panda Monitoring seems by far better than what we are used to

Dashboard/WMA… (as usual)

…Plus WMStats Some debugging info added, but not that much (where is the WN name? where is the LSF id?)

Features we usually do not have All the log (pilots + stderr + stdout) are on the web – All: not only snippets for failed jobs – I guess ph support would love it, instead of asking to upload logs – support can get all the info from WEB, no need to ask the (maybe not too skilled user) – Snippets are not ok in general: a failure can be dependent from a bad Env Variable … cannot be seen from the snippet alone There is link PILOT LSF id ! This I considered lost since we left gLite, and it is a MAJOR help to debug strange problems (like WNs acting as black holes)

Pilot log  WN LSF id

logs (full logs present, not just snippets guessed as interesting by the system) Full logs uploaded to SE

Other features I liked Panda seems user friendly when scheduling jobs: if you submit a task, even if your priority is very low, a few jobs are executed almost immediately, allowing you to spot broken workflows in advance It seems I can resubmit at any time (no need to wait for task in cooloff …) – Is it because ACDC is not in the game? Is there anything we pay for this (side effects I am not aware of?)

Conclusions? As said, functionally both were doing what asked – PANDA does not look at all behind I cannot speak about what is NOT supposed to be in PoC (which is not a small subset) The major differences to me are – Monitoring: way better in PoC with full disclosure of all the info – The early prioritization of some jobs is a lot of help (goes far beyond simple python sanity check) – You seem to be able to resubmit any time – no cool off needed; this potentially cuts the time to process tails

First test of the PoC. Caveats I am not a developer ;) I was also beta tester of Crab3+WMA in 2011; I restarted testing it ~2 weeks ago to have a 1 to.

Similar presentations

Presentation on theme: "First test of the PoC. Caveats I am not a developer ;) I was also beta tester of Crab3+WMA in 2011; I restarted testing it ~2 weeks ago to have a 1 to."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

First test of the PoC. Caveats I am not a developer ;) I was also beta tester of Crab3+WMA in 2011; I restarted testing it ~2 weeks ago to have a 1 to.

Similar presentations

Presentation on theme: "First test of the PoC. Caveats I am not a developer ;) I was also beta tester of Crab3+WMA in 2011; I restarted testing it ~2 weeks ago to have a 1 to."— Presentation transcript:

Similar presentations

About project

Feedback