Presentation is loading. Please wait.

Presentation is loading. Please wait.

22/10/2007Software Week1 Distributed analysis user feedback (I) Carminati Leonardo Universita’ degli Studi e sezione INFN di Milano.

Similar presentations


Presentation on theme: "22/10/2007Software Week1 Distributed analysis user feedback (I) Carminati Leonardo Universita’ degli Studi e sezione INFN di Milano."— Presentation transcript:

1 22/10/2007Software Week1 Distributed analysis user feedback (I) Carminati Leonardo Universita’ degli Studi e sezione INFN di Milano

2 22/10/2007Software Week2 Distributed analysis: the user dream DPD Distributed analysis AOD

3 22/10/2007Software Week3 Distributed analysis with GANGA  A fair report on the ganga performance is complicated because it strongly depends :  Data distribution: AODs are not completely duplicated to all T1 as they should be (incomplete datasets)  Sites configuration: jobs fail in some sites for several (local) reasons  The discriminating variable is the COMPLETENESS of a dataset:  If a dataset is COMPLETE somewhere then ganga works perfectly: the jobs are sent correctly to sites which have a complete replica of the dataset.  The problem of ‘bad sites’ (= sites on which my jobs fail) is clearly an issue for the users:  Often the jobs fail because the site on which they are executed is not properly configured: this doesn’t depend (directly) on ganga  A less ambitious approach: restricting the access to a minimal list of GOOD sites on which I’m sure my jobs will run (automatic procedure?) would reduce a lot the failure rate and enhance the clients satisfaction  The possibility to define a ‘black list’ is included in the latest ganga release

4 22/10/2007Software Week4 Distributed analysis with ganga: current situation If a dataset is INCOMPLETE everywhere then you might run into problems: the current picture is probably not optimal from the user point of view: Suppose you have a dataset of 30 files which is incomplete everywhere In this configuration you will get 10 output files: Files from 1-5 will be lost because they are missing everywhere Files from 20-30 will be lost because the job assignment to different sites A subsequent submission might give different results Subjob 1: files from 1-10 Subjob 2: files from 11-20 Subjob 3: files from 21-30 Site A: incomplete dataset files 5-15 Site B: incomplete dataset files 15-30 Site C: incomplete dataset files 15-20

5 22/10/2007Software Week5 Distributed analysis with ganga: current situation  For the moment it’s not safe to let ganga decide for you: better to select the site with the maximum amount of files and send jobs there.  So here things start to be complicated (at least from a common user point of view) and users tend to become nervous..  A user has to do some operations which are not always simple:  Find out where the files are and select the site with the largest number of files  Be sure that the selected site is a good one.  send jobs there  A good example is the AsymFilter gamma jet sample 6379: ~4M events  Four tids available and incomplete at all lcg sites (except for tid 10048 which for dq2 is complete in desy)

6 22/10/2007Software Week6 Ask ganga where the files are: d.list_locations_num_files(‘... ‘) trig1_misal1_mc12.006379.PythiaPhotonJet_AsymJetFilter.recon.AOD.v12000603_tid007531 {'CERNPROD': 10, 'DESY-HH': 0, 'NIKHEF': 2236, 'CPPM': 0, 'RALDISK': 0, 'CNAFDISK': 171, 'SACLAY': 0, 'ASGCDISK': 3931, 'AU-UNIMELB': 349, 'TORON': 1084, 'NIPNE_02': 0, 'TOKYO': 0, 'AGLT2': 0, 'NDGFT1DISK': 0, 'TRIUMFDISK': 3931, 'BNLDISK': 0, 'LAL': 0, 'BEIJING': 0, 'WUP': 3927, 'LAPP': 0, 'LYONDISK': 0, 'FZKDISK': 3928, 'LPNHE': 0, 'CERNCAF': 1120}trig1_misal1_mc12.006379.PythiaPhotonJet_AsymJetFilter.recon.AOD.v12000603_tid009704 {'CERNPROD': 82, 'DESY-HH': 0, 'ASGCDISK': 549, 'RALDISK': 0, 'TW-FTT': 0, 'NAPOLI': 124, 'NIKHEF': 835, 'CNAFDISK': 2654, 'DESY-ZN': 2543, 'LNF': 151, 'CYF': 1756, 'TOKYO': 20, 'AGLT2': 0, 'NDGFT1DISK': 0, 'TRIUMFDISK': 2610, 'BNLDISK': 0, 'PICDISK': 2157, 'WUP': 1900, 'MILANO': 118, 'ROMA1': 95, 'LYONDISK': 170, 'FZKDISK': 1580, 'CERNCAF': 520}trig1_misal1_mc12.006379.PythiaPhotonJet_AsymJetFilter.recon.AOD.v12000603_tid010048 {'MWT2_IU': 0, 'CNAFDISK': 0, 'TW-FTT': 0, 'TOKYO': 232, 'AGLT2': 0, 'NIKHEF': 161, 'FZU': 0, 'LYONDISK': 126, 'PICDISK': 1808, 'UTA_SWT2': 0, 'RALDISK': 0, 'FZKDISK': 1592, 'BU_DDM': 0, 'NDGFT1DISK': 0, 'BNLPANDA': 0, 'CERNCAF': 705, 'ASGCDISK': 256}trig1_misal1_mc12.006379.PythiaPhotonJet_AsymJetFilter.recon.AOD.v12000603_tid011139 {'IFAE': 1598, 'DESY-HH': 0, 'NIKHEF': 540, 'UTA_SWT2': 0, 'MWT2_IU': 0, 'RALDISK': 2183, 'TOKYO': 297, 'SACLAY': 179, 'ASGCDISK_V2': 510, 'UAM': 1951, 'CYF': 881, 'DESY-ZN': 1523, 'AU-UNIMELB': 425, 'CNAFDISK': 723, 'FZU': 1700, 'NDGFT1DISK': 0, 'BNLDISK': 0, 'PICDISK': 1317, 'WUP': 331, 'LYONDISK': 2249, 'FZKDISK': 207, 'BU_DDM': 0, 'CERNCAF': 462} How reliable is this output?

7 22/10/2007Software Week7 Distributed analysis with ganga: running jobs Running jobs is smooth on good sites: Lyon and FZK are my favorite sites: plenty of files related to my analysis, fast execution (1h to run over 1M events). I ran successfully also in Desy, Madrid, CNAF Very high efficiency (~90%) running on those sites on the available files But again: where are the missing files? My result on the expected 4M AsymJet sample: ~2.1M -> 50% You can’t directly compare this number with the one obtained with Pathena because ganga suffers the problem of asymmetric data distribution Clearly with a more careful search of good sites one can obtain a better performance but in any case this should be done automatically by ganga A new ganga working model is being provided :

8 22/10/2007Software Week8 Distribute analysis with ganga: new model Send to the site a subjob running on files which are really present at the site In this configuration you will get 25 output files: minimal user intervention and maximal result Clearly this would be a big improvement wrt to the current situation: I think this is really the key to deal with incomplete dataset issue and to enlarge the number of ganga clients Subjob 1: files from 5-15 Subjob 2: files from 21-30 Subjob 3: files from 16-20 Site A: incomplete dataset files 5-15 Site B: incomplete dataset files 15-30 Site C: incomplete dataset files 15-20

9 22/10/2007Software Week9 Distributed analysis with Pathena  Not much comments here. Pathena is the closest thing to the user dream: you just choose the input dataset name and the output dataset and your jobs on average will succeed with a very high efficiency (no additional worries!).  pathena --inDS trig1_misal1_csc11.005310.PythiaH120gamgam.recon.AOD.v12000601_tid005860 --outDS user.LeonardoCarminati.trig1_misal1_csc11.005310.PythiaH120gamgam.recon.AOD --split 10 HggAnalysis_jobOptions.py  Many users love Pathena for this !  Ok so where’s the problem? Still I would have some questions on Pathena:  Pathena benefits from the fact that (almost) all AODs are copied in BNL: what happens if data are not collected at BNL?  When I run Pathena it seems to me that the jobs go to BNL only: is it correct?  Is this model scalable with the increase of distributed analysis clients ?

10 22/10/2007Software Week10 Conclusions: From a user point of view the distributed analysis can be very easy (and efficient) or very difficult depending AOD distribution and computing sites quality Using pathena these complications are hidden because almost all data are replicated in BNL and the users are happy. Using GANGA the problems are more evident although they are not caused by ganga itself. In order to allow the users to run with no panic and at the same level of pathena: Ensure a correct AOD distribution at least to Tier1 Strongly support the new ganga job assignment model Provide an automatic mechanism to prevent the users to run at ‘bad’ sites


Download ppt "22/10/2007Software Week1 Distributed analysis user feedback (I) Carminati Leonardo Universita’ degli Studi e sezione INFN di Milano."

Similar presentations


Ads by Google