А.Минаенко Совещание по физике и компьютингу, 16 сентября 2009 г., ИФВЭ, Протвино Текущее состояние и ближайшие перспективы компьютинга для АТЛАСа в России
Layout Ru-Tier-2 tasks Analysis Model for the First Year (AMFY) Ru-Tier-2 resources Ru-Tier-2 resources needed for 2010 Current activity STEP09 28/04/2009A.Minaenko2
28/04/2009 ATLAS RuTier-2 tasks Russian Tier-2 (RuTier-2) computing facility is planned to supply with computing resources all 4 LHC experiments including ATLAS. It is a distributed computing center including at the moment computing farms of 6 institutions: ITEP, RRC-KI, SINP (all Moscow), IHEP (Protvino), JINR (Dubna), PNPI (St.Petersburg) The main RuTier-2 task is providing facilities for physics analysis of collected data using mainly AOD, DPD and user derived data formats It includes also development of reconstruction algorithms using limited subsets of ESD and Raw data 50 active ATLAS users are supposed to carry on physics data analysis at RuTier-2 Now group ru is created in the framework of ATLAS VO. It includes physicists intending to carry analysis in RuTier-2 and the group list contains 61 names at the moment. The group will have privilege of write access to local RuTier-2 disk resources (space token LOCALGROUPDISK) All the data used for analysis should be stored on disks and some unique data (user, group DPD) to be saved on tapes The second important task is production and storage of MC simulated data The full size of data and CPU needed for their analysis are proportional to the collected statistics. The resources needed should constantly grow with the increase of the number of collected events 3A.Minaenko
Analysis data formats (AMFY) DPD format for first year analysis –Recommendation: only one format between AOD and ntuple (merged D1PD/D2PD), called dAOD (suggest also renaming PerfDPD to dESD, without however changing their definition at this point) –dAOD driven by group analysis needs, possibility for added group info (example: top D2PD – here directly produced from AOD: top- dAOD) –Coordinated via PC ( Signal of one group useful for background studies of others) –Should be exercised for MC09 with final formats for first year –Analyses not covered use AOD directly All data formats used for data need to be available in MC as well 28/04/20094A.Minaenko
Where to run your jobs (AMFY) Physics analysis should start from AOD (or dAOD) –Note dESD have AOD objects included –Analysing AOD in general needs DB access –Oracle servers are only available at CERN/Tier1s – will be supplemented in the next weeks/months by increasing amounts of Frontier/Squid caches on Tier2 (Tier3, laptop, …) sites, along with POOL files in appropriate data elements –DB Release (as used for reprocessing, simulation) will become backup solution Run jobs on Tier2s to analyse (d)AOD and produce ntuples –Tier1s maybe involved for group production, Tier3s depending on availability of data sets 28/04/20095A.Minaenko
Skeleton physics analysis model 09/10 (AMFY) AOD dAOD AODfix Analysis group driven definitions coordinated by PC, May have added meta data to allow ARA- only analysis from here User file PAT ntuple dumper keep track of tag versions of meta-data, lumi-info etc Direct or Frontier/Squid DB access Pool files Use of TAG Athena [main selection & reco work] User format [final complex analysis steps] 2-3 times reprocessing from RAW in 2009/10 With release/cacheRe-produce for reprocessed data and significant meta-data updates May have several forms (left to the user): Pool file (ARA analysis) Root Tree Histograms … results Port developed analysis algorithms back to Athena as much as possible Data super-set of good runs for this period 28/04/20096A.Minaenko
28/04/2009 Current RuTier-2 resources for ATLAS CPU, kSI2kDisc, TBATLAS Disk, TB IHEP ITEP JINR RRC-KI PNPI SINP MEPhI FIAN Total Red – sites for user analysis of ATLAS data, the other for simulation only Now the main type of LHC grid jobs is official production jobs and CPU resources are at the moment dynamically shared by all 4 LHC VO on the base of equal access rights for each VO at each site Later, when analysis jobs will take larger part of RuTier-2 resources, the sharing of CPU resources will be proportional to VO disk share at each site This year CPU resources will be increased by 1000 kSI2k and disk – by 1000 TB 7A.Minaenko
28/04/2009 Estimate of resources needed to fulfil RuTier-2 tasks in 2010 The announced effective time available for physics data taking during long run is 0.6*10 7 seconds ATLAS DAQ system event recording rate is 200 events per second, i.e. whole expected statistics is 1.2*10 9 events The current AOD event size is 150 KB, 1.5 times larger than ATLAS computing model requirement and hardly it will be easily decreased to the moment of the data taking Full expected size of the current AOD version is equal to 180 TB It is necessary to keep 30-40% of the previous AOD version for comparisons. This gives full AOD size of 230 TB - DATADISK During first years of LHC work the very important task is study of the detector performance characteristics and the task requires more detailed information than the one available on the AOD. ATLAS plans to use for these task “performance DPDs” which are prepared on the base of ESD. Targeted full performance DPD size is equal to full AOD size, i.e. another 230 TB - DATADISK The expected physics DPD size (official physics DPD produced by physics analysis groups) is at the level of 0.5 of full AOD size, i.e. 120 TB more - GROUPDISK 50 TB should be reserved for local users usage (ntuples, histograms kept at LOCALGROUPDISK token, 1 TB per user) - LOCALGROUPDISK Expected size of simulated AOD for MC08 (10 TeV) events only is equal to 80 TB, so need to reserve for full simulated AOD about 150 TB -MCDISK It is necessary to keep some samples of ESD and RAW events and, probably calibration data So, minimal requirement for needed disk space is at the level of TB Using usual CPU/disk ratio 3/1 one gets kSI2k estimate for needed CPU resources 8A.Minaenko
28/04/2009 ATLAS RuTier-2 and data distribution The sites of RuTier-2 are associated with ATLAS Tier-1 SARA Now all 6 sites IHEP, ITEP, JINR, RRC-KI, SINP, PNPI are included in TiersOfAtlas list and FTS channels are tuned for data transfers to/from the sites 4 sites of them (IHEP, JINR, RRC-KI, PNPI) will be used by ATLAS data analysis and all physics data need for analysis will be kept at these sites. The other 2 sites will be used for MC simulations only All sites successfully participating in data transfer functional tests (next 2 slides). This is a coherent data transfer test Tier-0 →Tiers-1→Tiers-2 for all clouds, using existing SW to generate and replicate data and to monitor data flow. Now this is a regular activity done once per week. All the data transmitted during FTs are deleted at the end of each week. The volume of the data used for functional tests is at the level of 10% of data obtained during real data taking RuTier-2 is now subscribed to get all simulated AOD, DPD, Tags as well as cosmic AOD. The data transfer is done automatically under steering and control of central ATLAS DDM (Distributed Data Management) group The currently used shares (40%,30%,15%, 15% for RRC-KI, JINR, IHEP, PNPI) correspond to disk resources available for ATLAS at the sites The similar scheme will be used for real data transfer to RuTier-2 MC data are transferred to MCDISK space token and cosmic and future real data – to DATADISK space token at RuTier-2 9A.Minaenko
Main Tasks of ATLAS STEP09 (first two weeks of June) 28/04/200910A.Minaenko
ATLAS STEP09 tasks for Ru-Tier-2 The main activities in Ru-Tier-2 during STEP09 include data replication, providing facilities for simulation and analysis jobs. All at the rate and proportion which are expected during real data taking Data replication implies data transfer to the Ru-Tier-2 sites and it includes two types of data: –About 80 TB MC simulated AOD and DPD for long term storage. These data need to be used by analysis jobs submitted by ATLAS gangarobot during STEP09 and they will be used later for real analysis carried out by ATLAS users. The data to be stored at the MCDISK space token –About 110 TB of faked MC data which are needed only to imitate data flow with the rate corresponding to that during LHC work. These data to be stored at the DATADISK space token The following shares of DATADISK data were defined for our sites: RRC-KI – 50%, JINR – 20%, IHEP, PNPI – 15% each. This corresponds to free disk space available at the sites 28/04/200911A.Minaenko
STEP09 Data Distribution 28/04/200912A.Minaenko
ATLAS STEP09 tasks for Ru-Tier-2 Analysis jobs were submitted using two different backends: gLite WMS and Panda WMS jobs used POSIX IO to fetch data to WN, i.e. this is quasi online method. For this purpose usually native protocols are used: dcap for dCache SE, rfio for DPM SE. Input size – about 30 GB/job Panda used File stager. At the beginning of work all needed files were fetched to the local disk of the WN with lcg-cp command. This command in turn uses gsiftp protocol. Input size – about 10 GB/job ATLAS requested to tune local job schedulers to support during STEP09 the following shares between different job types –Simulation jobs (Role=production) – 50% –Panda analysis jobs (Role=pilot) – 25% –All the other including WNS analysis jobs – 25% For RDIG sites we additionally requested 5% for jobs of group atlas/ru This requirement is crucial for successful analysis because otherwise numerous long simulation jobs simply suppress short analysis jobs Nothing of this has been done at RDIG sites and, as I understand, at RRC-KI and PNPI even Role=pilot was not implemented for ATLAS 28/04/200913A.Minaenko
STEP09 Analysis Jobs Workflow 28/04/200914A.Minaenko
STEP09 DDM Results 28/04/200915A.Minaenko
STEP09 Analysis Global Results 28/04/200916A.Minaenko
STEP09 Analysis: 2 Backends 28/04/200917A.Minaenko
Analysis jobs in all clouds 28/04/2009 Next slides: Ru-Tier-2 sites marked in red; in yellow - sites with considerably larger contribution (better results) than IHEP 18A.Minaenko
Analysis jobs in NL cloud 28/04/200919A.Minaenko
Panda Analysis Jobs 28/04/200920A.Minaenko
Panda Analysis Jobs 28/04/200921A.Minaenko
WMS Analysis Jobs 28/04/200922A.Minaenko
Problems with Analysis in Ru-Tier-2 sites IHEP: the results are rather good for both backends. 40% of all analyzed in the NL cloud events were treated here and there are not so many sites in grid which make a considerably larger contribution (marked in yellow in the previous slides) 28/04/2009 IHEPefficiencyHzCPU/Wall#events WMS84%9.739%300 M Panda95%12.438%250 M Problems: Poor scheduling as in all the other sites. Even fare share between VOs was not provided properly. This is due to the fact that different job types (production, pilot, usual) had different priorities and job wait time had too large influence. Required shares between different groups of ATLAS jobs was not provided as well. In general the results of such scheduling were unpredictable. Further two slides from STE09 post-mortem are placed which hints that this is rather MAUI scheduler problem Atlas bug with libgfal.so name during ROOT library build. The problem affected WMS jobs decreasing their efficiency PNPI: there was one but severe problem – very narrow SE → WNs bandwidth which did not permit any real analysis at the site PNPIefficiencyHzCPU/Wall#events WMS21%0.811%4 M Panda0%--- 23A.Minaenko
Problems with Analysis in Ru-Tier-2 sites JINR: the results are not satisfactory especially for WMS but this is rather due to ATLAS than site problems 28/04/2009 JINRefficiencyHzCPU/Wall#events WMS4%11.080%6 M Panda55%8.628%215 M Problems: Atlas bug with libgfal.so name during ROOT library build. Killed practically all WMS jobs. Site uses gsidcap protocol but not dcap and in this case the library is always required Panda jobs use lcg-cp command to fetch input files to the local WN disk at the very beginning of work. lcg-cp in turn uses gsiftp protocol for the data transfer. In case of dCache SE all gsiftp traffic goes through gridftp doors placed at some nodes of the SE. At the very beginning of STEP09 there was only 1 gridftp door at the site then the number was increased up to 3 but anyway the value 1 or 3 Gbps is much less than really available SE → WNs bandwidth leading to a bottleneck and job fails. ATLAS must use only native protocols (dcap/gsidcap, rfio) for inter-farm data transfers 24A.Minaenko
Problems with Analysis in Ru-Tier-2 sites RRC-KI: the results are rather poor also. The efficiency and frequency are low for both backends as well as CPU/Walltime for WMS 28/04/2009 Problems –POSIX IO fails when trying to read next event –lcg-cp timeouts in 1800 sec. This and above problems are due to a bottleneck in SE → WNs pipe line. The farm network configuration need to be seriously corrected to solve the problems –lcg-cr fails at attempts to write small files (logfile and output ROOT file with histograms) to the SCRATCHDISK. This opposite traffic SE ← WNs is rather small and it should not suffer from the bottleneck. The reason is obviously due to some SE configuration error and has not been found and fixed yet RRC-KIefficiencyHzCPU/Wall#events WMS63%1.610%50 M Panda18%6.639%18 M 25A.Minaenko
Job Scheduling Problem, slide from STEP09 post-mortem 28/04/200926A.Minaenko
28/04/2009 Job Scheduling Problem, slide from STEP09 post-mortem 27A.Minaenko
Summary STEP09 Data transfer part of STEP09 was successful in general. The measures have been taken to make the RRC-KI SE work more stable; the external bandwidth at IHEP has been already increased up to 1 Gbps The main problem: the farms in RRC-KI and PNPI need to be reconfigured to remove bottlenecks. The farm design in the other sites need to be checked and, probably, revised also to take up future challenges. Possible solutions are presented in the enclosed presentation Fix all other found problems and repeat analysis exercises as soon as possible Very serious and urgent problem: new scheduler is needed or tune maui properly if it is possible. This is not only our problem but rather general LCG problem. Efficient analysis will not be possible without solution of the problem It is necessary to estimate output bandwidth of our separate fileservers of different types as well as full SE output bandwidth at each site for each VO. This is necessary to understand how many analysis jobs can accept each the site (see enclosed presentation) In future: try to increase frequency up to 20 Hz and CPU/wall up to 100% Raise question about libgfal.so library problem as well as the problem of usage by ATLAS analysis jobs gsiftp protocol for inter-farm data transfers 28/04/200928A.Minaenko