Download presentation
Presentation is loading. Please wait.
Published byPolly Briggs Modified over 8 years ago
1
US ATLAS Computing Operations Kaushik De University of Texas At Arlington US ATLAS Distributed Facility Workshop at SLAC October 13, 2010
2
Overview LHC is running We expect ~50 pb -1 by the end of this month ATLAS focus has switched to Physics results ~20 publications in the pipeline already Distributed Computing played a critical role in this success – congratulations and thanks to all of you US computing operations US operations is now closely integrated with ADC, as it should be during data taking operations RAC is playing an active role in resource decisions In this talk I will talk about overall ATLAS/ADC first Then US operations October 13, 2010 Kaushik De 2
3
Production Central production has been choppy Delays due to software releases Especially during summer US has done it’s share reliably But short of CPU resources now October 13, 2010 Kaushik De 3 Past Year Past Month Big crunch For winter conferences
4
Distributed Analysis October 13, 2010 Kaushik De 4 Distributed analysis has scaled impressively Reached factor of 10 more running jobs compared to a year ago LHC reality is a factor of five higher than stress test! We may need to scale even higher as data volume grows Stress Test LHC Start
5
US Analysis Sites October 13, 2010 Kaushik De 5 US sites have also scaled up impressively All sites running more analysis jobs We had to make quick adjustments after LHC start Data distribution/management has been critical Stress Test LHC Start
6
No Matter how you Slice Kors recently presented ATLAS has ~75 analysis sites But 75% of analysis is done in only 20 sites 90% is done in 36 sites 7 out of top 20 sites are in the US Based on July+August data US sites are running more jobs than many Tier 1 analysis sites October 13, 2010 Kaushik De 6 From Kors July + August
7
Moving Forward in ADC We are doing remarkably well Areas that are becoming important PAT – Physics Analysis Tools o Athena vs Root Analysis o Standardizing user libraries o Default user tool Shifts (DAST, Point1, ADCoS) Integrated monitoring (converge to Panda monitoring + DDM monitoring platforms) Data distribution – need to get smarter as the data volume increases Group production -> central production Tier 3’s becoming increasingly important October 13, 2010 Kaushik De 7
8
What is US Operations? Data production – MC, reprocessing Data management – storage allocations, data distribution User analysis – site testing, validation Distributed computing shift teams – ADCoS, DAST Successful US computing operations is only possible because of excellent US site managers – at BNL & Tier 2’s October 13, 2010 Kaushik De 8
9
MC Production and Reprocessing Cornerstone of computing operations Experienced team of more than 6 years in the US Responsible for: Efficient utilization of resources at Tier 1/2 sites Monitor site and task status 7 days a week – site online/offline Monitor data flow Report software validation issues Report task, and distributed software issues Part of ADCoS shift team: US Team: Yuri Smirnov (Captain), Mark Sosebee, Armen Vartapetian, Wensheng Deng Coverage is 24/7 by using shifts in 3 different time zones Task management/reprocessing/group production – Pavel nevski October 13, 2010 Kaushik De 9
10
Storage Management in US The following slides are from Armen Vartapetian q Hiro Ito and Wensheng Deng critical for success of data management Storage/Data management is most demanding and time consuming operation activity BNL (T1), and 5 T2-s: AGLT2 (Michigan), MWT2 (Chicago, Indiana), NET2 (Boston, Harvard), SWT2 (Arlington, Oklahoma), WT2 (SLAC) Storage systems: dCache, xrootd, GPFS, Lustre All the site admins are part of a storage management group, and participate in weekly US storage meetings, to coordinate activities, exchange experience, solve problems Important decisions are discussed at the US resource allocation committee (RAC) weekly meetings RAC committee decides priorities on the usage of computing resources. Overall ATLAS priorities for those pledged resources are set by the ATLAS CREM committee October 13, 2010 Kaushik De 10
11
Primer on Space Tokens DATADISK – ESD (full copy at BNL, some versions at T2’s), RAW (BNL only), AOD (four copies among U.S. sites) MCDISK – AOD’s (four copies among U.S. sites), ESD’s (full copy at BNL), centrally produced DPD’s (all sites), some HITs/RDOs DATATAPE/MCTAPE – archival data at BNL (RAW mainly) USERDISK – pathena output, limited lifetime (variable, at least 60 days, users notified before deletion) SCRATCHDISK – Ganga output, temporary user datasets, limited lifetime (maximum 30 days, no notification before deletion) GROUPDISK – physics/performance group data LOCALGROUPDISK – storage for local (geographically) groups/users PRODDISK – only used by Panda production at Tier 2 sites HOTDISK – database releases (including conditions), SW tarballs October 13, 2010 Kaushik De 11
12
Primer on Cleanup MCDISK/DATADISK - cleaned by Central Deletion or US DDM MCTAPE/DATATAPE - cleaned by Central Deletion, with notification to BNL to clean/reuse tapes SCRATCHDISK - cleaned by Central Deletion GROUPDISK - cleaned by Central Deletion HOTDISK - never cleaned! PRODDISK - cleaned by site USERDISK - cleaned by US DDM LOCALGROUPDISK - cleaned by US DDM October 13, 2010 Kaushik De 12
13
Available Storage Space Usually all the available space is online, particularly during the recent months with high data flow General principle is to keep maximum data on the disk (80%-90% full), and minimum possible storage idle In recent months the volume of the data was high regardless of that So all the sites/space tokens quite full (situation a bit more relaxed at BNL with the arrival of new storage) Try to move available space between space tokens (dCache sites), to provide space where most needed If that’s not possible, carry out additional deletions The issue of cleanup of the old data, which is not used anymore, is one of the most important ones At the moment sites are in the process of adding storage towards their pledged capacities, so generally the situation is improving, but we are getting ready for the next challenge… October 13, 2010 Kaushik De 13
14
Recent Activities and Experience In recent months Data/MC reprocessing and distribution + LHC data with increasing luminosity was significant stress to SE at all sites Overall US cloud performance was quite well. And part of it is due to the effort to make free space always available. Central cleanup not always catches up with fast arriving data During the periods with high levels of data flow, the space tokens at various sites are routinely at the risk of being full Additional space reorganization and cleanup effort needed to stay afloat We at US are constantly monitoring the space situation and submitting additional deletions for the space tokens at risk Part of the solution was also to distribute not everything (ESDs), but especially what users are running on (PD2P) October 13, 2010 Kaushik De 14
15
Central Deletion Service Overall experience of different site admins with the automatic cleanup system is not so positive Very often it’s late – when you already run out of space Reduce the grace period or optimize the deletion criteria and the process Very difficult to trace the deletion through all the steps and monitor Interface is not very user friendly No total size values are presented to understand the gain Need better organization of the log file archiving A better central deletion service will significantly help the data management operations Development of the service is underway by central operations October 13, 2010 Kaushik De 15
16
USERDISK Cleanup in US Pathena output for jobs running at US sites goes to USERDISK. Cleanup done by US operations Lifetime of the datasets is at least 2 months We send notification email to users about the upcoming cleanup of the datasets with a link to the list and some basic information on how to proceed if the dataset is still needed User datasets in USERDISK are matched with datasets owner DN from the DDM catalog DN is matched with known email address Often manual effort needed for missing/obsolete emails Overall users give quite positive review to this cleanup operation when they are notified in advance Plan to run the monitoring software and the userdisk cleanup service on a more regular basis October 13, 2010 Kaushik De 16
17
Possible Improvements Continue discussions with central operations on limitations of the quota system (size) for production space tokens. No need to put artificial restrictions on the storage of the production system So the only meaningful limit must be the physical size of the pool where the production space tokens are located Particularly this issue was escalated from the xrootd sites where individual space token sizes doesn’t have meaning Unfortunately the central deletion service at the moment can’t operate in the space tokens without an individual size The admins of xrootd sites are not very thrilled with the idea of changing the space organization and reporting q you have to implement artificial space token sizes q if you don’t follow and correct it all the time, you may have a situation when the available space is negative. It can happen quite often if the storage is close to being full q you need to restart the system every time after doing a change, which can create errors on the data transfer dashboard October 13, 2010 Kaushik De 17
18
Possible Improvements On the other hand management of just one pool is very convenient In fact we are imitating the advantage of one pool situation, when are moving the free space from one space token to another in dCache. And we are doing that quite often when the site is starting to get full Also from the point of view of the deletion service presumably it must be much easier to do the cleanup in any space token and achieve the goal, rather than find the most useless data in only that specific space token For non-production space tokens (userdisk, scratchdisk, groupdisk, localgroupdisk) we can operate without a size limitation as well, but we most definitely will need some kind of a quota system for users and groups, to monitor the usage and limit storage abuse Central operations generally agree with the proposed changes. We need to continue discussions and follow the software developments covering those issues. October 13, 2010 Kaushik De 18
19
DAST – Distributed Analysis Support Team DAST started in September 2008 - support pathena and Ganga users. Led by N. Ozturk in US time zone First point of contact for all distributed analysis questions: Athena, physics analysis tools, conditions database access, site/service problems, dq2-* tools, data access at sites, data replication, etc. Help is provided to users via a forum hn-atlas-dist-analysis-help@cern.ch DAST shifts are new Class-2 shifts in OTP (attracted few more people). Three level of shifts in 3 time zone (European, American, Asia-Pacific) 1 st level, trained shifter, shift credit 100%, 7days/week 2 nd level, expert shifter, shift credit 50%, 7days/week Trainee level, trainee shifter, shift credit 50%, 7 days/week American shifts are fully covered successfully since the beginning of the team (no weekend shifts yet, the load is still manageable by 5 day shifts) Kaushik De 19October 13, 2010
20
Kaushik De 20 Manpower in DAST EU time zone NA time zone ---------------------------------------------------------------------------- Daniel van der SterNurcan Ozturk (taking shifts in EU time zone now) Mark Slater Alden Stradling Hurng-Chun Lee Sergey Panitkin Manoj Jha Bill Edson Christian Kummer Wensheng Deng Maria Shiyakova Venkat Kaushik Jaroslava Schovancova Shuwei Ye Elena Oliver Garcia Nils Krumnack Frederic BrochuWoochun Park Karl Harrison Daniel Geerts Carl Gwilliam Mohamed Gouighri Borge Gjelsten blue: new member Katarina Pajchel red: trainee We are at absolute minimum (8 people) in US time zone for most time, continous effort to find and train new people We are at absolute minimum (8 people) in US time zone for most time, continous effort to find and train new people October 13, 2010
21
New Data Distribution Model First 6 months of LHC data showed importance of data distribution in successful analysis site usage Looking ahead to more LHC data Storage will get saturated We need to scale up to more users PD2P was first step in fixing some problems October 13, 2010 Kaushik De 21
22
The success 3 data distribution power for ATLAS Data distribution power MB/s per day MC reprocessing JanFebMarchApril May Start of 7 TeV data-taking 6 GB/s Peaks of 10 GB/s achieved ~2 GB/s (design) June July Start of 10 11 p/bunch operation 2009 data reprocessing Data and MC reprocessing 22 From Kors
23
Difficulty 1 A small fraction of the data we distribute is actually used Data* datasets Counts dataset access Only by official tools There are ~200k datasets 23 From Kors
24
Difficulty 2 We don’t know a priori which data type will be used most Same plot, normalized for the number of files per dataset 24 From Kors
25
Difficulty 3 Data is popular for a very short time Dataset: data10_7TeV.00158116.physics_L1Calo.recon.ESD.f271 Dataset Events: 99479 Replicas: 6, Files: 6066, Users: 35, Dataset Size: 17.1 TB Note: Search was for the last 120 days, but only used for 13 days 25 From Kors
26
Data Distribution To Tier 2’s Oct 5, 2010 Kaushik De 26 Most user analysis jobs run at Tier 2 sites Jobs are sent to data We rely on pushing data out to Tier 2 sites promptly Difficult since many data formats and many sites We adjusted frequently the number of copies and data types in April & May But Tier 2 sites were filling up too rapidly, and user pattern was unpredictable Most datasets copied to Tier 2’s were never used From Kors, SW Week
27
We Changed Data Distribution Model Reduce pushed data copies to Tier 2’s Only send small fraction of AOD’s automatically Pull all other data types, when needed by users Note: for production we have always pulled data as needed But users were insulated from this change Did not affect the many critical ongoing analyses No delays in running jobs No change in user workflow Oct 5, 2010 Kaushik De 27 From Kors, SW Week
28
Data Flow to Tier 2’s Example above is from US Tier 2 sites Exponential rise in April and May, after LHC start We changed data distribution model end of June – PD2P Much slower rise since July, even as luminosity grows rapidly Oct 5, 2010 Kaushik De 28
29
What is PD2P Dynamic data placement at Tier 2’s Continue automatic distribution to Tier 1’s – treat them as repositories Reduce automatic data subscriptions to Tier 2’s – instead use PD2P The plan Panda will subscribe a dataset to a Tier 2, if no other copies are available (except at a Tier 1), as soon as any user needs the dataset o User jobs will still go to Tier 1 while data is being transferred – no delay Panda will subscribe replicas to additional Tier 2’s, if needed, based on backlog of jobs using the dataset (PanDA checks continuously) Cleanup will be done by central DDM popularity based cleaning service (as described in previous talk by Stephane) Few caveats Start with DATADISK and MCDISK Exclude RAW, RDO and HITS datasets from PD2P Restrict transfers within cloud for now Do not add sites too small (storage mainly) or too slow Oct 5, 2010 Kaushik De 29
30
Main Goals User jobs should not experience delay due to data movement First dataset replication is ‘request’ based Any user request to run jobs will trigger replication to a Tier 2 chosen by PanDA brokering – no matter how small or large the request Additional dataset replication is ‘usage’ based Send replicas to more Tier 2’s if a threshold is crossed (many jobs are waiting for the dataset) Types of datasets replication are ‘policy’ based We follow Computing Model – RAW, RDO, HITS are never replicated to Tier 2’s by PanDA (we may have more complex rules later, to allow for small fraction of these types to be replicated) PanDA does replication only to DATADISK and MCDISK, for now Replication pattern is ‘cloud’ based Even though subscription source is not specified, currently PanDA will only initiate replication if source is available within cloud (we hope to relax this in the next phase of tests) Oct 5, 2010 Kaushik De 30
31
Some Statistics Running for 3+ months now – since Jun 15 Started in US cloud, and then FR cloud, now IT cloud 5870 datasets subscribed so far Most datasets are never used and therefore never copied to Tier 2 Majority of datasets copied by PD2P still not reused at Tier 2 o This will change soon because of automatic rebrokering However, those which are reused, are reused often 1,634,272 files were reused by other user jobs, so far in 3+ months Now lets look at some PD2P results/plots Sep 27, 2010 Kaushik De 31
32
Distribution Among Sites is Even Sep 27, 2010 Kaushik De 32
33
Rate is also Even Sep 27, 2010 Kaushik De 33 Summed over all three clouds
34
Reuse of PD2P Files Sep 27, 2010 Kaushik De 34 We plot here the number of datasets subscribed by PD2P which were accessed later by other users (x-axis shows number of files accessed)
35
Patterns of Data Usage – Part I Interesting patterns are emerging by type of data LHC data reused more often than MC data – not unexpected Sep 27, 2010 Kaushik De 35
36
Patterns of Data Usage – Part 2 Interesting patterns also by format of data During past 3+ months: All types of data showing up: ESD, NTUP, AOD, DED most popular But highest reuse (counting files): ESD, NTUP Sep 27, 2010 Kaushik De 36
37
Trends in Data Reuse Oct 5, 2010 Kaushik De 37 PD2P pull model does not need a priori assumption about popular data types for user analysis It automatically moves data based on user workflow We observe now a shift towards using DPD’s (NTUP)
38
Recent Improvements to PD2P Re-brokering was implemented two weeks ago PanDA will now re-broker jobs to a different site, if they remain in queue too long (site problems, too many users, long jobs…) Side effect – users can now use dataset containers for output If dataset containers are used, sub-jobs may now be brokered to multiple sites for faster execution (in the past all sub-jobs went to a single site chosen by PanDA) Results of these changes do not show up in plots yet, but will speed up user job completions, and balance the load better among sites Oct 5, 2010 Kaushik De 38
39
New PD2P Monitoring Oct 5, 2010 Kaushik De 39
40
What Next? Is it time to tune PD2P algorithm? Not yet – rate of subscriptions is still low (much lower than subscribing all datasets available, as before PD2P) Low threshold for first subscription helps additional users, even if the subscribed datasets are seldom reused High threshold for multiple subscriptions - only copy hot datasets We will monitor and optimize PD2P as data volume grows We are looking at possibility of matching data size to site capability Can we improve and expand to other caching models? Many ideas on the table For example: using ROOT TreeCache For example: using XRootD based caching These require longer term development Large Scale Demonstrator LST2010 – CERN IT and ATLAS project Oct 5, 2010 Kaushik De 40
41
Wish List from Kors As we learn how to pull data, we should remove artificial cloud boundaries (currently ATLAS has 10 clouds = 10 Tier 1’s) First – allow Tier 2’s to get data from other Tier 1’s Second – allow Tier 2’s to get data from each other (already allowed in some clouds) Finally – break down all artificial topological boundaries (preserving only real boundaries) Oct 5, 2010 Kaushik De 41
42
Data ultimate pull model T0 T1 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 Data can be pulled from anywhere Needs another network Not necessarily more bandwidth But different topology 42 From Kors
43
The OPN + T2PN a possible architecture T1 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T0 AP 2 T2 APs in EU 2 T2 APs in US Well connected to OPN 43 From Kors
44
Moving Forward in US Operations Activities in the next 6 months Expect a lot more data – space management Need more resources – CPU’s especially Expect many more users – scaling up Distributed Analysis Tier 3’s become more important for end user analysis Priorities Maintain smooth operations in challenging environment Consolidate and incrementally update distributed software Data cleanup/consolidation October 13, 2010 Kaushik De 44
45
Summary US ATLAS Computing Operations provides critical support during LHC data taking period Work closely with many US and ATLAS wide teams Communications is critical – good track record here Some systems are under active development – operations people provide continuous feedback Need to improve automation of all systems Documentation always comes last – need to improve October 13, 2010 Kaushik De 45
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.