US ATLAS Computing Operations Kaushik De University of Texas At Arlington US ATLAS Distributed Facility Workshop at SLAC October 13, 2010.

US ATLAS Computing Operations Kaushik De University of Texas At Arlington US ATLAS Distributed Facility Workshop at SLAC October 13, 2010

Overview  LHC is running  We expect ~50 pb -1 by the end of this month  ATLAS focus has switched to Physics results  ~20 publications in the pipeline already  Distributed Computing played a critical role in this success – congratulations and thanks to all of you  US computing operations  US operations is now closely integrated with ADC, as it should be during data taking operations  RAC is playing an active role in resource decisions  In this talk  I will talk about overall ATLAS/ADC first  Then US operations October 13, 2010 Kaushik De 2

Production  Central production has been choppy  Delays due to software releases  Especially during summer  US has done it’s share reliably  But short of CPU resources now October 13, 2010 Kaushik De 3 Past Year Past Month Big crunch For winter conferences

Distributed Analysis October 13, 2010 Kaushik De 4  Distributed analysis has scaled impressively  Reached factor of 10 more running jobs compared to a year ago  LHC reality is a factor of five higher than stress test!  We may need to scale even higher as data volume grows Stress Test LHC Start

US Analysis Sites October 13, 2010 Kaushik De 5  US sites have also scaled up impressively  All sites running more analysis jobs  We had to make quick adjustments after LHC start  Data distribution/management has been critical Stress Test LHC Start

No Matter how you Slice  Kors recently presented  ATLAS has ~75 analysis sites  But 75% of analysis is done in only 20 sites  90% is done in 36 sites  7 out of top 20 sites are in the US  Based on July+August data  US sites are running more jobs than many Tier 1 analysis sites October 13, 2010 Kaushik De 6 From Kors July + August

Moving Forward in ADC  We are doing remarkably well  Areas that are becoming important  PAT – Physics Analysis Tools o Athena vs Root Analysis o Standardizing user libraries o Default user tool  Shifts (DAST, Point1, ADCoS)  Integrated monitoring (converge to Panda monitoring + DDM monitoring platforms)  Data distribution – need to get smarter as the data volume increases  Group production -> central production  Tier 3’s becoming increasingly important October 13, 2010 Kaushik De 7

What is US Operations?  Data production – MC, reprocessing  Data management – storage allocations, data distribution  User analysis – site testing, validation  Distributed computing shift teams – ADCoS, DAST  Successful US computing operations is only possible because of excellent US site managers – at BNL & Tier 2’s October 13, 2010 Kaushik De 8

MC Production and Reprocessing  Cornerstone of computing operations  Experienced team of more than 6 years in the US  Responsible for:  Efficient utilization of resources at Tier 1/2 sites  Monitor site and task status 7 days a week – site online/offline  Monitor data flow  Report software validation issues  Report task, and distributed software issues  Part of ADCoS shift team:  US Team: Yuri Smirnov (Captain), Mark Sosebee, Armen Vartapetian, Wensheng Deng  Coverage is 24/7 by using shifts in 3 different time zones  Task management/reprocessing/group production – Pavel nevski October 13, 2010 Kaushik De 9

Storage Management in US  The following slides are from Armen Vartapetian q Hiro Ito and Wensheng Deng critical for success of data management  Storage/Data management is most demanding and time consuming operation activity  BNL (T1), and 5 T2-s: AGLT2 (Michigan), MWT2 (Chicago, Indiana), NET2 (Boston, Harvard), SWT2 (Arlington, Oklahoma), WT2 (SLAC)  Storage systems: dCache, xrootd, GPFS, Lustre  All the site admins are part of a storage management group, and participate in weekly US storage meetings, to coordinate activities, exchange experience, solve problems  Important decisions are discussed at the US resource allocation committee (RAC) weekly meetings  RAC committee decides priorities on the usage of computing resources. Overall ATLAS priorities for those pledged resources are set by the ATLAS CREM committee October 13, 2010 Kaushik De 10

Primer on Space Tokens  DATADISK – ESD (full copy at BNL, some versions at T2’s), RAW (BNL only), AOD (four copies among U.S. sites)  MCDISK – AOD’s (four copies among U.S. sites), ESD’s (full copy at BNL), centrally produced DPD’s (all sites), some HITs/RDOs  DATATAPE/MCTAPE – archival data at BNL (RAW mainly)  USERDISK – pathena output, limited lifetime (variable, at least 60 days, users notified before deletion)  SCRATCHDISK – Ganga output, temporary user datasets, limited lifetime (maximum 30 days, no notification before deletion)  GROUPDISK – physics/performance group data  LOCALGROUPDISK – storage for local (geographically) groups/users  PRODDISK – only used by Panda production at Tier 2 sites  HOTDISK – database releases (including conditions), SW tarballs October 13, 2010 Kaushik De 11

Primer on Cleanup  MCDISK/DATADISK - cleaned by Central Deletion or US DDM  MCTAPE/DATATAPE - cleaned by Central Deletion, with notification to BNL to clean/reuse tapes  SCRATCHDISK - cleaned by Central Deletion  GROUPDISK - cleaned by Central Deletion  HOTDISK - never cleaned!  PRODDISK - cleaned by site  USERDISK - cleaned by US DDM  LOCALGROUPDISK - cleaned by US DDM October 13, 2010 Kaushik De 12

Available Storage Space  Usually all the available space is online, particularly during the recent months with high data flow  General principle is to keep maximum data on the disk (80%-90% full), and minimum possible storage idle  In recent months the volume of the data was high regardless of that  So all the sites/space tokens quite full (situation a bit more relaxed at BNL with the arrival of new storage)  Try to move available space between space tokens (dCache sites), to provide space where most needed  If that’s not possible, carry out additional deletions  The issue of cleanup of the old data, which is not used anymore, is one of the most important ones  At the moment sites are in the process of adding storage towards their pledged capacities, so generally the situation is improving, but we are getting ready for the next challenge… October 13, 2010 Kaushik De 13

Recent Activities and Experience  In recent months Data/MC reprocessing and distribution + LHC data with increasing luminosity was significant stress to SE at all sites  Overall US cloud performance was quite well. And part of it is due to the effort to make free space always available.  Central cleanup not always catches up with fast arriving data  During the periods with high levels of data flow, the space tokens at various sites are routinely at the risk of being full  Additional space reorganization and cleanup effort needed to stay afloat  We at US are constantly monitoring the space situation and submitting additional deletions for the space tokens at risk  Part of the solution was also to distribute not everything (ESDs), but especially what users are running on (PD2P) October 13, 2010 Kaushik De 14

Central Deletion Service  Overall experience of different site admins with the automatic cleanup system is not so positive  Very often it’s late – when you already run out of space  Reduce the grace period or optimize the deletion criteria and the process  Very difficult to trace the deletion through all the steps and monitor  Interface is not very user friendly  No total size values are presented to understand the gain  Need better organization of the log file archiving  A better central deletion service will significantly help the data management operations  Development of the service is underway by central operations October 13, 2010 Kaushik De 15

USERDISK Cleanup in US  Pathena output for jobs running at US sites goes to USERDISK. Cleanup done by US operations  Lifetime of the datasets is at least 2 months  We send notification email to users about the upcoming cleanup of the datasets with a link to the list and some basic information on how to proceed if the dataset is still needed  User datasets in USERDISK are matched with datasets owner DN from the DDM catalog  DN is matched with known email address  Often manual effort needed for missing/obsolete emails  Overall users give quite positive review to this cleanup operation when they are notified in advance  Plan to run the monitoring software and the userdisk cleanup service on a more regular basis October 13, 2010 Kaushik De 16

Possible Improvements  Continue discussions with central operations on limitations of the quota system (size) for production space tokens. No need to put artificial restrictions on the storage of the production system  So the only meaningful limit must be the physical size of the pool where the production space tokens are located  Particularly this issue was escalated from the xrootd sites where individual space token sizes doesn’t have meaning  Unfortunately the central deletion service at the moment can’t operate in the space tokens without an individual size  The admins of xrootd sites are not very thrilled with the idea of changing the space organization and reporting q you have to implement artificial space token sizes q if you don’t follow and correct it all the time, you may have a situation when the available space is negative. It can happen quite often if the storage is close to being full q you need to restart the system every time after doing a change, which can create errors on the data transfer dashboard October 13, 2010 Kaushik De 17

Possible Improvements  On the other hand management of just one pool is very convenient  In fact we are imitating the advantage of one pool situation, when are moving the free space from one space token to another in dCache. And we are doing that quite often when the site is starting to get full  Also from the point of view of the deletion service presumably it must be much easier to do the cleanup in any space token and achieve the goal, rather than find the most useless data in only that specific space token  For non-production space tokens (userdisk, scratchdisk, groupdisk, localgroupdisk) we can operate without a size limitation as well, but we most definitely will need some kind of a quota system for users and groups, to monitor the usage and limit storage abuse  Central operations generally agree with the proposed changes. We need to continue discussions and follow the software developments covering those issues. October 13, 2010 Kaushik De 18

DAST – Distributed Analysis Support Team  DAST started in September 2008 - support pathena and Ganga users.  Led by N. Ozturk in US time zone  First point of contact for all distributed analysis questions:  Athena, physics analysis tools, conditions database access, site/service problems, dq2-* tools, data access at sites, data replication, etc.  Help is provided to users via a forum  hn-atlas-dist-analysis-help@cern.ch  DAST shifts are new Class-2 shifts in OTP (attracted few more people).  Three level of shifts in 3 time zone (European, American, Asia-Pacific)  1 st level, trained shifter, shift credit 100%, 7days/week  2 nd level, expert shifter, shift credit 50%, 7days/week  Trainee level, trainee shifter, shift credit 50%, 7 days/week  American shifts are fully covered successfully since the beginning of the team (no weekend shifts yet, the load is still manageable by 5 day shifts) Kaushik De 19October 13, 2010

Kaushik De 20 Manpower in DAST EU time zone NA time zone ---------------------------------------------------------------------------- Daniel van der SterNurcan Ozturk (taking shifts in EU time zone now) Mark Slater Alden Stradling Hurng-Chun Lee Sergey Panitkin Manoj Jha Bill Edson Christian Kummer Wensheng Deng Maria Shiyakova Venkat Kaushik Jaroslava Schovancova Shuwei Ye Elena Oliver Garcia Nils Krumnack Frederic BrochuWoochun Park Karl Harrison Daniel Geerts Carl Gwilliam Mohamed Gouighri Borge Gjelsten blue: new member Katarina Pajchel red: trainee We are at absolute minimum (8 people) in US time zone for most time, continous effort to find and train new people We are at absolute minimum (8 people) in US time zone for most time, continous effort to find and train new people October 13, 2010

New Data Distribution Model  First 6 months of LHC data showed importance of data distribution in successful analysis site usage  Looking ahead to more LHC data  Storage will get saturated  We need to scale up to more users  PD2P was first step in fixing some problems October 13, 2010 Kaushik De 21

The success 3 data distribution power for ATLAS  Data distribution power MB/s per day MC reprocessing JanFebMarchApril May Start of 7 TeV data-taking 6 GB/s Peaks of 10 GB/s achieved ~2 GB/s (design) June July Start of 10 11 p/bunch operation 2009 data reprocessing Data and MC reprocessing 22 From Kors

Difficulty 1  A small fraction of the data we distribute is actually used  Data* datasets  Counts dataset access  Only by official tools  There are ~200k datasets 23 From Kors

Difficulty 2  We don’t know a priori which data type will be used most  Same plot, normalized for the number of files per dataset 24 From Kors

Difficulty 3  Data is popular for a very short time  Dataset: data10_7TeV.00158116.physics_L1Calo.recon.ESD.f271  Dataset Events: 99479  Replicas: 6, Files: 6066, Users: 35, Dataset Size: 17.1 TB Note: Search was for the last 120 days, but only used for 13 days 25 From Kors

Data Distribution To Tier 2’s Oct 5, 2010 Kaushik De 26  Most user analysis jobs run at Tier 2 sites  Jobs are sent to data  We rely on pushing data out to Tier 2 sites promptly  Difficult since many data formats and many sites  We adjusted frequently the number of copies and data types in April & May  But Tier 2 sites were filling up too rapidly, and user pattern was unpredictable  Most datasets copied to Tier 2’s were never used From Kors, SW Week

We Changed Data Distribution Model  Reduce pushed data copies to Tier 2’s  Only send small fraction of AOD’s automatically  Pull all other data types, when needed by users  Note: for production we have always pulled data as needed  But users were insulated from this change  Did not affect the many critical ongoing analyses  No delays in running jobs  No change in user workflow Oct 5, 2010 Kaushik De 27 From Kors, SW Week

Data Flow to Tier 2’s  Example above is from US Tier 2 sites  Exponential rise in April and May, after LHC start  We changed data distribution model end of June – PD2P  Much slower rise since July, even as luminosity grows rapidly Oct 5, 2010 Kaushik De 28

What is PD2P  Dynamic data placement at Tier 2’s  Continue automatic distribution to Tier 1’s – treat them as repositories  Reduce automatic data subscriptions to Tier 2’s – instead use PD2P  The plan  Panda will subscribe a dataset to a Tier 2, if no other copies are available (except at a Tier 1), as soon as any user needs the dataset o User jobs will still go to Tier 1 while data is being transferred – no delay  Panda will subscribe replicas to additional Tier 2’s, if needed, based on backlog of jobs using the dataset (PanDA checks continuously)  Cleanup will be done by central DDM popularity based cleaning service (as described in previous talk by Stephane)  Few caveats  Start with DATADISK and MCDISK  Exclude RAW, RDO and HITS datasets from PD2P  Restrict transfers within cloud for now  Do not add sites too small (storage mainly) or too slow Oct 5, 2010 Kaushik De 29

Main Goals  User jobs should not experience delay due to data movement  First dataset replication is ‘request’ based  Any user request to run jobs will trigger replication to a Tier 2 chosen by PanDA brokering – no matter how small or large the request  Additional dataset replication is ‘usage’ based  Send replicas to more Tier 2’s if a threshold is crossed (many jobs are waiting for the dataset)  Types of datasets replication are ‘policy’ based  We follow Computing Model – RAW, RDO, HITS are never replicated to Tier 2’s by PanDA (we may have more complex rules later, to allow for small fraction of these types to be replicated)  PanDA does replication only to DATADISK and MCDISK, for now  Replication pattern is ‘cloud’ based  Even though subscription source is not specified, currently PanDA will only initiate replication if source is available within cloud (we hope to relax this in the next phase of tests) Oct 5, 2010 Kaushik De 30

Some Statistics  Running for 3+ months now – since Jun 15  Started in US cloud, and then FR cloud, now IT cloud  5870 datasets subscribed so far  Most datasets are never used and therefore never copied to Tier 2  Majority of datasets copied by PD2P still not reused at Tier 2 o This will change soon because of automatic rebrokering  However, those which are reused, are reused often  1,634,272 files were reused by other user jobs, so far in 3+ months  Now lets look at some PD2P results/plots Sep 27, 2010 Kaushik De 31

Distribution Among Sites is Even Sep 27, 2010 Kaushik De 32

Rate is also Even Sep 27, 2010 Kaushik De 33 Summed over all three clouds

Reuse of PD2P Files Sep 27, 2010 Kaushik De 34 We plot here the number of datasets subscribed by PD2P which were accessed later by other users (x-axis shows number of files accessed)

Patterns of Data Usage – Part I  Interesting patterns are emerging by type of data  LHC data reused more often than MC data – not unexpected Sep 27, 2010 Kaushik De 35

Patterns of Data Usage – Part 2  Interesting patterns also by format of data  During past 3+ months:  All types of data showing up: ESD, NTUP, AOD, DED most popular  But highest reuse (counting files): ESD, NTUP Sep 27, 2010 Kaushik De 36

Trends in Data Reuse Oct 5, 2010 Kaushik De 37  PD2P pull model does not need a priori assumption about popular data types for user analysis  It automatically moves data based on user workflow  We observe now a shift towards using DPD’s (NTUP)

Recent Improvements to PD2P  Re-brokering was implemented two weeks ago  PanDA will now re-broker jobs to a different site, if they remain in queue too long (site problems, too many users, long jobs…)  Side effect – users can now use dataset containers for output  If dataset containers are used, sub-jobs may now be brokered to multiple sites for faster execution (in the past all sub-jobs went to a single site chosen by PanDA)  Results of these changes do not show up in plots yet, but will speed up user job completions, and balance the load better among sites Oct 5, 2010 Kaushik De 38

New PD2P Monitoring Oct 5, 2010 Kaushik De 39

What Next?  Is it time to tune PD2P algorithm?  Not yet – rate of subscriptions is still low (much lower than subscribing all datasets available, as before PD2P)  Low threshold for first subscription helps additional users, even if the subscribed datasets are seldom reused  High threshold for multiple subscriptions - only copy hot datasets  We will monitor and optimize PD2P as data volume grows  We are looking at possibility of matching data size to site capability  Can we improve and expand to other caching models?  Many ideas on the table  For example: using ROOT TreeCache  For example: using XRootD based caching  These require longer term development  Large Scale Demonstrator LST2010 – CERN IT and ATLAS project Oct 5, 2010 Kaushik De 40

Wish List from Kors  As we learn how to pull data, we should remove artificial cloud boundaries (currently ATLAS has 10 clouds = 10 Tier 1’s)  First – allow Tier 2’s to get data from other Tier 1’s  Second – allow Tier 2’s to get data from each other (already allowed in some clouds)  Finally – break down all artificial topological boundaries (preserving only real boundaries) Oct 5, 2010 Kaushik De 41

Data ultimate pull model T0 T1 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2  Data can be pulled from anywhere  Needs another network  Not necessarily more bandwidth  But different topology 42 From Kors

The OPN + T2PN a possible architecture T1 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T2T2 T0 AP 2 T2 APs in EU 2 T2 APs in US Well connected to OPN 43 From Kors

Moving Forward in US Operations  Activities in the next 6 months  Expect a lot more data – space management  Need more resources – CPU’s especially  Expect many more users – scaling up Distributed Analysis  Tier 3’s become more important for end user analysis  Priorities  Maintain smooth operations in challenging environment  Consolidate and incrementally update distributed software  Data cleanup/consolidation October 13, 2010 Kaushik De 44

Summary  US ATLAS Computing Operations provides critical support during LHC data taking period  Work closely with many US and ATLAS wide teams  Communications is critical – good track record here  Some systems are under active development – operations people provide continuous feedback  Need to improve automation of all systems  Documentation always comes last – need to improve October 13, 2010 Kaushik De 45

US ATLAS Computing Operations Kaushik De University of Texas At Arlington US ATLAS Distributed Facility Workshop at SLAC October 13, 2010.

Similar presentations

Presentation on theme: "US ATLAS Computing Operations Kaushik De University of Texas At Arlington US ATLAS Distributed Facility Workshop at SLAC October 13, 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

US ATLAS Computing Operations Kaushik De University of Texas At Arlington US ATLAS Distributed Facility Workshop at SLAC October 13, 2010.

Similar presentations

Presentation on theme: "US ATLAS Computing Operations Kaushik De University of Texas At Arlington US ATLAS Distributed Facility Workshop at SLAC October 13, 2010."— Presentation transcript:

Similar presentations

About project

Feedback