Download presentation
Presentation is loading. Please wait.
Published byRobyn Bryan Modified over 8 years ago
1
US ATLAS DDM Operations Alexei Klimentov, BNL US ATLAS Tier-2 Workshop UCSD, Mar 8 th 2007
2
Mar 8, 2007US ATLAS T2 WS. A.Klimentov2 DDM Operations (US ATLAS) Coordination : Alexei Klimentov and Wensheng Deng –BNL : Wensheng Deng, Hironori Ito, Xin Zhao –GLTier2 : UM : Shawn Mckee, –NETier2 : BU : Saul Youssef, –WTier2 : Wei Yang –SWTier2 : Patrick McGuigan –OU : Horst Severini, Karthik Arunachalam –UTA : Patrick McGuigan, Mark Sosebee –MWTier2 : Dan Schrager IU : Kristy Kallback-Rose, Dan Schrager, UC : Robert Gardner, Greg Cross
3
Mar 8, 2007US ATLAS T2 WS. A.Klimentov3 DDM Operations (US ATLAS) Main Activities DDM AODs, EVGEN, DB releases distribution, RDOs consolidation : –W.Deng, A.Klimentov, P.Nevski LFC stress test : –A.Klimentov DDM development and deployment (0.2.12/0.3) : –P.McGuigan, W.Deng, H.Ito End users tools (development and support) –T.Maeno Monitoring and metrics –P.McGuigan, S.Reddy, D.Adams,H.Ito,AK User’s Support – X.Zhao, H.Ito, H.Severini, AK Documentation, FAQs, WiKi, procedures –H.Severini –https://twiki.cern.ch/twiki/bin/view/Atlas/WorkBookDDM Troubleshooting console – R.Gardner et al
4
Mar 8, 2007US ATLAS T2 WS. A.Klimentov4 AODs, RDOs, EVGEN and DB releases distribution (EVGEN & RDO) Follow Physics Coordinator request for RDO replication –From BNL (~8 TB) to LYON and FZK –From ALL sites (T1 and T2, for US ATLAS – BNLTAPE and BNLPANDA) to CERN Status is available http://panda.atlascomp.org/?mode=listRDOReplications EVGEN (P.Nevski) started a systematic replication of event generator input files to all Tier-1 centers for production needs. For the moment only recently produced EVGEN datasets are subscribed and no dedicated monitoring is available. this will be improved in the future.
5
Mar 8, 2007US ATLAS T2 WS. A.Klimentov5 AODs, RDOs, EVGEN and DB releases distribution (AODs) Follow ATLAS and US ATLAS Computing model –ATLAS Model: Each cloud must have at least 2 complete copies of AODs One copy on Tier-1 and the second copy shared between Tier-2s –US ATLAS Model : Each Tier-2 must have a complete AOD copy Currently : AGLT2, BU, MWT2, UTA – 100%, SLAC – 10% http://panda.atlascomp.org/?mode=listAODReplications 2 steps central subscription policy –AODs are subscribed centrally from source Tier-1 to all Tier-1s and CERN (only if datasets have files) –AODs are subscribed centrally from parent Tier-1 to Tier-2s within the cloud
6
Mar 8, 2007US ATLAS T2 WS. A.Klimentov6 AODs, RDOs, EVGEN and DB releases distribution. AODs (Cont.) Last Subscription : 138Kfiles, 600 datasets (total) –2 DDM instances : BNLDISK (initially BNLTAPE) and BNLPANDA BNLDISK : destination for Tier-1s and source for US Tier-2s BNLPANDA source for Tier-1s and for US Tier-2s loading on BNL facilities –BNL demonstrates steady performance : All subscriptions are processed Data to BNL ~1+TB /week (reached after Feb 25 th ) Also data from BNL is replicated to other sites with ~100% performance 85% of files are replicated –Similar numbers for CERN,FZK, LYON and NDGF –15% »2-3% related to BNL problem »7% data not available on source Tier-1s »~5% instability of central services –PIC : out of disk space –ASGC,RAL, NIKHEF/SARA processes ~50% of subscriptions –CNAF, TRIUMF – backlog is growing –AODs Data Volume on Tier-2s (P.McGuigan page) http://gk03.swt2.uta.edu:8000/bnl_cloud.html Subscription Periods Feb 5 – Feb 22Feb 25 – Mar 4Mar 5 - Mar 8 Datasets/files 40/10K230/45K400/80K Performance/waiting subscriptions 95%/085%/271%/6
7
Mar 8, 2007US ATLAS T2 WS. A.Klimentov7 AODs, RDOs, EVGEN and DB releases distribution. AODs (Known Problems) –Central services saturation –Large (10+K files) datasets –Interference of AODs and MC input files distribution within US cloud Data are coming to Tier-2s, but AODs data transfer is dominating Hypothesis –MC data transfer has lower priority after several transfer failures –Data is coming proportionally –dCache cannot serve all requests –GBs/files transfered to BU, MICH and UTA (from Hiro, Mar 4th) Type Files GB Type Files GB TotalGB UMICH AOD 7915 494 NTUP 4146 87 580 UTA AOD 2893 220 NTUP 1519 27 250 BU AOD 4632 363 NTUP 2429 66 430 –Frequent disks “switching” on Tier-2s –200B size files (all sites, but OSG) –Monitoring ARDA monitoring is improving, but we need more for day-by-day operations It takes hours to get information from LFC But it is even slower to get information using DQ2 APIs
8
Mar 8, 2007US ATLAS T2 WS. A.Klimentov8 DDM Deployment and Installation (Dec Tier-2 WS) Installation –A2. DDM ops test-bed set up (coordinated by Wensheng and Patrick) Central DB server prototype at BNL Sites service VO box at UTA –A3. DDM installation procedure (coordinated by Patrick) e-mail from Patrick (Dec 5 th ) “..Generic DDM installation procedure for all sites complete with pacman installation for 0.2.12 The most basic test of subscribing to data at BNL works without a problem with test site UTA_TEST1…” DQ2 client (pacman version) Hiro (done) –A11. DDM installation and deployment (coordinated by Alexei) 0.2.12 will be in production at least up to Mar 2007 0.2.12 validation done (Wensheng) Time slot and scenario agreed with Kaushik Installation on the first sites (BNL, BU, UTA) week 50 (Dec 11 – 18)
9
Mar 8, 2007US ATLAS T2 WS. A.Klimentov9 DDM Deployment and Installation. DQ2 0.3 0.2.12 in production since Oct 2006 0.3 was expected in production Mar 2007 and for (pre)testing on OSG Feb 25 th Current status : –Miguel : 0.3 will be in production after Tier-0 test Presumably end of March/beg of April –0.3 isn’t available yet for pre-testing on OSG (confirmed by DQ2 developers during ATLAS Operations meeting Mar 6 th ) –0.3 won’t be available for pre-testing until end of Tier-0 test (Mar 25 th ) –DDM developers are planned to dedicate 1 week after tests for 0.3 “integration” DDM operations : 0.3 will not be a production version until all necessary tests will be conducted and pacman installation for OSG sites will be ready.
10
Mar 8, 2007US ATLAS T2 WS. A.Klimentov10 DDM Operations priority list Data Integrity Check at BNL and Tier-2s (H.Ito, P.Nevski) –dCache content against LRC –LRC content against DDM catalogs 85% of data is replicated using central subscription (the number is monitored regularly) –What about 15% ? –All information is available now, no need to scan DQ2 datasets –W.Deng is working on automatic resubscription/dq2_cr program, to have ALL data at BNL and US T2 Monitoring –Not to repeat ARDA monitoring, but provide all information about day-by-day operations D.Adams, PM, AK, monitoring pages S.Reddy will make one coherent source of info Metrics (H.Ito) –Number of subscriptions per site, obsolete subscriptions, time to process subscription, etc –In action items for a while LRC POOL dependencies (P.Salgado, W.Deng) Troubleshooting console (R.Gardner et al) 0.3 installation (W.Deng, H.ito, P.McGuigan)
11
Mar 8, 2007US ATLAS T2 WS. A.Klimentov11 DDM Operations priority list (cont) DQ2/Panda –Policy in case of network or central services are not available –Files registration in TID datasets only after files are replicated to BNL and registered in LRC Minimize human intervention in DDM operations –Generic integrity check scripts for all ATLAS sites –More scripts to check dataset content on site (PM pages) Kaushik’s proposal : to get info for datasets produced by Panda using dataset name or task ID Recovering procedures –after failure of FTS, dCache, site services, network, etc 2007 Functional tests will address DDM performance issues (April 2007, after 0.3 deployment on sites)
12
Mar 8, 2007US ATLAS T2 WS. A.Klimentov12 DDM Operations URLs AODs and RDOs distribution –http://panda.atlascomp.org/?mode=listAODReplicationshttp://panda.atlascomp.org/?mode=listAODReplications –http://panda.atlascomp.org/?mode=listRDOReplications DB releases distribution –http://panda.atlascomp.org/?mode=listDBRelease ATLAS AODs/NTUPS delivered in BNL cloud (P.McGuigan) http://gk03.swt2.uta.edu:8000/bnl_cloud.html ATLAS CSC datasets at BNL http://www.usatlas.bnl.gov/~dial/atprod/validation/current /html/bnl_datasets.html http://www.usatlas.bnl.gov/~dial/atprod/validation/current /html/bnl_datasets.html
13
Mar 8, 2007US ATLAS T2 WS. A.Klimentov13 Backup slides
14
Mar 8, 2007US ATLAS T2 WS. A.Klimentov14 Results on Performance Testing of the LFC @ CERN Machines used for running the local tests: lxmrrb53[09/10].cern.ch CPUs: 2x Intel Xeon 3.0 GHz (2 MB L2 cache) RAM: 4 GB NIC: 1 Gbps Ethernet Local test conditions: Background load: < 2% (CPUs), < 45% (RAM) Ping to the LFC (LRC) server: ≈ 0.5 (0.1) ms On the remote sites the similar 2xCPU ATLAS VO boxes were used. Tiers Plato Rate (Hz) Time per GUID, ms Plato Rate (Hz) Time per GUID, ms CERN12.480.6250+/-354.0 CNAF8.11232084.8 RAL6.41562224.5 ASGC0.6814711725.8 LFC server: prod-lfc-atlas-local.cern.ch LFC test server : lxb1540.cern.ch.cern.ch December 2006 January 2007
15
Mar 8, 2007US ATLAS T2 WS. A.Klimentov15 LFC Performance Testing Jan 2007 : test LFC host API lib with bulk ops the same set of GUIDs average of 5 meas. 250+/- 35 Hz GUIDs processing rate Dec 2006. LFC production host Production API libs No bulk operations support 12.4 Hz GUIDs processing rate
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.