US ATLAS Computing Operations Kaushik De University of Texas At Arlington U.S. ATLAS Tier 2/Tier 3 Workshop, FNAL March 8, 2010.

Slides:

Advertisements

Similar presentations

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Advertisements

AMOD Report Doug Benjamin Duke University. Hourly Jobs Running during last week 140 K Blue – MC simulation Yellow Data processing Red – user Analysis.

The Panda System Mark Sosebee (for K. De) University of Texas at Arlington dosar workshop March 30, 2006.

Nurcan Ozturk University of Texas at Arlington US ATLAS Distributed Facility Workshop at SLAC October 2010 Assessment of Analysis Failures, DAST.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

Integration Program Update Rob Gardner US ATLAS Tier 3 Workshop OSG All LIGO.

Introduction: Distributed POOL File Access Elizabeth Gallas - Oxford – September 16, 2009 Offline Database Meeting.

DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.

PanDA Summary Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.

14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.

ATLAS in LHCC report from ATLAS –ATLAS Distributed Computing has been working at large scale Thanks to great efforts from shifters.

US ATLAS Computing Operations Kaushik De University of Texas At Arlington US ATLAS Distributed Facility Workshop at SLAC October 13, 2010.

PanDA Update Kaushik De Univ. of Texas at Arlington XRootD Workshop, UCSD January 27, 2015.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

Nurcan Ozturk University of Texas at Arlington US ATLAS Transparent Distributed Facility Workshop University of North Carolina - March 4, 2008 A Distributed.

PROOF Farm preparation for Atlas FDR-1 Wensheng Deng, Tadashi Maeno, Sergey Panitkin, Robert Petkus, Ofer Rind, Torre Wenaus, Shuwei Ye BNL.

Post-DC2/Rome Production Kaushik De, Mark Sosebee University of Texas at Arlington U.S. Grid Phone Meeting July 13, 2005.

Network awareness and network as a resource (and its integration with WMS) Artem Petrosyan (University of Texas at Arlington) BigPanDA Workshop, CERN,

EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.

PD2P The DA Perspective Kaushik De Univ. of Texas at Arlington S&C Week, CERN Nov 30, 2010.

Alex Read, Dept. of Physics Grid Activities in Norway R-ECFA, Oslo, 15 May, 2009.

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

Data Management: US Focus Kaushik De, Armen Vartapetian Univ. of Texas at Arlington US ATLAS Facility, SLAC Apr 7, 2014.

EGI-InSPIRE EGI-InSPIRE RI DDM solutions for disk space resource optimization Fernando H. Barreiro Megino (CERN-IT Experiment Support)

LHCbComputing LHCC status report. Operations June 2014 to September m Running jobs by activity o Montecarlo simulation continues as main activity.

CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.

The ATLAS Computing Model and USATLAS Tier-2/Tier-3 Meeting Shawn McKee University of Michigan Joint Techs, FNAL July 16 th, 2007.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.

U.S. ATLAS Facility Planning U.S. ATLAS Tier-2 & Tier-3 Meeting at SLAC 30 November 2007.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013

Shifters Jamboree Kaushik De ADC Jamboree, CERN December 4, 2014.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

CERN IT Department CH-1211 Genève 23 Switzerland t Future Needs of User Support (in ATLAS) Dan van der Ster, CERN IT-GS & ATLAS WLCG Workshop.

Dynamic Data Placement: the ATLAS model Simone Campana (IT-SDC)

ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.

PanDA & Networking Kaushik De Univ. of Texas at Arlington ANSE Workshop, CalTech May 6, 2013.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

WLCG November Plan for shutdown and 2009 data-taking Kors Bos.

Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.

PD2P Planning Kaushik De Univ. of Texas at Arlington S&C Week, CERN Dec 2, 2010.

PD2P, Caching etc. Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.

ADC Operations Shifts J. Yu Guido Negri, Alexey Sedov, Armen Vartapetian and Alden Stradling coordination, ADCoS coordination and DAST coordination.

Production System 2 manpower and funding issues Alexei Klimentov Brookhaven National Laboratory Aug 19, 2013 Production System Technical Meeting CERN.

ATLAS Computing: Experience from first data processing and analysis Workshop TYL’10.

J. Shank DOSAR Workshop LSU 2 April 2009 DOSAR Workshop VII 2 April ATLAS Grid Activities Preparing for Data Analysis Jim Shank.

Status: ATLAS Grid Computing

Supporting Analysis Users in U.S. ATLAS

Computing Operations Roadmap

University of Texas At Arlington Louisiana Tech University

Virtualization and Clouds ATLAS position

U.S. ATLAS Tier 2 Computing Center

Data Challenge with the Grid in ATLAS

ATLAS activities in the IT cloud in April 2008

PanDA in a Federated Environment

Readiness of ATLAS Computing - A personal view

The ADC Operations Story

Simulation use cases for T2 in ALICE

U.S. ATLAS Testbed Status Report

Cloud Computing R&D Proposal

Roadmap for Data Management and Caching

ATLAS DC2 & Continuous production

Presentation transcript:

US ATLAS Computing Operations Kaushik De University of Texas At Arlington U.S. ATLAS Tier 2/Tier 3 Workshop, FNAL March 8, 2010

Overview  We expect LHC collisions to resume next week  ATLAS is ready – after 16 years of preparations  The distributed computing infrastructure must perform  US facilities are required to provide about one quarter of ATLAS computing (though historically we have often provided one third)  US also responsible for PanDA software used ATAS wide  Smooth US computing operations is critical for ATLAS success  We did very well with the short LHC run in 2009  We have experienced people and systems in place  But the stress on the system will be far greater in  Focus of computing operations will be timely physics results  We have to adapt quickly to circumstances, as they arise March 8, 2010 Kaushik De 2

US Distributed Computing Organization  See Michael Ernst’s talk for facilities overview & overall ATLAS plans  Site integration covered in Rob Gardner’s talk  Operations group started two years ago – work will get a lot more intense now March 8, 2010 Kaushik De 3 ATLAS Distributed Computing (ADC) Alexei Klimentov US ATLAS Facilities Michael Ernst US ATLAS Computing Operations Kaushik De Deputy (Batch & Storage) Armen Vartapetian Resource Allocation Committee (RAC) Richard Mount Production Shift Captain Yuri Smirnov DDM Ops Hiro Ito Deputy (User Analysis Support) Nurcan Ozturk US ATLAS Facilities Integration Program Rob Gardner

Scope of US Computing Operations  Data production – MC, reprocessing  Data management – storage allocations, data distribution  User analysis – site testing, validation  Distributed computing shift and support teams  Success of US computing operations depends on site managers – we are fortunate to have excellent site teams March 8, 2010 Kaushik De 4

Integration with ADC  U.S. Operations Team works closely with (and parallels) the ATLAS Distributed Computing team (ADC)  Overall goals and plans are synchronized  US ATLAS is fully integrated with and well represented in ADC  Jim Shank – deputy Computing Coordinator for ATLAS  Alexei Klimentov – ADC coordinator  KD, Pavel Nevski – processing and shifts coordinator  Nurcan Ozturk – Distributed Analysis support team co-coordinator  Hiro Ito and Charles Waldman – DDM support and tools  Torre Wenaus – PanDA software coordinator  ADC team reorganization for still ongoing March 8, 2010 Kaushik De 5

US Resource Allocation Committee  RAC is becoming increasingly important as LHC starts  Under new management and very active  sourceAllocationCommittee sourceAllocationCommittee sourceAllocationCommittee   Core team meets weekly, full team meets monthly  Members  R. Mount (RAC Coordinator), K. Black, J. Cochran, K. De, M. Ernst, R. Gardner, I. Hinchliffe, A. Klimentov, A. Vartapetian, R. Yoshida  Ex-officio : H. Gordon, M. Tuts, T. Wenaus, S. Willocq March 8, 2010 Kaushik De 6

MC Production and Reprocessing  Cornerstone of computing operations  Experienced team of more than 6 years in the US  Responsible for:  Efficient utilization of resources at Tier 1/2 sites  Monitor site and task status 7 days a week – site online/offline  Monitor data flow  Report software validation issues  Report task, and distributed software issues  Part of ADCoS shift team:  US Team: Yuri Smirnov (Captain), Mark Sosebee, Armen Vartapetian, Wensheng Deng, Rupam Das  Coverage is 24/7 by using shifts in 3 different time zones  Added 3 new shifters from central and south America March 8, 2010 Kaushik De 7

US Production March 8, 2010 Kaushik De 8

MC Production Issues  Computing resources (CPU’s) were mostly idle in Jan-Feb 2010, because of lack of central production requests  Meanwhile, US physicists need more MC simulations  In the future, we will back-fill resources with US regional production, fast-tracking requests from US physicists  We have successful experience of large scale US regional production from summer 2009, and from previous years  Mechanism for regional production same as central production, hence data can be used for official physics analysis  Central production kept all queues busy in second half of 2009, hence regional production dried up – need to restart again March 8, 2010 Kaushik De 9

Reprocessing Issues  Tape reprocessing is slow  We try to keep as much real data on disk as possible  Due to high I/O requirements, reprocessing is usually done at Tier 1’s (ATLAS-wide)  In the US, the option to run reprocessing at Tier 2’s is also available, to speed up physics results  Tier 2’s go through reprocessing validation  We have used US Tier 2’s in the past (but not every time)  Reprocessing will become increasingly challenging as we accumulate LHC data – US is doing well March 8, 2010 Kaushik De 10

Reprocessing Experience March 8, 2010 Kaushik De 11

Storage Management  Many tools used for data placement, data cleaning, data consistency, storage system management…  Some useful links:      Also needs many people  ADC Operations: led by Stephane Jezequel  DQ2 team: led by Simone Campana  ATLAS DDM operations: led by Ikuo Ueda  US team: KD (US Operations), Hironori Ito (US DDM), Armen Vartapetian (US Ops Deputy), Wensheng Deng (US DDM), Pedro Salgado (BNL dcache), plus Tier 2 support team – weekly meetings March 8, 2010 Kaushik De 12

Storage Issues  US will have ~10 PB within the next few months  Already have ~7.5 PB deployed  Fast ramp-up needed  Storage is partitioned into space tokens by groups  Good – keeps data types isolated (user, MC, LHC…)  Bad – fragments storage, which is wasteful and difficult to manage  Space token management  Each site must provide 6-10 different storage partitions (tokens)  This is quite labor intensive – ADC trying to automate  Many tools for efficient space management are still missing  Good management of space tokens is essential to physics analysis March 8, 2010 Kaushik De 13

Primer on Tokens  DATADISK – ESD (full copy at BNL, some versions at T2’s), RAW (BNL only), AOD (four copies among U.S. sites)  MCDISK – AOD’s (four copies among U.S. sites), ESD’s (full copy at BNL), centrally produced DPD’s (all sites), some HITs/RDOs  DATATAPE/MCTAPE – archival data at BNL  USERDISK – pathena output, limited lifetime (variable, at least 60 days, users notified before deletion)  SCRATCHDISK – Ganga output, temporary user datasets, limited lifetime (maximum 30 days, no notification before deletion)  GROUPDISK – physics/performance group data  LOCALGROUPDISK – storage for local (geographically) groups/users  PRODDISK – only used by Panda production at Tier 2 sites  HOTDISK – database releases March 8, 2010 Kaushik De 14

Storage Tokens Status March 8, 2010 Kaushik De From A. Vartapetian

Data Management Issues  DDM consolidation – see Hiro’s talk  MCDISK is largest fraction of space usage  Not surprising given the delay in LHC schedule  Need to prune MCDISK as data arrives  US DDM team is studying what can be cleaned up, without affecting the work of physics users  Some Tier 2’s are very full  New resources are coming soon  We are keeping an eye daily on these sites  There may be space crunch again in Fall 2010 March 8, 2010 Kaushik De 16

Cleaning Up  MCDISK/DATADISK - cleaned by Central Deletion or US DDM  MCTAPE/DATATAPE - cleaned by Central Deletion, with notification to BNL to clean/reuse tapes  SCRATCHDISK - cleaned by Central Deletion  GROUPDISK - cleaned by Central Deletion  HOTDISK - never cleaned!  PRODDISK - cleaned by site  USERDISK - cleaned by US DDM  LOCALGROUPDISK - cleaned by US DDM March 8, 2010 Kaushik De 17

Some Examples of Data Distribution March 8, 2010 Kaushik De 18 BNL DATADISK/MCDISK MWT2 DATADISK, SWT2 MCDISK

GROUPDISK Endpoints in US  All Tier 1/Tier 2 sites in the US provide physics/performance endpoints on GROUPDISK  For example: PERF-MUONS at AGLT2 & NET2, PERF- EGAMMA at SWT2, PERF-JETS at MWT2 & WT2  Each phys/perf group has designated data manager  GROUPDISK endpoints host data of common interest  Data placement can be requested through group manager or DaTRI:  For endpoint usage summary see: March 8, 2010 Kaushik De 19

US Endpoint Summary March 8, 2010 Kaushik De 20

User Analysis  U.S. ATLAS has excellent track record of supporting users  US is the most active cloud in ATLAS for user analysis  Analysis sites are in continuous and heavy use for >2 years  We have regularly scaled up resources to match user needs  Tier 3 issues will be discussed tomorrow March 8, 2010 Kaushik De 21

Analysis Usage Growing March 8, 2010 Kaushik De 22

900GeV User Analysis Jobs March 8, Kaushik De Statistics from November 2009

900GeV MC User Analysis Jobs March 8, 2010 Kaushik De 24 Statistics from November 2009

ADCoS (ADC Operations Shifts)  ADCoS combined shifts started January 28th, 2008  ADCoS Goals  World-wide (distributed/remote) shifts  To monitor all ATLAS distributed computing resources  To provide Quality of Service (QoS) for all data processing  Organization  Senior/Trainee: 2 day shifts, Expert: 7 day shifts  Three shift times (in CERN time zone): o ASIA/Pacific: 0h - 8h o EU-ME: 8h - 16h o Americas: 16h - 24h  U.S. shift team  In operation long before ADCoS was started March 8, 2010 Kaushik De 25

Distributed Analysis Shift Team – DAST  User analysis support is provided by the AtlasDAST (Atlas Distributed Analysis Shift Team) since September 29, Previously, user support was on a best effort basis provided by the Panda and Ganga software developers.  Nurcan Ozturk (UTA) and Daniel van der Ster (CERN) are coordinating this effort.  DAST organizes shifts currently in two time zones – US and CERN. One person from each zone is on shift for 7 hours a day covering between 9am-11pm CERN time, and 5 days a week.  Please contact Nurcan to join this effort March 8, 2010 Kaushik De 26

Conclusion  Waiting for collisions! March 8, 2010 Kaushik De 27