U.S. ATLAS Computing Facilities Bruce G. Gibbard Brookhaven National Laboratory Mid-year Review of U.S. LHC Software and Computing Projects NSF Headquarters, Arlington, Virginia July 8, 2003
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 2 Mission of US ATLAS Computing Facilities Supply capacities to the ATLAS Distributed Virtual Offline Computing Center At levels agreed to in a computing resource MoU (Yet to be written) Guarantee the Computing Required for Effective Participation by U.S. Physicists in the ATLAS Physics Program Direct access to and analysis of physics data sets Simulation, re-reconstruction, and reorganization of data as required to support such analyses
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 3 ATLAS Facilities Model ATLAS Computing Will Employ the ATLAS Virtual Offline Computing Facility to process and analyze its data “Cloud” mediated set of resources including: CERN Tier 0 All Regional Facilities (Tier 1’s) - Typically ~200 users each Some National Facilities (Tier 2’s) All members of ATLAS Virtual Organization (VO) must contribute in funds or in kind (personnel, equipment), proportional to author count All members of ATLAS VO will have defined access rights Typically only a subset of resources at a regional or national center are Integrated into the Virtual Facility Non-integrated portion over which regional control is retained is expected to be used to augment resources supporting analyses of region interest
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 4 Analysis Model: All ESD Resident on Disk Enables ~24 hour selection/regeneration passes (versus ~month if tape stored) – faster, better tuned, more consistent selection Allows navigation for individual events (to all processed, though not Raw, data) without recourse to tape and associated delay – faster more detailed analysis of larger consistently selected data sets Avoids contention between analyses over ESD disk space and the need to develop complex algorithms to optimize management of that space – better result with less effort Complete set on disk at US Tier 1 cost impact discussed later Reduced sensitivity to performance of multiple Tier 1’s, intervening network (transatlantic) & middleware – improved system reliability, availability, robustness and performance – cost impact discussed later
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA US ATLAS Facilities A Coordinated Grid of Distributed Resources Including … Rich Baker / Bruce Gibbard Tier 1 Facility at Brookhaven – Rich Baker / Bruce Gibbard Currently operational at ~1% of required 2008 capacity Saul Youssef 5 Permanent Tier 2 Facilities – Saul Youssef Scheduled for selection beginning in 2004 Currently there are 2 Prototype Tier 2’s Indiana U – Fred Luehring / University of Chicago – Rob Gardner Boston U – Saul Youssef 7 Currently Active Tier 3 (Institutional) Facilities Shawn McKee WAN Coordination Activity – Shawn McKee Rob Gardner Program of Grid R&D Activities – Rob Gardner Based on Grid Projects ( PPDG, GriPhyN, iVDGL, EU Data Grid, EGEE, etc.) Kaushik De/Pavel Nevski Grid Production & Production Support Effort – Kaushik De/Pavel Nevski
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 6 Facilities Organization Chart
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 7
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 8 WBS 2.3 Personnel Increase for FY ‘04 ( ) Important not fully funded request
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA Tier 1 Facility Functions Primary U.S. data repository for ATLAS Programmatic event selection and AOD & DPD regeneration from ESD Chaotic high level analysis by individuals Especially for large data set analyses Significant source of Monte Carlo Re-reconstruction as needed Technical support for smaller US computing resource centers Co-located and operated with the RHIC Computing Facility To date a very synergistic relationship Some recent increased divergence Substantial benefit from cross use of idle resources (2000 CPU’s)
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 10 Tier 1 Facility Evolution for FY ‘04 No staff increase nor equipment procurement for FY ‘03 Only new equipment for FY ‘02 based on DOE end-of-year funding supplement; 10 TBytes disk addition & upgrade of single tape drive Result has been capacities lower than expected and needed Compute capacities applied to ATLAS Data Challenge 1 (DC 1) were ~x 2 less than expected by ATLAS based on US author count Only very efficient facility utilization and supplemental production at Tier 2’s & 3’s resulted in an acceptable level of US contribution Modest equipment upgrades planned for FY ’04 (for DC 2) Disk: 12 TBytes 25 TBytes (factor of 2) CPU Farm: 30 kSPECint2000 130 kSPECint2000 (factor of 4) First processor farm upgrade since FY ’01 (3 years) Robotic Tape Storage: 30 MBytes/sec 60 MBytes/sec (factor of 2)
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 11 Capital Equipment
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 12
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 13 Need for Tier 1 Facility Staff Increase Procurement, Installation and Operation of additional equipment Need for ATLAS specific Linux OS - RH 7.3 versus RHIC RH 9 Investigation of alternate disk technologies In particular CERN Linux disk server-like approaches Increased complexity of cyber security and AAA for Grid Major increases in user base and level of activity in 2004 Grid 3/PreDC2, Grid demonstration exercise in preparation for DC2 DC2, ATLAS Data Challenge 2 LHC Computing Grid (LCG) deployment ( LCG-0 LCG-1)
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 14
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 15
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 16 Cost Impact of All ESD on Local Disk Assumptions Increase from 480 TB to 1 PB of total disk Some associated increase in CPU and infrastructure Simple extension of current technology Using a conservative technology so cost may be over estimated Personnel requirement unchanged Alternative is effort spent optimizing transfer and caching schemes Tier 1 Facility cost differential through 2008 (First full year of LHC operation) Since facility cost is not dominated by hardware, …reduction to “1/3 disk model” certainly reduces cost but not dramatically
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA Tier 2 Facilities 5 Permanent Tier 2 Facilities Primary resource for simulation Empower individual institutions and small groups to do autonomous analyses using more directly accessible and locally managed resources 2 Prototype Tier 2’s selected for ability to rapidly contribute to Grid development Indiana University / (effective FY ‘03) University of Chicago Boston University Permanent Tier 2 will be selected to leverage strong institutional resources Selection of first two scheduled for spring 2004 Currently 7 active Tier 3’s in addition to prototype Tier 2’s; all candidates Tier 2’s Aggregate of 5 permanent Tier 2’s will be comparable to Tier 1 in CPU
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 18 Tier 2 Facilities Evolution First Significant iVDGL Funded Equipment Procurements Now Underway – (Moore’s law Don’t buy it until you need it) Second Round Scheduled for Summer FY ’04 At time of DC2, aggregate Tier 2 capacities comparable to those of Tier 1; later in 2004, very significantly more
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA Networking Responsible for: Specifying both the national and international WAN requirements of US ATLAS Communicating requirement to appropriate Network Infrastructure suppliers (ESnet, Internet 2, etc.) Monitoring the extent to which WAN requirements … … are currently being met … will continue to be met as they increase in the future Small base program support effort includes: Interacting with ATLAS facility site managers and technical staff Participating in HENP networking forums Adopt/adapt/develop, deploy, & operate WAN monitoring tools WAN upgrades not anticipate during next year Currently Tier 1 & 2 sites are at OC12 except UC, now planning OC3 OC12 by Fall Upcoming exercises require ~1 TByte/day (~15% of OC12 theoretical capacity) RHIC competitive utilization at BNL current also in ~15% range
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA Grid Tools & Services Responsible for development, evaluation, and creation of integrated Grid-based system for distributed production processing and user analysis Primary point of contact and coordination with Grid projects ( PPDG, GriPhyN, iVDGL, EDG, EGEE, etc.) Accept, evaluate, and integrate tools & services from Grid projects Transmit requirements and feedback to Grid projects Responsibility for supporting the integration of ATLAS application with Grid tools & services
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 21
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA Grid Production Responsible for deploying, production scale testing & hardening, operating, monitoring and documenting the performance of systems for production processing & user analysis Primary point of contact to ATLAS production activities including the transmission of … … production requests to, and facilities availability from, the rest of US ATLAS computing management … requirements to ATLAS production for optimal use of US resources … feedback to Tools & Service effort regarding production scale issues Responsible for integration, on an activity by activity basis, of US ATLAS production contributions into overall ATLAS production Requested increase by 2.65 Project supported FTE’s for FY ’04 to address growing production demands but budget supports only 1.65
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 23 Increasing Grid Production Two significant production activities in FY ’04 (Only DC1 in FY ’03) Grid3/PreDC2 exercise DC2 While each is anticipated to be a few months in duration, experience from DC1 indicates that near continuous ongoing production is more likely Production is moving from being centric to being centric Production is moving from being Facility centric to being Grid centric In its newness, Grid computing is a more complex and less stable production environment and currently requires more effort Level of effort During DC1 (Less than 50% Grid using 5 sites) – 3.35 FTE’s (0.85 Project) For Grid3/PerDC2/DC2 (~100% Grid using 11 sites) – Minimum of 6 FTE’s Reductions below this level (forced by budget constraint) Will reduce efficiency of resource utilization Will force some fallback from Grid to Facility type production to meet commitments
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 24
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 25 3 Major Near Term Milestones LCG Deployment including US Tier 1 LCG-0, exercise deployment mechanisms – completed May ’03 Substantial comment on mechanisms offer and seemed well received LCG-1, initial deployment beginning – July ‘03 LCG-1, full function, reliable, manageable service – Jan ‘04 PreDC2/Grid3 exercise – Nov ‘03 Full geographic chain Tier 2 Tier 1 Tier 0 Tier 1 Tier 2 + analysis Goals: Test DC2 model, forge Tier0 / Tier1 staff link, initiate Grid analysis ATLAS DC2 – April ’04 (Slippage by ~3 months is not unlikely) DC1 scale in number of events ~10 7 but x 2 in CPU & storage for G ant 4 Exercising complete geographic chain (Tier 2 Tier 1 Tier 0 Tier 1 Tier 2) Goal: Use of LCG-1 for Grid computing as input to Computing Model Document
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 26 Near Term Schedule
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 27
8 July 2003 B. Gibbard Review of U.S. LHC Computing - Arlington VA 28 A U.S. ATLAS Physics Analysis Center at BNL Motivation: Position the U.S. to insure active participation in ATLAS physics analysis Builds on existing Tier 1 ATLAS Computing Center, CORE Software leadership at BNL, and theorists who already are working closely with experimentalists. This BNL Center will become a place where U.S. physicists come with their students and post-docs. Scope and Timing: Hire at least 1 key physicist/year starting in 2003 to add to excellent existing staff to cover all aspects of ATLAS physics analysis: tracking, calorimetry, muons, trigger, simulation, etc. Expect the total staff including migration from D0 will reach ~25 by 2007 First hire will arrive on August 26, 2003 The plan is to have a few of the members in residence at CERN for 1-2 years on a rotating basis. Cost: base funding Will need DOE increment to the declining BNL HEP base program. Additional base funding of ~$200k/year FY03 => $1.5M in FY07. H. Gordon, BNL DOE Annual HEP Program Review, April 22, 2002