U.S. ATLAS Computing Facilities (Overview) Bruce G. Gibbard Brookhaven National Laboratory Review of U.S. LHC Software and Computing Projects Fermi National Laboratory November 27-30, 2001
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 2 Outline US ATLAS Computing Facilities Definition Mission Architecture & Elements Motivation for Revision of the Computing Facilities Plan Schedule Computing Model & Associated Requirements Technology Evolution Tier 1 Budgetary Guidance Tier 1 Personnel, Capacity, & Cost Profiles for New Facilities Plan
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 3 US ATLAS Computing Facilities Mission Facilities procured, installed and operated …to meet U.S. “MOU” obligations to ATLAS Direct IT support (Monte Carlo generation, for example) Support for detector construction, testing, and calibration Support for software development and testing …to enable effective participation by US physicists in the ATLAS physics program ! Direct access to and analysis of physics data sets Simulation, re-reconstruction, and reorganization of data as required to complete such analyses
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 4 Element of US ATLAS Computing Facilities A Hierarchy of Grid Connected Distributed Resources Including: Tier 1 Facility Located at Brookhaven – Rich Baker / Bruce Gibbard Operational at < 0.5% level 5 Permanent Tier 2 Facilities ( to be Selected in April ’03 ) 2 Prototype Tier 2’s selected earlier this year and now active Indiana University – Rob Gardner Boston University – Jim Shank Tier 3 / Institutional Facilities Several currently active; most candidate to become Tier 2’s Univ. of California at Berkeley, Univ. of Michigan, Univ. of Oklahoma, Univ. of Texas at Arlington, Argonne Nat. Lab. Distribute IT Infrastructure – Rob Gardner US ATLAS Persistent Grid Testbed – Ed May HEP Networking – Shawn McKee Coupled to Grid Projects with designated liaisons PPDG – Torre Wenaus GriPhyN – Rob Gardner iVDGL – Rob Gardner EU Data Grid – Craig Tull
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 5 Tier 2’s Mission of Tier 2’s for US ATLAS A primary resource for simulation Empower individual institutions and small groups to do relatively autonomous analysis using high performance regional networks and more directly accessible and locally managed resources Prototype Tier 2’s were selected based on their ability to contribute rapidly to Grid architecture development Goal in future Tier 2 selections will be to leverage particularly strong institutional resources of value to ATLAS Aggregate of the 5 Tier 2’s is expected to be comparable to Tier 1 in CPU and disk capacity available for analysis
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 6 US ATLAS Persistent Grid Testbed Calren Esnet, Abilene, Nton Esnet, Mren UC Berkeley LBNL-NERSC Esne t NPACI, Abilene Brookhaven National Laboratory Indiana University Boston University Argonne National Laboratory U Michigan Oklahoma University Abilene Prototype Tier 2s HPSS sites University of Texas At Arlington
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 7 Evolution of US ATLAS Facilities Plan In Responds to Changes or Potential Changes in Schedule Computing Model & Requirements Technology Budgetary Guidance
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 8 Changes in Schedule LHC start-up projected to be a year later 2005/2006 2006/2007 ATLAS Data Challenges (DC’s) have, so far, stayed fixed DC0 – Nov/Dec 2001 – 10 5 events Software continuity test DC1 – Feb/Jul 2002 – 10 7 events ~1% scale test DC2 – Jan/Sep 2003 – 10 8 events ~10% scale test A serious functionality & capacity exercise A high level of US ATLAS facilities participation is deemed very important
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 9 Computing Model and Requirements Nominal model was: At Tier 0 (CERN) Raw ESD/AOD/TAG pass done, result shipped to Tier 1’s At Tier 1’s (six anticipated for ATLAS) TAG/AOD/~25% of ESD on Disk, Tertiary storage for remainder of ESD Selection passes through complete ESD ~monthly Analysis of TAG/AOD/selected ESD/etc. (n-tuples) on disk for analysis pass by ~200 users within 4 hours At Tier 2’s (five in U.S.) Data access primarily via Tier 1 (to control load on CERN and transatlantic link) Support ~50 users as above but frequent access ESD on disk at Tier 1 likely Serious limitations are A month is a long time to wait for the next selection pass Only 25% of ESD is available for event navigating from TAG/AOD during analysis The 25% of ESD on disk will rarely have been consistently selected (once a month) and will be continuously rotating, altering the accessible subset of data
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 10 Changes in Computing Model and Requirements (2) Underlying problem: Selection pass and analysis event navigation access to ESD is sparse Estimated to be ~1 out of 100 events per analysis ESD is on tape rather than on disk Tape is a sequential medium Must access 100 times more data then needed Tape is expensive per unit of I/O bandwidth As much as 10 times that of disk Thus penalty in access cost relative to disk may be a factor of ~1000 Solution: Get all ESD on disk Methods for accomplishing this are: Buy more disk at Tier 1 – most straight forward Unify/coordinate use of existing disk across multiple Tier 1’s – more economical Some combination of above – compromise as necessitated by available funding
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 11 “2007” Capacities for U.S. Tier 1 Options “3 Tier 1” Model (Complete ESD found on disk of U.S. plus 2 other Tier 1’s) Highly dependent on the performance of other Tier 1’s and the Grid middleware and network (transatlantic) used to connect to them “Standalone” Model (Complete ESD on disk of US Tier 1) While avoiding above dependencies, is more expensive
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 12 Changes in Technology No dramatic new technologies Previously assumed technologies are tracking Moore’s Law well Recent price performance points from RHIC Computing Facility CPU: IBM procurement - $33/SPECint95 310 Dual 1 GHz Pentium III 97.2 SPECint95/Node Delivered Aug 2001 $1M fully racked including cluster management hardware & software Disk: OSSI/LSI procurement - $27k/TByte 33 Usable TB of high availability Fibre Channel RAID 1400 MBytes/sec Delivered Sept 2001 $887k including SAN switch Strategy is to project, somewhat conservatively, from these points for facilities design and costing Actually used 20 month rather than the observed <18 month price/performance halving time for disk and cpu
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 13 Changes in Budgetary Assumptions (2) Assumed Funding Profiles (At Year $K) For revise LHC startup schedule, new profile is better For ATLAS DC 2 which stayed fixed in ’03, new profile is worse Hardware capacity goals of DC 2 will not be met Personnel intensive facility development may be as much as 1 year behind Hope is that another DC will be added allowing validation of a more nearly fully developed Tier 1 and US ATLAS facilities Grid
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 14 Profiles for Standalone Disk Option Much higher functionality (than other options) and, given new stretched out LHC schedule, within budget guidance Fractions in revised profiles in table below are of a final system which has nearly 2.5 times the capacity of that discussed last year
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 15 Associated Labor Profile
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 16 Summary Tier 1 Cost Profile (At Year $K) Current plan violated guidance by $370k in FY ’04, but this is a year of some flexibility in guidance Strict adherence to FY ’04 guidance would … reduce facility capacity from 3% to 1.5% or staff by 2 FTE’s
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 17 Tier 1 Capacity Profile
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 18 Tier 1 Cost Profiles
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 19 Standalone Disk Model Benefits All ESD, AOD, and TAG data on local disk Enables analysis specific 24 hour selection passes (versus one month aggregated passes) – faster, better tuned, more consistent selection Allows navigation for individual events (to all processed, but not Raw, data) without recourse to tape and associated delay – faster more detailed analysis of larger consistently selected data sets Avoids contention between analyses over ESD disk space and the need for complex algorithms to optimize management of that space – better result with less effort While prepared to serve appropriate levels of data access to other Tier 1’s, US will not in general be unduly sensitive to the performance of other Tier 1’s or intervening network (transatlantic) and middleware – improved system reliability, availability, robustness and performance
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 20 Tier 2 Issues The high availability of the complete ESD set on disk at the Tier 1 and the associated increased frequency of ESD selection passes will, for connected Tier 2’s (and Tier 3’s ), lead to … More analysis activity – (Increasing CPU & Disk utilization) More frequent analysis passes on More and larger usable TAG, AOD and ESD subsets More network traffic into the site from the Tier 1 – (Increasing WAN utilization) Selection results Event navigation into the full disk resident ESD As in the case of the Tier 1, an additional year of funding before turn-on and the increased effectiveness of “year later” funding contribute to satisfying these increased needs within or near the integrated out year (’05-’07) budget guidance The delay of some ’06 funding to ’07 is required for a better match of profiles
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 21 Tier 2 Distribution Of Hardware Cost
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 22 Tier 1 Distribution Of Hardware Cost
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 23 FY 2007 Capacity Comparison of Models
27-30 November, 2001 B. Gibbard Review of US LHC Software and Computing Projects 24 Conclusions Standalone disk model A dramatic improvement over previous tape based mode – Functionality & Performance A significant improvement over multi-Tier 1 disk model – Performance, Reliability & Robustness Respects funding guidance in model sensitive out-years If costs are higher or funding lower than expected, a graceful fallback is to access some of the data on disks at other Tier 1’s Adiabaticly move toward multi-Tier 1 model