The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee University of Michigan CCP 2006, Gyeongju, Korea August 29 th, 2006.

Slides:



Advertisements
Similar presentations
31/03/00 CMS(UK)Glenn Patrick What is the CMS(UK) Data Model? Assume that CMS software is available at every UK institute connected by some infrastructure.
Advertisements

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
Resources for the ATLAS Offline Computing Basis for the Estimates ATLAS Distributed Computing Model Cost Estimates Present Status Sharing of Resources.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
TeraPaths : A QoS Collaborative Data Sharing Infrastructure for Petascale Computing Research USATLAS Tier 1 & Tier 2 Network Planning Meeting December.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.
TeraPaths: A QoS Collaborative Data Sharing Infrastructure for Petascale Computing Research Bruce Gibbard & Dantong Yu High-Performance Network Research.
Zhiling Chen (IPP-ETHZ) Doktorandenseminar June, 4 th, 2009.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.
CHEP – Mumbai, February 2006 The LCG Service Challenges Focus on SC3 Re-run; Outlook for 2006 Jamie Shiers, LCG Service Manager.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
Finnish DataGrid meeting, CSC, Otaniemi, V. Karimäki (HIP) DataGrid meeting, CSC V. Karimäki (HIP) V. Karimäki (HIP) Otaniemi, 28 August, 2000.
ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.
Thoughts on Future LHCOPN Some ideas Artur Barczyk, Vancouver, 31/08/09.
ATLAS and GridPP GridPP Collaboration Meeting, Edinburgh, 5 th November 2001 RWL Jones, Lancaster University.
F. Fassi, S. Cabrera, R. Vives, S. González de la Hoz, Á. Fernández, J. Sánchez, L. March, J. Salt, A. Lamas IFIC-CSIC-UV, Valencia, Spain Third EELA conference,
14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Tier-2  Data Analysis  MC simulation  Import data from Tier-1 and export MC data CMS GRID COMPUTING AT THE SPANISH TIER-1 AND TIER-2 SITES P. Garcia-Abia.
TeraPaths TeraPaths: Establishing End-to-End QoS Paths through L2 and L3 WAN Connections Presented by Presented by Dimitrios Katramatos, BNL Dimitrios.
EGEE is a project funded by the European Union under contract IST HEP Use Cases for Grid Computing J. A. Templon Undecided (NIKHEF) Grid Tutorial,
9 Systems Analysis and Design in a Changing World, Fourth Edition.
The ATLAS Grid Progress Roger Jones Lancaster University GridPP CM QMUL, 28 June 2006.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
ATLAS WAN Requirements at BNL Slides Extracted From Presentation Given By Bruce G. Gibbard 13 December 2004.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
TeraPaths: A QoS Enabled Collaborative Data Sharing Infrastructure for Petascale Computing Research The TeraPaths Project Team Usatlas Tier 2 workshop.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
The ATLAS TAGs Database - Experiences and further developments Elisabeth Vinek, CERN & University of Vienna on behalf of the TAGs developers group.
Large scale data flow in local and GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)
The ATLAS Computing Model and USATLAS Tier-2/Tier-3 Meeting Shawn McKee University of Michigan Joint Techs, FNAL July 16 th, 2007.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
The ATLAS Computing & Analysis Model Roger Jones Lancaster University GDB BNL, Long Island, 6/9/2006.
The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures D. Liko, IT/PSS for the ATLAS Distributed Analysis Community.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
The ATLAS Computing & Analysis Model Roger Jones Lancaster University ATLAS UK 06 IPPP, 20/9/2006.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
David Adams ATLAS ATLAS Distributed Analysis (ADA) David Adams BNL December 5, 2003 ATLAS software workshop CERN.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Finding Data in ATLAS. May 22, 2009Jack Cranshaw (ANL)2 Starting Point Questions What is the latest reprocessing of cosmics? Are there are any AOD produced.
David Adams ATLAS ADA: ATLAS Distributed Analysis David Adams BNL December 15, 2003 PPDG Collaboration Meeting LBL.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
Fermilab Cal Tech Lambda Station High-Performance Network Research PI Meeting BNL Phil DeMar September 29, 2005.
CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.
ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,
ATLAS – statements of interest (1) A degree of hierarchy between the different computing facilities, with distinct roles at each level –Event filter Online.
Pasquale Migliozzi INFN Napoli
Data Challenge with the Grid in ATLAS
Readiness of ATLAS Computing - A personal view
Artem Trunov and EKP team EPK – Uni Karlsruhe
Simulation use cases for T2 in ALICE
ATLAS DC2 & Continuous production
The ATLAS Computing Model
The LHCb Computing Data Challenge DC06
Presentation transcript:

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee University of Michigan CCP 2006, Gyeongju, Korea August 29 th, 2006

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 2 Overview  The ATLAS collaboration has only a year before it must manage large amounts of “real” data for its globally distributed collaboration.  ATLAS physicists need the software and physical infrastructure required to:  Calibrate and align detector subsystems to produce well understood data  Realistically simulate the ATLAS detector and its underlying physics  Provide access to ATLAS data globally  Define, manage, search and analyze data-sets of interest  I will cover current status, plans and some of the relevant research in this area and indicate how it might benefit ATLAS in augmenting and extending its infrastructure. ATLAS

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 3 The ATLAS Computing Model  Computing Model is fairly well evolved, documented in C-TDR  pdf pdf  There are many areas with significant questions/issues to be resolved:  Calibration and alignment strategy is still evolving  Physics data access patterns MAY be exercised (SC04: since June)  Unlikely to know the real patterns until 2007/2008!  Still uncertainties on the event sizes, reconstruction time  How best to integrate ongoing “infrastructure” improvements from research efforts into our operating model?  Lesson from the previous round of experiments at CERN (LEP, )  Reviews in 1988 underestimated the computing requirements by an order of magnitude!

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 4 ATLAS Computing Model Overview  We have a hierarchical model (EF-T0-T1-T2) with specific roles and responsibilities  Data will be processed in stages: RAW->ESD->AOD-TAG  Data “production” is well-defined and scheduled  Roles and responsibilities are assigned within the hierarchy.  Users will send jobs to the data and extract relevant data  typically NTuples or similar  Goal is a production and analysis system with seamless access to all ATLAS grid resources  All resources need to be managed effectively to insure ATLAS goals are met and resource providers policy’s are enforced. Grid middleware must provide this

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 5 ATLAS Facilities and Roles  Event Filter Farm at CERN  Assembles data (at CERN) into a stream to the Tier 0 Center  Tier 0 Center at CERN  Data archiving: Raw data to mass storage at CERN and to Tier 1 centers  Production: Fast production of Event Summary Data (ESD) and Analysis Object Data (AOD)  Distribution: ESD, AOD to Tier 1 centers and mass storage at CERN  Tier 1 Centers distributed worldwide (10 centers)  Data steward: Re-reconstruction of raw data they archive, producing new ESD, AOD  Coordinated access to full ESD and AOD (all AOD, % of ESD depending upon site)  Tier 2 Centers distributed worldwide (approximately 30 centers)  Monte Carlo Simulation, producing ESD, AOD, ESD, AOD sent to Tier 1 centers  On demand user physics analysis of shared datasets  Tier 3 Centers distributed worldwide  Physics analysis  A CERN Analysis Facility  Analysis  Enhanced access to ESD and RAW/calibration data on demand

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 6 Computing Model: event data flow from EF  Events written in “ByteStream” format by the Event Filter farm in 2 GB files  ~1000 events/file (nominal size is 1.6 MB/event)  200 Hz trigger rate (independent of luminosity)  Currently 4+ streams are foreseen:  Express stream with “most interesting” events  Calibration events (including some physics streams, such as inclusive leptons)  “Trouble maker” events (for debugging)  Full (undivided) event stream  One 2-GB file every 5 seconds will be available from the Event Filter  Data will be transferred to the Tier-0 input buffer at 320 MB/s (average)  The Tier-0 input buffer will have to hold raw data waiting for processing  And also cope with possible backlogs  ~125 TB will be sufficient to hold 5 days of raw data on disk

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 7 ATLAS Data Processing  Tier-0:  Prompt first pass processing on express/calibration & physics streams  hours, process full physics streams with reasonable calibrations  Implies large data movement from T0 →T1s, some T0 ↔ T2 (Calibration)  Tier-1:   Reprocess 1-2 months after arrival with better calibrations   Reprocess all local RAW at year end with improved calibration and software   Implies large data movement from T1↔T1 and T1 → T2

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 8 ATLAS partial &“average” T1 Data Flow (2008) Tier-0 CPU farm T1 Other Tier-1s disk buffer RAW 1.6 GB/file 0.02 Hz 1.7K f/day 32 MB/s 2.7 TB/day ESD2 0.5 GB/file 0.02 Hz 1.7K f/day 10 MB/s 0.8 TB/day AOD2 10 MB/file 0.2 Hz 17K f/day 2 MB/s 0.16 TB/day AODm2 500 MB/file Hz 0.34K f/day 2 MB/s 0.16 TB/day RAW ESD2 AODm Hz 3.74K f/day 44 MB/s 3.66 TB/day RAW ESD (2x) AODm (10x) 1 Hz 85K f/day 720 MB/s T1 Other Tier-1s T1 Each Tier-2 Tape RAW 1.6 GB/file 0.02 Hz 1.7K f/day 32 MB/s 2.7 TB/day disk storage AODm2 500 MB/file Hz 0.34K f/day 2 MB/s 0.16 TB/day ESD2 0.5 GB/file 0.02 Hz 1.7K f/day 10 MB/s 0.8 TB/day AOD2 10 MB/file 0.2 Hz 17K f/day 2 MB/s 0.16 TB/day ESD2 0.5 GB/file 0.02 Hz 1.7K f/day 10 MB/s 0.8 TB/day AODm2 500 MB/file Hz 3.1K f/day 18 MB/s 1.44 TB/day ESD2 0.5 GB/file 0.02 Hz 1.7K f/day 10 MB/s 0.8 TB/day AODm2 500 MB/file Hz 3.1K f/day 18 MB/s 1.44 TB/day ESD1 0.5 GB/file 0.02 Hz 1.7K f/day 10 MB/s 0.8 TB/day AODm1 500 MB/file 0.04 Hz 3.4K f/day 20 MB/s 1.6 TB/day AODm1 500 MB/file 0.04 Hz 3.4K f/day 20 MB/s 1.6 TB/day AODm2 500 MB/file 0.04 Hz 3.4K f/day 20 MB/s 1.6 TB/day Plus simulation and analysis data flow Slide from D.Barberis There are a significant number of flows to be managed and optimized

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 9 ATLAS Event Data Model  RAW:  “ByteStream” format, ~1.6 MB/event  ESD (Event Summary Data):  Full output of reconstruction in object (POOL/ROOT) format:  Tracks (+ their hits), Calo Clusters, Calo Cells, combined reconstruction objects etc.  Nominal size 500 kB/event  currently 2.5 times larger: contents and technology under revision  AOD (Analysis Object Data):  Summary of event reconstruction with “physics” (POOL/ROOT) objects:  electrons, muons, jets, etc.  Nominal size 100 kB/event  currently 70% of that: contents and technology under revision  TAG:  Database used to quickly select events in AOD and/or ESD files

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 10 ATLAS Data Streaming  ATLAS Computing TDR had 4 streams from event filter  primary physics, calibration, express, problem events  Calibration stream has split at least once since!  Discussions are focused upon optimisation of data access  At AOD, envisage ~10 streams  TAGs useful for event selection and data set definition  We are now planning ESD and RAW streaming  Straw man streaming schemes (trigger based) being agreed  Will explore the access improvements in large-scale exercises  Are also looking at overlaps, bookkeeping etc

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 11 HEP Data Analysis  Raw data  hits, pulse heights  Reconstructed data (ESD)  tracks, clusters…  Analysis Objects (AOD)  Physics Objects  Summarized  Organized by physics topic  Ntuples, histograms, statistical data

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 12 Production Data Processing Raw data Reconstruction Data Acquisition Level 3 trigger Trigger Tags Event Summary Data ESD Event Tags Physics Models Monte Carlo Truth Data MC Raw Data Reconstruction MC Event Summary Data MC Event Tags Detector Simulation Calibration Data Run Conditions Trigger System coordination required at the collaboration and group levels

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 13 Physics Analysis Event Tags Event Selection Calibration Data Analysis Processing Raw Data Tier 0,1 Collaboration wide Tier 2 Analysis Groups Tier 3, 4 Physicists Physics Analysis PhysicsObjects StatObjects ESD Analysis Objects PhysicsObjects StatObjects PhysicsObjects StatObjects

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 14 ATLAS Resource Requirements in for 2008 Recent (July 2006) updates have reduced the expected contributions Computing TDR

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 15 ATLAS Grid Infrastructure  ATLAS plans to use grid technology  To meet its resource needs  To manage those resources  Three grids  LCG  Nordugrid  OSG  Significant resources, but different middleware  Teams working on solutions are typically associated to a grid and its middleware  In principle all ATLAS resources are available to all ATLAS users  Works out to O(1) cpu per user  Interest by ATLAS users to use their local systems with priority  Not only a central system, flexibility concerning middleware Plan “A” is “the Grid”…there is no plan “B”

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 16 ATLAS Virtual Organization  Until recently the Grid has been a “free for all”  no CPU or storage accounting (new in a prototyping/testing phase)  no or limited priorities (roles mapped to small number of accounts: atlas01-04)  no storage space reservation  Last year ATLAS saw a competition for resources between “official” Rome productions and “unofficial”, but organized, productions  B-physics, flavour tagging...  The latest release of the VOMS (Virtual Organisation Management Service) middleware package allows the definition of user groups and roles within the ATLAS Virtual Organisation  and is used by all ATLAS grid flavors!  Relative priorities are easy to enforce IF all jobs go through the same system  For a distributed submission system, it is up to the resource providers to:  agree to the policies of each site with ATLAS  publish and enforce the agreed policies

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 17 Calibrating and Aligning ATLAS Calibrating and aligning detector subsystems is a critical process  Without well understood detectors we will have no meaningful physics data The default option for offline prompt calibrations is processing at Tier-0 or at the Cern Analysis Facility, however the TDR states that:  “Tier-2 centres will provide analysis facilities, and some will provide the capacity to produce calibrations based on processing raw data”.  “Tier-2 facilities may take a range of significant roles in ATLAS such as providing calibration constants, simulation and analysis”.  “Some Tier-2s may take significant role in calibration following the local detector interests and involvements”.  ATLAS will have some subsystems utilizing Tier-2 centers as Calibration and Alignment sites.  Must insure we can support the data flow without disrupting other planned flows  Real-time aspect is critical – the system must account for “deadlines”

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 18 L2PU Thread Calibration Server Local Server Gatherer Calibration farm disk Server Control Network x 25 x ~ 20 5 =Thread ~ 10 MB/s TCP/IP, UDP, etc. ~ 500 kB/s Dequeue Memory queue Proposed ATLAS Muon Calibration System (quoted bandwidths are for 10 KHz muon rate)

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 19 ATLAS Simulations  Within ATLAS the Tier-2 centers will be responsible for the bulk of the simulation effort.  Current planning assumes ATLAS will simulate approximately 20% of the real data volume  This number is dictated by resources; ATLAS may need to find a way to increase this fraction  Event generator frame work interfaces multiple packages   including the Genser distribution provided by LCG-AA  Simulation with Geant4 since early 2004   automatic geometry build from GeoModel   >25M events fully simulated up to now since mid-2004   only a handful of crashes!  Digitization tested and tuned with Test Beam

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 20 ATLAS Analysis Computing Model ATLAS Analysis model broken into two components  Scheduled central production of augmented AOD, tuples & TAG collections from ESD  Derived files moved to other T1s and to T2s  Chaotic user analysis of augmented AOD streams, tuples, new selections etc and individual user simulation and CPU- bound tasks matching the official MC production  Modest to large(?) job traffic between T2s (and T1s, T3s)

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 21 Distributed Analysis  At this point emphasis is on a batch model to implement the ATLAS Computing model  Interactive solutions are difficult to realize on top of the current middleware layer  We expect ATLAS users to send large batches of short jobs to optimize their turnaround  Scalability  Data Access  Analysis in parallel to production  Job Priorities  Distributed analysis effectiveness depends strongly upon the hardware and software infrastructure.  Analysis is divided into “group” and “on demand” types

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 22 ATLAS Group Analysis  Group analysis is characterised by access to full ESD and perhaps RAW data  This is resource intensive  Must be a scheduled activity  Can back-navigate from AOD to ESD at same site  Can harvest small samples of ESD (and some RAW) to be sent to Tier 2s  Must be agreed by physics and detector groups  Group analysis will produce  Deep copies of subsets  Dataset definitions  TAG selections  Big Trains  Most efficient access if analyses are blocked into a ‘big train’  Idea around for a while, already used in e.g. heavy ions  Each wagon (group) has a wagon master=production manager  Must ensure will not derail the train  Train must run often enough (every ~2 weeks?)

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 23 ATLAS On-demand Analysis  Restricted Tier 2s and CAF  Could specialize some Tier 2s for some groups  ALL Tier 2s are for ATLAS-wide usage  Role and group based quotas are essential  Quotas to be determined per group not per user  Data Selection  Over small samples with Tier-2 file-based TAG and AMI dataset selector  TAG queries over larger samples by batch job to database TAG at Tier-1s/large Tier 2s  What data?  Group-derived EventViews  Root Trees  Subsets of ESD and RAW  Pre-selected or selected via a Big Train run by working group  Each user needs 14.5 kSI2k (about 12 current boxes)  2.1TB ‘associated’ with each user on average

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 24 ATLAS Data Management  Based on Datasets  PoolFileCatalog API is used to hide grid differences  On LCG, LFC acts as local replica catalog  Aims to provide uniform access to data on all grids  FTS is used to transfer data between the sites  To date FTS has tried to manage data flow by restricting allowed endpoints (“channel” definition)  Interesting possibilities exist to incorporate network related research advances to improve performance, efficiency and reliability  Data management is a central aspect of Distributed Analysis  PANDA is closely integrated with DDM and operational  LCG instance was closely coupled with SC3  Right now we run a smaller instance for test purposes  Final production version will be based on new middleware for SC4 (FPS)

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 25 Distributed Data Management  Accessing distributed data on the Grid is not a simple task (see below!)  Several DBs are needed centrally to hold dataset information  “Local” catalogues hold information on local data storage  The new DDM system (right) is under test this summer  It will be used for all ATLAS data from October on (LCG Service Challenge 3)

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 26 ATLAS plans for using FTS T1 T0 T2 LFC FTS Server T1 FTS Server T0 T1 …. VO box LFC: local within ‘cloud’ All SEs SRM  Tier-0 FTS server:   Channel from Tier-0 to all Tier-1s: used to move "Tier-0" (raw and 1st pass reconstruction data)   Channel from Tier-1s to Tier-0/CAF: to move e.g. AOD (CAF also acts as "Tier-2" for analysis)  Tier-1 FTS server:   Channel from all other Tier-1s to this Tier-1 (pulling data): used for DQ2 dataset subscriptions (e.g. reprocessing, or massive "organized" movement when doing Distributed Production)   Channel to and from this Tier-1 to all its associated Tier-2s   Association defined by ATLAS management (along with LCG)   “Star”-channel for all remaining traffic [new: low-traffic]

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 27 ATLAS and Related Research  Up to now I have focused on the ATLAS computing model  Implicit in this model and central to its success are:  High-performance, ubiquitous and robust networks  Grid middleware to securely find, prioritize and manage resources  Without either of these capabilities the model risks melting down or failing to deliver the required capabilities.  Efforts to date have (necessarily) focused on building the most basic capabilities and demonstrating they can work.  To be truly effective will require updating and extending this model to include the best results of ongoing networking and resource management research projects.  A quick overview of some selected (US) projects follows…

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 28 The UltraLight Project  UltraLight is  A four year $2M NSF ITR funded by MPS (2005-8)  Application driven Network R&D.  A collaboration of BNL, Buffalo, Caltech, CERN, Florida, FIU, FNAL, Internet2, Michigan, MIT, SLAC, Vanderbilt.  Significant international participation: Brazil, Japan, Korea amongst many others.  Goal: Enable the network as a managed resource.  Meta-Goal: Enable physics analysis and discoveries which could not otherwise be achieved.

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 29 ATLAS and UltraLight Disk-to-Disk Research ATLAS MDT sub- systems need very fast calibration turn-around time (< 24 hours) Initial estimates plan for as much as 0.5 TB/day of high-Pt muon data for calibration. UltraLight UltraLight could enable us to quickly transport (~1/4 hour) the needed events to Tier-2 sites for calibration Michigan is an ATLAS Muon Alignment and Calibration Center, a Tier-2 and an UltraLight Site Muon calibration work has presented an opportunity to couple research efforts into production

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 30 Networking at KNU (Korea) TUses 10Gbps GLORIAD link from Korea to US, which is called BIG- GLORIAD, also part of UltraLight TTry to saturate this BIG- GLORIAD link with servers and cluster storages connected with 10Gbps TKorea is planning to be a Tier-1 site for LHC experiments Korea U.S. BIG-GLORIAD

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 31 VINCI: Virtual Intelligent Networks for Computing Infrastructures   A network Global Scheduler implemented as a set of collaborating agents running on distributed MonALISA services   Each agent uses policy-based priority queues; and negotiates for an end to end connection using a set of cost functions   A lease mechanism is implemented for each offer an agent makes to its peers   Periodic lease renewal is used for all agents; this results in a flexible response to task completion, as well as to application failure or network errors   If network errors are detected, supervising agents cause all segments to be released along a path.   An alternative path may then be set up rapidly enough to avoid a TCP timeout, allowing the transfer to continue uninterrupted.

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 32 Lambda Station A network path forwarding service to interface production facilities with advanced research networks:  Goal is selective forwarding on a per flow basis  Alternate network paths for high impact data movement  Dynamic path modification, with graceful cutover & fallback  Current implementation is based on policy-based routing & DSCP marking  Lambda Station interacts with:  Host applications & systems  LAN infrastructure  Site border infrastructure  Advanced technology WANs  Remote Lambda Stations D. Petravick, P. DeMar

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 33 TeraPaths (LAN QoS Integration) Site ASite B WAN web services WAN monitoring WAN web services hardware drivers Web page APIs Cmd line QoS requests user manager scheduler site monitor … router manager user manager scheduler site monitor … router manager The TeraPaths project investigates the integration and use of LAN QoS and MPLS/GMPLS-based differentiated network services in the ATLAS data intensive distributed computing environment in order to manage the network as a critical resource TeraPaths Includes: BNLMichigan ESNet (OSCARS) FNAL(LambdaStation)SLAC(DWMI)

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 34 Integrating Research into Production  As you can see there are many efforts, even just within the US, to help integrate a managed network into our infrastructure  There are also many similar efforts in computing, storage, grid-middleware and applications (EGEE, OSG, LCG,…).  The challenge will be to harvest these efforts and integrate them into a robust system for LHC physicists.  I will close with an “example” vision of what could result from such integration…

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 35 An Example: UltraLight/ATLAS Application (2008)  Node1> fts –vvv –in mercury.ultralight.org:/data01/big/zmumu05687.root –out venus.ultralight.org:/mstore/events/data –prio 3 –deadline +2:50 –xsum TFTS: Initiating file transfer setup… TFTS: Remote host responds ready TFTS: Contacting path discovery service TPDS: Path discovery in progress… TPDS:Path RTT ms, best effort path bottleneck is 10 GE TPDS:Path options found: TPDS:Lightpath option exists end-to-end TPDS:Virtual pipe option exists (partial) TPDS:High-performance protocol capable end-systems exist TFTS: Requested transfer 1.2 TB file transfer within 2 hours 50 minutes, priority 3 TFTS: Remote host confirms available space for TFTS: End-host agent contacted…parameters transferred TEHA: Priority 3 request allowed for TEHA: request scheduling details TEHA: Lightpath prior scheduling (higher/same priority) precludes use TEHA: Virtual pipe sizeable to 3 Gbps available for 1 hour starting in 52.4 minutes TEHA: request monitoring prediction along path TEHA: FAST-UL transfer expected to deliver 1.2 Gbps (+0.8/-0.4) averaged over next 2 hours 50 minutes

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 36 ATLAS FTS 2008 Example (cont.) TEHA: Virtual pipe (partial) expected to deliver 3 Gbps(+0/-0.3) during reservation; variance from unprotected section < 0.3 Gbps 95%CL TEHA: Recommendation: begin transfer using FAST-UL using network identifier #5A-3C1. Connection will migrate to MPLS/QoS tunnel in 52.3 minutes. Estimated completion in 1 hour minutes. TFTS: Initiating transfer between mercury.ultralight.org and venus.ultralight.org using #5A-3C1 TEHA: Transfer initiated…tracking at URL: fts://localhost/FTS/AE13FF132-FAFE39A-44-5A-3C1 TEHA: Reservation placed for MPLS/QoS connection along partial path: 3Gbps beginning in 52.2 minutes: duration 60 minutes TEHA: Reservation confirmed, rescode #9FA-39AF2E, note: unprotected network section included. T T TFTS: Transfer proceeding, average 1.1 Gbps, GB transferred TEHA: Connecting to reservation: tunnel complete, traffic marking initiated TEHA: Virtual pipe active: current rate 2.98 Gbps, estimated completion in minutes TFTS: Transfer complete, signaling EHA on #5A-3C1 TEHA: Transfer complete received…hold for xsum confirmation TFTS: Remote checksum processing initiated… TFTS: Checksum verified—closing connection TEHA: Connection #5A-3C1 completed…closing virtual pipe with 12.3 minutes remaining on reservation TEHA: Resources freed. Transfer details uploading to monitoring node TEHA: Request successfully completed, transferred 1.2 TB in 1 hour 41.3 minutes (transfer 1 hour 34.4 minutes )

The ATLAS Computing Model: Status, Plans and Future Possibilities Shawn McKee 37 Conclusions  ATLAS is quickly approaching “real” data and our computing model has been successfully validated (as far as we have been able to take it).  Some major uncertainties exist, especially around “user analysis” and what resource implications these may have.  There are lots of R&D programs active in many areas of special importance to ATLAS (and LHC) which could significantly strengthen the core model  The challenge will be to select, integrate, prototype and test the R&D developments in time to have a meaningful impact upon the ATLAS (or LHC) program Questions?