ATLAS Goals and Status Jim Shank US LHC OSG Technology Roadmap May 4-5th, 2005.

ATLAS Goals and Status Jim Shank US LHC OSG Technology Roadmap May 4-5th, 2005

2 Outline THigh level USATLAS goals TMilestone Summary and Goals TIdentified Grid3 Shortcomings TWorkload Management Related TDistributed Data Management Related TVO Service Related

3 US ATLAS High Level Requirements TAbility to transfer large datasets between CERN, BNL, and Tier2 Centers TAbility to store and manage access to large datasets at BNL TAbility to produce Monte Carlo at Tier2 and other sites and deliver to BNL TAbility to store, serve, catalog, manage and discover ATLAS wide datasets by users and ATLAS file transfer services TAbility to create and submit user analysis jobs to data locations TAbility to access opportunistically non-dedicated ATLAS resources TTo deliver all these capabilities at a scale commensurate with available resources (CPU, storage, network, and human) and number of users

4 ATLAS Computing Timeline 2003 POOL/SEAL release (done) ATLAS release 7 (with POOL persistency) (done) LCG-1 deployment (done) ATLAS complete Geant4 validation (done) ATLAS release 8 (done) DC2 Phase 1: simulation production (done) DC2 Phase 2: intensive reconstruction (the real challenge!) LATE! Combined test beams (barrel wedge) (done) Computing Model paper (done) Computing Memorandum of Understanding (in progress) ATLAS Computing TDR and LCG TDR (starting) DC3: produce data for PRR and test LCG-n Physics Readiness Report Start cosmic ray run GO! 2004 2005 2006 2007 NOW Rome Physics Workshop

5 Grid Pressure Grid3/DC2 very valuable exercise Lots of pressure on me now to develop backup plans if the “grid middleware” does not come through with required functionality on our timescale. Since manpower is short, this could mean pulling manpower out of our OSG effort.

6 Schedule & Organization Grid Tools & Services 2.3.4.1 Grid Service Infrastructure 2.3.4.2 Grid Workload Management 2.3.4.3 Grid Data Management 2.3.4.4 Grid Integration & Validation 2.3.4.5 Grid User Support 20052006 2007 NOW CSCPRRCRRStartTDRDC2 ATLAS milestones

7 Milestone Summary (I) 2.3.4.1 Grid Service Infrastructure May 2005 ATLAS software management service upgrade June 2005 Deployment of OSG 0.2 (expected increments after this) June 2005 ATLAS-wide monitoring and accounting service June 2005 ATLAS site certification service June 2005 LCG interoperability (OSG  LCG) July 2005 SRM/dCache deployed on Tier2 Centers Sep 2005 LCG interoperability services for SC05

8 Milestone Summary (II) 2.3.4.2 Grid Workload Management June 2005: Defined metrics for submit host scalability met June 2005: Capone recovery possible without job loss July 2005: Capone2 WMS for ADA Aug 2005: Pre-production Capone2 delivered to Integration team Integration with SRM Job scheduling Provision of Grid WMS for ADA Integration with DDMS Sep 2005: Validated Capone2 + DDMS Oct 2005: Capone2 WMS with LCG Interoperability components

9 Milestone Summary (III) 2.3.4.3 Grid Data Management April 2005: Interfaces and functionality for OSG-based DDM June 2005: Integrate with OSG-based storage services with benchmarks Aug 2005: Interfaces to the ATLAS OSG workload management system (Capone) Nov 2005: Implementing storage authorization and policies

10 Milestone Summary (IV) 2.3.4.4 Grid Integration & Validation March 2005: Deployment of ATLAS on OSG ITB (Integration Testbed) June 2005: Pre-production Capone2 deployed and validated OSG ITB July 2005: Distributed analysis service implemented with WMS August 2005: Integrated DDM+WMS service challenges on OSG Sept 2005: CSC (formerly DC3) full functionality for production service Oct 2005: Large scale distributed analysis challenge on OSG Nov 2005: OSG-LCG Interoperability exercises with ATLAS Dec 2005: CSC full functionality pre-production validation

11 ATLAS DC2 Overview of Grids as of 2005-02-24 18:11:30 Gridsubmittedpendingrunningfinishedfailedefficiency Grid3 3638141530284694377 % NorduGrid 2913011051142647034962 % LCG 6052861014569224224738 % TOTAL125661252941298435953953 % Capone + Grid3 Performance TCapone submitted & managed ATLAS jobs on Grid3 > 150K TIn 2004, 1.2M CPU-hours TGrid3 sites with more than 1000 successful DC2 jobs: 20 TCapone instances > 1000 jobs: 13

12 ATLAS DC2 and Rome Production on Grid3 = 350 Max # jobs/day = 1020

13 Most Painful Shortcomings T“Unprotected” Grid services  GT2 Gram and GridFTP vulnerable to multiple users and VOs  Frequent manual intervention of site administrators TNo reliable file transfer service or other data management services such as space management TNo policy-based authorization infrastructure  No distinction between production and individual grid users TLack of a reliable information service TOverall robustness and reliability (on client and server) poor

14 Resources ATLAS Data Grid TStrategy for dealing with multiple VOs TPartition resources into VO-managed and shared TProvision for hosting persistent “Guest” VO services & agents GRAM GRIDFTP IP SRM ATLAS VO Services, Agents, Proxy Caches, Catalogs Resources Guest VO Services, Agents, Proxy Caches Shared OSG

15 ATLAS Production System Requirements for VO and Core Services DDMS WMS

16 Grid Data Management TATLAS Distributed Data Management System (DDMS) TUS ATLAS GTS role in this project:  Provide input to design: expected interfaces, functionality, and scalability performance and metrics, based on experience with Grid3 and DC2 and compatibility with OSG services  Integrate with OSG-based storage services  Benchmark OSG implementation choices with ATLAS standards  Specification and development of as-needed OSG specific components required for integration with the overall ATLAS system  Introduce new middleware services as they mature  Interfaces to OSG workload management system (Capone)  Implementing storage authorization and policies for role-based usage (reservation, expiration, cleanup, connection to VOMS, etc) consistent with ATLAS data management tools and services.

17 DDMS Issues: File Transfers TDataset discovery and distribution for both production and analysis services TAOD distribution from CERN to Tier1 TMonte Carlo produced at Tier2 delivered to Tier1 TSupport role-based priorities for transfers and space

18 Hierarchical Data Model VO Sites VO Central Torre Wenaus

19 SE info provider SE specific claims catalog DDMS Torre Wenaus interfaces to core

LCG Baseline Services Working Group 20 Attempt to reuse same Grid catalogues for dataset catalogues (reuse mapping provided by interface as well as backend) ATLAS Interactions with catalogues POOL FC API ATLAS-API ATLAS Service OtherGrid API Internal Catalogues (many) Internal Catalogues (many) Dataset Catalogues Infrastructure ATLAS DQ WN SE data Local Replica Catalogue Local Replica Catalogue Local Replica Catalogue Local Replica Catalogue Register files Register datasets WN POOL SE data Local Replica Catalogue Local Replica Catalogue Queries Catalogue Exports LFN&GUID->SURL On each site: fault tolerant service with multiple back ends internal space management User defined metadata schemas Accept different catalogues and interfaces for different GRIDs but expect to impose POOL FC interface. Datasets, internal space management Replication Metadata Monitoring Miguel Branco

21 Reliable FT in DDMS TMySQL Backend TServices for agents and clients TSchedules transfers  later, monitor and account resources TSecurity and policy issues affecting core infrastructure  Authentication and Authorization infrastructure needs to be in place  Priorities for production managers & end-users need to be settable  Roles to set group and user quotas, permissions DQ evolving M. Branco

LCG Deployment Group 22 LCG-required SRM functions  SRM v1.1 insufficient – mainly lack of pinning  SRM v3 not required – and timescale too late  Require Volatile, Permanent space; Durable not practical  Global space reservation: reserve, release, update (mandatory LHCb, useful ATLAS,ALICE). Compactspace NN  Permissions on directories mandatory  Prefer based on roles and not DN (SRM integrated with VOMS desirable but timescale?)  Directory functions (except mv) should be implemented asap  Pin/unpin high priority  srmGetProtocols useful but not mandatory  Abort, suspend, resume request : all low priority  Relative paths in SURL important for ATLAS, LHCb, not for ALICE CMS input/comments not included yet DDMS INTERFACING TO STORAGE

23 Managing Persistent Services TGeneral summary: larger sites and ATLAS-controlled sites should let us run services like LRC, space management, replication. TOther sites can be handled by being part of some multi-site domain managed by one of our sites -- ie their persistent service needs are covered by services running at our sites, and site-local actions like space management happen via submitted jobs (working with the remote service and therefore requiring some remote connectivity or gateway proxying) rather than local persistent services.

24 Scalable Remote Data Access TATLAS reconstruction and analysis jobs require acces to remote database servers at CERN, BNL, and elsewhere TPresents additional traffic on network especially for large sites TSuggest use of local mechanisms, such was web proxy caches, to minimize this impact

25 Workload Management TLCG Core services contributions focused on LCG-RB, etc. THere address scalability and stability problems experienced on Grid3 TDefined metrics; achieve by job batching or by DAG-appends  5000 active jobs from a single submit host  Submission rate: 30 jobs/minute  >90% job efficiency jobs accepted  All of US production managed by 1 person TJob state persistency mechanism  Capone recovery possible without job loss

26 Other Considerations TRequire reliable information system TIntegration with Managed Storage  Several resources in OSG will be SRM/dCache, which does not support 3 rd party transfers (used heavily in Capone)  NeST as an option for space management TJob scheduling  Static at present; change to matchmaking based on data location and job queue depth & policy TIntegration with ATLAS data management system (evolving)  Current system works directly with file-based RLS  New system will interface with new ATLAS dataset model; POOL file catalog interface

27 Grid Interoperability TInteroperability with LCG  A number of proof of principle exercises completed  Progress of GLUE schema with LCG  Publication of USATLAS resources (compute, storage) to LCG index service (BDII)  Use of the LCG developed General Information Provider  Submit ATLAS job from LCG to OSG (via an LCG Resource Broker)  Submit from OSG site to LCG services directly via Condor-G  Demonstration challenge under discussion with LCG in OSG Interoperability TInteroperability with TeraGrid  Initial discussions begun in OSG Interoperability Activity  F. Luehring (US ATLAS) co-Chair  Initial issues identified: authorization, allocations, platform dependencies

28 Summary TWe’ve been busy almost continuously with production since July 2004. TUncovered many shortcomings in Grid3 core infrastructure and our own services (eg. WMS scalability, no DDMS). TExpect to see more ATLAS services and agents on gatekeeper nodes, especially for DDMS TNeed VO-scoped management of resources by the WMS and DDMS until middleware service supply these capabilities TTo be compatible with ATLAS requirements and OSG principles, we see resources partitioned (hosted dedicated VO & guest VO resources and general OSG opportunistic) to achieve reliability and robustness. TTime is of the essence. A grid component, fully meeting our specs, delivered late is the same as a failed component. We will have already implemented a work around.

appendix

30 ProdDB Elements of the execution environment used to run ATLAS DC2 on Grid3. The dashed (red) box indicates processes executing on the Capone submit host. Yellow boxes indicate VDT components. Condor-G schedd GridMgr CE gsiftp WN SE Chimera RLS Windmill Pegasus VDC DonQuijote Mon-Servers MonALISA gram Grid3 Sites Capone jobsch GridCat MDS Grid3 System Architecture Capone inside

31 Capone Workload Manager TMessage interface  Web Service & Jabber TTranslation layer  Windmill (Prodsys) schema  Distributed Analysis (ADA) -TBD TProcess execution  Capone Finite State Machine  Process eXecution Engine  SBIR (FiveSight) TProcesses  Grid interface  Stub: local shell script testing  Data Management -TBD The Capone Architecture, showing three common component layers above selectable modules which interact with local and remote services. Message protocols Jabber Web Service Translation ADA Windmill Process Execution PXECPE-FSM Stub Grid(GCE Client) DataManagement

ATLAS Goals and Status Jim Shank US LHC OSG Technology Roadmap May 4-5th, 2005.

Similar presentations

Presentation on theme: "ATLAS Goals and Status Jim Shank US LHC OSG Technology Roadmap May 4-5th, 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ATLAS Goals and Status Jim Shank US LHC OSG Technology Roadmap May 4-5th, 2005.

Similar presentations

Presentation on theme: "ATLAS Goals and Status Jim Shank US LHC OSG Technology Roadmap May 4-5th, 2005."— Presentation transcript:

Similar presentations

About project

Feedback