ATLAS Use and Experience of FTS

Slides:



Advertisements
Similar presentations
Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.
Advertisements

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
FZU participation in the Tier0 test CERN August 3, 2006.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
How to Install and Use the DQ2 User Tools US ATLAS Tier2 workshop at IU June 20, Bloomington, IN Marco Mambelli University of Chicago.
DDM-Panda Issues Kaushik De University of Texas At Arlington DDM Workshop, BNL September 29, 2006.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
MAGDA Roger Jones UCL 16 th December RWL Jones, Lancaster University MAGDA  Main authors: Wensheng Deng, Torre Wenaus Wensheng DengTorre WenausWensheng.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE middleware: gLite Data Management EGEE Tutorial 23rd APAN Meeting, Manila Jan.
SRM workshop – September’05 1 SRM: Expt Reqts Nick Brook Revisit LCG baseline services working group Priorities & timescales Use case (from LHCb)
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Replica Management Services in the European DataGrid Project Work Package 2 European DataGrid.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.
Storage cleaner: deletes files on mass storage systems. It depends on the results of deletion, files can be set in states: deleted or to repeat deletion.
Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to.
1 LHCb File Transfer framework N. Brook, Ph. Charpentier, A.Tsaregorodtsev LCG Storage Management Workshop, 6 April 2005, CERN.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
David Adams ATLAS ATLAS distributed data management David Adams BNL February 22, 2005 Database working group ATLAS software workshop.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Distributed Data Management Miguel Branco 1 DQ2 discussion on future features BNL workshop October 4, 2007.
David Adams ATLAS ATLAS-ARDA strategy and priorities David Adams BNL October 21, 2004 ARDA Workshop.
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
INFSO-RI Enabling Grids for E-sciencE FTS failure handling Gavin McCance Service Challenge technical meeting 21 June.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
1 DIRAC Data Management Components A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
ALICE Physics Data Challenge ’05 and LCG Service Challenge 3 Latchezar Betev / ALICE Geneva, 6 April 2005 LCG Storage Management Workshop.
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
INFSO-RI Enabling Grids for E-sciencE File Transfer Service Patricia Mendez Lorenzo CERN (IT-GD) / CNAF Tier 2 INFN - SC3 Meeting.
CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.
Riccardo Zappi INFN-CNAF SRM Breakout session. February 28, 2012 Ingredients 1. Basic ingredients (Fabric & Conn. level) 2. (Grid) Middleware ingredients.
ATLAS DDM Developing a Data Management System for the ATLAS Experiment September 20, 2005 Miguel Branco
Federating Data in the ALICE Experiment
Daniele Bonacorsi Andrea Sciabà
Jean-Philippe Baud, IT-GD, CERN November 2007
gLite Basic APIs Christos Filippidis
ALICE and LCG Stefano Bagnasco I.N.F.N. Torino
L’analisi in LHCb Angelo Carbone INFN Bologna
LCG Service Challenge: Planning and Milestones
Vincenzo Spinoso EGI.eu/INFN
(on behalf of the POOL team)
U.S. ATLAS Grid Production Experience
Report of Dubna discussion
gLite Data management system overview
Data Challenge with the Grid in ATLAS
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
INFN-GRID Workshop Bari, October, 26, 2004
Introduction to Data Management in EGI
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
PanDA in a Federated Environment
Short update on the latest gLite status
Dirk Düllmann CERN Openlab storage workshop 17th March 2003
Data Management cluster summary
R. Graciani for LHCb Mumbay, Feb 2006
Data services in gLite “s” gLite and LCG.
Data Management in LHCb: consistency, integrity and coherence of data
DIRAC Data Management: consistency, integrity and coherence of data
The LHCb Computing Data Challenge DC06
Presentation transcript:

ATLAS Use and Experience of FTS FTS workshop 16 Nov 05

Outline Intro to ATLAS DDM How we use FTS SC3 Tier 0 exercise experience Things we like Things we would like

ATLAS DDM System Moves from a file based system to one based on datasets Hides file level granularity from users A hierarchical structure makes cataloging more manageable However file level access is still possible Scalable global data discovery and access via a catalog hierarchy No global physical file replica catalog (but global dataset replica catalog and global logical file catalog) Datasets Sites Files

(Container) Dataset ‘B’ ATLAS DDM System As well as catalogs for datasets and locations we have ‘site services’ to replicate data We use ‘subscriptions’ of datasets to sites held in a global catalog Site services take care of the replica resolution, transfer and registration at the destination site Site ‘X’: Dataset ‘A’ (Container) Dataset ‘B’ Dataset ‘A’ | Site ‘X’ Dataset ‘B’ | Site ‘Y’ Site ‘Y’: Subscriptions: File1 File2 Data block1 Data block2

Subscription Agents Uses FTS here! File state Function Agents (site local MySQL DB) Agents Function Fetcher Finds incomplete datasets unknownSURL ReplicaResolver Finds remote SURL knownSURL MoverPartitioner Assigns Mover agents assigned Mover Moves file Uses FTS here! toValidate ReplicaVerifier Verifies local replica validated BlockVerifier Verifies whole dataset complete done This is what runs on the VO Boxes

Within the Mover agent The python Mover agent reads in a XML file catalog of source files to copy The destination file name is based on the SRM endpoint + dataset name + source filename <File ID="bc340aff-4057-4dcc-98aa-204432c4bb07"> <physical> <pfn filetype="" name="srm://castorgridsc.cern.ch/castor/cern.ch/grid/atlas/ddm_tier0/perm/esd.0003/esd.0003._5645.1"/> </physical> <logical/> <metadata att_name="destination" att_value="http://vobox.grid.sinica.edu.tw:8000/dq2//esd.0003"/> <metadata att_name="fsize" att_value="500000000"/> <metadata att_name="md5sum" att_value=""/> </File>

Within the Mover agent We create a file of source and dest SURLs and submit the bulk job to FTS (using CLI via python commands module) Then query every x seconds using glite-transfer-status to see if status changes ‘Done’: mark all files as successfully copied ‘Hold’, ‘Failed’: some or all files failed so look through the output for successes and failures In the case of failed file: The file is put back to the ‘unknownSURL’ state and goes again through the chain of agents (max 5 times x 3 FTS retries = 15 retries overall) Successful files: The destination file is validated by using SRM commands directly (getFileMetaData) to compare file size with source catalog file size Would like to know if this stage is really necessary or if FTS already does it (or will in future?) (more later…)

Using FTS within SC3 ATLAS’ SC3 is a Tier 0 exercise where we produce RAW data at CERN and replicate reconstructed data to Tier 1 sites (using FTS!) We started officially on 2nd Nov so been running for ~2 weeks now With ~1 month of small scale testing using the FTS pilot service - this was very useful for testing integration of FTS and debugging site problems with SRM paths etc..

Results so far 1 - 7 Nov

Results so far.. Put latest plots here 9 - 15 Nov

What worked well The service is very reliable virtually no failures connecting to service (apart from when CERN had unstable network) 99.9% of failures are problems with sites/humans It hasn’t lost any of our jobs information The interface is friendly and self-explanatory The throughput rate is fast enough, but we haven’t really stressed it so far Response to reported errors is good (fts-support)

What we would like Staging from tape In theory this is not a problem for us in SC3 but will be in the future Would like FTS to deal with staging from tape properly (rather than giving SRM get timeouts), having a ‘staging’ status and perhaps enabling us to query through FTS whether files are on tape or disk Integration with replica catalogs We use LFC (LCG) and Oracle/Globus RLS (through POOL FC interface) (OSG) So we can say move LFN x from site y to site z and FTS calls a service that takes care of resolution and registration Bandwidth monitoring within FTS Error reporting Email lists again… would like to know who to tell in case of error. Can you give a hint based on the error?

What we would like TierX to TierY transfers handled by the network fabric, so channels between all sites should exist support priorities, with possibility to do late reshuffling plugins to allow interactions with experiment's services. Example of plug-ins - or experiment-specific services: catalog interactions (not exclusively grid catalogs) plugins to zip files on the fly (transparently to users but very good for MSS) - after transfer starts and/or before files are stored on storage an idea is for FTS to provide a callback? Must understand VO agents framework and what can be done with that! reliable: keep retrying until told to stop but allow real-time monitoring of errors for transfer (parseable errors preferable) so that we can do reshuffling of transfers, cancel them, etc signal conditions such as source missing, destination down, etc

Some Questions (maybe already answered today!) Would like to understand how to optimise (no of files per bulk etc) Do you distinguish between permanent errors (channel doesn’t exist) and temporary errors (SRM timeout)? I.e. not retrying permanent errors and is there a way to report this to us so we don’t retry either? Do we need our own verification stage or are we just repeating what FTS does? ‘Duration’ - is this time from submission to completion or ‘Active’ time?

Conclusion We are happy with the FTS service so far - it’s given us some good results But we haven’t tested it til it breaks! Probably the most reliable part of SC3 in our experience We would like to see it integrated with more components to reduce our workload (staging, catalogs) Look forward to further developments!