EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org WLCG File Transfer Service Sophie Lemaitre – Gavin Mccance Joint EGEE and OSG Workshop.

Slides:



Advertisements
Similar presentations
GUMS status Gabriele Carcassi PPDG Common Project 12/9/2004.
Advertisements

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
INFSO-RI Enabling Grids for E-sciencE SRMv2.2 experience Sophie Lemaitre WLCG Workshop.
ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)
OSG Services at Tier2 Centers Rob Gardner University of Chicago WLCG Tier2 Workshop CERN June 12-14, 2006.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.
INFSO-RI Enabling Grids for E-sciencE DAGs with data placement nodes: the “shish-kebab” jobs Francesco Prelz Enzo Martelli INFN.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE middleware: gLite Data Management EGEE Tutorial 23rd APAN Meeting, Manila Jan.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
INFSO-RI Enabling Grids for E-sciencE OSG-LCG Interoperability Activity Author: Laurence Field (CERN)
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
GCRC Meeting 2004 BIRN Coordinating Center Software Development Vicky Rowley.
Conference name Company name INFSOM-RI Speaker name The ETICS Job management architecture EGEE ‘08 Istanbul, September 25 th 2008 Valerio Venturi.
INFSO-RI Enabling Grids for E-sciencE Scenarios for Integrating Data and Job Scheduling Peter Kunszt On behalf of the JRA1-DM Cluster,
EGEE-II INFSO-RI Enabling Grids for E-sciencE Data Management cluster summary Krzysztof Nienartowicz JRA1 All Hands meeting, Helsinki.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
INFSO-RI Enabling Grids for E-sciencE EGEE is a project funded by the European Union under contract INFSO-RI Grid Accounting.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.
INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
The CMS Top 5 Issues/Concerns wrt. WLCG services WLCG-MB April 3, 2007 Matthias Kasemann CERN/DESY.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
Julia Andreeva on behalf of the MND section MND review.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
ATLAS Database Access Library Local Area LCG3D Meeting Fermilab, Batavia, USA October 21, 2004 Alexandre Vaniachine (ANL)
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
INFSO-RI Enabling Grids for E-sciencE FTS failure handling Gavin McCance Service Challenge technical meeting 21 June.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Data Management cluster summary David Smith JRA1 All Hands meeting, Catania, 7 March.
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
EGEE-III INFSO-RI Enabling Grids for E-sciencE VO Authorization in EGEE Erwin Laure EGEE Technical Director Joint EGEE and OSG Workshop.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid is a Bazaar of Resource Providers and.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Job Management Claudio Grandi.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
INFSO-RI Enabling Grids for E-sciencE FTS Administrators Tutorial for Tier-2s Paolo Badino
CERN IT Department CH-1211 Genève 23 Switzerland t Towards end-to-end debugging for data transfers Gavin McCance Javier Conejero Banon Sophie.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
INFSO-RI Enabling Grids for E-sciencE Running reliable services: the LFC at CERN Sophie Lemaitre
Jean-Philippe Baud, IT-GD, CERN November 2007
Jan 12, 2005 Improving CMS data transfers among its distributed Computing Facilities N. Magini CERN IT-ES-VOS, Geneva, Switzerland J. Flix Port d'Informació.
WLCG Service Interventions
Ákos Frohner EGEE'08 September 2008
lundi 25 février 2019 FTS configuration
Presentation transcript:

EGEE-II INFSO-RI Enabling Grids for E-sciencE WLCG File Transfer Service Sophie Lemaitre – Gavin Mccance Joint EGEE and OSG Workshop on Data Handling in Production Grids, Monterey 25 June 2007

Enabling Grids for E-sciencE EGEE-II INFSO-RI File Transfer Service FTS overview The File Transfer Service (FTS) is a data movement fabric service –It is a multi-VO service, used to balance usage of site resources according to VO and site policies Why is it needed ? –For the user, the service it provides is the reliable point to point movement of files –For the site manager, it provides a reliable and manageable way of serving file movement requests from their experiments –For the production manager, it provides ability to control requests coming from his users  Re-ordering, prioritization,… –The focus is on the “service”  It should make it easy to do these things well

Enabling Grids for E-sciencE EGEE-II INFSO-RI Who uses it 1 The sites use it as part of their fabric –It’s designed to make it easier for a multi-VO site to run the transfers of its VOs –Tier-1 sites run the FTS servers and are responsible for processing the transfer requests from tier-2s and transferring data between tier-1s –Tier-0 export is run from CERN –The focus is on the service delivered, the ease of manageability and service monitoring File Transfer Service 3

Enabling Grids for E-sciencE EGEE-II INFSO-RI File Transfer Service Who uses it 2 FTS is used by experiment frameworks –Typically end-users do not interact directly with it – they interact with their experiment framework –Production managers sometimes query it directly to debug / chase problems Experiment framework decides it wants to move a set of files –The expt. framework is responsible for staging-in (for now..) –It packages up a set of source/destination file pairs and submits transfer jobs to FTS –The state of each job is tracked as it progresses through the various transfer stages –The experiment framework can poll the status at any time

Enabling Grids for E-sciencE EGEE-II INFSO-RI Service APIs FTS has 3 basic API group: Job submission / tracking –Used by experiment frameworks to submit requests Service / channel management –Used by admins and VO production managers to control the service Statistics tools –Providing aggregate statistics on what the service has been doing, current failure rates, classes, etc –This is being done as part of the WLCG monitoring group to make sure the information is available to all interested stakeholders File Transfer Service 5

Enabling Grids for E-sciencE EGEE-II INFSO-RI Security model Transfers are always run using user credential –VOMS credential is now used (and renewed as necessary) in FTS 2.0 Authorization to service is done using: –Grid mapfile mechanism or –VOMS role VO production manager roles Channel administrator roles Service manager role File Transfer Service 6

Enabling Grids for E-sciencE EGEE-II INFSO-RI User API Uses a submit / poll pattern with unique job ID –Jobs can contain multiple copy requests Various polling methods with different detail –Overall job status (“is it done yet?”) –Job summary –Detailed status of individual file failures / status Job cancelation and priority reshuffling by suitably authorised users –i.e. VO production managers No notification mechanism yet –The submit/poll pattern isn’t so efficient… Much commonality with Globus RFT API –We’ve been talking… File Transfer Service 7

Enabling Grids for E-sciencE EGEE-II INFSO-RI Channel concept For management ease, the service supports splitting jobs onto multiple “channels” –Once a job is submitted to the FTS it is assigned to a suitable channel for serving A channel may be: –A point to point network link (e.g. we manage all the T0-export links in WLCG on a separate channel) –Various “catch-all” channels  (e.g. everything else coming to me, or everything to one of my tier-2 sites) –More flexible “grouping of sites” channel definitions are on the way Channels are uni-directional –e.g. at CERN we have one set for the export and one set for the import File Transfer Service 8

Enabling Grids for E-sciencE EGEE-II INFSO-RI Channels… “Channel”: it’s not a great name –It always causes confusion... (but we’re ~stuck with the name now) –It isn’t tied to a physical network path –It’s just a management concept –“Queue” might be a better name All file transfer jobs on the same channel are served as part of the same queue –Inter-VO priorities for the queue (Atlas gets 75%, CMS gets the rest) –Internal-VO priorities within a VO –Each channel has its own set of transfer parameters –Number of concurrent files running, number streams, TCP buffer, etc Given the transfers your FTS server is required to support (as defined by experiment computing models and WLCG), channels allow you to split up the management of these as you see fit File Transfer Service 9

Enabling Grids for E-sciencE EGEE-II INFSO-RI FTS topology Simplified tiered infrastructure FTS servers are located at CERN and Tier-1 sites To provide full “coverage” WLCG defines what transfers a given FTS server has to support FTS servers are independent File Transfer Service 10

Enabling Grids for E-sciencE EGEE-II INFSO-RI FTS and data scheduling FTS provides the reliable and manageable transport layer It does not (and will not) provide more complex data scheduling –Multi-hop transfers –Broadcast transfers –Dataset collation But it may be used as the underlying management layer for services providing this Much of this extra functionality is currently provided in the experiment layer –It’s quite computing model dependent –e.g. Phedex from CMS File Transfer Service 11

Enabling Grids for E-sciencE EGEE-II INFSO-RI File Transfer Service FTS server architecture All components are decoupled from each other –Each interacts only with the database Experiments interact via web-service VO agents do VO-specific operations (1 per VO) Channel agents do channel specific operation (e.g. the transfers) Monitoring and statistics can be collected via the DB

Enabling Grids for E-sciencE EGEE-II INFSO-RI FTS server architecture Designed for high availability and scalability User front-end web-service is stateless and (should be) load balanced to provide availability and scalability –Service interventions that don’t require a DB schema upgrade can be made with zero user-visible downtime Agent daemons are designed to scale over multiple nodes as necessary with load Critical component is central DB –WLCG production services on Oracle RAC to provide availability and scalability File Transfer Service 13

Enabling Grids for E-sciencE EGEE-II INFSO-RI File Transfer Service 14 FTS 2.0 FTS 2.0 server: new features –Delegation of proxy from the client to the FTS service –Improved monitoring capabilities  Critical to the ‘overall transfer service’ operational stability  Much more data retained in the database, some new methods to access them in the admin API –Beta SRM 2.2 support  This is now being tested on the EGEE pre-production service as part of the SRM 2.2 testing activity –Better administration tools  Make it easier to run the service –Better database model  Improve the performance and scalability –Placeholders for future functionality  Minimise the impact of future upgrade interventions

Enabling Grids for E-sciencE EGEE-II INFSO-RI File Transfer Service 15 FTS developments –Evolve the SRM 2.2 code as we understand the SRM 2.2 implementations (based on feedback from PPS) –Incrementally improve service monitoring  FTS will have the capacity to give very detailed measurements about the current service level and problems currently being observed with sites  Integration with experiment and operations dashboards  Design work ongoing –Site grouping in channel definition (“clouds”)  To make it easier to implement the computing models of CMS and ALICE  Code exists: to be tested on pilot service –Incrementally improve service administration tools –SRM/gridFTP split –Notification of job state changes Not planned –Not planning to produce a non-Oracle version  Sites with lower production requirements can use restricted Oracle XE

Enabling Grids for E-sciencE EGEE-II INFSO-RI FTS current status Current FTS production status –CERN has just moved to FTS 2.0 –All T1 sites currently using FTS 1.5 –> 10 petabytes exported from CERN since SC4 –A few more petabytes moved between tier-1 sites and from tier-1 to tier-2 sites FTS infrastructure runs well –CERN and T1 sites ~understand the software –Most problems ironed out last year –Remainder of the problems understood with experiments and we have a plan to address them There are still problems with ‘overall transfer service’ File Transfer Service 16

Enabling Grids for E-sciencE EGEE-II INFSO-RI Issues 1 “There are still problems with overall transfer service” The overall system is very complex –Understanding the cross-site “end to end transfer service” is still an issue  Experiment layer, FTS, SRM at source, SRM at destination, gridFTP servers, network, tape backends  It can be done, but the manpower required is significant and is not sustainable in the long term –The number of retries needed to get files from A to B is still rather high: reduced efficiency –Improving services’ stability is critical (FTS included ) –Monitoring will help  “Understanding the whole system” is our primary focus  Can we coordinate the logging / monitoring of FTS and SRMs to improve this situation ? File Transfer Service 17

Enabling Grids for E-sciencE EGEE-II INFSO-RI Issues 2 Behaviour under error conditions is different for different SRM implementations –This took a lot of effort to resolve in SRM 1.1 –The hope is that the SRM 2.2 standard is better in this regard Still, a conservative deployment schedule must anticipate problems of this type for SRM 2.2 deployment in production –The “overall production service” will not be stable until any such integration problems are understood File Transfer Service 18

Enabling Grids for E-sciencE EGEE-II INFSO-RI Issues 3 FTS easily lets you throttle channels writing to your storage –This was a deployment choice of WLCG –But source overloading a still a problem  Recently reported by ATLAS (e.g. BNL) –It would be good if the SRMs could indicate their busy-ness to FTS by some mechanism, so it could back off  The other proposed solution of having all the FTS servers and other SRM clients cooperating (in a data scheduler model) so as not to overload an SRM is not seen as credible by WLCG File Transfer Service 19

Enabling Grids for E-sciencE EGEE-II INFSO-RI Summary FTS is designed as a highly available and scalable service to help sites manage the file transfer requests from their VOs Focus is upon service management Current WLCG FTS infrastructure runs well Problems with “overall transfer service” –Complexity: cross-site debugging is expensive –Resilience: too easy to overload services, ‘standard’ interfaces not always quite standard, especially under error conditions This is where we need to focus File Transfer Service 20