Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.

Slides:



Advertisements
Similar presentations
Applications Area Issues RWL Jones GridPP13 – 5 th June 2005.
Advertisements

ALICE G RID SERVICES IP V 6 READINESS
Storage: Futures Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 8 October 2008.
Implementing A Simple Storage Case Consider a simple case for distributed storage – I want to back up files from machine A on machine B Avoids many tricky.
CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.
G Robert Grimm New York University Pulling Back: How to Go about Your Own System Project?
Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
INFSO-RI Enabling Grids for E-sciencE SRMv2.2 experience Sophie Lemaitre WLCG Workshop.
CERN - IT Department CH-1211 Genève 23 Switzerland t LCG Deployment GridPP 18, Glasgow, 21 st March 2007 Tony Cass Leader, Fabric Infrastructure.
1 CMPT 471 Networking II DHCP Failover and multiple servers © Janice Regan,
Status report on SRM v2.2 implementations: results of first stress tests 2 th July 2007 Flavia Donno CERN, IT/GD.
Status of SRM 2.2 implementations and deployment 29 th January 2007 Flavia Donno, Maarten Litmaath IT/GD, CERN.
SRM 2.2: tests and site deployment 30 th January 2007 Flavia Donno, Maarten Litmaath IT/GD, CERN.
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
The HEPiX IPv6 Working Group David Kelsey EGI TF, Prague 18 Sep 2012.
CERN IT Department CH-1211 Geneva 23 Switzerland t Storageware Flavia Donno CERN WLCG Collaboration Workshop CERN, November 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
Senior Design 1 Project Android Pilot Nation Stellar Sea Lions Team –Andrew Olivier –Jordan Fryer –Karen Echon –Jacob Hahn University of Portland School.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Maarten Litmaath (CERN), GDB meeting, CERN, 2006/02/08 VOMS deployment Extent of VOMS usage in LCG-2 –Node types gLite 3.0 Issues Conclusions.
MW Readiness Verification Status Andrea Manzi IT/SDC 21/01/ /01/15 2.
Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.
GLUE 2 Open Issues in Storage Information Providers 16 th May 2014.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
MW Readiness WG Update Andrea Manzi Maria Dimou Lionel Cons 10/12/2014.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
WLCG Grid Deployment Board, CERN 11 June 2008 Storage Update Flavia Donno CERN/IT.
Maarten Litmaath (CERN), GDB meeting, CERN, 2006/06/07 SRM v2.2 working group update Results of the May workshop at FNAL
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
LCG CCRC’08 Status WLCG Management Board November 27 th 2007
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Storage Accounting John Gordon, STFC GDB March 2013.
Automatic Resource & Usage Monitoring Steve Traylen/Flavia Donno CERN/IT.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
The new FTS – proposal FTS status. EMI INFSO-RI /05/ FTS /05/ /05/ Bugs fixed – Support an SE publishing more than.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
LCG Accounting Update John Gordon, CCLRC-RAL WLCG Workshop, CERN 24/1/2007 LCG.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Grid Deployment Board CERN, 14 May 2008 Storage Update Flavia Donno CERN/IT.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Handling of T1D0 in CCRC’08 Tier-0 data handling Tier-1 data handling Experiment data handling Reprocessing Recalling files from tape Tier-0 data handling,
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2).
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Storage Element Model and Proposal for Glue 1.3 Flavia Donno,
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
1 SRM v2.2 Discussion of key concepts, methods and behaviour F. Donno CERN 11 February 2008.
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
ARDA Massimo Lamanna / CERN Massimo Lamanna 2 TOC ARDA Workshop Post-workshop activities Milestones (already shown in December)
Installation Accounting Status Flavia Donno CERN/IT-GS WLCG Management Board, CERN 28 October 2008.
SRM 2.2: experiment requirements, status and deployment plans 6 th March 2007 Flavia Donno, INFN and IT/GD, CERN.
SRM v2.2: service availability testing and monitoring SRM v2.2 deployment Workshop - Edinburgh, UK November 2007 Flavia Donno IT/GD, CERN.
Status of the SRM 2.2 MoU extension
Cross-site problem resolution Focus on reliable file transfer service
gLite->EMI2/UMD2 transition
Andrea Manzi, Oliver Keeble
Flavia Donno CERN GSSD Storage Workshop 3 July 2007
Proposal for obtaining installed capacity
1 VO User Team Alarm Total ALICE ATLAS CMS
Andrea Manzi, Oliver Keeble
Presentation transcript:

Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009

The problem Storage services might become very slow or unresponsive when “busy”, i. e. running at the limit of some resources (high load). dCache PNFS bottlenecks might make the full dCache service very slow or unresponsive. The situation might degrade so much that the service can become unusable and require human intervention. Current Data Management clients do not handle correctly the situation where the storage service is “busy”. Some clients (FTS) may [abort and] retry immediately a failed request making the busy status of the storage server even more severe. There was no agreed way to communicate the busy status of a server to a client. WLCG Management Board, CERN 10 March

Goals We organized 2 phone conferences with storage service and data management clients developers, data management experts and some managers (OSG and EGEE): CASTOR, dCache, DPM, StoRM, BeStMan GFAL/lcg-utils, FTS, dCache srm* clients, StoRM+BeStMan clients Experiments DM clients The goals: To agree on a way to communicate to clients the “busy” status of a server To advise clients on how to react in the presence of a “busy” server 3 WLCG Management Board, CERN 10 March 2009

Conclusions The SRM server MUST return SRM_INTERNAL_ERROR when experiencing transient problems (caused by high load on some components). SRM_FAILURE can always be returned to fail a request. If the request will be processed the SRM server SHOULD return an estimatedWaitTime for each file in the request to tell the client when the next polling SHOULD happen in order to have a new update on the status of each file. If the client application receives an SRM_INTERNAL_ERROR from the SRM server, it MAY repeat the request. If it does, it SHOULD use a randomized exponential retry time. The algorithm used by Ethernet when congestion is detected was proposed as a good example of randomized exponential retry algorithm. 4 WLCG Management Board, CERN 10 March 2009

Backward-compatibility Current official WLCG Data Management clients can already catch the SRM_INTERNAL_ERROR return code However, they do not act in the optimal way since they might either abort the request or retry immediately Current storage services already send the SRM_INTERNAL_ERROR code but only in very rare cases (failure of some internal component – i.e. Database) The return code is sent to the clients when it is already too late Therefore, the solution suggested is backward-compatible, although it is advisable but not required for the “busy storage services”-aware clients to be deployed before “busy storage services”-enabled servers are deployed Clients might otherwise fail more often than needed 5 WLCG Management Board, CERN 10 March 2009

Status of WLCG DM clients 6 The official WLCG Data Management clients that need code changes are: FTS 2.2 will not have any of the agreed changes, and the priority is now on the checksum checks. The coding for the issue at discussion will begin in April, and at least three months are expected for having it in production. For GFAL/lcg-utils the implementation of the suggested changes will not start before mid March. We do not have yet a precise schedule for the dCache srm* clients. StoRM clients perform only atomic operations at the moment (no retries). Therefore they do not need any changes. Retries will be implemented later on. BeStMan clients will be implementing the required changes in the next two weeks. WLCG Management Board, CERN 10 March 2009

Experiment specific DM clients 7 We cannot control what experiments do Reference implementations are being provided for all known use-cases to provide specific guidelines and examples. Follow up with the experiments DM developers Make GGUS supporters aware of this work A first reference implementation for the pre-stage use case can be found here: deployment/flavia/prestage_release_example.py Reference implementations for all known use-cases can be ready in about 2 months. WLCG Management Board, CERN 10 March 2009

Status of Storage Servers 8 The implementation of needed changes can start now for all storage services Storage developers will make available a development test endpoint implementing the agreed behavior A pre-release of GFAL basic building block functions need to be available in order to check the correct implementation of the storage server against them The reference implementations made available for the experiments will be used to test the pre-release of storage service (development test endpoint) The S2 stress test suite will be used to make the servers “busy”, causing high load WLCG Management Board, CERN 10 March 2009

Deployment plan 9 It seems feasible to have both WLCG official Data Management clients and storage services tested and ready for deployment by September 2009 This date is very close to LHC startup. How should we proceed? WLCG Management Board, CERN 10 March 2009

More information 10 More information can be found on the Storage Solution Working Group twiki page: ReadinessChallenges#Storage_Solution_Working_Group_S Please, check the section “Busy Storage Services” WLCG Management Board, CERN 10 March 2009