Status report on SRM v2.2 implementations: results of first stress tests 2 th July 2007 Flavia Donno CERN, IT/GD.

Slides:



Advertisements
Similar presentations
EGEE is a project funded by the European Union under contract IST Using SRM: DPM and dCache G.Donvito,V.Spinoso INFN Bari
Advertisements

Storage: Futures Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 8 October 2008.
Implementing A Simple Storage Case Consider a simple case for distributed storage – I want to back up files from machine A on machine B Avoids many tricky.
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
– n° 1 StoRM latest performance test results Alberto Forti Otranto, Jun
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.
Windows Server 2008 Chapter 11 Last Update
1 Functional Testing Motivation Example Basic Methods Timing: 30 minutes.
Staging to CAF + User groups + fairshare Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE Offline week,
INFSO-RI Enabling Grids for E-sciencE SRMv2.2 experience Sophie Lemaitre WLCG Workshop.
CERN, 29 August 2006 Status Report Riccardo Zappi INFN-CNAF, Bologna.
Status of SRM 2.2 implementations and deployment 29 th January 2007 Flavia Donno, Maarten Litmaath IT/GD, CERN.
SRM 2.2: tests and site deployment 30 th January 2007 Flavia Donno, Maarten Litmaath IT/GD, CERN.
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
CERN IT Department CH-1211 Geneva 23 Switzerland t Storageware Flavia Donno CERN WLCG Collaboration Workshop CERN, November 2008.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
Bulk Data Movement: Components and Architectural Diagram Alex Sim Arie Shoshani LBNL April 2009.
Andrew C. Smith – Storage Resource Managers – 10/05/05 Functionality and Integration Storage Resource Managers.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Report on Installed Resource Capacity Flavia Donno CERN/IT-GS WLCG GDB, CERN 10 December 2008.
New stager commands Details and anatomy CASTOR external operation meeting CERN - Geneva 14/06/2005 Sebastien Ponce, CERN-IT.
SRM Monitoring 12 th April 2007 Mirco Ciriello INFN-Pisa.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
WLCG Grid Deployment Board, CERN 11 June 2008 Storage Update Flavia Donno CERN/IT.
Maarten Litmaath (CERN), GDB meeting, CERN, 2006/06/07 SRM v2.2 working group update Results of the May workshop at FNAL
CERN SRM Development Benjamin Coutourier Shaun de Witt CHEP06 - Mumbai.
Status SC3 SARA/Nikhef 20 juli Status & results SC3 throughput phase SARA/Nikhef Mark van de Sanden.
Jens G Jensen RAL, EDG WP5 Storage Element Overview DataGrid Project Conference Heidelberg, 26 Sep-01 Oct 2003.
Automatic Resource & Usage Monitoring Steve Traylen/Flavia Donno CERN/IT.
Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.
Storage Classes report GDB Oct Artem Trunov
INFSO-RI Enabling Grids for E-sciencE SRMv2.2 in DPM Sophie Lemaitre Jean-Philippe.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
WLCG Grid Deployment Board CERN, 14 May 2008 Storage Update Flavia Donno CERN/IT.
Handling of T1D0 in CCRC’08 Tier-0 data handling Tier-1 data handling Experiment data handling Reprocessing Recalling files from tape Tier-0 data handling,
DMLite GridFTP frontend Andrey Kiryanov IT/SDC 13/12/2013.
Criteria for Deploying gLite WMS and CE Ian Bird CERN IT LCG MB 6 th March 2007.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Storage Element Model and Proposal for Glue 1.3 Flavia Donno,
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
SRM-iRODS Interface Development WeiLong UENG Academia Sinica Grid Computing 1.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
1 SRM v2.2 Discussion of key concepts, methods and behaviour F. Donno CERN 11 February 2008.
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
CERN IT Department CH-1211 Genève 23 Switzerland t CCRC’08 Review from a DM perspective Alberto Pace (With slides from T.Bell, F.Donno, D.Duelmann,
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.
SRM 2.2: experiment requirements, status and deployment plans 6 th March 2007 Flavia Donno, INFN and IT/GD, CERN.
The HEPiX IPv6 Working Group David Kelsey (STFC-RAL) EGI OMB 19 Dec 2013.
Bologna, March 30, 2006 Riccardo Zappi / Luca Magnoni INFN-CNAF, Bologna.
CERN IT Department CH-1211 Genève 23 Switzerland t DPM status and plans David Smith CERN, IT-DM-SGT Pre-GDB, Grid Storage Services 11 November.
SRM v2.2: service availability testing and monitoring SRM v2.2 deployment Workshop - Edinburgh, UK November 2007 Flavia Donno IT/GD, CERN.
StoRM: a SRM solution for disk based storage systems
Status of the SRM 2.2 MoU extension
Flavia Donno, Jamie Shiers
Hints for DPM Administration
Flavia Donno CERN GSSD Storage Workshop 3 July 2007
SRM v2.2 / v3 meeting report SRM v2.2 meeting Aug. 29
SRM2 Migration Strategy
Proposal for obtaining installed capacity
GIN-Data : SRM Island Inter-Op Testing
INFNGRID Workshop – Bari, Italy, October 2004
The LHCb Computing Data Challenge DC06
Presentation transcript:

Status report on SRM v2.2 implementations: results of first stress tests 2 th July 2007 Flavia Donno CERN, IT/GD

2 Tests  All implementations pass basic tests   Use-case test family enhanced with even more tests:  CASTOR: Passes all use-cases.  Disk1 implemented by switching off garbage collector (not gracefully handled by CASTOR) *Fixed*  PutDone slow. *Fixed*  dCache : Passes all use-cases  No tools provided for site administrators to reserve space statically for a VO. *Fixed*  In Tape1Disk0 allocated space decreases when files are migrated to tape *Fixed*  DPM: Passes all use-cases *Fixed*  Garbage collector for expired space available with next release of DPM (1.6.5 in certification). *Fixed*  StoRM: Passes all use-cases  No tools provided for site administrators to reserve space statically for a VO.  BeStMan: Passes all use-cases  No tools provided for site administrators to reserve space statically for a VO.  Some calls not compliant to the specs as defined during the WLCG Workshop in January 2007 (for instance, requestToken not always returned).

3 Tests  Details about implementations status:  Problems Problems  Minor issues still open:  dCache:  An srmPrepareToPut or an srmStatusOfPutRequest returns SRM_NO_FREE_SPACE at file and request level if the space specified is expired instead of returning SRM_FAILURE at file level and SRM_SPACE_LIFETIME_EXPIRED at request level, or (if the space token is no longer valid) SRM_INVALID_REQUEST at request level and SRM_FAILURE at file level..  An srmPrepareToPut or an srmStatusOfPutRequest returns SRM_FAILURE at file and request level if no space of the requested class is available instead of returning SRM_NO_FREE_SPACE at file and request level or SRM_INVALID_REQUEST at request level and SRM_FAILURE at file level.  When method is not supported the explanation contains often the following string: "handler discovery and dinamic load failedjava.lang.ClassNotFoundException:..."  StoRM:  srmPrepareToPut and srmStatusOfPutRequest return SRM_FAILURE at request level instead of returning SRM_SPACE_LIFETIME_EXPIRED when the space specified in the request is expired and the space token is still available. If the space token is unavailable SRM_INVALID_REQUEST should be returned.  BeStMan:  Permanent files are not allowed to live in volatile space

4 Tests  Stress tests started on all development endpoints using 9 client machines. Small server instances are preferred in order to reach easily the limits.  First Goals:  Understand the limits of the instance under test  Make sure it does not crash or hang under heavy load  Make sure that the response time does not degrade to an “unreasonable” level  Further goals:  Make sure there are no hidden race-conditions for the calls that are the most used  Understand server tuning  Learn from stress testing  Parallel stress-testing activities are on-going by the EIS team with GSSD input

5 Stress Tests description  GetParallel  This test puts a file (/etc/group) in SRM default space. Then it spawns many (configurable statically at each run) threads requesting a TURL (=protocol dependent handle) to access the same file. The test can be driven to use different access protocols in different threads. The polling frequency to check if the TURL has been assigned can be specified in a fixed mode or can increasingly become high. Polling continues even after the TURL is assigned to check for changes in status. The test tries to clean up after itself. I was planning to introduce in another test of the same kind other operations such as Abort while trying to use the aborted TURL.  GetParallelTransf  Same as previous test but once the TURL is obtained each thread tries to actually retrieve the file. The test tries to clean up after itself. I was planning to introduce another test of the same kind where clients use the TURL assigned to other clients.  PutGet01  This test simulates many clients putting and getting (small) files simultaneously. Number of threads and polling frequency can be set as in previous tests. The test tries to clean up after itself.

6 Stress Tests description  PutGetMix  This test simulates many clients putting and getting randomly small (oKB) and big files (oMB/GB) simultaneously. Number of threads and polling frequency can be set as in previous tests. The test tries to clean up after itself.  PutMany/01  This test performs many PrepareToPut requests in parallel. Then the requests are also aborted in parallel. (Same characteristics as previous tests). The test PutMany01 only performs the PrepareToPut (without abort). Better checking of the system response is needed. No file transfer is performed !  ReserveSpace  This test does not apply to CASTOR. This test simulates many requests in parallel to reserve 1GB of disk space.  BringOnline  It reserves 1GB of disk space of type Tape1Disk0, fills it in with files (122MB) and checks the response of the system when the reserved space is full. It checks if some file is migrated to tape and if so it requests for the file to be staged on disk.

7 Stress Tests presentation

8 Stress Tests presentation Under the date there is one directory per run

9 Stress Tests presentation DPM Each number Corresponds to a node. The nodes where failures occur have bold/italic numbers 9 client machines

10 Stress Tests presentation Small instances preferred For stress-testing. In this case the failure happened on the client side (because of S2, each client cannot run more than 100 Threads) BeStMan

11 Stress Tests presentation dCache

12 Stress Tests presentation StoRM The system is not yet dropping requests. The response time degrades with load.

13 Stress Tests presentation StoRM With 60 threads the system drops requests. The system slows down (more time to complete a test). However, the server recovers nicely after the crisis.

14 Stress Tests presentation CASTOR srmStatusOfGetRequest srm://lxb6033.cern.ch:8443 requestToken=54549 SURL[srm://lxb6033.cern.ch:8443/castor/cern.ch/grid/dteam/ txt] Returns: sourceSURL0=srm://lxb6033.cern.ch:8443/castor/cern.ch/grid/dteam/ txt returnStatus.explanation0="PrepareToGet failed: Bad address" returnStatus.statusCode0=SRM_FAILURE returnStatus.explanation="No subrequests succeeded" returnStatus.statusCode=SRM_FAILURE Race condition ? Slow PutDone cured! Test completed in < 3 minutes

15 Stress Tests presentation CASTOR The server responds well under load. Requests get dropped but the response time is still good.

16 Summary of First Preliminary Results  CASTOR:  Race conditions found. Working with developers to address problems  Good handling of heavy-load: requests are dropped if server busy (the client can retry)  Response time for the requests being processed is good.  dCache:  Authorization module crash  Server very slow or unresponsive (max heap size reached - restart cures the problem)  Working with developers to address problems  DPM:  No failures  Good handling of heavy-load: requests are dropped if server busy (the client can retry)  Response time for the requests being processed is good.  StoRM  Response time degrades with load. The system might become unresponsive. However it recovers after the crisis.  Working with developers to address problems  BeStMan  Server unresponsive under heavy load. It does not resume operations when load decreases.  Working with the developers to address problems More analysis is needed in order to draw conclusions

17 Stress-test client improvements  The green/red presentation is probably not adequate  What does red mean ?  How can we make it easy for the developers the diagnosis of the problem ?  What happens when we increase the number of client nodes ?  I AM STILL PLAYING WITH THE PRESENTATION PAGE!! PLEASE DO NOT TAKE THE RED BOXES AS CORRECT!!!  Improve the management of the test-suite itself.  To efficiently stop/start/abort/restart  To easily diagnostic client problems  Etc.  How can we monitor improvements ?  Reproduce race condition problems  It is important to stress-test one system at the time  It is important to register degradation of performance  Extend the test suite with more use-cases  Experiments input very much appreciated.  External system monitoring is needed.

18 Plans  Continue stress-testing of development endpoints till allowed by the developers/sites  Coordinate with other testers  In order to understand what happens it is better to have dedicated machines  Publish results  As done for basic/use-case, publish a summary of the status of the implementations to help developers react, as a reference for sites and experiments.  Report monthly at the MB  Follow up possible problems at deployment sites  What else ?