Status report on SRM v2.2 implementations: results of first stress tests 2 th July 2007 Flavia Donno CERN, IT/GD.

Status report on SRM v2.2 implementations: results of first stress tests 2 th July 2007 Flavia Donno CERN, IT/GD

2 Tests  All implementations pass basic tests  https://twiki.cern.ch/twiki/bin/view/SRMDev https://twiki.cern.ch/twiki/bin/view/SRMDev  Use-case test family enhanced with even more tests:  CASTOR: Passes all use-cases.  Disk1 implemented by switching off garbage collector (not gracefully handled by CASTOR) *Fixed*  PutDone slow. *Fixed*  dCache : Passes all use-cases  No tools provided for site administrators to reserve space statically for a VO. *Fixed*  In Tape1Disk0 allocated space decreases when files are migrated to tape *Fixed*  DPM: Passes all use-cases *Fixed*  Garbage collector for expired space available with next release of DPM (1.6.5 in certification). *Fixed*  StoRM: Passes all use-cases  No tools provided for site administrators to reserve space statically for a VO.  BeStMan: Passes all use-cases  No tools provided for site administrators to reserve space statically for a VO.  Some calls not compliant to the specs as defined during the WLCG Workshop in January 2007 (for instance, requestToken not always returned).

3 Tests  Details about implementations status:  https://twiki.cern.ch/twiki/bin/view/SRMDev/Implementations Problems https://twiki.cern.ch/twiki/bin/view/SRMDev/Implementations Problems  Minor issues still open:  dCache:  An srmPrepareToPut or an srmStatusOfPutRequest returns SRM_NO_FREE_SPACE at file and request level if the space specified is expired instead of returning SRM_FAILURE at file level and SRM_SPACE_LIFETIME_EXPIRED at request level, or (if the space token is no longer valid) SRM_INVALID_REQUEST at request level and SRM_FAILURE at file level..  An srmPrepareToPut or an srmStatusOfPutRequest returns SRM_FAILURE at file and request level if no space of the requested class is available instead of returning SRM_NO_FREE_SPACE at file and request level or SRM_INVALID_REQUEST at request level and SRM_FAILURE at file level.  When method is not supported the explanation contains often the following string: "handler discovery and dinamic load failedjava.lang.ClassNotFoundException:..."  StoRM:  srmPrepareToPut and srmStatusOfPutRequest return SRM_FAILURE at request level instead of returning SRM_SPACE_LIFETIME_EXPIRED when the space specified in the request is expired and the space token is still available. If the space token is unavailable SRM_INVALID_REQUEST should be returned.  BeStMan:  Permanent files are not allowed to live in volatile space

4 Tests  Stress tests started on all development endpoints using 9 client machines. Small server instances are preferred in order to reach easily the limits.  First Goals:  Understand the limits of the instance under test  Make sure it does not crash or hang under heavy load  Make sure that the response time does not degrade to an “unreasonable” level  Further goals:  Make sure there are no hidden race-conditions for the calls that are the most used  Understand server tuning  Learn from stress testing  Parallel stress-testing activities are on-going by the EIS team with GSSD input

5 Stress Tests description  GetParallel  This test puts a file (/etc/group) in SRM default space. Then it spawns many (configurable statically at each run) threads requesting a TURL (=protocol dependent handle) to access the same file. The test can be driven to use different access protocols in different threads. The polling frequency to check if the TURL has been assigned can be specified in a fixed mode or can increasingly become high. Polling continues even after the TURL is assigned to check for changes in status. The test tries to clean up after itself. I was planning to introduce in another test of the same kind other operations such as Abort while trying to use the aborted TURL.  GetParallelTransf  Same as previous test but once the TURL is obtained each thread tries to actually retrieve the file. The test tries to clean up after itself. I was planning to introduce another test of the same kind where clients use the TURL assigned to other clients.  PutGet01  This test simulates many clients putting and getting (small) files simultaneously. Number of threads and polling frequency can be set as in previous tests. The test tries to clean up after itself.

6 Stress Tests description  PutGetMix  This test simulates many clients putting and getting randomly small (oKB) and big files (oMB/GB) simultaneously. Number of threads and polling frequency can be set as in previous tests. The test tries to clean up after itself.  PutMany/01  This test performs many PrepareToPut requests in parallel. Then the requests are also aborted in parallel. (Same characteristics as previous tests). The test PutMany01 only performs the PrepareToPut (without abort). Better checking of the system response is needed. No file transfer is performed !  ReserveSpace  This test does not apply to CASTOR. This test simulates many requests in parallel to reserve 1GB of disk space.  BringOnline  It reserves 1GB of disk space of type Tape1Disk0, fills it in with files (122MB) and checks the response of the system when the reserved space is full. It checks if some file is migrated to tape and if so it requests for the file to be staged on disk.

7 Stress Tests presentation http://lxdev25.cern.ch/s2farm/results/final/history/

8 Stress Tests presentation Under the date there is one directory per run

9 Stress Tests presentation DPM Each number Corresponds to a node. The nodes where failures occur have bold/italic numbers 9 client machines

10 Stress Tests presentation Small instances preferred For stress-testing. In this case the failure happened on the client side (because of S2, each client cannot run more than 100 Threads) BeStMan

11 Stress Tests presentation dCache

12 Stress Tests presentation StoRM The system is not yet dropping requests. The response time degrades with load.

13 Stress Tests presentation StoRM With 60 threads the system drops requests. The system slows down (more time to complete a test). However, the server recovers nicely after the crisis.

14 Stress Tests presentation CASTOR srmStatusOfGetRequest srm://lxb6033.cern.ch:8443 requestToken=54549 SURL[srm://lxb6033.cern.ch:8443/castor/cern.ch/grid/dteam/20070701-220340-4202-0.txt] Returns: sourceSURL0=srm://lxb6033.cern.ch:8443/castor/cern.ch/grid/dteam/20070701-220340-4202-0.txt returnStatus.explanation0="PrepareToGet failed: Bad address" returnStatus.statusCode0=SRM_FAILURE returnStatus.explanation="No subrequests succeeded" returnStatus.statusCode=SRM_FAILURE Race condition ? Slow PutDone cured! Test completed in < 3 minutes

15 Stress Tests presentation CASTOR The server responds well under load. Requests get dropped but the response time is still good.

16 Summary of First Preliminary Results  CASTOR:  Race conditions found. Working with developers to address problems  Good handling of heavy-load: requests are dropped if server busy (the client can retry)  Response time for the requests being processed is good.  dCache:  Authorization module crash  Server very slow or unresponsive (max heap size reached - restart cures the problem)  Working with developers to address problems  DPM:  No failures  Good handling of heavy-load: requests are dropped if server busy (the client can retry)  Response time for the requests being processed is good.  StoRM  Response time degrades with load. The system might become unresponsive. However it recovers after the crisis.  Working with developers to address problems  BeStMan  Server unresponsive under heavy load. It does not resume operations when load decreases.  Working with the developers to address problems More analysis is needed in order to draw conclusions

17 Stress-test client improvements  The green/red presentation is probably not adequate  What does red mean ?  How can we make it easy for the developers the diagnosis of the problem ?  What happens when we increase the number of client nodes ?  I AM STILL PLAYING WITH THE PRESENTATION PAGE!! PLEASE DO NOT TAKE THE RED BOXES AS CORRECT!!!  Improve the management of the test-suite itself.  To efficiently stop/start/abort/restart  To easily diagnostic client problems  Etc.  How can we monitor improvements ?  Reproduce race condition problems  It is important to stress-test one system at the time  It is important to register degradation of performance  Extend the test suite with more use-cases  Experiments input very much appreciated.  External system monitoring is needed.

18 Plans  Continue stress-testing of development endpoints till allowed by the developers/sites  Coordinate with other testers  In order to understand what happens it is better to have dedicated machines  Publish results  As done for basic/use-case, publish a summary of the status of the implementations to help developers react, as a reference for sites and experiments.  Report monthly at the MB  Follow up possible problems at deployment sites  What else ?

Status report on SRM v2.2 implementations: results of first stress tests 2 th July 2007 Flavia Donno CERN, IT/GD.

Similar presentations

Presentation on theme: "Status report on SRM v2.2 implementations: results of first stress tests 2 th July 2007 Flavia Donno CERN, IT/GD."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Status report on SRM v2.2 implementations: results of first stress tests 2 th July 2007 Flavia Donno CERN, IT/GD.

Similar presentations

Presentation on theme: "Status report on SRM v2.2 implementations: results of first stress tests 2 th July 2007 Flavia Donno CERN, IT/GD."— Presentation transcript:

Similar presentations

About project

Feedback