Presentation is loading. Please wait.

Presentation is loading. Please wait.

Flavia Donno, Jamie Shiers

Similar presentations


Presentation on theme: "Flavia Donno, Jamie Shiers"— Presentation transcript:

1 Flavia Donno, Jamie Shiers
Grid Storage Services: status of servers, clients and the SRM protocol implementation Flavia Donno, Jamie Shiers CERN Workshop 2007 su Calcolo e Reti dell’INFN Rimini, 7-11 May 2007

2 Outline Introduction The Grid Storage Service: requirements
The Storage Resource Manager: SRM v2.2 protocol SRM v2.2 definition and validation process and activities Available implementations Tests executed and their results Current issues with the implementations Status of the clients The GSSD working group Conclusions WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

3 Introduction Because of its mission, one of the critical issues that WLCG has to face is the provision of a Grid storage service that allows for the management and handling of Petabytes of data produced by the LHC experiments every year. In June 2005 the WLCG Baseline Service Working Group published a report: where it is stated that: A Storage Element Service is mandatory and high priority. The full set of recommended features should be available by February 2006 The experiment requirements for a Grid storage service are defined Experiments agree to use only high-level tools as interface to SRM The report was based on the early experience acquired with the few existing implementations of Grid storage service based on early and not completely satisfactory versions of the Storage Resource Manager protocol: SRM v1.1 and v2.1 (this latest never deployed). At the Mumbai workshop (CHEP2006) it became clear that the experiments had learned more about what was needed and changed their requirements. It took from February to May 2006 to converge on what is now called SRM 2.2. In May 2006 at FNAL the WLCG MoU was agreed on: WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

4 Basic requirements Support for Permanent files (and volatile copies)
Support for Permanent Space Space Reservation : static only per VO. If dynamic space reservation is available, allow for the possibility of releasing the allocated space Permission Functions only on directories based on VOMS group/roles Directory Management Functions Data Transfer Functions File access protocol negotiation VO-specific relative paths WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

5 The Storage Resource Manager: SRM v2.2
The Storage Resource Manager (SRM) is a middleware component whose function is to provide dynamic space allocation and file management on shared storage components on the Grid. More precisely, the SRM is a Grid service with several different implementations. Its main specification documents are: A. Sim, A. Shoshani (eds.), The Storage Resource Manager Interface Specication, v. 2.2, available at F. Donno et al., Storage Element Model for SRM 2.2 and GLUE schema description, v3.5. WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

6 The Storage Resource Manager Interface
The SRM Interface Specification lists the service requests, along with the data types for their arguments. Function signatures are given in an implementation-independent language and grouped by functionality: Space management functions allow the client to reserve, release, and manage spaces, their types and lifetimes. Support for different qualities of storage space. Reserve/Release Space, ChangeSpaceForFiles, ExtendFileLifeTimeInSpace Data transfer functions have the purpose of getting files into SRM spaces either from the client's space or from other remote storage systems on the Grid, and to retrieve them. PrepareToPut/StatusOfPutRequest/PutDone PrepareToGet/StatusOfGetRequest BringOnline Copy ReleaseFiles, AbortRequest/Files, ExtendFileLifeTime Other function classes are Directory, Permission, and Discovery functions. Ping, Ls, Mkdir, Rm, Rmdir, SetPermission, CheckPermission, etc. WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

7 SRM 2.2 activities Weekly meetings to follow the development of the implementations up to December 2006 The LBNL testing suite written in Java was the only one available till September 2006, manually run. Only in January 2007 this test suite was running automatically every day. In September 2006 CERN took over from RAL the development of the S2 SRM 2.2 testing suite, enhancing it with a complete test set and with publishing and monitoring tools. Reports about the status of the SRM 2.2 test endpoints are given monthly to the LCG MB WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

8 Study of SRM 2.2 specification
In September 2006 very different interpretations of the spec 5 Releases of the specification up to now: July 2006, September 2006, December 2006, January 2007, April 2007 A study of the spec (proposal of a data model, UML state/activity diagrams, etc.) has identified many behaviours not defined by the specs. A list of about 50 points has been compiled in September 2006. Many issues solved. Other 30 points discussed and agreed during the WLCG Workshop in January 2007. Agreement on fundamental concepts. Minor issues postponed to SRM v3. The study of the specifications, the discussions and testing of the open issues have helped insure consistency between SRM implementations. WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

9 Available implementations
SRM v2.2 implementations are available today for the following Storage Services: CASTOR2 dCache DPM StoRM BeStMan CASTOR2 is a Hierarchical Storage Server (HSS) with a tape system back-end. It is developed by CERN and RAL. The available version with SRM v2.2 support is v dCache is a HSS with plug-ins for many different tape system back-ends. It is developed by DESY and FNAL. The available version with SRM 2.2 support is v1.8Beta. DPM is a disk-based only Storage Server. It is developed by CERN. The version available in production with SRM v2.2 support is v1.6.4. StoRM is a disk-based only Storage Server developed by INFN and ICTP. It takes advantage of the underlying file-systems to implement the features offered by the SRM interface. It offers an SRM v2.2 interface for many file-system based solutions such as GPFS, Lustre, XFS and POSIX generic file-systems. The version available with SRM v2.2 support is v1.3.15 BeStMan is a disk or MSS based Storage Server developed by LBNL. The version available with SRM v2.2 support is v WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

10 Tests executed S2 test suite testing availability of endpoints, basic functionality, use cases and boundary conditions, interoperability, exhaustive and stress tests. Availability: Ping and full put cycle (putting and retrieving a file) Basic: basic functionality checking only return codes and passing all basic input parameters Usecases: testing boundary conditions, exceptions, real use cases extracted from the middleware clients and experiment applications. Interoperability: servers acting as clients, cross copy operations Exhaustive: Checking for long strings, strange characters in input arguments, missing mandatory or optional arguments. Output parsed. Stress: Parallel tests for stressing the systems, multiple requests, concurrent colliding requests, space exhaustion, etc. S2 tests cron job running 5 times per day. Results published daily on a web page. Latest and history available: Test results and issues discussed on srm-tester and srm-devel lists In parallel, manual tests from high-level libraries and CLIs such as GFAL/lcg-utils, FTS, DPM test suite. WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

11 The S2 result web-pages Several endpoints available for testing
srmCopy and srmChangeSpaceForFiles foreseen for all implementations by the end of 2007 Several endpoints available for testing Main ones are at the sites developing the SRM v2.2 interface Other sites volunteered to install the most recent version and give access to experiments for the validation process. WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

12 The S2 result web-pages dCache@DESY CASTOR and DPM Authorization/
Connection errors (dCache developers will look into the problem this coming week) CASTOR and DPM Do not provide srmCopy for the moment. Foreseen for end of 2007. StoRM does not provide srmCopy in PULL mode WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007 12

13 Test results WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

14 Current issues with the implementations
CASTOR: Static space reservation manually created. Need to reserve dteam special space tokens for SAM tests. Pins are not honored. VOs needs to negotiate with the sites the size of the disk buffers and tuning of the garbage collector. Tape0Disk1 storage class implemented at the moment by switching off the garbage collector switched off. However, CASTOR cannot handle gracefully the situation where the disk fills up. Requests are kept in queue waiting for the space to be freed up by the garbage collector. Disk1 storage issues will be probably addressed by the end of this year. Very slow PutDone. This has stopped us from performing real stress tests on the CASTOR instance. It can crash the server or make it unresponsive. A fix for this problem is available in latest releases and will be deployed in production in the coming months. No quota provided for the moment. Since the space is statically partitioned between VOs, it is the partition that guarantees that a user does not exceed its allocated space No support for VOMS groups/roles ACLs at the moment. WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

15 Current issues with the implementations
dCache: To create a file in SRM you need to execute srmPrepareToPut, transfer, srmPutDone. The SRM protocol establishes that the existence of the SURL starts when the srmPrepareToPut operation is completed. For dCache the SURL exists only when file transfer is initiated. This has consequences on the clients that have to deal with this situation. Two concurrent srmPrepareToPut succeed, but the associated transfers are then synchronized, one of them failing. This has been done to avoid to explicitly abort Put requests for transfers that have not been initiated. dCache does not clean up after failed operations. It is up to the client to issue an explicit srmAbort[File]Request and remove the correspondent SURLs before retrying. Other clients do not behave the same (see later). However, this behavior is compliant with the SRM protocol specification. Different interpretation of space reservation in case of Tape1Disk0 storage class: the space dynamically reserved by the user is decreased for the size of the files in it migrated to tape. Users have to re-issue a space reservation request if more space is needed. Other implementations use the reserved space as a window for streaming files to tape. Soft quota provided for the moment. Transfers are controlled so that the user cannot write new files in a full space. However transfers initiated when space was available are not aborted if space is insufficient (if gridftp is used even these transfers are aborted if the space allocation is exceeded). No quota for space reservation: users can reserve as much space as they want if the space is available. No support for VOMS groups/roles ACLs at the moment. Foreseen by the end of 2007. WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

16 Current issues with the implementations
DPM: SURL created at the completion of srmPrepareToPut operation. However, two concurrent srmPrepareToPut with the overwrite flag enabled succeed. ATTENTION: The implementation does not flag the user that another concurrent srmPrepareToPut is being executed. A second srmPrepareToPut with overwrite flag set automatically aborts the first one. This has been done to avoid to explicitly abort Put requests and to remove the correspondent SURL for transfers that have not been initiated. DPM v1.6.4 does not provide at the moment a garbage collector for expired space. It is up to the client to explicitly remove the space when done. This has already caused problems on production installations that were under S2 tests. DPM v in certification at the moment provides a garbage collector for expired space. Soft quota provided for the moment. Transfers are controlled so that the user cannot write new files in a full space. However transfers initiated when space was available are not aborted if space becomes insufficient. No quota for space reservation: users can reserve as much space as they want if the space is available. This will be provided as soon as accounting will be implemented in DPM. WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

17 Current issues with the implementations
StoRM: Space tokens are strictly connected to paths specific per VO. The SRM protocol establishes that only the space token determines the storage class and there is no connection between space tokens and the SRM namespace (for some implementation the path can determine the tape set where the files must reside). In StoRM, the path does determine the storage class at the moment. Space reservation is guaranteed. For other implementation space reservation is on best effort: a user can request a given amount of space. However, if another user on the same system exceeds its reserved space, it could happen that the first user cannot benefit of the space reserved. StoRM instead pre-allocates the blocks at the file system level, therefore the space reservation is guaranteed. Quota available through the underlying file system: there is no direct support for quota in StoRM. If the file system supports it then sys-admins can decide to enable it. However, for GPFS, some time ago it was observed that enabling quota slows performance down. WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

18 Current issues with the implementations
BeStMan: Get/Put/Copy/BringOnline return status referring to the list of SURLs specified in the call. All SRM calls are characterized by a request ID. Such ID refers to the full set of files specified in the request. If an srmStatusOf* request is issued specifying the request ID, the status of the original request should be returned and not only the status of the request relative to the files specified in the srmStatusOf*. BeStMan does not comply to the SRM spec in this respect, as defined during the WLCG workshop in January 2007. PERMANENT files cannot live in volatile space. The SRM protocol specifies that permanent files created in volatile space should continue to live in a default space when the volatile space expires (as per decisions taken during the WLCG workshop in January and various discussions on ). Quota available through the underlying file system: there is no direct support for quota in BeStMan. If the file system supports it then sys-admins can decide to enable it. No support for ACLs based on VOMS groups and roles. WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

19 Other general issues No dynamic information providers for GLUE v1.3 available yet Monitoring tools Accounting information Management tools for TapenDisk1 storage class What else ? Different implementations offer different internal tools for monitoring, accounting, management. No uniform interface. No clear way for the experiments to manage the storage resources on the Grid for the moment. We are working in this area. WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

20 Status of SRM high-level clients
Lcg-utils v and GFAL available in production SRM v1 is the default. SRM v2 is used only if space token specified or SRM v1 endpoint not published. No ReserveSpace functionality offered through lcg-util-GFAL BringOnline only possible through GFAL. Pinning not available since not supported by all implementations. It will be available with the next release of GFAL. Releasing files possible through lcg-utils and GFAL Lcg-cp now takes also SURL as destination. No catalogue operations performed. At the moment no possibility to force GFAL to use a specific access protocol. Bulk operations will be available with the next release of GFAL. FTS 2.0 in testing SRM v2.2 support Improved monitoring Improved certificate delegation handling Deployment at CERN by mid May 2007 Deployment at T1s a month after that. WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

21 Grid Storage System Deployment (GSSD) Working Group
Mailing list: Mandate: Coordinating with sites, experiments, and developers the deployment of the various SRM 2.2 implementations and the corresponding Storage Classes. Establishing a migration plan from SRM v1 to SRM v2 so that the experiments can access the same data from the 2 endpoints transparently. Coordinating the Glue Schema v1.3 deployment for the Storage Element and make sure that the information published are correct. Coordinating the provision of the necessary information by the Storage Providers in order to monitor the status of storage resources, ensuring that all sites provide the experiments with the requested resources and with the correct usage.People involved: developers (SRM and clients), storage providers (CASTOR, dCache, DPM, StoRM), site admins, experiments Ensure transparency of data access and the functionalities required by the experiments (see Baseline Service report). WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

22 GSSD Achievements so far
(A.) Proposal of a migration plan from SRM v1 to SRM v2 in place, ensuring transparent data access from both SRM v1 and SRM v2. Main point is the coordination of the namespace adopted by the experiments: connect the namespace with the storage classes as in the example below. /my_exp/data/RAW/year/run/reco_pass/stream /my_exp/data/DST/year/run/reco_pass/stream /my_exp/data/ESD_master/year/run/reco_pass/stream /my_exp/data/ESD_replica/year/run/reco_pass/stream /my_exp/data/AOD/year/run/reco_pass/stream (B.) Preparation of the EGEE pre-production testbed for experiments testing of SRM v2.2 15 endpoints available at the moment. Mostly hardware dedicated to tests but also DPM production sites ( Negotiating with the experiments the details: data to be accessed, catalogues, space, storage classes, space tokens, data flow patterns, clients to be exercized, etc. (C.) Monitoring/accounting group to investigate which information is available at the moment in the various implementations. First report produced Drafting a wrapper for a common interface to be used in GridView or experiments WLCG dashboard, for the experiments to evaluate RAW --> T1D0 DST --> T1D0, but different tape set ESD_master --> T1D1 ESD_replica --> T0D1 AOD --> T0D1 WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007

23 Conclusions Much clearer description of SRM specifications. All ambiguous behaviors made explicit. A few issues left out for SRM v3 since they do not affect the SRM MoU. Well established and agreed methodology to check the status of the implementations and trigger the cycle for the definition of the protocol. Boundary conditions, use cases from the upper layer middleware and experiment applications represented in the S2 test suite families of tests. Working with sites and experiments for the deployment of the SRM 2.2 and Storage Classes. Specific guidelines for Tier-1 and Tier-2 sites are being compiled. It is really important to have the experiments perform tests in pre-production in order to verify the readiness of SRM v2.2 for production. WorkShop 2007 sul Calcolo e Reti dell'INFN - Rimini, Italy May 2007


Download ppt "Flavia Donno, Jamie Shiers"

Similar presentations


Ads by Google