The EU DataGrid Data Management

The EU DataGrid Data Management
The European DataGrid Project Team

EDG Tutorial Overview Information Service Workload Management Services
Data Management Services Networking Information Service Fabric Management

Overview Data Management Issues Main Components EDG Replica Catalog
EDG Replica Manager GDMP In this lecture you will learn about data management issues within a Data Grid and which are the main components to cope with these issues. Subsequently, four data management tools which are currently deployed on the EDG testbed are presented.

Data Management Issues
On the grid the user user has access to different kinds of computing and data services which are connected via the Internet. If you submit a job you typically don’t know in advance where it might end. The input data needed by a job might be on any storage element.

Data Management Issues
Of course, you might steer a job to a location where the data is available, but what if you need several data that is on different places? Similarly, when your job is finished – where should you put the data such that it is easy accessibly by yourself and others in subsequent applications?

Data Management Tools Tools for On EDG Testbed you have Locating data
Copying data Managing and replicating data Meta Data management On EDG Testbed you have EDG Replica catalog globus-url-copy (GridFTP) EDG Replica Manager Grid Data Mirroring Package (GDMP) Spitfire To cope with these issues you need several tools that allow to find data on the grid, to copy data, to replicate data (I.e. to produce exact copies to get better locality), and, ideally, high level data management tools that hide most of the complexity. In addition, applications may store additional meta data with their files which of course also needs to be managed. On the EDG Testbed we currently provide for this: A replica catalog that stores physical locations of files, I.e. Physical File Names together with a logical identifier, a Logical File Name; Globus tools (globus-url-copy which uses the GridFTP protocol) for secure file transfer; GDMP a data mirroring package that allows to automatically provide data on places that are interested in this data; the EDG Replica Manager that allows to replicate files while keeping the replica catalog consistent; and Spitfire, a frontend to grid enabled data bases for managing meta-data (this tool is not covered in this tutorial).

EDG Replica Catalog Based upon the Globus LDAP Replica Catalog
Stores LFN/PFN mappings and additional information (e.g. filesize): Physical File Name (PFN): host + full path & and file name Logical File Name (LFN): logical name that may be resolved to PFNs LFN : PFN = 1 : n Only files on storage elements may be registered Each VO has a specific storage dir on an SE Example PFN: lxshare0222.cern.ch/flatfiles/SE1/iteam/file1.dat host storage dir LFN must be full path of file starting from storage dir LFN of above PFN: file1.dat The EDG Replica Catalog is based upon the RC provided by Globus. It is essentially an LDAP server that stores LFN/PFN mappings together with additional information, such as filesize. Note that only files stored on SEs may be stored in the catalog. Moreover, there is a restriction in filenames: every replica must have the same filename starting from a VO specific storage directory on the SE. This storage directory may be retrieved from the information services: The LFN for a file must be the full path and filename starting from the storage directory.

EDG Replica Catalog API and command line tools
addLogicalFileName getLogicalFileName deleteLogicalFileName getPhysicalFileName addPhysicalFileName deletePhysicalFileName addLogicalFileAttribute getLogicalFileAttribute deleteLogicalFileAttribute 0/node85.html Main command line options: -l: logical file name -p: physical file name -n: attribute name -v: attribute value -c: config file name -d: don’t show all output -C: clear text password authorization (used currently in most catalog. –C is required!)

globus-url-copy Low level tool for secure copying
globus-url-copy <protocol>://<source file> \ <protocol>://<destination file> Main Protocols: gsiftp – for secure transfer, only available on SE and CE file – for accessing files stored on the local file system on e.g. UI, WN globus-url-copy file://`pwd`/file1.dat \ gsiftp://lxshare0222.cern.ch/ \ flatfiles/SE1/EDGTutorial/file1.dat globus-url-copy is a low level tool for secure copying. It should only be used if just a copy without registering the file in the RC is desired. Otherwise, higher level tools as we discuss now should be used.

The EDG Replica Manager
Extends the Globus replica manager Only client side tool Allows replication (copy) and registering of files in RC works with LDAP based RC and RLS (see future directions) Keeps RC consistent with stored data. Uses GDMP’s staging interface to stage to MSS The edg replica manager may be used to replicate data and keeping the replica catalog in sync. It is a prototype developed based upon the globus replica manager in order to get experience for the ongoing development of an intelligent replica manager, called Reptor (cf. Lecture on Future Directions).

The Replica Manager APIs
(un)registerEntry(LogicalFileName lfn, FileName source) Replica Catalogue operations only - no file transfer copyFile(FileName source, FileName destination, String protocol) allows for third-party transfer transfer between: two StorageElements or ComputingElement and Storage Element Space management policies under development all tools support parallel streams for file transfers This and the next slide show the API functions of the Replica Manager(RM), to be used by clients and applications. Main command line options: -l: logical file name -s: source file name -d: destination file name -c: config file for the replica catalog -e: verbose error output (important for testing!)

The Replica Manager APIs
copyAndRegisterFile(LogicalFileName lfn, FileName source, FileName destination, String protocol) third-party transfer but : files can only be registered in Replica Catalogue if destination PFN contains a valid SE (i.e. needs to be registered in the RC)! replicateFile(LogicalFileName lfn, deleteFile(LogicalFileName lfn, FileName source)

Genius file replication

Genius file delete

experience will directly be used in DataGrid
originally based on CMS requirements for replicating Objectivity files for High Level Trigger studies production prototype project for evaluating Grid technologies (especially Globus) experience will directly be used in DataGrid input also for PPDG and GriPhyN GDMP is a data mirroring package that was originally developed as a production prototype for evaluating grid technologies, in particular Globus, in context of CMS trigger studies. GDMP has been integrated into EDG software and the experiences gained with it are exploited in ongoing EDG replication projects.

Overview of Components
Globus Replica Catalogue GDMP client GDMP is a client/server architecture where an GDMP server is running on every SE. GDMP servers may also communicate with each other and have access to a common RC. A GDMP server is able to server multiple VOs with different replica catalogs. Site1 Site3 Site2

Subscription Model All the sites that subscribe to a particular site get notified whenever there is an update in its catalog. Site 1 Site 2 Subscriber list Subscriber list subscribe subscribe GDMP works according to a subscription/notification scheme. A site that is interested in some particular data subscribes to the site where the data is produced and will subsequently be notified whenever an update is published on the provider site. Site 3

Export / Import Catalogue
Export Catalog information about the new files produced . is published Import Catalog information about the files which have been published by other sites but not yet transferred locally As soon as the file is transferred locally, it is removed from the import catalogue. Possible to pull the information about new files into your import catalogue. Site 1 Site 2 export catalog export catalog 1)register, publish new files 1) get info about new files import catalog The subscription/notification model works with help of 2 kinds of catalogs: an export and an import catalog. The export catalog is published on the provider site and contains information on new files the provider wants to publish. The import catalog is automatically updated with new information available in any export catalog a site is subscribed to. Once a file referenced in the import catalog has been actually transferred, the corresponding entry is removed. If the import catalog is unreachable, for instance due to network problems, it might miss updates from sites it is subscribed to. In such a case, an active pull may be used to update the import catalog. 3) delete files Site 3 2) transfer files 2) transfer files

Usage gdmp_ping Ping a GDMP server and get its status
gdmp_host_subscribe first thing to be done by a site gdmp_register_local_file Registers a file in local file catalogue but NOT in Replica Catalogue (RC) gdmp_publish_catalogue send information of newly created files to subscribed hosts (no real data transfer) – update RC gdmp_replicate_get - gdmp_replicate_put get/put all the files from the import catalogue – update RC gdmp_remove_local_file Delete a local file and update RC gdmp_get_catalogue Get remote catalogue contents – for error recovery For detailed information on these commands as well as the command line options refer to the GDMP users guide at:

Site 2 Site 5 Site 1 Site 3 Site 4 Using GDMP
Register all files in a directory at site 1 gdmp_register_local_file –d /data/files Site 2 Site 5 Site 1 /data/files/file1 /data/files/file2 … Site 3 Site 4 Data produced at site 1 to be replicated to other sites

Site 5 Site 2 Site 1 Site 3 Site 4 Using GDMP 2
Start with subscription gdmp_host_subscribe –r <HOST> -p <PORT> Site 5 Site 2 gdmp_host_subscribe gdmp_host_subscribe Site 1 Subscriber list gdmp_host_subscribe Site 3 Site 4

Publish new files…can combine with filtering gdmp_publish_catalogue (might use filter option) Import catalog Import catalog Site 5 Site 2 Export catalog Site 1 Subscriber list gdmp_publish_catalogue Site 3 Import catalog Site 4

Poll for change in catalog (pull model)…can combine with filtering…also used for error recovery. gdmp_get_catalogue –host <HOST> Import catalog Import catalog Site 2 Site 5 Export catalog Site 1 Subscriber list gdmp_get_catalogue Site 3 Site 4 Import catalog Import catalog

Transfer files…can use the progress meter gdmp_replicate_get get_progress_meter…produces a progress.log. replica.log has all files already transferred. Import catalog Import catalog Site 2 Site 5 gdmp_replicate_get gdmp_replicate_get Export catalog Site 1 Subscriber list gdmp_replicate_get Site 3 Site 4 Import catalog Import catalog

GDMP vs. EDG Replica Manager
Replicates sets of files Replication between SEs only Mass storage interface File size as logical attribute Subscription model Event notification CRC file size check Support for Objectivity Replica Manager Replicates single files Replication between SEs, UI or CE to SE. Uses GDMP’s Mass Storage interface at the SE In contrast to GDMP, the edg-replica manager only supports basic replication of single files, however, it is able to copy between all kinds of testbed machines. For simple replication tasks, edg-replica-manager is the preferred tool; for complex replication tasks including replication of multiple files to multiple locations gdmp should be used. client-server client side only

File Management Summary
Site A Site B Storage Element A Storage Element B File Management on the Grid What file management functionalities are expected? The promise of the Grid is that the user files will be managed by the Grid services. The user’s files are identified by a name that is known to the user, the Logical File Name, as shown in this slide (File A,B,C,D,X,Y). A file may be replicated to many Grid sites, in the example shown on the slide, file A and B are available on both sites, C,D,X,Y are only available on one of the two sites. In the following we will argue for the file management functionalities that are necessary for the DataGrid users (HEP, EO, BIO) in a Grid environment. In order to make files available on more than one site, we need File Transfer services between sites. The Storage Element is defined (from the point of view of Data Management) as an abstract storage service that ‘speaks’ several file transfer protocols, for example FTP, HTTP, GridFTP, RFIO. The File Transfer service is a service offered by the Storage Element, the default being GridFTP for DataGrid. File Transfer File A File X File A File C File B File Y File B File D

Replica Catalog: Map Logical to Site files Site A Site B Storage Element A Storage Element B If files are available from more than one storage site (because they have been transferred using the File Transfer service), we need a service that keeps track of them. This service is the Replica Catalog, it contains the mappings of the Logical File Names to their Site File Names, i.e. the name by which the file can be accessed through the Storage Element in a given site. File Transfer File A File X File A File C File B File Y File B File D

Replica Catalog: Map Logical to Site files Replica Selection: Get ‘best’ file Site A Site B Storage Element A Storage Element B Next, we need to provide a service that makes a choice between the available replicas based on a set of information available from the Grid: network bandwith, storage element access speed, etc. File Transfer File A File X File A File C File B File Y File B File D

Replica Catalog: Map Logical to Site files Replica Selection: Get ‘best’ file Pre- Post-processing: Prepare files for transfer Validate files after transfer Site A Site B Storage Element A Storage Element B Many file types need some kind of pre- and/or post-processing before and/or after transfer. These steps might be related to data extraction, conversion, encryption for pre-processing and importation, conversion, decryption, etc for post-prosessing. Also, as a Quality-of-Service step the file may be validated after transfer as a post-processing step (by checking its checksum, for example). This service should be customizable by the applications that use certain file types. File Transfer File A File X File A File C File B File Y File B File D

Replica Catalog: Map Logical to Site files Replica Selection: Get ‘best’ file Pre- Post-processing: Prepare files for transfer Validate files after transfer Replication Automation: Data Source subscription Site A Site B Storage Element A Storage Element B Automated replication based on a subscription model is also desirable especially for HEP, where the data coming from the LHC will have to be distributed automatically over the world. The subscription semantics may be defined through file name patterns, data locations, etc. This service should also be customizable by the Virtual Organizations setting it up. File Transfer File A File X File A File C File B File Y File B File D

Replica Catalog: Map Logical to Site files Replica Selection: Get ‘best’ file Pre- Post-processing: Prepare files for transfer Validate files after transfer Replication Automation: Data Source subscription Site A Site B Load balancing: Replicate based on usage Storage Element A Storage Element B There is another desirable automatic replication mechanism that should be available through a service, which is automated replication to balance the access load on ‘popular’ files. This service needs to store and analyze user access patterns on the files available through the Grid. Replication is triggered by rules that are applied to the usage patterns on the existing files. File Transfer File A File X File A File C File B File Y File B File D

Replica Manager: ‘atomic’ replication operation single client interface orchestrator
File Management Replica Catalog: Map Logical to Site files Replica Selection: Get ‘best’ file Pre- Post-processing: Prepare files for transfer Validate files after transfer Replication Automation: Data Source subscription Site A Site B Load balancing: Replicate based on usage Storage Element A Storage Element B The user does not want to be burdened to contact each of these services directly, taking care of the correct order of operations and making sure that errors are caught and handled accordingly. This functionality is provided by the Replica Manager which acts as the single interface to all replication services. As an additional QoS functionality it should provide transactional integrity of the steps involved in the high-level usages of the underlying services. It should be able to gracefully recover interrupted transfers, catalog updates, pre- and post-processing steps, etc. File Transfer File A File X File A File C File B File Y File B File D

Replica Manager: ‘atomic’ replication operation single client interface orchestrator
File Management Replica Catalog: Map Logical to Site files Replica Selection: Get ‘best’ file Pre- Post-processing: Prepare files for transfer Validate files after transfer Replication Automation: Data Source subscription Site A Site B Load balancing: Replicate based on usage Metadata: LFN metadata Transaction information Access patterns Storage Element A Storage Element B A helper service to be used by the File Management services described until now is the metadata service. It should be able to store all metadata associated with Data Management, like attributes on LFNs, patterns for the load balancing service, transaction locks for the replica manager, etc. File Transfer File A File X File A File C File B File Y File B File D

Security File Management
Replica Manager: ‘atomic’ replication operation single client interface orchestrator File Management Replica Catalog: Map Logical to Site files Replica Selection: Get ‘best’ file Security Pre- Post-processing: Prepare files for transfer Validate files after transfer Replication Automation: Data Source subscription Site A Site B Load balancing: Replicate based on usage Metadata: LFN metadata Transaction information Access patterns Storage Element A Storage Element B Across all of these services we need to impose the stamp of Security, which means that the services need to authenticate and authorize their clients. In order to provide a higher level of consistency between the replicas, we require the users to transfer all administrative rights on the files that they register through the Replica Manager. In order to administer their files they therefore will need to go through the Replica Manager again – administering the files directly through the Storage will be denied. This enables the Replica Manager to act for the user on the user’s behalf – automated replication would not be possible without this restriction. File Transfer File A File X File A File C File B File Y File B File D

The EU DataGrid Data Management

Similar presentations

Presentation on theme: "The EU DataGrid Data Management"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The EU DataGrid Data Management

Similar presentations

Presentation on theme: "The EU DataGrid Data Management"— Presentation transcript:

Similar presentations

About project

Feedback