The EU DataGrid Data Management

Slides:



Advertisements
Similar presentations
DataTAG WP4 Meeting CNAF Jan 14, 2003 Interfacing AliEn and EDG 1/13 Stefano Bagnasco, INFN Torino Interfacing AliEn to EDG Stefano Bagnasco, INFN Torino.
Advertisements

WP2: Data Management Gavin McCance University of Glasgow November 5, 2001.
Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.
Author - Title- Date - n° 1 GDMP The European DataGrid Project Team
1 CHEP 2000, Roberto Barbera Tests of data management services in EDG 1.2 ALICE Off-line Week,
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
DataGrid is a project funded by the European Union CHEP 2003 – March 2003 – Title – n° 1 Grid Data Management in Action Experience in Running and.
GRID DATA MANAGEMENT PILOT (GDMP) Asad Samar (Caltech) ACAT 2000, Fermilab October , 2000.
NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.
Hands-On Microsoft Windows Server 2003 Administration Chapter 5 Administering File Resources.
70-270, MCSE/MCSA Guide to Installing and Managing Microsoft Windows XP Professional and Windows Server 2003 Chapter Nine Managing File System Access.
Data Grid Web Services Chip Watson Jie Chen, Ying Chen, Bryan Hess, Walt Akers.
Makrand Siddhabhatti Tata Institute of Fundamental Research Mumbai 17 Aug
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
DCE (distributed computing environment) DCE (distributed computing environment)
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
File and Object Replication in Data Grids Chin-Yi Tsai.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
- Distributed Analysis (07may02 - USA Grid SW BNL) Distributed Processing Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
1 Administering Shared Folders Understanding Shared Folders Planning Shared Folders Sharing Folders Combining Shared Folder Permissions and NTFS Permissions.
Andrew S. Budarevsky Adaptive Application Data Management Overview.
MAGDA Roger Jones UCL 16 th December RWL Jones, Lancaster University MAGDA  Main authors: Wensheng Deng, Torre Wenaus Wensheng DengTorre WenausWensheng.
Author - Title- Date - n° 1 Partner Logo EU DataGrid, Work Package 5 The Storage Element.
Author - Title- Date - n° 1 Partner Logo WP5 Summary Paris John Gordon WP5 6th March 2002.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE middleware: gLite Data Management EGEE Tutorial 23rd APAN Meeting, Manila Jan.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
Stephen Burke – Data Management - 3/9/02 Partner Logo Data Management Stephen Burke, PPARC/RAL Jeff Templon, NIKHEF.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
DGC Paris WP2 Summary of Discussions and Plans Peter Z. Kunszt And the WP2 team.
Replica Management Kelly Clynes. Agenda Grid Computing Globus Toolkit What is Replica Management Replica Management in Globus Replica Management Catalog.
Web Server.
Data Management The European DataGrid Project Team
Data Management The European DataGrid Project Team
Testing the HEPCAL use cases J.J. Blaising, F. Harris, Andrea Sciabà GAG Meeting April,
DICOMwebTM 2015 Conference & Hands-on Workshop University of Pennsylvania, Philadelphia, PA September 10-11, 2015 DICOMweb Workflow API (UPS-RS) Jonathan.
1 DIRAC Data Management Components A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI solution for high throughput data analysis Peter Solagna EGI.eu Operations.
What is BizTalk ?
The EDG Testbed Deployment Details
OGF PGI – EDGI Security Use Case and Requirements
Oxana Smirnova, Jakob Nielsen (Lund University/CERN)
Classic Storage Element
The Data Grid: Towards an architecture for Distributed Management
Vincenzo Spinoso EGI.eu/INFN
(on behalf of the POOL team)
The European DataGrid Project Team
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
Overview – SOE PatchTT December 2013.
Chapter 2: System Structures
Introduction to Data Management in EGI
Dirk Düllmann CERN Openlab storage workshop 17th March 2003
Artem Trunov and EKP team EPK – Uni Karlsruhe
Data Management in Release 2
OGSA Data Architecture Scenarios
Stephen Burke, PPARC/RAL Jeff Templon, NIKHEF
An Introduction to Computer Networking
Application layer Lecture 7.
The EU DataGrid Data Management
A Web-Based Data Grid Chip Watson, Ian Bird, Jie Chen,
Data services in gLite “s” gLite and LCG.
Grid Data Replication Kurt Stockinger Scientific Data Management Group Lawrence Berkeley National Laboratory.
Architecture of the gLite Data Management System
The EU DataGrid Fabric Management Services
Grid Computing Software Interface
Presentation transcript:

The EU DataGrid Data Management The European DataGrid Project Team http://www.eu-datagrid.org

EDG Tutorial Overview Information Service Workload Management Services Data Management Services Networking Information Service Fabric Management

Overview Data Management Issues Main Components EDG Replica Catalog EDG Replica Manager GDMP In this lecture you will learn about data management issues within a Data Grid and which are the main components to cope with these issues. Subsequently, four data management tools which are currently deployed on the EDG testbed are presented.

Data Management Issues On the grid the user user has access to different kinds of computing and data services which are connected via the Internet. If you submit a job you typically don’t know in advance where it might end. The input data needed by a job might be on any storage element.

Data Management Issues Of course, you might steer a job to a location where the data is available, but what if you need several data that is on different places? Similarly, when your job is finished – where should you put the data such that it is easy accessibly by yourself and others in subsequent applications?

Data Management Tools Tools for On EDG Testbed you have Locating data Copying data Managing and replicating data Meta Data management On EDG Testbed you have EDG Replica catalog globus-url-copy (GridFTP) EDG Replica Manager Grid Data Mirroring Package (GDMP) Spitfire To cope with these issues you need several tools that allow to find data on the grid, to copy data, to replicate data (I.e. to produce exact copies to get better locality), and, ideally, high level data management tools that hide most of the complexity. In addition, applications may store additional meta data with their files which of course also needs to be managed. On the EDG Testbed we currently provide for this: A replica catalog that stores physical locations of files, I.e. Physical File Names together with a logical identifier, a Logical File Name; Globus tools (globus-url-copy which uses the GridFTP protocol) for secure file transfer; GDMP a data mirroring package that allows to automatically provide data on places that are interested in this data; the EDG Replica Manager that allows to replicate files while keeping the replica catalog consistent; and Spitfire, a frontend to grid enabled data bases for managing meta-data (this tool is not covered in this tutorial).

EDG Replica Catalog Based upon the Globus LDAP Replica Catalog Stores LFN/PFN mappings and additional information (e.g. filesize): Physical File Name (PFN): host + full path & and file name Logical File Name (LFN): logical name that may be resolved to PFNs LFN : PFN = 1 : n Only files on storage elements may be registered Each VO has a specific storage dir on an SE Example PFN: lxshare0222.cern.ch/flatfiles/SE1/iteam/file1.dat host storage dir LFN must be full path of file starting from storage dir LFN of above PFN: file1.dat The EDG Replica Catalog is based upon the RC provided by Globus. It is essentially an LDAP server that stores LFN/PFN mappings together with additional information, such as filesize. Note that only files stored on SEs may be stored in the catalog. Moreover, there is a restriction in filenames: every replica must have the same filename starting from a VO specific storage directory on the SE. This storage directory may be retrieved from the information services: http://testbed007.cern.ch/tbstatus-bin/infoindexcern.pl The LFN for a file must be the full path and filename starting from the storage directory.

EDG Replica Catalog API and command line tools addLogicalFileName getLogicalFileName deleteLogicalFileName getPhysicalFileName addPhysicalFileName deletePhysicalFileName addLogicalFileAttribute getLogicalFileAttribute deleteLogicalFileAttribute http://cmsdoc.cern.ch/cms/grid/userguide/gdmp-3- 0/node85.html Main command line options: -l: logical file name -p: physical file name -n: attribute name -v: attribute value -c: config file name -d: don’t show all output -C: clear text password authorization (used currently in most catalog. –C is required!)

globus-url-copy Low level tool for secure copying globus-url-copy <protocol>://<source file> \ <protocol>://<destination file> Main Protocols: gsiftp – for secure transfer, only available on SE and CE file – for accessing files stored on the local file system on e.g. UI, WN globus-url-copy file://`pwd`/file1.dat \ gsiftp://lxshare0222.cern.ch/ \ flatfiles/SE1/EDGTutorial/file1.dat globus-url-copy is a low level tool for secure copying. It should only be used if just a copy without registering the file in the RC is desired. Otherwise, higher level tools as we discuss now should be used.

The EDG Replica Manager Extends the Globus replica manager Only client side tool Allows replication (copy) and registering of files in RC works with LDAP based RC and RLS (see future directions) Keeps RC consistent with stored data. Uses GDMP’s staging interface to stage to MSS The edg replica manager may be used to replicate data and keeping the replica catalog in sync. It is a prototype developed based upon the globus replica manager in order to get experience for the ongoing development of an intelligent replica manager, called Reptor (cf. Lecture on Future Directions).

The Replica Manager APIs (un)registerEntry(LogicalFileName lfn, FileName source) Replica Catalogue operations only - no file transfer copyFile(FileName source, FileName destination, String protocol) allows for third-party transfer transfer between: two StorageElements or ComputingElement and Storage Element Space management policies under development all tools support parallel streams for file transfers This and the next slide show the API functions of the Replica Manager(RM), to be used by clients and applications. Main command line options: -l: logical file name -s: source file name -d: destination file name -c: config file for the replica catalog -e: verbose error output (important for testing!)

The Replica Manager APIs copyAndRegisterFile(LogicalFileName lfn, FileName source, FileName destination, String protocol) third-party transfer but : files can only be registered in Replica Catalogue if destination PFN contains a valid SE (i.e. needs to be registered in the RC)! replicateFile(LogicalFileName lfn, deleteFile(LogicalFileName lfn, FileName source)

Genius file replication

Genius file replication

Genius file replication

Genius file replication

Genius file delete

Genius file delete

Genius file delete

experience will directly be used in DataGrid originally based on CMS requirements for replicating Objectivity files for High Level Trigger studies production prototype project for evaluating Grid technologies (especially Globus) experience will directly be used in DataGrid input also for PPDG and GriPhyN http://cern.ch/GDMP GDMP is a data mirroring package that was originally developed as a production prototype for evaluating grid technologies, in particular Globus, in context of CMS trigger studies. GDMP has been integrated into EDG software and the experiences gained with it are exploited in ongoing EDG replication projects.

Overview of Components Globus Replica Catalogue GDMP client GDMP is a client/server architecture where an GDMP server is running on every SE. GDMP servers may also communicate with each other and have access to a common RC. A GDMP server is able to server multiple VOs with different replica catalogs. Site1 Site3 Site2

Subscription Model All the sites that subscribe to a particular site get notified whenever there is an update in its catalog. Site 1 Site 2 Subscriber list Subscriber list subscribe subscribe GDMP works according to a subscription/notification scheme. A site that is interested in some particular data subscribes to the site where the data is produced and will subsequently be notified whenever an update is published on the provider site. Site 3

Export / Import Catalogue Export Catalog information about the new files produced . is published Import Catalog information about the files which have been published by other sites but not yet transferred locally As soon as the file is transferred locally, it is removed from the import catalogue. Possible to pull the information about new files into your import catalogue. Site 1 Site 2 export catalog export catalog 1)register, publish new files 1) get info about new files import catalog The subscription/notification model works with help of 2 kinds of catalogs: an export and an import catalog. The export catalog is published on the provider site and contains information on new files the provider wants to publish. The import catalog is automatically updated with new information available in any export catalog a site is subscribed to. Once a file referenced in the import catalog has been actually transferred, the corresponding entry is removed. If the import catalog is unreachable, for instance due to network problems, it might miss updates from sites it is subscribed to. In such a case, an active pull may be used to update the import catalog. 3) delete files Site 3 2) transfer files 2) transfer files

Usage gdmp_ping Ping a GDMP server and get its status gdmp_host_subscribe first thing to be done by a site gdmp_register_local_file Registers a file in local file catalogue but NOT in Replica Catalogue (RC) gdmp_publish_catalogue send information of newly created files to subscribed hosts (no real data transfer) – update RC gdmp_replicate_get - gdmp_replicate_put get/put all the files from the import catalogue – update RC gdmp_remove_local_file Delete a local file and update RC gdmp_get_catalogue Get remote catalogue contents – for error recovery For detailed information on these commands as well as the command line options refer to the GDMP users guide at: http://cmsdoc.cern.ch/cms/grid/userguide/gdmp-3-0/node47.html

Site 2 Site 5 Site 1 Site 3 Site 4 Using GDMP Register all files in a directory at site 1 gdmp_register_local_file –d /data/files Site 2 Site 5 Site 1 /data/files/file1 /data/files/file2 … Site 3 Site 4 Data produced at site 1 to be replicated to other sites

Site 5 Site 2 Site 1 Site 3 Site 4 Using GDMP 2 Start with subscription gdmp_host_subscribe –r <HOST> -p <PORT> Site 5 Site 2 gdmp_host_subscribe gdmp_host_subscribe Site 1 Subscriber list gdmp_host_subscribe Site 3 Site 4

Site 5 Site 2 Site 1 Site 3 Site 4 Using GDMP 3 Publish new files…can combine with filtering gdmp_publish_catalogue (might use filter option) Import catalog Import catalog Site 5 Site 2 Export catalog Site 1 Subscriber list gdmp_publish_catalogue Site 3 Import catalog Site 4

Site 5 Site 2 Site 1 Site 3 Site 4 Using GDMP 4 Poll for change in catalog (pull model)…can combine with filtering…also used for error recovery. gdmp_get_catalogue –host <HOST> Import catalog Import catalog Site 2 Site 5 Export catalog Site 1 Subscriber list gdmp_get_catalogue Site 3 Site 4 Import catalog Import catalog

Site 5 Site 2 Site 1 Site 3 Site 4 Using GDMP 5 Transfer files…can use the progress meter gdmp_replicate_get get_progress_meter…produces a progress.log. replica.log has all files already transferred. Import catalog Import catalog Site 2 Site 5 gdmp_replicate_get gdmp_replicate_get Export catalog Site 1 Subscriber list gdmp_replicate_get Site 3 Site 4 Import catalog Import catalog

GDMP vs. EDG Replica Manager Replicates sets of files Replication between SEs only Mass storage interface File size as logical attribute Subscription model Event notification CRC file size check Support for Objectivity Replica Manager Replicates single files Replication between SEs, UI or CE to SE. Uses GDMP’s Mass Storage interface at the SE In contrast to GDMP, the edg-replica manager only supports basic replication of single files, however, it is able to copy between all kinds of testbed machines. For simple replication tasks, edg-replica-manager is the preferred tool; for complex replication tasks including replication of multiple files to multiple locations gdmp should be used. client-server client side only

File Management Summary Site A Site B Storage Element A Storage Element B File Management on the Grid What file management functionalities are expected? The promise of the Grid is that the user files will be managed by the Grid services. The user’s files are identified by a name that is known to the user, the Logical File Name, as shown in this slide (File A,B,C,D,X,Y). A file may be replicated to many Grid sites, in the example shown on the slide, file A and B are available on both sites, C,D,X,Y are only available on one of the two sites. In the following we will argue for the file management functionalities that are necessary for the DataGrid users (HEP, EO, BIO) in a Grid environment. In order to make files available on more than one site, we need File Transfer services between sites. The Storage Element is defined (from the point of view of Data Management) as an abstract storage service that ‘speaks’ several file transfer protocols, for example FTP, HTTP, GridFTP, RFIO. The File Transfer service is a service offered by the Storage Element, the default being GridFTP for DataGrid. File Transfer File A File X File A File C File B File Y File B File D

File Management Summary Replica Catalog: Map Logical to Site files Site A Site B Storage Element A Storage Element B If files are available from more than one storage site (because they have been transferred using the File Transfer service), we need a service that keeps track of them. This service is the Replica Catalog, it contains the mappings of the Logical File Names to their Site File Names, i.e. the name by which the file can be accessed through the Storage Element in a given site. File Transfer File A File X File A File C File B File Y File B File D

File Management Summary Replica Catalog: Map Logical to Site files Replica Selection: Get ‘best’ file Site A Site B Storage Element A Storage Element B Next, we need to provide a service that makes a choice between the available replicas based on a set of information available from the Grid: network bandwith, storage element access speed, etc. File Transfer File A File X File A File C File B File Y File B File D

File Management Summary Replica Catalog: Map Logical to Site files Replica Selection: Get ‘best’ file Pre- Post-processing: Prepare files for transfer Validate files after transfer Site A Site B Storage Element A Storage Element B Many file types need some kind of pre- and/or post-processing before and/or after transfer. These steps might be related to data extraction, conversion, encryption for pre-processing and importation, conversion, decryption, etc for post-prosessing. Also, as a Quality-of-Service step the file may be validated after transfer as a post-processing step (by checking its checksum, for example). This service should be customizable by the applications that use certain file types. File Transfer File A File X File A File C File B File Y File B File D

File Management Summary Replica Catalog: Map Logical to Site files Replica Selection: Get ‘best’ file Pre- Post-processing: Prepare files for transfer Validate files after transfer Replication Automation: Data Source subscription Site A Site B Storage Element A Storage Element B Automated replication based on a subscription model is also desirable especially for HEP, where the data coming from the LHC will have to be distributed automatically over the world. The subscription semantics may be defined through file name patterns, data locations, etc. This service should also be customizable by the Virtual Organizations setting it up. File Transfer File A File X File A File C File B File Y File B File D

File Management Summary Replica Catalog: Map Logical to Site files Replica Selection: Get ‘best’ file Pre- Post-processing: Prepare files for transfer Validate files after transfer Replication Automation: Data Source subscription Site A Site B Load balancing: Replicate based on usage Storage Element A Storage Element B There is another desirable automatic replication mechanism that should be available through a service, which is automated replication to balance the access load on ‘popular’ files. This service needs to store and analyze user access patterns on the files available through the Grid. Replication is triggered by rules that are applied to the usage patterns on the existing files. File Transfer File A File X File A File C File B File Y File B File D

Replica Manager: ‘atomic’ replication operation single client interface orchestrator File Management Replica Catalog: Map Logical to Site files Replica Selection: Get ‘best’ file Pre- Post-processing: Prepare files for transfer Validate files after transfer Replication Automation: Data Source subscription Site A Site B Load balancing: Replicate based on usage Storage Element A Storage Element B The user does not want to be burdened to contact each of these services directly, taking care of the correct order of operations and making sure that errors are caught and handled accordingly. This functionality is provided by the Replica Manager which acts as the single interface to all replication services. As an additional QoS functionality it should provide transactional integrity of the steps involved in the high-level usages of the underlying services. It should be able to gracefully recover interrupted transfers, catalog updates, pre- and post-processing steps, etc. File Transfer File A File X File A File C File B File Y File B File D

Replica Manager: ‘atomic’ replication operation single client interface orchestrator File Management Replica Catalog: Map Logical to Site files Replica Selection: Get ‘best’ file Pre- Post-processing: Prepare files for transfer Validate files after transfer Replication Automation: Data Source subscription Site A Site B Load balancing: Replicate based on usage Metadata: LFN metadata Transaction information Access patterns Storage Element A Storage Element B A helper service to be used by the File Management services described until now is the metadata service. It should be able to store all metadata associated with Data Management, like attributes on LFNs, patterns for the load balancing service, transaction locks for the replica manager, etc. File Transfer File A File X File A File C File B File Y File B File D

Security File Management Replica Manager: ‘atomic’ replication operation single client interface orchestrator File Management Replica Catalog: Map Logical to Site files Replica Selection: Get ‘best’ file Security Pre- Post-processing: Prepare files for transfer Validate files after transfer Replication Automation: Data Source subscription Site A Site B Load balancing: Replicate based on usage Metadata: LFN metadata Transaction information Access patterns Storage Element A Storage Element B Across all of these services we need to impose the stamp of Security, which means that the services need to authenticate and authorize their clients. In order to provide a higher level of consistency between the replicas, we require the users to transfer all administrative rights on the files that they register through the Replica Manager. In order to administer their files they therefore will need to go through the Replica Manager again – administering the files directly through the Storage will be denied. This enables the Replica Manager to act for the user on the user’s behalf – automated replication would not be possible without this restriction. File Transfer File A File X File A File C File B File Y File B File D