Download presentation
Presentation is loading. Please wait.
Published byLaureen Oliver Modified over 9 years ago
1
File and Object Replication in Data Grids Chin-Yi Tsai
2
2 Outline Introduction Background and Related Work Globus Data Grid Tools File Replication Tool : GDMP Object Replication Experimental Results with GridFTP Conclusion
3
3
4
4 Introduction Data replication Data replication is a key-issue in Data Grid File and Object Distributed analysis of experimental data High Energy Physics Community (HEP) CERN ATLAS CMS The CMS experiment is a high energy physics experiment located at CERN, that will start data taking in the year 2006 Computing, storage, network There is a natural mapping to a Grid environment GDMP architecture uses Globus Data Grid tools as middleware Fileobjectobjectobjectobjectobjectobject
5
5 File Replication and Object Replication Grid site 1 (source) Grid site 2 (destination) Local storage application Grid site 1 (source) Grid site 2 (destination) application object copier tool
6
6 Major focus of European DataGrid Project on High Energy Physics Object data stores used for next generation experiments objects are important for data handling Grid software mainly to deal with file replication issues single file (about 1 or 2 GB in size; in total a few PB) contains many objects most objects are read-only
7
7 Related Data Grid Projects Earth Science Grid (ESG) management of climate data Particle Physics Data Grid (PPDG) HEP applications Grid Physics Network (GriPhyN) Realizing the concept of Virtual Data
8
8 Objectivity file A object 5 object 4 object 3 object 2 object 1 database federation file A object 5 object 4 object 3 object 2 object 1 databaseA file B object 5 object 4 object 3 object 2 object 1 databaseB file C object 5 object 4 object 3 object 2 object 1 databaseC
9
9 Globus Data Grid Tools The Globus Toolkit is an open source software toolkit used for building grids middleware Four main components of Globus The Grid Security Infrastructure (GSI) The Globus Resource Management The Globus Information Management architecture Data Management architecture, or Data Grid GridFTP, Replica Management
10
10 GridFTP SRB HPSS DFS DPSS uniform client interface GridFTP
11
11 Features of GridFTP GSI and Kerberos support GSS API Third-party control of data transfer add GSS API Parallel data transfer Multiple TCP stream, single host Striped data transfer Multiple TCP stream, multiple host/server Partial file transfer Automatic negotiation of TCP buffer/window sizes Support the reliable and restartable data transfer user Local security infrastructure user …. GSI Site A Site B user Local security infrastructure user
12
12 The GridFTP Protocol Implementation The two main libraries globus_ftp_control_library globus_ftp_client_library
13
13 Replica Catalog Mapping between logical name for files or collections and one or more copies of the objects on physical storage systems Three types of entries logical collections Location (physical) logical files
14
14 One Application Model Replica Catalog Logical Collection Weather measurement 2003 Logical Collection Weather measurement 2002 Location cwb.gov.tw Location ntu.edu.tw Location fcu.edu.tw filename: Jan 2003 filename: Feb 2003 … filename: Dec 2003 filename: Jan 2003 filename: Feb 2003 Protocol: GridFTP Hostname: cwb.gov.tw Path: nfs/weather/ filename: Jan 2003 filename: Feb 2003 filename: Oct 2003 filename: Jan 2003 filename: Sep 2003 Logical File Parent Logical File Jan 2003 Logical File Jan 2003 …
15
15 An Example Replication Scenario File1: 100MB File2: 200MB File3: 300MB File4: 400MB File5: 500MB Site B File2File3File5 Site A File1File2File3File4 namesToSearchFile filename:File4 filename:File5 File1 File2 File3 File4 File5 listCollectionNamesFile filename:File1 filename:File2 filename:File3 filename:File4 listANamesFile filename:File2 filename:File3 filename:File5 listBNamesFile Location entry corresponding to site A uc : gridftp://Ahost.isi.edu:2222/nfs/path/on/A Location entry corresponding to site B uc : gridftp://Bhost.mcs.anl.gov:7777/nfs/path/on/B
16
Implementation This Scenario with the Command Line Tool Registering the collection globus-replica-catalog –host -manager -password <> - collection –create listCollectionNamesFile Registering the location A globus-replica-catalog –host -manager -password <> - location locationA? –create gridftp://Ahost.isi.edu:2222/nfs/path/on/A listANamesFile Registering the location B globus-replica-catalog –host -manager -password <> - location locationB? –create gridftp://Bhost.mcs.anl.gov:7777/nfs/path/on/B listBNamesFile
17
Registering logical file File1, File2, File3, File4, File5 globus-replica-catalog –host -manager -password <> -logicalfile File1 –create 104857600 Searching for the uc(URL constructor) attribute of all location that contain File4 and File5 globus-replica-catalog –host -manager -password <> uc -collection –find-locations NamesToSearchFile –attributes uc List the value of the size attribute of the File2 globus-replica-catalog –host -manager -password <> -logicalfile File2? –list-attributes size
18
18 GDMP Architecture The GDMP client-server software system is a generic file replication tool Request Manager Security Layer Replica Catalog Service Data Mover Service Storage Manager Service
19
19 Replica Catalog Service global Maintain a global file name space of replicas New file logical file name meta-information physical location Client sites query the Replica Catalog Service Implementation LDAP and Globus library (replica catalog) High-level API Globus Replica Catalog Replica Catalog Service Application API
20
20 Data Mover Service Layered design high-level API and low-level service Data transfer security, performance, robustness To use GridFTP as GDMP’s underlying file transfer mechanism Handle network failures and perform additional check for corruption
21
21 Storage Management Service Use external tools for staging (different for each MSS) Assume that each site has a local disk pool = data transfer cache GDMP triggers file staging to the disk pool If a file is not located on the disk pool but requested by a remote site GDMP, initiates a disk-to-disk file transfer GDMP has a plug-in for Hierarchical Storage Manager (HRM) APIs, which provide a common interface to be used to access different Mass Storage Systems. The implementation is based on CORBA Site B GDMP disk pool
22
22 Object Replication Motivation File replication works well for many kinds of applications however, too inefficient for physics analysis: only a few objects of a file are requested physicists want to have replicas on specific sites with sufficient CPU power don’t want to have the entire file but only a few objects file replication: overhead in terms of data to be transferred use object copier to copy objects to a file and then replicate the “new” file one object per file is inefficient since object size is between a 100bytes and 1 MB - too many files Grid site 1 (source) Grid site 2 (destination) application object copier tool
23
23 Object Replication Architecture Choices large, world-wide distributed databases are not considered very attractive in HEP significant parts of GDMP and Globus are used Object replication cycle: objects are identified by application objects not present at the location are identified “missing” objects are copied into new files and then transferred to the application Copy and file transfer are pipelined to achieve a better response time Index files used for locating objects
24
24 Object Replication Prototyping Experience analysis Most of current next-generation experiments do not do analysis yet: object replication is still a prototype file replication based on GDMP is in production use machine where object copier is running needs to be powerful (CPU and IO)
25
25 Experimental Results with GridFTP Main motivation study the impact of TCP socket buffer size tuning on parallel data transfers understand the throughput that can be achieved in realistic settings Get maximal throughput it is critical to use optimal TCP send and receive socket buffer size (too small or to large) Test server WU-ftpd server 0.4b6 Test program extened_get extended_put
26
26 Experimental Results with GridFTP (cont’d)
27
27 Experimental Results with GridFTP (cont’d) Optimal TCP buffer size = RTT * (speed of bottleneck link) RTT measured with Unix ping tool bottleneck link speed: pipechar (new tool from LBNL) Simple method to determine optimal number of parallel streams is not known yet too many streams may overload the receiving host usually, 4~8 parallel streams are optimal
28
28 Conclusion data management GDMP replication service has been enhanced with more advanced data management features namespace file catalog management efficient file transfer (GridFTP) Object-based replication experimental analysis
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.