Presentation is loading. Please wait.

Presentation is loading. Please wait.

File and Object Replication in Data Grids Chin-Yi Tsai.

Similar presentations


Presentation on theme: "File and Object Replication in Data Grids Chin-Yi Tsai."— Presentation transcript:

1 File and Object Replication in Data Grids Chin-Yi Tsai

2 2 Outline  Introduction  Background and Related Work  Globus Data Grid Tools  File Replication Tool : GDMP  Object Replication  Experimental Results with GridFTP  Conclusion

3 3

4 4 Introduction  Data replication  Data replication is a key-issue in Data Grid  File and Object  Distributed analysis of experimental data  High Energy Physics Community (HEP)  CERN  ATLAS  CMS  The CMS experiment is a high energy physics experiment located at CERN, that will start data taking in the year 2006  Computing, storage, network  There is a natural mapping to a Grid environment  GDMP architecture uses Globus Data Grid tools as middleware Fileobjectobjectobjectobjectobjectobject

5 5 File Replication and Object Replication Grid site 1 (source) Grid site 2 (destination) Local storage application Grid site 1 (source) Grid site 2 (destination) application object copier tool

6 6  Major focus of European DataGrid Project on High Energy Physics  Object data stores used for next generation experiments  objects are important for data handling  Grid software mainly to deal with file replication issues  single file (about 1 or 2 GB in size; in total a few PB)  contains many objects  most objects are read-only

7 7 Related Data Grid Projects  Earth Science Grid (ESG)  management of climate data  Particle Physics Data Grid (PPDG)  HEP applications  Grid Physics Network (GriPhyN)  Realizing the concept of Virtual Data

8 8 Objectivity file A object 5 object 4 object 3 object 2 object 1 database federation file A object 5 object 4 object 3 object 2 object 1 databaseA file B object 5 object 4 object 3 object 2 object 1 databaseB file C object 5 object 4 object 3 object 2 object 1 databaseC

9 9 Globus Data Grid Tools  The Globus Toolkit is an open source software toolkit used for building grids  middleware  Four main components of Globus  The Grid Security Infrastructure (GSI)  The Globus Resource Management  The Globus Information Management architecture  Data Management architecture, or Data Grid  GridFTP, Replica Management

10 10 GridFTP SRB HPSS DFS DPSS uniform client interface GridFTP

11 11 Features of GridFTP  GSI and Kerberos support  GSS API  Third-party control of data transfer  add GSS API  Parallel data transfer  Multiple TCP stream, single host  Striped data transfer  Multiple TCP stream, multiple host/server  Partial file transfer  Automatic negotiation of TCP buffer/window sizes  Support the reliable and restartable data transfer user Local security infrastructure user …. GSI Site A Site B user Local security infrastructure user

12 12 The GridFTP Protocol Implementation  The two main libraries  globus_ftp_control_library  globus_ftp_client_library

13 13 Replica Catalog  Mapping between logical name for files or collections and one or more copies of the objects on physical storage systems  Three types of entries  logical collections  Location (physical)  logical files

14 14 One Application Model Replica Catalog Logical Collection Weather measurement 2003 Logical Collection Weather measurement 2002 Location cwb.gov.tw Location ntu.edu.tw Location fcu.edu.tw filename: Jan 2003 filename: Feb 2003 … filename: Dec 2003 filename: Jan 2003 filename: Feb 2003 Protocol: GridFTP Hostname: cwb.gov.tw Path: nfs/weather/ filename: Jan 2003 filename: Feb 2003 filename: Oct 2003 filename: Jan 2003 filename: Sep 2003 Logical File Parent Logical File Jan 2003 Logical File Jan 2003 …

15 15 An Example Replication Scenario File1: 100MB File2: 200MB File3: 300MB File4: 400MB File5: 500MB Site B File2File3File5 Site A File1File2File3File4 namesToSearchFile filename:File4 filename:File5 File1 File2 File3 File4 File5 listCollectionNamesFile filename:File1 filename:File2 filename:File3 filename:File4 listANamesFile filename:File2 filename:File3 filename:File5 listBNamesFile Location entry corresponding to site A uc : gridftp://Ahost.isi.edu:2222/nfs/path/on/A Location entry corresponding to site B uc : gridftp://Bhost.mcs.anl.gov:7777/nfs/path/on/B

16 Implementation This Scenario with the Command Line Tool Registering the collection globus-replica-catalog –host -manager -password <> - collection –create listCollectionNamesFile Registering the location A globus-replica-catalog –host -manager -password <> - location locationA? –create gridftp://Ahost.isi.edu:2222/nfs/path/on/A listANamesFile Registering the location B globus-replica-catalog –host -manager -password <> - location locationB? –create gridftp://Bhost.mcs.anl.gov:7777/nfs/path/on/B listBNamesFile

17 Registering logical file File1, File2, File3, File4, File5 globus-replica-catalog –host -manager -password <> -logicalfile File1 –create 104857600 Searching for the uc(URL constructor) attribute of all location that contain File4 and File5 globus-replica-catalog –host -manager -password <> uc -collection –find-locations NamesToSearchFile –attributes uc List the value of the size attribute of the File2 globus-replica-catalog –host -manager -password <> -logicalfile File2? –list-attributes size

18 18 GDMP Architecture  The GDMP client-server software system is a generic file replication tool Request Manager Security Layer Replica Catalog Service Data Mover Service Storage Manager Service

19 19 Replica Catalog Service global  Maintain a global file name space of replicas  New file  logical file name  meta-information  physical location  Client sites query the Replica Catalog Service  Implementation  LDAP and Globus library (replica catalog)  High-level API Globus Replica Catalog Replica Catalog Service Application API

20 20 Data Mover Service  Layered design  high-level API and low-level service  Data transfer  security, performance, robustness  To use GridFTP as GDMP’s underlying file transfer mechanism  Handle network failures and perform additional check for corruption

21 21 Storage Management Service  Use external tools for staging (different for each MSS)  Assume that each site has a local disk pool = data transfer cache  GDMP triggers file staging to the disk pool  If a file is not located on the disk pool but requested by a remote site GDMP, initiates a disk-to-disk file transfer  GDMP has a plug-in for Hierarchical Storage Manager (HRM) APIs, which provide a common interface to be used to access different Mass Storage Systems.  The implementation is based on CORBA Site B GDMP disk pool

22 22 Object Replication Motivation  File replication works well for many kinds of applications  however, too inefficient for physics analysis:  only a few objects of a file are requested  physicists want to have replicas on specific sites with sufficient CPU power  don’t want to have the entire file but only a few objects  file replication: overhead in terms of data to be transferred  use object copier to copy objects to a file and then replicate the “new” file  one object per file is inefficient since object size is between a 100bytes and 1 MB - too many files Grid site 1 (source) Grid site 2 (destination) application object copier tool

23 23 Object Replication Architecture Choices  large, world-wide distributed databases are not considered very attractive in HEP  significant parts of GDMP and Globus are used  Object replication cycle:  objects are identified by application  objects not present at the location are identified  “missing” objects are copied into new files and then transferred to the application  Copy and file transfer are pipelined to achieve a better response time  Index files used for locating objects

24 24 Object Replication Prototyping Experience analysis  Most of current next-generation experiments do not do analysis yet:  object replication is still a prototype  file replication based on GDMP is in production use  machine where object copier is running needs to be powerful (CPU and IO)

25 25 Experimental Results with GridFTP  Main motivation  study the impact of TCP socket buffer size tuning on parallel data transfers  understand the throughput that can be achieved in realistic settings  Get maximal throughput  it is critical to use optimal TCP send and receive socket buffer size (too small or to large)  Test server  WU-ftpd server 0.4b6  Test program  extened_get  extended_put

26 26 Experimental Results with GridFTP (cont’d)

27 27 Experimental Results with GridFTP (cont’d)  Optimal TCP buffer size = RTT * (speed of bottleneck link)  RTT measured with Unix ping tool  bottleneck link speed: pipechar (new tool from LBNL)  Simple method to determine optimal number of parallel streams is not known yet  too many streams may overload the receiving host  usually, 4~8 parallel streams are optimal

28 28 Conclusion data management  GDMP replication service has been enhanced with more advanced data management features  namespace  file catalog management  efficient file transfer (GridFTP)  Object-based replication  experimental analysis


Download ppt "File and Object Replication in Data Grids Chin-Yi Tsai."

Similar presentations


Ads by Google