Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Distributed Metadata with the AMGA Metadata Catalog Nuno Santos, Birger Koblitz 20 June 2006.

Similar presentations


Presentation on theme: "INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Distributed Metadata with the AMGA Metadata Catalog Nuno Santos, Birger Koblitz 20 June 2006."— Presentation transcript:

1 INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Distributed Metadata with the AMGA Metadata Catalog Nuno Santos, Birger Koblitz 20 June 2006 Workshop on Next-Generation Distributed Data Management

2 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 2 Abstract Metadata Catalogs on Data Grids – The case for replication The AMGA Metadata Catalog Metadata Replication with AMGA Benchmark Results Future Work/Open Challenges

3 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 3 Metadata Catalogs Metadata on the Grid –File Metadata - Describe files with application-specific information  Purpose: file discovery based on their contents –Simplified Database Service – Store generic structured data on the Grid  Not as powerful as a DB, but easier to use and better Grid integration (security, hide DB heterogeneity) Metadata Services are essential for many Grid applications Must be accessible Grid-wide But Data Grids can be large…

4 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 4 An Example - The LCG Sites LCG – LHC Computing Grid –Distribute and process the data generated by the LHC (Large Hadron Collider) at CERN –~200 sites and ~5.000 users worldwide Taken from: http://goc03.grid-support.ac.uk/googlemaps/lcg.html

5 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 5 Challenges for Catalog Services Scalability –Hundreds of grid sites –Thousands users Geographical Distribution –Network latency Dependability –In a large and heterogeneous system, failures will be common A centralized system does not meet the requirements –Distribution and replication required

6 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 6 Off-the-shelf DB Replication? Most DB systems have DB replication mechanisms –Oracle Streams, Slony for PostgreSQL, MySQL replication Example: 3D Project at CERN (Distributed Deployment of Databases) –Uses Oracle Streams for replication –Being deployed only at a few LCG sites (~10 sites, Tier-0 and Tier-1s)  Requires Oracle ($$$) and expert on-site DBAs ($$$)  Most sites don’t have these resources Off-the-shelf replication is vendor-specific –But Grids are heterogeneous by nature –Sites have different DB systems available Only partial solution to the problem of metadata replication

7 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 7 Replication in the Catalog Alternative we are exploring: Replication in the Metadata Catalog Advantages –Database independent –Metadata-aware replication  More efficient – replicate Metadata commands  Better functionality – Partial replication, federation –Ease of deployment and administration  Built-in into the Metadata Catalog  No need for dedicated DB admin The AMGA Metadata Catalogue is the basis for our work on replication

8 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 8 The AMGA Metadata Catalog Metadata Catalog of the gLite Middleware (EGEE) Several groups of users among the EGEE community: –High Energy Physics –Biomed Main features –Dynamic schemas –Hierarchical organization –Security:  Authentication: user/pass, X509 Certs, GSI  Authorization: VOMS, ACLs

9 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 9 AMGA Implementation C++ implementation Back-ends –Oracle, MySQL, PostgreSQL, SQLite Front-end - TCP Streaming –Text-based protocol like TELNET, SMTP, POP… addentry /DLAudio/song.mp3 /DLAudio:Author ‘John Smith’ /DLAudio:Album ‘Latest Hits’ selectattr /DLAudio:FILE /DLAudio:Author /DLAudio:Album ‘like(/DLAudio:FILE, “%.mp3")‘ Examples: Adding data Retrieving data

10 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 10 Standalone Performance Single server scales well up to 100 concurrent clients Could not go past 100. Limited by the database WAN access one to two orders of magnitude slower than LAN Replication can solve both bottlenecks

11 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 11 Metadata Replication with AMGA

12 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 12 Requirements of EGEE Communities Motivation: Requirements of EGEE’s user communities. –Mainly HEP and Biomed High Energy Physics (HEP) –Millions of files, 5.000+ users distributed across 200+ computing centres –Mainly (read-only) file metadata –Main concerns: scalability, performance and fault-tolerance Biomed –Manage medical images on the Grid  Data produced in a distributed fashion by laboratories and hospitals  Highly sensitive data: patient details –Smaller scale than HEP –Main concern: security

13 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 13 Metadata Replication Full replication Partial replication FederationProxy Some replication models

14 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 14 Architecture Main design decisions –Asynchronous replication – for tolerating with high latencies and fault- tolerance –Partial replication – Replicate only what is interesting for the remote users –Master-slave – Writes only allowed on the master  But mastership is granted to metadata collections, not to nodes

15 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 15 Status Initial implementation completed –Available functionality:  Full and partial replication  Chained replication (master → slave1 → slave2)  Federation - basic support Data is always copied to slave  Cross DB replication: PostgreSQL → MySQL tested Other combinations should work (give or take some debugging) Available as part of AMGA

16 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 16 Benchmark Results

17 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 17 Benchmark Study Investigate the following: 1)Overhead of replication and scalability of the master 2)Behaviour of the system under faults

18 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 18 Scalability Small increase in CPU usage as number of slaves increases –10 slaves, 20% increase from standalone operation Number of update logs sent scales almost linearly Setup Insertion rate at master: 90 entries/s. Total: 10,000 entries 0 slaves - saving replication updates, but not shipping (slaves disconnected)

19 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 19 Fault Tolerance Next test illustrates fault tolerance mechanisms Slave fails –Master keeps the updates for the slave –Replication log grows Slave reconnects –Master sends pending updates –Eventually system recovers to a steady state with the slave up-to- date Test conditions: –Insertion rate at master: 50 entries/s –Total: 20.000 entries –Two slaves, both start connected –Slave1 disconnects temporarily Setup:

20 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 20 Fault Tolerance and Recovery While slave1 is disconnected, the replication log grows in size –Limited in size. Slave unsubscribed if it does not reconnect in time. After slave reconnection, system recovers in around 60 seconds.

21 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 21 Future Work/Open Challenges

22 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 22 Scalability Support hundreds of replicas –HEP use case. Extreme case: one replica catalog per site Challenges –Scalability –Fault-tolerance – tolerate failures of slaves and of master Current method of shipping updates (direct streaming) might not scale –Chained replication (divide and conquer)  Already possible with AMGA, performance needs to be studied –Group communication

23 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 23 Federation Federation of independent catalogs –Biomed use case Challenges –Provide a consistent view over the federated catalogs –Shared namespace –Security - Trust management, access control and user management Ideas

24 Enabling Grids for E-sciencE INFSO-RI-508833 Workshop on Next-Generation Distributed Data Management - 20 June 2006 24 Conclusion Replication of Metadata Catalogues necessary for Data Grids We are exploring replication at the Catalogue using AMGA Initial implementation completed –First results are promising Currently working on improving scalability and on federation More information about our current work at: http://project-arda-dev.web.cern.ch/project-arda-dev/metadata/


Download ppt "INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Distributed Metadata with the AMGA Metadata Catalog Nuno Santos, Birger Koblitz 20 June 2006."

Similar presentations


Ads by Google