INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org Distributed Metadata with the AMGA Metadata Catalog Nuno Santos, Birger Koblitz 20 June 2006.

Slides:



Advertisements
Similar presentations
Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.
Advertisements

PostgreSQL Replicator – easy way to build a distributed Postgres database Irina Sourikova PHENIX collaboration.
The AMGA metadata catalog Riccardo Bruno - INFN Madrid, 07-11/05/2007.
Asterios Katsifodimos Saturday, May 23, 2015 High Performance Computing systems Lab University of Cyprus The AMGA metadata catalog – An Overview Slides.
Active Directory: Final Solution to Enterprise System Integration
GGF Toronto Spitfire A Relational DB Service for the Grid Peter Z. Kunszt European DataGrid Data Management CERN Database Group.
Overview Distributed vs. decentralized Why distributed databases
Hands-On Microsoft Windows Server 2003 Administration Chapter 5 Administering File Resources.
Chapter 12 Distributed Database Management Systems
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
DATABASE MANAGEMENT SYSTEMS 2 ANGELITO I. CUNANAN JR.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
(ITI310) SESSIONS : Active Directory By Eng. BASSEM ALSAID.
Exploring Directory Services. Need for DS Multiple servers, multiple services in single network –Multiple servers for reliability, security, optimizing.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
KISTI’s Activities on the NA4 Biomed Cluster Soonwook Hwang, Sunil Ahn, Jincheol Kim, Namgyu Kim and Sehoon Lee KISTI e-Science Division.
Presenter: Dipesh Gautam.  Introduction  Why Data Grid?  High Level View  Design Considerations  Data Grid Services  Topology  Grids and Cloud.
ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America The AMGA metadata catalog with use cases.
INFSO-RI Enabling Grids for E-sciencE gLite Data Management Services - Overview Mike Mineter National e-Science Centre, Edinburgh.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
MySQL and GRID Gabriele Carcassi STAR Collaboration 6 May Proposal.
EGEE-III INFSO-RI Enabling Grids for E-sciencE The Medical Data Manager : the components Johan Montagnat, Romain Texier, Tristan.
Csi315csi315 Client/Server Models. Client/Server Environment LAN or WAN Server Data Berson, Fig 1.4, p.8 clients network.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 12 Distributed Database Management Systems.
Distributed Database Systems Overview
Heterogeneous Database Replication Gianni Pucciani LCG Database Deployment and Persistency Workshop CERN October 2005 A.Domenici
Usenix Annual Conference, Freenix track – June 2004 – 1 : Flexible Database Clustering Middleware Emmanuel Cecchet – INRIA Julie Marguerite.
INFSO-RI Enabling Grids for E-sciencE AMGA Metadata Server - Metadata Services in gLite (+ ARDA DB Deployment Plans with Experiments)
Enabling Grids for E-sciencE EGEE-III INFSO-RI I. AMGA Overview What is AMGA Metadata Catalogue of EGEE’s gLite 3.1 Middleware Main Feature of.
The Advanced Data Searching System The Advanced Data Searching System with 24 February APCTP 2010 J.H Kim & S. I Ahn & K. Cho on behalf of the Belle-II.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks AMGA PHP API Claudio Cherubino INFN - Catania.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
1 Distributed Databases BUAD/American University Distributed Databases.
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
Distributed database system
Replica Consistency in a Data Grid1 IX International Workshop on Advanced Computing and Analysis Techniques in Physics Research December 1-5, 2003 High.
INFSO-RI Enabling Grids for E-sciencE gLite Data Management and Interoperability Peter Kunszt (JRA1 DM Cluster) 2 nd EGEE Conference,
From Digital Objects to Content across eInfrastructures Content and Storage Management in gCube Pasquale Pagano CNR –ISTI on behalf of Heiko Schuldt Dept.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Implementation and performance analysis of.
ATLAS Grid Requirements A First Draft Rich Baker Brookhaven National Laboratory.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
WP 10 ATF meeting April 8, 2002 Data Management and security requirements of biomedical applications Johan Montagnat - WP10.
Enabling Grids for E-sciencE EGEE-II INFSO-RI Medical Data Manager 1 Dicom retrieval : overview of the DPM One command line to retrieve a file:
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
Database authentication in CORAL and COOL Database authentication in CORAL and COOL Giacomo Govi Giacomo Govi CERN IT/PSS CERN IT/PSS On behalf of the.
Oracle to MySQL synchronization Gianni Pucciani CERN, University of Pisa.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks APEL CPU Accounting in the EGEE/WLCG infrastructure.
Distributed Data Access Control Mechanisms and the SRM Peter Kunszt Manager Swiss Grid Initiative Swiss National Supercomputing Centre CSCS GGF Grid Data.
FP6−2004−Infrastructures−6-SSA Enabling Grids for E-sciencE The AMGA Metadata Catalog Introduction and hands-on exercises Nuno Santos.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Grid based telemedicine application
Jean-Philippe Baud, IT-GD, CERN November 2007
AMGA Metadata Service Vladimir Dimitrov IPP-BAS “gLite middleware Application Developers Course”, Sofia, Bulgaria,
Security and Replication of Metadata with AMGA
Cross-health enterprises Medical Data Management on the EGEE grid
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
(ITI310) SESSIONS 6-7-8: Active Directory.
AMGA Metadata Service Vladimir Dimitrov IPP-BAS “gLite middleware Application Developers Course”, Plovdiv, Bulgaria,
AMGA Metadata Service Vladimir Dimitrov IPP-BAS “gLite middleware Application Developers Course”, Sofia, Bulgaria,
The AMGA metadata catalog
Database System Architectures
Presentation transcript:

INFSO-RI Enabling Grids for E-sciencE Distributed Metadata with the AMGA Metadata Catalog Nuno Santos, Birger Koblitz 20 June 2006 Workshop on Next-Generation Distributed Data Management

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Abstract Metadata Catalogs on Data Grids – The case for replication The AMGA Metadata Catalog Metadata Replication with AMGA Benchmark Results Future Work/Open Challenges

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Metadata Catalogs Metadata on the Grid –File Metadata - Describe files with application-specific information  Purpose: file discovery based on their contents –Simplified Database Service – Store generic structured data on the Grid  Not as powerful as a DB, but easier to use and better Grid integration (security, hide DB heterogeneity) Metadata Services are essential for many Grid applications Must be accessible Grid-wide But Data Grids can be large…

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June An Example - The LCG Sites LCG – LHC Computing Grid –Distribute and process the data generated by the LHC (Large Hadron Collider) at CERN –~200 sites and ~5.000 users worldwide Taken from:

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Challenges for Catalog Services Scalability –Hundreds of grid sites –Thousands users Geographical Distribution –Network latency Dependability –In a large and heterogeneous system, failures will be common A centralized system does not meet the requirements –Distribution and replication required

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Off-the-shelf DB Replication? Most DB systems have DB replication mechanisms –Oracle Streams, Slony for PostgreSQL, MySQL replication Example: 3D Project at CERN (Distributed Deployment of Databases) –Uses Oracle Streams for replication –Being deployed only at a few LCG sites (~10 sites, Tier-0 and Tier-1s)  Requires Oracle ($$$) and expert on-site DBAs ($$$)  Most sites don’t have these resources Off-the-shelf replication is vendor-specific –But Grids are heterogeneous by nature –Sites have different DB systems available Only partial solution to the problem of metadata replication

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Replication in the Catalog Alternative we are exploring: Replication in the Metadata Catalog Advantages –Database independent –Metadata-aware replication  More efficient – replicate Metadata commands  Better functionality – Partial replication, federation –Ease of deployment and administration  Built-in into the Metadata Catalog  No need for dedicated DB admin The AMGA Metadata Catalogue is the basis for our work on replication

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June The AMGA Metadata Catalog Metadata Catalog of the gLite Middleware (EGEE) Several groups of users among the EGEE community: –High Energy Physics –Biomed Main features –Dynamic schemas –Hierarchical organization –Security:  Authentication: user/pass, X509 Certs, GSI  Authorization: VOMS, ACLs

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June AMGA Implementation C++ implementation Back-ends –Oracle, MySQL, PostgreSQL, SQLite Front-end - TCP Streaming –Text-based protocol like TELNET, SMTP, POP… addentry /DLAudio/song.mp3 /DLAudio:Author ‘John Smith’ /DLAudio:Album ‘Latest Hits’ selectattr /DLAudio:FILE /DLAudio:Author /DLAudio:Album ‘like(/DLAudio:FILE, “%.mp3")‘ Examples: Adding data Retrieving data

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Standalone Performance Single server scales well up to 100 concurrent clients Could not go past 100. Limited by the database WAN access one to two orders of magnitude slower than LAN Replication can solve both bottlenecks

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Metadata Replication with AMGA

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Requirements of EGEE Communities Motivation: Requirements of EGEE’s user communities. –Mainly HEP and Biomed High Energy Physics (HEP) –Millions of files, users distributed across 200+ computing centres –Mainly (read-only) file metadata –Main concerns: scalability, performance and fault-tolerance Biomed –Manage medical images on the Grid  Data produced in a distributed fashion by laboratories and hospitals  Highly sensitive data: patient details –Smaller scale than HEP –Main concern: security

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Metadata Replication Full replication Partial replication FederationProxy Some replication models

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Architecture Main design decisions –Asynchronous replication – for tolerating with high latencies and fault- tolerance –Partial replication – Replicate only what is interesting for the remote users –Master-slave – Writes only allowed on the master  But mastership is granted to metadata collections, not to nodes

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Status Initial implementation completed –Available functionality:  Full and partial replication  Chained replication (master → slave1 → slave2)  Federation - basic support Data is always copied to slave  Cross DB replication: PostgreSQL → MySQL tested Other combinations should work (give or take some debugging) Available as part of AMGA

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Benchmark Results

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Benchmark Study Investigate the following: 1)Overhead of replication and scalability of the master 2)Behaviour of the system under faults

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Scalability Small increase in CPU usage as number of slaves increases –10 slaves, 20% increase from standalone operation Number of update logs sent scales almost linearly Setup Insertion rate at master: 90 entries/s. Total: 10,000 entries 0 slaves - saving replication updates, but not shipping (slaves disconnected)

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Fault Tolerance Next test illustrates fault tolerance mechanisms Slave fails –Master keeps the updates for the slave –Replication log grows Slave reconnects –Master sends pending updates –Eventually system recovers to a steady state with the slave up-to- date Test conditions: –Insertion rate at master: 50 entries/s –Total: entries –Two slaves, both start connected –Slave1 disconnects temporarily Setup:

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Fault Tolerance and Recovery While slave1 is disconnected, the replication log grows in size –Limited in size. Slave unsubscribed if it does not reconnect in time. After slave reconnection, system recovers in around 60 seconds.

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Future Work/Open Challenges

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Scalability Support hundreds of replicas –HEP use case. Extreme case: one replica catalog per site Challenges –Scalability –Fault-tolerance – tolerate failures of slaves and of master Current method of shipping updates (direct streaming) might not scale –Chained replication (divide and conquer)  Already possible with AMGA, performance needs to be studied –Group communication

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Federation Federation of independent catalogs –Biomed use case Challenges –Provide a consistent view over the federated catalogs –Shared namespace –Security - Trust management, access control and user management Ideas

Enabling Grids for E-sciencE INFSO-RI Workshop on Next-Generation Distributed Data Management - 20 June Conclusion Replication of Metadata Catalogues necessary for Data Grids We are exploring replication at the Catalogue using AMGA Initial implementation completed –First results are promising Currently working on improving scalability and on federation More information about our current work at: