Laboratoire LIP6 The Gedeon Project: Data, Metadata and Databases Yves DENNEULIN LIG laboratory, Grenoble ACI MD.

Slides:

Advertisements

Similar presentations

Elton Mathias and Jean Michael Legait 1 Elton Mathias, Jean Michael Legait, Denis Caromel, et al. OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis,

Advertisements

Distributed Processing, Client/Server and Clusters

Data Grids Jon Ludwig Leor Dilmanian Braden Allchin Andrew Brown.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Distributed Processing, Client/Server, and Clusters

Technical Architectures

Module 8: Concepts of a Network Load Balancing Cluster

Data Management for Physics Analysis in PHENIX (BNL, RHIC) Evaluation of Grid architecture components in PHENIX context Barbara Jacak, Roy Lacey, Saskia.

1 Draft of a Matchmaking Service Chuang liu. 2 Matchmaking Service Matchmaking Service is a service to help service providers to advertising their service.

Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.

Grids and Grid Technologies for Wide-Area Distributed Computing Mark Baker, Rajkumar Buyya and Domenico Laforenza.

Oxford Jan 2005 RAL Computing 1 RAL Computing Implementing the computing model: SAM and the Grid Nick West.

Business Intelligence Dr. Mahdi Esmaeili 1. Technical Infrastructure Evaluation Hardware Network Middleware Database Management Systems Tools and Standards.

Distributed Systems: Client/Server Computing

Technical solution presentation AVL System for Fire Brigades.

Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

Presenter: Dipesh Gautam.  Introduction  Why Data Grid?  High Level View  Design Considerations  Data Grid Services  Topology  Grids and Cloud.

SITools Enhanced Use of Laboratory Services and Data Romain Conseil

Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.

1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong

1 Apache. 2 Module - Apache ♦ Overview This module focuses on configuring and customizing Apache web server. Apache is a commonly used Hypertext Transfer.

The Grid Component Model: an Overview “Proposal for a Grid Component Model” DPM02 “Basic Features of the Grid Component Model (assessed)” -- DPM04 CoreGrid.

The Grid Component Model and its Implementation in ProActive CoreGrid Network of Excellence, Institute on Programming Models D.PM02 “Proposal for a Grid.

Introduction to CVMFS A way to distribute HEP software on cloud Tian Yan (IHEP Computing Center, BESIIICGEM Cloud Computing Summer School.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

IMDGs An essential part of your architecture. About me

RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah

File and Object Replication in Data Grids Chin-Yi Tsai.

A Survey on Programming Model Context Toolkit Gaia ETC (of Equator Project) Tentaculus.

Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.

Cracow Grid Workshop, October 27 – 29, 2003 Institute of Computer Science AGH Design of Distributed Grid Workflow Composition System Marian Bubak, Tomasz.

Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.

Heavy and lightweight dynamic network services: challenges and experiments for designing intelligent solutions in evolvable next generation networks Laurent.

NOVA Networked Object-based EnVironment for Analysis P. Nevski, A. Vaniachine, T. Wenaus NOVA is a project to develop distributed object oriented physics.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Scenarios for a Learning GRID Online Educa Nov 30 – Dec 2, 2005, Berlin, Germany Nicola Capuano, Agathe Merceron, PierLuigi Ritrovato

09/02 ID099-1 September 9, 2002Grid Technology Panel Patrick Dreher Technical Panel Discussion: Progress in Developing a Web Services Data Analysis Grid.

Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.

Presented by Jens Schwidder Tara D. Gibson James D. Myers Computing & Computational Sciences Directorate Oak Ridge National Laboratory Scientific Annotation.

1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.

Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.

Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing

1 VLDB - Data Management in Grids B. Del-Fabbro, D. Laiymani, J.M. Nicod and L. Philippe Laboratoire d’Informatique de l’Université de Franche-Comté Séoul,

Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.

ETRI Site Introduction Han Namgoong,

Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore

Parallel IO for Cluster Computing Tran, Van Hoai.

IBM Express Runtime Quick Start Workshop © 2007 IBM Corporation Deploying a Solution.

1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.

Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.

Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

Fault – Tolerant Distributed Multimedia Streaming Web Application By Nirvan Sagar – Srishti Ganjoo – Syed Shahbaaz Safir

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

Distributed Processing, Client/Server and Clusters

Parallel Virtual File System (PVFS) a.k.a. OrangeFS

GridOS: Operating System Services for Grid Architectures

Business System Development

Scaling Network Load Balancing Clusters

Simulation Production System

Table General Guidelines for Better System Performance

GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.

Abstract Machine Layer Research in VGrADS

GSAF Grid Storage Access Framework

DUCKS – Distributed User-mode Chirp-Knowledgeable Server

Table General Guidelines for Better System Performance

Presentation transcript:

Laboratoire LIP6 The Gedeon Project: Data, Metadata and Databases Yves DENNEULIN LIG laboratory, Grenoble ACI MD

Context and goals ● Heterogeneous metadata management on grids  Clusters of clusters ● High-level queries using metadata ● Easy and flexible deployment and configuration ● Minimal overhead ● Various interfaces ● Initial target application domains  Biocomputing (lots of metadata, few data)  Microscopic imaging (lots of data data, few metadata)

The Gedeon middleware  Metadata management on lightweight grids ● Records of (attribute,value) pairs stored in files  Flexible requests ● Can be combined through scripting  Various interfaces ● Command line (tools) ● Libraries ● Virtual FS (legacy applications support)  Deployment “à la carte” ● Composition of various data sources  Performances ● Dedicated I/O library ● Semantic caching

Outline 1.General architecture a.Gedeon internal structure b.Composition of various data sources 2.Practical use 3.« dual » cache Conclusion

Example of a deployment Query Interface (API, FS, GUI,...) Local proxy Interconnect middleware Local proxy Interconnect Client Servers « close » to the client Storage sites cache

Gedeon components ● Gedeon Kernel  fuple ● I/O Library ● Evaluate the queries  lowerG ● Operators to compose bases ● Remote access ● Interface  API lowerG  Virtual FS ● Cache application vSGF lowerG fuple network cache fuple network lowerG Local proxy

What inside the sources? ● Records of pairs attribute/value Id classifA classifB 457 Bacteria Clostridia taille26 ref Record

Example of composition of sources client + J Metadata can be local or copies site S1 site S2 site S3 RR

... Union enreg. A1 enreg. A2 enreg. A3 enreg. A4 + enreg. B1 enreg. B2 enreg. B3 enreg. B4... enreg. A1 enreg. A2 enreg. A3 enreg. A4 enreg. B1 enreg. B2 enreg. B3 enreg. B4 Unify storage space + Parallel evaluation

Round Robin RR Fault Tolerance client Source 1 Source 2

Round Robin RR Load Balancing client Source 1 Source 2 client

... Join operator Id A1 A2 457 v1 v2 A3v3 Id A1 A2 458 v4 v5 A3v6 J Id... Id An 457 vAn1 Id An 458 vAn2... Id A1 A2 457 v1 v2 A3v3 Id A1 A2 458 v4 v5 A3v6 AnvAn1 AnvAn2 Enrich a source with another

Outline 1.General architecture a.Gedeon internal structure b.Composition of various data sources 2.Practical use 3.« dual » cache Conclusion

Tools 1/2 ● Libraries ● CLI ● Operations  sort  projection  select  index ...

Tools 2/2 sort(attr='taille') ● Examples  sort $> cat mesmeta.g | fsort 'taille' > trie_taille.g  index create_idx(attr='Id').Id.idx search_idx('Id', 'P0123')

Language for the requests ● Simple ($, type control with the operators) ● Regular expressions ● Of the second order

Select expression Id classifB 459 Bacteria taille47 Id classifA 460 Fermicutes Select $Id>459 Id classifA 460 Fermicutes Id classifA classifB 457 Bacteria Clostridia taille26

Select using regexp Id classifA classifB Id classifB 457 Bacteria Clostridia 459 Bacteria taille26 taille47 Id classifA 460 Fermicutes Select $classifB==/.*a$/ Id classifA classifB 457 Bacteria Clostridia taille26 Id classifB 459 Bacteria taille47

Select using 2nd order logic Id classifA classifB Id classifB 457 Bacteria Clostridia 459 Bacteria taille26 taille47 Id classifA 460 Fermicutes Select $/classif[AB]/==Bacteria && $taille>=36 Id classifB 459 Bacteria taille47

Virtual FS interface ● Just a specific file-oriented interface ● Data and metadata can be anywhere in the grid ● Definition of logical directories  Ex: cd '$classifB==|.*a$|'  « and » between directories  1 filename =value of a metadata: logical view /fs_virt/$classifB==|.*a$|> ls /fs_virt/$classifB==|.*a$|> cat *>/tmp/mater /fs_virt/$classifB==|.*a$|>

Outline 1.General architecture a.Gedeon internal structure b.Composition of various data sources 2.Practical use 3.« dual » cache Conclusion

Dual cache (1) ● 2 cooperative caches  cache of requests (R, {id,...}) -> save computing power  cache of data (id, {attr,...}) -> save bandwidth ● Semantic cache  Can evaluate a query using the data in the cache  Can generate a remainder to complement the data cached

Example ● Refinement of a request 1)'$OC==/Eukaryota/' -> (R, Lid={id1,id2,...}) 2)'$OC==/Eukaryota/ && $year>=1998' Select(*Lid, '$year>=1998')

Dual cache (2) ● Distributed semantic cache  Typically used inside communities ● Lots of common requests  No location constraints ● Members of the community can be geographically scattered ● Distributed data cache  Minimize time and data transfer  Cooperation between close, from a topological point of view, sites

Dual cache (3) Grenoble Servers Rennes Dual cache Query cache Object cache Semantic locality Community Eukaryota Community Archaea Geographic locality

Dual cache (4) ● Work in progress on the notion of distance  Find geographical proximity  Find common interests between communities ● Create hybrid communities based on their requests ● Could be used to change the cache parameters  Manual and/or automatic

Conclusion ● A data integration middleware  Handling of metadata ● Distributed and modular  Deployment can be done according to architectural/organisational constraints ● Definition of a dual cache infrastructure  Reflect both organisational use ● Prototype in use  Packaging and documentation needed

Questions?