Laboratoire LIP6 The Gedeon Project: Data, Metadata and Databases Yves DENNEULIN LIG laboratory, Grenoble ACI MD
Context and goals ● Heterogeneous metadata management on grids Clusters of clusters ● High-level queries using metadata ● Easy and flexible deployment and configuration ● Minimal overhead ● Various interfaces ● Initial target application domains Biocomputing (lots of metadata, few data) Microscopic imaging (lots of data data, few metadata)
The Gedeon middleware Metadata management on lightweight grids ● Records of (attribute,value) pairs stored in files Flexible requests ● Can be combined through scripting Various interfaces ● Command line (tools) ● Libraries ● Virtual FS (legacy applications support) Deployment “à la carte” ● Composition of various data sources Performances ● Dedicated I/O library ● Semantic caching
Outline 1.General architecture a.Gedeon internal structure b.Composition of various data sources 2.Practical use 3.« dual » cache Conclusion
Example of a deployment Query Interface (API, FS, GUI,...) Local proxy Interconnect middleware Local proxy Interconnect Client Servers « close » to the client Storage sites cache
Gedeon components ● Gedeon Kernel fuple ● I/O Library ● Evaluate the queries lowerG ● Operators to compose bases ● Remote access ● Interface API lowerG Virtual FS ● Cache application vSGF lowerG fuple network cache fuple network lowerG Local proxy
What inside the sources? ● Records of pairs attribute/value Id classifA classifB 457 Bacteria Clostridia taille26 ref Record
Example of composition of sources client + J Metadata can be local or copies site S1 site S2 site S3 RR
... Union enreg. A1 enreg. A2 enreg. A3 enreg. A4 + enreg. B1 enreg. B2 enreg. B3 enreg. B4... enreg. A1 enreg. A2 enreg. A3 enreg. A4 enreg. B1 enreg. B2 enreg. B3 enreg. B4 Unify storage space + Parallel evaluation
Round Robin RR Fault Tolerance client Source 1 Source 2
Round Robin RR Load Balancing client Source 1 Source 2 client
... Join operator Id A1 A2 457 v1 v2 A3v3 Id A1 A2 458 v4 v5 A3v6 J Id... Id An 457 vAn1 Id An 458 vAn2... Id A1 A2 457 v1 v2 A3v3 Id A1 A2 458 v4 v5 A3v6 AnvAn1 AnvAn2 Enrich a source with another
Outline 1.General architecture a.Gedeon internal structure b.Composition of various data sources 2.Practical use 3.« dual » cache Conclusion
Tools 1/2 ● Libraries ● CLI ● Operations sort projection select index ...
Tools 2/2 sort(attr='taille') ● Examples sort $> cat mesmeta.g | fsort 'taille' > trie_taille.g index create_idx(attr='Id').Id.idx search_idx('Id', 'P0123')
Language for the requests ● Simple ($, type control with the operators) ● Regular expressions ● Of the second order
Select expression Id classifB 459 Bacteria taille47 Id classifA 460 Fermicutes Select $Id>459 Id classifA 460 Fermicutes Id classifA classifB 457 Bacteria Clostridia taille26
Select using regexp Id classifA classifB Id classifB 457 Bacteria Clostridia 459 Bacteria taille26 taille47 Id classifA 460 Fermicutes Select $classifB==/.*a$/ Id classifA classifB 457 Bacteria Clostridia taille26 Id classifB 459 Bacteria taille47
Select using 2nd order logic Id classifA classifB Id classifB 457 Bacteria Clostridia 459 Bacteria taille26 taille47 Id classifA 460 Fermicutes Select $/classif[AB]/==Bacteria && $taille>=36 Id classifB 459 Bacteria taille47
Virtual FS interface ● Just a specific file-oriented interface ● Data and metadata can be anywhere in the grid ● Definition of logical directories Ex: cd '$classifB==|.*a$|' « and » between directories 1 filename =value of a metadata: logical view /fs_virt/$classifB==|.*a$|> ls /fs_virt/$classifB==|.*a$|> cat *>/tmp/mater /fs_virt/$classifB==|.*a$|>
Outline 1.General architecture a.Gedeon internal structure b.Composition of various data sources 2.Practical use 3.« dual » cache Conclusion
Dual cache (1) ● 2 cooperative caches cache of requests (R, {id,...}) -> save computing power cache of data (id, {attr,...}) -> save bandwidth ● Semantic cache Can evaluate a query using the data in the cache Can generate a remainder to complement the data cached
Example ● Refinement of a request 1)'$OC==/Eukaryota/' -> (R, Lid={id1,id2,...}) 2)'$OC==/Eukaryota/ && $year>=1998' Select(*Lid, '$year>=1998')
Dual cache (2) ● Distributed semantic cache Typically used inside communities ● Lots of common requests No location constraints ● Members of the community can be geographically scattered ● Distributed data cache Minimize time and data transfer Cooperation between close, from a topological point of view, sites
Dual cache (3) Grenoble Servers Rennes Dual cache Query cache Object cache Semantic locality Community Eukaryota Community Archaea Geographic locality
Dual cache (4) ● Work in progress on the notion of distance Find geographical proximity Find common interests between communities ● Create hybrid communities based on their requests ● Could be used to change the cache parameters Manual and/or automatic
Conclusion ● A data integration middleware Handling of metadata ● Distributed and modular Deployment can be done according to architectural/organisational constraints ● Definition of a dual cache infrastructure Reflect both organisational use ● Prototype in use Packaging and documentation needed
Questions?