The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 1 Semantic Network Analysis Analyzing Semantic Interoperability in Bioinformatic Database Networks Philippe Cudré-Mauroux, EPFL Joint work with: Julien Gaugaz, Adriana Budura and Karl Aberer
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 2 Overview 1.Peer Data Management Systems (PDMS) 2.Semantic Interoperability in the Large Generatingfunctionologic framework 3.The Sequence Retrieval System Degree distribution Analysis of giant component Weighted analysis 4.Conclusions
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 3 Beyond Keyword Search searching semantically richer objects in large scale heterogeneous networks T18:49:03Z T20:09:28Z date? 05/08/2004 Jan 1, 2005 ? ? ? ? ?
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 4 Decentralized Data Integration Large Scale Information Systems (e.g., WWW) –Number of sources > 100 –Unreliable data Autonomy –Semi-structured data E.g., XML/RDF –No integrity constraints –No transactions –Simple SP queries E.g., triple patterns, ranking –Schemata created by end users –Network churn Distributed Databases –Number of sources < 100 –Consistent data Coordination –Structured data E.g., Relational data model –Integrity constraints –Transactions –Powerful queries E.g., SQL, aggregation –Schemas created by administrators –Relatively Fixed topology VS
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 5 Data Integration: LAV/GAV Traditional database techniques (e.g., LAV/GAV) rely on centralized schemas to integrate data sources Not applicable to our context –Scale (upper ontologies?) –Churn –Autonomy How can we foster semantic interoperability in decentralized settings? Date myDate yourDate m(Date) = yourDate m(Date) = myDate
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 6 Semantic Interoperability Q1= $p/GUID FOR $p IN /Photoshop_Image WHERE $p/Creator LIKE "%Robi%" 178A8CD8865 Robinson Tunbridge Wells Royal Council … Photoshop (own schema) 178A8CD8866 Henry Peach Robinson Photographer Tunbridge Council … WinFS (known schema ) T12 = $fs/GUID $fs/Author/DisplayName FOR $fs IN /WinFSImage Q2= $p/GUID FOR $p IN T12 WHERE $p/Creator LIKE "%Robi%" Extending semantic interoperability techniques to decentralized settings
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 7 1. Peer Data Management Systems Pairwise mappings –Peer Data Management Systems (PDMS) Local mappings overcome global heterogeneity –Iterative query rewriting T18:49:03Z T20:09:28Z date? 05/08/2004 Jan 1, 2005 article weather es:cDate xap:CreateDate es:cDate myRDF :Date myRDF: Date xap:ModifyDate
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 8 Semantic Mediation Layer Correlated / Uncorrelated Correlated / Uncorrelated “Physical” layer Overlay Layer Semantic Mediation Layer
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 9 Schema-to-Schema Graph Inter-organization of the different schemas used by the peers - Logical model - Directed - Weighted - Redundant
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 10 The Semantic Connectivity Graph Definition (Semantic Interoperability) Two peers are said to be semantically interoperable if they can forward queries to each other in the Schema-to-Schema graph, potentially through series of semantic translation links Idea –As for physical network analyses, create a connectivity layer to account for semantic interoperability The semantic connectivity Graph S –Unweighted, irreflexive and non-redundant version of the Schema-to- Schema graph
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 11 Observations Theorem Peers in a set P s are semantically interoperable iff S s is strongly connected, with S s {s | p P s, p s} Observation 1 A set of peers P s cannot be semantically interoperable if |E s | < |V s | Observation 2 A set of peers P s is semantically interoperable if |E s | > |V s | (|V s |-1) - (|V s |-1)
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities Semantic Interoperability in the Large Question –How can we analyze semantic interoperability in large-scale PDMS? Idea: use percolation theory to detect the emergence of a strongly connected component in S –Necessary condition for vertex-strong connectivity –Necessary condition for semantic interoperability
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 13 The Model Adaptation of a recent graph-theoretic framework –Newman, Strogatz, Watts 2001 Large-scale semantic graphs as random graphs with arbitrary degree distribution –Exponentially distributed, small-world, scale-free… graphs Specificities of our model –Strong clustering (clustering coefficient cc) –Bidirectionality (bidirectionality coefficient bc) (for directed networks) Based on generatingfunctionology – Percolation: ci > 0
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 14 Size of the giant component With u the smallest non-negative solution of And G 1 the distribution of edges from first to second- order neighbors:
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities The Sequence Retrieval System (SRS) Commercial information indexing and retrieval system Bioinformatic libraries –EMBL –SwissProt –Prosite –Etc. Schemas described in a custom language (Icarus) Mappings (links) from one database to others
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 16 Why is SRS interesting? Applying our heuristics on a real large-scale corpus of interconnected databases –More than 380 databanks –More than 500 (undirected) links –Data used by professionals on a daily basis
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 17 Crawling the SRS schema-to-schema graph Custom crawler As of May 2005 (EBI repository) –388 nodes –518 edges –Giant connected component: 187 nodes –Power-law distribution of node degrees –Clustering coefficient = 0.32 –Diameter = 9
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 18 Results Connectivity indicator ci = 25.4 –Super-critical state Size of the giant component –0.47 (derived) –0.48 (observed)
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 19 Graphs with same power-law degree distr. Varying number of edges
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 20 10x Bigger Graph
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 21 Analyzing weighted networks Do we have a sufficient number of good mappings? Introducing quality measures from the mappings –Weights –Attribute / schema level –Cf. Chatty Web (WWW03) Semantic query forwarding –Per-hop forwarding behaviors –Only forward if w i >= = 0 : flooding = 1 : exact answers
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 22 Weighted Results Same degree distribution (388 nodes) Uniformly distributed weights between 0 and 1
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities Conclusions Analyzing a real network of bioinformatic databases –Accurate results (even for relatively small networks) –Weighted / unweighted Current works –Compositions of weights along a path –Semantic random walkers –Public domain simulator Future works –Analyzing other forwarding behaviors –Implementation in a real PDMS (self-organizing mappings) GridVine
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 24 References A Necessary Condition for Semantic Interoperability in the Large Philippe Cudré-Mauroux and Karl Aberer ODBASE 2004 GridVine: Building Internet-Scale Semantic Overlay Networks Karl Aberer, Philippe Cudré-Mauroux and Tim van Pelt ISWC 2004 Semantic Overlay Networks (Tutorial) Karl Aberer and Philippe Cudré-Mauroux VLDB 2005 … complete reference list at
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 25 Thank you for your attention Questions ?