Presentation is loading. Please wait.

Presentation is loading. Please wait.

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 1 IBM T.J. Watson.

Similar presentations


Presentation on theme: "The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 1 IBM T.J. Watson."— Presentation transcript:

1 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 1 IBM T.J. Watson 02.06.06 Data Management in Large-Scale Decentralized Information Systems Philippe Cudré-Mauroux, EPFL Joint work with: Karl Aberer (advisor @ EPFL)

2 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 2 Overview 1.Motivation Picture Sharing in Decentralized Settings 2.Data Integration in large-scale networks 1.Peer Data Management Systems 2.Semantic Gossiping 3.Probabilistic Message-Passing 4.Aspects of self-organization 1.Self-Healing semantic networks 2.Analyzing semantic interoperability in the large 3.Applications 0. P-Grid 1.GridVine 2.PicShark 4.Conclusions & Future Work

3 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 3 1. Motivation: Picture Sharing Profusion of Digital Images –Variety of powerful devices –gigabytes of pictures is the new norm Most of the images are kept local Some are shared –Mostly point-to-point –Primitive search capabilities MMS HTTP SMTP

4 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 4 Opportunity More and more software use metadata to organize images locally –(Semi) Structured metadata (e.g., XML, PSA) –Ontological metadata (e.g., RDF, XMP) –Type-based metadata (e.g., WinFS) <rdf:RDF xmlns:rdf= 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'> 2001-12-19T18:49:03Z 2001-12-19T20:09:28Z John Doe …

5 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 5 Hurdle: Metadata Heterogeneity Why not taking advantage of those metadata in a distributed setting? X Syntactic discrepancies X Semantic heterogeneity All the aforementioned standards are extensible Shared representation is not enough ImageGUIDcDate A0657B2505.08.04 109E7A2505.08.04 05/08/2004 VS Width Length-Y VS

6 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 6 Beyond Keyword Search searching semantically richer objects in large scale heterogeneous networks 2001-12- 19T18:49:03Z 2001-12- 19T20:09:28Z date? 05/08/2004 Jan 1, 2005 ? ? ? ? ?

7 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 7 2. Data Integration in large-scale systems Large Scale Information Systems (e.g., WWW) –Number of sources > 1000 –Unreliable data Autonomy –Semi-structured data E.g., XML/RDF –No integrity constraints –No transactions –Simple SP queries E.g., triple patterns, ranking –Schemata created by end users –Network churn Distributed Databases –Number of sources < 100 –Consistent data Coordination –Structured data E.g., Relational data model –Integrity constraints –Transactions –Powerful queries E.g., SQL, aggregation –Schemas created by administrators –Relatively Fixed topology VS

8 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 8 Data Integration: LAV/GAV Traditional database techniques (e.g., LAV/GAV) rely on centralized schemas to integrate data sources Not applicable to our context –Scale (upper ontologies?) –Churn –Autonomy How can we foster semantic interoperability in decentralized settings? Date myDate yourDate m(yourDate) = Date m(myDate) = Date

9 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 9 Semantic Interoperability Q1= $p/GUID FOR $p IN /Photoshop_Image WHERE $p/Creator LIKE "%Robi%" 178A8CD8865 Robinson Tunbridge Wells Royal Council … Photoshop (own schema) 178A8CD8866 Henry Peach Robinson Photographer Tunbridge Council … WinFS (known schema ) T12 = $fs/GUID $fs/Author/DisplayName FOR $fs IN /WinFSImage Q2= $p/GUID FOR $p IN T12 WHERE $p/Creator LIKE "%Robi%"  Extending semantic interoperability techniques to decentralized settings

10 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 10 2.1. Peer Data Management Systems Pairwise mappings –Peer Data Management Systems (PDMS) Local mappings overcome global heterogeneity –Iterative query rewriting 2001-12- 19T18:49:03Z 2001-12- 19T20:09:28Z date? 05/08/2004 Jan 1, 2005 article weather es:cDate  xap:CreateDate es:cDate  myRDF :Date myRDF: Date  xap:ModifyDate

11 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 11 Problem: Precision/Recall Tradeoff Semantic Query routing –To whom shall I forward a query posed against my local schema? Some (most) mappings will be (partially) faulty –Low expressive power of mappings samePropertyAs / sameClassAs / subclassOf … or event worse (Microformats) –Automatic schema alignment techniques –Different views on conceptualizations Local query resolution –Low recall Flooding (PDMS so far) –Low precision Standard deductive integration is not sufficient –Uncertainty on mappings and conceptualizations

12 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 2.2. Semantic Gossiping Local translations enabling global agreements Selective query forwarding paradigm –Syntactic distances Lost predicates –Semantic distances Results analysis Cycles analysis  Precision/Recall tradeoff

13 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 13 Semantic Gossiping Selectively reformulate queries through mapping links π Title  Author=Joe (R2) π Titre  Auteur=Joe (R1) π Title  Creator=Joe (R3) π Title  Creature=Joe (R5)  Author=Joe (R4) X X π Title  Creator=Joe (R4)

14 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 14 2.3. Probabilistic Message Passing m0m0 m1m1 m2m2 m3m3 m4m4 m5m5 Where do the mapping quality measures come from? Link-based analysis of the PDMS -Automatically deriving quality measures for the mappings Transitive Closures on mapping operations -Mapping Cycles -Parallel Paths f0f0

15 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 15 On Cycles / parallel paths q VS m 3 (m 4 (m 0 (q))) m0m0 m3m3 m4m4 f0f0 art/Creator? VS art/creatDate? q:art/Creator?

16 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 16 Computing a Marginal for one cycle P(m 0, m 3, m 4, f 0 ) = P(m 0 ) P(m 3 ) P(m 4 ) P(f 0 | m 0, m 3, m 4, ) P(m 0 | f 0 )=  m3, m4, P(m 0, m 3, m 4, f 0 ) P(f 0 ) -1 But: feedbacks on different cycles are correlated –One wrong mapping will affect several cycles/paths –Need to express a global probabilistic model for the mapping graph observedunknown

17 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 17 A Brief Intro to Factor-Graphs g(x1, x2, x3, x4) = fA(x1, x2)fB(x2, x3, x4)

18 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 18 Deriving PDMS Factor-Graphs Abductive reasoning on transitive closures of mappings a priori information on mapping

19 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 19 PDMS Factor-Graphs Cyclic graph –Junction Tree? Clustering / Stretching of variables? Not applicable (decentralization) –Iterative Sum-Product Approximate results How to perform iterative sum-product by message passing on the mapping graph? –Message passing in factor graph does not correspond to connectivity of mapping graph –We want to rely on decentralized computations only Locality VS Globality of nodes in the factor graph –Mappings: local –Feedback factor: common, global knowledge –Observed feedback variables: neighborhood

20 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 20 Embedded Message-Passing (1)

21 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 21 Embedded Message-Passing (2)

22 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 22 Message Passing Decentralized computations Computationally inexpensive –Sums and Products Message-Passing Schedules –Periodic –Lazy (piggybacking on query forwarding) No message overhead

23 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 23 Implemented System Schemas –Import from OWL (Web Ontology Language) Mappings –KnowledgeWeb Ontology Alignment API –Import from RDF/XML –Automated on-the-fly creation –Comparison to standard alignments  Automatic derivation of quality measures P(m=correct | {F}) for the mappings using iterative message-passing

24 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 24 2.4. Aspects of Self-Organization What can we do once we have gathered quality measures about the mappings? –Routing queries (  ) –Correcting mappings –Analyzing global properties of the graph

25 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 25 2.4.1. Self-Healing Semantic Networks Two types of self-organization –Static network Self-organizing dissemination of queries –Dynamic network Self-organizing network of mappings Idea: –Quality evaluation of mappings through Semantic Gossiping –Modify low quality links –Reorganized network leads to different quality evaluation –Dynamic network changes  self-organizing, self-referential semantic network

26 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities Some Results (1) Sensitivity to TTL (cycle analysis only, 25 schemas, 4 concepts)

27 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities Some Results (2) Scalability (results analysis only, 4 concepts, TTL=3, misclassification rate=0.1, 2 documents/peer on avg.)

28 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 28 2.4.2. Analyzing Semantic Inter. in the Large What about interoperability at a global scale? Modeling semantic interoperability: The semantic connectivity graph –Idea: as for physical network analyses, define a connectivity layer –Unweighted, non-redundant version of the Schema-to-schema graph Schema-to-Schema Graph –Logical model –Directed –Weighted –Redundant

29 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 29 Semantic Interoperability in the Large Definition Peers in a set P s are semantically interoperable iff S s is strongly connected, with S s  {s |  p  P s, p  s} Observation 1 A set of peers P s cannot be semantically interoperable if |E s | < |V s | Observation 2 A set of peers P s is semantically interoperable if |E s | > |V s | (|V s |-1) - (|V s |-1) What happens between theses two bounds?

30 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 30 A Necessary Condition for interoperability in the large Analyzing semantic interoperability in large-scale, decentralized networks –Percolation theory for directed graphs –Based on a recent graph-theoretic framework –Random graphs with specific degree distributions p jk, clustering coefficients cc and bidirectionality coefficient bc Based on generatingfunctionality Distribution of edges from first to second-order neighbors: Necessary condition for semantic interoperability in the large:  j,k (jk-j(bc+cc)-k)p jk ≥ 0 Also: approximations of the size of semantically interoperable clusters

31 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 31 Some results (Poisson-distributed random graph, 10 000 vertices)

32 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 32 Analysis of a real system Analysis of the Sequence Retrieval System (SRS) –Commercial information indexing and retrieval system for bioinformatic libraries –Schemas described in a custom language (Icarus) –Mappings (foreign keys) from one database to others Crawling the EBI repository –388 databanks –518 (undirected) links –Power-law distribution of node degrees –Clustering coefficient = 0.32 –Diameter = 9 Connectivity indicator ci = 25.4 –Super-critical state Size of the giant component –0.47 (derived) VS 0.48 (observed)

33 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 33 Query dissemination in weighted networks –Per-hop forwarding behaviors –Only forward if w i >=   = 0 : flooding  = 1 : exact answers –Degree distribution similar to SRS –Uniformly distributed weights between 0 and 1

34 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 34 Local view on global properties? SRS-like distribution, 1000 vertices, 4000 edges

35 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 35 3. Applications 0.P-Grid A peer-to-peer access structure 1.GridVine Self-organizing semantic overlay network 2.PicShark Self-organizing middleware to export pictures and create mappings

36 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 36 Standard data management over overlay networks Hard problem? –Strictly speaking impossible CAP theorem: pick at most two of the following: 1.Consistency 2.Availability 3.Tolerance to network Partitions Practical compromises: E.g., Relaxing ACID properties S. Gilbert and N. Lynch: Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services, ACM SIGACT News, 33(2), 2002.

37 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 37 3.0. The P-Grid Access Structure Distributed, virtual binary search tree –complete decentralization –self-organization –efficient search Gridella: a DHT based on P-Grid –decentralized load balancing –updates –replication –management of dynamic IP addresses and identities Used in several large-scale research projects –MICS –Alvis –Bricks –Evergrow

38 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 38 1 : 12, 13 01 : 5, 10 001: 9,4 1 0,1 1 : 12, 13 01 : 5,14 001: 9,4 7 0,1 1 : 6,13 01 :10,14 000: 1,7 4 2,3 1 : 8,2 01 : 3, 10 000: 1,7 9 2,3 1 : 8, 13 00 : 7,9 011: 3,10 5 4,5 1 : 2,12 00 : 9,4 011: 3,10 14 4,5 1 : 6,8 00 : 1,7 010: 5,14 10 6,7 1 : 11,12 00 : 1,9 010: 5,14 3 6,7 0 : 4,7 11 : 2,12 101: 8,13 11 8,9 1 : 1,3 11 : 2,12 101: 8,13 6 8,9 0 : 5,9 11 : 2,12 100: 6,11 13 10,11 0 : 4,9 11 : 2,12 100: 6,11 8 10,11 0 : 5,7 10 : 6,13 12 12,13,14 0 : 1,14 10 : 11,13 2 12,13,14 000001010011 01 0001 10 100101 11 ID peer identifier 2,3 data keys (2=0010 etc.) Prefix Routing 1 : 12, 13 routing table entry query(101) @ 7

39 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 39 3.1. GridVine Building large-scale semantic systems –Self-organizing semantic overlay network Principle of data independence –Scalable physical layer –Semantic logical layer

40 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 40 Semantic Mediation Layer Correlated / Uncorrelated Correlated / Uncorrelated “Physical” layer Overlay Layer Semantic Mediation Layer

41 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 41 GridVine: Annotating Content

42 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 42 Features Self-organized, scalable, decentralized Resolves key-based searches in O (log(n)) even for unbalanced trees Semantic Web compliant –RDF triples, RDFS schemas, OWL mappings Structured searches –Simple Triple Patterns, RDQL queries –Query resolution: iterative, distributed table lookup Semantic Gossiping –Tradeoff precision / recall –Automatic reformulation of queries One of the building-blocks of the Social Semantic Desktop –Nepomuk, large-scale EU research project

43 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 43 Indexing semi-structure in GridVine Soft-states –Each triple has an expiration time (cf. CAP theorem) Locality-preserving hash-function –Range searches Triple t = Put(Hash(lsir:GridVine), t) Put(Hash(dc:creator), t) Put(Hash(lsir:pcm), t)

44 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 44 Semantic Integration in GridVine (1) Vertical integration: hierarchy of classes –Fostering semantic interoperability through reuse of conceptualizations –Simple, user-oriented constructs (cf. GUI) –Few, popular base classes bootstrapping interoperability through properties RDFS entailment can be materialized for the instances

45 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 45 Semantic integration in GridVine (2) Horizontal integration: mappings –Simple links relating properties –Cycle + feedback analysis to get probabilistic guarantees  AutomaticSemantic Gossiping

46 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 46 Traversals of the Semantic Overlay Network GridVine: structured P2P network –No more constraints on gossiping Different query forwarding paradigms –Iterative forwarding –Recursive forwarding

47 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 47 3.2. PicShark Where do the mappings come from? Middleware for sharing semi-structured metadata attached to pictures and creating mappings PSP XMP WinFS Metadata Extractor (Distributed) Hashtable (e.g., GridVine) Insert Retrieve Features Extractor 60 moments Information Tracker PicShark

48 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 48 Features Self-Organization of mappings –Based on low-level features extracted from Picture (color moment, textures) Structured Metadata (lexicographical analysis) Self-Organization of annotations –Probabilistic propagation of annotations between similar individuals Self-Organization of query dissemination –Schema distance based on probabilistic subsumption –Propagation within a certain diameter Driven by user interaction Scalable Computationally expensive operations are local at the peers Only simple in-network operations (look-ups) (on-going) collaborative effort with Microsoft Research + MICS

49 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 49 PicShark Prototype

50 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 50 4. Conclusions Fundamental issue: Data interoperability and management in large scale decentralized environments –Content Sharing –Information search –Semantic Web? Traditional techniques are not sufficient –Scale –Autonomy –Uncertainty Self-organizing, decentralized stochastic processes Data Indexing Data Integration Semantics as agreement Abductive reasoning (on transitive closures of mappings) Query dissemination

51 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 51 Current research directions Probabilistic firing of complex events –Noisy streams of data (motes, sensor streams) –Networked environment (HiFi) Information Retrieval in PDMS –Characterization –Probabilistic guarantees for multi-hop retrieval based on extensions Information-theoretic measures for interoperability Query optimization in GridVine

52 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 52 References Semantic Overlay Networks (Tutorial) Karl Aberer and Philippe Cudré-Mauroux International Conference on Very Large Data Bases (VLDB 05). … complete reference list at http://lsirpeople.epfl.ch/pcudre/

53 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 53 Thank you for your attention Questions ?

54 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 54 Some (Preliminary) Results: Convergence (undirected example graph, prior 0.7 delta 0.1)

55 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 55 Fault-tolerance (faulty links) (undirected example graph, prior 0.8 delta 0.1)

56 The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 56 Preliminary Results: EON (Alignment contest) Worst-case scenario: no prior knowledge Set of 6 schemas on bibliographic data (approx. 30-40 attributes) 396 generated attribute mappings (84 incorrect)


Download ppt "The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 1 IBM T.J. Watson."

Similar presentations


Ads by Google