The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 1 IBM T.J. Watson.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Oyster, Edinburgh, May 2006 AIFB OYSTER - Sharing and Re-using Ontologies in a Peer-to-Peer Community Raul Palma 2, Peter Haase 1 1) Institute AIFB, University.
George Anadiotis, Spyros Kotoulas and Ronny Siebes VU University Amsterdam.
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
GridVine: Building Internet-Scale Semantic Overlay Networks By Lan Tian.
Cognitive Publish/Subscribe for Heterogeneous Clouds Šarūnas Girdzijauskas, Swedish Institute of Computer Science (SICS) Joint work with:
©2003, Philippe Cudré-Mauroux, EPFL-I&C-IIF, Distributed Information Systems Lab The Chatty Web: Emergent Semantics Through Gossiping WWW2003 Karl Aberer,
Presented by: Thabet Kacem Spring Outline Contributions Introduction Proposed Approach Related Work Reconception of ADLs XTEAM Tool Chain Discussion.
UNCERTML - DESCRIBING AND COMMUNICATING UNCERTAINTY Matthew Williams
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 1 Semantic Network.
Probabilistic Message Passing in Peer Data Management Systems Philippe Cudré-Mauroux, Karl Aberer EPFL Andras Feher, T.U. Darmstadt.
© 2007, Roman Schmidt Distributed Information Systems Laboratory Evergrow workshop, Jerusalem, IsraelFebruary 19, 2007 Efficient implementation of BP in.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 1 ICDE
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 1 MICS Scientific.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
©2003, Philippe Cudré-Mauroux, EPFL-I&C-IIF, Distributed Information Systems Lab The Chatty Web approach for global semantic agreements MMGPS Workshop,
Building Low-Diameter P2P Networks Eli Upfal Department of Computer Science Brown University Joint work with Gopal Pandurangan and Prabhakar Raghavan.
©2004, Philippe Cudré-Mauroux Exploiting Localized Metadata in Decentralized Settings Microsoft Research Asia Philippe Cudré-Mauroux Distributed.
Overview Distributed vs. decentralized Why distributed databases
ODBASE A Necessary Condition for Semantic Interoperability in the Large Philippe Cudré-Mauroux and Karl Aberer School of Computer and Communication.
1 ISWC GridVine: Building Internet-Scale Semantic Overlay Networks Karl Aberer, Philippe Cudré-Mauroux, Manfred Hauswirth School of Computer.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 1 DB Berkeley.
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
EPFL-I&C-LSIR [P-Grid.org] Workshop on Distributed Data and Structures ’04 NCCR-MICS [IP5] presented by Anwitaman Datta Joint work with Karl Aberer and.
Automatic Data Ramon Lawrence University of Manitoba
Improving Data Access in P2P Systems Karl Aberer and Magdalena Punceva Swiss Federal Institute of Technology Manfred Hauswirth and Roman Schmidt Technical.
UNIVERSITY OF JYVÄSKYLÄ Resource Discovery in Unstructured P2P Networks Distributed Systems Research Seminar on Mikko Vapa, research student.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Developing Analytical Framework to Measure Robustness of Peer-to-Peer Networks Niloy Ganguly.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Master Thesis Defense Jan Fiedler 04/17/98
Information System Development Courses Figure: ISD Course Structure.
Semantic Network as Continuous System Technical University of Košice doc. Ing. Kristína Machová, PhD. Ing. Stanislav Dvorščák WIKT 2010.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
Data Management in Large-scale P2P Systems
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
Object storage and object interoperability
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
The Chatty Web : Emergent Semantics Through Gossiping Karl Aberer, Philippe Cudre-Mauroux, Manfred Hauswirth Presented by Yookyung Jo.
CIMA and Semantic Interoperability for Networked Instruments and Sensors Donald F. (Rick) McMullen Pervasive Technology Labs at Indiana University
Distributed cooperation and coordination using the Max-Sum algorithm
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Research Directions in Databases Technological Education Institution of Larisa in collaboration with Staffordshire University Larisa Dr. Theodoros.
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Sensor Data Search & Integration Philippe Cudré-Mauroux & Karl Aberer Nokia-MICS meeting Novembre 14, 2006.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
CHAPTER 3 Architectures for Distributed Systems
Probabilistic Data Management
Paraskevi Raftopoulou, Euripides G.M. Petrakis
Database Systems Instructor Name: Lecture-3.
Information Networks: State of the Art
A Semantic Peer-to-Peer Overlay for Web Services Discovery
Presentation transcript:

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 1 IBM T.J. Watson Data Management in Large-Scale Decentralized Information Systems Philippe Cudré-Mauroux, EPFL Joint work with: Karl Aberer EPFL)

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 2 Overview 1.Motivation Picture Sharing in Decentralized Settings 2.Data Integration in large-scale networks 1.Peer Data Management Systems 2.Semantic Gossiping 3.Probabilistic Message-Passing 4.Aspects of self-organization 1.Self-Healing semantic networks 2.Analyzing semantic interoperability in the large 3.Applications 0. P-Grid 1.GridVine 2.PicShark 4.Conclusions & Future Work

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 3 1. Motivation: Picture Sharing Profusion of Digital Images –Variety of powerful devices –gigabytes of pictures is the new norm Most of the images are kept local Some are shared –Mostly point-to-point –Primitive search capabilities MMS HTTP SMTP

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 4 Opportunity More and more software use metadata to organize images locally –(Semi) Structured metadata (e.g., XML, PSA) –Ontological metadata (e.g., RDF, XMP) –Type-based metadata (e.g., WinFS) <rdf:RDF xmlns:rdf= ' T18:49:03Z T20:09:28Z John Doe …

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 5 Hurdle: Metadata Heterogeneity Why not taking advantage of those metadata in a distributed setting? X Syntactic discrepancies X Semantic heterogeneity All the aforementioned standards are extensible Shared representation is not enough ImageGUIDcDate A0657B E7A /08/2004 VS Width Length-Y VS

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 6 Beyond Keyword Search searching semantically richer objects in large scale heterogeneous networks T18:49:03Z T20:09:28Z date? 05/08/2004 Jan 1, 2005 ? ? ? ? ?

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 7 2. Data Integration in large-scale systems Large Scale Information Systems (e.g., WWW) –Number of sources > 1000 –Unreliable data Autonomy –Semi-structured data E.g., XML/RDF –No integrity constraints –No transactions –Simple SP queries E.g., triple patterns, ranking –Schemata created by end users –Network churn Distributed Databases –Number of sources < 100 –Consistent data Coordination –Structured data E.g., Relational data model –Integrity constraints –Transactions –Powerful queries E.g., SQL, aggregation –Schemas created by administrators –Relatively Fixed topology VS

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 8 Data Integration: LAV/GAV Traditional database techniques (e.g., LAV/GAV) rely on centralized schemas to integrate data sources Not applicable to our context –Scale (upper ontologies?) –Churn –Autonomy How can we foster semantic interoperability in decentralized settings? Date myDate yourDate m(yourDate) = Date m(myDate) = Date

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 9 Semantic Interoperability Q1= $p/GUID FOR $p IN /Photoshop_Image WHERE $p/Creator LIKE "%Robi%" 178A8CD8865 Robinson Tunbridge Wells Royal Council … Photoshop (own schema) 178A8CD8866 Henry Peach Robinson Photographer Tunbridge Council … WinFS (known schema ) T12 = $fs/GUID $fs/Author/DisplayName FOR $fs IN /WinFSImage Q2= $p/GUID FOR $p IN T12 WHERE $p/Creator LIKE "%Robi%"  Extending semantic interoperability techniques to decentralized settings

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities Peer Data Management Systems Pairwise mappings –Peer Data Management Systems (PDMS) Local mappings overcome global heterogeneity –Iterative query rewriting T18:49:03Z T20:09:28Z date? 05/08/2004 Jan 1, 2005 article weather es:cDate  xap:CreateDate es:cDate  myRDF :Date myRDF: Date  xap:ModifyDate

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 11 Problem: Precision/Recall Tradeoff Semantic Query routing –To whom shall I forward a query posed against my local schema? Some (most) mappings will be (partially) faulty –Low expressive power of mappings samePropertyAs / sameClassAs / subclassOf … or event worse (Microformats) –Automatic schema alignment techniques –Different views on conceptualizations Local query resolution –Low recall Flooding (PDMS so far) –Low precision Standard deductive integration is not sufficient –Uncertainty on mappings and conceptualizations

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 2.2. Semantic Gossiping Local translations enabling global agreements Selective query forwarding paradigm –Syntactic distances Lost predicates –Semantic distances Results analysis Cycles analysis  Precision/Recall tradeoff

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 13 Semantic Gossiping Selectively reformulate queries through mapping links π Title  Author=Joe (R2) π Titre  Auteur=Joe (R1) π Title  Creator=Joe (R3) π Title  Creature=Joe (R5)  Author=Joe (R4) X X π Title  Creator=Joe (R4)

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities Probabilistic Message Passing m0m0 m1m1 m2m2 m3m3 m4m4 m5m5 Where do the mapping quality measures come from? Link-based analysis of the PDMS -Automatically deriving quality measures for the mappings Transitive Closures on mapping operations -Mapping Cycles -Parallel Paths f0f0

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 15 On Cycles / parallel paths q VS m 3 (m 4 (m 0 (q))) m0m0 m3m3 m4m4 f0f0 art/Creator? VS art/creatDate? q:art/Creator?

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 16 Computing a Marginal for one cycle P(m 0, m 3, m 4, f 0 ) = P(m 0 ) P(m 3 ) P(m 4 ) P(f 0 | m 0, m 3, m 4, ) P(m 0 | f 0 )=  m3, m4, P(m 0, m 3, m 4, f 0 ) P(f 0 ) -1 But: feedbacks on different cycles are correlated –One wrong mapping will affect several cycles/paths –Need to express a global probabilistic model for the mapping graph observedunknown

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 17 A Brief Intro to Factor-Graphs g(x1, x2, x3, x4) = fA(x1, x2)fB(x2, x3, x4)

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 18 Deriving PDMS Factor-Graphs Abductive reasoning on transitive closures of mappings a priori information on mapping

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 19 PDMS Factor-Graphs Cyclic graph –Junction Tree? Clustering / Stretching of variables? Not applicable (decentralization) –Iterative Sum-Product Approximate results How to perform iterative sum-product by message passing on the mapping graph? –Message passing in factor graph does not correspond to connectivity of mapping graph –We want to rely on decentralized computations only Locality VS Globality of nodes in the factor graph –Mappings: local –Feedback factor: common, global knowledge –Observed feedback variables: neighborhood

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 20 Embedded Message-Passing (1)

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 21 Embedded Message-Passing (2)

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 22 Message Passing Decentralized computations Computationally inexpensive –Sums and Products Message-Passing Schedules –Periodic –Lazy (piggybacking on query forwarding) No message overhead

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 23 Implemented System Schemas –Import from OWL (Web Ontology Language) Mappings –KnowledgeWeb Ontology Alignment API –Import from RDF/XML –Automated on-the-fly creation –Comparison to standard alignments  Automatic derivation of quality measures P(m=correct | {F}) for the mappings using iterative message-passing

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities Aspects of Self-Organization What can we do once we have gathered quality measures about the mappings? –Routing queries (  ) –Correcting mappings –Analyzing global properties of the graph

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities Self-Healing Semantic Networks Two types of self-organization –Static network Self-organizing dissemination of queries –Dynamic network Self-organizing network of mappings Idea: –Quality evaluation of mappings through Semantic Gossiping –Modify low quality links –Reorganized network leads to different quality evaluation –Dynamic network changes  self-organizing, self-referential semantic network

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities Some Results (1) Sensitivity to TTL (cycle analysis only, 25 schemas, 4 concepts)

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities Some Results (2) Scalability (results analysis only, 4 concepts, TTL=3, misclassification rate=0.1, 2 documents/peer on avg.)

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities Analyzing Semantic Inter. in the Large What about interoperability at a global scale? Modeling semantic interoperability: The semantic connectivity graph –Idea: as for physical network analyses, define a connectivity layer –Unweighted, non-redundant version of the Schema-to-schema graph Schema-to-Schema Graph –Logical model –Directed –Weighted –Redundant

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 29 Semantic Interoperability in the Large Definition Peers in a set P s are semantically interoperable iff S s is strongly connected, with S s  {s |  p  P s, p  s} Observation 1 A set of peers P s cannot be semantically interoperable if |E s | < |V s | Observation 2 A set of peers P s is semantically interoperable if |E s | > |V s | (|V s |-1) - (|V s |-1) What happens between theses two bounds?

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 30 A Necessary Condition for interoperability in the large Analyzing semantic interoperability in large-scale, decentralized networks –Percolation theory for directed graphs –Based on a recent graph-theoretic framework –Random graphs with specific degree distributions p jk, clustering coefficients cc and bidirectionality coefficient bc Based on generatingfunctionality Distribution of edges from first to second-order neighbors: Necessary condition for semantic interoperability in the large:  j,k (jk-j(bc+cc)-k)p jk ≥ 0 Also: approximations of the size of semantically interoperable clusters

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 31 Some results (Poisson-distributed random graph, vertices)

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 32 Analysis of a real system Analysis of the Sequence Retrieval System (SRS) –Commercial information indexing and retrieval system for bioinformatic libraries –Schemas described in a custom language (Icarus) –Mappings (foreign keys) from one database to others Crawling the EBI repository –388 databanks –518 (undirected) links –Power-law distribution of node degrees –Clustering coefficient = 0.32 –Diameter = 9 Connectivity indicator ci = 25.4 –Super-critical state Size of the giant component –0.47 (derived) VS 0.48 (observed)

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 33 Query dissemination in weighted networks –Per-hop forwarding behaviors –Only forward if w i >=   = 0 : flooding  = 1 : exact answers –Degree distribution similar to SRS –Uniformly distributed weights between 0 and 1

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 34 Local view on global properties? SRS-like distribution, 1000 vertices, 4000 edges

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities Applications 0.P-Grid A peer-to-peer access structure 1.GridVine Self-organizing semantic overlay network 2.PicShark Self-organizing middleware to export pictures and create mappings

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 36 Standard data management over overlay networks Hard problem? –Strictly speaking impossible CAP theorem: pick at most two of the following: 1.Consistency 2.Availability 3.Tolerance to network Partitions Practical compromises: E.g., Relaxing ACID properties S. Gilbert and N. Lynch: Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services, ACM SIGACT News, 33(2), 2002.

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities The P-Grid Access Structure Distributed, virtual binary search tree –complete decentralization –self-organization –efficient search Gridella: a DHT based on P-Grid –decentralized load balancing –updates –replication –management of dynamic IP addresses and identities Used in several large-scale research projects –MICS –Alvis –Bricks –Evergrow

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 38 1 : 12, : 5, : 9,4 1 0,1 1 : 12, : 5,14 001: 9,4 7 0,1 1 : 6,13 01 :10,14 000: 1,7 4 2,3 1 : 8,2 01 : 3, : 1,7 9 2,3 1 : 8, : 7,9 011: 3,10 5 4,5 1 : 2,12 00 : 9,4 011: 3, ,5 1 : 6,8 00 : 1,7 010: 5, ,7 1 : 11,12 00 : 1,9 010: 5,14 3 6,7 0 : 4,7 11 : 2,12 101: 8, ,9 1 : 1,3 11 : 2,12 101: 8,13 6 8,9 0 : 5,9 11 : 2,12 100: 6, ,11 0 : 4,9 11 : 2,12 100: 6, ,11 0 : 5,7 10 : 6, ,13,14 0 : 1,14 10 : 11, ,13, ID peer identifier 2,3 data keys (2=0010 etc.) Prefix Routing 1 : 12, 13 routing table entry 7

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities GridVine Building large-scale semantic systems –Self-organizing semantic overlay network Principle of data independence –Scalable physical layer –Semantic logical layer

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 40 Semantic Mediation Layer Correlated / Uncorrelated Correlated / Uncorrelated “Physical” layer Overlay Layer Semantic Mediation Layer

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 41 GridVine: Annotating Content

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 42 Features Self-organized, scalable, decentralized Resolves key-based searches in O (log(n)) even for unbalanced trees Semantic Web compliant –RDF triples, RDFS schemas, OWL mappings Structured searches –Simple Triple Patterns, RDQL queries –Query resolution: iterative, distributed table lookup Semantic Gossiping –Tradeoff precision / recall –Automatic reformulation of queries One of the building-blocks of the Social Semantic Desktop –Nepomuk, large-scale EU research project

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 43 Indexing semi-structure in GridVine Soft-states –Each triple has an expiration time (cf. CAP theorem) Locality-preserving hash-function –Range searches Triple t = Put(Hash(lsir:GridVine), t) Put(Hash(dc:creator), t) Put(Hash(lsir:pcm), t)

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 44 Semantic Integration in GridVine (1) Vertical integration: hierarchy of classes –Fostering semantic interoperability through reuse of conceptualizations –Simple, user-oriented constructs (cf. GUI) –Few, popular base classes bootstrapping interoperability through properties RDFS entailment can be materialized for the instances

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 45 Semantic integration in GridVine (2) Horizontal integration: mappings –Simple links relating properties –Cycle + feedback analysis to get probabilistic guarantees  AutomaticSemantic Gossiping

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 46 Traversals of the Semantic Overlay Network GridVine: structured P2P network –No more constraints on gossiping Different query forwarding paradigms –Iterative forwarding –Recursive forwarding

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities PicShark Where do the mappings come from? Middleware for sharing semi-structured metadata attached to pictures and creating mappings PSP XMP WinFS Metadata Extractor (Distributed) Hashtable (e.g., GridVine) Insert Retrieve Features Extractor 60 moments Information Tracker PicShark

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 48 Features Self-Organization of mappings –Based on low-level features extracted from Picture (color moment, textures) Structured Metadata (lexicographical analysis) Self-Organization of annotations –Probabilistic propagation of annotations between similar individuals Self-Organization of query dissemination –Schema distance based on probabilistic subsumption –Propagation within a certain diameter Driven by user interaction Scalable Computationally expensive operations are local at the peers Only simple in-network operations (look-ups) (on-going) collaborative effort with Microsoft Research + MICS

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 49 PicShark Prototype

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities Conclusions Fundamental issue: Data interoperability and management in large scale decentralized environments –Content Sharing –Information search –Semantic Web? Traditional techniques are not sufficient –Scale –Autonomy –Uncertainty Self-organizing, decentralized stochastic processes Data Indexing Data Integration Semantics as agreement Abductive reasoning (on transitive closures of mappings) Query dissemination

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 51 Current research directions Probabilistic firing of complex events –Noisy streams of data (motes, sensor streams) –Networked environment (HiFi) Information Retrieval in PDMS –Characterization –Probabilistic guarantees for multi-hop retrieval based on extensions Information-theoretic measures for interoperability Query optimization in GridVine

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 52 References Semantic Overlay Networks (Tutorial) Karl Aberer and Philippe Cudré-Mauroux International Conference on Very Large Data Bases (VLDB 05). … complete reference list at

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 53 Thank you for your attention Questions ?

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 54 Some (Preliminary) Results: Convergence (undirected example graph, prior 0.7 delta 0.1)

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 55 Fault-tolerance (faulty links) (undirected example graph, prior 0.8 delta 0.1)

The National Centres of Competence in Research are managed by the Swiss National Science Foundation on behalf of the Federal Authorities 56 Preliminary Results: EON (Alignment contest) Worst-case scenario: no prior knowledge Set of 6 schemas on bibliographic data (approx attributes) 396 generated attribute mappings (84 incorrect)