ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

Google News Personalization: Scalable Online Collaborative Filtering
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.
Query Optimization May 31st, Today A few last transformations Size estimation Join ordering Summary of optimization.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Information Retrieval in Practice
Search Engines and Information Retrieval
Xyleme A Dynamic Warehouse for XML Data of the Web.
Aki Hecht Seminar in Databases (236826) January 2009
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
2005Integration-intro1 Data Integration Systems overview The architecture of a data integration system:  Components and their interaction  Tasks  Concepts.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
U of R eXtensible Catalog Team MetaCat. Problem Domain.
XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.
September 26, 2007 iTrails: Pay-as-you-go Information Integration in Dataspaces Marcos Vaz Salles Jens Dittrich Shant Karakashian Olivier GirardLukas Blunschi.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Dataspaces: A New Abstraction for Data Management Mike Franklin, Alon Halevy, David Maier, Jennifer Widom.
Overview of Search Engines
Objectives of the Lecture :
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
The Exchange of Retrieval Knowledge about Services between Agents Mirjam Minor Mike Wernicke.
Search Engines and Information Retrieval Chapter 1.
1 iTrails: Pay-as-you-go Information Integration in Datasapces Authors: Salles, Dittrich et al. (ETH Zurich) Published in VLDB2007 Presenter: Jim 7 Dec.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
Querying Business Processes Under Models of Uncertainty Daniel Deutch, Tova Milo Tel-Aviv University ERP HR System eComm CRM Logistics Customer Bank Supplier.
Knowledge Modeling, use of information sources in the study of domains and inter-domain relationships - A Learning Paradigm by Sanjeev Thacker.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Search Engine Architecture
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University.
Lecture 3: Uninformed Search
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
ITrails: Pay-as-you-go Information Integration in Dataspaces Presented By Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 An infrastructure for context-awareness based on first order logic 송지수 ISI LAB.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Facilitating Document Annotation Using Content and Querying Value.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
March 8, 2007 From Personal Desktops to Personal Dataspaces: A Report on Building the iMeMex Personal Dataspace Management System Jens Dittrich Lukas Blunschi.
An Efficient Algorithm for Incremental Update of Concept space
Database Management System
Search Engine Architecture
Data, Databases, and DBMSs
Information Retrieval
CS & CS Capstone Project & Software Development Project
Database Systems Instructor Name: Lecture-3.
Research on Personal Dataspace Management
Search Engine Architecture
Graph and Link Mining.
Materializing Views With Minimal Size To Answer Queries
Toward Large Scale Integration
Presentation transcript:

iTrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi ETH Zurich VLDB 2007 Anat Heilper Jan CS Seminar in Databases (236826) 1

Problem: Querying heterogeneous data Sources Data Sources Laptop Server Web Server DB Server What is the impact of the global depression in Israel?Query Systems ???? 2

Solution 1: Use a Search Engine Data Sources Laptop Server Web Server Query System DB Server Graph IR Search Engine global depression Israel TopX [VLDB05], FleXPath [SIGMOD04], XSearch [VLDB03], XRank [SIGMOD03] text, links text, links text, links text, links Query semantics are not precise! 3

Result Query Solution 2: Use an Information Integration System Data source 2Data source 1Data source 3 Query interface Global schema Source schema ? Price index countries unemployment Crime rate countries unemployment Crime rate Too much effort to provide schema mappings! 4 4

Schema first approach (SFA) Semantically integrated view over the data sources Mappings between source schemas and mediated schema Queries have clearly defined semantics  Expensive to construct and maintain  Not all data sources have schemas No schema approach (NSA) Keyword search Requires good result ranking methods Performs no integration  Query semantics is not well defined 2 opposite approaches : Querying heteregenous data sources 5

Motivation of iTrail Find a integration solution in-between these two extremes? ? Dataspace System Graph IR Search Engine Data Integration System Temps Cities CO 2 Sunspots text, links text, links text, links text, links The more effort you pay, the more query power you have. 6

iTrails Core Idea: Add Integration Hints Incrementally 1) Provide search service over the data – Use general graph data model (iDM) – handles unstructured documents, XML, and relations 2) Add integration semantics via hints (trails) 3) If more semantics needed, apply trails – Smooth transition between search and data integration – Semantics added incrementally to improve precision / recall 7

Example of an iDM X 1 = {.name = ‘home‘,.tuple = {.owner = ‘root‘,.lastmodified = ‘ ‘},.content = “} X 2 = {.name = ‘mike‘,.tuple = {.owner = ‘root‘,.lastmodified = ‘ ‘},.content = “}... X 5 = {.name = ‘SIGMOD42.pdf ‘,.tuple = {size = 10k,.owner = ‘mike‘,.lastmodified = ‘ ‘},.content = ‘} ….. home Mike papers PIM SIGMOD42.pdf SIGMOD44.pdf QP VLDB12.pdf VLDB10.pdf projects PIM SIGMOD42.pdf

General graph data model - iDM iDM (iMeMeX Data Model) represents every structural component of the input data as a node. Supports unstructured, semi-structured and structured data, e.g., files&folders, XML, relations 9

iMeMeX – integrated MeMeX Vannevar Bush introduced the concept “memex” in the 1945s: "device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility." Bush predicted: "Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified." 10

Data model Data represented by directed graph G = (RV, E) RV: {V 1,... V n } termed resource view E: Ordered pairs (V i, V j ) of resource views V i  V j : V j is reachable from V i by traversing the edges E 11

Resource view Component V i.namestring V i.Tuplesequence of attribute value pairs ((att 0, val 0 ), (att 1, val 1 ),… ) V i.contenttext A resource view V i has three components: name, tuple, and content {.name= ‘SIGMOD42.pdf ‘,.tuple = {size = 10k,.owner = ‘mike‘,.lastmodified = ‘},.content = ‘} 12

Query model Query expression: – Query Q selects nodes R := Q(G)  G.RV – Example: //mike/papers Component projection – C  {.name,.tuple.,.content} : projection of set of resource views selected by query Q, i.e. set of components R’ := {V i.C | V i  Q(G)} 13

Component projection example Example: //mike//PIM/*.tuple.lastmodified X 1 = {.name = ‘home‘,.tuple = {.owner = ‘root‘,.lastmodified = ‘ ‘},.content = “} X 2 = {.name = ‘mike‘,.tuple = {.owner = ‘root‘,.lastmodified = ‘ ‘},.content = “}... X 5 = {.name = ‘SIGMOD42.pdf ‘,.tuple = {size = 10k,.owner = ‘mike‘,.lastmodified = ‘ ‘},.content = ‘} ….. home Mike papers PIM SIGMOD42.pdf SIGMOD44.pdf QP VLDB12.pdf VLDB10.pdf projects PIM SIGMOD42.pdf

Syntax of query expression QUERY_EXPRESSION := (PATH | KT_PREDICATE) (union QUERY_EXPRESSION)* PATH := (LOCATION_STEP)+ LOCATION_STEP := LS_SEP NAME_PREDICATE (`[` KT_PREDICATE `]`)? LS_SEP := `//` | `/` NAME_PREDICATE := `*` | (`*`) ? VALUE (`* `)? KT_PREDICATE := (KEYWORD | TUPLE) (LOGOP KT_PREDICATE)* KEYWORD := `”` VALUE (WHITESPACE VALUE) * `”` | VALUE (WHITESPACE KEYWORD)* TUPLE := ATTRIBUTE_IDENTIFIER OPERATOR VALUE OPERATOR := `=` | ` ` LOGOP := `AND` | `OR` 15

semantics All nodes in graph All nodes in graph that have ‘a’ in its content All nodes in graph that have ‘a’ and ‘b’ in its content All nodes in graph such that.name== ‘A’ nodes that.name== ‘B’ and there is an edge from W w.name == ‘A’ 16

Logical algebra for query expressions OperatorNamesemantics GAll resource views {V|V  G.RV}  P(I) Selection {V|V  I  P(V)}  (I) Shallow unset {W|(V,W)  G.E  V  I}  (I) Deep unset {V|V W  V  I} I1 I2I1 I2 intersection {V|V  I 1  V  I 2 } I 1  I 2 union {V|V  I 1  V  I 2 } 17

Example 18

What have we seen so far? Problem: querying heterogeneous data sources Find a solution between SFA and NSA – Generic graph data model to describe the data – queries describes paths in the graph 19

How itrails help? Queries are modified by hints ( trails) which adds/modifies search paths to look at. Example: yesterday → //*[date = today() – 1] 20

iTrails: Defining Trails Basic Form of a Trail QL [.CL] → QR [.CR] Intuition: When I query for QL [.CL], you should also query for QR [.CR] – Queries: keyword and path expressions Attribute projections

iTrails: Defining Trails Unidirectional trail Q L [.C L ] → Q R [.C R ] Intuition: – When query for Q L [.C L ], also query for Q R [.C R ] Bidirectional trail Q L [.C L ]  Q R [.C R ] Example:ψ i :=//*.tuple.date  //*.tuple.modified Queries:keyword and path expressions Attribute projections Query example: global warming zurich or //Temperatures/*[celsius>10] 22

BE ZH Trail Examples: Global Warming Zurich Trail for Implicit meaning: query for global warming, also query Temperature data > 10 degrees” Trail for an Entity: When query for zurich, query for references of zurich as a region global warming → //Temperatures/*[celsius > 10] Temperatures city celsius date Bern 24-Sep Zurich 25-Sep zurich → //*[region = “ZH”] Uster region global warming zurich 9 ZH Zurich 26-Sep 23

Trail Example: Deep Web Bookmarks Trail for a Bookmark: Query for train home, also query Train website: origin = TelAviv Uni destination = Haifa Hof Hacarmel train home train home → //trainCompany.com//*[origin=“Tel Aviv Uni” and dest =“HAifa Hof Hacarmel”] Web Server 24

Trail Examples: Thesauri, Dictionaries, Language-agnostic Search Trail for Thesauri: query for car, also query for auto Trails for Dictionary: query for car, also query for carro and vice- versa car auto car automobile car → auto car → automobile automobile → car Laptop Server 25

Trail Examples: Schema Equivalences Trail for schema match on names: query for Employee.empName, also query for Person.name Trail for schema match on salaries: query for Employee.salary, also query for Person.income Employee empName salary Person name age income //Employee//*.tuple.empName → //Person//*.tuple.name //Employee//*.tuple.salary → //Person//*.tuple.income DB Server empId SSN 26

How are Trails Created? Given by the user – Explicitly – Via Relevance Feedback (Semi-)Automatically – Automatic schema matching – Ontologies and thesauri (e.g., wordnet) – User communities (e.g., trails on gene data, bookmarks) 27

Uncertainty and Trails Probabilistic Trails: – model uncertain trails – probabilities used to rank trails Q L [.C L ] → Q R [.C R ], 0 ≤ p ≤ 1 – Example: car → auto, p = 0.9 probability p reflects the likelihood that results obtained by trail are correct. 28

Certainty and Trails - continue Scored Trails: – Give higher value to certain trails – Scoring Factors: boost scores of results obtained by the trail Q L [.C L ] → Q R [.C R ], sf > 1. examples – T1: weather → sf //Temperatures/*, sf ≥ 1 – T2: yesterday → sf //*[date = today() – 1], sf ≥ 1 Intuition: sf reflects the relevance of the trail. – Results obtained are scored sf times higher than the results obtained without the trail. – If no scoring factor is available, sf = 1 29

Rewriting Queries with Trails U weather yesterday (1) Matching T 2 : yesterday → //*[date = today() – 1] Query (2) Transformation Trail U weather yesterday U //*[date = today() – 1] (3) Merging T 2 matches 30

Replacing Trails Trails that use replace instead of union semantics U weather yesterday (1) Matching T 2 : yesterday //*[date = today() – 1] Query (2) Transformation Trail U weather //*[date = today() – 1] (3) Merging T 2 matches 31

... U Problem: Recursive Matches (1/2) U weather yesterday U //*[date = today() – 1] T 2 : yesterday → //*[date = today() – 1] New query still matches T 2, so T 2 could be applied again U weather U yesterday U //*[date = today() – 1] U... Infinite recursion T 2 matches 32

Problem: Recursive Matches (2/2) U weather yesterday U //*[date = today() – 1] Trails may be mutually recursive T 3 : //*.tuple.date → //*.tuple.modified U weather U yesterday //*[date = today() – 1] T 10 : //*.tuple.modified → //*.tuple.date U //*[modified = today() – 1] U weather U yesterday //*[date = today() – 1] U //*[modified = today() – 1] U //*[date = today() – 1] We again match T 3 and enter an infinite loop T 3 matches T 10 matches 33

Algorithm to solve recursion - MMCA Multiple Match Coloring Algorithm (MMCA): – Keep history of all trails matched or introduced – Given a set of trails Y. For every trail t in Y: – Apply t to Q iteratively and color the query tree nodes in Q according to the trails that already touched those nodes 34

U weather yesterday First Level U weather yesterday //Temperatures/* U U //*[date = today() – 1] U weather yesterday //Temperatures/* U U //*[modified = today() – 1] U U //*[received = today() – 1] //*[date = today() – 1] Second Level T 1 matches T 2 matches T 3, T 4 match Multiple Match Coloring Algorithm T1: weather → //Temperatures/* T2:yesterday → //*[date =today()-1] T3://*.tuple.date →//*.tuple.modified T4://*.tuple.date →//*.tuple.received 35

MMCA is exponential in number of levels – Every leaf can be applied any of the trails, and each trail can generate additional leafs. Solution: Trail Pruning – Number of levels – punish recursive rewrites – Top-K trails matched in each level Ranking by probability/certainity/weight – Other - timeout, progressively compute query results Multiple Match Coloring Algorithm cont. 36

iTrails Evaluation in iMeMex Main Questions in Evaluation – Quality: Top-K Precision and Recall – Performance: Use of Materialization – Scalability: Query-rewrite Time vs. Number of Trails 37

iTrails Evaluation in iMeMex Scenario 1: Few High-quality Trails – Closer to information integration use cases – Obtained real datasets and indexed them – 18 hand-crafted trails – 14 hand-crafted queries Scenario 2: Many Low-quality Trails – Closer to search use cases – Randomly generated up to 10,000 trails and queries with a mutual uniform match probability of 1% 38

iTrails Evaluation in iMeMex: Scenario 1 Configured iMeMex to act in three modes – Baseline: Graph / IR search engine – iTrails: Rewrite search queries with trails – Perfect Query: Semantics-aware query Data: shipped to central index Laptop Server Web Server DB Server sizes in MB 39

Trails and queries used in Scenario 1 max original tree size: 14 max final tree size after applying trails: 35 max # of trails applied: 5 40

Quality: Top-K Precision and Recall (k=20) Search Engine misses relevant results Search Query is partially semantics-aware Scenario 1: few high-quality Trails (18 trails) Queries perfect query Perfect Query always has precision and recall equal to 1 41

Performance: Use of Materialization Trail merging adds overhead to query execution Trail Materialization improves performence for almost all queries Scenario 1: few high-quality trails (18 trails) 42

Scalability: Query-rewrite Time vs. Number of Trails – scenario 2 No pruning approach  exponential growth in the query plan sizes Query-rewrite time can be controlled with pruning 43

summary First framework to explore pay-as-you-go information integration in dataspaces iTrails: generic method to model semantic relationships gradually Itrails are used to rewrite queries Algorithm to control recursive query rewrites 44

Personal opinion - advantages The method is incremental – Integrators can collect statistics, find most common queries and define trails for popular queries first. – Dynamic system: If popular queries changes over time, trails for less popular queries can be disabled to reduce system workload. Trails can be defined independently by domain expects for each data domain. 45

Personal opinion - disadvantages  Trails are global: every rewritten query is evaluated over every data source. – Trail can have different meaning for different data sources.  For a good quality of query results, trails have to be defined manually  problem for large systems. Solution: use machine learning techniques to improve automatic trails creation  Overlaps and inconsistencies in trails are possible since query returns union of the results satisfying all trails Solution: trail mining and weighting would be helpful here. 46

Questions? 47

Bibliography iTrails: Pay-as-you-go Information Integration in Dataspaces:Marcos Antonio Vaz Salles JensPeter Dittrich Shant Kirakos Karakashian Olivier René Girard Lukas Blunschi ETH Zurich 8092 Zurich, Switzerland dbis.ethz.ch | iMeMex.org From Databases to Dataspaces: A New Abstraction for Information Management:Michael Franklin University of California, Berkeley, Alon Halevy Google Inc. and U. Washington, David Maier Portland State University Wikipedia, dataspace: memex: Imemex information: 48

Backup slides 49

Algorithm runtime: – L: Number of leaves in query Q – M: Max number of leaves in query introduced by a trail – N: Number of trails – d  {1,...,N} number of levels Theorem: Maximum number of trail applications performed by MMCA and maximum number of leaves in the merged query tree are both bounded by O(L M^ d ) Multiple Match Coloring Algorithm Analysis 50

MMCA run time analysis (O(LM^d ) ) If trail t is matched in query Q, it colors Q leaf nodes Subtree containing only these nodes is not matched again by t. Worst case, in each level only one of the trails matches for each of the leaves. 1 st run: Trail match  M new leaves for each of those leaves  total of LM new nodes plus L old nodes  L(M+1) leaves and L trail applications for the first level. 2 nd run: t doesn’t match any of the leaves anymore (they are colored in 1 st run). However, all leaves may be matched against N −1 colors. Worst case, again, only one of the trails matches for each of the existing leaf nodes. In the d-th level, will lead to L(M+1)^(d−1) trail applications and a total of L(M+1)^d leaves. 51

iDM: Lazily Computed Graph Nodes and edges are lazily computed Each node is a Resource View 52

iDM: Lazily Computed Graph iDM is not a static model – Every component of every Resource View may be created on demand – Every Resource View may be created on demand Behind the scenes, obtaining the content may: – Read a file on the filesystem – Access a page on the web – Fetch the data from an index structure Behind the scenes, obtaining the group may: – Get the children of a folder in the filesystem – Look up an edge replica – Obtain the sections of a document 53

How to implement iDM: Architectural Perspective Indexes&Replicas access (warehousing) Data source access (mediation) Complex operators (query algebra) 54

Data management approaches Features Integration Solution SearchDataspacesData Integration Integration Effort Low Pay-as-you- go High Query Semantics Precision / Recall Precise Need for Schema Schema- never Schema-laterSchema-first 55

Canonical form The canonical form of Г(Q) of a query Q is obtained by decomposing Q into location step separators and predicates (P) according to grammar. Г(Q) is constructed by the following recursion: G if tree is empty Tree =  (tree) if LS_SEP=// and not first location step μ(tree) if LS_SEP=/ and not first location step tree  σ p (G) otherwise 56