ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

iTrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi ETH Zurich VLDB 2007 Anat Heilper Jan. 2009 CS Seminar in Databases (236826) 1

Problem: Querying heterogeneous data Sources Data Sources LaptopEmail Server Web Server DB Server What is the impact of the global depression in Israel?Query Systems ???? 2

Solution 1: Use a Search Engine Data Sources Laptop Email Server Web Server Query System DB Server Graph IR Search Engine global depression Israel TopX [VLDB05], FleXPath [SIGMOD04], XSearch [VLDB03], XRank [SIGMOD03] text, links text, links text, links text, links Query semantics are not precise! 3

Result Query Solution 2: Use an Information Integration System Data source 2Data source 1Data source 3 Query interface Global schema Source schema ? Price index countries unemployment Crime rate countries unemployment Crime rate Too much effort to provide schema mappings! 4 4

Schema first approach (SFA) Semantically integrated view over the data sources Mappings between source schemas and mediated schema Queries have clearly defined semantics  Expensive to construct and maintain  Not all data sources have schemas No schema approach (NSA) Keyword search Requires good result ranking methods Performs no integration  Query semantics is not well defined 2 opposite approaches : Querying heteregenous data sources 5

Motivation of iTrail Find a integration solution in-between these two extremes? ? Dataspace System Graph IR Search Engine Data Integration System Temps Cities CO 2 Sunspots............... text, links text, links text, links text, links The more effort you pay, the more query power you have. 6

iTrails Core Idea: Add Integration Hints Incrementally 1) Provide search service over the data – Use general graph data model (iDM) – handles unstructured documents, XML, and relations 2) Add integration semantics via hints (trails) 3) If more semantics needed, apply trails – Smooth transition between search and data integration – Semantics added incrementally to improve precision / recall 7

Example of an iDM X 1 = {.name = ‘home‘,.tuple = {.owner = ‘root‘,.lastmodified = ‘05.01.2000‘},.content = “} X 2 = {.name = ‘mike‘,.tuple = {.owner = ‘root‘,.lastmodified = ‘04.17.2008‘},.content = “}... X 5 = {.name = ‘SIGMOD42.pdf ‘,.tuple = {size = 10k,.owner = ‘mike‘,.lastmodified = ‘04.01.2007‘},.content = ‘@PDF... ‘} ….. home Mike papers PIM SIGMOD42.pdf SIGMOD44.pdf QP VLDB12.pdf VLDB10.pdf projects PIM SIGMOD42.pdf 8 1 2 5

General graph data model - iDM iDM (iMeMeX Data Model) represents every structural component of the input data as a node. Supports unstructured, semi-structured and structured data, e.g., files&folders, XML, relations 9

iMeMeX – integrated MeMeX Vannevar Bush introduced the concept “memex” in the 1945s: "device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility." Bush predicted: "Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified." 10

Data model Data represented by directed graph G = (RV, E) RV: {V 1,... V n } termed resource view E: Ordered pairs (V i, V j ) of resource views V i  V j : V j is reachable from V i by traversing the edges E 11

Resource view Component V i.namestring V i.Tuplesequence of attribute value pairs ((att 0, val 0 ), (att 1, val 1 ),… ) V i.contenttext A resource view V i has three components: name, tuple, and content {.name= ‘SIGMOD42.pdf ‘,.tuple = {size = 10k,.owner = ‘mike‘,.lastmodified = 04.01.2007‘},.content = ‘@PDF... ‘} 12

Query model Query expression: – Query Q selects nodes R := Q(G)  G.RV – Example: //mike/papers Component projection – C  {.name,.tuple.,.content} : projection of set of resource views selected by query Q, i.e. set of components R’ := {V i.C | V i  Q(G)} 13

Component projection example Example: //mike//PIM/*.tuple.lastmodified X 1 = {.name = ‘home‘,.tuple = {.owner = ‘root‘,.lastmodified = ‘05.01.2000‘},.content = “} X 2 = {.name = ‘mike‘,.tuple = {.owner = ‘root‘,.lastmodified = ‘04.17.2008‘},.content = “}... X 5 = {.name = ‘SIGMOD42.pdf ‘,.tuple = {size = 10k,.owner = ‘mike‘,.lastmodified = ‘04.01.2007‘},.content = ‘@PDF... ‘} ….. home Mike papers PIM SIGMOD42.pdf SIGMOD44.pdf QP VLDB12.pdf VLDB10.pdf projects PIM SIGMOD42.pdf 1 2 5 14

Syntax of query expression QUERY_EXPRESSION := (PATH | KT_PREDICATE) (union QUERY_EXPRESSION)* PATH := (LOCATION_STEP)+ LOCATION_STEP := LS_SEP NAME_PREDICATE (`[` KT_PREDICATE `]`)? LS_SEP := `//` | `/` NAME_PREDICATE := `*` | (`*`) ? VALUE (`* `)? KT_PREDICATE := (KEYWORD | TUPLE) (LOGOP KT_PREDICATE)* KEYWORD := `”` VALUE (WHITESPACE VALUE) * `”` | VALUE (WHITESPACE KEYWORD)* TUPLE := ATTRIBUTE_IDENTIFIER OPERATOR VALUE OPERATOR := `=` | ` ` LOGOP := `AND` | `OR` 15

semantics All nodes in graph All nodes in graph that have ‘a’ in its content All nodes in graph that have ‘a’ and ‘b’ in its content All nodes in graph such that.name== ‘A’ nodes that.name== ‘B’ and there is an edge from W w.name == ‘A’ 16

Logical algebra for query expressions OperatorNamesemantics GAll resource views {V|V  G.RV}  P(I) Selection {V|V  I  P(V)}  (I) Shallow unset {W|(V,W)  G.E  V  I}  (I) Deep unset {V|V W  V  I} I1 I2I1 I2 intersection {V|V  I 1  V  I 2 } I 1  I 2 union {V|V  I 1  V  I 2 } 17

Example 18

What have we seen so far? Problem: querying heterogeneous data sources Find a solution between SFA and NSA – Generic graph data model to describe the data – queries describes paths in the graph 19

How itrails help? Queries are modified by hints ( trails) which adds/modifies search paths to look at. Example: yesterday → //*[date = today() – 1] 20

iTrails: Defining Trails Basic Form of a Trail QL [.CL] → QR [.CR] Intuition: When I query for QL [.CL], you should also query for QR [.CR] – Queries: keyword and path expressions Attribute projections

iTrails: Defining Trails Unidirectional trail Q L [.C L ] → Q R [.C R ] Intuition: – When query for Q L [.C L ], also query for Q R [.C R ] Bidirectional trail Q L [.C L ]  Q R [.C R ] Example:ψ i :=//*.tuple.date  //*.tuple.modified Queries:keyword and path expressions Attribute projections Query example: global warming zurich or //Temperatures/*[celsius>10] 22

20 15 14 BE ZH Trail Examples: Global Warming Zurich Trail for Implicit meaning: query for global warming, also query Temperature data > 10 degrees” Trail for an Entity: When query for zurich, query for references of zurich as a region global warming → //Temperatures/*[celsius > 10] Temperatures city celsius date Bern 24-Sep Zurich 25-Sep zurich → //*[region = “ZH”] Uster region global warming zurich 9 ZH Zurich 26-Sep 23

Trail Example: Deep Web Bookmarks Trail for a Bookmark: Query for train home, also query Train website: origin = TelAviv Uni destination = Haifa Hof Hacarmel train home train home → //trainCompany.com//*[origin=“Tel Aviv Uni” and dest =“HAifa Hof Hacarmel”] Web Server 24

Trail Examples: Thesauri, Dictionaries, Language-agnostic Search Trail for Thesauri: query for car, also query for auto Trails for Dictionary: query for car, also query for carro and vice- versa car auto car automobile car → auto car → automobile automobile → car Laptop Email Server 25

Trail Examples: Schema Equivalences Trail for schema match on names: query for Employee.empName, also query for Person.name Trail for schema match on salaries: query for Employee.salary, also query for Person.income Employee empName salary Person name age income //Employee//*.tuple.empName → //Person//*.tuple.name //Employee//*.tuple.salary → //Person//*.tuple.income DB Server empId SSN 26

How are Trails Created? Given by the user – Explicitly – Via Relevance Feedback (Semi-)Automatically – Automatic schema matching – Ontologies and thesauri (e.g., wordnet) – User communities (e.g., trails on gene data, bookmarks) 27

Uncertainty and Trails Probabilistic Trails: – model uncertain trails – probabilities used to rank trails Q L [.C L ] → Q R [.C R ], 0 ≤ p ≤ 1 – Example: car → auto, p = 0.9 probability p reflects the likelihood that results obtained by trail are correct. 28

Certainty and Trails - continue Scored Trails: – Give higher value to certain trails – Scoring Factors: boost scores of results obtained by the trail Q L [.C L ] → Q R [.C R ], sf > 1. examples – T1: weather → sf //Temperatures/*, sf ≥ 1 – T2: yesterday → sf //*[date = today() – 1], sf ≥ 1 Intuition: sf reflects the relevance of the trail. – Results obtained are scored sf times higher than the results obtained without the trail. – If no scoring factor is available, sf = 1 29

Rewriting Queries with Trails U weather yesterday (1) Matching T 2 : yesterday → //*[date = today() – 1] Query (2) Transformation Trail U weather yesterday U //*[date = today() – 1] (3) Merging T 2 matches 30

Replacing Trails Trails that use replace instead of union semantics U weather yesterday (1) Matching T 2 : yesterday //*[date = today() – 1] Query (2) Transformation Trail U weather //*[date = today() – 1] (3) Merging T 2 matches 31

... U Problem: Recursive Matches (1/2) U weather yesterday U //*[date = today() – 1] T 2 : yesterday → //*[date = today() – 1] New query still matches T 2, so T 2 could be applied again U weather U yesterday U //*[date = today() – 1] U... Infinite recursion T 2 matches 32

Problem: Recursive Matches (2/2) U weather yesterday U //*[date = today() – 1] Trails may be mutually recursive T 3 : //*.tuple.date → //*.tuple.modified U weather U yesterday //*[date = today() – 1] T 10 : //*.tuple.modified → //*.tuple.date U //*[modified = today() – 1] U weather U yesterday //*[date = today() – 1] U //*[modified = today() – 1] U //*[date = today() – 1] We again match T 3 and enter an infinite loop T 3 matches T 10 matches 33

Algorithm to solve recursion - MMCA Multiple Match Coloring Algorithm (MMCA): – Keep history of all trails matched or introduced – Given a set of trails Y. For every trail t in Y: – Apply t to Q iteratively and color the query tree nodes in Q according to the trails that already touched those nodes 34

U weather yesterday First Level U weather yesterday //Temperatures/* U U //*[date = today() – 1] U weather yesterday //Temperatures/* U U //*[modified = today() – 1] U U //*[received = today() – 1] //*[date = today() – 1] Second Level T 1 matches T 2 matches T 3, T 4 match Multiple Match Coloring Algorithm T1: weather → //Temperatures/* T2:yesterday → //*[date =today()-1] T3://*.tuple.date →//*.tuple.modified T4://*.tuple.date →//*.tuple.received 35

MMCA is exponential in number of levels – Every leaf can be applied any of the trails, and each trail can generate additional leafs. Solution: Trail Pruning – Number of levels – punish recursive rewrites – Top-K trails matched in each level Ranking by probability/certainity/weight – Other - timeout, progressively compute query results Multiple Match Coloring Algorithm cont. 36

iTrails Evaluation in iMeMex Main Questions in Evaluation – Quality: Top-K Precision and Recall – Performance: Use of Materialization – Scalability: Query-rewrite Time vs. Number of Trails 37

iTrails Evaluation in iMeMex Scenario 1: Few High-quality Trails – Closer to information integration use cases – Obtained real datasets and indexed them – 18 hand-crafted trails – 14 hand-crafted queries Scenario 2: Many Low-quality Trails – Closer to search use cases – Randomly generated up to 10,000 trails and queries with a mutual uniform match probability of 1% 38

iTrails Evaluation in iMeMex: Scenario 1 Configured iMeMex to act in three modes – Baseline: Graph / IR search engine – iTrails: Rewrite search queries with trails – Perfect Query: Semantics-aware query Data: shipped to central index Laptop Email Server Web Server DB Server sizes in MB 39

Trails and queries used in Scenario 1 max original tree size: 14 max final tree size after applying trails: 35 max # of trails applied: 5 40

Quality: Top-K Precision and Recall (k=20) Search Engine misses relevant results Search Query is partially semantics-aware Scenario 1: few high-quality Trails (18 trails) Queries perfect query Perfect Query always has precision and recall equal to 1 41

Performance: Use of Materialization Trail merging adds overhead to query execution Trail Materialization improves performence for almost all queries Scenario 1: few high-quality trails (18 trails) 42

Scalability: Query-rewrite Time vs. Number of Trails – scenario 2 No pruning approach  exponential growth in the query plan sizes Query-rewrite time can be controlled with pruning 43

summary First framework to explore pay-as-you-go information integration in dataspaces iTrails: generic method to model semantic relationships gradually Itrails are used to rewrite queries Algorithm to control recursive query rewrites 44

Personal opinion - advantages The method is incremental – Integrators can collect statistics, find most common queries and define trails for popular queries first. – Dynamic system: If popular queries changes over time, trails for less popular queries can be disabled to reduce system workload. Trails can be defined independently by domain expects for each data domain. 45

Personal opinion - disadvantages  Trails are global: every rewritten query is evaluated over every data source. – Trail can have different meaning for different data sources.  For a good quality of query results, trails have to be defined manually  problem for large systems. Solution: use machine learning techniques to improve automatic trails creation  Overlaps and inconsistencies in trails are possible since query returns union of the results satisfying all trails Solution: trail mining and weighting would be helpful here. 46

Questions? 47

Bibliography iTrails: Pay-as-you-go Information Integration in Dataspaces:Marcos Antonio Vaz Salles JensPeter Dittrich Shant Kirakos Karakashian Olivier René Girard Lukas Blunschi ETH Zurich 8092 Zurich, Switzerland dbis.ethz.ch | iMeMex.org From Databases to Dataspaces: A New Abstraction for Information Management:Michael Franklin University of California, Berkeley, Alon Halevy Google Inc. and U. Washington, David Maier Portland State University Wikipedia, dataspace:http://en.wikipedia.org/wiki/Data_Spaces, memex:http://en.wikipedia.org/wiki/Vannevar_Bush Imemex information: http://imemex.ethz.ch/http://imemex.ethz.ch/ 48

Backup slides 49

Algorithm runtime: – L: Number of leaves in query Q – M: Max number of leaves in query introduced by a trail – N: Number of trails – d  {1,...,N} number of levels Theorem: Maximum number of trail applications performed by MMCA and maximum number of leaves in the merged query tree are both bounded by O(L M^ d ) Multiple Match Coloring Algorithm Analysis 50

MMCA run time analysis (O(LM^d ) ) If trail t is matched in query Q, it colors Q leaf nodes Subtree containing only these nodes is not matched again by t. Worst case, in each level only one of the trails matches for each of the leaves. 1 st run: Trail match  M new leaves for each of those leaves  total of LM new nodes plus L old nodes  L(M+1) leaves and L trail applications for the first level. 2 nd run: t doesn’t match any of the leaves anymore (they are colored in 1 st run). However, all leaves may be matched against N −1 colors. Worst case, again, only one of the trails matches for each of the existing leaf nodes. In the d-th level, will lead to L(M+1)^(d−1) trail applications and a total of L(M+1)^d leaves. 51

iDM: Lazily Computed Graph Nodes and edges are lazily computed Each node is a Resource View 52

iDM: Lazily Computed Graph iDM is not a static model – Every component of every Resource View may be created on demand – Every Resource View may be created on demand Behind the scenes, obtaining the content may: – Read a file on the filesystem – Access a page on the web – Fetch the data from an index structure Behind the scenes, obtaining the group may: – Get the children of a folder in the filesystem – Look up an edge replica – Obtain the sections of a document 53

How to implement iDM: Architectural Perspective Indexes&Replicas access (warehousing) Data source access (mediation) Complex operators (query algebra) 54

Data management approaches Features Integration Solution SearchDataspacesData Integration Integration Effort Low Pay-as-you- go High Query Semantics Precision / Recall Precise Need for Schema Schema- never Schema-laterSchema-first 55

Canonical form The canonical form of Г(Q) of a query Q is obtained by decomposing Q into location step separators and predicates (P) according to grammar. Г(Q) is constructed by the following recursion: G if tree is empty Tree =  (tree) if LS_SEP=// and not first location step μ(tree) if LS_SEP=/ and not first location step tree  σ p (G) otherwise 56

ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Similar presentations

Presentation on theme: "ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Similar presentations

Presentation on theme: "ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi."— Presentation transcript:

Similar presentations

About project

Feedback