iTrails: Pay-as-you-go Information Integration in Dataspaces Presented By Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi ETH Zurich Summerized By Sungchan Park
Copyright 2008 by CEBT Problem: Querying Several Sources Center for E-Business Technology
Copyright 2008 by CEBT Solution #1: Use a Search Engine Center for E-Business Technology
Copyright 2008 by CEBT Solution #2: Use an Information Integration System Center for E-Business Technology
Copyright 2008 by CEBT iTrail Core Idea Is there an integration solution in-between these two extremes? Center for E-Business Technology
Copyright 2008 by CEBT iTrail Core Idea Center for E-Business Technology Is there an integration solution in-between these two extremes? Declaratively add lightweight ‘hints’ to a search engine thus allowing gradual enrichment of loosely integrated data sources
Copyright 2008 by CEBT Example Scenario Query “pdf yesterday” Hints(Trails) 1.The date attribute is mapped to modified attribute 2.The date attribute is mapped to received attribute 3.The yesterday keyword is mapped to a query for values of the date attribute equal to the date of yesterday 4.The pdf keyword is mapped to a query for elements whose names end in pdf Center for E-Business Technology
Copyright 2008 by CEBT Where hints come from? Given by the user Explicitly Via Relevance Feedback (Semi-)Automatically Information extraction techniques Automatic schema matching Ontologies and thesauri (e.g., wordnet) User communities (e.g., trails on gene data, bookmarks) All these aspects are beyond the scope of this paper Center for E-Business Technology
Copyright 2008 by CEBT Data and Query Model Data Model Assume that all data is represented by a logical graph G Query also represented by graph Center for E-Business Technology
Copyright 2008 by CEBT Query Syntax Center for E-Business Technology
Copyright 2008 by CEBT Query Example “//Home/projects//*[“Mike”]” Center for E-Business Technology
Copyright 2008 by CEBT Basic Form of a Trail An unidirectional trail An bidirectional trail Center for E-Business Technology
Copyright 2008 by CEBT Trail Example Trails in an example scenario Trails Given query – “pdf yesterday” Transformed query – “//*.pdf[modified=yesterday() OR received=yesterday() ].” Center for E-Business Technology
Copyright 2008 by CEBT iTrail Query Processing 1.Matching 2.Transforming 3.Merging Center for E-Business Technology
Copyright 2008 by CEBT iTrail Query Processing Example Given Query Q 1 = //home/projects//* [“Mike”] Trail Ψ 8 := //home/*.name -> //calendar//*.tuple.category Resulting Query Q 1 { Ψ 8 } = //home/projects/*[“Mike”] U //calendar//*[category=“project”]//*.[“Mike”] Center for E-Business Technology Utilizing G. Miklau and D. Suciu. Containment and Equivalence for an Xpath Fragment. In PODS, 2002.
Copyright 2008 by CEBT Applying Multiple Trail MMCA(Multiple Match Colouring Algorithm) algorithm Trail can be applied infinitely To prevent infinite recursion, a trail should not be rematched to nodes in a logical plan generated by itself Center for E-Business Technology
Copyright 2008 by CEBT Other Issues Trail Pruning Problem: MMCA is exponential in number of levels Solution: Trail Pruning – Prune by number of levels – Prune by top-K trails matched in each level Give weight and prob. to trails – Prune by both top-K trails and number of levels Trail Indexing Precompute trail expressions in order to speed up query processing Trail materialization Center for E-Business Technology
Copyright 2008 by CEBT Experiments Setting Configured iMeMex to act in three modes – Baseline: Graph / IR search engine – iTrails: Rewrite search queries with trails – Perfect Query: Semantics-aware query Data Center for E-Business Technology
Copyright 2008 by CEBT Experiment, Quality Compare with baseline Center for E-Business Technology
Copyright 2008 by CEBT Experiment, overhead Compare with perfect query Overhead is not negligible However, this can be fixed by exploiting trail materializations Center for E-Business Technology
Copyright 2008 by CEBT Experiment, Scalability #1 Center for E-Business Technology Rewrite Time Query-rewrite time can be controlled with pruning
Copyright 2008 by CEBT Experiment, Scalability #2 Quality Pruning improves precision Center for E-Business Technology
Copyright 2008 by CEBT Conclusion Our Contributions iTrails: generic method to model semantic relationships (e.g. implicit meaning, bookmarks, dictionaries, thesauri,attribute matches,...) We propose a framework and algorithms for Pay-as-you-go Information Integration Smooth transition between search and data integration Future Work Trail Creation – Use collections (ontologies, thesauri, wikipedia) – Work on automatic mining of trails from the dataspace Other types of trails Center for E-Business Technology