Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

Slides:



Advertisements
Similar presentations
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Advertisements

1 Virtual Cursors for XML Joins Beverly Yang (Stanford) Marcus Fontoura, Eugene Shekita Sridhar Rajagopalan, Kevin Beyer CIKM’2004.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
DBLABNational Taiwan Ocean University1/35 A Document-based Approach to Indexing XML Data Ya-Hui Chang and Tsan-Lung Hsieh Department of Computer Science.
I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
QUANZHONG LI BONGKI MOON Indexing & Querying XML Data for../Regular Path Expressions/* SUNDAR SUPRIYA.
XSEarch XML Search Engine Jonathan MAMOU October 2002.
Stephan Gammeter, Lukas Bossard, Till Quack, Luc Van Gool.
Processing Data in External Storage CS Data Structures Mehmet H Gunes Modified from authors’ slides.
Storing and Querying Ordered XML Using Relational Database System Swapna Dhayagude.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
COMP630 Paper Presentation by Haomian(Eric) Wang.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Chapter 8 Physical Database Design. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Overview of Physical Database.
XML files (with LINQ). Introduction to LINQ ( Language Integrated Query ) C#’s new LINQ capabilities allow you to write query expressions that retrieve.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
Keyword Search on External Memory Data Graphs Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan PVLDB 2008 Reported by: Yiqi Lu.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
University of Crete Department of Computer Science ΗΥ-561 Web Data Management XML Data Archiving Konstantinos Kouratoras.
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.
Index tuning-- B+tree. overview Overview of tree-structured index Indexed sequential access method (ISAM) B+tree.
XML and Database.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Session 1 Module 1: Introduction to Data Integrity
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
2004/12/31 報告人 : 邱紹禎 1 Mining Frequent Query Patterns from XML Queries L.H. Yang, M.L. Lee, W. Hsu, and S. Acharya. Proc. of 8th Int. Conf. on Database.
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Lock-Free Consistency Control for Web 2.0 Applications Jiang-Ming Yang 1,3, Hai-Xun Wang 2, Ning Gu 1, Yi-Ming Liu 1, Chun-Song Wang 1, Qi-Wei Zhang 1.
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,
Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
An Efficient Algorithm for Incremental Update of Concept space
CPS216: Data-intensive Computing Systems
Indexing Structures for Files and Physical Database Design
CS522 Advanced database Systems
Efficient processing of path query with not-predicates on XML data
Toshiyuki Shimizu (Kyoto University)
Structure and Content Scoring for XML
Structure and Content Scoring for XML
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University Jayavel Shanmugasundaram Yahoo! Research Summarized by Dongmin Shin, IDS Lab., Seoul National University Presented by Dongmin Shin, IDS Lab., Seoul National University

Copyright  2007 by CEBT Index  Introduction  Background  System Overview  QPT Generation Module  PDT Generation Module  Experiments  Conclusion and Future Work 2

Copyright  2007 by CEBT Index  Introduction  Background  System Overview  QPT Generation Module  PDT Generation Module  Experiments  Conclusion and Future Work 3

Copyright  2007 by CEBT Introduction Fundamental assumption of traditional information retrieval systems 4 The set of documents being searched is materialized.

Copyright  2007 by CEBT Introduction But 5 The view is often virtual (unmaterialized) Aggregator may not have resources to materialize all the data If the view is materialized, the contents of the view may be out-of-date or maintaining the view may be expensive The data sources may not wish to provide the entire data

Copyright  2007 by CEBT Introduction  Example Personalized views : MyYahoo or Microsoft Sharepoint – There are many users and their content is often overlapping – It could lead to data duplication and its associated space-overhead Information integration 6

Copyright  2007 by CEBT Introduction Efficiently evaluating keyword search queries over virtual XML views 7 Need

Copyright  2007 by CEBT Index  Introduction  Background  System Overview  QPT Generation Module  PDT Generation Module  Experiments  Conclusion and Future Work 8

Copyright  2007 by CEBT Background 9

Background  XML Scoring tf(e,k) : the number of distinct occurrences of the keyword k in element e and its descendants idf(k) = score(e,Q) = 10 TF-IDF method

Copyright  2007 by CEBT Index  Introduction  Background  System Overview  QPT Generation Module  PDT Generation Module  Experiments  Conclusion and Future Work 11

Copyright  2007 by CEBT System Overview (1) Keyword queries over virtual views 12 (2) The parser redirects the query to the Query Pattern Tree(QPT) Generation Module (3) QPT is sent to the Pruned Document Tree(PDT) Generation Module (4) Generate PDTs using only the path indices and inverted list indices (5) Rewritten query and PDTs are sent to Evaluator(6) Produce the view that contains all view elements with pruned content (7) Elements are scored, only those with highest scores are fully materialized using document storage

Copyright  2007 by CEBT System Overview  XML Storage Dewey IDs – Popular id format – Hierarchical numbering scheme – ID of an element contains the ID of its parent 13

Copyright  2007 by CEBT System Overview  XML Indexing Path indices – Evaluate XML path and twig(i.e., branching path) – Store XML paths with values in a relational table – Use indices such as B+-tree – One row for each unique (Path, Value) pair – IDList : the list of ids of all elements on the path – B+-tree index is built on the (Path, Value) pair 14

Copyright  2007 by CEBT System Overview Inverted list indices – Store the list of XML elements that directly contain the keyword for each keyword in the document collection 15

Copyright  2007 by CEBT Index  Introduction  Background  System Overview  QPT Generation Module  PDT Generation Module  Experiments  Conclusion and Future Work 16

Copyright  2007 by CEBT QPT(Query Pattern Tree) Generation Module 17 V : used for query evaluation C : used for result materialization

Copyright  2007 by CEBT Index  Introduction  Background  System Overview  QPT Generation Module  PDT Generation Module  Experiments  Conclusion and Future Work 18

Copyright  2007 by CEBT PDT Generation Module  Output Only contains elements that correspond to nodes in the QPT Only contains element values that are required during query evaluation  Advantage Query evaluation is likely to be more efficient and scalable – Since PDT is much smaller than the underlying data Allows us to use the regular(unmodified) query evaluator – PDT is in regular XML format 19

Copyright  2007 by CEBT PDT Generation Module  Key Idea An element e in the document corresponding to a node n in the QPT is selected for inclusion only if it satisfies three types of constraints (1) Ancestor constraint – an ancestor element of e that corresponds to the parent of n in the QPT should also be selected (2) Descendant constraint – for each mandatory edge from n to a child of n in the QPT, at least one child/descendant element of e corresponding to that child of n should also be selected (3) Predicate Constraint – if e is a leaf node, it satisfies all predicates associated with n 20

Copyright  2007 by CEBT PDT Generation Module  PrepareList (1) Issues a lookup on path indices for each QPT node that has no mandatory child edges (2) Identifies nodes that have a ‘v’ annotation to obtain values and ids (3) Looks up inverted lists indices and retrieves the list of Dewey IDs containing the keywords along with tf values 21

Copyright  2007 by CEBT PDT Generation Module  Candidate Tree(CT) 22

Copyright  2007 by CEBT PDT Generation Module  Step 1 : adding new IDs – Adds the current minimum IDs in pathLists 23

Copyright  2007 by CEBT PDT Generation Module  Step 2 : creating PDT nodes – Create PDT nodes using CT nodes – Top-down – Check DM value of each CT node if it is “1”, create it in pdt cache If not, check children of that node  If DM value of that children node is “1”, create is in pdt cache of parent node 24

Copyright  2007 by CEBT PDT Generation Module  Step 3 : removing CT nodes – Bottom-up – Check if each node satisfies ancestor constraints If not, remove If so, propagate to the pdt cache of the ancestor – If some node has no children and does not satisfy descendant constraints, remove 25

Copyright  2007 by CEBT PDT Generation Module – When we remove the root node “books”, all IDs in its pdt cache will be propagated to the result PDT 26

Copyright  2007 by CEBT PDT Generation Module 27

Copyright  2007 by CEBT Index  Introduction  Background  System Overview  QPT Generation Module  PDT Generation Module  Experiments  Conclusion and Future Work 28

Copyright  2007 by CEBT Experiments  500MB INEX dataset  Varying parameters Size of data, # keywords, selectivity of keywords # of joins, join selectivity, level of nesting # of results, Avg. size of view element  Four alternative approaches Baseline GTP : general solution to integrate structure and keyword search queries Efficient : proposed architecture Proj : techniques of projecting XML documents 29

Copyright  2007 by CEBT Experiments  EFFICIENT is a scalable and efficient soultion 30  The cost of generating PDTs scales gracefully  Overhead of post- processing(scoring and materializing) is negligible  The cost of the query evaluator dominates the entire cost

Copyright  2007 by CEBT Experiments  Run time for EFFICIENT increases slightly Because it accesses more inverted lists to retrieve tf values 31  Run time for EFFICIENT increases Because the cost of the query evaluation increases

Copyright  2007 by CEBT Index  Introduction  Background  System Overview  QPT Generation Module  PDT Generation Module  Experiments  Conclusion and Future Work 32

Copyright  2007 by CEBT Conclusion and Future Work  Conclusion A general technique for evaluating keyword search queries over views Efficient over a wide range of parameters  Future Work Instead of using the regular query evaluator, we could use the techniques proposed for ranked query evaluation Views may contain non-monotonic operators such as group-by 33