Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.

Slides:



Advertisements
Similar presentations
Evaluating “find a path” reachability queries P. Bouros 1, T. Dalamagas 2, S.Skiadopoulos 3, T. Sellis 1,2 1 National Technical University of Athens 2.
Advertisements

Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al Khalifa et al., ICDE 2002.
Heaps1 Part-D2 Heaps Heaps2 Recall Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is a pair (key, value)
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
S. Sudarshan Based partly on material from Fawzi Emad & Chau-Wen Tseng
1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.
Web Data Management XML Query Evaluation 1. Motivation PTIME algorithms for evaluating XPath queries: – Simple tree navigation – Translation into logic.
CS 171: Introduction to Computer Science II
Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved Chapter Trees and B-Trees.
CS 104 Introduction to Computer Science and Graphics Problems Data Structure & Algorithms (4) Data Structures 11/18/2008 Yang Song.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
CHAPTER 11 Searching. 2 Introduction Searching is the process of finding a target element among a group of items (the search pool), or determining that.
Priority Queues1 Part-D1 Priority Queues. Priority Queues2 Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Chapter 2: Algorithm Discovery and Design
Binary Trees Chapter 6.
Maps A map is an object that maps keys to values Each key can map to at most one value, and a map cannot contain duplicate keys KeyValue Map Examples Dictionaries:
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
Geoff Holmes and Bernhard Pfahringer COMP206-08S General Programming 2.
CHAPTER 71 TREE. Binary Tree A binary tree T is a finite set of one or more nodes such that: (a) T is empty or (b) There is a specially designated node.
Chapter 17 Domain Name System
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Querying Structured Text in an XML Database By Xuemei Luo.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
Chapter 6 Binary Trees. 6.1 Trees, Binary Trees, and Binary Search Trees Linked lists usually are more flexible than arrays, but it is difficult to use.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML1 Efficient Structural Joins on Indexed XML Documents Shu-Yao Chien, Zografoula Vagena, Donghui.
Chapter 21 Priority Queue: Binary Heap Saurav Karmakar.
P p Chapter 10 has several programming projects, including a project that uses heaps. p p This presentation shows you what a heap is, and demonstrates.
CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Starting at Binary Trees
Symbol Tables and Search Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of CHAPTER 12: Multi-way Search Trees Java Software Structures: Designing.
Priority Queues and Heaps. October 2004John Edgar2  A queue should implement at least the first two of these operations:  insert – insert item at the.
FlexTable: Using a Dynamic Relation Model to Store RDF Data IDS Lab. Seungseok Kang.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.
Lecture 14 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
CPS120: Introduction to Computer Science Nell Dale John Lewis Abstract Data Types.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
DS.T.1 Trees Chapter 4 Overview Tree Concepts Traversals Binary Trees Binary Search Trees AVL Trees Splay Trees B-Trees.
CPS120: Introduction to Computer Science Sorting.
15.1 – Introduction to physical-Query-plan operators
Indexing Structures for Files and Physical Database Design
CS522 Advanced database Systems
Database Management System
DATA STRUCTURES AND OBJECT ORIENTED PROGRAMMING IN C++
B+ Tree.
Chapter Trees and B-Trees
Chapter Trees and B-Trees
ITEC 2620M Introduction to Data Structures
Week nine-ten: Trees Trees.
Searching CLRS, Sections 9.1 – 9.3.
Introduction to XML IR XML Group.
Presentation transcript:

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego Rodica Bozianu XML and Database Systems

Efficient Keyword Search for Smallest LCAs in XML Database Abstract Keyword search is a proven, user-friendly way to query HTML documents in the World Wide Web. Keyword search in XML documents, modeled as labeled trees (efficient algorithms) The set of smallest trees containing all keywords

Efficient Keyword Search for Smallest LCAs in XML Database Abstract Core contribution: Lookup Eager algorithm Exploits key properties of smallest trees. Used when the query contains keywords with significantly different frequencies. Scan Eager algorithm is tuned for keywords with similar frequencies. Analytically and experimentally evaluates Present XKSearch system Utilizes the Indexed Lookup Eager, Scan Eager and Stack algorithms

Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

Efficient Keyword Search for Smallest LCAs in XML Database Introduction According to the Smallest Lowest Common Ancestor (SLCA) semantics : The result of keyword query is the set of nodes that:  Contain the keywords either in their labels or in the labels of their descendant nodes and  They have no descendant node that also contains all keywords

Efficient Keyword Search for Smallest LCAs in XML Database Introduction Example: if you ask for the relation between John and Ben the node list [0.1.1, 0.1.2, ]  XQuery Complex and difficult to be executed efficiently

Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

Efficient Keyword Search for Smallest LCAs in XML Database Notation -each node v of the tree corresponds to an XML element and is labeled with a tag λ(v). -for each node numerical id pre(v) -The XKSearch implementation uses Dewey numbers as the id’s Provide a straightforward solution to locating the LCA of two nodes < Compatible with preorder numbering

Efficient Keyword Search for Smallest LCAs in XML Database Notation For a list of k keywords and an input XML tree T:  an answer subtree a subtree of T such that it contains at least one instance of keywords.  a smallest answer subtree an answer subtree non of its subtrees is an answer subtree  = the set of the roots of all smallest answer subtrees of

Efficient Keyword Search for Smallest LCAs in XML Database Notation  the keyword list of (i.e. the list of nodes whose label directly contains sorted by id)  the node is an ancestor of node  or  is an ancestor node if there exists a node such that  If then  the lowest common ancestor lca( , )=0.1.1

Efficient Keyword Search for Smallest LCAs in XML Database Notation  Given sets of nodes, a node if there exist such that  v belongs to the smallest lowest common ancestor (SLCA) of if and  The result is removes ancestor nodes from its input

Efficient Keyword Search for Smallest LCAs in XML Database Notation  ( ) = right (left) match of v in the set S The node of S that has the smallest (biggest) id that is greater (smaller) than or equal to pre(v)  returns null when there is no right (left) match node.  Cost: steps to find the right (left) match to compare two Dewey numbers  the other argument when one argument is null and the descendant node when v1 and v2 have ancestor- descendant relationship. Cost:

Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

Efficient Keyword Search for Smallest LCAs in XML Database Algorithms for finding the SLCA  A Brute-force solution to the SLCA problem Computes the LCAs of all node combinations and then removes ancestor nodes Complexity: It is blocking After it computes an LCA for some, it cannot report v as an answer since there might be another set of k nodes such that

Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm Preferred when the keyword search includes at least one low frequency keyword  Based on four properties of SLCAs Property(1) Observations: for any two nodes to the right of a node if for any two nodes to the left of a node if

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm Property(2) for k>2 Property(3) Leads to an algorithm to compute - computes for each (1≤i≤n) - the answer is r q

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm Benefit over Brute-force: for each node v1 in S1, the algorithm does not compute for all Computes a single where each is computed by the matched functions (lm and rm) (2≤i≤k) Complexity: or

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm v x v x u v x Result = {} U {0.1.1} u Result = {0.1.1}U{1.2.0} = {0.1.1, 1.2.0}

Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm subroutineDerivation of algorithm to compute Property(4): blocking algorithm it only processes the last keyword list after it completely processes the first k-1 keyword lists

All nodes in xᵢ except the last one are guaranteed to be SLCAs The last node is carried on the next operations Repeat the operation for all groups of P nodes of Sᵢ The smaller P is, the faster the algorithm produces the first SLCA. No operations to remove ancestor nodes from a set -> Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm

“Class”, “John” and “Ben” P=1 Efficient Keyword Search for Smallest LCAs in XML Database Indexed Lookup Eager Algorithm B=[0.1.0] Output B=ø B={} B=[0.1.1] B v Output v=0.1.1 (line #13) B

Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

When the occurrences of keywords do not differ significantly Its lm and rm implementations scan keyword lists to find matches a cursor for each keyword list Observation: nodes from different lists may not be accessed in order Efficient Keyword Search for Smallest LCAs in XML Database Scan Eager Algorithm finding matches by lookupsfinding matches by scanning the keyword lists

Complexity: or Efficient Keyword Search for Smallest LCAs in XML Database Scan Eager Algorithm

Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

 Each stack entry has a pair of components  Id components from the bottom entry to a stack entry en are  Keywords an array of length k of boolean values keywords[i]=T the subtree rooted at the node denoted by the stack entry directory or indirectly contains the keyword w Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm

Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm

Example: ”Class”, “John” and ”Ben” Keyword lists: [0.1.0,0.1.1], [0.0.0, , ], [ , , ] Initially: the stack is empty V=0.0.0 P=NULL Add non-matching components to the stack: Second iteration: v=0.1.0 (the next smallest node) p=lca(stack, v) = 0 pop out top 2 entries of the stack : (the important information is carried) add non-matching components: Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm JBC 0TFF 0FFF 0FFF 0TFF 0FFT 1FFF 0TFF

….. Seventh iteration : the initial stack: v= p=lca( , )=0 pop out top 4 entries of the stack: when popping out the third component: we find a SLCA : Outputs as SLCA Complexity : the number of lca operations and the number of Dewey number comparisons are Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm 1TTT 1FTT 0TFF 0FTF 2FFF 1TFT 1FTT 0TFF 0TTT 0FTF 2FFF 1TFT 1FTT 0TFF Not a SLCA => pass keyword witness information to the top entry 2FTF 1TFT 1FTT 0TFF

The Scan Eager algorithm has several advantages over the Stack algorithm.  the Scan Eager algorithm starts from the smallest keyword list, does not have to scan to the end of every keyword list may terminate much earlier than the Stack  the number of lca operations of the Scan Eager algorithm is usually much less than that of the Stack algorithm  the Stack algorithm operates on a stack whose depth is bounded by the depth of the input tree while the Scan Eager algorithm with P=1 only needs to keep three nodes in the whole process and no push/pop operations are involved. Efficient Keyword Search for Smallest LCAs in XML Database The Stack Algorithm

Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

Indexed Lookup Eager, Scan Eager and Stack algorithms implemented in Java using the Apache Xerces XML parser and Berkeley DB Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation

The architecture: B-tree structure allows efficient implementation of the match operations Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation

The table LT has d entries ( d = depth of the input tree) LT(i) = the maximum number of bits needed to store the i-th component in a Dewey number; where c is the number of children of the node at the level of i-1 that has the maximum number of children among all odes at the same level In general: bytes are needed for a Dewey number of a node at level i Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation

Indexed Lookup Eager algorithm keyword lists are in a single B+ tree where keywords are the primary key and Dewey numbers are the secondary key For w and a Dewey number p, it takes a single scan operation to find the right and left match of p in the keyword list of w The number of disk accesses: -cannot be more than (Bᵢ = the number of blocks of keyword list Sᵢ ) Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation

Scan Eager algorithm The keys in the B+ tree are simply keywords  The data associated with each key w is the list of Dewey numbers of the nodes directly containing the keyword w  The number of disk accesses: Efficient Keyword Search for Smallest LCAs in XML Database XKSearch System Implementation

Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

similarities among the Scan Eager, Indexed Lookup Eager and Stack algorithms. However, the differences between the performance of algorithms for cold cache is not as significant as those in the hot cache experiments. The reason is that most keyword lists do not take many pages. The size of the keyword lists and the time to construct them are proportional to the size of the input XML documents XKSearchB stores Dewey numbers without using a level table On average, the size of indexes constructed by XKSearch is 65% of XKSearchB the construction time of XKSearch is 55% of XKSearchB the query response time of XKSearch for hot cache is 70% of XKSearchB Efficient Keyword Search for Smallest LCAs in XML Database Experiments

Efficient Keyword Search for Smallest LCAs in XML Database Outline 1.Introduction 2.Notation 3.Algorithms for finding the SLCA of keyword lists 1.The Indexed Lookup Eager Algorithm (IL) 2.Scan Eager Algorithm 3.The Stack Algorithm 4.XKSearch System Implementation 5.Experiments 6.Conclusions

 The XKSearch system inputs a list of keywords and returns the set of Smallest Lowest Common Ancestor nodes  The complexity of Indexed Lookup Eager algorithm:  The Indexed Lookup Eager algorithm outperforms, often by orders of magnitude, other algorithms when the keywords have different frequencies.  Scan Eager algorithm as the best variant for the case where the keywords have similar frequencies. Efficient Keyword Search for Smallest LCAs in XML Database Conclusions

Efficient Keyword Search for Smallest LCAs in XML Database Thank you !