By A. Aboulnaga, A. R. Alameldeen and J. F. Naughton Vldb’01

Slides:



Advertisements
Similar presentations
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Advertisements

Binary Search Trees Azhar Maqsood School of Electrical Engineering and Computer Sciences (SEECS-NUST)
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Brandon Andrews CS6030.  What is a phylogenetic tree?  Goals in a phylogenetic tree generator  Distance based method  Fitch-Margoliash Method Example.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga Alaa R. Alameldeen Jeffrey F. Naughton Computer Sciences.
Jan. 2013Dr. Yangjun Chen ACS Outline Signature Files - Signature for attribute values - Signature for records - Searching a signature file Signature.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved Chapter Trees and B-Trees.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
CMPT-825 (Natural Language Processing) Presentation on Zipf’s Law & Edit distance with extensions Presented by: Kaustav Mukherjee School of Computing Science,
XML Primer. 2 History: SGML vs. HTML vs. XML SGML (1960) XML(1996) HTML(1990) XHTML(2000)
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
CSC 213 – Large Scale Programming. Today’s Goals  Review a new search tree algorithm is needed  What real-world problems occur with old tree?  Why.
1 Multiway trees & B trees & 2_4 trees Go&Ta Chap 10.
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
XPathLearner: An On-Line Self- Tuning Markov Histogram for XML Path Selectivity Estimation Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey.
Database Management 9. course. Execution of queries.
Organizing and Searching Information with XML Selectivity Estimation for XML Queries Thomas Beer, Christian Linz, Mostafa Khabouze.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Querying Structured Text in an XML Database By Xuemei Luo.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
Database Systems Part VII: XML Querying Software School of Hunan University
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing.
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
CSE Advanced Algorithms Instructor : Gautam Das Submitted by Raja Rajeshwari Anugula & Srujana Tiruveedhi.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
A novel, low-latency algorithm for multiple group-by query optimization Duy-Hung Phan Pietro Michiardi ICDE16.
COMP261 Lecture 23 B Trees.
BCA-II Data Structure Using C
B-Trees B-Trees.
Multiway Search Trees Data may not fit into main memory
B-Trees B-Trees.
Extra: B+ Trees CS1: Java Programming Colorado State University
Probabilistic Data Management
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Chapter Trees and B-Trees
Chapter Trees and B-Trees
(edited by Nadia Al-Ghreimil)
CMSC 341 Lecture 10 B-Trees Based on slides from Dr. Katherine Gibson.
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Evaluation of Relational Operations: Other Operations
Minimal Spanning Trees
تصنيف التفاعلات الكيميائية
Multi-Way Search Trees
The DSW Algorithm The building block for tree transformations in this algorithm is the rotation There are two types of rotation, left and right, which.
On Inferring K Optimum Transformations of XML Document from Update Script to DTD Nobutaka Suzuki Graduate School of Library, Information and Media Studies.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Dynamic Programming Dynamic Programming 1/15/ :41 PM
Binary Search Trees.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Database Design and Programming
XML indexing – A(k) indices
(edited by Nadia Al-Ghreimil)
Implementation of Relational Operations
Evaluation of Relational Operations: Other Techniques
Wei Wang University of New South Wales, Australia
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

By A. Aboulnaga, A. R. Alameldeen and J. F. Naughton Vldb’01 Estimating the Selectivity of XML Path Expressions for Internet Scale Applications By A. Aboulnaga, A. R. Alameldeen and J. F. Naughton Vldb’01 Presented by Kan Kin Fai

Outline of Presentation Introduction Path Trees Markov Tables Experimental Findings

Motivation XML enables Internet scale applications that query data from many sources Niagara, Xyleme, … Queries over XML data use path expressions

An XML Document <readings> <play> <title>Pygmalion</title> <author>Bernard Shaw</author> </play> <novel> <title>David Copperfield</title> <author>Charles Dickens</author> </novel> </readings>

Querying XML Data FOR $n_auth IN document("*")//novel/author $p_auth IN document("*")//play/author WHERE $n_auth/text() = $p_auth/text() RETURN $n_auth Optimizing this query requires estimating the selectivity of the path expressions This requires information about the structure of the XML data

Goal of this Paper Build database statistics that capture the structure of XML data Ensure that the statistics fit in a small amount of memory For efficient query optimization Important for Internet scale applications Use the statistics to estimate the selectivity of simple XML path expressions

Simple Path Expression A sequence of tags that represents a navigation through the tree structure of the XML data starting anywhere in the tree //t1/t2/…/tn Assumes an unordered model of XML Does not consider navigations based on IDREF attributes or on predicates on the attributes values

Path Trees A tree representing the structure of an XML document Every node represents a path starting from the root of the XML document The root node represents the root element

Path Trees A node has a child node for every distinct element directly nested in any of the elements reachable by the path it represents Every node is labeled with the tag name of the elements reachable by the path it represents and with the number of such elements (i.e. frequency of the node)

Path Trees <A> <B> </B> <B> <D> </D> </B> <C> <E> </E> </C> </A> A 1 B 2 C 1 D 1 D 1 E 3

Summarizing Path Trees Path trees contain all the information needed for selectivity estimation Problem: May not fit in available memory Small available memory Internet scale Remove low frequency nodes Removed nodes replaced with *-nodes Tag name: * meaning "any tag" Frequency: Average frequency of replaced nodes Sibling-*, Level-*, Global-*, No-*

Sibling-* Summarization Repeatedly choose the path tree node with the lowest frequency and mark it for deletion Check its siblings to see if any of them is either a *-node or a regular node that has been marked for deletion If yes, coalesce the node with such sibling node into one *-node. Coalescing the children of coalesced nodes if they have the same tag name.

Sibling-* Summarization During summarization, all path tree nodes store the number of nodes in the original unsummarized path tree that they represent and the total frequency of these nodes. When the path tree becomes small enough, traverse the tree and compute for every *-node the average frequency of the multiple deleted nodes that it represents.

Sibling-* Summarization 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 I 2 J 4 K 11 K 12

Sibling-* Summarization 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 I 2 J 4 K 11 K 12

Sibling-* Summarization 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 I J 4 K 11 K 12 2

Sibling-* Summarization 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 I J K 11 K 12 2 4

Sibling-* Summarization 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 * f=6 n=2 K 11 K 12

Sibling-* Summarization 1 B 13 C 9 D 7 E F 15 G 10 H 6 5 * f=6 n=2 K 11 K 12

Sibling-* Summarization 1 B 13 C 9 D 7 E F 15 G 10 H 5 6 * f=6 n=2 K 11 K 12

Sibling-* Summarization 1 B 13 C 9 D E F 15 G 10 H 7 5 6 * f=6 n=2 K 11 K 12

Sibling-* Summarization 1 B 13 C 9 * f=12 n=2 F 15 G 10 H 6 * f=6 n=2 K 11 K 12

Sibling-* Summarization 1 B 13 C 9 * f=12 n=2 F 15 G 10 H 6 * f=6 n=2 K 11 K 12

Sibling-* Summarization 1 B 13 C 9 * f=12 n=2 F 15 G H 10 6 * f=6 n=2 K 11 K 12

Sibling-* Summarization 1 B 13 C 9 * f=12 n=2 F 15 * f=16 n=2 * f=6 n=2 K 11 K 12

Sibling-* Summarization 1 B 13 C 9 * f=12 n=2 F 15 * f=16 n=2 * f=6 n=2 K f=23 n=2

Sibling-* Summarization 1 B 13 C 9 * F 15 * 6 8 * K f=23 n=2 3

Original Path Tree A 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 I 2 J 4 K 11 K 12

Sibling-* Summarization 1 B 13 C 9 * F 15 * 6 8 * K f=23 n=2 3 Try to preserve the exact position of the deleted nodes in the original path tree May need to delete 2n nodes to reduce the size of the tree by n nodes

Selectivity Estimation Try to match the tags in the path expressions with tags in the path tree to find all path tree nodes to which the path expression leads The estimated selectivity is the total frequency of all these nodes.

Selectivity Estimation When we can’t match a tag in the path expression to a path tree node with a regular tag, try to match it to a *-node that can take its place. E.g. //A/B/C would match all of //A/*/C, //A/*/* and //*/B/* Allow matches with any number of *-nodes as long as they include at least one node with a regular tag name

Level-* Summarization Has a *-node for every level of the path tree representing all deleted nodes at this level All nodes deleted at any given level of the path tree are coalesced into the *-node for this level Preserves only the level in the path tree of the deleted nodes, not their exact position as in sibling-* Need to delete n+h nodes to reduce the size of the path tree by n nodes, where h is the number of levels in the tree

Level-* Summarization 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 I 2 J 4 K 11 K 12

Level-* Summarization 1 B 13 C 9 D E F 15 G 10 H 7 5 6 I J K 11 K 12 2 4

Level-* Summarization 1 B 13 C 9 * 6 F 15 G 10 * 3 K 11 K 12

Global-* Summarization A single *-node represents all low-frequency nodes deleted from anywhere in the path tree Preserves less information about the deleted nodes than sibling-* or level-* Needs to delete only n+1 nodes to reduce the size of the path tree by n nodes

Global-* Summarization 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 I 2 J 4 K 11 K 12

Global-* Summarization 1 B 13 C 9 D E F 15 G 10 H 7 5 6 I J K 11 K 12 2 4

Global-* Summarization 3 * B 13 C 9 D F 15 G 10 H 7 6 K 11 K 12

No-* Summarization Low-frequency nodes are simply deleted and not replaced with *-nodes Deletes exactly n nodes to reduce the size of a path tree by n nodes

No-* Summarization A 1 B 13 C 9 D 7 E 5 F 15 G 10 H 6 I 2 J 4 K 11 K 12

No-* Summarization A 1 B 13 C 9 D E F 15 G 10 H 7 5 6 I J K 11 K 12 2 4

No-* Summarization B 13 C 9 D E F 15 G 10 H 7 5 6 K 11 K 12

Markov Tables A table of all distinct paths of length up to m and their frequencies For paths of length greater than m, combine paths from the Markov table Example: Uses "short memory" or "Markov" property f(B/C/D) f(B/C) f(A/B/C/D) = f(A/B/C)

Markov Tables Path Freq A 1 AC 6 B 11 AD 4 C 15 BC 9 D 19 BD 7 AB CD 8

Summarizing Markov Tables Exact selectivities for paths of length up to m Approximate selectivities for paths longer than m Problem: May not fit in available memory Remove low frequency paths Discard removed paths of length > 2 Replace removed paths of length 1 or 2 with *-paths Suffix-*, Global-*, No-*

Suffix-* Summarization Two special *-paths *: all deleted paths of length 1 */*: all deleted paths of length 2 Adds low-frequency path of length 1 to * Keeps a set of deleted paths of length 2, SD Deletes low-frequency path of length 2 Looks for suffix-* path (e.g. A/*) with the same starting tag in the Markov table Looks for path with the same starting tag in SD

Suffix-* Summarization Deletes low-frequency path of length 2 Adds deleted suffix-* path to */* At the end, add deleted paths in SD to */* and compute the average frequencies of all *-paths. Selectivity Estimation Use the frequencies of suffix-* paths and *-paths if any of the required paths is not found Return 0 if only *-paths are used for estimation

Suffix-* Summarization Path Freq A 1 AC 6 B 11 AD 4 C 15 BC 9 D 19 BD 7 AB CD 8

Suffix-* Summarization Path Freq A 1 AC 6 B 11 AD 4 C 15 BC 9 D 19 BD 7 AB CD 8 * **

Suffix-* Summarization Path Freq A 1 AC 6 B 11 AD 4 C 15 BC 9 D 19 BD 7 AB CD 8 * **

Suffix-* Summarization Path Freq AC 6 B 11 AD 4 C 15 BC 9 D 19 BD 7 AB CD 8 * f=1,n=1 **

Suffix-* Summarization Path Freq AC 6 B 11 AD 4 C 15 BC 9 D 19 BD 7 AB CD 8 * f=1,n=1 ** SD= { } Set of deleted paths of length 2

Suffix-* Summarization Path Freq AC 6 B 11 C 15 BC 9 D 19 BD 7 AB CD 8 * f=1,n=1 ** SD= { (AD,4) }

Suffix-* Summarization Path Freq AC 6 B 11 C 15 BC 9 D 19 BD 7 AB CD 8 * f=1,n=1 ** SD= { (AD,4) }

Suffix-* Summarization Path Freq AC 6 B 11 C 15 BC 9 D 19 BD 7 AB CD 8 * f=1,n=1 ** SD= { (AD,4) }

Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 BC 9 D 19 BD 7 AB CD 8 * f=1,n=1 ** SD= { }

Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 BC 9 D 19 BD 7 AB CD 8 * f=1,n=1 ** SD= { }

Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 BC 9 D 19 AB CD 8 * f=1,n=1 ** SD= { (BD,7) }

Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 BC 9 D 19 AB CD 8 * f=1,n=1 ** SD= { (BD,7) }

Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 BC 9 D 19 AB * f=1,n=1 ** SD= { (BD,7), (CD,8) }

Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 BC 9 D 19 AB * f=1,n=1 ** SD= { (BD,7), (CD,8) }

Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 BC 9 D 19 AB * f=1,n=1 ** SD= { (BD,7), (CD,8) }

Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 B* f=16,n=2 D 19 AB * f=1,n=1 ** SD= { (CD,8) }

Suffix-* Summarization Path Freq A* f=10,n=2 B 11 C 15 B* f=16,n=2 D 19 AB * f=1,n=1 ** SD= { (CD,8) }

Suffix-* Summarization Path Freq B 11 C 15 B* f=16,n=2 D 19 AB * f=1,n=1 ** f=10,n=2 We gave A* a second chance but it lost it. Done. SD= { (CD,8) }

Suffix-* Summarization Path Freq B 11 C 15 B* 8 D 19 AB * 1 ** 6 Selectivity estimation. SD= { }

Global-*, No-* Summarization Two *-paths, * and ** Adds low-frequency path of length 1 or 2 to the appropriate *-path immediately Deletes fewer paths than suffix-* to summarize the Markov table No-* No *-paths

Experimental Findings Path trees Query paths in data: Global-* Query paths not in data: No-* Markov tables m = 2 is best (practical values: 2 and 3) Query paths in data: Suffix-*

Explanation Methods using *-nodes/*-paths aggressively assume nodes/paths that cannot be found did exist in the original path tree/Markov table. No -* conservatively assumes that nodes/paths that cannot be found did not exist in the original path tree/Markov table.

Experimental Findings When the data has many common sub-structures, Markov tables give more accurate estimation. When the data does not have many common sub-structures, path trees give more accurate estimation.

Explanation DBLP Repeated sub-structures effectively captured by Markov tables <sigmod> <inproceedings> <author>…</author> … </inproceedings> … </sigmod> <vldb> <inproceedings> <author>…</author> … </inproceedings> … </vldb>