XML Query Processing Talk prepared by Bhavana Dalvi (05305001) Uma Sawant (05305903)

Slides:

Advertisements

Similar presentations

APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

Advertisements

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.

Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,

DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.

Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al Khalifa et al., ICDE 2002.

File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.

TREES Chapter 6. Trees - Introduction  All previous data organizations we've studied are linear—each element can have only one predecessor and successor.

B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree

Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.

Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.

1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.

Binary Trees, Binary Search Trees COMP171 Fall 2006.

BTrees & Bitmap Indexes

Storing and Querying Ordered XML Using Relational Database System Swapna Dhayagude.

1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.

B-Trees Disk Storage What is a multiway tree? What is a B-tree?

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.

B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.

1 B-Trees Disk Storage What is a multiway tree? What is a B-tree? Why B-trees? Comparing B-trees and AVL-trees Searching a B-tree Insertion in a B-tree.

B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.

B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.

1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.

Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.

B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.

COSC2007 Data Structures II

Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.

XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.

A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.

Querying Structured Text in an XML Database By Xuemei Luo.

Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.

© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.

VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML1 Efficient Structural Joins on Indexed XML Documents Shu-Yao Chien, Zografoula Vagena, Donghui.

TwigStackList¬: A Holistic Twig Join Algorithm for Twig Query with Not-predicates on XML Data by Tian Yu, Tok Wang Ling, Jiaheng Lu, Presented by: Tian.

B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.

Chapter 12 Query Processing. Query Processing n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation of Expressions 2.

QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

Rooted Tree a b d ef i j g h c k root parent node (self) child descendent leaf (no children) e, i, k, g, h are leaves internal node (not a leaf) sibling.

Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree

Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.

Dr. N. MamoulisAdvanced Database Technologies1 Topic 8: Semi-structured Data In various application domains, the data are semi-structured; the database.

From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.

Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.

Grouping Robin Burke ECT 360. Outline Extra credit Numbering, revisited Grouping: Sibling difference method Uniquifying in XPath Grouping: Muenchian method.

Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.

1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.

Holistic Twig Joins: Optimal XML Pattern Matching Nicholas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 02 Presented by: Li Wei, Dragomir Yankov.

Indexing Structures Database System Implementation CSE 507 Some slides adapted from R. Elmasri and S. Navathe, Fundamentals of Database Systems, Sixth.

1 Structural Join Algorithms – Examples Key property: x is a descendant (resp., child) of y iff x.docId = y.docId & x.StartPos < y.StartPos

1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.

Compression and Huffman Coding. Compression Reducing the memory required to store some information. Lossless compression vs lossy compression Lossless.

Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.

Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,

Multiway Search Trees Data may not fit into main memory

Efficient processing of path query with not-predicates on XML data

Presented by Sandhya Rani Are Prabhas Kumar Samanta

Database System Implementation CSE 507

Binary Trees, Binary Search Trees

Indexing and Hashing Basic Concepts Ordered Indices

Lecture 2- Query Processing (continued)

Advanced Implementation of Tables

Database Design and Programming

Structural Joins: A Primitive for Efficient XML Query Pattern Matching

Presentation transcript:

XML Query Processing Talk prepared by Bhavana Dalvi ( ) Uma Sawant ( )

Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al Khalifa, H. Jagadish, N. Koudas, J. Patel D. Srivastava, Yuquing Wu ICDE 2002

Motivation Query : book[title='XML'] //author[. ='jane']

Query Tree book[title='XML'] //author[.='jane']

Decomposition Of Query Tree

Evaluation of Query Matching each of the binary structural relationships against database. Stitching together these basic matches

Different ways of matching structural relationships Tuple-at-a-time approach ➢ Tree traversal ➢ Using child & parent pointers ➢ Inefficient because complete pass through data Pointer based approach ➢ Maintain (Parent,Child) pairs & identifying (ancestor,descendants) : Time complexity ➢ Maintain (ancestor,descendant) pairs : space complexity ➢ Either case is infeasible

Solution: Set-at-a-time approach Uses mechanism ➢ Positional representation of occurrences of XML elements and string values ➢ Element 3 tuple (DocId, StartPos:EndPos, LevelNum) ➢ String 3 tuple (DocId, StartPos, LevelNum)

Positional Representation

Structural Relationship Test (D1, S1:E1, L1) (D2, S2:E2, L2) Ancestor-Descendant ➢ D1 = D2, S1 < S2, E2 < E1 Parent-Child ➢ D1 = D2, S1 < S2, E2 < E1, L1 + 1 = L2

Goal Given ancestor-descendant (or parent-child) structural relationship (e1,e2), find all node pairs which satisfy this

Traditional merge join ➢ Does equi-join on doc-id ➢ Tests for inequalities on cross-product of two sets Multi-Predicate Merge Join (MPMGJN) ➢ MPMGJN uses Positional representation ➢ Better than traditional merge join ➢ Applies multiple predicates simultaneously ➢ Still lot of unnecessary computation and I/O Previous Approaches

Need of better I/O and CPU optimal algorithm

Solution: Structural Joins Set-at-a-time approach Uses positional representation of XML elements. I/O and CPU optimal

Structural Join To locate elements matching a tag: ➢ Index on Node (ElementTag, DocID, StartPos, EndPos, LevelNum) Index on ElementTag tag List : elements sorted by (DocID, StartPos, EndPos, LevelNum)

Structural Join Goal: join two lists based on either parent-child or ancestor-descendant Input : ➢ AList (a.DocID, a.StartPos : a.EndPos, d.LevelNum) ➢ DList (d.DocID, d.StartPos : d.EndPos, d.LevelNum) Output can be sorted by ➢ Ancestor: (DocID, a.StartPos, d.StartPos), or ➢ Descendant: (DocID, d.StartPos, a.StartPos)

Two families of structural join algorithms Tree-merge Algorithm : MPMGJN is member of this family Stack-tree Algorithm

Algorithm Tree-Merge-Anc Output : ordered by ancestors Algorithm : Loop through list of ancestors in increasing order of startPos ➢ For each ancestor, skip over unmatchable descendants ➢ check for ancestor-descendant relationship ( or parent-child relationship ) ➢ Append result to output list

Worst case for Tree-Merge-Anc

Analysis Ancestor-descendent relationship ➢ |Output List| = O(|AList| * |DList|) ➢ Time complexity is optimal O( |AList| * |DList| ) ➢ But poor I/O performance Parent-child relationship ➢ |Output List| = O(|AList| + |DList|) ➢ Time complexity is O (|AList| * |DList| )

Tree-Merge-Desc Algorithm Output : ordered by descendants Algorithm : Loop over Descendants list in increasing order of startPos ➢ For each descendant, skip over unmatchable ancestors ➢ check for ancestor-descendant relationship ( or parent-child relationship ) ➢ Append result to output list

Worst case for Tree-Merge-Desc

Analysis Ancestor-descendent relationship ➢ |Output List| = O( |AList| * |DList| ) ➢ Time Complexity : O( |AList| * |DList| ) Parent-child relationship ➢ |Output List| = O(|AList| + |DList|) ➢ Space and Time complexities are O (|AList| * |DList| )

Tree-Merge algorithms are not I/O optimal Repetitive accesses to Anc or Desc list

Motivation for Stack-Tree Algorithm Basic idea: depth first traversal of XML tree ➢ takes linear time with stack of size equal to tree depth ➢ all ancestor-descendant relationships appear on stack during traversal Main problem: do not want to traverse the whole database, just nodes in Alist or Dlist Solution : Stack-Tree algorithm ➢ Stack: Sequence of nodes in Alist

Stack-Tree-Desc Initialize start pointers (a*, d*, s->top) While input lists are not empty and stack is not empty ➢ if new nodes (a* and d*) are not descendants of current s->top, pop the stack ➢ else if a* is ancestor of d*, push a* on stack and increment a* else ➔ compute output list for d*, by matching with all nodes in current stack, in bottom-up order ➔ Increment d* to point to next node

Example of Stack-Tree-Desc Execution Alist : a1, a2,... DList : d1, d2,...

Step 1

Step 2

Step 3

Stack-Tree-Desc Analysis * Time complexity (for anc-desc and parent-child) O(|AList| + |DList| + |OutputList|) * I/O Complexity (for anc-desc and par-child) O(|AList| / B + |DList| / B + |OutputList| / B) ➢ Where B is blocking factor

Stack-Tree-Anc Output ordered by ancestors Cannot use same algorithm, as in Stack-Tree-Desc Basic problem: results from a particular descendant cannot be output immediately ➢ Later descendants may match earlier ancestor, hence have to be output first

Stack-Tree-Anc Solution: keep lists of matching descendant nodes with each stack node ➢ Self-list Descendants that match this node Add descendant node to self-lists of all matching ancestor nodes ➢ Inherit list Inherited from nodes already popped from stack, to be output after self-list matches are output

Algorithm Stack-Tree-Anc ● Initialize start pointers (a*, d*, s->top) ● While the input lists are not empty and the stack is not empty if new nodes (a* and d*) are not descendants of current s->top, pop the stack (p* = popped ancestor node) ➢ Append p*. inherit_list to p*. self_list ➢ Append resulting list to (s->top). inherit_list else ➢ if a* is ancestor of d*, push a* on stack and increment a* ➢ else Append corresp. tuple to self list of all nodes in stack Increment d* to point to next node

Example of Stack -Tree-Anc

Step 1 Step 2 Alist : a1, a2, a3 Dlist : d1, d2

Step 3

Step 4

Step 5

Final output is : (a1, d1), (a1, d2), (a2, d1), (a3, d2)

Stack-Tree-Anc Analysis Requires careful handling of lists (linked lists) Time complexity (for anc-desc and parent-child relation) O(|AList| + |DList| + |OutputList|) Careful buffer management needed

Performance Study

Data Set

Queries

Performance

Conclusion The performance of the traversal-style algorithms degrades considerably with the size of the dataset. Performance of STJD is superior compared to others (STJA, TMJA, TMJD).

ORDPATHs: Insert Friendly XML Node Labels Patrik O'Neil, Elizabeth O'Neil Shankar Pal, Istvan Cseri, Gideon Schaller, Nigel Westbury (SIGMOD 2004)

Motivation Previous schemes adequate for static XML data. But poor response for arbitrary inserts Relabeling of many nodes is necessary Hence if data is not static, need for an insert-friendly labeling method.

Traditional Methods for Positional Representation Dewey Order : Hierarchical scheme

Is independent of database schema Allows efficient access through Dewey List indexing But arbitrary inserts are costly. Dewey ID representation

Solution: ORDPATH Similar to Dewey ID Differs in initial labeling & encoding scheme Provides efficient insertion at any position in tree Encodes the path efficiently to give maximum possible compression Byte by Byte comparison : to get proper document order Supports extremely high performance query plans

Example

Initial Load: only positive and odd integers are assigned. Later Insertions: even-numbered and negative integer components

ORDPATH : Primary key Relational node table

Compressed ORDPATH Format L i : Length in bits of successive O i bitstring ➢ Uses prefix free encoding O i (Ordinal): Variable length representation of node depending on L i

L i /O i Pair Design ORDPATH = ”1.-9” L0 O0 L1 O Complex ORDPATH Format Table Of Length Li & Oi pairs

Comparing ORDPATH values Simple bitstring comparison yields document order Ancestor-descendent relationships –X is strict substring of Y implies that X is ancestor of Y

ORDPATH Insertions Insertions at extreme ends Right of all: Add 2 to last ordinal of last child Left of all: Add -2 to last ordinal of first child

Other insertions Arbitrary insertions: ➢ Careting in : Create a component with even ordinal falling between final ordinals of two siblings. ➢ Append with a new odd component ➢ Depth of tree remains constant

Example Adding a node under 5 between 5.5 & > ➢ Create a new caret 6 (5 < 6 < 7) ➢ New siblings are 5.6.1, 5.6.3,....

We can caret in entire subtree

Use of ORDPATH representation in Query plans

ORDPATH Primitives PARENT(ORDPATH X). ➢ Parent of X ➢ Remove rightmost component of X (odd) ➢ Remove all rightmost even ordinals ➢ e.g. PARENT( ) = (1.3) GRDESC(ORDPATH X) ➢ Smallest ORDPATH-like value greater than any descendent of X ➢ Increment last ordinal component of X ➢ e.g. GRDESC(1.3.1) = (1.3.2)

Secondary Indices Element and Attribute TAG index supporting fast look up of elements and attributes by name. Element and Attribute VALUE index supporting fast look up of elements and attributes by value

Query Plans Query : //Book//Publisher[.=”xyz”] Plan-1 : #descendants are small ➢ Retrieve all book elements(sec. Index on Attribute TAG) & all descendants (Using GRDESC(X)) Plan-2 : #descendants are large ➢ Separate sequences of Book & Publisher [value=”xyz”] (sec. Index on Attribute value) ➢ Join by ancestor//descendent

Query plans contd... Plan-3: #descendants are extremely small ➢ Start at publisher & look for a Book element as ancestor (Using PARENT(X))

Insert Friendly Ids Generate labels to reflect document order but not path information ➢ Pass through XML tree in document order. ➢ Generate single component L 0 /O 0 pairs with ordinals = 1, 3, 5, 7... ➢ Later insertions: ORDPATH careting-in Method --> Multiple even Oi components Short ID : Primary key in Node table No relabeling required on inserts

Example

Insert between 5 & 7

Insert children of 6.1

Conclusion Thus ORDPATH suggests an hierarchical naming scheme. Supports insertion of nodes at arbitrary positions without relabeling ORDPATH primitives along with secondary indices leads to efficient query plans

References Structural Joins: A Primitive for Efficient XML Query Pattern Matching, D. Srivastava, S. Al-Khalifa, H.V. Jagadish, N. Koudas, J.M. Patel, Y.Wu, ICDE ORDPATHs: Insert-Friendly XML Node Labels, Patrick E. O'Neil, Elizabeth J. O'Neil, Shankar Pal, Istvan Cseri, Gideon Schaller, Nigel Westbury, SIGMOD On Supporting Containment Queries in Relational Database Management Systems, Chun Zhang, Jerey Naughton, David DeWitt, Qiong Luon, Guy Lohmano, SIGMOD 2001.

Thank You !

Comparing ORDPATH values Simple bitstring comparison --> document order Ancestor-descendent relationships

ORDPATH Length Worst case: small fanout at each level (2) Proved result: ➢ Avg. depth P(n) of such tree obeys inequality : P(n) <= log 2 (n) ● If max. depth = d, max. degree = t then max. bitlength L of labels is bounded: d.log 2 (t) – 1 <= L <= 4dlog 2 (t)

Alternate Representation Of Li

Traditional Merge Join vs. MPMGJN

Performance Study Data Tree