Efficient processing of path query with not-predicates on XML data

Slides:



Advertisements
Similar presentations
Ting Chen, Jiaheng Lu, Tok Wang Ling
Advertisements

Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.
APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,
DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al Khalifa et al., ICDE 2002.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
Web Data Management XML Query Evaluation 1. Motivation PTIME algorithms for evaluating XPath queries: – Simple tree navigation – Translation into logic.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Chapter 12 Trees. Copyright © 2005 Pearson Addison-Wesley. All rights reserved Chapter Objectives Define trees as data structures Define the terms.
1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University.
Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Michael Vassilakopoulos.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML1 Efficient Structural Joins on Indexed XML Documents Shu-Yao Chien, Zografoula Vagena, Donghui.
TwigStackList¬: A Holistic Twig Join Algorithm for Twig Query with Not-predicates on XML Data by Tian Yu, Tok Wang Ling, Jiaheng Lu, Presented by: Tian.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Holistic Twig Joins: Optimal XML Pattern Matching Nicholas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 02 Presented by: Li Wei, Dragomir Yankov.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.
An Improved Prefix Labeling Scheme: A Binary String Approach for Dynamic Ordered XML Changqing LiTok Wang Ling Department of Computer Science School of.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Topic 2: binary Trees COMP2003J: Data Structures and Algorithms 2
CPS216: Data-intensive Computing Systems
CS522 Advanced database Systems
Multiway Search Trees Data may not fit into main memory
Data Structure Interview Question and Answers
Database Management System
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Efficient Filtering of XML Documents with XPath Expressions
RE-Tree: An Efficient Index Structure for Regular Expressions
B+ Tree.
Chapter 12: Query Processing
Database Management Systems (CS 564)
Spatial Online Sampling and Aggregation
ICS 353: Design and Analysis of Algorithms
Mining Association Rules from Stars
Indexing and Hashing Basic Concepts Ordered Indices
Lectures on Graph Algorithms: searching, testing and sorting
Structure and Content Scoring for XML
Trees CMSC 202, Version 5/02.
CMSC 202 Trees.
Early Profile Pruning on XML-aware Publish-Subscribe Systems
MCN: A New Semantics Towards Effective XML Keyword Search
CE 221 Data Structures and Algorithms
Chapter 12 Query Processing (1)
Structure and Content Scoring for XML
Structural Joins: A Primitive for Efficient XML Query Pattern Matching
Wei Wang University of New South Wales, Australia
Relax and Adapt: Computing Top-k Matches to XPath Queries
Efficient Aggregation over Objects with Extent
Presentation transcript:

Efficient processing of path query with not-predicates on XML data Enhua Jiao, Tok Wang Ling, Chee-Yong Chan {jiaoenhu, lingtw, chancy}@comp.nus.edu.sg Computer Science Department School of Computing National University of Singapore

Outline XML Basics Motivating example Naïve approach Our solutions: PathStack Imp-PathStack A performance study Conclusion and future work As the name implied, the second solution is an improved variant of the first solution. The second solution improves the performance of the first one in terms of CPU processing, we shall talk about more later.

XML basics Commonly modeled as ordered trees Tree nodes: elements and values. Parent-child node pairs: element - direct subelement, element – value. s p j <project> <supplier> <part> <color> ‘blue’ </color> <color>‘red’ </part> </supplier> </project> element p r o j e c t . . . s u p p l i e r s u p p l i e r s u p p l i e r p a r t p a r t p a r t p a r t To distingush an element node from a value node, we enclose the value node with a rectangle box. Such a tree is called XML data tree, and each of these nodes is called XML data node. c o l o r c o l o r c o l o r p a r t c o l o r c o l o r c o l o r ' r e d ' ' r e d ' ' b l u e ' ' b l u e ' ' y e l l o w ' ' r e d ' value

XML basics: node labeling scheme How to determine the structural relationship between two XML data nodes? i.e., parent-child, ancestor-descendant, preceding-following relationships. A set of labeling schemes were proposed Represent each node in XML data tree with a label according to its position in the tree. The structural relationship between two data nodes can be easily determined from their respective labels. How do we determine the structural relationship between two XML data nodes? For example, give a data node with the tagname “supplier”, and a data node with the tagname “part”, how do we know whether they are of parent-child relationship, ancestor-descendant relationship or preceding-following relationships? A set of Labeling schemes were proposed.

XML basics: XML path queries Building blocks of XML queries: path query (PQ) specify a path pattern to be matched by paths in xml data tree: //project/supplier[.//part/color=‘red’] By value search color=‘red’ easily supported by existing indices. By structure search //project/supplier[.//part/color] the focus of current research. By Value search specifies that an element must have certain value; By structure search specifies the structural relationship between pair of elements that they must satisfy.

Motivating examples Current research focus: path query without not-predicates //project/supplier[.//part/color=‘red’] path query with not-predicates: //project/supplier[not(.//part[./color=‘red’])] No solutions proposed so far to process such queries.

Naïve approach Decompose //project/supplier[not(.//part[./color=‘red’])] into //project/supplier //project/supplier[.//part/color=‘red’] Make use of existing solutions. Answer can be obtained by comparing two result sets Such concept is applied recursively for path queries with recursive not-predicates

Naïve approach: problems High I/O XML data is scanned repetitively. Writing/reading of intermediate results. High CPU Redundant processing of some structural relationships. Set difference computation. High memory space Storage of intermediate results.

Our Solution: PathStack Objectives XML data is scanned only once. No intermediate results. No redundant processing of structural relationships. Run time memory is bounded by the longest path in XML data tree.

PathStack: query definitions //project/supplier[not(.//part/color=‘red’)] ni: element tagname where i indicates the nesting level of the element. Two query nodes are connected by “||” if they are of ancestor-descendant relationship, or “|” if they are of parent-child relationship. “” represents a not-predicate. Result: <project, supplier> such that this project node is a parent of the supplier node, and the supplier doesn’t have a descendant part node with ‘red’ color.

PathStack : satisfaction of subqueries This slide is very important, as our solution is developed based on this concept. Given the query on the left hand side, (a) query (b) Data tree

PathStack : data structures Each query node ni:X is associated with a data stream Ti and a stack Si. Data stream (Ti): containing all data nodes from XML data tree with tagname = X, sorted in document order. Stack (Si): Let nj: Y be the query node which is the parent of the highest negative edge. Regular stack: associated with query nodes with i<j Stack item: <X, pointer to an item in Si-1>, X is a data node. Boolean stack: associated with query nodes with i≥j. Stack item: <X, pointer to an item in Si-1, satisfy>, X is a data node, satisfy is a boolean variable indicating if X satisfies its corresponding subquery. Can be denoted as Sbooli as well.

PathStack : an example in (a), Ai , Bi , … are the labels for element with tagname ‘A’, ‘B’, … respectively. It’s for easy distinguish of elements with the same tagname.

PathStack : key idea Visit data nodes in the set of associated streams in document order. Pop nodes in the set of stacks that do not lie on the same path as the data node selected in current round. Nodes must be popped from Si in decreasing i order. Let nj: Y be the query node which is the parent of the highest negative edge. For <X, satisfy> popped from Si: if i>j, then we can determine if some nodes in Si-1 satisfies their corresponding subquery, based on the satisfy of X, and the edge between query node ni-1 and ni. Else if i=j and satisfy=true, then there is a potential answer which can be read from the set of stacks. Push current node into its corresponding stack Sk. If Sk is a boolean stack, current node’s satisfy value will be initialized according to the edge between nK and nK+1.

PathStack : key idea (cont.) B2, t D1, t C2, f C1, f C1, t

PathStack : key idea (cont.) answer <A1, B2> B2, t C2, f B1, f

Imp-PathStack: minimizing Number of Boolean Stacks Boolean stacks are more costly to maintain than regular stacks. Can we use less Boolean stacks to achieve the same result as PathStack? Yes, only query node with negative child edge needs to be associated with Boolean stack. The leaf node in query path: always true (virtual Boolean stack) Query node with positive child node: satisfy value can be determined easily from the nodes in Sboolj, where nj is the nearest descendant query node of ni that is associated with a (real or virtual) boolean stack

Imp-PathStack: optimizing Stack Operations Some document nodes that do not affect the final results are still pushed into stacks. Can we avoid pushing such nodes into stacks? Not affecting the satisfy value of A1, A2 and A3, can be skipped

Performance study: configurations The testbed: implemented the Naïve approach, PathStack and imp-PathStack in Java using file system as storage engine. Experiments were run on a 750Mhz Ultra Sparc III CPU with 512MB main memory and a 300MB quota of disk space. Experimental dataset: Treebank.xml. It has a max depth of 35, an average depth of 7.87, an average fan-out of 2.3, and about half million nodes. Experimental queries: 3 sets of path queries which contain 1, 2 or 3 not-predicates (denoted as Q1, Q2 and Q3 respectively) were used in the experiment. All queries have around 152000 data nodes totally (30% of the experimental dataset) in their associated streams and 2000 nodes (0.4% selectivity) in final results. Evaluation metric Execution time Disk I/O: count the total number of data nodes read from/written to disk.

Performance study: experiment queries EMPTY/S//X/VP[not(NP/PP//JJR)] EMPTY/S//X[not(VP/NP/PP//JJR)] EMPTY/S[not(//X/VP/NP/PP//JJR)] Q2 EMPTY/S//X/VP[not(NP/PP[not(//JJR)])] EMPTY/S//X[not(VP/NP/PP[not(//JJR)])] EMPTY/S[not(//X/VP/NP/PP[not(//JJR)])] Q3 EMPTY/S//X/VP[not(NP[not(PP[not(//JJR)])])] EMPTY/S//X[not(VP/NP[not(PP[not(//JJR)])])] EMPTY/S[not(//X/VP/NP[not(PP[not(//JJR)])])]

Performance study: Naïve vs. PathStack Observation: PathStack is more efficient than the Naïve approach. Performance improvement increases with number of not predicates . Why? In the Naive approach, the more not-predicates in the given query, the more repetitive scans of the associated streams will be performed, and the more intermediate results will be generated.

Performance study: PathStack vs. imp-PathStack Observation: imp-PathStack requires less execution time, however the improvement is very marginal. Why? Execution time dominating factor: I/O cost, CPU cost contributes a small portion to the overall execution time. Due to lack of index support, in our implementation, we still need to read the entire associated streams of a query to determine what are the nodes that can be skipped (which means no reduction of I/O cost in node skipping step).

Performance study: PathStack vs. imp-PathStack * * Stream size of each query set refers the total number of nodes in the set of data streams of each query. Observation: (1) percentage of nodes skipped is irrelevant to the number of not-predicates in the query; (2) the percentage of nodes skipped is not exciting. Why? The experimental data set we used has a deeply nested structure with low fan-out, our node skipping mechanism works well for data set with high fan-out.

Conclusion and future work In this paper, we have Defined the representation and matching of path queries with not-predicates. Proposed PathStack and its improved variant imp- PathStack. Implemented the naïve approach and our two solutions to study their performances. For future work, we would like to extend our algorithm to process more general twig queries with not-predicates.

References E. Jiao, Efficient processing of XML path queries with not-predicates, M.Sc. Thesis, National University of Singapore, 2004. N. Bruno, N. Koudas, and D. Srivastava. Holistic Twig Joins: Optimal XML pattern matching. In Proc. of the SIGMOD, 2002. D. Florescu and D. Kossman. Storing and querying XML data using an RDMBS. IEEE Data Engineering Bulletin, 22(3): 27-34, 1999. H. Jiang, H. Lu, W. Wang, Efficient Processing of XML Twig Queries with OR-Predicates, In Proc. of the SIGMOD 2004. D. Srivastava, S. Al-Khalifa, H. V. Jagadish, N. Koudas, J. M. Patel, and Y. Wu. Structural joins: A primitive for efficient XML query pattern matching. In Proc. of the ICDE, pages 141-152, 2002.