1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.

Slides:



Advertisements
Similar presentations
Ting Chen, Jiaheng Lu, Tok Wang Ling
Advertisements

Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
On Boosting Holism in XML Twig Pattern Matching Using Two Data Streaming Techniques Presenter: Lu Jiaheng Supervisor: Prof. Ling Tok Wang Joint work: Chen.
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al Khalifa et al., ICDE 2002.
Efficient Processing Regular Queries In Shared-Nothing Parallel Database Systems Using Tree- And Structural Indexes (ADBIS 2007, Bulgaria) Vu Le Anh, Attilla.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
QUANZHONG LI BONGKI MOON Indexing & Querying XML Data for../Regular Path Expressions/* SUNDAR SUPRIYA.
Web Data Management XML Query Evaluation 1. Motivation PTIME algorithms for evaluating XPath queries: – Simple tree navigation – Translation into logic.
BLAS: An Efficient XPath Processing System Chen Y., Davidson S., Zheng Y. Νίκος Λούτας.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
CSC 2300 Data Structures & Algorithms February 6, 2007 Chapter 4. Trees.
1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University.
Review Binary Tree Binary Tree Representation Array Representation Link List Representation Operations on Binary Trees Traversing Binary Trees Pre-Order.
1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Data Structures : Project 5 Data Structures Project 5 – Expression Trees and Code Generation.
Querying Structured Text in an XML Database By Xuemei Luo.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
TwigStackList¬: A Holistic Twig Join Algorithm for Twig Query with Not-predicates on XML Data by Tian Yu, Tok Wang Ling, Jiaheng Lu, Presented by: Tian.
Database Systems Part VII: XML Querying Software School of Hunan University
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
BLAS: An Efficient XPath Processing System Zhimin Song Advanced Database System Professor: Dr. Mengchi Liu.
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.
CE 221 Data Structures and Algorithms Chapter 4: Trees (Binary) Text: Read Weiss, §4.1 – 4.2 1Izmir University of Economics.
Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
Dr. N. MamoulisAdvanced Database Technologies1 Topic 8: Semi-structured Data In various application domains, the data are semi-structured; the database.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Holistic Twig Joins: Optimal XML Pattern Matching Nicholas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 02 Presented by: Li Wei, Dragomir Yankov.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.
Gspan: Graph-based Substructure Pattern Mining
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.
Efficient processing of path query with not-predicates on XML data
Database Management System
A paper on Join Synopses for Approximate Query Answering
RE-Tree: An Efficient Index Structure for Regular Expressions
Probabilistic Data Management
Chapter 12: Query Processing
Structure and Content Scoring for XML
CE 221 Data Structures and Algorithms
Structure and Content Scoring for XML
Structural Joins: A Primitive for Efficient XML Query Pattern Matching
Wei Wang University of New South Wales, Australia
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He

2 Outline ☞ ☞ XML Twig Pattern Matching  Problem definition  State of the Art: TwigStack  Sub-optimality of TwigStack Our algorithm TwigStackList Performance Conclusion

3 XML Twig Pattern Matching XML Data Model  A XML document is commonly modeled as a rooted, ordered and labeled tree.  E.g. Note that identifiers (e.g. b1) are given to tree nodes for easy reference book preface chapter paragraph section figure paragraph section figure paragraphfigure paragraph …………. title p1p1 t1t1 c1c1 s1s1 s2s2 t2t2 p2p2 pf 1 f1f1 s3s3 p3p3 f2f2 c2c2 f3f3 p4p4 b1b1 D1:

4 XML Twig Pattern Matching Regional Coding [1]  Node Label: (startPos: endPos, LevelNum) startPos and endPos are calculated by performing a pre-order traversal of the document tree; LevelNum is the level of the node in the tree.  E.g. book (0: 50, 1) preface (1:3, 2)chapter (4:22, 2) chapter(23:45, 2) paragraph (2:2, 3) section (5:21, 3) section(7:12, 4) figure (10:10, 6) paragraph(9:11, 5) section(13:17, 4) figure (15:15, 6) paragraph(14:16, 5) figure (19:19, 5) paragraph(18:20, 4)title: (6:6, 4) title: (8:8, 5) D1: 1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, 1994.

5 XML Twig Pattern Matching What is a Twig Pattern?  A twig pattern is a small tree whose nodes are predicates (e.g. element type test) and edges are either Parent-Child (P-C) edges or Ancestor-Descendant (A-D) edges.  E.g. An XPath query Q1 selects Figure elements which are descendants of some Paragraph elements which in turn are children of Section elements having at least one child element Title Section Title Paragraph Figure Q1: Section[Title]/Paragraph//Figure

6 XML Twig Pattern Matching Twig Pattern Matching  Problem Statement Given a query twig pattern Q, and a XML database D that has index structures (e.g. regional coding scheme) to identify database nodes that satisfy each of Q’s node predicates, compute ALL the answers to Q in D.  E.g. The matches for twig pattern Section[Title]/Paragraph//Figure in the document D1 are: (s1, t1, p4, f3) (s2, t2, p2, f1) D1: b1 pf1 c1 c2 p1 s1 s2 f1 p2 s3 f2 p3f3 s4 t1 t2

7 XML Twig Pattern Matching TwigStack[2]: a holistic approach  Tag Streaming: all elements of tag q are grouped in a stream T q ordered by their startPos  Optimal when all the edges in twig pattern are A-D edges  Two-phase algorithm: Phase 1 TwigJoin: a list of intermediate paths are outputted Phase 2 Merge: merge the intermediate path list to get the result 1. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002.

8 XML Twig Pattern Matching TwigStack Review  A node q in a twig pattern Q is coupled with a stack Sq  An element e is pushed into its stack if and only if e is in some match to Q. E.g. Only color highlighted elements are pushed into their stacks. Thus it is ensured that no redundant paths are output.  An element e is popped out from its stack if all matches involving it have been reported Thus we ensure that the memory space used by stacks is bounded. Q: Section[//Title]//Paragraph//Figure S Section S Title S Paragraph S Figure D1: b1 pf1 c1 c2 p1 s1 s2 f1 p2 s3 f2 p3f3 s4 t1 t2

9 XML Twig Pattern Matching Optimality of TwigStack for only A-D edge twig pattern  Each stream T q is scanned only once,where q appears the twig pattern  No redundant intermediate result: All intermediate paths output in Phase 1 appear in the final result; CPU and I/O cost: O(|Input| + |Output|)  Space Complexity: O(|Longest Path in the XML tree|)

10 Sub-optimality of TwigStack Unfortunately, TwigStack is sub-optimal for queries with any parent-child relationship. TwigStack may output a large size of intermediate results that are not merge-joinable to final solutions for queries with parent-child relationships.

11 Example for sub-optimality of TwigStack Twig Pattern An simple XML tree s1 p1 f2 t2 t1 Section title paragraph figure TwigStack output (s1,t1) as the intermediate result, since s1 has a descendant t1 and p1 which in turn has a descendant f2. Observe that p1 has no child with tag figure. There is not any matching in this XML tree. So (s1,t1) is a “useless” solution.

12 Main problem and my experiment As shown before, TwigStack might output some intermediate results that are not merge-joinable to final solutions for queries with parent-child edges. To have a better understanding, we perform TwigStack on real dataset. Data set : TreeBank [UW XML repository] Queries:  Q1:VP [/DT] //PRP_DOLLAR_  Q2: S//NP[//PP/TO][/VP/_NONE_]/JJ  Q3: S [/JJ] /NP All queries contain parent-child relationships.

13 Our experimental results Intermediate paths by TwigStack Merge- joinable paths Percentage of useless intermediate paths Q110, % Q224, % Q370, % Most intermediate paths do not contribute to final answers due to parent-child edges! It is a big challenge to improve TwigStack to answer queries with parent-child edges.

14 Our intuitive observation We can improve TwigStack for queries in the previous example. Twig Pattern s1 p1 f1 t2 t1 Section title paragraph figure An simple XML tree Our intuitive observation: why not read more paragraph elements and cache them in the main memory? For example, in this XML tree, after we scan the p1, we do not stop and continue to read the next element. Then we find that there is only one paragraph element and f1 is not the child of paragraph. So we should not output any solution.

15 Outline XML Twig Pattern Matching  Problem definition  State of the Art: TwigStack  Sub-optimality of TwigStack ☞ ☞ Our algorithm TwigStackList Experimental results Conclusion

16 Our main idea Main idea: we read more elements in the input stream and cache some of them in the main memory so that we can make a more accurate decision about whether an element can contribute to final answer. One desiderata: We cannot cache too many elements in the main memory. For each node q in twig query, the number of elements with tag q cached in the main memory should not be greater than the longest path in the XML dataset.

17 Our caching strategy What elements should be cached into the main memory?  Only those that may contribute to final answers Twig Pattern s1 s2 s4 s3 t1 Section title paragraph An simple XML tree p1 We only need to cache s1,s2,s4 into main memory, why not s3? Because if s3 contributed to final answer, then there would be an element before p1 that is child of s3. Now we see that p1 is the first element. So s3 is guaranteed not to contribute to final answer.

18 Our criteria for pushing an element to stack Whether an element can be pushed into stack is very important for controlling intermediate results. Why? Because, once an element is pushed into stack, then this element is ready to output. So less elements are pushed into stack, less intermediate results are output. Our Criteria : Given an element e q from stream T q, before e q is pushed into stack S q, we ensure that (i) element e q has a descendant e q’ for each child q’ of q, and (ii) if (q, q’) is a parent-child relationship, e q’ has parent with tag q in the path from e q to e qmax, where e qmax is the descendant of e q with the maximal start value. (iii) each of q’ recursively satisfy the first two conditions.

19 Examples Let us see two examples to understand the criteria. Twig Pattern Section title paragraph s1 s2 s3 p1 t1 An simple XML tree f1 figure Element s1 can be pushed into stack, but s2, s3 cannot. Note that s1 can be pushed into stack, not just because t1,p1 and f1 are descendants, more importantly, because in the path from s1 to f1, element t1, p1 and f1 can find their parents with tag section.

20 Examples Twig Pattern Section title paragraph s1 p1 o1 t1 An simple XML tree f1 figure In this example, s1 cannot be pushed into stack. Because although elements t1,p1 and f1 are still descendants of s1, now in the path from s1 to f1, element p1 cannot find the parent with tag section. Observe that the parent of p1 is o1 (i. e. o1 means other element ). In this example, we cache s1 and s2 to main memory, for they might involve in query answers in the future. s2

21 TwigStackList We propose a novel holistic twig algorithm TwigStacklist to evaluate a twig query. Unlike previous TwigStack, TwigStackList has the unique features:  It considers the parent-child edge in the query and enhance the criteria for elements to be pushed into stack.  It use data structure: list to cache some elements that likely participate in final solutions. The number of elements in any list is strictly bounded by the longest path in the dataset.  It has a broader class of optimal queries. TwigStackList can guarantee each output intermediate solution contributes to final answers when queries contain only ancestor-descendant edges below branching nodes.

22 Example TwigStackList show I/O optimal for the following query. In contrast, TwigStack shows sub-optimal. Note that below branching node section, all edges in query are A-D relationship. Twig Pattern s1 p1 f1 t2 t1 Section title paragraph figure An simple XML tree In this case, TwigStacklList does not push s1 to stack and thereby avoid outputting (s1,t1). But TwigStack push s1 to stack and output (s1,t1). Observe that (s1,t1) is a useless intermediate solution.

23 Sub-optimality of TwigStackList Although TwigStackList broaden the class of optimal query compared to TwigStack, TwigStackList is still show sub-optimality for queries with parent-child edge below branching edges. Twig Pattern s1 s2 f1 t1 Section title paragraph An simple XML tree Observe that there is no matching solution for this dataset. But TwigStackList caches s1 and s2 in the list and push s1 to stack. So (s1,t1) will be output as a useless solution.

24 Outline XML Twig Pattern Matching  Problem definition  State of the Art: TwigStack  Sub-optimality of TwigStack Our algorithm TwigStackList ☞ ☞ Experimental results Conclusion

25 Experimental Setting  Pentium 4 CPU, RAM 768MB, disk 2GB  TreeBank Maximal depth 36, 2.4 million nodes  DTD data a → bc | cb |d c → a a and c are non- terminals, b and d are terminals  Random Seven tags : a, b, c, d, e, f, g. ; uniform distributed Fan-out of elements varied 2-100, depth varied

26 Performance against TreeBank Queries with XPath expression: Q1S[//MD]//ADJ Q2S/VP/PP[/NP/VBN]/IN Q3S/VP//PP[//NP/VBN]//IN Q4VP[/DT]//PRP_DOLLAR_ Q5S[//VP/IN]//NP Q6S[/JJ]/NP Number of intermediate path solutions for TwigStackList V.s. TwigStack TwigStackTwigStackListReduction percentageUseful Path Q1 35 0%35 Q %92 Q %4612 Q %5 Q %22565 Q %10

27 Performance analysis We have three observations: (1) when queries contain only ancestor-descendant edges, two algorithms have similar performance. See Q1. (2)When edges below non-branching nodes contain only ancestor-descendant relationships, TwigStack is optimal, but TwigStack show the sub-optimal. See Q3.Q5 (3) When edges below branching nodes contain parent-child relationships, both TwigStack and TwigStackList are sub-optimal. Buit TwigStack typically output far few “useless” intermediate solution than TwigStack. See Q 2,Q4,Q6.

28 Performance against DTD data There is no matching solution for query a[//b]//c/d in the DTD dataset. But TwigStack outputs too much redundant path solutions. In contrast, TwigStackList shows its optimal and significantly outperforms TwigStack in this query.

29 Performance against random dataset TwigStackTwigStackListReduction percentage Useful Path Q %2077 Q %100 Q %14476 Q %16775 Q %566 Twig queries From the following table, we see that for all queries, TwigStackList again is more efficient than TwigStack in terms of the size of intermediate results.

30 Outline XML Twig Pattern Matching  Problem definition  State of the Art: TwigStack  Sub-optimality of TwigStack Our algorithm TwigStackList Experimental results ☞ ☞ Conclusion

31 Conclusion Previous algorithm TwigStack show the sub- optimality for queries with parent-child edges. We propose new algorithm TwigStackList to address this problem. TwigStackList broadens the class of query with I/O optimality. Experiments show that TwigStackList typically output much fewer useless intermediate result as far as the query contains parent-child relationships. We commend to use TwigStackList to evaluate a query with parent-child relationships.

32 Backup questions: 1. Turn back to the slide about “Performance against DTD data ”. In two figures, what is the X-axis? X-axis shows that the ratio of the number of elements with tag d relative to that with b and c. This ratio is important. Because according to the DTD: a → bc | cb |d, c → a, for query a[//b]//c/d, while the ratio decreases, the “useless” intermediate results output by TwigStack increase. In contrast, TwigStackList is optimal in this case. So it does not affected by the variety of the ratio. Therefore, we show the superiority of TwigStackList over TwigStack by varying the ratio.

33 Backup questions: 2. You say that TwigStackList is more efficient than TwigStack, since it outputs less intermediate results. So it is easy to understand that TwigStackList is better than TwigStack in terms of I/O cost, but how about CPU cost? TwigStackList is more efficient than TwigStack for evaluating query with parent-child relationships in terms of not only intermediate result size, but also the execution time. Of course, TwigStackList needs to scan the elements cached in the main memory and slightly increase the CPU cost. But compared to the great benefit from the reduction of I/O cost, this cost is worthy.