1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of.

Slides:



Advertisements
Similar presentations
Ting Chen, Jiaheng Lu, Tok Wang Ling
Advertisements

Mathematical Preliminaries
Applications Computational LogicLecture 11 Michael Genesereth Spring 2004.
1 Concurrency: Deadlock and Starvation Chapter 6.
Advanced Piloting Cruise Plot.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.
APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
On Boosting Holism in XML Twig Pattern Matching Using Two Data Streaming Techniques Presenter: Lu Jiaheng Supervisor: Prof. Ling Tok Wang Joint work: Chen.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
UNITED NATIONS Shipment Details Report – January 2006.
and 6.855J Spanning Tree Algorithms. 2 The Greedy Algorithm in Action
Writing Pseudocode And Making a Flow Chart A Number Guessing Game
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Dr. Alexandra I. Cristea CS 253: Topics in Database Systems: XPath, NameSpaces.
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
17 th International World Wide Web Conference 2008 Beijing, China XML Data Dissemination using Automata on top of Structured Overlay Networks Iris Miliaraki.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Chapter 4: Informed Heuristic Search
ABC Technology Project
VOORBLAD.
演 算 法 實 驗 室演 算 法 實 驗 室 On the Minimum Node and Edge Searching Spanning Tree Problems Sheng-Lung Peng Department of Computer Science and Information Engineering.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
David Luebke 1 8/25/2014 CS 332: Algorithms Red-Black Trees.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Computing Structural Similarity of Source XML Schemas against Domain XML Schema Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Jixue Liu 3 Guoren Wang 4 Chi.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Ray Charles i can’t stop loving you 1 2 I ’ve made up my mind.
Chapter 5 Test Review Sections 5-1 through 5-4.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Januar MDMDFSSMDMDFSSS
10 -1 Chapter 10 Amortized Analysis A sequence of operations: OP 1, OP 2, … OP m OP i : several pops (from the stack) and one push (into the stack)
Analyzing Genes and Genomes
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Foundations of Data Structures Practical Session #7 AVL Trees 2.
Energy Generation in Mitochondria and Chlorplasts
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al Khalifa et al., ICDE 2002.
1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University.
TwigStackList¬: A Holistic Twig Join Algorithm for Twig Query with Not-predicates on XML Data by Tian Yu, Tok Wang Ling, Jiaheng Lu, Presented by: Tian.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Holistic Twig Joins: Optimal XML Pattern Matching Nicholas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 02 Presented by: Li Wei, Dragomir Yankov.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Efficient processing of path query with not-predicates on XML data
Presentation transcript:

1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of Singapore Nov CIKM 2004 Washington D.C. U.S.A.

2 Outline XML Twig Pattern Matching Problem definition State of the Art: TwigStack Sub-optimality of TwigStack Our algorithm: TwigStackList Performance Conclusion

3 XML Twig Pattern Matching An XML document is commonly modeled as a rooted, ordered and labeled tree. book preface chapter section figure paragraph section figure paragraphfigure paragraph …………. title XML Data Intro

4 Regional Coding Node Label 1 : (startPos: endPos, LevelNum) E.g. book (0: 32, 1) preface (1:3, 2)chapter (4:29, 2) chapter(30:31, 2) Intro (2:2, 3) section (5:28, 3) section(9:17, 4) figure (14:15, 6) paragraph(13:16, 5) section(18:23, 4) figure (20:21, 6) paragraph(19:22, 5) figure (25:26, 5) paragraph(24:27, 4) title: (6:8, 4) title: (10:12, 5) 1. M.P. Consens and T.Milo. Optimizing queries on files. In In Proceedings of ACM SIGMOD, Data (7:7, 3) XML (11:11, 3)

5 What is a Twig Pattern? A twig pattern is a small tree whose nodes are tags, attributes or text values and edges are either Parent-Child (P-C) edges or Ancestor- Descendant (A-D) edges. E.g. Selects Figure elements which are descendants of Paragraph elements which in turn are children of Section elements having child element Title Twig pattern : Section Title Paragraph Figure

6 XML Twig Pattern Matching Problem Statement Given a query twig pattern Q, and an XML database D, we need to compute ALL the answers to Q in D. E.g. Consider Q1 and Doc 1: Doc1: s1 s2 f1 p1 t1 t2 Section titlefigure Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1) Q1:

7 Previous work: TwigStack TwigStack 2 : a holistic approach Two-phase algorithm: Phase 1 TwigJoin: intermediate root-leaf paths are outputted Phase 2 Merge: merge the intermediate path list to get the result 2. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In In Proceedings of ACM SIGMOD, 2002.

8 Previous work: TwigStack A node q in a twig pattern Q is associated with a stack S q Insertion and deletion in a stack S q Insertion: An element e q from stream T q is pushed into its stack S q if and only if e q has a descendant e qi in each T qi, where q i is a child of q Each node e qi recursively has the first property Deletion: An element e q is popped out from its stack if all matches involving it have been output.

9 Sub-optimality of TwigStack TwigStack is I/O optimal for only ancestor-descendant edge query Unfortunately, TwigStack is sub-optimal for queries with any parent-child edge. TwigStack may output a large size of intermediate results that are not merge-joinable to any final solution for queries with parent-child relationships.

10 Sub-optimality of TwigStack: an example Twig Pattern s1 p1 f1 t2 t1 Section title paragraph figure A simple XML tree Since s1 has descendants t1,p1 and in turn p1 has descendant f1, TwigStack output an intermediate path solution. But it is useless, for there is no solution for this example at all.

11 Main problem and our experiment TwigStack might output some intermediate results that are useless to query answers. To have a better understanding, we perform TwigStack on real dataset. Data set : TreeBank[from U. of Washington XML datasets] Queries: Q1:VP [/DT] //PRP_DOLLAR_ Q2: S//NP[//PP/TO][/VP/_NONE_]/JJ Q3: S [/JJ] /NP All queries contain parent-child relationships.

12 Our experimental results Intermediate paths by TwigStack Merge- joinable paths Percentage of useless intermediate paths Q110, % Q224, % Q370, % Most intermediate paths do not contribute to final answers due to parent-child edges! It is a big challenge to improve TwigStack to answer queries with parent-child edges.

13 Intuition for improvement Twig Pattern s1 p1 f1 t2 t1 Section title paragraph figure A simple XML tree Our intuitive observation: why not read more paragraph elements and cache them in the main memory? For example, after we scan the p1, we do not stop and continue to read the next paragraph element. Then we find that there is only one paragraph element and f1 is not the child of paragraph. So we should not output any intermediate solution.

14 Outline XML Twig Pattern Matching Problem definition State of the Art: TwigStack Sub-optimality of TwigStack Our algorithm TwigStackList Experimental results Conclusion

15 Our main idea Main idea: we read more elements in the input streams and cache some of them in the main memory so that we can make a more accurate decision about whether an element can contribute to final answer. But we cannot cache too many elements in the main memory. For each node q in twig query, the number of elements with tag q cached in the main memory should not be greater than the longest path in the XML dataset.

16 Our caching method What elements should be cached into the main memory? Only those that might contribute to final answers s1 p1 p3 p2 t1 A simple XML tree f1 We only need to cache p1,p3 into main memory, why not p2? Because if p2 contributed to final answers, then there would be an element before f1 to become the child of p2. But now we see that f1 is the first element. So p2 is guaranteed not to contribute to final answers. Twig Pattern Section title paragraph figure

17 Our criteria for pushing an element to stack The criteria for an element to be pushed into stack is very important for controlling intermediate results. Why? Because, once an element is pushed into stack, then this element is ready to output. So less elements are pushed into stack, less intermediate results are output. Our criteria : Given an element e q from stream T q, before e q is pushed into stack S q, we ensure that (i) element e q has a descendant e q for each child q of q, and (ii) if (q, q) is a parent-child relationship, e q has parent with tag q in the path from e q to e qmax, where e qmax is the descendant of e q with the maximal start value, q max being a child of q. (iii) each of q recursively satisfy the first two conditions.

18 Examples s1 p1 p3 p2 t1 A simple XML tree f1 Element p3 can be pushed into stack, but p1, p2 cannot. Because p3 has a child f1. Although p1 has a descendant f1, but f1 is not the child of p1. Twig Pattern Section title paragraph figure

19 Our algorithm: TwigStackList We propose a novel holistic twig algorithm TwigStacklist to evaluate a twig query. Unique features of TwigStackList: It considers the parent-child edge in the query There is a list for each query node to cache elements that likely participate in final solutions. It identifies a broader class of optimal queries. TwigStackList can guarantee the I/O optimality for queries with only ancestor-descendant edges connecting branching nodes and their children.

20 TwigStackList : an example Twig Pattern Section titleparagraph figure An XML tree StackList s1 p1 p3 f1 t1 t2 s2 p2 t3 f2 Root p2 s2 t3 f2 p3 p1 Scan s1, t1, p1,f1.

21 TwigStackList : an example Twig Pattern Section titleparagraph figure An XML tree StackList s1 p1 p3 f1 t1 t2 s2 p2 t3 f2 Root p2 s2 t3 f2 p3 p1 Since p1 is not the parent of f1 (but ancestor), we continue to scan p2 and put p1 to list.

22 TwigStackList : an example Twig Pattern Section titleparagraph figure An XML tree StackList s1 p1 p3 f1 t1 t2 s2 p2 t3 f2 Root p2 s2 t3 f2 p3 p1 Put p2,p3 to list and the cursor points to p3, for it is the parent of f2.

23 TwigStackList : an example Twig Pattern Section titleparagraph figure An XML tree StackList s1 p1 p3 f1 t1 t2 s2 p2 t3 f2 Root p2 s2 t3 Output intermediate solutions: f2, Final: Merge p3 p1

24 TwigStackList v.s. TwigStack TwigStackList shows I/O optimal for the above query. In contrast, TwigStack shows sub-optimal, for it output theuesless path solution Twig Pattern s1 p1 Section title paragraph figure p3 f1 t1 An XML tree t2 s2 p2 t3 f2 Root

25 Sub-optimality of TwigStackList Although TwigStackList broadens the class of optimal query compared to TwigStack, TwigStackList is still show sub-optimality for queries with parent- child edge connecting branching nodes. Twig Pattern s1 s2 p1 t1 Section title paragraph A simple XML tree Observe that there is no matching solution for this dataset. But TwigStackList caches s1 and s2 in the list and push s1 to stack. So (s1,t1) will be output as a useless solution.

26 Sub-optimality of TwigStackList Although TwigStackList broadens the class of optimal query compared to TwigStack, TwigStackList is still show sub-optimality for queries with parent- child edge connecting branching nodes. Twig Pattern s1 s2 p1 t1 Section title paragraph A simple XML tree Observe that there is no matching solution for this dataset. But TwigStackList caches s1 and s2 in the list and push s1 to stack. So (s1,t1) will be output as a useless solution. p2 Here the behavior of TwigStackList is still reasonable since we do not know whether s1 has a child p2 following p1 before we advance p1.

27 Outline XML Twig Pattern Matching Problem definition State of the Art: TwigStack Sub-optimality of TwigStack Our algorithm TwigStackList Experimental results Conclusion

28 Experimental Setting Pentium 4 CPU, RAM 768MB, disk 2GB TreeBank Download from University of Washington XML dataset Maximal depth 36, 2.4 million nodes Random Seven tags : a, b, c, d, e, f, g. ; uniform distributed Fan-out of elements varied 2-100, depth varied

29 Performance against TreeBank Queries with XPath expression: Number of intermediate path solutions for TwigStackList V.s. TwigStack TwigStackTwigStackListReduction percentageUseful Path Q1 35 0%35 Q %92 Q %4612 Q %5 Q %22565 Q %10 Q1S[//MD]//ADJQ4VP[/DT]//PRP_DOLLAR_ Q2S/VP/PP[/NP/VBN]/INQ5S[//VP/IN]//NP Q3S/VP//PP[//NP/VBN]//INQ6S[/JJ]/NP

30 Performance analysis We have three observations: (1) when queries contain only ancestor-descendant edges, two algorithms have similar performance. See Q1. (2)When edges connecting branching nodes contain only ancestor-descendant relationships, TwigStack is optimal, but TwigStack show the sub-optimal. See Q3.Q5 (3) When edges connecting branching nodes contain parent-child relationships, both TwigStack and TwigStackList are sub-optimal. But TwigStack typically output far few useless (<5%) intermediate solution than TwigStack. See Q2,Q4,Q6.

31 Performance against random dataset TwigStackTwigStackListReductionUseful Path Q %2077 Q %100 Q %14476 Q %16775 Q %566 From the following table, we see that for all queries, TwigStackList again is more efficient than TwigStack in terms of the size of intermediate results.

32 Outline XML Twig Pattern Matching Problem definition State of the Art: TwigStack Sub-optimality of TwigStack Our algorithm TwigStackList Experimental results Conclusion

33 Conclusion Previous algorithm TwigStack show the sub-optimality for queries with parent-child edges. We propose a new algorithm TwigStackList to address this problem. TwigStackList broadens the class of query with I/O optimality. Experiments show that TwigStackList typically output much fewer useless intermediate result as far as the query contains parent-child edges. We recommend to use TwigStackList as a new holistic join algorithm to evaluate a query with parent-child edges.

34 Thank You! Q & A