From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.

Slides:



Advertisements
Similar presentations
Ting Chen, Jiaheng Lu, Tok Wang Ling
Advertisements

Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.
APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.
Benchmarking Holistic Approaches to XML TPQ Processing Jiaheng Lu Renmin University of China BenchmarX 2010.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
On Boosting Holism in XML Twig Pattern Matching Using Two Data Streaming Techniques Presenter: Lu Jiaheng Supervisor: Prof. Ling Tok Wang Joint work: Chen.
Computing Structural Similarity of Source XML Schemas against Domain XML Schema Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Jixue Liu 3 Guoren Wang 4 Chi.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
1 Virtual Cursors for XML Joins Beverly Yang (Stanford) Marcus Fontoura, Eugene Shekita Sridhar Rajagopalan, Kevin Beyer CIKM’2004.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,
DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al Khalifa et al., ICDE 2002.
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
2015/5/5 A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML Ning Zhang(University of Waterloo) Varun Kacholia(Indian Institute.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
Web Data Management XML Query Evaluation 1. Motivation PTIME algorithms for evaluating XPath queries: – Simple tree navigation – Translation into logic.
BLAS: An Efficient XPath Processing System Chen Y., Davidson S., Zheng Y. Νίκος Λούτας.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
1 Optimizing Cursor Movement in Holistic Twig Joins Marcus Fontoura, Vanja Josifovski, Eugene Shekita (IBM Almaden Research Center) Beverly Yang (Stanford)
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML Represented by: Ai Mu Based on the paper written by Ning Zhang, Varun.
Graph Data Management Lab, School of Computer Science gdm.fudan.edu.cn XMLSnippet: A Coding Assistant for XML Configuration Snippet.
1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Querying Structured Text in an XML Database By Xuemei Luo.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML1 Efficient Structural Joins on Indexed XML Documents Shu-Yao Chien, Zografoula Vagena, Donghui.
TwigStackList¬: A Holistic Twig Join Algorithm for Twig Query with Not-predicates on XML Data by Tian Yu, Tok Wang Ling, Jiaheng Lu, Presented by: Tian.
Crimson: A Data Management System to Support Evaluating Phylogenetic Tree Reconstruction Algorithms Yifeng Zheng, Stephen Fisher, Shirley cohen, Sheng.
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
XML Labling and Query Optimization Sigmod
Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
Dr. N. MamoulisAdvanced Database Technologies1 Topic 8: Semi-structured Data In various application domains, the data are semi-structured; the database.
Efficient Processing of Updates in Dynamic XML Data Changqing Li, Tok Wang Ling, Min Hu.
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Holistic Twig Joins: Optimal XML Pattern Matching Nicholas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 02 Presented by: Li Wei, Dragomir Yankov.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.
Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
XML Query languages--XPath. Objectives Understand XPath, and be able to use XPath expressions to find fragments of an XML document Understand tree patterns,
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,
Compressing XML Documents with Finite State Automata
Efficient processing of path query with not-predicates on XML data
RE-Tree: An Efficient Index Structure for Regular Expressions
Holistic Twig Joins: Optimal XML Pattern Matching
Probabilistic Data Management
Structure and Content Scoring for XML
Early Profile Pruning on XML-aware Publish-Subscribe Systems
XML Query Processing Yaw-Huei Chen
MCN: A New Semantics Towards Effective XML Keyword Search
Structure and Content Scoring for XML
Structural Joins: A Primitive for Efficient XML Query Pattern Matching
Presentation transcript:

From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National University of Singapore

2 Outline Background  Define our problem: XML twig pattern matching  Previous work and problems Our new twig matching algorithms  A new labeling scheme: extended Dewey  A new holistic algorithm: TJFast Experimental results Conclusion

3 XML basics Short for Extensible Markup Language A language for defining the syntax and semantics of structured data An XML document is commonly modeled as a rooted, ordered and tagged tree. book preface chapter section paragraph section paragraph …………. title “XML” “Data” “Intro” “…”

4 Querying XML Data Major standards for querying XML data  XPath and XQuery XML twig pattern matching is a core operation in XPath and XQuery Definition of XML twig pattern : An XML twig pattern is a small tree whose nodes are tags, attributes or text values; and edges are either Parent-Child edges or Ancestor-Descendant edges

5 An XML twig pattern example  Create a flat list of all the title-author pairs for every book in bibliography. $b: book $t: title bib $a: author Ancestor-descendant relationship Parent-child relationship XQuery: { for $b in doc("bib.xml")/bib//book, $t in $b/title, $a in $b/author, return { $t } { $a } } To answer the XQuery, we need to first match the following XML twig pattern:

6 Our research problem  Problem Statement Given an XML twig pattern Q, and an XML database D, we need to find ALL the matches of Q on D.  E.g. Consider the following twig pattern and document : An XML tree: s1 s2 f1 p1 t1 t2 Section TitleFigure Twig pattern: Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1)

7 Our research problem  Problem Statement Given an XML twig pattern Q, and an XML database D, we need to find ALL the matches of Q on D.  E.g. Consider the following twig pattern and document : An XML tree: s1 s2 f1 p1 t1 t2 Section TitleFigure Twig pattern: Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1)

8 Our research problem  Problem Statement Given an XML twig pattern Q, and an XML database D, we need to find ALL the matches of Q on D.  E.g. Consider the following twig pattern and document : An XML tree: s1 s2 f1 p1 t1 t2 Section TitleFigure Twig pattern: Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1)

9 Outline Background  Define our problem: XML twig pattern matching  Previous work and problems Our new twig matching algorithms  A new labeling scheme: extended Dewey  A new holistic algorithm: TJFast Experiments Conclusion

10 Previous work: TwigStack [1] TwigStack  Each element in the document is labeled with the region encoding labeling scheme.  Node Label 1 : (startPos: endPos, LevelNum) 1. N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In Proceedings of ACM SIGMOD, s1 s2 f1 f2 t1 t2 1:12,1 2:3,2 4:9,2 10:11,2 5:6,37:8,3 A sample XML tree with region encoding: s1 is an ancestor pf f1, since (1) s1.start < f1.start and (2) s1.end> f1.end.

11 Previous work: TwigStack Two-phase algorithm:  Phase 1 : parts of intermediate root-leaf paths are outputted  Phase 2 : the intermediate paths are merge- joined to get the final results

12 Example : TwigStack Document: s1 s2 f1 f2 t1 t2 Section title figure Query:

13 Example: TwigStack Document: s1 s2 f1 f2 t1 t2 Section title figure Query: 1:12,1 2:3,2 4:9,2 10:11,2 5:6,37:8,3

14 Previous work: TwigStack Document: s1 s2 f1 f2 t1 t2 section title figure Query: 1:12,1 2:3,2 4:9,2 10:11,2 5:6,37:8,3 (1:12,1), (4:9,2) (2:3,2), (5:6,3) section title figure (7:8,3), (10:11,2) TwigStack scans all the six labels once to answer the query. The elements are arranged in the document order.

15 Previous work: TwigStack Document: s1 s2 f1 f2 t1 t2 Section title figure Query: 1:12,1 2:3,2 4:9,2 10:11,2 5:6,37:8,3 Section// title:, Section// figure:, Phase 1. Intermediate paths

16 Previous work: TwigStack Document: s1 s2 f1 f2 t1 t2 Section title figure Query: 1:12,1 2:3,2 4:9,2 10:11,2 5:6,37:8,3 Section// title:, Section// figure:, Phase 1. Intermediate paths,,, Phase 2. Final solutions Join

17 Previous work: TwigStackList [2] 2. J. Lu, T. Chen, T. W. Ling. “Efficient processing of XML twig patterns with parent child edges: a look-ahead approach” In Proceedings of CIKM, TwigStack may output many useless intermediate paths when edges contain parent- child(P-C) relationships. Better method: TwigStackList [2]  Advantage: Reduce the number of intermediate paths for queries with P-C relationships  Disadvantage: More comparison and higher CPU cost than TwigStack

18 Our research goal In our research, our goal is to design a new holistic twig join algorithm which is more efficient than TwigStack and TwigStackList Two aspects to achieve this goal: (1) Input: reduce the input I/O cost (2) Output: reduce the size of intermediate results

19 Two aspects: (1) Reduce the input I/O cost To answer a twig query, TwigStack and TwigStackList need to read labels for all the elements whose tags occur in the query. Our motivation is to design a new algorithm which can reduce the input cost by only reading labels for query leaf nodes.

20 If the query contains any parent-child relationship, TwigStack/TwigStackList may output many intermediate path solutions that cannot contribute to final results. Our motivation is to design a new algorithm to avoid (or reduce) the output of those useless intermediate results. Two aspects: (2) Reduce intermediate results

21 Outline Background  Define our problem: XML twig pattern matching  Previous work and challenges Our new twig matching algorithms  A new labeling scheme: extended Dewey  A new holistic algorithm: TJFast Experiments Conclusion

22 Original Dewey Labeling Scheme In Dewey labeling scheme, each element is presented by a vector:  (i) the root is labeled by a empty string ε  (ii) for a non-root element u, label(u)= label(s).x, where u is the x-th child of s. For example: s1 s2 f1 f2 t1 t ε

23 Main problem of the original Dewey If we use the original Dewey labeling scheme to answer the twig query, we need to read labels for all query nodes, there is no performance speedup compared to TwigStack and TwigStackList. Our idea: Extend the original Dewey labeling scheme so that given the label of any element e, we can know the path of e from this label alone.

24 Modular function We need to know some schema information: DTD (Document Type Definitions ) or XML schema Given DTD information: book → author, title, chapter* Our solution: using modular function, we create a match between an element tag and a integer number. We define X author mod 3 = 0 X title mod 3 = 1 X chapter mod 3 = 2; where, X t is the last component of the label of tag t. book ε 0 title author 1 chapter 2 5 Why not 3 as the original Dewey ?

25 Derive element tag From a label, we can derive its tag name. book → author, title, chapter* Recall that we define: X author mod 3 = 0 X title mod 3 = 1 X chapter mod 3 = 2. book ε 0 title author 1 chapter 2 5 ? ?? ?

26 More examples for assigning labels Let us consider a more complicated DTD a → (b | c )*, d?, c+ We define: X b mod 3 = 0 X c mod 3 = 1 X d mod 3 = 2 (Why do we use mod 3 instead of 4?) a ε 0 d b 2 c 4 c 7

27 Derive the path from a label By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label. For example: DTD: book → author, title, chapter* chapter → (paragraph | section)* section → (paragraph | section)* book chapter section author title book author title chapter paragraph section Mod 3=0 Mod 3=1 Mod 3=2 Mod 2=0 Mod 2=1 Mod 2=0 Mod 2=1 Question: Given a label 5.1.0, what is the corresponding path ? Document: FST: chapter section paragraph section

28 Derive the path from a label By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label. For example: DTD: book → author, title, chapter* chapter → (paragraph | section)* section → (paragraph | section)* book chapter section author title Document: chapter section paragraph section Following the above red path, we get denotes : book/ chapter/section/paragraph book author title chapter paragraph section Mod 3=0 Mod 3=1 Mod 3=2 Mod 2=0 Mod 2=1 Mod 2=0 FST: Mod 2=1

29 Two properties of extended Dewey Find Ancestor Label  From a label of any element, we can derive the labels of its all ancestors. Find Ancestor Name  From a label of any element, we can derive the tag names of its all ancestors. Two properties enable us to design a new and efficient algorithm for XML twig pattern matching.

30 Outline Background  Define our problem: XML twig pattern matching  Previous work and challenges Our new twig matching algorithms  A new labeling scheme: extended Dewey  A new holistic algorithm: TJFast (a Fast Twig Join algorithm) Experiments Conclusion

31 A new algorithm: TJFast For each node n in the query, there exists a corresponding input stream T n. T n contains the extended Dewey labels of elements of tag n. Those labels are arranged by the document order. For each branching node b of twig pattern, there is a corresponding set S b, which contains elements possibly involving query answers. (Compared to TwigStack, what difference? ) During any point of computing, the size of set S b is bounded by the depth of the XML document.

32 A new algorithm: TJFast Two-phase algorithm:  Phase 1 : parts of intermediate root-leaf paths are outputted Cache elements that possibly involve in query answer to sets Output intermediate paths according to elements in sets  Phase 2 : the intermediate paths are merge- joined to get the final results

33 An example for TJFast algorithm Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, TD:TD: TC:TC: { } DTD: a -> a*,d*, b* b -> d*, c* d -> c* Root 0 … ε A set for the branching node A Why do we not need T A, T B streams?

34 An example for TJFast algorithm Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, { } Root 0 … ε a1/a2/d1 derive a1/a3/b1/c1 derive By finite state transducer of extended Dewey labeling scheme TD:TD: TC:TC:

35 An example for TJFast algorithm Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, { } Root 0 … ε Both a1 and a3 possibly involve in query answers. (Why not a2 ?) TD:TD: TC:TC:

36 An example for TJFast algorithm Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, { } Root 0 … ε Then we insert a1 to the set, since a1 is an ancestor of a3. We cannot insert both a1 and a3 to the set. Because a1, a3 only possibly involve query answers, inserting both may cause more useless intermediate results. TD:TD: TC:TC:

37 An example for TJFast algorithm Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, {a1 } Root 0 … ε Move the cursor of T D from d1 to d2 and output one path solution TD:TD: TC:TC:

38 An example for TJFast algorithm Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, {a1,a3 } Root 0 … ε We insert a3 to the set, since a3 definitely involves in query answers a1/a3/d2 derive TD:TD: TC:TC:

39 An example for TJFast algorithm Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, {a1,a3 } Root 0 … ε Move the cursor of stream T D from d2 to d3 and output and. TD:TD: TC:TC:

40 An example for TJFast algorithm Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, {a1,a3 } Root 0 … ε Move the cursor of stream T C from c1 to c2 and output the path TD:TD: TC:TC:

41 An example for TJFast algorithm Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, {a1,a3 } Root 0 … ε 1.Move the cursor T D of to the end and output path solution TD:TD: TC:TC:

42 An example for TJFast algorithm Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, {a1,a3 } Root 0 … ε 1.Move the cursor of T C of to the end and output TD:TD: TC:TC:

43 An example for TJFast algorithm Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d , , 0.3.1, {a1,a3 } Root 0 … ε Now all five elements has been scanned, in the second phase we merge-join all output path solutions. TD:TD: TC:TC:

44 An example for TJFast algorithm Document: Query: A DB C a1a1 a2a2 a3a3 b2b2 d2d2 b1b1 c2c2 d3d3 c1c1 d1d1 A// D:, A/B//C:, Phase 1. Intermediate paths,, Phase 2. Final solutions Join

45 Optimal query classes If an algorithm does not output any useless intermediate results for an query Q for all given documents, we call this algorithm is optimal for query Q. If an algorithm has a larger optimal query class, this algorithm has better ability to control the size of intermediate results.

46 Optimal class of TJFast, TwigStack and TwigStackList TwigStackTJFast, TwigStackList Optimal query class All edges are ancestor- descendant relationships All edges connecting branching nodes and the children are ancestor- descendant relationship For non-optimal queries, TJFast usually output less useless intermediate paths than TwigStack and TwigStackList do.

47 Outline Background  Define our problem: XML twig pattern matching  Previous work and challenges Our new twig matching algorithms  A new labeling scheme: extended Dewey  A new holistic algorithm: TJFast Experimental results Conclusion

48 Experiments Benchmarks  XMark: Synthetic Data  DBLP: Real Data for DBLP database  Treebank: Real Data from Wall Street Journal XMarkDBLPTreebank Data size(MB) Nodes(million) Max/Avg depth12/56/2.936/7.8

49 Path query Path Queries PQ1/site/closed-auctions/closed_auction/price PQ2/site/regions//item/location PQ3/site/people/person/gender PQ4/site/open_auctions/open_auction/reserve We compared PathStack[1] and TJFast on the following four path queries on XMark data.

50 Experiments: Number of elements read and input file size for path queries Observation: TJFast scans less elements than PathStack does. Explanation: TJFast only scans labels for leaf nodes in queries, but PathStack scans all nodes in the query.

51 Experiments: Execution time for path queries Observation: TJFast has better performance for all four path queries than PathStack. Explanation: TJFast reduces I/O cost by reading less elements.

52 Twig queries SourceTwig Queries TQ1DBLP//proceedings//title[.//i]//sup TQ2DBLP//article[.//sup]//title//sub TQ3Treebank/S[.//VP/IN]//NP TQ4Treebank/S/VP/PP[IN]/NP/VBN TQ5Treebank//VP[DT]//PRP_DOLLAR_ We compared TwigStack, TwigStackList and TJFast on the following four twig queries on DBLP and TreeBank data.

53 Experiments: Number of elements read and input file size for twig queries Observation: TJFast scans far less elements than TwigStack and TwigStackList do in two twig queries. Explanation: TJFast only scans elements for leaf nodes in queries. But TwigStack/TwigStackList needs to scan elements for all nodes. And the number of elements for non-leaf nodes is much more than that of leaf nodes.

54 Experiments: Execution time for twig queries Observation: For DBLP data, TJFast has much better performance than that of TwigStack/TwigStackList. Explanation: TJFast reduces I/O cost by reading less elements. TW-SS and TJ-SS denote the sequential scan time of input data for TwigStack/TwigStacklist and TJFast, respectively.

55 Experiments: Intermediate results TwigStack outputs many useless intermediate paths. But all paths solutions output by TJFast and TwigStackList contribute to final answers. QueryTwigStackTwigStackListTJFastUseful Q S NPVP IN Q3 Q3 contains only the ancestor-descendant relationships in branching edges.

56 Experiments: Intermediate results There are parent-child relationships in branching edges. All algorithms show the sub-optimality. All algorithms output useless pathes. But TJFast outputs the minimum number of useless pathes. QueryTwigStackTwigStackListTJFastUseful Q S VP PP Q4 INNP VBN

57 Outline Background  Define our problem: XML twig pattern matching  Previous work and challenges Our new twig matching algorithms  A new labeling scheme: extended Dewey  A new holistic algorithm: TJFast Experimental results Conclusion

58 Conclusions Efficient processing of twig queries is a core operation in XPath and XQuery We have proposed a new labeling scheme, extended Dewey and a new holistic twig pattern matching algorithm: TJFast. Compared to previous work  TJFast reduces the input I/O cost  TJFast reduces the output I/O cost for intermediate results. Future work  Use some index structures (B tree or R tree) to accelerate the query processing  Modify our extended Dewey to make it become an update- friendly labeling scheme

59 Reference [1] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In Proceedings of ACM SIGMOD, Propose TwigStack algorithm [2] T. Chen, J. Lu, and T. Ling. On boosting holism in xml twig pattern matching using structural indexingtechniques. In SIGMOD, Propose two new data streaming techniques [3] Y. Chen, S. B. Davidson, and Y. Zheng. BLAS: An efficient XPath processing system. In Proc. of SIGMOD, pages 47-58, Propose a new algorithm for XPath query [4] J. Lu, T. Chen, and T. W. Ling. Efficient processing of xml twig patterns with parent child edges: a look-ahead approach. In CIKM, pages , Propose TwigStackList algorithm

60 END Thank you! Q & A

61 Backup a bc d e Query: a1 b1 a2 d1 c1 f2 c2 e1 f1 Document TwigStackList outputs. But TJFast does not output this path solution.

62 Labels size XmarkDBLPTreeBank Region encoding(MB) Original Dewey(MB) Extended Dewey(MB)