1 Optimizing Cursor Movement in Holistic Twig Joins Marcus Fontoura, Vanja Josifovski, Eugene Shekita (IBM Almaden Research Center) Beverly Yang (Stanford)

Slides:



Advertisements
Similar presentations
Ting Chen, Jiaheng Lu, Tok Wang Ling
Advertisements

Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
1 Virtual Cursors for XML Joins Beverly Yang (Stanford) Marcus Fontoura, Eugene Shekita Sridhar Rajagopalan, Kevin Beyer CIKM’2004.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Haris Georgiadis Minas Charalambides Vasilis Vassalos Athens University of Economics and Business 1 Efficient Physical Operators for a cost-based XPath.
DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.
© 2004, M. Fontoura VLDB, Toronto, September 2004 High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita,
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al Khalifa et al., ICDE 2002.
CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.
On the Memory Requirements of XPath Evaluation over XML Streams Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.
Efficient Indexing of Shared Content in IR Systems Andrei Broder, Nadav Eiron, Marcus Fontoura, Michael Herscovici, Ronny Lempel, John McPherson, Eugene.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
QUANZHONG LI BONGKI MOON Indexing & Querying XML Data for../Regular Path Expressions/* SUNDAR SUPRIYA.
An Algorithm for Streaming XPath Processing with Forward and Backward Axes Charles Barton, Philippe Charles, Deepak Goyal, Mukund Raghavchari IBM T. J.
ABC Book by student/teacher name
Storing and Querying Ordered XML Using Relational Database System Swapna Dhayagude.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Inbal Yahav A Framework for Using Materialized XPath Views in XML Query Processing VLDB ‘04 DB Seminar, Spring 2005 By: Andrey Balmin Fatma Ozcan Kevin.
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
Trip Planning Queries F. Li, D. Cheng, M. Hadjieleftheriou, G. Kollios, S.-H. Teng Boston University.
CSC 2300 Data Structures & Algorithms February 6, 2007 Chapter 4. Trees.
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
COMP 261 Lecture 12 Disjoint Sets. Menu Kruskal's minimum spanning tree algorithm Disjoint-set data structure and Union-Find algorithm Administrivia –Marking.
1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002.
Querying Structured Text in an XML Database By Xuemei Luo.
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML1 Efficient Structural Joins on Indexed XML Documents Shu-Yao Chien, Zografoula Vagena, Donghui.
TwigStackList¬: A Holistic Twig Join Algorithm for Twig Query with Not-predicates on XML Data by Tian Yu, Tok Wang Ling, Jiaheng Lu, Presented by: Tian.
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.
Trees 2: Section 4.2 and 4.3 Binary trees. Binary Trees Definition: A binary tree is a rooted tree in which no vertex has more than two children
Use properties of radicals
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Building Java Programs Binary Trees reading: 17.1 – 17.3.
2004/12/31 報告人 : 邱紹禎 1 Mining Frequent Query Patterns from XML Queries L.H. Yang, M.L. Lee, W. Hsu, and S. Acharya. Proc. of 8th Int. Conf. on Database.
Dr. N. MamoulisAdvanced Database Technologies1 Topic 8: Semi-structured Data In various application domains, the data are semi-structured; the database.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
Holistic Twig Joins: Optimal XML Pattern Matching Written by: Nicolas Bruno Nick Koudas Divesh Srivastava Presented by: Jose Luna John Bassett.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Holistic Twig Joins: Optimal XML Pattern Matching Nicholas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 02 Presented by: Li Wei, Dragomir Yankov.
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.
Rate-Based Query Optimization for Streaming Information Sources Stratis D. Viglas Jeffrey F. Naughton.
Adversarial Search 2 (Game Playing)
Querying Large XML Data Hsuan-Heng, Wu Shawn Ju. XML V.S. HTML XML is designed to describe data XML don’t use predefined tags XML is used to exchange.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.
Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
Efficient processing of path query with not-predicates on XML data
Efficient Filtering of XML Documents with XPath Expressions
RE-Tree: An Efficient Index Structure for Regular Expressions
Holistic Twig Joins: Optimal XML Pattern Matching
TT-Join: Efficient Set Containment Join
Storing and Querying XML Documents Without Using Schema Information
Toshiyuki Shimizu (Kyoto University)
1 (x+1)(x+3) (x-3)(x+2) (x-1)(x+3) x2+2x-3 (x+6)(x-2) (x+1)(x-3)
Summary.
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Incremental Maintenance of XML Structural Indexes
Introduction to XML IR XML Group.
Presentation transcript:

1 Optimizing Cursor Movement in Holistic Twig Joins Marcus Fontoura, Vanja Josifovski, Eugene Shekita (IBM Almaden Research Center) Beverly Yang (Stanford) CIKM’2005

2 Motivation for $a in //article[year = “2005” or keyword = “XML”] for $s in $a/section return $s/title In an index-based method, 7 tags and text elements need to be verified to process this query Running time is dominated by the I/O for manipulating this cursors Twig join Algorithms are not optimized for I/O and do not exploit the query’s extraction points article AND ORsection title year 2005 keyword XML

3 Our Contributions 1. TwigOptimal, a new holistic twig join algorithm that supports a large fraction of XQuery (including AND/OR branches) 2. Description of how extraction points improve query performance 3. Experimental evaluation that shows how TwigOptimal outperforms current algorithms

4 Agenda Background TwigOptimal algorithm Experimental results Conclusions

5 XML Indexing Begin/End/Level encoding Begin: preorder position of tag/text End: preorder position of last descendent Level: depth Containment: X contains Y iff X.begin < Y.begin <= X.end (assuming well-formed) A1A1 B1B1 B2B2 C1C1 D1D1 B3B3 C2C2 R (0,7,0) (1,5,1) (2,2,2) (4,4,3) (5,5,3) (6,7,1) (7,7,2) (3,5,2)

6 Basic Access Path Inverted lists Posting: Token = Location = Supported method on cursor: C B.fowardTo(Position p) A1A1 B1B1 B2B2 C1C1 D1D1 B3B3 C2C2 R B1B1 B2B2 B3B3 C1C1 C2C2

7 Joins in XML Structural (Containment) Joins Twig Joins A || B A || B || C D B || C B || D A || B || C

8 LocateExtension “Extension” (w.r.t. query node q) – a solution for the subquery rooted at q Input: q Result: the cursors of all descendants of q point to an extension for q A || B || C D B1B1 C1C1 X1X1 X2X2 D2D2 B3B3 D1D1 A C2C2

9 LocateExtension While (not end(q) && not hasExtension(q)) { (p, c) = PickBrokenEdge(q); ZigZagJoin(p, c); } A || B || C D B1B1 C1C1 X1X1 X2X2 D2D2 B3B3 D1D1 A C2C2

10 TwigOptimal Algorithm Tests if the cursor with the minimal location has an extension If not, try to virtually move cursors until they form an extension Only move cursors physically if no more virtual move is possible A virtual move just sets the begin value of the cursor, therefore no I/O is involved: Cq.begin = new begin value for Cq; Cq.virtual = true; //indicates that the cursor is virtual

11 Checking Extension We have an extension for cursor q if: All cursors underneath q are properly aligned All cursors underneath q have physical locations A || B || C D B1B1 C1C1 X1X1 X2X2 D2D2 B3B3 D1D1 A C2C2 Return false

12 Checking Extension We have an extension for cursor q if: All cursors underneath q are properly aligned All cursors underneath q have physical locations A || B || C D B1B1 C1C1 X1X1 X2X2 D2D2 B3B3 D1D1 A C2C2 Return true

13 Moving Cursors Two passes over the query tree Bottom-up: move each parent cursor forward so it contains the children cursors Top-down: move the children cursors forward so they are contained by their parents

14 Move Cursors Example x2x2 y4y4 y5y5 y1y1 x1x1 z2z2 z1z1 y2y2 y3y = virtual move Query = //x[.//y and.//z] = physical move

15 Comparing with TSGeneric+ w1w1 x1x1 w2w2 x2x2 y2y2 y3…y3…y 50 y 51 y y 100 z2z2 x 50 y 49 y 98 x3x3 x 4...x 49 = current cursor position Query = //w//x//y//z = virtual move = physical move y1y1 z1z1 y 99

16 Comparing with TSGeneric+ x2x2 y2y2 y 50 y 51 y y 49 y 98 x3x3 x 4...x 49 = current cursor position Query = //w//x//y//z = physical move w1w1 x1x1 y1y1 z1z1 y3…y3… w2w2 y 100 z2z2 x 50 y 99

17 Extraction Points Optimization If neither q or its descendants in the query are extraction points we can virtually move these cursors within q’s parent C1C1 B1B1 A1A1 C 99 || B C A C 100 A2A2 B2B2 B3B3

18 Prototype Implemented over Berkeley DB B-tree Inverted lists Posting: Token = Location = Position is BEL

19 Data Sets Xmark 10 documents of size ~ 100MB each Synthetic 4 tags: W, X, Y, Z Uncorrelated, no self-nesting Same frequency

20 Experimental Results

21 Experimental Results

22 Experimental Results

23 Experimental Results

24 Experimental Results

25 Conclusion TwigOptimal algorithm outperforms existing twig join algorithms by more than 40%, especially for larger queries Optimized for I/O, which is the performance bottleneck Extraction points optimization improve performance