1 Virtual Cursors for XML Joins Beverly Yang (Stanford) Marcus Fontoura, Eugene Shekita Sridhar Rajagopalan, Kevin Beyer CIKM’2004.

Slides:



Advertisements
Similar presentations
Ting Chen, Jiaheng Lu, Tok Wang Ling
Advertisements

Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Jiaheng Lu, Ting Chen, Tok Wang Ling National University of.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
© 2004, M. Fontoura VLDB, Toronto, September 2004 High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita,
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
Efficient Indexing of Shared Content in IR Systems Andrei Broder, Nadav Eiron, Marcus Fontoura, Michael Herscovici, Ronny Lempel, John McPherson, Eugene.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
QUANZHONG LI BONGKI MOON Indexing & Querying XML Data for../Regular Path Expressions/* SUNDAR SUPRIYA.
Web Data Management XML Query Evaluation 1. Motivation PTIME algorithms for evaluating XPath queries: – Simple tree navigation – Translation into logic.
ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu Presenter: Elena Zheleva,
Storing and Querying Ordered XML Using Relational Database System Swapna Dhayagude.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
INDEXING DATASPACES by Xin Dong & Alon Halevy ITCS 6010 FALL 2008 Presented by: VISHAL SHETH.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
1 Optimizing Cursor Movement in Holistic Twig Joins Marcus Fontoura, Vanja Josifovski, Eugene Shekita (IBM Almaden Research Center) Beverly Yang (Stanford)
CSC 2300 Data Structures & Algorithms February 6, 2007 Chapter 4. Trees.
XML-QL A Query Language for XML Charuta Nakhe
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
A TREE BASED ALGEBRA FRAMEWORK FOR XML DATA SYSTEMS
1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002.
G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,
Pattern tree algebras: sets or sequences? Stelios Paparizos, H. V. Jagadish University of Michigan Ann Arbor, MI USA.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Querying Structured Text in an XML Database By Xuemei Luo.
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML1 Efficient Structural Joins on Indexed XML Documents Shu-Yao Chien, Zografoula Vagena, Donghui.
TwigStackList¬: A Holistic Twig Join Algorithm for Twig Query with Not-predicates on XML Data by Tian Yu, Tok Wang Ling, Jiaheng Lu, Presented by: Tian.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Computer Science: A Structured Programming Approach Using C Trees Trees are used extensively in computer science to represent algebraic formulas;
Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
CE 221 Data Structures and Algorithms Chapter 4: Trees (Binary) Text: Read Weiss, §4.1 – 4.2 1Izmir University of Economics.
INRIA - Progress report DBGlobe meeting - Athens November 29 th, 2002.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
2004/12/31 報告人 : 邱紹禎 1 Mining Frequent Query Patterns from XML Queries L.H. Yang, M.L. Lee, W. Hsu, and S. Acharya. Proc. of 8th Int. Conf. on Database.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
Example: Expressions Python Programming, 2/e 1 [+, [*, 3, 5], [*, 2, [-, 6, 1]]]
1 Review of report "LSDX: A New Labeling Scheme for Dynamically Updating XML Data"
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Holistic Twig Joins: Optimal XML Pattern Matching Nicholas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 02 Presented by: Li Wei, Dragomir Yankov.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.
Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
CS 201 Data Structures and Algorithms
Compressing XML Documents with Finite State Automata
Efficient processing of path query with not-predicates on XML data
Holistic Twig Joins: Optimal XML Pattern Matching
Binary Trees "A tree may grow a thousand feet tall, but its leaves will return to its roots." -Chinese Proverb.
TT-Join: Efficient Set Containment Join
CS223 Advanced Data Structures and Algorithms
Storing and Querying XML Documents Without Using Schema Information
Toshiyuki Shimizu (Kyoto University)
Structure and Content Scoring for XML
Early Profile Pruning on XML-aware Publish-Subscribe Systems
XML Query Processing Yaw-Huei Chen
CE 221 Data Structures and Algorithms
Structure and Content Scoring for XML
The Ohio State University
Presentation transcript:

1 Virtual Cursors for XML Joins Beverly Yang (Stanford) Marcus Fontoura, Eugene Shekita Sridhar Rajagopalan, Kevin Beyer CIKM’2004

2 Motivation //article//section[ //title contains(‘Query Processing’) AND //figure//caption contains(‘XML’)] In an index-based method, 8 tags and text elements need to be verified to process this query Virtual cursors allows us to reduce the size of the input data by looking only at leaf nodes “Query Processing” article section titlefigure caption “XML”

3 Our Contributions 1. Virtual cursors improve runtime performance by more than an order of magnitude by eliminating I/O 2. Virtual cursors can be used by existing algorithms for structural and holistic twig joins 3. Overhead of path indices and ancestor information is subsumed by the advantages of virtual cursors

4 Agenda Background Virtual cursors algorithm Experimental results Conclusions

5 Position Encoding Scheme #1: Begin/End/Level Begin: preorder position of tag/text End: preorder position of last descendent Level: depth Containment: X contains Y iff X.begin < Y.begin <= X.end (assuming well-formed) A1A1 B1B1 B2B2 C1C1 D1D1 B3B3 C2C2 R (0,7,0) (1,5,1) (2,2,2) (4,4,3) (5,5,3) (6,7,1) (7,7,2) (3,5,2)

6 Position Encoding Scheme #2: Dewey Position of element E = {position of parent}.n, where E is the nth child of its parent Containment: X contains Y iff X is a prefix of Y A1A1 B1B1 B2B2 C1C1 D1D1 B3B3 C2C2 R (1) (1.1) (1.1.1) ( ) ( ) (1.2) (1.2.1) (1.1.2)

7 Position Encoding Begin/End/Level Typically more compact Fewer implementation issues Dewey Encodes positions of all ancestors

8 Path Index A1A1 B1B1 B2B2 C1C1 D1D1 B3B3 C2C2 R PathID /R1 /R/A2 /R/A/B3 /R/A/B/C4 /R/A/B/D5 /R/B6 /R/B/C7 Path Pattern->Set of matching path IDs /R/B->{6} //R//C->{4, 7}

9 Basic Access Path Inverted lists Posting: Token = Location = Data = <> Supported methods on cursor: C B.advance() C B.fwdBeyond(Position p) C B.fwdToAncestor(Position p) A1A1 B1B1 B2B2 C1C1 D1D1 B3B3 C2C2 R B1B1 B2B2 B3B3 C1C1 C2C2

10 Joins in XML Structural (Containment) Joins Twig Joins A || B A || B || C D B || C B || D A || B || C

11 LocateExtension “Extension” (w.r.t. query node q) – a solution for the subquery rooted at q Input: q Result: the cursors of all descendants of q point to an extension for q A || B || C D B1B1 C1C1 X1X1 X2X2 D2D2 B3B3 D1D1 A C2C2

12 LocateExtension While (not end(q) && not hasExtension(q)) { (p, c) = PickBrokenEdge(q); ZigZagJoin(p, c); } A || B || C D B1B1 C1C1 X1X1 X2X2 D2D2 B3B3 D1D1 A C2C2

13 Virtual Cursors Observe Every useful position in a non-leaf query node is an ancestor of some leaf position GetAncestors() Given a position P, return all ancestor positions of P Data: A 1 – B 1 – A 2 – C 1 getAncestors(C 1 ) = {A 1, B 1, A 2 } Dewey: already encoded in position Begin/End/Level: not simple, extra work is needed

14 Join Points GetLevels() Input: Path ID, tag Output: all ancestor levels at which this tag occurs Path: A – B – A – C PathID = 3 GetLevels(3, “A”) = {1, 3}

15 Virtual Cursor Algorithm VirtualFwdToAncestor(Position p) //C is the implicit parameter “this” AncArray = GetAncestors(p); LevelArray = GetLevels(p.PID, C.token) for (i=1; i < AncArray.length(); i++) { if (AncArray[i] < C.pCur) continue; if (AncArray[i].level not in LevelArray) continue; C.pCur = AncArray[i]; return C.pCur; } return invalidPosition;

16 Example AxAx AyAy A1A1 A 99 A 100 B1B1 B2B2 root Position ZERO GetAncestors(B1) = {root, A y, A 99 } Path root-A-A-B has PathID x, GetLevels(x, A) = {2, 3} C A.VirtualFwdToAncestor(B 1 ) For i = 1, AncArray[1].level = 1, which is not in LevelArray = {2, 3} For i = 2, both conditions hold, first answer for //A//B

17 LocateExtension Revisited While (not end(q) && not hasExtension(q)) { l = PickBrokenLeaf(q); A = ancestors of l under q; amax = maxarg { Ca | a is in A }; Cl.fwdBeyond(Camax); for each a in A Ca.virtualFwdToAncestor(Cl); } While (not end(q) && not hasExtension(q)) { (p, c) = PickBrokenEdge(q); ZigZagJoin(p, c); }

18 Evaluation Proved that with exception of invalid positions, every position returned by a virtual cursor would also be returned by a physical cursor Typically much fewer positions are returned for virtual cursors No additional I/O

19 Performance Analysis employee name Structural join: employee//name Emp Name No PathIDs and no ancestor information

20 Performance Analysis employee name Structural join: employee//name Emp Name With PathIDs and no ancestor information

21 Performance Analysis employee name Structural join: employee//name Emp Name No PathIDs but with ancestor information

22 Performance Analysis emloypee name Structural join: employee//name Emp Name PathIDs and ancestor information

23 Performance Analysis emloypee name Structural join: employee//name Name PathIDs and ancestor information with Virtual Cursors

24 Prototype Implemented over Berkeley DB B-tree Inverted lists Posting: Token = Location = Position is either BEL or Dweye Data = or <>

25 Data Sets Xmark 10 documents of size ~ 100MB each Synthetic 7 tags: A, B, …, G Uncorrelated, no self-nesting Frequency A = B = C = D = X E = X/10 F = X/100 G = X/1000

26 Experimental Results //employee//name

27 Experimental Results //employee//name

28 Experimental Results //  //e//d

29 Experimental Results //  //e//d

30 Experimental Results //α//A//B//C

31 Experimental Results //A//B//C//α

32 Experimental Results Works better if elements in the dataset are uncorrelated //employee//name Deeper queries the better for virtual cursors algorithm (more internal nodes) Selective join at the bottom of the query the better, since we use only leaf nodes

33 Overhead of Index Features Uncompressed (Xmark) BEL 463 MB, s Dewey538 MB, s Path index incurs in no overhead for text centric datasets (size, index build time, and runtime) Higher cost comes from integrating path information into the inverted index Overall the overhead of index features is small, but grows with the dataset depth

34 Conclusion Virtual cursors reduce the size of the input data by using only leaf nodes Easily integrated in current structural and holistic twig join algorithms Overhead of index features (path indices and ancestor information) is acceptable Path indices and ancestor information combined produce better results

35 More details ntoura/papers/cikm2004.pdf