Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wei Wang University of New South Wales, Australia

Similar presentations


Presentation on theme: "Wei Wang University of New South Wales, Australia"— Presentation transcript:

1 Wei Wang University of New South Wales, Australia
7/4/2019 Efficient Processing of XML Path Queries Using the Disk-based F&B Index Wei Wang University of New South Wales, Australia With Hongzhi Wang (HIT), Hongjun Lu (HKUST), Haifeng Jiang (IBM), Xuemin Lin (UNSW), Jianzhong Li (HIT) Dr. Wei CSE, UNSW

2 XML Query Processing XML Query by structural constraint
7/4/2019 XML Query Processing XML Modeled as a labeled tree Query by structural constraint Simple Path Queries, e.g., //Customer//Name Branching/Twig Queries, e.g., //Customer[//Zipcode]//Name 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

3 Index or Join? Index-based approaches Join-based approaches
7/4/2019 Index or Join? Q1: /a/b Index-based approaches DataGuide, 1-index F&B Index and a few approximate indexes Join-based approaches Structural join Twig join a b b a Also hybrid approach, e.g., MIXED mode paper from wisc in VLDB 2003. If XML is a tree, all those indexes are trees. b Join-based approaches appear to be more actively researched! 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

4 Outline Introduction Disk-based F&B Index Experiment Conclusions
7/4/2019 VLDB 2005

5 XML Structural Indexes
7/4/2019 XML Structural Indexes “Exact” Indexes 1-index Based on backward bisimilarity Covers all simple path queries F&B Index Based on backward and forward bisimilarity Covers all branching queries (optimally) 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

6 A Running Example extent Q1: /a/b Q2: /a/b[d] Q3: /a/b[c][d] {b, b, b}
7/4/2019 A Running Example Q1: /a/b Q2: /a/b[d] Q3: /a/b[c][d] {b, b, b} extent F&B is refined from 1-index 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

7 Problems with F&B Index?
7/4/2019 Problems with F&B Index? Lack of scalability Usually large in practice No immediate solution when it cannot be accommodated in memory Unbalanced, all-leaf-nodes tree Naïve solutions (e.g., B+-tree, pre-order clustering in Lore, subtree clustering in Natix) do not work well Lack of efficiency Non-deterministic searching //-axis requires traversing the whole subtrees Much more costly when the index is not in the memory 100M XMark, 2M doc nodes  0.5 million F&B nodes if treated as a tree. 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

8 Outline Introduction Disk-based F&B Index Experiment Conclusions
7/4/2019 VLDB 2005

9 7/4/2019 Disk-based F&B Index Overcome the memory limit by putting F&B index to the disk Naïve method does not work well For this query, need to touch all the pages + random I/O Q1: /a/b 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

10 Basic Idea Moral: Clustering is important Cluster by tag  tape
7/4/2019 Basic Idea Moral: Clustering is important Cluster by tag  tape Cluster by parent  segment & block Cluster by 1-index ID  chunk Benefits: Optimized tree traversals Enable other intelligent algorithms 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

11 7/4/2019 Q1: /a/b 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

12 7/4/2019 Q.P. by Tree Traversal Dim 1: DFS/BFS Dim 2: Path/Branching Path Dim 3: / or // Q5: /a/b/c Q2: /a/b[d] Q4: /a//c Problem: Still have to traverse the entire subtrees to process // 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

13 Q.P. by RangeFetch H(1, c) = [3, 6]
7/4/2019 Q.P. by RangeFetch H(1, c) = [3, 6] (chunkID, tagName) Q4: /a//c Restriction: Can only answer /p//q, where p is a simple path. 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

14 More Data Structures 3 more tapes:
7/4/2019 More Data Structures 3 more tapes: Add region code for each d-node in the extents  Extents Tape Use physical (start, end) codes Sort d-nodes according to (start, end) Add Doc Tape Add Value Tape 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

15 7/4/2019 Example 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

16 SegSJ Key observation: SegSJ(/p//q)
7/4/2019 SegSJ Key observation: Structural relationship between two segments can be inferred from the relationship between their first d-nodes in their extent. SegSJ(/p//q) R(s, e)  A = /p S(s, e)  D = //q Structural join R and S Using partition-based or sorting-based SJ algorithm b1  (10,78), (210, 297), … d1  (19,25), (54, 66), … Take the (s, e) of the first d-node in each segment 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

17 Outline Introduction Disk-based F&B Index Experiment Conclusions
7/4/2019 VLDB 2005

18 Experiments Setup DBLP/XMark/TreeBank 8 representative queries
7/4/2019 Experiments Setup DBLP/XMark/TreeBank 8 representative queries Dim 1: PC/AD Dim 2: Path/Twig Dim 3: Large/Small DFS, BFS, RangeFetch, SegSJ NoK, TwigStack, Kaushik’s algorithm in [SIGMOD 04] Metric: time/PIO/LIO * Kaushik: On the integration … 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

19 Varying Buffer Size (PC-Path)
7/4/2019 Varying Buffer Size (PC-Path) 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

20 Varying Buffer Size (PC-Twig)
7/4/2019 Varying Buffer Size (PC-Twig) 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

21 Varying Buffer Size (AD-Path)
7/4/2019 Varying Buffer Size (AD-Path) 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

22 Varying Buffer Size (AD-Twig)
7/4/2019 Varying Buffer Size (AD-Twig) 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

23 7/4/2019 Buffer Hit Ratio 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

24 7/4/2019 Scalability 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

25 Comparing with Other Systems
7/4/2019 Comparing with Other Systems 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

26 Outline Introduction Disk-based F&B Index Experiment Conclusions
7/4/2019 VLDB 2005

27 Conclusions Disk-based F&B Index
7/4/2019 Conclusions Disk-based F&B Index Store and cluster the index on the disk More efficient and intelligent query processing algorithms Demonstrated good scalability and query efficiency Expecting new query processing algorithms based on index probing (in addition to join-based approaches) 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW

28 Q&A Thank You! 7/4/2019 VLDB 2005

29 Related Work Indexes Join-based approaches
7/4/2019 Related Work Indexes Exact: DataGuide, 1-index, F&B Index Approx: Approx. DataGuide, A(k)-index, D(k)-index, M*(k)-index Join-based approaches Hybrid approach: “mixed-mode” in [VLDB 03] Niagara [VLDB 03] combines tree traversals + joins [SIGMOD 04] use 1-index to accelerate joins Clustering Lore: pre-order Natix: subtree 7/4/2019 VLDB 2005 Dr. Wei CSE, UNSW


Download ppt "Wei Wang University of New South Wales, Australia"

Similar presentations


Ads by Google