Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung, Divykant Agrawal and K. Selcuk Candan NEC Laboratories America * University of California, Santa Barbara
2VLDB' Seoul, Korea Background XML –Hierarchical (tree) structured data –Provide flexibility to model semi-structured data –Widely accepted as universal data exchange format Query over XML –XPath, XQuery [W3C] –Extensively used by many applications –Adopted by a number of commercial systems
3VLDB' Seoul, Korea State-of-the-art: XML Query Processing Path Tree Binary Structure Joins [Timber] – Large intermediate results Holistic Approach Algebraic Approach PathStack [Bruno, et. al] TwigStack [Bruno, et. al] (GTP) Generalized Tree Pattern Optimize multiple path expressions of XQuery [Chen, et. al] – Expensive post-processing ? Twig 2 Stack
4VLDB' Seoul, Korea Processing Generalized Tree Pattern (GTP) Queries B A D XQuery: FOR $b in //A[E]/B, $d in $b/$D LET $c = $b/C RETURN $b, $c, $d C Type Algebraic Approach [Chen et.al] Return node Group return node Structural Joins Non return node Example Mandatory Axis Optional Axis Structural Outer Joins – Grouping Duplication Elimination Sort //A//B //A/B a2 b1 a1 a2 b1 a1 b2 Our goal: Avoid ALL these!
5VLDB' Seoul, Korea Motivation: PathStack [Bruno et.al] Query: //A//B; Data: Key observation: minimize intermediate results through compact representation of path matches, by –Inter-node: record AD relationship between elements in different query nodes, e.g., b1→a2, b2→a2 –Intra-node: record AD relationship between elements within the same query nodes, e.g., b1, b2 TwigStack [Bruno et.al] minimizes intermediate results through: –Output only those path matches that are in final twig results –However, such optimality cannot be guaranteed [Choi, et.al] –Not helpful for processing GTP queries Question: can we minimize intermediate results for twig queries through compact result encoding (similar to PathStack)? –Useful for processing GTP queries as well? S[A] a1 S[B] b1 b2a2 b1 a1 b2
6VLDB' Seoul, Korea Hierarchical Stack Encoding Inter-node: //A//B –Can still use explicit edges Intra-node: A –Matching elements forms a tree structure as well Associate each query node with a hierarchical stack –Push element e into hierarchical stack HS[E] iff e satisfies the sub-twig query rooted at E Matching can be determined when entire sub-tree of e seen Require post-order document traversal a2 a3a4 a1 HS[A] a3a4 a2 a1
7VLDB' Seoul, Korea Twig 2 Stack: Running Example C B A D a2 c1 b2 b1 d1 a1 [1,20], 1 [2,15], 2 [3,14], 3 [4,11], 4 [8, 9], 6 [5,10], 5 d2 [6,7], 6 c2 [12,13], 4 b3 d3d3 [16,19], 2 [17,18], 3 HS[B] b2 HS[C] c1 b1 HS[A] a2 HS[D] d2 d1 c2c2d3d3 TwigStack needs to enumerate 3 matches for //A/B//D and 2 for //A/B//C then join them together. Twig 2 Stack requires neither path joins nor path enumeration! Merging Stacks
8VLDB' Seoul, Korea GTP Result Enumeration Bottom-up Computation.vs. Top-down Enumeration –Visit Only those that are in the twig matches Handling grouping results –Automatic grouping through Inter-node edges Handling duplicates and out-of-order results –Problems coming from non-return nodes –If D is return node while B is not b1 → d1, d2, d3 and b2 →d2, d3 (duplicates) –Observation: Intra-node hierarchy provides hints c2 a4 d3 d2 c1 b2 b1 d1
9VLDB' Seoul, Korea Experiment Setup Implementation –Twig 2 Stack: Java –TwigStack, TJFast: Java Kindly provided by Jiaheng Lu from National University of Singapore (NUS) Datasets –XMark, DBLP, TreeBank Metrics –Query processing time –IO time
10VLDB' Seoul, Korea Processing Full Twig Queries Optimization of Query Processing: TwigStack Twig 2 Stack Optimization of IO: TJFast
11VLDB' Seoul, Korea Not yet done: Memory Usage Hierarchical Stack Encoding could hold entire document in memory in the worst case –Unlike DOM approach, only matches need to be stored Tag match (Partial) twig match Predicate evaluation Early result enumeration dramatically reduces the memory usage –Enumerate query results before the end of document and release buffer –Main idea: hybrid of top-down (PathStack) and bottom-up (Twig 2 Stack) approaches
12VLDB' Seoul, Korea Early Result Enumeration (ERM) Enumerate results and release buffer when elements in top- branch node are popped from PathStack S[A] S[B] S[D]S[C] a1 a2 d3 HS[D]HS[C] HS[B] HS[A] b2 c1 d2 b1 d1 c2 a2 c1 b2 b1 d1 a1 [1,20], 1 [2,15], 2 [3,14], 3 [4,11], 4 [8, 9], 6 [5,10], 5 d2 [6,7], 6 c2 [12,13], 4 b3 d3d3 [16,19], 2 [17,18], 3 C B A D
13VLDB' Seoul, Korea Memory Usage article dblp titleyear open_auctions site bidreserve bidder Small sub-tree Huge sub-tree increase
14VLDB' Seoul, Korea Conclusions and Future Work Proposed a bottom-up GTP processing solution –A twig encoding scheme –A GTP enumeration algorithm that avoids any post-processing operations –A hybrid scheme to reduce memory usage Future directions –Handling worst case memory issues –Optimizing IO cost by exploiting indexes –Handling other axes, full XQuery, graph input –Handling XML streams –…
16VLDB' Seoul, Korea Processing GTP Optimization of non-return nodes Automatic grouping