Download presentation
Presentation is loading. Please wait.
Published byDeshaun Newsome Modified over 9 years ago
1
Haris Georgiadis Minas Charalambides Vasilis Vassalos Athens University of Economics and Business 1 Efficient Physical Operators for a cost-based XPath Execution Engine
2
Motivation (1) XPath query: /s/r/*/it[mb/m/to=‘x’]//k Three navigation alternatives (among others): Straightforward navigation retrieve all it elements under /s/r/*/it; keep those having at least one to descendant under /mb/m/to with text value ‘x’. For the it elements left, return their k descendants. Starting from k return all k elements with at least one it ancestor, which in turn: has a to descendant under /mb/m/to with text value ‘x’ and has a s document element ancestor via relative path parent::*/parent::r/parent::s. Starting from to return all to elements under /s/r/*/it/mb/m/to, keep only those with text value ‘x’, then go backward via parent::m/parent::mb/parent::it and, for the it elements left, return their k descendants 2Athens University of Economics and Business
3
Motivation (2) Many XPath processing algorithms PPFS+, Staircase Join, Sort Merge-based structural joins, PathStack, Twig 2 Stack etc Many physical data models and storage techniques : Shredding on relations: Schema-based mapping vs. edge-based mapping Storage into disk pages preserving XML hierarchy Structural encodings: Region Encoding vs. Prefix based encoding Data structures: XB-trees, F&B Index, Path indexes 3Athens University of Economics and Business
4
Classifying current approaches Athens University of Economics and Business4 Schema-oblivious mapping Techniques Based on RDBMSs XPath SQLXPath Accelerator [Grust 2004], XRel [Yoshikawa 2001], Shrex [Amer-Yahia, 2004], PPFX[GV 2006,2007] XPath (enhanced) relational algebra IBM DB2, MS SQL Server, MonetDB/XQuery, Oracle DB Inverted Lists and/or B-Trees Techniques not Based on RDBMS Structural JoinsMPMGJN [Zhang, 2001], Staircase Join [Grust 2003], XR-Tree [Jiang 2003] Holistic Path or Twig processing PathStack, and TwigStack [Bruno 2002], Twig2Stack [Chen 2006], TJFast[Lu 2005], DataGuide [Goldman 1997], M(k)-index [He 2004], F&B-index [Wang 2005] Navigational approachNatix [Brantner 2005], Niagara[Halverson 2003], Lore [McHugh 1999] Inverted Lists and/or B-Trees Techniques not based on RDBMS GeCOEX system Structural JoinsMPMGJN [Zhang, 2001], Staircase Join [Grust 2003], XR-Tree [Jiang 2003] Holistic Path or Twig processing PathStack, and TwigStack [Bruno 2002], Twig2Stack [Chen 2006], TJFast[Lu 2005], DataGuide [Goldman 1997], M(k)-index [He 2004], F&B-index [Wang 2005] Navigational approachNatix [Brantner 2005], Niagara[Halverson 2003], Lore [McHugh 1999] Inverted Lists and/or B-Trees Techniques not based on RDMBS GeCOEX system Structural JoinsMPMGJN [Zhang, 2001], Staircase Join [Grust 2003], XR-Tree [Jiang 2003] Holistic Path or Twig processing LU and SM families of physical operators PathStack and TwigStack [Bruno 2002], Twig2Stack [Chen 2006], TJFast[Lu 2005], DataGuide [Goldman 1997], M(k)-index [He 2004], F&B-index [Wang 2005] Navigational approachNatix [Brantner 2005], Niagara[Halverson 2003], Lore [McHugh 1999]
5
Contribution I GeCOEX: the first generic Xpath cost-based execution and optimization framework Agnostic to the underlying XML storage system and the access methods it supports Independent of the techniques and algorithms available for XPath processing. Encapsulated in operator implementations, and rewriting rules Cost based optimization 5Athens University of Economics and Business
6
Contribution II XPalgebra: A novel XPath logical algebra Good fit with many XPath processing techniques Lookup and SM: two novel and efficient families of physical operators for Xpath Multiple storage engines Experimental evaluation: Direct comparison of operator implementations Athens University of Economics and Business6
7
GeCOEX System Architecture Parser Physical Plan Executor XPath query result XPA API Primitive Access Method Cost Models Database Statistics Physical Plan Selector Query Optimization Query Execution XPA Driver Rewriting Rules Descriptors Physical Operators Descriptors Physical Operator Descriptors Cost Models Descriptors Physical Operator Descriptors Cost Models Primitive Access Method Cost Models Descriptors Physical Operators Primitive Access Methods Data Model Database Statistics 7Athens University of Economics and Business
8
XPalgebra Generic sequence-based logical algebra for a subset of XPath Forward and backward axes Non-positional predicates involving conjunctive boolean expressions Maintains the navigation nature of Xpath Data Model Element Sequence Duplicate-free list of elements in document order Sequence Operators: (mainly) navigation Input and Output: Sequence Boolean Operators: used for filtering Input: Element Output: True or False 8Athens University of Economics and Business
9
XPalgebra – Sequence Operators Both the input and the output of a Sequence operator are sequences of nodes The input sequence is called context sequence BoolExpr: const | Ъ 1 ^Ъ 2 ^ … ^Ъ n, where Ъ i : Boolean Operator c a (S)fp /d//c (S)cs ^g^^f, //k/l (S)vf text()=5 (S)pf /a/b//g (S) f(S, Ъfp /d//c ) …/a …/d//c parent::g/ancestor::f//k/l …[text()=5] 9Athens University of Economics and Business
10
XPalgebra – Sequence Operators Both the input and the output of a Sequence operator are sequences of nodes The input sequence is called context sequence BoolExpr: const | Ъ 1 ^Ъ 2 ^ … ^Ъ n, where Ъ i : Boolean Operator 10Athens University of Economics and Business
11
XPalgebra – Boolean Operators applied on single nodes only the input element is called context element return boolean values BoolExpr: const | Ъ 1 ^Ъ 2 ^ … ^Ъ n, where Ъ i : Boolean Operator f(S, Ъfp /d//c )f(S, Ъfp /d//c (Ъvf @a=2 ^Ъc d )) …[d//c]…[d//c[@a=2]/d] 11Athens University of Economics and Business
12
XPalgebra – Boolean Operators applied on single nodes only the input element is called context element return boolean values BoolExpr: const | Ъ 1 ^Ъ 2 ^ … ^Ъ n, where Ъ i : Boolean Operator f(S, Ъfp /d//c ) …[d//c] 12Athens University of Economics and Business
13
XPalgebra - examples /s/r/*/it[mb/m/to=‘x’]//k d k (f(fp /s/r/*/it (root), Ъfp /mb/m/to (Ъvf text()=x ))) 13Athens University of Economics and Business
14
Physical Operators Athens University of Economics and Business14 Implements the Sequence interface of XPA API Access the XML data using the AccessMethods interface of the XPA API Example: a physical operator implementation That’s how physical operators are agnostic to the physical data model
15
Physical Operators Large number of physical operators, divided roughly into four ‘families’: Lookup operators (LU) Inspired by indexed nested loops join d LU a : for each element n from input sequence S make a lookup using XPAAPI.Descs(n, a) SortMerge-based operators(SM) Inspired by Sort Merge join d SM a : scan all elements from input sequence S and all a elements (using XPAAPI.Descs(root, a)) and find ‘ancestor-descendant’ matches Staircase Join operators[Grust 2003] PathStack operators [Bruno 2002] Athens University of Economics and Business15
16
Physical Operators Athens University of Economics and Business16 sLU*SM * Staircase [Grust 2003] PathStack [Bruno 2002] c (child) ** d (descendant) fp (forward path) ** p (parent) X ** a (ancestor) ** bp (backward path) ** X cs (cousin) XX ** : inspired by original
17
Physical Plan /s/r/*/it[mb/m/to=‘x’]//k Use SM to find it elements under /s/r/*/it Filter it elements: For each it use LU to check whether it has to elements under mb/m/to For each to check if its text node equals ‘x’ For the remaining it elements find their k descendants using the Staircase Join Athens University of Economics and Business17
18
Costing Physical Operators The cost estimation of physical operator is defined by its Descriptor and is based on the cardinality of the input sequence/logical operator certain statistics (DBStatistics interface) the cost of the primitive access methods it invokes (AccMCostModels interface). Athens University of Economics and Business18 That’s how physical operator cost models are agnostic to the physical data model
19
Costing an Operator The cost of physical operator is based on the cardinality of the context sequence, certain statistics the cost of the primitive access methods it invokes. Example: Cost(d LU k ) = Card(f) *( c1 +Occ(/s/r/*/it,//k)* c2) where c1: the result of CostForDescLookup() c2: the result of CostForNextDesc() 19Athens University of Economics and Business
20
Operator Cardinality Estimation Card(fp /s/r/*/it ) = 1*Occ(/, /s/r/*/it) =21750 Card(f)= Card(fp /s/r/*/it ) *Sel(/s/r/*/it, /mb/m/to) * =3.8.
21
The XPA API 21Athens University of Economics and Business
22
5 XML Storage Systems and their XPA drivers 22Athens University of Economics and Business Parser Physical Plan Executor XPath query result XPA API Primitive Access Method Cost Models Database Statistics Physical Plan Selector Query Optimization Query Execution XPA Driver Rewriting Rules Descriptors Physical Operators Descriptors Physical Operator Descriptors Cost Models Descriptors Physical Operator Descriptors Cost Models Descriptors Physical Operators Primitive Access Methods Data Model XML Storage System The PE-basic Native XML storage system Dewey encoding, 1 B-Tree per tag name The RE-basic Native XML storage system Pre/Post/Level encoding, 1 B-Tree per tag name The PE-Path Native XML storage system Dewey encoding, 1 B-Tree per tag name, Paths B-Tree The RE-Path Native XML storage system Pre/Post/Level encoding, 1 B-Tree per tag name, Paths B-Tree The Edge-RE Native XML storage system Pre/Post/Level encoding, 1 B-Tree for all elements
23
The PE-basic XPA driver The PE-basic Native XML storage system dewey encoding scheme XML elements are stored in B-Tree structures, one per tag name. And the corresponding driver Element interface methods that check structural relationships are implemented by applying appropriate comparisons between dewey keys AccessMethods interface implemented by accessing directly the B-Tree structures AccMCostModels interface cost models for the primitive access methods rely on the properties of B-Trees DBStatistics interface uses the RTN path B-tree and statistical information regarding node values 23Athens University of Economics and Business
24
Four more XPA drivers The PE-Path XML storage system and driver Similar to PE-basic Distinct Root-To-Node-paths are stored in a separate index (RTN-paths index) getRTNPath() of the Element interface is very cheap due to RTN-paths index Parent() and Ancs() of the AccessMethods interface are very cheap due to the combination of dewey encoding and RTN-paths index The RE-basic XML storage system and driver Similar to PE-basic but uses region encoding The RE-Path XML storage system and driver Similar to the PE-Path but uses region encoding Parent() and Ancs() of the AccessMethods interface are not cheap The Edge-based RE-Path XML storage system and driver similar to RE-path but stores all elements in a single B-Tree structure 24Athens University of Economics and Business
25
Our Pool of Physical Operators Four families of physical operators SortMerge-based (SM) Traverse two sequences of XML elements, left and right, similar in spirit to SortMerge join Implemented for: d, c, fp, a, p, bp and cs logical operators Lookups (LU) Search a minimum window of elements for each context element, similar in spirit to indexed nested loops join Implemented for d, c, fp, a, p, bp and cs logical operators Staircase Join 2 Staircase operators for d and a. PathStack 3 PathStack operators (d, c, fp) 25Athens University of Economics and Business
26
Lookup Operators Novel efficient algorithms for holistically evaluating forward and backward multi-step paths Based on root-to-node filtering. buffered-leaping: a new technique for pipelined duplicate elimination and document order preservation Search a minimum window of elements for each element in the context sequence window: the result of calling the method from the AccessMethods interface of the XPA API (e.g. Descs(), Ancs()) corresponding to the XPath axis (e.g. descendant, ancestor) for a given context element
27
The size of chain at any time is very small and upper bounded by the depth of the XML document Example: fp LU /c/f r b1b2b3b8 cf4c b4 b6b7 b9 c f1 f2 f3f5 b5c c c f11 c d c f6 f7 f8 f9 d f10 f12f13 cc f14 f15 f16 d c f17 e rootAnc contextElchain next() b1 b2 b2 not a descendant of b1 window =XPAPI.Descs(b1,‘f’); regExprFilter(f1.getRTNPath(), /c//f, 1) = true f1 next() regExprFilter(f2.getRTNPath(), /c//f, 1) = falseregExprFilter(f3.getRTNPath(), /c//f, 1) = true f3 b2 b3 b3 not a descendant of b2 window =XPAPI.Descs(b2,‘f’); regExprFilter(f4.getRTNPath(), /c//f, 1) = false next() regExprFilter(f5.getRTNPath(), /c//f, 1) = true f5 next() b3 b5 b5 is a descendant of b3 window =XPAPI.Descs(b3,‘f’); b5b7 b7 is a descendant of b3 b7 b9 b9 is not a descendant of b3 f6 descendant of b3 and regExprFilter(f6.getRTNPath(), /c//f, 1) = false f6 descendant of b5 and regExprFilter(f6.getRTNPath(), /c//f, 3) = false f6 not descendant of b7 f7 descendant of b3 and regExprFilter(f7.getRTNPath(), /c//f, 1) = false f7 descendant of b5 and regExprFilter(f7.getRTNPath(), /c//f, 3) = true f7 f8 descendant of b3 and regExprFilter(f8.getRTNPath(), /c//f, 1) = false f8 not descendant of b5f8 not descendant of b7 f9 again not reachable from any of b3, b5, b7 via /c//f f10 again not reachable from any of b3, b5, b7 via /c//f f11 again not reachable from any of b3, b5, b7 via /c//f f12 is reachable from b7 via /c//f f12 next() f13 is reachable from b7 via /c//f f13 next() b9 null context sequence is exhausted window =XPAPI.Descs(b9,‘f’); f16 is not reachable from b9 via /c//ff17 is reachable from b9 via /c//f f17
28
Example: bp LU parent::c/ancestor::b r b1b2b3b8 cf4c b4 b6b7 b9 c f1 f2 f3f5 b5c c c f11 c d c f6 f7 f8 f9 d f10 f12f13 cc f14 f15 f16 d c f17 e contextEl sortedElements window =XPAPI.Ancs(f2,‘b’); window =XPAPI.Ancs(f3,‘b’); window =XPAPI.Ancs(f5,‘b’); window =XPAPI.Ancs(f6,‘b’); window =XPAPI.Ancs(f8,‘b’); window =XPAPI.Ancs(f11,‘b’); Cheap implementation of Ancs() in the PE-Path driver Dewey(f2)=1.1.2.1.1 RTN(f2)= /r/b/c/f => there is a ‘b’ ancestor b’ at level 2 Dewey(b’)= substr(dewey(f2), …) = 1.1 RTN(b’)=substr(RTN(f2), …) = /r/b Ancs() outputs n without actually retrieving b1 from the database. n is the virtual representation of b1, denoted as #b1 b1 # f2 f3 f5 f3 is a descendant of b1 V next() b1 b2 # V f5 not a descendant of b1f6 not a descendant of b2 f6 next() b2 next() b3 # b4 # b5 # f8 V f8 is a descendant of b3 f11 f11 is a descendant of b3 b7 # null b4 reverseOf(parent::c/ancestor::b)=/c//f V: regExprFilter(f3.getRTNPath(), /c//f, 1)=true
29
SM Operators Inspired by sort-merge join algorithms Traverse two sequences of elements, left and right left: the context sequence (the input sequence) right: always consists of all the elements of the requested tag name Keeping track of the current elements on left and right, try to find matching pairs according to the appropriate navigation axis and condition Novel techniques for holistic SM-based forward path and backward path operators with guaranteed low memory requirements
30
Performance Comparison
32
Sensitivity to context selectivity descendant ancestor forward path
33
Conclusions I Novel techniques for evaluating forward and backward multi-step paths pipelined duplicate elimination and document order preservation Lookup fp, Lookup bp, Lookup cs, SM fp, SM bp, SM cs Fast backwards navigation that fully exploits the capabilities of the underlying storage system Algorithms perform well across a variety of different physical storage models First steps towards building cost models for XPath Athens University of Economics and Business33
34
Conclusions II Operator-based XPath processing provides significant optimization opportunities Different implementations of logical operators can provide benefits in different circumstances E.g. context selectivity Query plans can be much more efficient than (existing) monolithic (twig) techniques in most circumstances 34Athens University of Economics and Business
35
Related Work Many different XPath processing techniques Structural Joins Variations of the Sort-Merge and Index Nested Loops Joins MPMGJN, [Al-Khalifa, 2003], [Chien, S.-Y 2002], [Jiang, H, 2003], Staircase Join Holistic Path or Twig Processing PathStack, Twig2Stack, TJFast, M(k)-index, F&B-index Inefficient evaluation of backward axes Difficult to be combined End-to-end techniques Based on special data structures and/or indexes No cost models have been provided
36
Thank you! 36Athens University of Economics and Business
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.