Download presentation
Presentation is loading. Please wait.
Published byToby Pierce Modified over 9 years ago
1
Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto
2
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 2 Motivation Growing importance of XML query processing Plethora of implementations: native XML dbms (e.g. Timber, Niagara, BEA/XQRL, Natix,ToX) XQuery systems (e.g. Galax, IPSI-SQ, XSM, MS-XQuery) XPath processors (e.g. XSQ, SPEX, XPush, Xalan, PathStack) publish/subscribe (e.g.Y-Filter,IndexFilter,WebFilter,NiagaraCQ) twig query processors (e.g. TwigStack, PRIX, TurboXPath) Our contribution: Apply novel cost-based optimization techniques to XML query processing that exploit path summaries
3
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 3 Example XQuery and Pattern Tree Pattern Tree (PT) or Twig Query for $x in document(“catalog.xml”)//item, $y in document(“parts.xml”)//part, $z in document(“supplier.xml”)//supplier where $x/part_no = $y/part_no and $z/supplier_no = $x/supplier_no and $z/city = "Toronto" and $z/province = "Ontario" return {$x/part_no} {$x/price} {$y/description}
4
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 4 Example XQuery Processing for $x in document(“catalog.xml”)//item, $y in document(“parts.xml”)//part, $z in document(“supplier.xml”)//supplier where $x/part_no = $y/part_no and $z/supplier_no = $x/supplier_no and $z/city = "Toronto" and $z/province = "Ontario" return {$x/part_no} {$x/price} {$y/description} $x = $y $z = $x
5
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 5 Contributions Holistic Path Summary Pruning Access Order Selection Path Summaries as Catalogs
6
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 6 Outline Introduction Path Summaries in the Optimizer Holistic Path Summary Pruning Experimental Evaluation Access Order Selection Experimental Evaluation Future Work
7
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 7 ToXin Path Summary For each distinct path in document there is a path in ToXin - is an exact path summary – reflects the structure of the document [RM01] Initially proposed as a back-end - can answer any pattern queries 1001 Magna Toronto ON 1002 MEC Vancouver BC TTTI
8
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 8 Augmented ToXin Trees System catalog: schema + data statistics DTD and XML Schema are used for validation, they do not describe the actual schema of the instances ToXin is an exact path summary actual schema ToXin augmented with statistics system catalog ToXop statistics: NCARD – no. of instances for an element ICARD – no of distinct value for an element Fan-out – avg. no. of sub-element instances for each sub- element Augmented ToXin Tree: existing schema (TT) + statistics + node instances (TI)
9
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 9 Outline Introduction Path Summaries in the Optimizer Holistic Path Summary Pruning Experimental Evaluation Access Order Selection Experimental Evaluation Future Work
10
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 10 TTTI All path summary based query processors perform some path summary pruning specific to the processor Idea: separate path pruning from the processor and encoding Holistic Path Summary Pruning (HPSP): Holistic Path Summary Pruning TwigStackScan is one possible HPSP-based Access Method evaluate the pattern tree on the actual schema (TT tree) compute the twig query using an appropriate algorithm for the particular element encoding
11
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 11 Stack algorithms: PathStack, TwigStack, TwigStackXB [BSK02] Use region algebra encoding: T element : [DocID, Term, StartPos, EndPos, LevelNum] - elements T text : [DocID, Term, TextValue, StartPos, LevelNum] - string values Build a stream (noted as T) for all elements having the same label, e.g. T author encompasses all author elements from the document Stack Algorithms
12
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 12 TwigStackScan Access Method Extended region algebra encoding: T element : [DocID, Term, StartPos, EndPos, LevelNum, TTnodeID] - elements T text : [DocID, Term, TextValue, StartPos, LevelNum, TTnodeID] - string values TwigStackScan = HPSP + TwigStack
13
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 13 Experimental Datasets Dataset Name Size (MB) # of Elements # of Attributes # of Text Total # of Nodes # of TT nodes Max- depth DBLP130.7263,332,130404,2763,005,8486,742,2542246 SWISSPROT112.1302,977,0312,189,8592,013,8447,180,7343035 XMARK (1.9)112.4862,769,710726,7831,478,2524,974,74535810 DBLP, SWISSPROT: University of Washington XML Repository Both are large (millions of nodes) and shallow DBLP – regular in structure (5 structures that repeat) SWISSPROT – irregular in structure (many one of the kind structures) XMARK: simulates an on-line auction site xmlgen from 0.01 (0.6 MB) – 2.8 (165.9 MB) removed the content of ‘Text’ elements 30% reduction in size
14
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 14 TwigStackScan Scale-Up Q7 scale-up with (XMARK) file size TwigStackScan speedup with (XMARK) file size Q7: //site/people/person[@id = "person0"]/name – 1 twig match - @id in person, category, item, open_action Q8: //site/people/person/name – 38,760 twig matches When applicable TwigStackScan yields improvements of one order of magnitude
15
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 15 TwigStackScan vs. TwigStack QueryDatasetTwigStack (ms) TwigStackScan (ms) Speedup Q1//inproceedings[./author="Jim Gray"][./year="1990"]/@keyDBLP7,1084,7791.49 Q2//www[./editor]/urlDBLP3,0154075.38 Q3//book/author[text() ="C.J. Date"]DBLP430488.96 Q4//Entry/Keyword[ text() = "Rhizomelic chondrodysplasia punctata"] SWISSPROT1881831.03 Q5//Entry[PFAM[@prim_id="PF00304"]][.//DISULFID/Descr]SWISSPROT6,4307528.55 Q6//Entry[./Org="Piroplasmida"]//AuthorSWISSPROT6,6876,8910.97 Q7//site/people/person[@id = "person0"]/nameXMARK6991195.87 Q8//site/people/person/nameXMARK5,4423,8041.43 Q9//regions/samerica/item[./location = "United States" AND./@id./quantity AND./payment]/name XMARK8,32647017.71 Q10//person[@id = "person217" AND./address [./city/text() = "Lubbock" AND./country/text() = "United States”] ]/name XMARK1,1671249.41 Q11//person[@id AND./address [./city/text() = "Lubbock" AND./country/text() = "United States”] ]/name XMARK4,4932,5201.78 High selectivity twig queries (Q1, Q4, Q6, Q7): speedup 0.97 to 5.87 Low selectivity twig queries (Q8, Q11): speedup 1.43 to 1.78 Scattered twig matches(Q2, Q3, Q5, Q9), grouped twig matches (Q10): speedup 8.96 to 75.38
16
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 16 Outline Introduction Path Summaries in the Optimizer Holistic Path Summary Pruning Experimental Evaluation Access Order Selection Experimental Evaluation Future Work
17
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 17 Order Selection in Pattern Trees 1. Order Selection: the order in which to evaluate the branches 2. Direction Selection: decide how to evaluate a branch: top/down or bottom/up Choosing between top/down and bottom/up is extremely expensive computationally: LORE optimizer [McW99] – for a document with level 7 – millions of possible plans
18
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 18 ToXinScan Access Method Relational optimizers compute a GOOD plan not THE BEST plan Similarly we use data statistics and heuristics to compute a good plan The access-order selection strategy: 1. Sort the children according to parent selectivity 2. Evaluate the path with the highest selectivity using a bottom-up evaluation 3. Evaluate all other paths, in the selectivity order, using a top-down evaluation
19
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 19 ToXinScan Scale-Up Speedup ToXinScan vs. TwigStack with (XMARK) file size Q8: //site/people/person[@id = "person0"]/name – 1 twig match Q9: //site/people/person/name – 38,760 twig matches Q10: //regions/samerica/item[./location = "United States" AND./@id AND./quantity AND./payment] /name – 8 twig matches Two-order of magnitude improvements over TwigStack
20
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 20 ToXinScan vs. TwigStack QueryDatasetTwigStack (ms) ToXinScan (ms) TwigStack/ ToXinScan Q1//inproceedings[./author="Jim Gray"] [./year="1990"]/@keyDBLP7,10813054.68 Q2//www[./editor]/urlDBLP3,0153977.31 Q3//book/author[text() ="C.J. Date"]DBLP386904.29 Q4//inproceedings[./title/text() = "Semantic Analysis Patterns."] /authorDBLP430469.35 Q5//Entry/Keyword[text() = "Rhizomelic chondrodysplasia punctata"]SWISSPROT188872.16 Q6//Entry[PFAM[@prim_id="PF00304"]] [.//DISULFID/Descr]SWISSPROT6,4308080.37 Q7//Entry[./Org="Piroplasmida"]//AuthorSWISSPROT6,68713151.05 Q8//site/people/person[@id = "person0"]/nameXMARK699759.32 Q9//site/people/person/nameXMARK5,4429557.28 Q10//regions/samerica/item[./location = "United States" AND./@id AND./quantity AND./payment] /name XMARK8,32668122.44 Q11//person[@id = "person217" AND./address [./city/text() = "Lubbock" AND./country/text() = "United States] ]/name XMARK1,1679012.97 Q12//person[@id = "person20125" AND./address [./city/text() = "Lubbock" AND./country/text() = "United States] ]/name XMARK1,8169219.74 Q13//person[@id = "person48027" AND./address [./city/text() = "Lubbock" AND./country/text() = "United States] ]/name XMARK2,7469528.80 Q14//person[@id AND./address [./city/text() = "Lubbock" AND./country/text() = "United States] ]/name XMARK4,4939348.31 High selectivity twig queries (Q3, Q4, Q5, Q8): speedup 2.16 to 9.32 Grouped twig matches (Q11, Q12, Q13): speedup 12.97 to 28.80 Low selectivity (Q2, Q9, Q10, Q14), scattered twig matches (Q1, Q6, Q7): speedup 48.31 to 122.44
21
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 21 ToXinScan vs. Heavier Indexes Pattern indexes (such as PRIX [RM04], ViST [WPF+03]) are the best twig-query processors Indexes are expensive to build (three passes over the document) and require extensive space ViST uses O(SH) space, S # of sequences, H height of tree Indexes outperform TwigStack by two-orders of magnitude Good news: using path summaries and the presented optimization strategy we achieve the same performance improvements as node indexes path summaries are inexpensive to build (one pass over the document)
22
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 22 Outline Introduction Path Summaries in the Optimizer Holistic Path Summary Pruning Experimental Evaluation Access Order Selection Experimental Evaluation Future Work
23
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 23 Future Work Generalize based on the strategy derived from the TwigStackScan access method Holistic Path Summary Pruning (HPSP) can be used in conjunction with any twig query evaluation method Can be used with Path summaries other than ToXin ToXinScan Add a generalized cost model for access methods Enhance the XML statistics used Propose benchmarks for XML Access methods
24
Thank you for your attention! Attila Barta Mariano P. Consens Alberto O. Mendelzon { atibarta, consens, mendel }@cs.toronto.edu University of Toronto
25
VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 25 ToXinScan vs. PRIX QueryDatasetTwigStack/ ToXinScan TwigStack/PRIX [RMo03, RMo04] Q1//inproceedings[./author="Jim Gray"] [./year="1990"]/@key DBLP54.6814.01 Q2//www[./editor]/urlDBLP77.31145.00 Q6//Entry[PFAM[@prim_id="PF00304"]] [.//DISULFID/Descr] SWISSPROT80.3743.15 [RMo04] Praveen Rao, Bongki Moon, “PRIX: Indexing and Querying XML Using Prufer Sequences”, Proceedings of the 2004 International Conference on Data Engineering, Boston, MA, 2004 [RMo03] Praveen Rao, Bongki Moon, “PRIX: Indexing and Querying XML Using Prufer Sequences”, Technical Report TR-03-06,Univ. of Arizona, Tucson, 2003 Good news: node indexes (e.g. PRIX) are computationally expensive to build (three passes over the document) while path summaries are un-expensive to build (one pass over the document)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.