Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto.

Slides:

Advertisements

Similar presentations

Ting Chen, Jiaheng Lu, Tok Wang Ling

Advertisements

Jiaheng Lu, Ting Chen and Tok Wang Ling National University of Singapore Finding all the occurrences of a twig.

From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.

Advanced XSLT. Branching in XSLT XSLT is functional programming –The program evaluates a function –The function transforms one structure into another.

XML: Extensible Markup Language

Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.

Min LuTIMBER: A Native XML DB1 TIMBER: A Native XML Database Author: H.V. Jagadish, etc. Presenter: Min Lu Date: Apr 5, 2005.

CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.

1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.

ViST: a dynamic index method for querying XML data by tree structures Authors: Haixun Wang, Sanghyun Park, Wei Fan, Philip Yu Presenter: Elena Zheleva,

Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.

Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.

Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.

1 Optimizing Cursor Movement in Holistic Twig Joins Marcus Fontoura, Vanja Josifovski, Eugene Shekita (IBM Almaden Research Center) Beverly Yang (Stanford)

Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)

1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.

1 Prefix Path Streaming: a New Clustering Method for XML Twig Pattern Matching Ting Chen, Tok Wang Ling, Chee-Yong Chan School of Computing, National University.

Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002.

XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.

XPathLearner: An On-Line Self- Tuning Markov Histogram for XML Path Selectivity Estimation Authors: Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey.

Pattern tree algebras: sets or sequences? Stelios Paparizos, H. V. Jagadish University of Michigan Ann Arbor, MI USA.

Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.

A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.

1 XSLT An Introduction. 2 XSLT XSLT (extensible Stylesheet Language:Transformations) is a language primarily designed for transforming the structure of.

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

Materialized View Selection for XQuery Workloads Asterios Katsifodimos 1, Ioana Manolescu 1 & Vasilis Vassalos 2 1 Inria Saclay & Université Paris-Sud,

TwigStackList¬: A Holistic Twig Join Algorithm for Twig Query with Not-predicates on XML Data by Tian Yu, Tok Wang Ling, Jiaheng Lu, Presented by: Tian.

Crimson: A Data Management System to Support Evaluating Phylogenetic Tree Reconstruction Algorithms Yifeng Zheng, Stephen Fisher, Shirley cohen, Sheng.

5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.

Declaratively Producing Data Mash-ups Sudarshan Murthy 1, David Maier 2 1 Applied Research, Wipro Technologies 2 Department of Computer Science, Portland.

Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.

Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

XML Access Control Koukis Dimitris Padeleris Pashalis.

Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.

Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.

Streaming XPath Engine Oleg Slezberg Amruta Joshi.

Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Dr. N. MamoulisAdvanced Database Technologies1 Topic 8: Semi-structured Data In various application domains, the data are semi-structured; the database.

From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.

Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.

1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.

Holistic Twig Joins: Optimal XML Pattern Matching Nicholas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 02 Presented by: Li Wei, Dragomir Yankov.

Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.

1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.

Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.

ADT 2010 MonetDB/XQuery (2/2): High-Performance, Purely Relational XQuery Processing Stefan Manegold.

Trie Indexes for Efficient XML Query Processing

By A. Aboulnaga, A. R. Alameldeen and J. F. Naughton Vldb’01

Compressing XML Documents with Finite State Automata

Efficient processing of path query with not-predicates on XML data

Efficient Filtering of XML Documents with XPath Expressions

OrientX: an Integrated, Schema-Based Native XML Database System

(b) Tree representation

Structure and Content Scoring for XML

Early Profile Pruning on XML-aware Publish-Subscribe Systems

MCN: A New Semantics Towards Effective XML Keyword Search

Structure and Content Scoring for XML

Wei Wang University of New South Wales, Australia

Relax and Adapt: Computing Top-k Matches to XPath Queries

Presentation transcript:

Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Attila Barta Mariano P. Consens Alberto O. Mendelzon University of Toronto

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 2 Motivation Growing importance of XML query processing Plethora of implementations: native XML dbms (e.g. Timber, Niagara, BEA/XQRL, Natix,ToX) XQuery systems (e.g. Galax, IPSI-SQ, XSM, MS-XQuery) XPath processors (e.g. XSQ, SPEX, XPush, Xalan, PathStack) publish/subscribe (e.g.Y-Filter,IndexFilter,WebFilter,NiagaraCQ) twig query processors (e.g. TwigStack, PRIX, TurboXPath) Our contribution:  Apply novel cost-based optimization techniques to XML query processing that exploit path summaries

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 3 Example XQuery and Pattern Tree Pattern Tree (PT) or Twig Query for $x in document(“catalog.xml”)//item, $y in document(“parts.xml”)//part, $z in document(“supplier.xml”)//supplier where $x/part_no = $y/part_no and $z/supplier_no = $x/supplier_no and $z/city = "Toronto" and $z/province = "Ontario" return {$x/part_no} {$x/price} {$y/description}

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 4 Example XQuery Processing for $x in document(“catalog.xml”)//item, $y in document(“parts.xml”)//part, $z in document(“supplier.xml”)//supplier where $x/part_no = $y/part_no and $z/supplier_no = $x/supplier_no and $z/city = "Toronto" and $z/province = "Ontario" return {$x/part_no} {$x/price} {$y/description} $x = $y $z = $x

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 5 Contributions Holistic Path Summary Pruning Access Order Selection Path Summaries as Catalogs

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 6 Outline Introduction  Path Summaries in the Optimizer  Holistic Path Summary Pruning  Experimental Evaluation  Access Order Selection  Experimental Evaluation  Future Work

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 7 ToXin Path Summary For each distinct path in document there is a path in ToXin - is an exact path summary – reflects the structure of the document [RM01] Initially proposed as a back-end - can answer any pattern queries 1001 Magna Toronto ON 1002 MEC Vancouver BC TTTI

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 8 Augmented ToXin Trees System catalog: schema + data statistics DTD and XML Schema are used for validation, they do not describe the actual schema of the instances ToXin is an exact path summary  actual schema ToXin augmented with statistics  system catalog ToXop statistics: NCARD – no. of instances for an element ICARD – no of distinct value for an element Fan-out – avg. no. of sub-element instances for each sub- element Augmented ToXin Tree: existing schema (TT) + statistics + node instances (TI)

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 9 Outline Introduction Path Summaries in the Optimizer  Holistic Path Summary Pruning  Experimental Evaluation  Access Order Selection  Experimental Evaluation  Future Work

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 10 TTTI  All path summary based query processors perform some path summary pruning specific to the processor  Idea: separate path pruning from the processor and encoding  Holistic Path Summary Pruning (HPSP): Holistic Path Summary Pruning  TwigStackScan is one possible HPSP-based Access Method evaluate the pattern tree on the actual schema (TT tree) compute the twig query using an appropriate algorithm for the particular element encoding

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 11 Stack algorithms: PathStack, TwigStack, TwigStackXB [BSK02] Use region algebra encoding: T element : [DocID, Term, StartPos, EndPos, LevelNum] - elements T text : [DocID, Term, TextValue, StartPos, LevelNum] - string values Build a stream (noted as T) for all elements having the same label, e.g. T author encompasses all author elements from the document Stack Algorithms

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 12 TwigStackScan Access Method Extended region algebra encoding: T element : [DocID, Term, StartPos, EndPos, LevelNum, TTnodeID] - elements T text : [DocID, Term, TextValue, StartPos, LevelNum, TTnodeID] - string values  TwigStackScan = HPSP + TwigStack

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 13 Experimental Datasets Dataset Name Size (MB) # of Elements # of Attributes # of Text Total # of Nodes # of TT nodes Max- depth DBLP ,332,130404,2763,005,8486,742, SWISSPROT ,977,0312,189,8592,013,8447,180, XMARK (1.9) ,769,710726,7831,478,2524,974, DBLP, SWISSPROT: University of Washington XML Repository Both are large (millions of nodes) and shallow DBLP – regular in structure (5 structures that repeat) SWISSPROT – irregular in structure (many one of the kind structures) XMARK: simulates an on-line auction site xmlgen from 0.01 (0.6 MB) – 2.8 (165.9 MB) removed the content of ‘Text’ elements  30% reduction in size

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 14 TwigStackScan Scale-Up Q7 scale-up with (XMARK) file size TwigStackScan speedup with (XMARK) file size Q7: = "person0"]/name – 1 twig match in person, category, item, open_action Q8: //site/people/person/name – 38,760 twig matches When applicable TwigStackScan yields improvements of one order of magnitude

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 15 TwigStackScan vs. TwigStack QueryDatasetTwigStack (ms) TwigStackScan (ms) Speedup Q1//inproceedings[./author="Jim Q2//www[./editor]/urlDBLP3, Q3//book/author[text() ="C.J. Date"]DBLP Q4//Entry/Keyword[ text() = "Rhizomelic chondrodysplasia punctata"] SWISSPROT Q6//Entry[./Org="Piroplasmida"]//AuthorSWISSPROT6,6876, = "person0"]/nameXMARK Q8//site/people/person/nameXMARK5,4423, Q9//regions/samerica/item[./location = "United States" AND./payment]/name XMARK8, = "person217" AND./address [./city/text() = "Lubbock" AND./country/text() = "United States”] ]/name XMARK1, AND./address [./city/text() = "Lubbock" AND./country/text() = "United States”] ]/name XMARK4,4932, High selectivity twig queries (Q1, Q4, Q6, Q7): speedup 0.97 to 5.87 Low selectivity twig queries (Q8, Q11): speedup 1.43 to 1.78 Scattered twig matches(Q2, Q3, Q5, Q9), grouped twig matches (Q10): speedup 8.96 to 75.38

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 16 Outline Introduction Path Summaries in the Optimizer Holistic Path Summary Pruning Experimental Evaluation  Access Order Selection  Experimental Evaluation  Future Work

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 17 Order Selection in Pattern Trees 1. Order Selection: the order in which to evaluate the branches 2. Direction Selection: decide how to evaluate a branch: top/down or bottom/up Choosing between top/down and bottom/up is extremely expensive computationally: LORE optimizer [McW99] – for a document with level 7 – millions of possible plans

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 18 ToXinScan Access Method Relational optimizers compute a GOOD plan not THE BEST plan Similarly we use data statistics and heuristics to compute a good plan The access-order selection strategy: 1. Sort the children according to parent selectivity 2. Evaluate the path with the highest selectivity using a bottom-up evaluation 3. Evaluate all other paths, in the selectivity order, using a top-down evaluation

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 19 ToXinScan Scale-Up Speedup ToXinScan vs. TwigStack with (XMARK) file size Q8: = "person0"]/name – 1 twig match Q9: //site/people/person/name – 38,760 twig matches Q10: //regions/samerica/item[./location = "United States" AND./quantity AND./payment] /name – 8 twig matches Two-order of magnitude improvements over TwigStack

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 20 ToXinScan vs. TwigStack QueryDatasetTwigStack (ms) ToXinScan (ms) TwigStack/ ToXinScan Q1//inproceedings[./author="Jim Gray"] Q2//www[./editor]/urlDBLP3, Q3//book/author[text() ="C.J. Date"]DBLP Q4//inproceedings[./title/text() = "Semantic Analysis Patterns."] /authorDBLP Q5//Entry/Keyword[text() = "Rhizomelic chondrodysplasia punctata"]SWISSPROT [.//DISULFID/Descr]SWISSPROT6, Q7//Entry[./Org="Piroplasmida"]//AuthorSWISSPROT6, = "person0"]/nameXMARK Q9//site/people/person/nameXMARK5, Q10//regions/samerica/item[./location = "United States" AND./quantity AND./payment] /name XMARK8, = "person217" AND./address [./city/text() = "Lubbock" AND./country/text() = "United States] ]/name XMARK1, = "person20125" AND./address [./city/text() = "Lubbock" AND./country/text() = "United States] ]/name XMARK1, = "person48027" AND./address [./city/text() = "Lubbock" AND./country/text() = "United States] ]/name XMARK2, AND./address [./city/text() = "Lubbock" AND./country/text() = "United States] ]/name XMARK4, High selectivity twig queries (Q3, Q4, Q5, Q8): speedup 2.16 to 9.32 Grouped twig matches (Q11, Q12, Q13): speedup to Low selectivity (Q2, Q9, Q10, Q14), scattered twig matches (Q1, Q6, Q7): speedup to

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 21 ToXinScan vs. Heavier Indexes Pattern indexes (such as PRIX [RM04], ViST [WPF+03]) are the best twig-query processors Indexes are expensive to build (three passes over the document) and require extensive space  ViST uses O(SH) space, S # of sequences, H height of tree Indexes outperform TwigStack by two-orders of magnitude Good news:  using path summaries and the presented optimization strategy we achieve the same performance improvements as node indexes  path summaries are inexpensive to build (one pass over the document)

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 22 Outline Introduction Path Summaries in the Optimizer Holistic Path Summary Pruning Experimental Evaluation Access Order Selection Experimental Evaluation  Future Work

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 23 Future Work Generalize based on the strategy derived from the TwigStackScan access method  Holistic Path Summary Pruning (HPSP) can be used in conjunction with any twig query evaluation method  Can be used with Path summaries other than ToXin ToXinScan  Add a generalized cost model for access methods  Enhance the XML statistics used Propose benchmarks for XML Access methods

Thank you for your attention! Attila Barta Mariano P. Consens Alberto O. Mendelzon { atibarta, consens, mendel University of Toronto

VLDB 2005 Benefits of Path Summaries in an XML Query Optimizer Supporting Multiple Access Methods 25 ToXinScan vs. PRIX QueryDatasetTwigStack/ ToXinScan TwigStack/PRIX [RMo03, RMo04] Q1//inproceedings[./author="Jim Gray"] DBLP Q2//www[./editor]/urlDBLP [.//DISULFID/Descr] SWISSPROT [RMo04] Praveen Rao, Bongki Moon, “PRIX: Indexing and Querying XML Using Prufer Sequences”, Proceedings of the 2004 International Conference on Data Engineering, Boston, MA, 2004 [RMo03] Praveen Rao, Bongki Moon, “PRIX: Indexing and Querying XML Using Prufer Sequences”, Technical Report TR-03-06,Univ. of Arizona, Tucson, 2003 Good news: node indexes (e.g. PRIX) are computationally expensive to build (three passes over the document) while path summaries are un-expensive to build (one pass over the document)