Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.

Slides:



Advertisements
Similar presentations
Deep packet inspection – an algorithmic view Cristian Estan (U of Wisconsin-Madison) at IEEE CCW 2008.
Advertisements

Review: Search problem formulation
B-Trees. Motivation When data is too large to fit in the main memory, then the number of disk accesses becomes important. A disk access is unbelievably.
Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.
Succinct Data Structures for Permutations, Functions and Suffix Arrays
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
XPath Query Processing DBPL9 Tutorial, Sept. 8, 2003, Part 2 Georg Gottlob, TU Wien Christoph Koch, U. Edinburgh Based on joint work with R. Pichler.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Incremental Maintenance of XML Structural Indexes Ke Yi 1, Hao He 1, Ioana Stanoi 2 and Jun Yang 1 1 Department of Computer Science, Duke University 2.
DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.
Inpainting Assigment – Tips and Hints Outline how to design a good test plan selection of dimensions to test along selection of values for each dimension.
On the Memory Requirements of XPath Evaluation over XML Streams Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.
Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams Bernhard Stegmaier (TU München) Joint work with.
TIMBER A Native XML Database Xiali He The Overview of the TIMBER System in University of Michigan.
2015/5/5 A Succinct Physical Storage Scheme for Efficient Evaluation of Path Queries in XML Ning Zhang(University of Waterloo) Varun Kacholia(Indian Institute.
An Algorithm for Streaming XPath Processing with Forward and Backward Axes Charles Barton, Philippe Charles, Deepak Goyal, Mukund Raghavchari IBM T. J.
1 Conditional XPath, the first order complete XPath dialect Maarten Marx Presented by: Einav Bar-Ner.
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
On the limits of partial compaction Anna Bendersky & Erez Petrank Technion.
1 Introduction to Computability Theory Lecture11: Variants of Turing Machines Prof. Amos Israeli.
Selective Dissemination of Streaming XML By Hyun Jin Moon, Hetal Thakkar.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
A Framework for Using Materialized XPath Views in XML Query Processing Dapeng He Wei Jin.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Submitted by : Estrella Eisenberg Yair Kaufman Ohad Lipsky Riva Gonen Shalom.
1 Regular expression matching with input compression : a hardware design for use within network intrusion detection systems Department of Computer Science.
G. Gottlob, C. Koch & R. Pichler TU Wien, Vienna, Austria Elias Politarhos Advanced Databases M.Sc. in Information Systems Athens University of Economics.
1 Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection Department of Computer Science and Information Engineering National.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.
Inbal Yahav A Framework for Using Materialized XPath Views in XML Query Processing VLDB ‘04 DB Seminar, Spring 2005 By: Andrey Balmin Fatma Ozcan Kevin.
The Complexity of XPath Evaluation Paper By: Georg Gottlob Cristoph Koch Reinhard Pichler Presented By: Royi Ronen.
1 Optimizing Cursor Movement in Holistic Twig Joins Marcus Fontoura, Vanja Josifovski, Eugene Shekita (IBM Almaden Research Center) Beverly Yang (Stanford)
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
1 Distributed Monitoring of Peer-to-Peer Systems By Serge Abiteboul, Bogdan Marinoiu Docflow meeting, Bordeaux.
IBM Almaden Research Center © 2006 IBM Corporation On the Path to Efficient XML Queries Andrey Balmin, Kevin Beyer, Fatma Özcan IBM Almaden Research Center.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.
Schema-Based Query Optimization for XQuery over XML Streams Hong Su Elke A. Rundensteiner Murali Mani Worcester Polytechnic Institute, Massachusetts, USA.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
An Improved Algorithm to Accelerate Regular Expression Evaluation Author: Michela Becchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture.
Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December.
Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection Authors: Fang Yu, Zhifeng Chen, Yanlei Diao, T. V. Lakshman, Randy H.
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.
Tree-Pattern Queries on a Lightweight XML Processor MIRELLA M. MORO Zografoula Vagena Vassilis J. Tsotras Research partially supported by CAPES, NSF grant.
Streaming XPath Engine Oleg Slezberg Amruta Joshi.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
2004/12/31 報告人 : 邱紹禎 1 Mining Frequent Query Patterns from XML Queries L.H. Yang, M.L. Lee, W. Hsu, and S. Acharya. Proc. of 8th Int. Conf. on Database.
Query Caching and View Selection for XML Databases Bhushan Mandhani Dan Suciu University of Washington Seattle, USA.
CSE 6331 © Leonidas Fegaras XQuery 1 XQuery Leonidas Fegaras.
XML Stream Processing Yanlei Diao University of Massachusetts Amherst.
Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia.
1 XPath Queries on Streaming Data Feng Peng and Sudarshan S. Chawathe İsmail GÜNEŞ Ayşe GENÇ
High-Performance XML Filtering with YFilter
Efficient Filtering of XML Documents with XPath Expressions
RE-Tree: An Efficient Index Structure for Regular Expressions
Temporal Indexing MVBT.
Probabilistic Data Management
Spatio-temporal Pattern Queries
(b) Tree representation
Query Processing for High-Volume XML Message Brokering
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Incremental Maintenance of XML Structural Indexes
Compact routing schemes with improved stretch
Path Oram An Extremely Simple Oblivious RAM Protocol
Presentation transcript:

Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

2 XML Document 1: 2: 3: 4: Intro 5: 6: 7: bla bla bla 8: 9: 10: 11: 12: Results 13: 14: 15: yada yada yada 16: 17: 18: 19: 20: Conclusions 21: 22: 23: etc etc etc 24: 25: 26: On the Complexity of Database Queries 27: 28: 29: Papadimitriou 30: 31: 32: Yannakakis 33: 34:

3 content XML Document Tree paper title section id title root section id title On the Complexity of Database Queries Intro 2 author content Papadimitriou Yannakakis Results yada yada yada section id title 1 etc etc etc 3 content Conclusions bla bla bla

4 XPath Queries Results yada yada yada content paper title section id title root section id title On the Complexity of Database Queries Intro 2 author content Papadimitriou Yannakakis section id title 1 etc etc etc 3 content Conclusions bla bla bla = “2” or title = “Intro”]/content

5 XPath Queries Results yada yada yada content paper title section id title root section id title On the Complexity of Database Queries Intro 2 author content Papadimitriou Yannakakis section id title 1 etc etc etc 3 content Conclusions bla bla bla /paper[title != section/title]/author

6 XPath Query = path pattern + predicates XPath 2.0 Forward axis only Eval(Q,D): nodes in D that match Q Two modes of XPath evaluation: Full fledged evaluation: given Q,D, output Eval(Q,D) Filtering: given Q,D, determine whether Eval(Q,D) is nonempty.

7 XML Streams XML stream: sequence of SAX events startDocument(), endDocument(), startElement(name), endElement(name), text(str) Why XML streams? For transferring XML between systems For efficient access to large XML documents Critical resources Memory Processing time

8 Streaming XML Algorithms XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] FluX [Koch el al 04] TurboXPath [Josifovski, Fontoura, and Barta 05] … All of them use lots of memory on certain queries & documents All of them use lots of memory on certain queries & documents

9 Memory Bottleneck I : Storage of Large Transition Tables Framework of most algorithms: Q  NFA Simulate NFA by DFA Caveat: exponential blowup However: exponential blowup is not necessary [Bar-Yossef, Fontoura, Josifovski 04] Algorithm for filtering XML streams whose space is linear in the query size

10 Memory Bottleneck II : Buffering of Document Fragments Scenario 1: buffering nodes, which may or may not be part of the output. Results yada yada yada content paper title section id title root section id title On the Complexity of Database Queries Intro 2 author content Papadimitriou Yannakakis section id title 1 etc etc etc 3 content Conclusions bla bla bla = “2” or title = “Intro”]/content

11 Memory Bottleneck II : Buffering of Document Fragments Scenario 2: buffering nodes needed for evaluating pending predicates. Results yada yada yada content paper title section id title root section id title On the Complexity of Database Queries Intro 2 author content Papadimitriou Yannakakis section id title 1 etc etc etc 3 content Conclusions bla bla bla /paper[title != section/title]/author

12 Memory Bottleneck II : Buffering of Document Fragments Scenario 3: buffering multiple candidate matches that are nested within each other. a root c a b a c b //a[b and c] Relevant only when document is “recursive” Space required:  (doc-recursion-depth) [Bar-Yossef, Fontoura, Josifovski 04]

13 Our Results Quantitative space lower bounds for: Full-fledged evaluation of queries with predicates (Scenario 1) Filtering/full-fledged evaluation of queries with “multi-variate” predicates (Scenario 2) Matching upper bound Eager evaluation of predicates In all other scenarios: no buffering required Filtering of queries with “univariate” predicates over non-recursive documents is possible without buffering [Bar-Yossef, Fontoura, Josifovski 04]

14 Related Work Space complexity of XPath evaluation over non- streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03] Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03] Space complexity of select-project-join queries over relational data streams [Arasu et al 02]

15 Document Concurrency Q: query D =  1,…,  n : document Each  i is an SAX event  t = (  1,…,  t ) Definition: x  D is alive at step t if x   t and    s.t. x  Eval(Q,  t  )  x  Eval(Q,  t  ) t-concurrency(D,Q): number of nodes that are alive at step t concurrency(D,Q): max t t-concurrency(D,Q)

16 Concurrency: Example 1: 2: 3: 4: Intro 5: 6: 7: bla bla bla 8: 9: 10: 11: 12: Results 13: 14: 15: yada yada yada 16: 17: 18: 19: 20: Conclusions 21: 22: 23: etc etc etc 24: 25: 26: 27: On the Complexity of Database Queries 28: 29: 30: Papadimitriou 31: 32: 33: Yannakakis 34: 35: alive dead = “2” or title = “Intro”]/content

17 Lower Bound Notions A “normal” lower bound: For every algorithm A, there exist Q and D s.t. A uses on Q and D  (concurrency(D,Q)) bits of space. Q and D may be “pathological” Doesn’t say much about real-world queries/documents An “ideal” lower bound: For every A, every Q, and every D, A uses on Q and D  (concurrency(D,Q)) bits of space. Too good to be true A can have D and Q “hard-coded”, and then know the result a priori Space of A on D and Q = minimum description length of Q and D

18 Our Lower Bound Theorem: For every A, every Q, and every D, there exists an almost isomorphic document D’, s.t. A uses on Q and D’,  (concurrency(D,Q)) bits of space. D’ is the same as D, except for a few extra empty nodes with auxiliary names. Theorem holds only if: Q is “star-free” D is non-recursive

19 Why isn’t this Obvious? Reason 1: we want the theorem to work for every Q and D, not only ones with high MDL. Reason 2: Obvious: If x is alive at step t  A has to remember x Because: A may or may not need to output x Not obvious: If x and y are alive at step t  A has to remember both If x and y are not “independent”, maybe it’s enough to remember just x (or just y)

20 Proof of Lower Bound C = t-concurrency(D,Q) x 1,…,x C = nodes that are alive at step t Recall: for every x i there exist  i and  i s.t. x i  Eval(Q,  t  i ) x   Eval(Q,  t  i ) Lemma: there exist a single  and a single  s.t. for all i, x i  Eval(Q,  t  ) x i  Eval(Q,  t  )

21 Proof of Lower Bound (cont.) For every S  { 1,…,C } define document D S : D S is the same as D, except For every i  S, we “mark” x i Marking: an extra empty child with an auxiliary name Note: D S is almost-isomorphic to D A = any algorithm Note: From output of A on D S, one can “reconstruct” the set S.

22 Proof of Lower Bound (cont.) Consider state of A at step t when running on D S If suffix = , none of the x i ’s should be output  A could not have output any x i by step t If suffix = , no information in suffix about S but S can be reconstructed from output  state of A at step t must have all information about S Conclusion: space  ≥  (C) Actual proof: by one-way communication complexity

23 Conclusions Our contributions: Quantitative space lower bounds Full-fledged evaluation of queries with predicates Filtering/full-fledged evaluation of queries with “multi- variate” predicates Matching upper bound Open problems: Quantitative lower bounds for XQuery evaluation over streams Address larger fragments of XPath