On the Memory Requirements of XPath Evaluation over XML Streams Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.

Slides:



Advertisements
Similar presentations
Xiaoming Sun Tsinghua University David Woodruff MIT
Advertisements

Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Truthful Mechanisms for Combinatorial Auctions with Subadditive Bidders Speaker: Shahar Dobzinski Based on joint works with Noam Nisan & Michael Schapira.
Bottom-up Evaluation of XPath Queries Stephanie H. Li Zhiping Zou.
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Theory of Computing Lecture 23 MAS 714 Hartmut Klauck.
Circuit and Communication Complexity. Karchmer – Wigderson Games Given The communication game G f : Alice getss.t. f(x)=1 Bob getss.t. f(y)=0 Goal: Find.
The Communication Complexity of Approximate Set Packing and Covering
Succinct Data Structures for Permutations, Functions and Suffix Arrays
Lecture 24 MAS 714 Hartmut Klauck
XPath Query Processing DBPL9 Tutorial, Sept. 8, 2003, Part 2 Georg Gottlob, TU Wien Christoph Koch, U. Edinburgh Based on joint work with R. Pichler.
DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Fast Algorithms For Hierarchical Range Histogram Constructions
Gillat Kol (IAS) joint work with Ran Raz (Weizmann + IAS) Interactive Channel Capacity.
Containment of Nested XML Queries Xin (Luna) Dong, Alon Halevy, Igor Tatarinov University of Washington.
Having Proofs for Incorrectness
© The McGraw-Hill Companies, Inc., Chapter 8 The Theory of NP-Completeness.
Deterministic Selection and Sorting Prepared by John Reif, Ph.D. Analysis of Algorithms.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 L is in NP means: There is a language L’ in P and a polynomial p so that L 1 · L 2 means: For some polynomial time computable map r : 8 x: x 2 L 1 iff.
Rotem Zach November 1 st, A rectangle in X × Y is a subset R ⊆ X × Y such that R = A × B for some A ⊆ X and B ⊆ Y. A rectangle R ⊆ X × Y is called.
CS 253: Algorithms Chapter 8 Sorting in Linear Time Credit: Dr. George Bebis.
Containment of Nested XML Queries Presented by: Orly Goren Xin Dong, Igor TatarinovAlon Halevy,
Limitations of VCG-Based Mechanisms Shahar Dobzinski Joint work with Noam Nisan.
CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
CPSC 411, Fall 2008: Set 2 1 CPSC 411 Design and Analysis of Algorithms Set 2: Sorting Lower Bound Prof. Jennifer Welch Fall 2008.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Derandomizing LOGSPACE Based on a paper by Russell Impagliazo, Noam Nissan and Avi Wigderson Presented by Amir Rosenfeld.
Dean H. Lorenz, Danny Raz Operations Research Letter, Vol. 28, No
1 On the Benefits of Adaptivity in Property Testing of Dense Graphs Joint work with Mira Gonen Dana Ron Tel-Aviv University.
1 Streaming Computation of Combinatorial Objects Ziv Bar-Yossef U.C. Berkeley Omer Reingold AT&T Labs – Research Ronen.
Sorting Lower Bound Andreas Klappenecker based on slides by Prof. Welch 1.
Analysis of Algorithms CS 477/677
Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.
The Complexity of XPath Evaluation Paper By: Georg Gottlob Cristoph Koch Reinhard Pichler Presented By: Royi Ronen.
1 Optimizing Cursor Movement in Holistic Twig Joins Marcus Fontoura, Vanja Josifovski, Eugene Shekita (IBM Almaden Research Center) Beverly Yang (Stanford)
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Defining Polynomials p 1 (n) is the bound on the length of an input pair p 2 (n) is the bound on the running time of f p 3 (n) is a bound on the number.
Computer Algorithms Lecture 11 Sorting in Linear Time Ch. 8
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.
Succinct Representations of Trees
Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.
Analysis of Algorithms CS 477/677
One-way multi-party communication lower bound for pointer jumping with applications Emanuele Viola & Avi Wigderson Columbia University IAS work done while.
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
1/24 Introduction to Graphs. 2/24 Graph Definition Graph : consists of vertices and edges. Each edge must start and end at a vertex. Graph G = (V, E)
Streaming XPath Engine Oleg Slezberg Amruta Joshi.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
LIMITATIONS OF ALGORITHM POWER
Query Caching and View Selection for XML Databases Bhushan Mandhani Dan Suciu University of Washington Seattle, USA.
WEEK 5 The Disjoint Set Class Ch CE222 Dr. Senem Kumova Metin
XML Stream Processing Yanlei Diao University of Massachusetts Amherst.
Processing XML Streams with Deterministic Automata Denis Mindolin Gaurav Chandalia.
Sorting & Lower Bounds Jeff Edmonds York University COSC 3101 Lecture 5.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
New Characterizations in Turnstile Streams with Applications
Andrzej Ehrenfeucht, University of Colorado, Boulder
Efficient Filtering of XML Documents with XPath Expressions
RE-Tree: An Efficient Index Structure for Regular Expressions
Branching Programs Part 3
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Chapter 11 Limitations of Algorithm Power
The Lower Bounds of Problems
Early Profile Pruning on XML-aware Publish-Subscribe Systems
Graphs and Algorithms (2MMD30)
Clustering.
Presentation transcript:

On the Memory Requirements of XPath Evaluation over XML Streams Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

Preliminaries: XML PODS Josifovski 1 Fagin 3 conference name speaker name paper_cnt root speaker name paper_cnt PODS Josifovski Fagin 1 3 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8

Preliminaries: XPath 1.0 /conference[name = PODS]/speaker[paper_cnt > 1]/name conference name root Document Query Result: { x 7 } speaker name paper_cnt = PODS > 1 conference name speaker name paper_cnt root speaker name paper_cnt PODS Josifovski Fagin 1 3 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8

XML Streams XML stream: XML document arriving as a one-way stream Critical resources: Memory Processing time Why XML streams? For transferring XML between systems For efficient access to large XML documents

Streaming XML Algorithms XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] TurboXPath [Josifovski, Fontoura, and Barta 04] …

Our Results Space lower bounds for evaluating XPath on XML streams A streaming XML algorithm Matches the lower bounds on a large fragment of the language Uses space sub-linear in the query size rather than exponential in the query size

Related Work Space complexity of XPath evaluation over non- streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03] Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03] Space complexity of select-project-join queries over relational data streams [Arasu et al 02]

Data Complexity [Vardi 82]  (Q,D) Evaluation function of a query Q on document D.  Q (D) Evaluation function of a fixed query Q on document D. Data complexity on Q: Complexity of best algorithm for  Q on worst D. Worst-case data complexity: max Q (complexity of  Q ). We characterize the data complexity of  Q separately for each Q (not just the worst-case one).

XPath Fragment 1. Queries are subsumption-free conference name root Query = PODS name != SIGMOD conference root Query name != SIGMOD Not subsumption-free Subsumption-free

XPath Fragment (cont.) 2. Queries are univariate conference paper_cnt root Query author_cnt Query Not univariate Univariate < conference paper_cnt root author_cnt < 30 > 30

XPath Fragment (cont.) 3. Queries consist of conjunctions only 4. Queries are “star-restricted”

Query Frontier Size 1.Frontier at u: u, its siblings, and the siblings of its ancestors. Theorem 1: For all queries Q in the fragment, stream-space(  Q ) =  (FrontierSize(Q)). Definitions : 2.FrontierSize(Q): size of largest frontier. conference name root Query speaker name paper_cnt = PODS > 1

Theorem 2: For all queries Q in the fragment that have at least one “//” node, stream-space(  Q ) =  (recDepth Q (D)). Document Recursion Depth //part number root name part number name root name x0x0 x1x1 x3x3 x4x4 x4x4 x6x6 x7x7 x2x2 Definition: recDepth Q (D): Max number of nodes in D that lie on one root-to-leaf path and “path match” the same node in Q. Document D Query Q part number x5x5 Compressor 12 Refrigerator 456

Document Depth Definition: depth(D): Length of longest root-to- leaf path. part number name root name x0x0 x1x1 x3x3 x4x4 x4x4 x6x6 x7x7 x2x2 Document D part number x5x5 Compressor 12 Refrigerator Theorem 2: For all queries Q in the fragment that have at least one “/” node, stream-space(  Q ) =  (log depth(D)). 456

New algorithm Theorem 4(a): For all queries Q in a “Univariate XPath”: Space: O(|Q| recDepth(D) log depth(D)). Time: O(|D| |Q| recDepth(D)). Theorem 4(b): For all queries Q in a subset of our fragment and for non-recursive documents D, Space: O(FrontierSize(Q) log depth(D)). Time: O(|D| FrontierSize(Q)).

Proof of Theorem 1 Fragment: “subsumption-free” “univariate” Conjunctions only “star-restricted” Theorem 1: For all queries Q in the fragment, stream-space(  Q ) =  (FrontierSize(Q)). conference name root Query speaker name paper_cnt = PODS > 1

Critical Document Definition: Document D is critical for query Q, if: (1) D matches Q. (2) If we remove from D any node, it no longer matches Q. conference name root Query Q speaker name paper_cnt = PODS > 1 conference name speaker name paper_cnt root speaker name paper_cnt PODS Josifovski Fagin 1 3 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8 Document D

Main Lemmas Lemma 1: For all queries Q in the fragment and any critical document D for Q, stream-space(  Q ) =  (FrontierSize(D)). Lemma 2: For all queries Q in the fragment, there is a critical document D so that FrontierSize(D) = FrontierSize(Q). show proof Theorem 1: For all queries Q in the fragment, stream-space(  Q ) =  (FrontierSize(Q)).

One-way Communication Complexity Alice Bob x y m f: (X, Y)  Z f(x,y) CC(f) = number of communication bits used by the best protocol on the worst-case choice of inputs.

D   Reduction Alice Bob state A (  ) A : streaming algorithm for  Q using space S state A (  ) Theorem: stream-space(  Q ) >= CC(  Q )  Q (D)  

D,D, Fooling Set Technique Theorem: For any fooling set T, CC(  Q ) =  (log |T|). Definition A set T of partitioned documents is a fooling set for  Q if: 1. All documents in T match Q. 2. For any two distinct documents D , , D ,  in T, either D ,  does not match Q or D ,  does not match Q. Partitioned document:   Document prefix Document suffix

Proof of Lemma 1 Lemma 1: For all queries Q in the fragment nd any critical document D for Q, stream-space(  Q ) =  (FS(D)). conference name root Query Q speaker name = PODS > 1 conference name root speaker name paper_cnt Fagin 3 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 Document D paper_cnt PODS

Proof of Lemma 1 For each subset S of Frontier(D), define a partitioned document D S : S = { x 2, x 5 } conference name root Query Q speaker name = PODS > 1 conference name root speaker name paper_cnt Fagin 3 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 Document D S paper_cnt PODS

2. If S  T, need: either D ST or D TS does not match Q. Proof of Lemma 1 (cont) 1. For all S, D S matches Q. Claim: { D S } S is a subset of Frontier(D) is a fooling set. stream-space(  Q ) >= log(2 FS(D) ) = FS(D). Proof of Claim:

Proof of Claim (example) conference name root speaker name paper_cnt x0x0 x1x1 x3x3 x2x2 x4x4 x5x5 Document D T T = { x 4,x 5 } PODS Document D TS conference name root speaker name paper_cnt x0x0 x1x1 x2x2 x3x3 x5x5 x4x4 Document D S S = { x 2,x 5 } PODS Fagin 3 3 conference root x0x0 x1x1 Conference name missing! speaker name paper_cnt x3x3 x4x4 Fagin 3 name Fagin x4x4 x5x5

Algorithm Uses the query as an NFA Based on three global data structures Pointer array Validation array Level array Matches the lower bounds for a fragment of XPath.

Algorithm Example Run c1 b1... c1 b1... a F 1 Level array Validation array Pointer array with one entry /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3 Query: /a[b and c] Input XML

Algorithm Example Run c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 Index 0 Index 1 Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

Algorithm Example Run c1 b1... c1 b1... Input XML a F 1 $ Query: /a[b and c] b F 2 a c F 2 Index 0 Index 1 b F 2 c c F 2 /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 Index 0 Index 1 b F 2 c c F 2 b F 2 /c c T 2 Algorithm Example Run Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 Index 0 Index 1 b F 2 c c F 2 b F 2 b c T 2 Algorithm Example Run b F 2 /c c T 2 Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 Index 0 Index 1 b F 2 c c F 2 b F 2 b c T 2 Algorithm Example Run b F 2 /c c T 2 b T 2 /b c T 2 Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 b F 2 c c F 2 b F 2 b c T 2 Algorithm Example Run b F 2 /c c T 2 b T 2 /b c T 2 a T 1 /a Return TRUE Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3

Conclusion: our Contributions Space lower bounds on the instance data complexity of XPath on XML streams: 1.In terms of Query Frontier Size 2.In terms of Document Recursion Depth 3.In terms of Document Depth A streaming XML algorithm Matches the lower bounds on a fragment of the language Does not use finite-state automata

XPath 1.0 C N S NP $ S NP PODS Josifovski Fagin1 3 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8 /conference/name /C/C /N/N $ u0u0 u1u1 u2u2 D Q Result: { x 2 }

XPath 1.0 C N S NP $ S NP PODS JosifovskiFagin13 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8 /conference//name /C/C //N $ u0u0 u1u1 u2u2 D Q Result: { x 2, x 4, x 7 }

D 33 11 11 22 22 33 33 11 11 22 22 33 Reduction Alice Bob s1s1 s2s2 s3s3 s4s4 A : S-space streaming algorithm for  Q. r ¸ 1: integer. (r = 6) s0s0 s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s5s5 s6s6 Theorem: S ¸ CC(  Q r ) / r  Q (D)