Download presentation
Presentation is loading. Please wait.
Published byKristopher Wickliffe Modified over 9 years ago
1
On the Memory Requirements of XPath Evaluation over XML Streams Ziv Bar-Yossef Marcus Fontoura Vanja Josifovski IBM Almaden Research Center
2
Preliminaries: XML PODS Josifovski 1 Fagin 3 conference name speaker name paper_cnt root speaker name paper_cnt PODS Josifovski Fagin 1 3 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8
3
Preliminaries: XPath 1.0 /conference[name = PODS]/speaker[paper_cnt > 1]/name conference name root Document Query Result: { x 7 } speaker name paper_cnt = PODS > 1 conference name speaker name paper_cnt root speaker name paper_cnt PODS Josifovski Fagin 1 3 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8
4
XML Streams XML stream: XML document arriving as a one-way stream Critical resources: Memory Processing time Why XML streams? For transferring XML between systems For efficient access to large XML documents
5
Streaming XML Algorithms XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] TurboXPath [Josifovski, Fontoura, and Barta 04] …
6
Our Results Space lower bounds for evaluating XPath on XML streams A streaming XML algorithm Matches the lower bounds on a large fragment of the language Uses space sub-linear in the query size rather than exponential in the query size
7
Related Work Space complexity of XPath evaluation over non- streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03] Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03] Space complexity of select-project-join queries over relational data streams [Arasu et al 02]
8
Data Complexity [Vardi 82] (Q,D) Evaluation function of a query Q on document D. Q (D) Evaluation function of a fixed query Q on document D. Data complexity on Q: Complexity of best algorithm for Q on worst D. Worst-case data complexity: max Q (complexity of Q ). We characterize the data complexity of Q separately for each Q (not just the worst-case one).
9
XPath Fragment 1. Queries are subsumption-free conference name root Query = PODS name != SIGMOD conference root Query name != SIGMOD Not subsumption-free Subsumption-free
10
XPath Fragment (cont.) 2. Queries are univariate conference paper_cnt root Query author_cnt Query Not univariate Univariate < conference paper_cnt root author_cnt < 30 > 30
11
XPath Fragment (cont.) 3. Queries consist of conjunctions only 4. Queries are “star-restricted”
12
Query Frontier Size 1.Frontier at u: u, its siblings, and the siblings of its ancestors. Theorem 1: For all queries Q in the fragment, stream-space( Q ) = (FrontierSize(Q)). Definitions : 2.FrontierSize(Q): size of largest frontier. conference name root Query speaker name paper_cnt = PODS > 1
13
Theorem 2: For all queries Q in the fragment that have at least one “//” node, stream-space( Q ) = (recDepth Q (D)). Document Recursion Depth //part number root name part number name root name x0x0 x1x1 x3x3 x4x4 x4x4 x6x6 x7x7 x2x2 Definition: recDepth Q (D): Max number of nodes in D that lie on one root-to-leaf path and “path match” the same node in Q. Document D Query Q part number x5x5 Compressor 12 Refrigerator 456
14
Document Depth Definition: depth(D): Length of longest root-to- leaf path. part number name root name x0x0 x1x1 x3x3 x4x4 x4x4 x6x6 x7x7 x2x2 Document D part number x5x5 Compressor 12 Refrigerator Theorem 2: For all queries Q in the fragment that have at least one “/” node, stream-space( Q ) = (log depth(D)). 456
15
New algorithm Theorem 4(a): For all queries Q in a “Univariate XPath”: Space: O(|Q| recDepth(D) log depth(D)). Time: O(|D| |Q| recDepth(D)). Theorem 4(b): For all queries Q in a subset of our fragment and for non-recursive documents D, Space: O(FrontierSize(Q) log depth(D)). Time: O(|D| FrontierSize(Q)).
16
Proof of Theorem 1 Fragment: “subsumption-free” “univariate” Conjunctions only “star-restricted” Theorem 1: For all queries Q in the fragment, stream-space( Q ) = (FrontierSize(Q)). conference name root Query speaker name paper_cnt = PODS > 1
17
Critical Document Definition: Document D is critical for query Q, if: (1) D matches Q. (2) If we remove from D any node, it no longer matches Q. conference name root Query Q speaker name paper_cnt = PODS > 1 conference name speaker name paper_cnt root speaker name paper_cnt PODS Josifovski Fagin 1 3 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8 Document D
18
Main Lemmas Lemma 1: For all queries Q in the fragment and any critical document D for Q, stream-space( Q ) = (FrontierSize(D)). Lemma 2: For all queries Q in the fragment, there is a critical document D so that FrontierSize(D) = FrontierSize(Q). show proof Theorem 1: For all queries Q in the fragment, stream-space( Q ) = (FrontierSize(Q)).
19
One-way Communication Complexity Alice Bob x y m f: (X, Y) Z f(x,y) CC(f) = number of communication bits used by the best protocol on the worst-case choice of inputs.
20
D Reduction Alice Bob state A ( ) A : streaming algorithm for Q using space S state A ( ) Theorem: stream-space( Q ) >= CC( Q ) Q (D)
21
D,D, Fooling Set Technique Theorem: For any fooling set T, CC( Q ) = (log |T|). Definition A set T of partitioned documents is a fooling set for Q if: 1. All documents in T match Q. 2. For any two distinct documents D , , D , in T, either D , does not match Q or D , does not match Q. Partitioned document: Document prefix Document suffix
22
Proof of Lemma 1 Lemma 1: For all queries Q in the fragment nd any critical document D for Q, stream-space( Q ) = (FS(D)). conference name root Query Q speaker name = PODS > 1 conference name root speaker name paper_cnt Fagin 3 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 Document D paper_cnt PODS
23
Proof of Lemma 1 For each subset S of Frontier(D), define a partitioned document D S : S = { x 2, x 5 } conference name root Query Q speaker name = PODS > 1 conference name root speaker name paper_cnt Fagin 3 x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 Document D S paper_cnt PODS
24
2. If S T, need: either D ST or D TS does not match Q. Proof of Lemma 1 (cont) 1. For all S, D S matches Q. Claim: { D S } S is a subset of Frontier(D) is a fooling set. stream-space( Q ) >= log(2 FS(D) ) = FS(D). Proof of Claim:
25
Proof of Claim (example) conference name root speaker name paper_cnt x0x0 x1x1 x3x3 x2x2 x4x4 x5x5 Document D T T = { x 4,x 5 } PODS Document D TS conference name root speaker name paper_cnt x0x0 x1x1 x2x2 x3x3 x5x5 x4x4 Document D S S = { x 2,x 5 } PODS Fagin 3 3 conference root x0x0 x1x1 Conference name missing! speaker name paper_cnt x3x3 x4x4 Fagin 3 name Fagin x4x4 x5x5
26
Algorithm Uses the query as an NFA Based on three global data structures Pointer array Validation array Level array Matches the lower bounds for a fragment of XPath.
27
Algorithm Example Run c1 b1... c1 b1... a F 1 Level array Validation array Pointer array with one entry /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3 Query: /a[b and c] Input XML
28
Algorithm Example Run c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 Index 0 Index 1 Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3
29
Algorithm Example Run c1 b1... c1 b1... Input XML a F 1 $ Query: /a[b and c] b F 2 a c F 2 Index 0 Index 1 b F 2 c c F 2 /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3
30
c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 Index 0 Index 1 b F 2 c c F 2 b F 2 /c c T 2 Algorithm Example Run Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3
31
c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 Index 0 Index 1 b F 2 c c F 2 b F 2 b c T 2 Algorithm Example Run b F 2 /c c T 2 Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3
32
c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 Index 0 Index 1 b F 2 c c F 2 b F 2 b c T 2 Algorithm Example Run b F 2 /c c T 2 b T 2 /b c T 2 Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3
33
c1 b1... c1 b1... a F 1 $ b F 2 a c F 2 b F 2 c c F 2 b F 2 b c T 2 Algorithm Example Run b F 2 /c c T 2 b T 2 /b c T 2 a T 1 /a Return TRUE Query: /a[b and c] Input XML /a/a /b/b $ u0u0 u1u1 u2u2 /c/c u3u3
34
Conclusion: our Contributions Space lower bounds on the instance data complexity of XPath on XML streams: 1.In terms of Query Frontier Size 2.In terms of Document Recursion Depth 3.In terms of Document Depth A streaming XML algorithm Matches the lower bounds on a fragment of the language Does not use finite-state automata
35
XPath 1.0 C N S NP $ S NP PODS Josifovski Fagin1 3 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8 /conference/name /C/C /N/N $ u0u0 u1u1 u2u2 D Q Result: { x 2 }
36
XPath 1.0 C N S NP $ S NP PODS JosifovskiFagin13 x0x0 x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x8x8 /conference//name /C/C //N $ u0u0 u1u1 u2u2 D Q Result: { x 2, x 4, x 7 }
37
D 33 11 11 22 22 33 33 11 11 22 22 33 Reduction Alice Bob s1s1 s2s2 s3s3 s4s4 A : S-space streaming algorithm for Q. r ¸ 1: integer. (r = 6) s0s0 s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s5s5 s6s6 Theorem: S ¸ CC( Q r ) / r Q (D)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.