Download presentation
Presentation is loading. Please wait.
1
Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center
2
2 XML Document 1: 2: 3: Software Testing 4: 5: 6: 7: Alice 8: 9: 10: engineer 11: 12: 13: 14: 15: Bob 16: 17: 18: engineer 19: 20: 21: 22: 23: Carole 24: 25: 26: assistant 27: 28: 29: 30: 31: John 32: 33: 34:
3
3 XML Document Tree Software Testing @id position department employee name root employee @id name Alice 2 name position Bob engineer employee @id name 1 assistant 3 position Carole engineer manager @id name 4 John
4
4 XPath Queries [manager/name = “John”] [position = “engineer”] @id position department employee name root employee @id name Alice 2 name position Bob engineer employee @id name 1 assistant 3 position Carole engineer manager @id name 4 John /department /employee /name
5
5 XPath Queries /department /name @id position department employee name root employee @id name Alice 2 name position Bob engineer employee @id name 1 assistant 3 position Carole engineer manager @id name 4 John [employee/name = manager/name]
6
6 XPath XPath 2.0 Forward axes only Eval(Q,D): nodes in D that match Q Two modes of XPath evaluation: Full fledged evaluation: given Q,D, output Eval(Q,D) Filtering: given Q,D, determine whether Eval(Q,D) is nonempty.
7
7 XML Streams XML stream: sequence of SAX events startDocument(), endDocument(), startElement(name), endElement(name), text(str), … Critical resources Memory Processing time Why XML streams? For transferring XML between systems For efficient access to large XML documents
8
8 Streaming XML Algorithms XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] FluX [Koch el al 04] TurboXPath [Josifovski, Fontoura, and Barta 05] … All of them use lots of memory on certain queries & documents All of them use lots of memory on certain queries & documents
9
9 Memory Bottleneck I : Storage of Large Transition Tables Framework of most algorithms: Q NFA Simulate NFA by DFA Caveat: exponential blowup However: exponential blowup is not necessary [Bar-Yossef, Fontoura, Josifovski 04] Algorithm for filtering XML streams whose space is linear in the query size
10
10 Memory Bottleneck II : Buffering of Document Fragments Scenario 1: buffering nodes, which may or may not be part of the output. /department[manager/name = “John”]/employee[position = “engineer”]/name @id position department employee name root employee @id name Alice 2 name position Bob engineer employee @id name 1 assistant 3 position Carole engineer manager @id name 4 John
11
11 Memory Bottleneck II : Buffering of Document Fragments Scenario 2: buffering nodes needed for evaluating pending predicates. @id position department employee name root employee @id name Alice 2 name position Bob engineer employee @id name 1 assistant 3 position Carole engineer manager @id name 4 John /department[employee/name = manager/name ]/name
12
12 Memory Bottleneck II : Buffering of Document Fragments Scenario 3: buffering multiple candidate matches that are nested within each other. Relevant only when document is “recursive” Space required: (doc-recursion-depth) [Bar-Yossef, Fontoura, Josifovski 04]
13
13 Our Results Quantitative space lower bounds for: Full-fledged evaluation of queries with predicates (Scenario 1) Filtering/full-fledged evaluation of queries with “multi-variate” predicates (Scenario 2) Matching upper bound Eager evaluation of predicates In all other scenarios: no buffering required Filtering non-recursive documents using queries with “univariate” predicates is possible without buffering [Bar-Yossef, Fontoura, Josifovski 04]
14
14 Related Work Space complexity of XPath evaluation over non- streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03] Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03] Space complexity of select-project-join queries over relational data streams [Arasu et al 02]
15
15 Document Concurrency Q: query D = 1,…, n : document Each i is an SAX event t = ( 1,…, t ) Definition: x D is alive at step t if x t and s.t. x Eval(Q, t ) x Eval(Q, t ) t-concurrency(D,Q): number of distinct nodes that are alive at step t concurrency(D,Q): max t t-concurrency(D,Q)
16
16 Lower Bound Notions A “normal” lower bound: For every algorithm A, there exist Q and D s.t. A uses on Q and D (concurrency(D,Q)) bits of space. Q and D may be “pathological” Doesn’t say much about real-world queries/documents An “ideal” lower bound: For every A, every Q, and every D, A uses on Q and D (concurrency(D,Q)) bits of space. Too good to be true A can have D and Q “hard-coded”, and then know the result a priori Space of A on D and Q = minimum description length of Q and D
17
17 Our Lower Bound Theorem: For every A, every Q, and every D, there exists an almost isomorphic document D’, s.t. A uses on Q and D’, (concurrency(D,Q)) bits of space. D’ is the same as D, except for a few extra empty nodes with auxiliary names. Theorem holds only if: Q is “star-free” D is non-recursive
18
18 Why isn’t this Obvious? Reason 1: we want the theorem to work for every Q and D, not only ones with high MDL. Reason 2: Obvious: If x is alive at step t A has to buffer x Because: A may or may not need to output x Not obvious: If x and y are alive at step t A has to buffer both If x and y are not “independent”, maybe it’s enough to buffer just x (or just y)
19
19 Proof of Lower Bound C = t-concurrency(D,Q) x 1,…,x C = distinct nodes alive at step t Recall: for every x i there exist i and i s.t. x i Eval(Q, t i ) x i Eval(Q, t i ) Lemma: there exist a single and a single s.t. for all i, x i Eval(Q, t ) x i Eval(Q, t )
20
20 Proof of Lower Bound (cont.) For every S { 1,…,C } define document D S : D S is the same as D, except For every i S, we “mark” x i Marking: an extra empty child with an auxiliary name Note: D S is almost-isomorphic to D t S = first t events in D S
21
21 Proof of Lower Bound (cont.) A = any algorithm Consider state of A after processing t S : If suffix = , none of the x i ’s should be output A could not have output any x i by step t If suffix = , no information in suffix about S but S can be reconstructed from output state of A at step t must have all information about S Conclusion: space ≥ (C) Actual proof: by one-way communication complexity
22
22 Conclusions Our contributions: Quantitative space lower bounds Full-fledged evaluation of queries with predicates Filtering/full-fledged evaluation of queries with “multi- variate” predicates Matching upper bound Open problems: Quantitative lower bounds for XQuery evaluation over streams Address larger fragments of XPath
23
23 Memory Bottleneck II : Buffering of Document Fragments Scenario 3: buffering multiple candidate matches that are nested within each other. a root c a b a c b //a[b and c] Relevant only when document is “recursive” Space required: (doc-recursion-depth) [Bar-Yossef, Fontoura, Josifovski 04]
24
24 Concurrency: Example 1: 2: 3: Software Testing 4: 5: 6: 7: Alice 8: 9: 10: engineer 11: 12: 13: 14: 15: Bob 16: 17: 18: engineer 19: 20: 21: 22: 23: Carole 24: 25: 26: assistant 27: 28: 29: 30: 31: John 32: 33: 34: /department[manager/name = “John”]/employee[position = “engineer”]/name alive dead
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.