Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.

Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center

2 XML Document 1: 2: 3: Software Testing 4: 5: 6: 7: Alice 8: 9: 10: engineer 11: 12: 13: 14: 15: Bob 16: 17: 18: engineer 19: 20: 21: 22: 23: Carole 24: 25: 26: assistant 27: 28: 29: 30: 31: John 32: 33: 34:

3 XML Document Tree Software Testing @id position department employee name root employee @id name Alice 2 name position Bob engineer employee @id name 1 assistant 3 position Carole engineer manager @id name 4 John

4 XPath Queries [manager/name = “John”] [position = “engineer”] @id position department employee name root employee @id name Alice 2 name position Bob engineer employee @id name 1 assistant 3 position Carole engineer manager @id name 4 John /department /employee /name

5 XPath Queries /department /name @id position department employee name root employee @id name Alice 2 name position Bob engineer employee @id name 1 assistant 3 position Carole engineer manager @id name 4 John [employee/name = manager/name]

6 XPath XPath 2.0 Forward axes only Eval(Q,D): nodes in D that match Q Two modes of XPath evaluation: Full fledged evaluation: given Q,D, output Eval(Q,D) Filtering: given Q,D, determine whether Eval(Q,D) is nonempty.

7 XML Streams XML stream: sequence of SAX events startDocument(), endDocument(), startElement(name), endElement(name), text(str), … Critical resources Memory Processing time Why XML streams? For transferring XML between systems For efficient access to large XML documents

8 Streaming XML Algorithms XFilter and YFilter [Altinel and Franklin 00] [Diao et al 02] X-scan [Ives, Levy, and Weld 00] XMLTK [Avila-Campillo et al 02] XTrie [Chan et al 02] SPEX [Olteanu, Kiesling, and Bry 03] Lazy DFAs [Green et al 03] The XPush Machine [Gupta and Suciu 03] XSQ [Peng and Chawathe 03] FluX [Koch el al 04] TurboXPath [Josifovski, Fontoura, and Barta 05] … All of them use lots of memory on certain queries & documents All of them use lots of memory on certain queries & documents

9 Memory Bottleneck I : Storage of Large Transition Tables Framework of most algorithms: Q  NFA Simulate NFA by DFA Caveat: exponential blowup However: exponential blowup is not necessary [Bar-Yossef, Fontoura, Josifovski 04] Algorithm for filtering XML streams whose space is linear in the query size

10 Memory Bottleneck II : Buffering of Document Fragments Scenario 1: buffering nodes, which may or may not be part of the output. /department[manager/name = “John”]/employee[position = “engineer”]/name @id position department employee name root employee @id name Alice 2 name position Bob engineer employee @id name 1 assistant 3 position Carole engineer manager @id name 4 John

11 Memory Bottleneck II : Buffering of Document Fragments Scenario 2: buffering nodes needed for evaluating pending predicates. @id position department employee name root employee @id name Alice 2 name position Bob engineer employee @id name 1 assistant 3 position Carole engineer manager @id name 4 John /department[employee/name = manager/name ]/name

12 Memory Bottleneck II : Buffering of Document Fragments Scenario 3: buffering multiple candidate matches that are nested within each other. Relevant only when document is “recursive” Space required:  (doc-recursion-depth) [Bar-Yossef, Fontoura, Josifovski 04]

13 Our Results Quantitative space lower bounds for: Full-fledged evaluation of queries with predicates (Scenario 1) Filtering/full-fledged evaluation of queries with “multi-variate” predicates (Scenario 2) Matching upper bound Eager evaluation of predicates In all other scenarios: no buffering required Filtering non-recursive documents using queries with “univariate” predicates is possible without buffering [Bar-Yossef, Fontoura, Josifovski 04]

14 Related Work Space complexity of XPath evaluation over non- streaming XML documents [Gottlob, Koch, Pichler 03], [Segoufin 03] Space complexity of XPath evaluation over streams of indexed XML data [Choi, Mahoui, Wood 03] Space complexity of select-project-join queries over relational data streams [Arasu et al 02]

15 Document Concurrency Q: query D =  1,…,  n : document Each  i is an SAX event  t = (  1,…,  t ) Definition: x  D is alive at step t if x   t and    s.t. x  Eval(Q,  t  )  x  Eval(Q,  t  ) t-concurrency(D,Q): number of distinct nodes that are alive at step t concurrency(D,Q): max t t-concurrency(D,Q)

16 Lower Bound Notions A “normal” lower bound: For every algorithm A, there exist Q and D s.t. A uses on Q and D  (concurrency(D,Q)) bits of space. Q and D may be “pathological” Doesn’t say much about real-world queries/documents An “ideal” lower bound: For every A, every Q, and every D, A uses on Q and D  (concurrency(D,Q)) bits of space. Too good to be true A can have D and Q “hard-coded”, and then know the result a priori Space of A on D and Q = minimum description length of Q and D

17 Our Lower Bound Theorem: For every A, every Q, and every D, there exists an almost isomorphic document D’, s.t. A uses on Q and D’,  (concurrency(D,Q)) bits of space. D’ is the same as D, except for a few extra empty nodes with auxiliary names. Theorem holds only if: Q is “star-free” D is non-recursive

18 Why isn’t this Obvious? Reason 1: we want the theorem to work for every Q and D, not only ones with high MDL. Reason 2: Obvious: If x is alive at step t  A has to buffer x Because: A may or may not need to output x Not obvious: If x and y are alive at step t  A has to buffer both If x and y are not “independent”, maybe it’s enough to buffer just x (or just y)

19 Proof of Lower Bound C = t-concurrency(D,Q) x 1,…,x C = distinct nodes alive at step t Recall: for every x i there exist  i and  i s.t. x i  Eval(Q,  t  i ) x i  Eval(Q,  t  i ) Lemma: there exist a single  and a single  s.t. for all i, x i  Eval(Q,  t  ) x i  Eval(Q,  t  )

20 Proof of Lower Bound (cont.) For every S  { 1,…,C } define document D S : D S is the same as D, except For every i  S, we “mark” x i Marking: an extra empty child with an auxiliary name Note: D S is almost-isomorphic to D  t S = first t events in D S

21 Proof of Lower Bound (cont.) A = any algorithm Consider state of A after processing  t S : If suffix = , none of the x i ’s should be output  A could not have output any x i by step t If suffix = , no information in suffix about S but S can be reconstructed from output  state of A at step t must have all information about S Conclusion: space  ≥  (C) Actual proof: by one-way communication complexity

22 Conclusions Our contributions: Quantitative space lower bounds Full-fledged evaluation of queries with predicates Filtering/full-fledged evaluation of queries with “multi- variate” predicates Matching upper bound Open problems: Quantitative lower bounds for XQuery evaluation over streams Address larger fragments of XPath

23 Memory Bottleneck II : Buffering of Document Fragments Scenario 3: buffering multiple candidate matches that are nested within each other. a root c a b a c b //a[b and c] Relevant only when document is “recursive” Space required:  (doc-recursion-depth) [Bar-Yossef, Fontoura, Josifovski 04]

24 Concurrency: Example 1: 2: 3: Software Testing 4: 5: 6: 7: Alice 8: 9: 10: engineer 11: 12: 13: 14: 15: Bob 16: 17: 18: engineer 19: 20: 21: 22: 23: Carole 24: 25: 26: assistant 27: 28: 29: 30: 31: John 32: 33: 34: /department[manager/name = “John”]/employee[position = “engineer”]/name alive dead

Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.

Similar presentations

Presentation on theme: "Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center.

Similar presentations

Presentation on theme: "Buffering in Query Evaluation over XML Streams Ziv Bar-Yossef Technion Marcus Fontoura Vanja Josifovski IBM Almaden Research Center."— Presentation transcript:

Similar presentations

About project

Feedback