Schema-based Scheduling of Event Processors and Buffer Minimization for Queries on Structured Data Streams Bernhard Stegmaier (TU München) Joint work with Christoph Koch (TU Wien) Stefanie Scherzinger (TU Wien) Nicole Schweikardt (HU Berlin)
FluX – Intl. Conf. on Very Large Databases Outline Motivation FluX Query Language Translating XQuery into FluX Further Aspects Experiments Conclusion
FluX – Intl. Conf. on Very Large Databases Traditional Approach Bibliography DTD List title(s) and authors of books {for $b in /bib/book return {$b/title} {$b/author} } Evaluation of book -node 1.Print 2.Buffer titles and authors 3.Output titles 4.Output authors 5.Print … Kemper Datenbanksysteme Eickler 40€ … Example: Buffer: Kemper Datenbanksysteme Eickler Output: Datenbanksysteme Kemper Eickler
FluX – Intl. Conf. on Very Large Databases The FluX Approach Bibliography DTD List title(s) and authors of books {for $b in /bib/book return {$b/title} {$b/author} } FluX query (for book node) … {process-stream $b: on title as $t return $t; on-first past (title,author) return {for $a in $b/author return $a}} … Kemper Datenbanksysteme Eickler 40€ … Example: Buffer: Kemper Eickler Output: Datenbanksysteme Kemper Eickler Less buffering using order constraints
FluX – Intl. Conf. on Very Large Databases The FluX Approach II Bibliography DTD List title(s) and authors of books {for $b in /bib/book return {$b/title} {$b/author} } FluX query … {process-stream $b: on title as $t return $t; on author as $a return $a;} … Datenbanksysteme Kemper Eickler 40€ … Example: Buffer: Output: Datenbanksysteme Kemper Eickler No buffering using order constraints!
FluX – Intl. Conf. on Very Large Databases Outline Motivation FluX Query Language Translating XQuery into FluX Further Aspects Experiments Conclusion
FluX – Intl. Conf. on Very Large Databases FluX Query Language Based on XQuery fragment XQuery - ε (empty) s (output fixed string) α β (sequence) {for $x in $y/π [where χ] return α} (for loop) {$x/π} (output path) {$x} (output) {if χ then α} (conditional)
FluX – Intl. Conf. on Very Large Databases FluX Query Language XQuery - expression is simple Can be executed without buffering the stream Example 1: {$x} {if $x/b = 5 then 5 } simple {$x} Example 2: not simple
FluX – Intl. Conf. on Very Large Databases FluX Query Language (ctd.) FluX expressions Simple XQuery - expression s {process-stream $y: H } s´ Event handlers H on-first past( S ) return α α: XQuery - expression S: set of symbols on a as $x return Q a: symbol name $x : variable Q: FluX expression α executed on buffers Q executed in event-based fashion
FluX – Intl. Conf. on Very Large Databases Safe FluX Queries FluX query is safe No XQuery - expression refers to elements that may still be encountered in the stream Bibliography DTD FluX query … {process-stream $b: on title as $t return $t; on-first past (title,author) return {for $p in $b/price return $p}} … Data stream … Kemper Datenbanksysteme Eickler 39€ … execute Not safe!
FluX – Intl. Conf. on Very Large Databases Safe FluX Queries FluX query is safe No XQuery - expression refers to elements that may still be encountered in the stream Bibliography DTD FluX query … {process-stream $b: on title as $t return $t; on-first past (title,author, price) return {for $p in $b/price return $p}} … Data stream … Kemper Datenbanksysteme Eickler 39€ … execute Safe!
FluX – Intl. Conf. on Very Large Databases Outline Motivation FluX Query Language Translating XQuery into FluX Further Aspects Experiments Conclusion
FluX – Intl. Conf. on Very Large Databases XQuery to FluX Rewrite XQuery - Q to FluX query F using (non-recursive) DTD F is safe w.r.t. DTD F is equivalent to Q F has low memory consumption Appropriate scheduling of event processors Steps 1. Normalization of Q 2. Rewriting into FluX
FluX – Intl. Conf. on Very Large Databases Normalization Rule-based rewriting of XQuery Split paths in single step for loops Eliminate where using if Push down if expressions Rewrite paths $x/a/… to for loops XMP, Q1 {for $b in $ROOT/bib/book where χ return {$b/year} {$b/title} } {for $bib in $ROOT/bib return {for $b in $bib/book return {if χ then } {for $year in $b/year return {if χ then {$year}}} {for $title in $b/title return {if χ then {$title}}} {if χ then }}}
FluX – Intl. Conf. on Very Large Databases Example {for $bib in $ROOT/bib return {for $b in $bib/book return {for $t in $b/title return {$t}} {for $a in $b/author return {$a}} }} function rewrite(Variable parentVar, Set H, XQuery - β): FluX rewrite($ROOT, {}, Q) Delay execution of β Bibliography DTD
FluX – Intl. Conf. on Very Large Databases Example {for $bib in $ROOT/bib return {for $b in $bib/book return {for $t in $b/title return {$t}} {for $a in $b/author return {$a}} }} rewrite($ROOT, {}, β 1 ) β 1 simple, no delay generate on-first past () return … β1β1 β2β2
FluX – Intl. Conf. on Very Large Databases Example {ps $ROOT: on-first past() return {for $bib in $ROOT/bib return {for $b in $bib/book return {for $t in $b/title return {$t}} {for $a in $b/author return {$a}} }} rewrite($ROOT, {}, β 2 ) β2β2
FluX – Intl. Conf. on Very Large Databases Example {ps $ROOT: on-first past() return {for $bib in $ROOT/bib return {for $b in $bib/book return {for $t in $b/title return {$t}} {for $a in $b/author return {$a}} }} rewrite($ROOT, {}, β 2 ) β 21, β 22 rewrite($ROOT, {}, β 21 ) no delay generate on bib as $bib return … β 21 β 22
FluX – Intl. Conf. on Very Large Databases Example {ps $ROOT: on-first past() return on bib as $bib return {for $b in $bib/book return {for $t in $b/title return {$t}} {for $a in $b/author return {$a}} }} rewrite($bib, {}, α 1 ) no delay generate on book as $b return … α1α1
FluX – Intl. Conf. on Very Large Databases Example {ps $ROOT: on-first past() return on bib as $bib return {ps $bib: on book as $b return {for $t in $b/title return {$t}} {for $a in $b/author return {$a}} } rewrite($b, {}, α 2 ) as before, no delays generate on-first past() return … on title as $t return … α2α2
FluX – Intl. Conf. on Very Large Databases Example {ps $ROOT: on-first past() return on bib as $bib return {ps $bib: on book as $b return {ps $b: on-first past() return ; on title as $t return {$t}; {for $a in $b/author return {$a}} } Assure all titles before α 32 rewrite($b, {title}, α 32 ) rewrite($b, {title}, α 41 ) delay execution after title, buffered execution generate on-first past(title,author) return … α 32 α 41 α 42
FluX – Intl. Conf. on Very Large Databases Example {ps $ROOT: on-first past() return on bib as $bib return {ps $bib: on book as $b return {ps $b: on-first past() return ; on title as $t return {$t}; on-first past(title,author) return {for $a in $b/author return {$a}}; } Assure all titles and authors before α 42 rewrite($b, {title,authors}, α 42 ) α 42 simple, delay execution after title,author generate on-first past(title,author) return … α 42
FluX – Intl. Conf. on Very Large Databases Example {ps $ROOT: on-first past() return on bib as $bib return {ps $bib: on book as $b return {ps $b: on-first past() return ; on title as $t return {$t}; on-first past(title,author) return {for $a in $b/author return {$a}}; on-first past(title,author) return ;};
FluX – Intl. Conf. on Very Large Databases Example {ps $ROOT: on-first past() return on bib as $bib return {ps $bib: on book as $b return {ps $b: on-first past() return ; on title as $t return {$t}; on-first past(title,author) return {for $a in $b/author return {$a}}; on-first past(title,author) return ;} on-first past(bib) return ;}
FluX – Intl. Conf. on Very Large Databases Example – Order Constraints {ps $ROOT: on-first past() return on bib as $bib return {ps $bib: on book as $b return {ps $b: on-first past() return ; on title as $t return {$t}; {for $a in $b/author return {$a}} } Assure all titles before α 41 rewrite($b, {title}, α 41 ) DTD ensures titles before authors generate on author as $a return … α 41 α 42
FluX – Intl. Conf. on Very Large Databases Example {ps $ROOT: on-first past() return on bib as $bib return {ps $bib: on book as $b return {ps $b: on-first past() return ; on title as $t return {$t}; on author as $a return {$a}; on-first past(title,author) return ;}; on-first past(bib) return ;} Assure all titles before α 41 rewrite($b, {title}, α 41 ) H={title} DTD ensures titles before authors generate on author as $a return …
FluX – Intl. Conf. on Very Large Databases Outline Motivation FluX Query Language Translating XQuery into FluX Further Aspects Experiments Conclusion
FluX – Intl. Conf. on Very Large Databases Further Aspects Visit our demonstration (Group 3: XML) To Normal Form Algebraic Optimizations To FluX XQuery DTD Query Compiler Streamed Query Evaluator XSAX Memory Buffers Query Optimizer Runtime Engine XML Input StreamXML Output Stream
FluX – Intl. Conf. on Very Large Databases Outline Motivation FluX Query Language Translating XQuery into FluX Further Aspects Experiments Conclusion
FluX – Intl. Conf. on Very Large Databases Experiments Based on XMark Queries adapted to XQuery - fragment Environment AMD Athlon XP 2000, 512MB RAM Linux, Sun JDK 1.4.2_03 Measurements Execution time Memory consumption
FluX – Intl. Conf. on Very Large Databases Experiments FluXGalaxAnonX time [s]memorytime [s]memorytime [s] 5M2,1013,437M3,4 Q110M2,8029,883M6,7 50M7,80->500M38,3 100M14,00->500M- 5M6,81,54M296,950M143,8 Q810M17,23,16M1498,3100M534,8 50M357,816,00M->500M- 100M11566,932,25M->500M- 5M5,6374k277,050Mn/a Q1110M11,4741k1663,7100Mn/a 50M170,83,64M->500Mn/a 100M626,87,27M->500Mn/a 5M2,2012,838M3,0 Q1310M3,1027,273M5,2 50M7,90230,1344M88,0 100M13,90->500M- 5M2,84,66k13,236M2,5 Q2010M3,45,18k29,780M6,2 50M8,77,01k->500M151,9 100M15,47,02k->500M-
FluX – Intl. Conf. on Very Large Databases Outline Motivation FluX Query Language Translating XQuery into FluX Further Aspects Experiments Conclusion
FluX – Intl. Conf. on Very Large Databases Conclusion FluX Event based extension of XQuery Rewriting of XQuery into FluX Usage of information of DTD FluX supports buffer-conscious query processing Low main memory consumption Efficient and scalable query execution on data streams Future work Recursive DTDs Extension of XQuery - subset (e.g., //, aggregate operators) Improve execution (joins)
FluX – Intl. Conf. on Very Large Databases Related Work Altinel, Franklin. “Efficient Filtering of XML Documents for Selective Dissemination of Information”. VLDB 2000 Buneman, Grohe, Koch. “Path Queries on Compressed XML”. VLDB 2003 Chan, Felber, Garofalakis, Rastogi. “Efficient Filtering of XML Documents with XPath Expressions”. ICDE 2002 Deutsch, Tannen. “Reformulation of XML Queries and Constraints”. ICDT 2003 Fegaras, Levine, Bose, Chaluvadi. “Query Processing on Streamed XML Data”. CIKM 2002 Green, Miklau, Onizuka, Suciu. “Processing XML Streams with Deterministic Automata”. ICDT 2003 Gupta, Suciu. “Stream Processing of XPath Queries with Predicates”. SIGMOD 2003 Ludäscher, Mukhopadhyay, Papakonstantinou. “A Transducer-Based XML Query Processor”. VLDB 2002 Marian, Siméon. “Projecting XML Documents”. VLDB 2003 Olteanu, Kiesling, Bry. “An Evaluation of Regular Path Expressions with Qualifiers against XML Streams”. ICDE 2003
FluX – Intl. Conf. on Very Large Databases FluX Query Language Based on XQuery fragment XQuery - ε (empty) s (output fixed string) α β (sequence) {for $x in $y/π [where χ] return α} (for loop) {$x/π} (output path) {$x} (output) {if χ then α} (conditional) Difference to XQuery in treating fixed strings {$ROOT/bib/book}
FluX – Intl. Conf. on Very Large Databases FluX Query Language (ctd.) XQuery - Expression α β γ is simple, if 1. α, γ (possibly empty) sequence of fixed string s {if χ then s} 2. β is empty or {$u} {if χ then {$u}} and $u not in condition of α γ Can be executed without buffering on streams Example: {$x} {if $x/b=5 then 5 } simple {$x}{$x} not simple
FluX – Intl. Conf. on Very Large Databases Dependencies … { for $title in $book/title return { if $book/publisher = “Addison-Wesley” and $book/year > 1991 then {$title} } } … {ps $book: … on-first past(publisher,year,title) return { for $title in $book/title return { … } } }; … } dependencies($book, “{for …}”)
FluX – Intl. Conf. on Very Large Databases Further Aspects The XSAX (eXtended SAX) parser Generates on-first events Execution of FluX queries Using XSAX Projection scheme Additional reduction of buffer size Algebraic pre-optimizations Visit our demonstration (Group 3: XML)