Download presentation
Presentation is loading. Please wait.
1
1 Efficient XML Stream Processing with Automata and Query Algebra A Master Thesis Presentation Student: Advisor: Reader: Jinhui Jian Prof. Elke A. Rundensteiner Prof. Kathi Fisler
2
2 The Need for XML Stream Processing XML Relational HTMLnews Internet XML data streams XML Stream Processing Engine New paradigms Distributed data provider Distributed data consumer New applications Monitoring (e.g., sensor network) Information Filtering (e.g., news, email) New challenges Arbitrarily nested structure Incomplete knowledge
3
3 Two Existing Approaches Automata-based [xfilter01, yfilter02, x-scan01,…] Algebraic [tukwila01, rainbow02, …] This thesis intends to integrate the both existing approaches into one system
4
4 A Running Example Give me book titles whose price is grater than 50: FOR $b in doc (bib.xml) //book WHERE $b/price > 50 RETURN $b/title TCP/IP Illustrated Stevens W. Addison-Wesley 65.95 Data on the Web Abiteboul Serge Buneman Peter Suciu Dan Morgan Kaufmann Publishers 39.95 Advanced Programming in the Unix environment Stevens W. Addison-Wesley 65.95 TCP/IP Illustrated Advanced Programming in the Unix environment
5
5 XML as a Stream of Tokens timeline TCP/IP Illustrated Stevens … … Input XML stream bib book title author last first publisherprice Text A token can be: An open tag A close tag PCDATA
6
6 Basic State-Transition Model TCP/IP Illustrated 65.95 … 120 book ε 3 price * input active states011,211 1,3…… stack[0] [1] [0] [1] [1,2] [0] [1] [1,2] [0] [1] [0] [1] [1,2] …… Q := //book/price FOR $b in doc (bib.xml) //book WHERE $b/price > 50 RETURN $b/title
7
7 Extended with Data Buffer and Buffer Operations FOR $b in doc (bib.xml) //book WHERE $b/price > 50 RETURN $b/title Data-driven Token at a time Fixed order 1. eval pred and set/clear flag 2. output if buffer not empty 120 book ε 3 title * 4 price 1. write buffer 2. output if flag is set bufferflag * *
8
8 Algebraic Query Plan FOR $b in doc (bib.xml) //book WHERE $b/price > 50 RETURN $b/title Set at a time Postponed operation Extract //book Navigate //book, price Select price > 50 Tagger Navigate //book, title
9
9 Exploit the Flexibility of Postponed Operations FOR $b in doc (bib.xml) //book WHERE $b/price > 50 and $b/author/last = “Stevens” RETURN $b/title Extract //book Navigate //book, price Select price > 50 Tagger Navigate //book, author/last Select last = “Stevens” Navigate //book, title
10
10 Query Optimization in Algebraic Systems Logical optimization Selection pushdown Projection pushdown Join order selection Physical optimization Operator algorithms Runtime optimization Scheduling Resource allocation
11
11 Thesis Overview Motivation The Automata model is good for on-the-fly pattern matching/retrieval The Algebraic model is good for optimizing complex queries Major challenges How to integrate the two models? How to optimize a query within the integrated query model?
12
12 The Raindrop Approach Integration Optimization
13
13 Path Bindings in XQuery FOR $b in doc (bib.xml) //book WHERE $b/price > 50 and RETURN $b/title FOR $b in doc (bib.xml) //book LET $p := $b/price, $t := $b/title WHERE $p > 50 RETURN $t FLWR expression: FOR…LET...WHERE…RETURN… Path bindingsFiltering and restructuring “The purpose of path bindings is to produce a tuple stream in which each tuple consists of one or more bound variables” [W3C]
14
14 A Two-Tier System Architecture Automata plan Master plan Tuple stream XML data stream Query answer
15
15 Modeling the Master Plan: Algebraic Navigate //book, price Select price > 50 Tagger Navigate //book, author/last Select last = … Navigate //book, title
16
16 Modeling the Automata Plan: Black Box vs. White Box Automata Plan Q1 := //book Q2 := //book/price Q3 := //book/title SJoin //book Extract //book/price Extract //book/title
17
17 How to optimize it? Automata plan Master plan Tuple stream XML data stream Query answer
18
18 Optimization: A Unified Process in the Logical View 0 1 Extract //book ε * Navigate //book, //book/price 2 book Select //book/price >5 0 Navigate //book, //book/title Extract //book Navigate //book, price Select price >5 0 Navigate //book, title Automata Plan Master Plan cBa Cba $c$b$a
19
The Algebra Core OpSymbolSemantic SelectionFilter tuples based on the predicate pred ProjectionFilter columns in the input tuples based on the variable list v JoinJoin input tuples based on the predicate pred AggregateAggregate over input tuples with the aggregate function f, e.g., sum and average TaggerFormat outputs based on the pattern pt, i.e., reconstruct XML tags NavigateTake input elements of path p1 and output ancestor elements of path p2 ExtractIdentify elements of path p from the input stream Structural Join Join input tuples on their structural relationship, e.g, the common parent relationship p
20
20 The Extract Operator 120 book ε * Extract //book/title TCP/IP Illustrated … … 1 title TCP/IP Illustrated Data on the Web Advanced Programming in the Unix environment
21
21 The Structural Join Operator 120 book ε 3 title * 4 price Extract //book/title Extract //book/price SJoin //book FOR $b in doc (bib.xml) //book LET $p := $b/price, $t := $b/title WHERE $p > 50 RETURN $t … TCP/IP Illustrated … … …
22
22 The Navigate Operator TCP/IP Illustrated Stevens W. Addison-Wesley 65.95 … … … … … … … … … Navigate //book, title A navigate operation can be postponed, independent of the input stream
23
23 A Special Optimization: In or Out? Automata plan Master plan Tuple stream XML data stream Query answer
24
Two Options: Bottom-up vs. Top-down … …</price … TCP/IP Illustrated Stevens W. Addison-Wesley 65.95 … … … … … … … … … … … … … … … … … …
25
25 Exploiting the Options for Optimization 0 1 Extract //book ε * Navigate //book, price 2 book Select price >5 0 Navigate //book, title The pull-out plan Extract //book/price 0 1 3 4 title price Extract //book/title ε * SJoin //book 2 book Select //book/price >50 The push-in plan Tagger
26
26 Query Optimization by Rewriting Rules Navigate pushin: Redundant SJoin: Redundant Extract: Selection Pushdown: Etc.. Algebraic transformation:
27
27 Runtime Optimization: Why? Optimization relies on cost estimation, which in terms relies on statistics Statistics unknown Statistics change Extract //book Navigate //book, price Select price >5 0 Navigate //book, title Tagger
28
28 Runtime Optimization Steps Stat Collection Decision Making Plan Migration
29
29 Why Need Migration? When to interrupt the executor Master plan Automata plan Normal execution Prepare for migration Decision making Plan modification Legend executor Optimizer Optimization cycle The migration process
30
30 Modifying the Automata: A Bad Example 0 1 Extract //book ε * Navigate //book, //book/price 2 book Select //book/price >5 0 Navigate //book, //book/title Extract //book/price 0 1 3 4 title price Extract //book/title ε * SJoin //book 2 book Select //book/price >50 TCP/IP Illustrated 36.65 … ……
31
31 Modifying the Automata: A Safe Approach … … … … … Safe point Unsafe point 0 1 ε * 2 book 0 1 3 4 title price ε * 2 book FOR $b in doc (bib.xml) //book LET $p := $b/price, $t := $b/title WHERE $p > 50 RETURN $t
32
32 Experimental Study Is it feasible to integrate automata model and algebraic model? Is push-in vs. pull-out a feasible optimization? Is runtime optimization worthwhile?
33
33 Experimental Setup Java 1.4 Pentium III-750MHz, 384MB Windows XP Professional Three-party components Xerces SAX parser The Kweelt XQuery parser Rainbow core
34
34 Exp1: System Throughput
35
35 Exp2: Push-in vs. Pull-out
36
Exp3: Runtime Optimization
37
37 Related work Automata-based XML processing XFilter, YFilter, X-Scan, XTrie, XPush, … Algebraic XQuery Engine XPeranto, LegoDB, Rainbow, Timber… Runtime Optimization Tukwila, Telegraph CQ,…
38
38 Contribution While many recent XML stream work (e.g., in SIGMOD03) processes XPath query, we are among the first to deal with XQuery We are the first to consider the flexible automata and query algebra integration problem Pushin vs. Pullout optimization techniques Prototype system Experimental study
39
39 Conclusion Combining automata and query algebra results in a very power query model for XML stream processing Special optimization techniques (e.g., pushin vs. pullout) can be applied in the integrated system Data statistics collected at runtime can be exploited via runtime optimization techniques
40
40 Thanks to: Prof. Elke A. Rundensteiner Prof. Kathi Fisler The Raindrop/Rainbow team All DSRG members
41
41 Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.