Raindrop: An Algebra-Automata Combined XQuery Engine over XML Streams

Raindrop: An Algebra-Automata Combined XQuery Engine over XML Streams
Hong Su, Elke Rundensteiner, Murali Mani, Ming Li Worcester Polytechnic Institute Worcester, MA VLDB 2004

Stream Processing data sources Networks data requesters

What’s Special for XML Stream Processing
Token-by-Token access manner Pattern retrieval Filtering + Restructuring FOR $a in stream(bids)//auction, $b in $a/seller[homepage], $c in $a/bidder[sameAddr] WHERE $b/*/phone = “508” Return <auction> $b, $c </auction> <auctions> <auction> <seller> <primary> <phone> timeline Token: not a counterpart of a self-contained tuple There have been numerous projects going on for general issues of stream query processing. They usually assume a relational model for query processing which means the data sources they consider are flat and structured tuples. Compared to such a relational-model, a question may arise, what are the specific challenges for XML streams? However, a token is not a direct counterpart of a tuple since a single token is meaningless without the context the other tokens provide. Pattern Retrieval on Token Streams

Two Computation Paradigms
Automata-based [yfilter, xscan, xsm, xsq, xpush…] Algebraic [niagara00, …] FOR $a in stream(bids)//auction, $b in $a/seller[homepage], $c in $a/bidder[sameAddr] WHERE $b/*/phone = “508” Return <auction> $b, $c </auction> Tagger homepage … * seller 3 4 auction Generally speaking, Automata-based systems are good at pattern retrievals over tokenized XML streams since automata is designed for matching language grammar against alphabets. In the scenario of XML, you can think of Xpath patterns as grammar and tokens as alphabets. The algebraic system, on the other hand, is good at both expressing and optimizing filtering and restructuring. Handling tokens does not naturally fit since tokens are no direct counterpart of tuples which are the foundation of algebraic system. In fact, niagara system, a typical algebraic system for XML stream processing ignores how the tokenized inputs are processed into tuples in the first place. It is intuitive that we should combine these two models Because they complement each other. The objectives of the project is to integrate these two paradigms into one. phone Navigate $a, /bidder-> $c 1 2 * 5 6 bid Navigate $a, /seller->$b 7 8 9 bidder sameAddr Navigate stream(bids),//auction->$a Automata Algebra

Comparison of Two Paradigms
Automata Paradigm Algebra Paradigm Good for pattern retrieval on tokens Does not support token inputs Need patches for filtering and restructuring Good for filtering and restructuring Present all details on same low level Support multiple descriptive levels (e.g., logical plan, physical plan) Little studied as query processing paradigm Well studied as query process paradigm Set-oriented High-level (can be better resoned over) Sound Set – token Logic rewrite - transitions Either paradigm has deficiencies Both paradigms complement each other

Four-Level Algebraic Framework
This Raindrop framework intends to integrate both paradigms into one Express the semantics of query regardless of input sources High (Declarative) Semantics-Focused Plan Accommodate tokenized streams/ automata computation Stream Logic Plan Describe implementation details of operators mention db2 here: rewriting, physical plan completion In part I, I will focus on the highest two levels since they are more about modeling. That is, they define the data model and semantics of each operator. while the rest two are more about implementation details. I would postpone introducing the lowest two levels until part II. Stream Physical Plan Decide how an operator is invoked (scheduling) Stream Execution Plan Low (Procedural) Abstraction Level

Level I: Semantics-Focused Plan
Express query semantics regardless of stored or stream input sources [Rainbow-ZPR02] Reuse existing general optimization techniques Decorrelation Cancel duplicate navigation operators …

Example Semantics-Focused Plan
Query: Stream Data: <auctions> <auction> <seller> <primary><phone>508</phone></primary> <secondary><phone>613</phone></secondary> </seller> <bid><bidder>…</bidder><bidder>…</bidder></bid> </auction> … FOR $a in stream(bids)//auction, $b in $a/seller[homepage], $c in $a/bidder[sameAddr] WHERE $b/*/phone = “508” Return <auction> $b, $c </auction> Plan and Input/output Data: … source <auctions>… </auctions> $a <auction>… </auction> $b <seller>… </seller> $c <bidder>… </bidder> <auctions>… </auctions> <auction>. .. </auction> NavUnnest $a, /bid/bidder ->$c … source <auctions>… </auctions> $a <auction>… </auction> $b <seller>… </seller> … NavUnnest $a, /seller ->$b source <auctions>… </auctions> $a <auction> … </auction> <auctions> … NavUnnest stream(bids),//auction->$a source <auctions> … </auctions>

Level II: Stream Logical Plan
Extend semantics-focused plan to accommodate tokenized stream inputs New input data format: Tokens New operators: StreamSource, TokenNavigate, ExtractUnnest, ExtractNest, StructuralJoin New rewrite rules: Push-into/Pull-out-of Automata

One Uniform Algebraic View
Algebraic Stream Logical Plan Tuple-based plan Query answer Tuple stream In the higher level, all the computations are modeled in an algebraic manner. In the lower level, the token-based subplan deals with token inputs and organizes tokens into tuples. These tuples are then fed to the tuple-based plan sitting above. Token-based plan (automata plan) XML data stream

Modeling Automata in Algebraic Plan: Black Box[XScan01] vs. White Box
FOR $a in stream(bids)//auction, $b in $a/seller[homepage], $c in $a/bid/bidder[sameAddr] WHERE $b/*/phone = “508” Return <auction> $b, $c </auction> StructuralJoin $a $a := stream(bids)//auction $b := $a/seller $c := $a/bid/bidder ExtractUnnest $a, $b ExtractUnnest $a, $c XScan Even when trying to model automata computation in the algebraic plan, there can be multiple choices. For example, the first design choice is to wrap the whole automata computation into one single operator. This operator exposes an interface to its consumer but not any internal logical details. Instead, we propose a white box approach in which, the automata computation is modeled as a plan consisting of some operators in finer granularities. In this way, the internal logic plan can be better understood. A second advantage of this approach is that the query rewriting techniques can be applied within the alutomata plan as well as across the whole plan which provides more optimization opportunities. We will touch on that a few slides later. TokenNavigate $a, /seller->$b TokenNavigate $a, /bid/bidder->$c TokenNavigate stream(bids), //auction->$a Black Box White Box

Data Model in Algebraic Plan Modeling Automata
<seller>…</seller> <bidder>...</bidder> … … StructuralJoin $a <seller>…</seller> <bidder>...</bidder> … … ExtractUnnest $a, $b <seller> ExtractUnnest $a, $c <primary> <bidder> <phone> <bidderid> 508 0314 </phone> … Even when trying to model automata computation in the algebraic plan, there can be multiple choices. For example, the first design choice is to wrap the whole automata computation into one single operator. This operator exposes an interface to its consumer but not any internal logical details. Instead, we propose a white box approach in which, the automata computation is modeled as a plan consisting of some operators in finer granularities. In this way, the internal logic plan can be better understood. A second advantage of this approach is that the query rewriting techniques can be applied within the alutomata plan as well as across the whole plan which provides more optimization opportunities. We will touch on that a few slides later. No automaton visible here, Say that will be shown later. It’s not very clearly shown here TokenNavigate $a, /bid/bidder->$c </primary> TokenNavigate $a, /seller->$b ... <auction> <seller> <auctions> TokenNavigate stream(bids), //auction->$a <primary> <auction> <phone> .... … … StreamSource …

For Details of Levels III and IV, please refer to
“Automaton Meets Query Algebra: Towards a Unified Model for XQuery Evaluation over XML Data Streams”, ER 2003 “Raindrop: A Uniform and Layered Algebraic Framework for XQueries on XML Streams”, CIKM 2003 “Raindrop: A Uniform and Layered Algebraic Framework for XQueries on XML Streams”, Journal Submission 2004

Optimization I: Computation Into or Out of Automata?
… Into Automata Out of Automata NavigateUnest $a, /bid/bidder ->$c … … NavigateUnnest $a, /seller ->$b NavUnnest stream(bids), //auction->$a NavigateUnnest $a, /bid/bidder->$c Automata Plan StructuralJoin $a ExtractUnnest $a, $b ExtractUnnest $a, $c NavigateUnnest $a, /seller->$b Automata Plan TokenNavigate $a, /seller->$b TokenNavigate $a, /bid/bidder->$c ExtracUnnest stream(bids), $a TokenNavigate stream(bids), //auction->$a TokenNavigate stream(bids), //auction->$a

Experimentation Results

Optimization II: Semantic Query Optimization
General schema-based optimizations Eliminate predicate/join, … Focus on operators manipulating flat values XML specific schema-based optimizations Focus on pattern retrieval Fall into two categories General XML SQO Minimize query tree [YCL+-AT&T 01] Stream XML SQO (our focus)

Stream-Specific XML SQO
Observations Pattern retrieval over tokens solely relies on document-order traversal Schema constraints help expedite document-order traversal State-of-the-Art [XPush03] covers limited query (boolean XPath match) and one type of constraints Our goals: Support more powerful query (XQuery) Support more types of constraints (XSchema)

Step I: Construct Query Graph
FOR $a in stream(bids)//auction, $b in $a/seller[homepage], $c in $a/bid/bidder[sameAddr] WHERE $b/*/phone = “508” Return <auction> $b, $c </auction> (a) Example Query (b) Query Tree

Example XML Schema

Step II: Apply Optimization Rules
Offer optimization rules utilizing occurrence constraints exclusive constraints order constraints Apply rules in an order ensuring no beneficial rule missed no redundant rule introduced

Step III: Translate Rewritten Query Graph Back to Plan (I)
when </phone> is encountered twice, check /*/phone: if fails the predicate, suspend states s2 and s3 Utilize Occurrence Constraints

Step III: Translate Rewritten Query Graph Back to Plan (II)
when <billTo> or <shipTo> is encountered once: suspend states s2 and s9 Utilize Exclusive Constraints

Step III: Translate Rewritten Query Graph Back to Plan (III)
when <primary> is encountered once, check /homepage: if no presence, suspend states s10, s3 and s2 Utilize Order Constraints

Thank WPI DSRG Rainbow Team for XAT Algebra Support
Thank WPI DSRG Rainbow Team for XAT Algebra Support

Thank WPI DSRG Rainbow Team for XAT Algebra Support

Raindrop: An Algebra-Automata Combined XQuery Engine over XML Streams

Similar presentations

Presentation on theme: "Raindrop: An Algebra-Automata Combined XQuery Engine over XML Streams"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Raindrop: An Algebra-Automata Combined XQuery Engine over XML Streams

Similar presentations

Presentation on theme: "Raindrop: An Algebra-Automata Combined XQuery Engine over XML Streams"— Presentation transcript:

Similar presentations

About project

Feedback