Presentation is loading. Please wait.

Presentation is loading. Please wait.

Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December.

Similar presentations


Presentation on theme: "Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December."— Presentation transcript:

1 Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December 2, 2008

2 Administrivia  Recall that the final project is due – with a write-up and a 10-minute demo presentation – on Tuesday 12/16, 9-11AM  Also: course evaluations (at end) 2

3 XML: Its Roles  Perhaps used as a superset of HTML for documents, but…  Most successful as a transport format for sending data between systems  SOAP, WSDL, etc.  Data interchange formats like ebXML, MAGE-ML, …  So why would we want to store it in a database to query it, when we could query over XML as it streams across the network?  (Note: not infinite streams, as in DSMSs, and it’s hierarchical) 3

4 Streaming XPaths and XQueries  Suppose I give an XPath expression (which is a subset of a regular expression)  Can I match it against the parse tree of the data?  An XQuery takes multiple XPaths in the FOR clause, and iterates over the elements of each Xpath (binding the variable to each)  FOR $i in doc(“abc”)/xyz, $j in $i/def  We can think of an XQuery as doing tree matching, which returns tuples ($i, $j) for each tree matching $i and $j 4

5 Where This Leads  An XQuery can be broken into two operations:  A parsing / tree matching stage (FOR and also LET)  * Finds matches to the variables  * Returns a tuple of trees  A (mostly) pipelined SPJ / union / group by / order by engine – (WHERE, ORDER BY, nesting in RETURN)  * Like a regular relational engine extended with XML tree datatype!  The first engine to put these things together: Tukwila (Ives+ 2000, 2002)  IBM DB2 was built upon a nearly identical model – TurboXPath (Josifowski 2004) 5

6 The Key: SAX (Simple API for XML)  If we are to match XPaths in streaming fashion, we need a stream of data items  The original parser model: DOM (Document Object Model)  Builds an entire object hierarchy in memory, which is traversable  Not incremental! (Until later versions)  SAX: a series of event notifications  open-tag, close-tag, character data  Idea: build a state machine (or similar mechanism) to match on the events! 6

7 Different Options  Many different “streaming XPath:” matching algorithms were developed with some differences  What to match with (DFA, NFA, lazy DFA, PDA, proprietary format)  Complexity of the path language (regular path expressions, XPath), axes (downwards, upwards, sideways), internal references (IDREFs, foreign keys), recursive patterns  Which operations can be pushed into the operator (selection predicates, joins, position indices)  We’ll consider TurboXPath, highlighted in red above (Tukwila’s x-scan is highlighted in green) 7

8 From XPath Patterns to Tuples and A Normal Query Plan 8 for $c in doc("d1")//customer for $p in doc("d2")//profiles[cid/text() = $c/cid/text()] for $o in $c/order[date = ‘12/12/01’] return {$c/name} {$p/status} {$o/amount} ($c/cid/text(), $c/name, $o/amount) ($p/status, $p/cid/text()) ⋈ Pipelined join TurboXPath over “d1” TurboXPath over “d2” ($c/name, $p/status, $o/amount) XML tagger (add “result”)

9 XPath Processing in TurboXPath 9

10 Performance Issues  Predicate pushdown  Similar to “sargable predicates” – reduces the internal state that must be run through a cross-product to produce tuples  “Smart” memory management  Want to deallocate space from partial pattern matches as early as possible  Parser efficiency  We found that Xerces-C (validating C++ parser used by TurboXPath) was 10x slower than expat (non-validating C parser) 10

11 11 Wrapping up…  This semester has been a whirlwind tour of many different aspects of the “data ecosystem”  Storage  Concurrency control  Query processing  Data distribution and streams  Heterogeneity, mappings, and reformulation (and the limitations thereof)  Many styles of data integration  XML processing  I hope I’ve been able to convey some of what makes this field both relevant and, I think, cool…

12 Where There Is Room for More Work (Among Many Topics)  Storage: rows versus columns  Concurrency control  Query processing  Is there a theory of adaptivity, and an optimal scheme?  Data distribution, networks, and streams  How do we distribute to 10,000 nodes? What is the relationship between network communication and query processing?  Data integration, better support for collaboration  How can we make it less human-intensive?  “Lightweight databases”  Probabilistic databases  Visualization and interfaces  Databases meets machine learning and info retrieval 12

13 A Sampler of Some of the Systems Work by (Some) Major DB Groups  Washington: Mystiq – probabilistic databases; distrib. streams  Stanford: Trio – probabilities and “lineage” meets databases  Cornell: databases meets games; probabilistic databases  Wisconsin: Cimple; database support for monitoring clusters  MIT: Sensor query processing; signal processing; column stores  Berkeley: Data management for sensors and networks  Maryland: Querying data models; learning and probabilities meets databases  Penn: Orchestra; data and workflow provenance; keyword querying with learned ranks over databases; lightweight data integration; networking meets databases; sensor integration 13

14 14 Thanks!!!  I had a great time this semester – I hope you learned a lot and found it to be enjoyable  I’m looking forward to seeing your projects!


Download ppt "Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December."

Similar presentations


Ads by Google