Download presentation
Presentation is loading. Please wait.
Published byMarshall Hensley Modified over 9 years ago
1
Streaming XPath / XQuery Evaluation and Course Wrap-Up Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems December 2, 2008
2
Administrivia Recall that the final project is due – with a write-up and a 10-minute demo presentation – on Tuesday 12/16, 9-11AM Also: course evaluations (at end) 2
3
XML: Its Roles Perhaps used as a superset of HTML for documents, but… Most successful as a transport format for sending data between systems SOAP, WSDL, etc. Data interchange formats like ebXML, MAGE-ML, … So why would we want to store it in a database to query it, when we could query over XML as it streams across the network? (Note: not infinite streams, as in DSMSs, and it’s hierarchical) 3
4
Streaming XPaths and XQueries Suppose I give an XPath expression (which is a subset of a regular expression) Can I match it against the parse tree of the data? An XQuery takes multiple XPaths in the FOR clause, and iterates over the elements of each Xpath (binding the variable to each) FOR $i in doc(“abc”)/xyz, $j in $i/def We can think of an XQuery as doing tree matching, which returns tuples ($i, $j) for each tree matching $i and $j 4
5
Where This Leads An XQuery can be broken into two operations: A parsing / tree matching stage (FOR and also LET) * Finds matches to the variables * Returns a tuple of trees A (mostly) pipelined SPJ / union / group by / order by engine – (WHERE, ORDER BY, nesting in RETURN) * Like a regular relational engine extended with XML tree datatype! The first engine to put these things together: Tukwila (Ives+ 2000, 2002) IBM DB2 was built upon a nearly identical model – TurboXPath (Josifowski 2004) 5
6
The Key: SAX (Simple API for XML) If we are to match XPaths in streaming fashion, we need a stream of data items The original parser model: DOM (Document Object Model) Builds an entire object hierarchy in memory, which is traversable Not incremental! (Until later versions) SAX: a series of event notifications open-tag, close-tag, character data Idea: build a state machine (or similar mechanism) to match on the events! 6
7
Different Options Many different “streaming XPath:” matching algorithms were developed with some differences What to match with (DFA, NFA, lazy DFA, PDA, proprietary format) Complexity of the path language (regular path expressions, XPath), axes (downwards, upwards, sideways), internal references (IDREFs, foreign keys), recursive patterns Which operations can be pushed into the operator (selection predicates, joins, position indices) We’ll consider TurboXPath, highlighted in red above (Tukwila’s x-scan is highlighted in green) 7
8
From XPath Patterns to Tuples and A Normal Query Plan 8 for $c in doc("d1")//customer for $p in doc("d2")//profiles[cid/text() = $c/cid/text()] for $o in $c/order[date = ‘12/12/01’] return {$c/name} {$p/status} {$o/amount} ($c/cid/text(), $c/name, $o/amount) ($p/status, $p/cid/text()) ⋈ Pipelined join TurboXPath over “d1” TurboXPath over “d2” ($c/name, $p/status, $o/amount) XML tagger (add “result”)
9
XPath Processing in TurboXPath 9
10
Performance Issues Predicate pushdown Similar to “sargable predicates” – reduces the internal state that must be run through a cross-product to produce tuples “Smart” memory management Want to deallocate space from partial pattern matches as early as possible Parser efficiency We found that Xerces-C (validating C++ parser used by TurboXPath) was 10x slower than expat (non-validating C parser) 10
11
11 Wrapping up… This semester has been a whirlwind tour of many different aspects of the “data ecosystem” Storage Concurrency control Query processing Data distribution and streams Heterogeneity, mappings, and reformulation (and the limitations thereof) Many styles of data integration XML processing I hope I’ve been able to convey some of what makes this field both relevant and, I think, cool…
12
Where There Is Room for More Work (Among Many Topics) Storage: rows versus columns Concurrency control Query processing Is there a theory of adaptivity, and an optimal scheme? Data distribution, networks, and streams How do we distribute to 10,000 nodes? What is the relationship between network communication and query processing? Data integration, better support for collaboration How can we make it less human-intensive? “Lightweight databases” Probabilistic databases Visualization and interfaces Databases meets machine learning and info retrieval 12
13
A Sampler of Some of the Systems Work by (Some) Major DB Groups Washington: Mystiq – probabilistic databases; distrib. streams Stanford: Trio – probabilities and “lineage” meets databases Cornell: databases meets games; probabilistic databases Wisconsin: Cimple; database support for monitoring clusters MIT: Sensor query processing; signal processing; column stores Berkeley: Data management for sensors and networks Maryland: Querying data models; learning and probabilities meets databases Penn: Orchestra; data and workflow provenance; keyword querying with learned ranks over databases; lightweight data integration; networking meets databases; sensor integration 13
14
14 Thanks!!! I had a great time this semester – I hope you learned a lot and found it to be enjoyable I’m looking forward to seeing your projects!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.