MonetDB/XQuery: Using a Relational DBMS for XML Peter Boncz CWI The Netherlands
Outline Basic XML / XQuery Introduction of Pathfinder and MonetDB projects Relational XQuery –XPath steps in the pre/post plane –Translating for-loops, and beyond Optimizations –Order prevention –Loop-Lifted Staircase join –Join recognition Outlook –Conclusions Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Outline Basic XML / XQuery Introduction of Pathfinder and MonetDB projects Relational XQuery –XPath steps in the pre/post plane –Translating for-loops, and beyond Optimizations –Order prevention –Loop-Lifted Staircase join –Join recognition Outlook –Conclusions Peter BonczTU Delft Pathfinder - MonetDB/XQuery
XML Standard, flexible syntax for data exchange –Regular, structured data Database content of all kinds: Inventory, billing, orders, … “Small” typed values –Irregular, unstructured text Documents of all kinds: Transcripts, books, legal briefs, … “Large” untyped values Lingua franca of B2B Applications… –Increase access to products & services –Integrate disparate data sources –Automate business processes … and numerous other application domains –Bio-informatics, library science, …
XML : A First Look XML document describing catalog of books No Such Thing as a Bad Day Hamilton Jordan Longstreet Press, Inc Publisher : This book is the moving account of one man's successful battles against three cancers... No Such Thing as a Bad Day is warmly recommended.
XQuery 1.0 Functional, strongly-typed query language XQuery 1.0 = XPath 2.0 for navigation, selection, extraction + A few more expressions For-Let-Where-Order By-Return (FLWOR) XML construction Operators on types + User-defined functions & modules + Strong typing
XSLT vs. XQuery XSLT 1.0: XML XML, HTML, Text –Loosely-typed scripting language –Format XML in HTML for display in browser –Must be highly tolerant of variability/errors in data XQuery 1.0: XML XML –Strongly-typed query language –Large-scale database access –Must guarantee safety/correctness of operations on data Over time, XSLT & XQuery may both serve needs of many application domains XQuery will become a hidden, commodity language
Navigation, Selection, Extraction Titles of all books published by Longstreet Press $cat/catalog/book[publisher=“Longstreet Press”]/title No Such Thing As A Bad Day Publications with Jerome Simeon as author or editor $cat//*[(author|editor) = “Jerome Simeon”] XQuery from the Experts … XQuery Formal Semantics …
Transformation & Construction First author & title of books published by A/W for $b in $cat//book[publisher = “Addison Wesley”] return { $b/author[1], $b/title } Don Chamberlin XQuery from the Experts
Literals & Constants Strings “hello world” Booleans fn:true() fn:false() –Avoid lexical conflicts with, e.g., //false $v/flag/true Numbers 12 [integer], 10.3E2 [double], 1.0 [decimal] xs:decimal(“1.0”) xs:unsignedLong(“ ”) Dates, times, & (totally ordered) durations xs:date(" ") xs:time(“04:20:00") xdt:dayTimeDuration("P21D") xdt:yearMonthDuration("P1Y2M") User-defined atomic types mycompany:inventory-id(“XXX-123")
Functions & Operators Arithmetic & comparison operators –Numerics E2 (-, *, div, idiv, mod) 1900, =, =) –Dates/Times/Durations xs:date(“ ”) + xdt:dayTimeDuration(“P10D”) xs:date(“ ”) >= xs:date(“ ”) –Nodes //incision >) Built-in functions –Strings fn:starts-with(“WWW 2004”, “WWW”) fn:matches(“WWW 2004”, “^W*”) –Sequences fn:avg((1,2,3,4)) fn:distinct-values(//price) –All other XML Schema primitive types …
Selection & Projection Titles of all books published by Longstreet Press $cat/catalog/book[publisher=“Longstreet Press”]/title => No Such Thing As A Bad Day Publications with Jêróme Siméon as author or editor $cat//*[(author|editor) = “Jêróme Siméon”] => XQuery from the Experts.., XQuery 1.0 Formal Semantics … Books with “good” reviews $cat//book[fn:contains(review/text(), “2 thumbs up”)]
Sources of Input Several ways to access inputs Document function fn:doc(“ fn:doc( Expr ) Variables –Bound in for expression or in host language $cat/catalog/book
Sequences & Iteration Sequence constructor Return all books followed by all W3C specifications ($cat/catalog/book, $cat/catalog/W3Cspec) XPath Expression Return all books & W3C specifications in doc order $cat/catalog/(book|W3Cspec) For Expression –Similar to map : apply function to each item in sequence Return number of authors in each book for $b in $cat/catalog/book return fn:count($b/authors) => (3,1,2,…)
Conditional & Quantified Conditional if //show[year >= 2000] then “A-OK!” else “Error!” Existential quantification –Implicit meaning of predicate expressions //show[year >= 2000] –Explicit expression: //show[some $y in./year satisfies $y >= 2000] Universal quantification //show[every $y in year satisfies $y >= 2000]
Putting It Together For each author, return number of books and receipts books published in past 2 years, ordered by name let $cat := fn:doc(“ Joinwww.bn.com/catalog.xml $sales := fn:doc(“ for $author in distinct-values($cat//author) Grouping let $books := >= 2000 and author = $a], S.J. $receipts := = order by $author Ordering return XML Construction { $author } { fn:count($books) } Aggregation { fn:sum($receipts) }
Recursive Processing Recursive functions support recursive data => declare function partCount($p as element(part)) as element(partCt) { { for $p2 in $p/part return partCount($p2) } }
XML Schema Languages Many variants… –DTDs, XML Schema, RELAX-N/G, XDuce … with similar goals to define –Types of literal (terminal) data –Names of elements & attribute XQuery designed to support (all of) XML Schema –Structural & name constraints over types –Regular tree expressions over elements, attributes, atomic types
TeXQuery : Full-text extensions Text search & querying of structured content Limited support in XQuery 1.0 –String operators with collation sequences $cat//book[contains(review/text(), “two thumbs up”)] Stop words, proximity searching, ranking Ex: “Tony Blair” within two words of “George Bush” Phrases that span tags and annotations Ex: Match “Mr. English sponsored the bill” in Mr. English for himself and Mr.Coyne sponsored the bill in the Committee for Financial Services
Outline Basic XML / XQuery Introduction of Pathfinder and MonetDB projects Relational XQuery –XPath steps in the pre/post plane –Translating for-loops, and beyond Optimizations –Order prevention –Loop-Lifted Staircase join –Join recognition Outlook –Conclusions Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Outline Basic XML / XQuery Introduction of Pathfinder and MonetDB projects Relational XQuery –XPath steps in the pre/post plane –Translating for-loops, and beyond Optimizations –Order prevention –Loop-Lifted Staircase join –Join recognition Outlook –Conclusions Peter BonczTU Delft Pathfinder - MonetDB/XQuery
XQuery Systems: 2 Approaches Tree-based –Tree is basic data structure Also on disk (if an XQuery DBMS) –Navigational Approach Galax [Simeon..], Flux [Koch..], X-Hive –Tree Algebra Approach TIMBER [Jagadish..] Relational –Data shredded in relational tables –XQuery translated into database query (e.g. SQL) Peter BonczTU Delft Pathfinder - MonetDB/XQuery
The Pathfinder Project Challenge / Goal: –Turn RDBMSs into efficient XQuery engines People: –Maurice van Keulen University of Twente –Torsten Grust, Jens Teubner University of Konstanz –Jan Rittinger University of Konstanz & CWI Peter BonczTU Delft Pathfinder - MonetDB/XQuery
The Pathfinder Project Challenge / Goal: –Turn RDBMSs into efficient XQuery engines People: –Maurice van Keulen University of Twente –Torsten Grust, Jens Teubner University of Konstanz –Jan Rittinger University of Konstanz & CWI Task: generate code for MonetDB Peter BonczTU Delft Pathfinder - MonetDB/XQuery
MonetDB: Applied CS Research at CWI a decade of “query-intensive” application experience image retrieval: Peter Bosch ImageSpotter audio/video retrieval: Alex van Ballegooij RAM XML text retrieval: de Vries / Hiemstra TIJAH biological sequences: Arno Siebes BRICKS XML databases: Albrecht Schmidt XMark Grust / vKeulen Pathfinder GIS: Wilco Quak MAGNUM data warehousing / OLAP / data mining SPSS DataDistilleries Univ. Massachussetts PROXIMITY CWI research group successfully spun off DataDistilleries (now SPSS) Peter BonczTU Delft Pathfinder - MonetDB/XQuery
MIL (Query Algebra) Pathfinder — MonetDB Pathfinder MonetDB Parser Sem. Analysis Core Translation Typechecking Relational Algebra Database SQL Core to MIL Translation Parser Sem. Analysis Core Translation Typechecking Database Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Open Source MonetDB + Pathfinder on Sourceforge – Mozilla License Project Homepage – Developers website: – RoadMap 14-apr-04: initial Beta release MonetDB/SQL 30-sep-04: first official release MonetDB/SQL 30-may-05: beta release of MonetDB/XQuery (i.e. Pathfinder) Peter BonczTU Delft Pathfinder - MonetDB/XQuery
MonetDB Peter BonczTU Delft Pathfinder - MonetDB/XQuery
MonetDB Particulars Column wise fragmentation –BAT: Binary Association Tables [oid,X] –Don’t touch what you don’t need Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Binary Association Tables (BATs) Peter BonczTU Delft Pathfinder - MonetDB/XQuery
BAT storage as thin arrays Peter BonczTU Delft Pathfinder - MonetDB/XQuery
MonetDB Particulars Column wise fragmentation –BAT: Binary Association Tables [oid,X] –Don’t touch what you don’t need Void (virtual-oid) columns –Contain dense sequence 0,1,2,3,4,… –Require no space –Positional access (nice for XPath skipping) pre = void Peter BonczTU Delft Pathfinder - MonetDB/XQuery
DBMS Architecture Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Monet: DBMS Microkernel Peter BonczTU Delft Pathfinder - MonetDB/XQuery
MonetDB: extensible architecture Front-end/back-end: support multiple data models support multiple end- user languages support diverse application domains Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Front-end/back-end: support multiple data models support multiple end- user languages support diverse application domains Pathfinder XQuery Frontend MonetDB: extensible architecture Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Architecture Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Outline Basic XML / XQuery Introduction of Pathfinder and MonetDB projects Relational XQuery –XPath steps in the pre/post plane –Translating for-loops, and beyond Optimizations –Order prevention –Loop-Lifted Staircase join –Join recognition Outlook –Conclusions Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Outline Basic XML / XQuery Introduction of Pathfinder and MonetDB projects Relational XQuery –XPath steps in the pre/post plane –Translating for-loops, and beyond MonetDB Implementation –Data structures Optimizations –Order prevention –Loop-Lifted Staircase join –Join recognition Outlook –Conclusions Peter BonczTU Delft Pathfinder - MonetDB/XQuery
XPath on and RDBMS Node-based relational encoding of XQuery's data model Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Tree Knowledge 1: pruning
Tree Knowledge 2: Partitioning
Staircase Join Algorithm
Tree Knowledge 3: Skipping
Pre/Post Pre/Level/Size done for better skipping and updates Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Updates Dense pre-numbers are nice for XPath – Positional skipping in Staircase join! But how to handle updates?
Updates Dense pre-numbers are nice for XPath – Positional skipping in Staircase join! But how to handle updates? Dense Not Dense
Planned Update Solution
XPath XQuery Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Sequence Representation sequence = table of items add pos column for maintaining order ignore polymorphism for the moment (10, “x”,, 10) → PosItem 110 2“X” 3pre(a) 410 Peter BonczTU Delft Pathfinder - MonetDB/XQuery
For-loops: the iter column Peter BonczTU Delft Pathfinder - MonetDB/XQuery
For-loops: the iter column Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Loop-lifting Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Loop-lifting Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Full Example joincalcproject Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Mapping Rules XQuery construct relational algebra See VLDB’04 / TDM’04 [Grust,Teubner] –Sequence construction union –If-Then-[Else] select, [union] –For loop map with cartesian product (all combinations) –Calculations projection expressions –List-functions (e.g. fn:first) select(pos=1) –Element Construction updates using descendant –Path steps selections on the pre/post plane Staircase join [VLDB03]: –Single-pass for a *set* of context nodes –elaborate skipping! Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Xmark Query 2 Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Xmark Query 2 (common subexpr) Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Outline Basic XML / XQuery Introduction of Pathfinder and MonetDB projects Relational XQuery –XPath steps in the pre/post plane –Translating for-loops, and beyond Optimizations –Order prevention –Loop-Lifted Staircase join –Join recognition Outlook –Conclusions Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Outline Basic XML / XQuery Introduction of Pathfinder and MonetDB projects Relational XQuery –XPath steps in the pre/post plane –Translating for-loops, and beyond MonetDB Implementation –Data structures Optimizations –Order prevention –Loop-Lifted Staircase join –Join recognition Outlook –Conclusions Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Order Prevention To encode order, we use the pos column New pos columns are created using DENSE RANK (sql) primitive Needs [pos] | [iter] order More commonly [iter,pos] Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Order Prevention To encode order, we use the pos column New pos columns are created using DENSE RANK (SQL) primitive Needs [pos] | [iter] order More commonly [iter,pos] This requires a lot of sorting! often not necessary Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Order Prevention [VLDB03 Wang&Cherniack] Order properties of relations Order propagation rules for relational operators Decoration of physical plans with order properties eliminate sort New ideas: RefineSort: pipelined algorithm that extends sort order Order property [C1] | [C2] “for each equal value of [C2] in order of appearance, the values in [C1] are monotonically increasing” Hash-based DENSE RANK only requires [pos] | [iter] sorts on [iter,pos] avoided Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Order Prevention [VLDB03 Wang&Cherniack] define: Order properties of relations Order propagation rules for relational operators Decoration of physical plans with order properties eliminate sort Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Order Prevention XQuery Strategies Generate Logical Plan (SQL) RDBMS optimizer must be order-aware Generate Physical Plan (MIL) XQuery generator is order-aware (current Pathfinder/MonetDB approach) Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Join Recognition (recap Mapping Rules) XQuery construct relational algebra See VLDB’04 / TDM’04 [Grust,Teubner] –Sequence construction union –If-Then-[Else] select, [union] –For loop map with cartesian product (all combinations) –Calculations projection expressions –List-functions (e.g. fn:first) select(pos=1) –Element Construction updates using descendant –Path steps selections on the pre/post plane Staircase join [VLDB03]: –Single-pass for a *set* of context nodes –elaborate skipping! Peter BonczTU Delft Pathfinder - MonetDB/XQuery
–For loop map with all combinations O(N*N) –If `simple’ condition exist on two loop variables join –Only make a map with the matching combinations –E.g. with Hash-Table O(N) Join Recognition for $p in $auction/site/people/person for $t in $auction/site/closed_auctions/closed_auction where = return $t Peter BonczTU Delft Pathfinder - MonetDB/XQuery
–For loop map with all combinations O(N*N) –If `simple’ condition exist on two loop variables join –Only make a map with the matching combinations –E.g. with Hash-Table O(N) Performed on the XCore tree Recognize if-then expressions Open question: where to optimize best?? Join Recognition for $p in $auction/site/people/person for $t in $auction/site/closed_auctions/closed_auction where = return $t Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Join Optimization for $x in $foo for $y in $bar where < return $x p1 p2 theta- join project Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Join Optimization for $x in $foo for $y in $bar where < return $x p1 /p1 /p2 theta- join project p1 /p1 /p2 theta- join Aggr(min)Aggr(max) Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Loop-Lifted StaircaseJoin (recap rules) XQuery construct relational algebra See VLDB’04 / TDM’04 [Grust,Teubner] –Sequence construction union –If-Then-[Else] select, [union] –For loop map with cartesian product (all combinations) –Calculations projection expressions –List-functions (e.g. fn:first) select(pos=1) –Element Construction updates using descendant –Path steps selections on the pre/post plane Staircase join [VLDB03]: –Single-pass for a *set* of context nodes –elaborate skipping! Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Loop-lifted staircase join Staircase join [VLDB03]: –Single-pass for a *set* of context nodes Loop-lifting multiple iters multiple sets of context nodes –elaborate skipping! –Loop-Lifted Staircase Join In a single pass: process multiple input context node lists –Use a stack –Exploit axis properties for pruning Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Staircase join document List of context nodes Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Loop-lifted staircase join document List of context nodesActive stack Multiple lists of context nodes Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Loop-lifted staircase join Staircase join [VLDB03]: –Single-pass for a *set* of context nodes Loop-lifting multiple iters multiple sets of context nodes –elaborate skipping! –Loop-Lifted Staircase Join In a single pass: process multiple input context node lists –Use a stack –Exploit axis properties for pruning Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Scalability Test platform Opteron 1.6GHz, 8GB RAM, Red Hat Linux 64-bit Can process 11GB document! Mostly linear scaling with document size Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Scalability Test platform Opteron 1.6GHz, 8GB RAM, Red Hat Linux 64-bit Can process 11GB document! Mostly linear scaling with document size Some swapping in the join queries Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Scalability Peter BonczTU Delft Pathfinder - MonetDB/XQuery Test platform Opteron 1.6GHz, 8GB RAM, Red Hat Linux 64-bit Can process 11GB document! Mostly linear scaling with document size Some swapping in the join-queries Q11 + Q12 generate quadratic result
XMark 10MB : Pathfinder vs XHive & Galax Peter BonczTU Delft Pathfinder - MonetDB/XQuery
XMark 1GB: Pathfinder vs X-Hive did not finish Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Conclusions Relational approach can be scalable & fast Crucial Optimizations –Join recognition –Loop-lifted XPath steps –Order awareness Peter BonczTU Delft Pathfinder - MonetDB/XQuery
Conclusions Relational approach can be scalable & fast Crucial Optimizations –Join recognition –Loop-lifted XPath steps –Order awareness Future Roadmap (beta: May 30, Holland Open) Alegebraic Query Optimization Updates (not in release) Peter BonczTU Delft Pathfinder - MonetDB/XQuery