Pattern tree algebras: sets or sequences? Stelios Paparizos, H. V. Jagadish University of Michigan Ann Arbor, MI USA
Outline XML and XQuery Order and Duplicates Document Order OrderBy Clause Binding Order Duplicates and XQuery Hybrid Collections Correct Output Order Thinking Efficiently Experimental Evaluation Final Words
Document Order Usage Provides capability to re-establish the original document information Mario Stelios Alton Example: Return authors of book with title = “ Grilling…” FOR $b IN document(t)//book WHERE $b/title = “Grilling for amateurs” RETURN $b/author
Document Order Implicit, derived from XML data model The order in which data is represented in a document is important information Requires original XML order representation within a single document Requires an order amongst documents during a single execution of a query Enforced on every XPath expression and every sequence operation e.g. Union
ORDER BY Clause Order Explicit specification with ORDER BY clause Results sorted using item’s value Example: Return all books sorted by year of publication XQuery:FOR $b IN document(t)//book ORDER BY $b/year RETURN $b SQL: SELECT book FROM t ORDER BY year
Binding Order Usage Provides mechanism to produce results in multiple document orders Example: Return books and articles with the same author, order the results by document order of FOR $b IN document(t)//book FOR $a IN document(t)//article WHERE $b/author = $a/author RETURN ($b, $a) book1 – article1 book1 – article2 book2 – article1 book2 – article2 book2 – article3 FOR $a IN document(t)//article FOR $b IN document(t)//book WHERE $b/author = $a/author RETURN ($b, $a) book, articlearticle, book Results book1 – article1 book2 – article1 book1 – article2 book2 – article2 book2 – article3
Binding Order Implicit, derived from the way the query is typed by the user Results are sorted based on the order variables are bound Uses multiple document orders
XQuery and Duplicates XQuery operates on duplicate-free sequences LET clause creates binding to sequence of matching elements FOR clause creates binding to each element of sequence of matching elements Hence, XQuery requires all duplicates to be removed at variable binding
Outline XML and XQuery Order and Duplicates Hybrid Collections Correct Output Order Thinking Efficiently Experimental Evaluation Final Words
Dilemma: Use Sequences or Sets (or Bags or …) Sets lose all ordering information Order can be important in intermediate steps Sequences are expensive to manipulate Optimization possibilities can be restricted Both sets and sequences are duplicate-free Duplicate elimination can be costly procedure that should be avoided when possible
Solution: Use Hybrid Collections A Hybrid Collection can have duplicate semantics that varies between a bag and a set and order semantics that varies between a set and a sequence Duplicate Specification Ordering Specification
Duplicate Specification (D-Spec) Given a collection of trees C T, D-Spec describes how duplicates were removed from the collection Possible Parameter Values: “empty”: Duplicates can be present “tree”: Duplicates were removed using deep-tree comparison amongst trees in C T List of Nodes u: Duplicates were removed using a comparison of the nodes referred by “u” in each tree in C T
Duplicate Specification Example
Ordering Item (O-Item) Minimum unit used when sorting a collection C T Parameters: Reference to sort by node Ascending (‘asc’) or descending (‘desc’) Empty greater (‘g’) or empty least (‘l’) for trees without a matching node Example: O-Item (B, asc, l)
Ordering Specification (O-Spec) Given a collection C T, O-Spec describes how the trees are sorted in the collection It accepts as parameter an ordered list of Ordering-Items Sorting took place in the order O-Items are specified
Ordering Specification Example “Fully-ordered” “Partially-ordered” “any order”
Outline XML and XQuery Order and Duplicates Hybrid Collections Correct Output Order Thinking Efficiently Experiments Final Words
TLC-C Correct Output Algorithm
TLC-C Basic Principles Duplicate behavior is correct with sets Document order is modeled by our node identifiers Pattern tree matches return information in document order ORDER BY clause is mapped to a list of ordering items and a sort operation Binding order is determined during parsing by tracking how the query was typed A sort operation is used at the end of each single block FLWOR statement to capture the binding order
Binding Order Example FOR $b IN document(“lib.xml”)//book FOR $a IN $b/author FOR $e IN $b/editor FOR $h IN $e/hobby FOR $i IN $a/interest RETURN $b Algebraic plan (TLC) Orderlist: 2, 3, 5, 6, 4
Binding Order Example FOR $b IN document(“lib.xml”)//book FOR $a IN $b/author FOR $e IN $b/editor FOR $h IN $e/hobby FOR $i IN $a/interest RETURN $b Algebraic plan with correct output order (TLC-C) Orderlist: 2, 3, 5, 6, 4
Outline XML and XQuery Order and Duplicates Hybrid Collections Correct Output Order Thinking Efficiently Enhancing an algebra with Hybrid Collections Minimizing Duplicate Elimination procedures Selections and Ordering Nested Queries and Ordering Experimental Evaluation Final Words
Operators with Ordering (example) Select S[apt, ord](C T ): produces the matches of the annotated pattern tree (apt) on the input collection C T New parameter ord is used for ordering ‘empty’, unspecified order ‘maintain’, preserve order of input C T ‘list-resort u’, destroy order of C T and resort using input list of node references u ‘list-add u’, preserve order of input C T and sort ties using input list of node references u
Algebraic Identities (example) Select S and Sort O can be merged O[ol](S[any, any](…)) ↔ S[any, ol](…) Select S and Sort O can be swaped O[ol](S[any, maintain](…)) ↔ S[any, maintain](O[ol](…))
Minimize Duplicate Eliminations Step 1: Remove redundant duplicate elimination procedures Step 2: Explore partial duplicate specifications to further minimize duplicate elimination procedures
Minimize DEs Step 1 Example FOR $o IN document(“auction.xml”)//open_auction WHERE count($o/bidder) > 5 RETURN {$o/quantity} {$o/type} From 6 DE procedures to 1
Minimize DEs Step 2 Example FOR $o IN document(“auction.xml”)//open_auction WHERE count($o/bidder) > 5 RETURN {$o/quantity} {$o/type} DE procedure is modified to DE: ID(2). Then using algebraic rewrites is eliminated completely.
Selections and Ordering For “selection” type queries, use algebraic rewrites and push the sort down to the select operator.
Selections and Ordering Example FOR $b IN document(“lib.xml”)//book FOR $a IN $b/author FOR $e IN $b/editor FOR $h IN $e/hobby FOR $i IN $a/interest RETURN $b Push Sort into Select using algebraic identities. Optimizer can plan Select operator without having the forced blocking sort at the end.
Joins and Ordering Example FOR $a IN document(t)//article FOR $b IN document(t)//book WHERE $b/author = $a/author RETURN ($b, $a) Algebraic plan with correct output order (TLC-C)
Joins and Ordering Example Push Sort into Join using algebraic identities.
Joins and Ordering Example Push Sort further down into Selects using algebraic identities.
Nested Queries and Ordering FOR $b IN document(“lib.xml”)/book LET $k := FOR $a IN document(“lib.xml”)/article WHERE $b/author = $a/author AND $a/conf = “VLDB” RETURN $a WHERE $b/year = 1999 RETURN {$b} {$k} Algebraic plan with correct output order (TLC-C)
Nested Queries and Reorder Rewrite Sort and blocking Join to Reorder operation.
Outline XML and XQuery Order and Duplicates Hybrid Collections Correct Output Order Thinking Efficiently Experimental Evaluation Final Words
Experimental Setup Timber System 128MB buffer pool Value index when necessary (not for all queries) Intel Pentium III-M 866 Mhz Windows 2000 professional IDE Hard Drive 512MB RAM XMark dataset factor 1 707MB total space (472MB data + 241MB index)
Minimizing Duplicate Eliminations x17 more selective x19 less selective q2 value join
Selections and Ordering x13 simple output x17 more selective x19 less selective
Join and Ordering q1 less selective q2 more selective x3 less selective
Nested Queries and Ordering
Ordering and Duplicate Optimizations x19 selection q2 value join X8 nested query
Outline XML and XQuery Order and Duplicates Hybrid Collections Correct Output Order Thinking Efficiently Experimental Evaluation Final Words
Related Work Relational Systems recognize smart sort placement as a problem D. Simmen, E. Shekita, and T. Malkemus. Fundamental techniques for order optimization. In Proc.SIGMOD Conf., 1996 XML Navigational-based approach has study of ordering requirements in: J. Hidders and P. Michiels. Avoiding unnecessary ordering operations in XPath. In Proc. DBPL Conf.,2003. XML Algebraic-based approaches use sets or sequences. Aside from the performance limitations, it is unknown whether they fully address the XQuery binding order to produce correct results.
Final Words Ordering in XQuery is a complex procedure with significant performance ramifications Introduced Hybrid Collections with Ordering Specification as means to a correct and flexible solution Similar path for Duplicates Showed algebraic optimizations that take advantage of provided flexibility Demonstrated experimentally the performance increase