TAX: A Tree Algebra for XML H.V. Jagadish Laks V.S. Lakshmanan Univ. of Michigan Univ. of British Columbia Divesh Srivastava Keith Thompson AT&T Labs – Research Univ. of Michigan Work supported by NSF and NSERC.
Overview Why an algebra for XML? Main challenges Data model Patterns & Witnesses Tree Value Functions Some Example Operators Translation Example – XQuery
Overview (contd.) Main Results Optimization Examples Implementation Summary & Future Work
Why an Algebra (for XML)? (aka Related Work) Bulk algebra for tree manipulation – efficient implementation of XML queries Algebra for manipulating trees (has been attempted before) Feature algebras – linguistics; efficient implementation? Grammar-based algebra for trees [Tompa+ 87, Gyssens+ 89] Aqua project [Zdonik+95]
Why XML algebra? [Related work] (contd.) GraphLog, Hy+ [Consens+90], GOOD [Paradaens+92] – cannot exploit special properties of trees (e.g., support for arbitrary recursion vs. ancestors, order) SS data – Lorel [Abiteboul+ 96], UnQL [Buneman+ 96]. XML algebras – [Beech+ 99], [Fernandez+ 00] (mainly type system issues), [Christofidis+ 00] (trees tuples), [Ludascher+ 00] (nodes, not trees), SAL [Beeri+ 99] (ordered lists of nodes)
Why? (contd.) be close to relational model, but direct support for (collections of) trees express at least RA + aggregation capture substantial fragment of XQuery admit efficient implementation and effective query optimization
Main Chellanges Capture rich variety of manipulations in a simple algebra Handle heterogeneity in tree collections structure “schema” of nodes of the same “type” Handle order (documents are ordered) sometimes important (e.g., author list) sometimes not (e.g., publisher vs. authors)
Data Model Data tree = rooted ordered tree Data in node = set of attr-val pairs Special attribute: pedigree – where did I come from? “doc id + offset in doc”. preserved for (copies of) original nodes thru manipulations. play important role in grouping, sorting, etc. null for new nodes. Collections (of trees) – unordered.
Patterns & Witnesses first challenge: how do you get at nodes and/or attributes? our solution: patterns – enable specification of parameters for most operations only show parts of interest: Need not know/care about entire structure of trees in collection
Patterns & Witnesses (contd.) Example P1: $1 $2$3 pcad $1.tag = book & $2.tag = year & $2.content < 2000 & $3.tag = author Structural part Condition part Additional parameters possible: e.g., selection/projection lists, grouping, ordering, etc. pc = direct ad = transitive
Patterns & Witnesses (contd.) What does a pattern do for you? generate witnesses against i/p collection one for each matching of pattern against i/p conditions must be respected (sub)structure preserved in o/p e.g., witness trees for pattern P1 – one tree for each author of each book published before 2000, showing year & author book-author link may be transitive in i/p but is necessarily direct in o/p source trees = trees witnesses “came from”
Tree Value Functions (TVF) What are they? Primitive recursive functions on structure of source trees Where are they used? grouping, ordering, aggregation, etc. Here is an example: f: T value of author, number of authors, tuple of authors, {author tuple, title}, etc. Complete example coming up …
Example Database bib book author name firstlastmid deg name title year firstlast 1910 Principia Mathematica AlfredNorthWhitehead BertrandRussel Sc.D., FRS M.A., FRS author name Panini Ashtadhyayi (First book on Sanskrit Grammar) year 560 BC
Example Operators – Selection Input: collection; parameters: pattern, selection list (pattern nodes) Example pattern P1 and empty SL: same witness trees as before pattern P1 with SL = {$1}: whole book subtrees (i.e. retain $1’s descendants) One-zero/more op in general Could retain other “relatives” instead (e.g., siblings)
Selection with P1 (empty SL) book authoryear 1910 author year 560 BC book year author Whole author subtree included when SL = {$3}. 1910
Example operators – Projection Input: collection; parameters: pattern, projection list Example Pattern P1 w/ PL = {$1, $2, $3}: one tree for each book published before 2000, showing year and author(s) Pattern P1 w/ PL = {$3}: one tree for each author of aforementioned books `*’ in PL causes descendants to be retained One-zero/more op (for reasons diff. from select)
Projection: P1 w/ PL = {$1,$2,$3} book author year 1910 author year 560 BC With $3*, can include whole author subtrees.
Selection vs. Projection Example FOR $b IN document(“doc.xml”)//book FOR $y IN $b/year[data() $y $a versus FOR $b IN document(“doc.xml”)//book[/year/data() $b/year $b/author selection projection
Example operators – grouping Input: collection; parameters: pattern, grouping TVF, ordering TVF. Example input: collection of books pattern: $1 $2$3 $4 $1.tag = book & $2.tag = title & $3.tag = author & $4.tag = name f_g(T) = “$4.content” f_o(T) = “$2.content” pc ad pc
Grouping (contd.) Here is what the o/p looks like: -- books ordered by title in each group … tax_group_root tax_group_basistax_group_subroot author book
Other operators Derived operators – various joins. Set operations: When are two data trees the “same”? Equality (shallow/deep) vs. isomorphism (include pedigree or not?) Multiset versions of operators Aggregation, Reordering, Renaming.
Translation Examples – XQuery FOR $b IN RETURN $b/title IF SOME $a IN $b//author SATISFIES $a/data() = “divesh” THEN $b/author
XQuery Translation (contd.) Pre-IF part E: select w/ then project w/ $1 $2 $1.tag=book & $2.tag=author & $2.hobby=tennis SL = $1* $3 $4 $3.tag=book & $4.tag=title PL = $3, $4 $3 $4 $3.tag=book & $4.tag=title PL = $3, $4
XQuery Translation (contd.) IF part F: select w/ then project w/ $5 $6 $5.tag=book & $6.tag=author & $6.content = divesh SL = $5* $7 $8 $7.tag=book & $8.tag=author PL = $7, $8
XQuery Translation (contd.) Do a left outerjoin of E with F w/ the condition $3 = $7 Project w/ Rename tax_prod_root sportydiveshbook. tax_prod_root / \ book book... | /... \ title author author PL = $9 $9.tag != book $9
Main Results Duplicate elimination by value can be expressed in TAX. The operators in TAX are independent. TAX is complete for relational algebra w/ aggregation. TAX can capture the fragment of XQuery FLWR expressions w/o function calls, recursion, w/ all path expressions using only constants, wildcards, and / & //, when no new ancestor- descendant relationships are created.
Optimization Examples Revisit translation example: E can be simplified to – project w/ Similar simplification applies to F Self-join can sometimes be eliminated Associativity, commutativity issues $1 $2$3 $1.tag=book & $2.tag=author & $2.hobby=tennis & $3.tag=title PL= $1,$3
Implementation TIMBER system at Univ. of Michigan Find pattern tree matches via Index scans Full scans Twig joins Joins implemented on streams Pedigree – implemented as position of element within document Pedigrees similar to RID at impl. level
Summary & Future Work TAX – extension of RA for handling heterogeneous collections of ordered labeled trees Simplicity; few more operators Recognize selective importance of order and handle elegantly Bulk algebra for efficient implementation of XML querying Stay tuned for TIMBER release(s) Future Arbitrary restructuring: copy-and-paste Updates: principled via operators