Introduction to XML Algebra CS561
Data Model data model ~ core data structures and data types supported by DBMS relational database is a table (set-oriented) data model XML format is a tree-structured hierarchical model
Why Query Algebra (for XML) ? It is common to translate a query language into an algebra. First, the algebra is used to give a semantics for the query language. Second, the algebra is used to support query optimization.
XML Algebra History Lore Algebra (August 1999) -- Stanford University IBM Algebra (September 1999) --Oracle; IBM; Microsoft Corp YAT Algebra (May 2000) AT&T Algebra (June 2000) --AT&T; Bell Labs Niagara Algebra (2001) -- University of Wisconsin -Madison
NIAGARA Title : Following the paths of XML Data: An algebraic framework for XML query evaluation By : Leonidas Galanis, Efstratios Viglas, David J. DeWitt, Jeffrey. F. Naughton, and David Maier. Univ. of Wisconsin
Outline Concepts of Niagara Algebra Operations Optimization
Goals of Niagara Algebra Be independent of schema information Query on both structure and content Generate simple, flexible, yet powerful algebraic expressions Allow re-use of traditional optimization techniques
Example: XML Source Documents Invoice.xml <Invoice_Document> <invoice No = 1> <account_number>2 </account_number> <carrier>AT&T</carrier> <total>$0.25</total> </invoice> <invoice> <account_number>1 </account_number> <carrier>Sprint</carrier> <total>$1.20</total> <total>$0.75</total> </Invoice_Document> Customer.xml <Customer_Document> <customer> <account>1 </account> <name>Tom </name> </customer > <account>2 </account> <name>George </name> </Customer _Document>
XML Data Model and Tree Graph Example: Invoice_Document <Invoice_Document> <invoice> <number>2</number> <carrier>Sprint</carrier> <total>$0.25</total> </invoice> <number>1</number> <carrier>Sprint</carrier> <total>$1.20</total> </Invoice_Document> … Invoice Invoice number carrier number total total carrier 2 AT&T $0.25 1 $1.20 Sprint Ordered Tree Graph, Semi structured Data
XML Data Model (for Querying) SQL: relations in, relation out. Relational Algebra: relations in, relation out. XQuery: XML doc in, XML docs out XML Algebra: ??
XML Data Model [GVDNM01] Collection of bags of vertices. Vertices in a bag have no order. Example: Root invoice.xml invoice invoice.account_number < account_number > element-content </ account_number > <invoice> Invoice-element-content </invoice> [Root“invoice.xml”, invoice, invoice. account_number ]
Data Model Bag elements are reachable by path expressions. Path expression consists of two parts: An entry point A relative forward part Example: account_number:invoice
Outline Concepts of Niagara Algebra Operations Optimization
Operators Source S , Follow , Expose , Vertex , Source S , Select , Join , Rename , Group , Union , Intersection , Difference - , Cartesian Product .
Source Operator S Input : a list of documents Output :a collection of singleton bags Examples : S (*) All known XML documents S (invoice*.xml) All XML documents whose filename match “invoice*.xml S (*,schema.dtd) All known XML documents that conform to schema.dtd
Follow operator Input : a path expression in entry point notation Functionality : extracts vertices reachable by path expression Output : a new bag that consists of the extracted vertex + all contents of original bag (in case of unnesting follow)
Follow operator (Example*) {[Root invoice.xml , invoice, invoice.carrier]} Root invoice.xml invoice invoice.carrier <carrier> carrier -element-content </carrier > <invoice> Invoice-element-content </invoice> *Unnesting Follow (carrier:invoice) Root invoice.xml invoice <invoice> Invoice-element-content </invoice> {[Root invoice.xml , invoice]}
Select operator Input : a set of bags Functionality : filters the bags of a collection using a predicate Output : a set of bags that conform to the predicate Predicate : Logical operator (,,), or simple qualifications (,,,,,)
Select operator (Example) {[Root invoice.xml , invoice],… } Root invoice.xml invoice <invoice> Invoice-element-content </invoice> invoice.carrier =Sprint Root invoice.xml invoice Root invoice.xml invoice <invoice> Invoice-element-content </invoice> <invoice> Invoice-element-content </invoice> {[Root invoice.xml , invoice], [Root invoice.xml , invoice], ……………}
Join operator Input: two collections of bags Functionality: Joins the two collections based on a predicate Output: the concatenation of pairs of pages that satisfy the predicate
Join operator (Example) {[Root invoice.xml , invoice, Root customer.xml , customer]} Root invoice.xml invoice Root customer.xml customer <invoice> Invoice-element-content </invoice> <customer> customer-element-content </customer> account_number: invoice =number:customer Root invoice.xml invoice Root customer.xml customer <invoice> Invoice-element-content </invoice> <customer> customer-element-content </customer> {[Root invoice.xml , invoice]} {[Root customer.xml , customer]}
Expose operator Input: a list of path expressions of vertices to be exposed Output: a set of bags that contains vertices in the parameter list with the same order
Expose operator (Example) {[Root invoice.xml , invoice.bill_period, invoice.carrier]} Root invoice.xml invoice. bill_period invoice.carrier <carrier> bill_period -element-content </carrier > <invoice> carrier-element-content </invoice> (bill_period,carrier) Root invoice.xml invoice invoice.carrier invoice.bill_period <invoice> Invoice-element-content </invoice> <invoice> carrier-element-content </invoice> <carrier> bill_period -element-content </carrier > {[Root invoice.xml , invoice, invoice.carrier, invoice.bill_period]}
Vertex operator Creates the actual XML vertex that will encompass everything created by an expose operator Example : (Customer_invoice)[((account)[invoice.account_number], (inv_total)[invoice.total])]
Other operators Group : is used for arbitrary grouping of elements based on their values Aggregate functions can be used with the group operator (i.e. average) Rename : Changes entry point annotation of elements of a bag. Example: (invoice.bill_period,date)
Example: XML Source Documents Invoice.xml <Invoice_Document> <invoice> <account_number>2 </account_number> <carrier>AT&T</carrier> <total>$0.25</total> </invoice> <account_number>1 </account_number> <carrier>Sprint</carrier> <total>$1.20</total> <total>$0.75</total> <auditor> maria </auditor> </Invoice_Document> Customer.xml <Customer_Document> <customer> <account>1 </account> <name>Tom </name> </customer > <account>2 </account> <name>George </name> </Customer _Document>
Xquery Example List account number, customer name, and invoice total for all invoices that have carrier = “Sprint”. FOR $i in (invoices.xml)//invoice, $c in (customers.xml)//customer WHERE $i/carrier = “Sprint” and $i/account_number= $c/account RETURN <Sprint_invoices> $i/account_number, $c/name, $i/total </Sprint_invoices>
Example: Xquery output <Sprint_Invoice> <account_number>1 </account_number> <name>Tom </name> <total>$1.20</total> </Sprint_Invoice >
Algebra Tree Execution Account_number name total Expose (*.account_number , *.name, *.total ) invoice(2) customer(1) Join (*.invoice.account_number=*.customer.account) invoice (2) Select (carrier= “Sprint” ) Invoice (1) invoice (2) invoice (3) customer(1) customer (2) Follow (*.invoice) Follow (*.customer) Source (Invoices.xml) Source (cutomers.xml)
Outline Concepts of Niagara Algebra Operations Optimization
Optimization with Niagara Optimizer based on Niagara algebra: Use the operation more efficiently Produce simpler expressions by combining operations
Language Convention A and B are path expressions A< B -- Path Expression A is prefix of B AnB --- Common prefix of path A and B AńB --- Greatest common prefix of path A and B ┴ --- Null path Expression
Heuristics using Rewrite Rules Allow optimization based on path selectivity When applying un-nesting with operation Φμ
Interchangeability of Follow operation Φμ(A) [Φμ(B)]=Φμ (B)[Φμ (A)] TRUE or FALSE? TRUE when exists C such that C < A && C < B and C = AńB Or AnB = ┴
Application of Rule on Invoice Φμ(acc_Num:invoice)[Φμ(carrier:invoice)] == Φμ(carrier:invoice)[Φμ(acc_Num:invoice)] ? TRUE or FALSE?
Application of Rule on Invoice Φμ(acc_Num:invoice)[Φμ(carrier:invoice)] = Φμ(carrier:invoice)[Φμ(acc_Num:invoice)] TRUE because both share common prefix “invoice”. Case AńB = invoice
Benefit of Rule Application NOTE: Assume acc_Num is required for each invoice element, while carrier is not THEN: Φμ(acc_Num:invoice)[Φμ(carrier:invoice)] == Φμ(carrier:invoice)[Φμ(acc_Num:invoice)] Then what algebra tree do we prefer?
Discussion Reduction of Input Size on first Sub-operation: Φμ(carrier:invoice) vs Φμ(acc_Num:invoice) (:
Can we apply the rule below? Φμ(acc_Num:invoice)[Φμ(acc_Num:Customer)]
Example “acc_Num:invoice” and “acc_Num:customer” are two totally different paths Case is: AnB = ┴ So yes, rule is valid.
Summary XML Algebra Operations Optimization