Chapter 10: XML The world of XML. The Data Semistructured data instance = a large graph.

Slides:



Advertisements
Similar presentations
XML Examples. Bank Information Basic structure: A-101 Downtown 500 … Johnson Alma Surrey … A-101 Johnson …
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
XML: Extensible Markup Language
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Chapter 10: XML. ©Silberschatz, Korth and Sudarshan10.2Database System ConceptsIntroduction XML: Extensible Markup Language Defined by the WWW Consortium.
Managing XML and Semistructured Data Lecture : Indexes.
Database System Concepts ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 10: XML.
1 COS 425: Database and Information Management Systems XML and information exchange.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Managing XML and Semistructured Data Lecture 16: Indexes Prof. Dan Suciu Spring 2001.
L13-S1 XML 2003 SJSU -- CmpE Database Design Dr. M.E. Fayad, Professor Computer Engineering Department, Room #283I College of Engineering San José State.
Slides adapted from A. Silberschatz et al. Database System Concepts, 5th Ed. SQL - part 2 - Database Management Systems I Alex Coman, Winter 2006.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
XMLII XSchema XSchema XQuery XQuery. XML Schema XML Schema is a more sophisticated schema language which addresses the drawbacks of DTDs. Supports XML.
XML Query Languages Notes Based on Chapter 10 of Database System Concepts.
4/20/2017.
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
Chapter 4 The Relational Model.
XMLII XSchema XSchema XQuery XQuery Oracle XSU Oracle XSU.
Chapter 10: XML. ©Silberschatz, Korth and Sudarshan10.2Database System ConceptsIntroduction XML: Extensible Markup Language Defined by the WWW Consortium.
Computing & Information Sciences Kansas State University Monday. 20 Oct 2008CIS 560: Database System Concepts Lecture 21 of 42 Monday, 20 October 2008.
Lecture 7 of Advanced Databases XML Querying & Transformation Instructor: Mr.Ahmed Al Astal.
 Structure of XML Data  XML Document Schema  Querying and Transformation  Application Program Interfaces to XML  Storage of XML Data  XML Applications.
Lecture 21 XML querying. 2 XSL (eXtensible Stylesheet Language) In HTML, default styling is built into browsers as tag set for HTML is predefined and.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
XML By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
Chapter 10: XML.
Lecture 6 of Advanced Databases XML Querying & Transformation Instructor: Mr.Eyad Almassri.
Computing & Information Sciences Kansas State University Friday, 17 Oct 2007CIS 560: Database System Concepts Lecture 21 of 42 Friday, 17 October 2008.
Chapter 10: XML XML Structure of XML Data XML Document Schema Querying and Transformation Application Program Interfaces to XML Storage of XML Data.
XMLI Structure of XML Data Structure of XML Data XML Document Schema XML Document Schema XPATH XPATH.
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Querying Structured Text in an XML Database By Xuemei Luo.
Winter 2006Keller, Ullman, Cushing18–1 Plan 1.Information integration: important new application that motivates what follows. 2.Semistructured data: a.
Computing & Information Sciences Kansas State University Thursday, 15 Mar 2007CIS 560: Database System Concepts Lecture 24 of 42 Thursday, 15 March 2007.
Chapter 10: XML The world of XML. Context The dawn of database technology 70s A DBMS is a flexible store-recall system for digital information It provides.
Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.
Database Systems Part VII: XML Querying Software School of Hunan University
Temple University – CIS Dept. CIS616– Principles of Data Management V. Megalooikonomou XML ( based on slides by Silberschatz, Korth and Sudarshan at Bell.
XML Name: Niki Sardjono Class: CS 157A Instructor : Prof. S. M. Lee.
[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Chapter 23 XML. 2 Introduction  XML: eXtensible Markup Language (What is a Markup language?)  Defined by the WWW Consortium (W3C)  Originally intended.
XML and Database.
Database System Concepts Bin Mu at Tongji University Chapter 10: XML.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou XML ( based on slides by.
©Silberschatz, Korth and Sudarshan10.1Database System Concepts XML Query Languages Notes Based on Chapter 10 of Database System Concepts.
1. XML Structure of XML Data XML Document Schema Querying and Transformation Application Program Interfaces to XML Storage of XML Data XML Applications.
Computing & Information Sciences Kansas State University Friday, 20 Oct 2006CIS 560: Database System Concepts Lecture 24 of 42 Friday, 20 October 2006.
Session 1 Module 1: Introduction to Data Integrity
Database System Concepts ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 10: XML.
XPath --XML Path Language Motivation of XPath Data Model and Data Types Node Types Location Steps Functions XPath 2.0 Additional Functionality and its.
Chapter 10: XML. XML Structure of XML Data XML Document Schema Querying and Transformation Application Program Interfaces to XML Storage of XML Data XML.
CSE 6331 © Leonidas Fegaras XQuery 1 XQuery Leonidas Fegaras.
Database System Concepts ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 10: XML.
Chapter 10: XML. ©Silberschatz, Korth and Sudarshan10.2Database System ConceptsIntroduction XML: Extensible Markup Language Defined by the WWW Consortium.
Chapter 10: XML. ©Silberschatz, Korth and Sudarshan10.2Database System ConceptsIntroduction XML: Extensible Markup Language Defined by the WWW Consortium.
L14-S1 XML 2003 SJSU -- CmpE Database Design Dr. M.E. Fayad, Professor Computer Engineering Department, Room #283I College of Engineering San José State.
Chapter 10: XML Introduction  XML: Extensible Markup Language  Defined by the WWW Consortium (W3C)  Originally intended as a document.
XML Schema – XSLT Week 8 Web site:
ADT 2010 Introduction to (XML, XPath &) XQuery Chapter 10 in Silberschatz, Korth, Sudarshan “Database System Concepts” Stefan Manegold
10.1 Chapter 10: XML Sections Problems 10.1, 10.2, 10.7 Find an example of using XML in a field of interest to you and describe it to the class.
Querying and Transforming XML Data
CS 480: Database Systems Lecture 28 March 22, 2013.
Presentation transcript:

Chapter 10: XML The world of XML

The Data Semistructured data instance = a large graph

The indexing problem The storage problem –Store the graph in a relational DBMS –Develop a new database storage structure The indexing problem: –Input: large, irregular data graph –Output: index structure for evaluating (regular) path expressions, e.g. bib.paper.author.firstname

XSet: a simple index for XML Part of the Ninja project at Berkeley Example XML data:

XSet: a simple index for XML Each node = a hashtable Each entry = list of pointers to data nodes (not shown) SELECT X FROM part.name X -yes SELECT X FROM part.supplier.name X -yes SELECT X FROM part.*.subpart.name X -maybe SELECT X FROM *.supplier.name X -maybe

Region Algebras structured text = text with tags (like XML) data = sequence of characters [c 1 c 2 c 3 …] region = interval in the text –representation (x,y) = [c x,c x+1, … c y ] –example: … region set = a set of regions –example all regions (may be nested) region algebra = operators on region set, s1 op s2 s1 intersect s2 = {r | r  s1, r  s2} s1 included s2 = {r | r  s1,  r’  s2, r  r’} s1 including s2 = {r | r  s1,  r’  s2, r  r’} s1 parent s2 = {r | r  s1,  r’  s2, r is a parent of r’} s1 child s2 = {r | r  s1,  r’  s2, r is child of r’}

Region Algebras part.name name child (part child root) part.supplier.name name child (supplier child (part child root)) *.supplier.name name child supplier part.*.subpart.name name child (subpart included (part child root)) Region expressions correspond to simple XPath expressions s1 child s2 = {r | r  s1,  r’  s2, r is child of r’}

Efficient computation of Region Algebra Operators Example: s1 included s2 s1 = {(x1,x1'), (x2,x2'), …} s2 = {(y1,y1'), (y2,y2'), …} (i.e. assume each consists of disjoint regions) Algorithm: if xi < yj then i := i + 1 if xi' > yj' then j := j + 1 otherwise: print (xi,xi'), do i := i + 1 Can do in sub-linear time when one region is very small

Storage structures for region algebras Every node is characterised by an integer pair (x,y) This means we have a 2-d space Any 2-d space data structure can be used If you use a (pre-order,post-order) numbering you get triangular filling of 2-d (to be discussed later)

Alternative mappings Mapping the structure to the relational world –The Edge approach –The Attribute approach –The Universal Table approach –The Normalized Universal approach –The Monet/XML approach –The Dataguide approach Mapping values –Separate value tables –Inlining Shredding

Dataguide approach Developed in the context of Lore, Lorel (Stanford Univ) Predecessor of the Monet/XML model Observation: –queries in the graph-representation take a limited form –they are partial walks from the root to an object of interest –this behaviour was stressed by the query language Lorel, i.e. an SQL-based query language based on processing regular expressions SELECT X FROM (Bib.*.author).(lastname|firstname).Abiteboul X

DataGuides Definition given a semistructured data instance DB, a DataGuide for DB is a graph G s.t.: - every path in DB also occurs in G - every path in G occurs in DB - every path in G is unique

Dataguides Example:

DataGuides Multiple DataGuides for the same data:

DataGuides Definition Let w, w’ be two words (I.e word queries) and G a graph w  G w’ if w(G) = w’(G) Definition G is a strong dataguide for a database DB if  G is the same as  DB Example: - G1 is a strong dataguide - G2 is not strong person.project !  DB dept.project person.project !  G2 dept.project

DataGuides Constructing the strong DataGuide G: Nodes(G)={{root}} Edges(G)=  while changes do choose s in Nodes(G), a in Labels add s’={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G) add (x -a->y) to Edges(G) Use hash table for Nodes(G) This is precisely the powerset automaton construction.

DataGuides How large are the dataguides ? –if DB is a tree, then size(G) <= size(DB) why? answer: every node is in exactly one extent of G here: dataguide = XSet –How many nodes does the strong dataguide have for this DB ? 20 nodes (least common multiple of 4 and 5) Dataguides usually fail on data with cyclic schemas, like:

Monet XML approach

Querying the XML world

Querying and Transforming XML Data Standard XML querying/translation languages –XPath Simple language consisting of path expressions –XSLT Simple language designed for translation from XML to XML and XML to HTML –XQuery An XML query language with a rich set of features Wide variety of other languages have been proposed, and some served as basis for the Xquery standard –XML-QL, Quilt, XQL, …

Tree Model of XML Data Query and transformation languages are based on a tree model of XML data An XML document is modeled as a tree, with nodes corresponding to elements and attributes –Element nodes have children nodes, which can be attributes or subelements –Text in an element is modeled as a text node child of the element –Children of a node are ordered according to their order in the XML document –Element and attribute nodes (except for the root node) have a single parent, which is an element node –The root node has a single child, which is the root element of the document We use the terminology of nodes, children, parent, siblings, ancestor, descendant, etc., which should be interpreted in the above tree model of XML data.

XML data with ID and IDREF attributes Downtown 500 Joe Monroe Madison Mary Erin Newark

XPath XPath is used to address (select) parts of documents using path expressions A path expression is a sequence of steps separated by “/” –Think of file names in a directory hierarchy Result of path expression: set of values that along with their containing elements/attributes match the specified path E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returnsbank-2 data Joe Mary E.g. /bank-2/customer/name/text( ) returns the same names, but without the enclosing tags

XPath (Cont.) The initial “/” denotes root of the document (above the top-level tag) Path expressions are evaluated left to right –Each step operates on the set of instances produced by the previous step Selection predicates may follow any step in a path, in [ ] –E.g. /bank-2/account[balance > 400] returns account elements with a balance value greater than 400 /bank-2/account[balance] returns account elements containing a balance subelement Attributes are accessed using –E.g. /bank-2/account[balance > returns the account numbers of those accounts with balance > 400 –IDREF attributes are not dereferenced automatically (more on this later)

Functions in XPath XPath provides several functions –The function count() at the end of a path counts the number of elements in the set generated by the path E.g. /bank-2/account[customer/count() > 2] –Returns accounts with > 2 customers –Also function for testing position (1, 2,..) of node w.r.t. siblings Boolean connectives and and or and function not() can be used in predicates IDREFs can be referenced using function id() –id() can also be applied to sets of references such as IDREFS and even to strings containing multiple references separated by blanks –E.g. returns all customers referred to from the owners attribute of account elements.

More XPath Features Operator “|” used to implement union –E.g. | gives customers with either accounts or loans However, “|” cannot be nested inside other operators. “//” can be used to skip multiple levels of nodes –E.g. /bank-2//name finds any name element anywhere under the /bank-2 element, regardless of the element in which it is contained. A step in the path can go to (13 variations in the standard): parents, siblings, ancestors and descendants of the nodes generated by the previous step, not just to the children –“//”, described above, is a short from for specifying “all descendants” –“..” specifies the parent.

Pathfinder Xpath is essential for the implementation of an Xquery processor. It is strongly related to the data structures and its primitives. A state-of-the-art implementation is MonetDB/Pathfinder developed by Uni. Konstantz, Twente University, CWI

Pathfinder Uni Konstantz

Pathfinder

pathfinder

Pathfinder

Staircase join

Pathfinder

XQuery

XQuery is a general purpose query language for XML data Currently being standardized by the World Wide Web Consortium (W3C) –The textbook description is based on a March 2001 draft of the standard. The final version may differ, but major features likely to stay unchanged. Alpha version of XQuery engine –Galax –IPSI-IQ –Xpath visualized –MonetDB/Pathfinder –Xhive XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL and XML-QL

XQuery XQuery uses a for … let … where.. return … syntax for  SQL from where  SQL where return  SQL select let allows temporary variables, and has no equivalent in SQL Variables make it possible to keep the state of processing around and severely complicates optimization

FLWR Syntax in XQuery For clause uses XPath expressions, and variables in the for- clause ranges over values in the set returned by Xpath XPath is used to address (select) parts of documents using path expressions A path expression is a sequence of steps separated by “/” Result of path expression: set of values that along with their containing elements/attributes match the specified path E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returnsbank-2 data Joe Mary E.g. /bank-2/customer/name/text( ) returns the same names, but without the enclosing tags

XPath XPath is used to address (select) parts of documents using path expressions A path expression is a sequence of steps separated by “/” –Think of file names in a directory hierarchy Result of path expression: set of values that along with their containing elements/attributes match the specified path E.g. /bank-2/customer/name evaluated on the bank-2 data we saw earlier returnsbank-2 data Joe Mary E.g. /bank-2/customer/name/text( ) returns the same names, but without the enclosing tags

XPath (Cont.) The initial “/” denotes root of the document (above the top-level tag) Path expressions are evaluated left to right –Each step operates on the set of instances produced by the previous step Selection predicates may follow any step in a path, in [ ] –E.g. /bank-2/account[balance > 400] returns account elements with a balance value greater than 400 /bank-2/account[balance] returns account elements containing a balance subelement Attributes are accessed using –E.g. /bank-2/account[balance > returns the account numbers of those accounts with balance > 400 –IDREF attributes are not dereferenced automatically (more on this later)

Functions in XPath XPath provides several functions –The function count() at the end of a path counts the number of elements in the set generated by the path E.g. /bank-2/account[customer/count() > 2] –Returns accounts with > 2 customers –Also function for testing position (1, 2,..) of node w.r.t. siblings Boolean connectives and and or and function not() can be used in predicates IDREFs can be referenced using function id() –id() can also be applied to sets of references such as IDREFS and even to strings containing multiple references separated by blanks –E.g. returns all customers referred to from the owners attribute of account elements.

More XPath Features Operator “|” used to implement union –E.g. | gives customers with either accounts or loans However, “|” cannot be nested inside other operators. “//” can be used to skip multiple levels of nodes –E.g. /bank-2//name finds any name element anywhere under the /bank-2 element, regardless of the element in which it is contained. A step in the path can go to (13 variations in the standard): parents, siblings, ancestors and descendants of the nodes generated by the previous step, not just to the children –“//”, described above, is a short from for specifying “all descendants” –“..” specifies the parent.

FLWR Syntax in XQuery Simple FLWR expression in XQuery –find all accounts with balance > 400, with each result enclosed in an.. tag for $x in /bank-2/account let $acctno := where $x/balance > 400 return $acctno Let clause not really needed in this query, and selection can be done In XPath. Query can be written as: for $x in /bank-2/account[balance>400] return

Path Expressions and Functions Path expressions are used to bind variables in the for clause, but can also be used in other places –E.g. path expressions can be used in let clause, to bind variables to results of path expressions The function distinct( ) can be used to removed duplicates in path expression results The function document(name) returns root of named document –E.g. document(“bank-2.xml”)/bank-2/account Aggregate functions such as sum( ) and count( ) can be applied to path expression results XQuery does not support groupby, but the same effect can be got by nested queries, with nested FLWR expressions within a return clause –More on nested queries later

Joins Joins are specified in a manner very similar to SQL for $b in /bank/account, $c in /bank/customer, $d in /bank/depositor where $a/account-number = $d/account-number and $c/customer-name = $d/customer-name return $c $a The same query can be expressed with the selections specified as XPath selections: for $a in /bank/account $c in /bank/customer $d in /bank/depositor[ account-number =$a/account-number and customer-name = $c/customer-name ] return $c $a

Changing Nesting Structure The following query converts data from the flat structure for bank information into the nested structure used in bank-1 for $c in /bank/customer return $c/* for $d in /bank/depositor[customer-name = $c/customer-name], $a in /bank/account[account-number=$d/account-number] return $a $c/* denotes all the children of the node to which $c is bound, without the enclosing top-level tag Exercise for reader: write a nested query to find sum of account balances, grouped by branch.

XQuery Path Expressions $c/text() gives text content of an element without any subelements/tags XQuery path expressions support the “–>” operator for dereferencing IDREFs –Equivalent to the id( ) function of XPath, but simpler to use –Can be applied to a set of IDREFs to get a set of results

Sorting in XQuery Sortby clause can be used at the end of any expression. E.g. to return customers sorted by name for $c in /bank/customer return $c/* sortby(name) Can sort at multiple levels of nesting (sort by customer-name, and by account-number within each customer) for $c in /bank/customer return $c/* for $d in /bank/depositor[customer-name=$c/customer-name], $a in /bank/account[account-number=$d/account-number] return $a/* sortby(account-number) sortby(customer-name)

Functions and Other XQuery Features User defined functions with the type system of XMLSchema function balances(xsd:string $c) returns list(xsd:numeric) { for $d in /bank/depositor[customer-name = $c], $a in /bank/account[account-number=$d/account-number] return $a/balance } Types are optional for function parameters and return values Universal and existential quantification in where clause predicates –some $e in path satisfies P –every $e in path satisfies P XQuery also supports If-then-else clauses

Xmark Used in most experiments on Xpath and Xquery evaluation Old figures on hand-compiled queries for the dataguide approach can be found in eport.html

Xmark

XMark

Monet XML approach

XMark Q1 Return the name of the person with ID ‘personal’ FOR $b IN RETURN $b/name/text()

Do it yourself Or skip

Xmark queries Q2: Return the initial increases of all open auctions. –This query evaluates the cost of array look-ups. Note that this query may actually be harder to evaluate than it looks; especially relational back-ends may have to struggle with rather complex aggregations to select the bidder element with index 1. Q3: Return the IDs of all open auctions whose current increase is at least twice as high as the initial increase. –This is a more complex application of index lookups. In the case of a relational DBMS, the query can take advantage of set-valued aggregates on the index attribute to accelerate the execution.

Xmark queries Q4: List the reserves of those open auctions where a certain person issued a bid before another person –This time, we stress the textual nature of XML documents by querying the tag order in the source document Q5: How many sold items cost more than 40? –Strings are the generic data type in XML documents. Queries that interpret strings will often need to cast strings to another data type that carries more semantics. This query challenges the DBMS in terms of the casting primitives it provides. Especially, if there is no additional schema information or just a DTD at hand, casts are likely to occur frequently.

Xmark queries Q6: How many items are listed on all continents? Regular path expressions are a fundamental building block of virtually every query language for XML or semi-structured data. These queries investigate how well the query processor can optimize path expressions and prune traversals of irrelevant parts of the tree. Q7: How many pieces of prose are in our database? A good evaluation engine should realize that there is no need to traverse the complete document tree to evaluate such expressions.Also note that the COUNT aggregation does not require a complete traversal of the tree. Just the cardinality of the respective relation is queried. Note that the tag does not exist in the database document.

Xmark queries Q8: List the names of persons and the number of items they bought. (joins person, closed\_auction) References are an integral part of XML as they allow richer relationships than just hierarchical element structures. This query defines horizontal traversals with increasing complexity. A good query optimizer should take advantage of the cardinalities of the sets to be joined. Q9: List the names of persons and the names of the items they bought in Europe. (joins person, closed_auction, item) References are an integral part of XML as they allow richer relationships than just hierarchical element structures. These queries define horizontal traversals with increasing complexity. A good query optimizer should take advantage of the cardinalities of the sets to be joined.

Xmark queries Q10: List all persons according to their interest; use french markup in the result. Constructing new elements may put the storage engine under stress especially in the context of creating materialized document views. The following query reverses the structure of person records by grouping them according to the interest profile of a person. Large parts of the person records are repeatedly reconstructed. To avoid simple copying of the original database we translate the mark-up into french. Q11: For each person, list the number of items currently on sale whose price does not exceed 0.02\% of the person's income This query tests the database's ability to handle large (intermediate) results. This time, joins are on the basis of values. The difference between these queries and the reference chasing queries Q8 and Q9 is that references are specified in the DTD and may be optimized with logical OIDs for example. The two queries Q11 and Q12 cascade in thesize of the result set and provide various optimization opportunities.

Xmark queries Q12: For each richer-than-average person, list the number of items currently on sale whose price does not exceed 0.02% of the person's income This query tests the database's ability to handle large (intermediate) results. This time, joins are on the basis of values. The difference between these queries and the reference chasing queries Q8 and Q9 is that references are specified in the DTD and may be optimized with logical OIDs for example. The two queries Q11 and Q12 cascade in the size of the result set and provide various optimization opportunities. Q13: List the names of items registered in Australia along with their descriptions. A key design for XML->DBMS mappings is to determine the fragmentation criteria. The complementary action is to reconstruct the original document from its broken-down representation. Query 13 tests for the ability of the database to reconstruct portions of the original XML document.

Xmark queries Q14:Return the names of all items whose description contains the word `gold'. We continue to challenge the textual nature of XML documents; this time, we conduct a full-text search in the form of keyword search. Although full-text scanning could be studied in isolation we think that the interaction with structural mark-up is essential as the concepts are considered orthogonal; so query Q14 is restricted to a subset of the document by combining content and structure. Q15: Print the keywords in emphasis in annotations of closed auctions. We now try to quantify the costs of long path traversals that don't include wildcards. We first descend deep into the tree (Query 15) and then return again (Query 16). Both queries only check for the existence of paths rather than selecting paths with predicates.

Xmark queries Q16: Return the IDs of those auctions that have one or more kweywords in emphasis. Q17:Which persons don't have a homepage? This is to test how well the query processors knows to deal with the semi-structured aspect of XML data, especially elements that are declared optional in the DTD. Q18:Convert the currency of the reserve of all open auctions to another currency. This query puts the application of user defined functions (UDF) to the proof. In the XML world, UDFs are of particular importance because they allow the user to assign semantics to generic strings that go beyond type coercion.

Query optimizer challenges Mapping Xquery to a RBDMS should be able –to deal with ordered tables –to skip sub-documents –to perform dynamic type casting –to avoid unnecessary construction of string intermediates –to recognize join-paths for fast access –to balance fragmentation and reconstruction cose

Xmark answers Q2: Return the initial increases of all open auctions. –This query evaluates the cost of array look-ups. Note that this query may actually be harder to evaluate than it looks; especially relational back-ends may have to struggle with rather complex aggregations to select the bidder element with index 1. FOR $b IN document("auction.xml")/site/open_auctions/open_auction RETURN $b/bidder[1]/increase/text()

XMark Q3: Return the IDs of all open auctions whose current increase is at least twice as high as the initial increase. –This is a more complex application of index lookups. In the case of a relational DBMS, the query can take advantage of set-valued aggregates on the index attribute to accelerate the execution. FOR $b IN document("auction.xml")/site/open_auctions/open_auction WHERE $b/bidder[0]/increase/text() *2 <= $b/bidder[last()]/increase/text() RETURN <increase first=$b/bidder[0]/increase/text() last=$b/bidder[last()]/increase/text()/>

Xmark result Q4: List the reserves of those open auctions where a certain person issued a bid before another person –This time, we stress the textual nature of XML documents by querying the tag order in the source document FOR $b IN document("auction.xml")/site/open_auctions/open_auction WHERE $b/bidder/personref[id="person18829"] BEFORE $b/bidder/personref[id="person10487"] RETURN $b/initial/text()

Xmark answers Q5: How many sold items cost more than 40? –Strings are the generic data type in XML documents. Queries that interpret strings will often need to cast strings to another data type that carries more semantics. This query challenges the DBMS in terms of the casting primitives it provides. Especially, if there is no additional schema information or just a DTD at hand, casts are likely to occur frequently. COUNT (FOR $i document("auction.xml")/site/closed_auctions/closed_auction WHERE $i/price/text() >= 40 RETURN $i/price)

Xmark results Q6: How many items are listed on all continents? Regular path expressions are a fundamental building block of virtually every query language for XML or semi-structured data. These queries investigate how well the query processor can optimize path expressions and prune traversals of irrelevant parts of the tree. FOR $b IN document("auction.xml")/site/regions RETURN COUNT ($b//item)

Xmark results Q7: How many pieces of prose are in our database? A good evaluation engine should realize that there is no need to traverse the complete document tree to evaluate such expressions.Also note that the COUNT aggregation does not require a complete traversal of the tree. Just the cardinality of the respective relation is queried. Note that the tag does not exist in the database document. FOR $p IN document("auction.xml")/site RETURN count($p//description) + count($p//annotation) + count($p// );

Xmark results Q8: List the names of persons and the number of items they bought. (joins person, closed\_auction) References are an integral part of XML as they allow richer relationships than just hierarchical element structures. This query defines horizontal traversals with increasing complexity. A good query optimizer should take advantage of the cardinalities of the sets to be joined. FOR $p IN document("auction.xml")/site/people/person LET $a := FOR $t IN document("auction.xml")/site/closed_auctions/closed_auction WHERE = RETURN $t RETURN COUNT ($a)

Xmark results Q9: List the names of persons and the names of the items they bought in Europe. (joins person, closed_auction, item) References are an integral part of XML as they allow richer relationships than just hierarchical element structures. These queries define horizontal traversals with increasing complexity. A good query optimizer should take advantage of the cardinalities of the sets to be joined. FOR $p IN document("auction.xml")/site/people/person LET $a := FOR $t IN document("auction.xml")/site/closed_auctions/closed_auction LET $n := FOR $t2 IN document("auction.xml")/site/regions/europe/item WHERE = RETURN $t2 WHERE = RETURN $n/name/text() RETURN $a

Xmark results Q10: List all persons according to their interest; use french markup in the result. Constructing new elements may put the storage engine under stress especially in the context of creating materialized document views. The following query reverses the structure of person records by grouping them according to the interest profile of a person. Large parts of the person records are repeatedly reconstructed. To avoid simple copying of the original database we translate the mark-up into french.

FOR $i IN DISTINCT LET $p := FOR $t IN document("auction.xml")/site/people/person WHERE = $i RETURN $t/gender/text(), $t/age/text(), $t/education/text(), $t/income/text(),

$t/name/text(), $t/street/text(), $t/city/text(), $t/country/text(), $t/ /text(), $t/homepage/text(), $t/creditcard/text() RETURN $i, $p

Xmark results Q11: For each person, list the number of items currently on sale whose price does not exceed 0.02\% of the person's income This query tests the database's ability to handle large (intermediate) results. This time, joins are on the basis of values. The difference between these queries and the reference chasing queries Q8 and Q9 is that references are specified in the DTD and may be optimized with logical OIDs for example. The two queries Q11 and Q12 cascade in the size of the result set and provide various optimization opportunities. FOR $p IN document("auction.xml")/site/people/person LET $c := FOR $i IN document("auction.xml")/site/open_auctions/open_auction/initial WHERE > (5000 * $i/text()) RETURN $i RETURN COUNT ($c)

Xmark results 12: For each richer-than-average person, list the number of items currently on sale whose price does not exceed 0.02% of the person's income This query tests the database's ability to handle large (intermediate) results. This time, joins are on the basis of values. The difference between these queries and the reference chasing queries Q8 and Q9 is that references are specified in the DTD and may be optimized with logical OIDs for example. The two queries Q11 and Q12 cascade in the size of the result set and provide various optimization opportunities. FOR $p IN document("auction.xml")/site/people/person FOR $p IN document("auction.xml")/site/people/person LET $l := FOR $i IN document("auction.xml")/site/open_auctions/open_auction/initial WHERE > (5000 * $i/text()) RETURN $i WHERE > RETURN COUNT ($l)

Xmark results Q13: List the names of items registered in Australia along with their descriptions. A key design for XML->DBMS mappings is to determine the fragmentation criteria. The complementary action is to reconstruct the original document from its broken-down representation. Query 13 tests for the ability of the database to reconstruct portions of theoriginal XML document. FOR $i IN document("auction.xml")/site/regions/australia/item RETURN $i/description

Xmark results Q14:Return the names of all items whose description contains the word `gold'. We continue to challenge the textual nature of XML documents; this time, we conduct a full-text search in the form of keyword search. Although full-text scanning could be studied in isolation we think that the interaction with structural mark-up is essential as the concepts are considered orthogonal; so query Q14 is restricted to a subset of the document by combining content and structure. FOR $i IN document("auction.xml")/site//item WHERE CONTAINS ($i/description,"gold") RETURN $i/name/text()

Xmark results Q15: Print the keywords in emphasis in annotations of closed auctions. We now try to quantify the costs of long path traversals that don't include wildcards. We first descend deep into the tree (Query 15) and then return again (Query 16). Both queries only check for the existence of paths rather than selecting paths with predicates. FOR $a IN document("auction.xml")/site/closed_auctions/closed_aucti on/annotation/description/parlist/listitem/parlist/listitem/tex t/emph/keyword/text() RETURN $a

Xmark results Q16: Return the IDs of those auctions that have one or more kweywords in emphasis. FOR $a IN document("auction.xml")/site/closed_auctions/closed_auction WHERE NOT EMPTY ($a/annotation/description/parlist/listitem/parlist/\ listitem/text/emph/keyword/text()) RETURN

Xmark results Q17:Which persons don't have a homepage? This is to test how well the query processors knows to deal with the semi-structured aspect of XML data, especially elements that are declared optional in the DTD. FOR $p IN document("auction.xml")/site/people/person WHERE EMPTY($p/homepage/text()) RETURN

Xmark results Q18:Convert the currency of the reserve of all open auctions to another currency. This query puts the application of user defined functions (UDF) to the proof. In the XML world, UDFs are of particular importance because they allow the user to assign semantics to generic strings that go beyond type coercion. FUNCTION CONVERT ($v) { RETURN * $v -- convert Dfl to Euros } FOR $i IN document("auction.xml")/site/open_auctions/open_auction/ RETURN CONVERT($i/reserve/text())

Query optimizer challenges Mapping Xquery to a RBDMS should be able –to deal with ordered tables –to skip sub-documents –to perform dynamic type casting –to avoid unnecessary construction of string intermediates –to recognize join-paths for fast access –to balance fragmentation and reconstruction cose

Xmark results Effect of loading 100Mb document into DBMS

Xmark results

Pathfinder/MonetDB 2004 implementation in seconds