L09: Introduction to XML Data Management  XML and XML Query Languages  Structural Summary and Coding Scheme  Managing XML Data in Relational Systems.

L09: Introduction to XML Data Management  XML and XML Query Languages  Structural Summary and Coding Scheme  Managing XML Data in Relational Systems

XML and XML Query Languages  XML and XML Query Languages  Structural Summary and Coding Scheme  Managing XML Data in Relational Systems

L09: Introduction to XML Data Management 3H. Lu/HKUST XML  Extensible Markup Language for data  A W3C standard to complement HTML http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000)  Standard for publishing and interchange  Origins: structured text SGML  “Cleaner” SGML for the Internet  Motivation:  HTML describes presentation  XML describes content

L09: Introduction to XML Data Management 4H. Lu/HKUST XML – Describing the Content XML Query Processing & Optimization March 18, 2004 Instructor Lu Hongjun HKUST luhj@cs.ust.hk Jeffrey X. Yu CUHK yu@se.cuhk.edu.hk

L09: Introduction to XML Data Management 5H. Lu/HKUST XML Document/Data  Hierarchical document format for information exchange in WWW  Self describing data (tags)  Nested element structures having a root  Element data can have  Attributes  Sub-elements

L09: Introduction to XML Data Management 6H. Lu/HKUST Basic XML Structures  Elements: …, …  Open & close tags or “empty tag”  Ordered, nestable  an element can be empty  Attributes  PCDATA/CDATA  An XML document: single root element  well formed XML document: if it has matching tags

L09: Introduction to XML Data Management 7H. Lu/HKUST Basic XML Structures: Attributes  Single-valued, ordered XML Data Management … 2003-2004  Special types: ID, IDREF, IDREFS  James  XML Data Management

L09: Introduction to XML Data Management 8H. Lu/HKUST Other XML Structures  Processing instructions: instructions for applications  CDATA sections: treat content as char data Whatever!!! ]]>  Comments: just like HTML  Entities: external resources and macros  &my-entity; (non-parameter entity)  %param-entity; (parameter entity for DTD declarations)

L09: Introduction to XML Data Management 9H. Lu/HKUST Data Centric vs. Document centric XML H. Lu luhj@cs.ust.hk Managing XML data using RDBMS 2001 … J.X. Yu Data mining Dr Lu is a professor at HKUST. He worked at NUS> before 1998.

L09: Introduction to XML Data Management 10H. Lu/HKUST XML Data Model  Several competing models  Document Object Model (DOM)  a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content, structure and style of documents  http://www.w3.org/DOM/ http://www.w3.org/DOM/

L09: Introduction to XML Data Management 11H. Lu/HKUST DOM Core Interface : Node  DOM tree: a tree-like structure of Node objects – the root of the tree is a document object.  Node Object (nodeName, nodeValue, nodeType, parentNode, childnodes, firstChild, lastChild, previousSibling, nextSibling, attributes, ownerDocument)  nodeType: ELEMENT_NODE, ATTRIBUTE_NODE, TEXT_NODE, CDATA_SECTION_NODE, ENTITY_NODE, PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, DOCUMENT_TYPE_NODE, DOCUMENT_FRAGMENT_NODE, NOTATION_NODE

L09: Introduction to XML Data Management 12H. Lu/HKUST DOM Interface  Each node of the document tree may have a number of child nodes, contained in a NodeList object.  Two ways of accessing a node object  Based on the location of an object in the document tree  Based on the name of an object

L09: Introduction to XML Data Management 13H. Lu/HKUST publicatiom Node NodeType=ELEMENT_NODE tagName = “publication” NodeValue = ‘nill’ A Sample DOM Tree &28 &26 &27 &70 &71 &66 &65 XML H. Lu Managing … 2001 Data mining J.X. Yu luhj@cs.ust.hk &294 publication name emai l pname membe r project year projec t name pname title autho r &69 &24 &1 &3 &2 Project Node NodeType=ELEMENT_NODE tagName = “project” NodeValue = ‘nill’ name Node NodeType=TEXT_NODE tagName = “name” NodeValue = ‘H. Lu’

L09: Introduction to XML Data Management 14H. Lu/HKUST Data Graph  Similar to DOM tree, but may have different notations that represent an XML document &28 &26 &27 &70 &71 &66 &65 XML H. Lu Managing … 2001 Data mining J.X. Yu luhj@cs.ust.hk &294 publication name email pname member project year project name pname title author &69 &24 &1 &3 &2 age 50 age 50

L09: Introduction to XML Data Management 15H. Lu/HKUST Document Type Definition  Inherited from SGML DTD standard  BNF grammar establishing constraints on element structure and content  Specification of attributes and their types  Definitions of entities

L09: Introduction to XML Data Management 16H. Lu/HKUST A Sample DTD project email * member author IDname * * &1 &3 &5 &6&6 &7 &4 ? publication title year &9&8 &2 pname *

L09: Introduction to XML Data Management 17H. Lu/HKUST XML Query Languages  There have been a large number of proposals during the past few years:  XPath [Clark, DeRose, W3C 1999]  XQuery [Boag, Chamberlin et al, W3C 2003]  XML-QL[Deutsch, Fernandez et al, QL99]  XQL [Robie, Lapp, QL99]  XML_GL [Ceri, Comai et al, WWW99]  Quilt [Chamberlin, Robie et al, 2000]  From W3C  XQuery 1.0 (W3C Working Draft, 12 November 2003) http://www.w3.org/TR/xquery/  XPath 2.0 (W3C Working Draft 12 November 2003) http://www.w3.org/TR/xpath20/

L09: Introduction to XML Data Management 18H. Lu/HKUST XPath: XML Path Language  The purpose  To address the node of an XML tree using a path notation for navigating through the hierarchical structure of an XML document.  Uses a compact, non-XML syntax  Designed to be embedded in a host language (e.g., XSLT, XQuery)  XPath Expressions  String of characters  Value of an expression is always an ordered collection of zero or more items (atomic value, node)

L09: Introduction to XML Data Management 19H. Lu/HKUST XPath: Steps  An XPath expression has following syntax: Path::=/Step 1 /Step 2 /…/Step n, where each Xpath step is defined as follows: Step::=Axis::Node-test Predicate* Axis specifies the “direction” in which the document should be navigated. For example, child::title[position() = 2]  There are 12 axes: child, descendant, descendant-or-self, parent, ancestor, ancestor-or-self, following, preceding, following-sibling, preceding-sibling, attribute, self, namespace

L09: Introduction to XML Data Management 20H. Lu/HKUST XPath Path Expressions projectmatches a project element *matches any element /matches the root element /projectmatches a project element under root project/membermatches a member in project project//namematches a name in project, at any depth //titlematches a title at any depth member|publciationmatches a member or a publication @agematches an age attribute project/member/@age matches age attribute in member, in project project/member/[@age<“45”] matches a member with age < 45

L09: Introduction to XML Data Management 21H. Lu/HKUST XPath Query Examples Result: H. Lu J.X. Yu /project/member/name: matches a name of member in project /project/publication/venue Result: empty – there was no venue element //pname : matches a pname at any depth Result: XML Data mining /project/member/name/text(): text of name elements Result: H. Lu J.X. Yu

L09: Introduction to XML Data Management 22H. Lu/HKUST More XPath Queries /project/member[publication] H. Lu luhj@cs.ust.hk Managing XML data using RDBMS 2001 /project/member[@age < “45”] J.X. Yu Data mining /project [member/@age < “25”] No element returned /project/member[email/text()] luhj@cs.ust.hk

L09: Introduction to XML Data Management 23H. Lu/HKUST XQuery  XQuery 1.0: An XML Query Language  W3C Working Draft 12 November 2003  http://www.w3.org/TR/xquery/ http://www.w3.org/TR/xquery/  XPath expressions are still the basic building block

L09: Introduction to XML Data Management 24H. Lu/HKUST XQuery  XQuery 1.0: An XML Query Language  W3C Working Draft 12 November 2003  http://www.w3.org/TR/xquery/ http://www.w3.org/TR/xquery/  FLWR Expressions: FOR-LET-WHERE-RETURN FOR/LET Clauses WHERE Clause RETURN Clause Ordered list of tuples of bound variables Instance of XML Query data model  FOR $x in expr  binds $x to each value in the list expr  LET $x = expr  binds $x to the entire list expr  Useful for common subexpressions and for aggregations Pruned list of tuples of bound variables

L09: Introduction to XML Data Management 25H. Lu/HKUST XQuery Examples FOR $x in /project/member/publication WHERE $x/year > 2000 RETURN $x/title FOR $m IN distinct(document( “ project.xml")//member) LET $p := document( “ project.xml")//publication[author = $m] WHERE count($p) > 10 RETURN $m distinct = a function that eliminates duplicates count = a (aggregate) function that returns the number of elements

Structural Summary and Coding Scheme  XML and XML Query Languages  Structural Summary and Coding Scheme  Managing XMLData in Relational Systems

L09: Introduction to XML Data Management 27H. Lu/HKUST Structural Summary  A structural summary for a data graph G D (V D, E D ) is another labeled graph G I (V I, E I ).  Each node v i  G I represents a set of nodes, extent(v i ), and extent(v i )  V D.  An edge e d (v i, v i ’)  G I exists if there is an edge e d (v d, v d ’)  G D v d  extent(v i ), v d ’  extent(v i ’ ).  The summary preserves all the paths in the data graph. A path expression query can be executed on G I instead of G D, which is most likely more efficient since size of G I is much smaller than G D.

L09: Introduction to XML Data Management 28H. Lu/HKUST Structural Summary  Basically, nodes in the data graph is grouped based on certain criteria, each group of nodes is represented by one node in the summary.  The size of summary will be determined by the grouping criteria.  Desired properties in supporting evaluating path expression queries using summary:  The results are safe (no false negatives) If not safe, only approximate answers can be obtained  The results are precise: contains no false positives If not precise, need validate results using the data graph

L09: Introduction to XML Data Management 29H. Lu/HKUST Structural Summary r a1 a2 a3 b1 b2 b3 c1 c2 c3 R A B C {a1,a2,a3} {b1,b2,b3} {c1,c2,c3} Data Graph Structural summary {r}

L09: Introduction to XML Data Management 30H. Lu/HKUST Sample Structural Summaries  Query workload independent summaries  Data Guide  1-index [Milo, Suciu, ICDT99]  A(k) index [Kaushik, Shenoy, ICDE02]  Query workload dependent summaries  APEX [Chung, Min et al, SIGMOD02]  D(k)-index [Chen, Lim et al, SIGMOD03]

L09: Introduction to XML Data Management 31H. Lu/HKUST Data Guides  DataGuide: dynamic structural summary of current database  Each label path in database appears once in DataGuide  No extraneous paths in DataGuide  Maintained incrementally as database evolves  Serves role of schema C1 is duplicated to achieve determinism in DataGuides

L09: Introduction to XML Data Management 32H. Lu/HKUST Bisimilarity and 1-Index  Most existing structural summary are based on graph bisimilarity, defined as follows:  Two data nodes u and v are bisimilar (u  v) if u and v have the same label; if u’ is a parent of u, then there is a parent v’ of v such that u’  v’, and vice versa;  Intuitively, the set of paths coming into them is the same if two nodes are bisimilar  Tova Milo and Dan Suciu. Index structures for path expressions. In ICDT’99. 277- 295, January 1999.

L09: Introduction to XML Data Management 33H. Lu/HKUST 1-Index  1-index: Each index node represents an equivalence class, in which data nodes are mutually bisimilar.  Evaluating path expression query using 1- index  safe: the result always contains the result of evaluating on the data graph;  precise: its result contains no false data node;

L09: Introduction to XML Data Management 34H. Lu/HKUST K-bisimilarity  1-index can be big  Formally, based on the notion of k-bisimilarity (  k ) which is defined inductively:  Node u  k v iff u  k-1 v, and for every parent u’ of u, there is a parent v’ of v such that u’  k-1 v’, and vice versa;  For any two nodes, u and v, u  0 v iff u and v have the same label;  Intuitively, if two data nodes are k-bisimilar, the set of paths coming into them with length (  k) is the same

L09: Introduction to XML Data Management 35H. Lu/HKUST A(k)-Index  A(k)-Index: group nodes based on their local structure – paths of length up to k, instead of the global path information  data nodes in each index nodes of A(k) index are mutually k-bisimilar;  Evaluation path expression query using A(k)-index:  safe: its result always contains the result of evaluating on the data graph;  precision: its result contains no false data node;  Raghav Kaushik, Pradeep Shenoy, Philip Bohannon and Ehud Gudes. Exploiting local similarity for indexing paths in graph-structured data. ICDE’02, 129-140.

L09: Introduction to XML Data Management 36H. Lu/HKUST A(2)-Index C2 and C3 can be grouped because their length-2 incoming paths are the same

L09: Introduction to XML Data Management 37H. Lu/HKUST APEX: Adaptive Path Index  1-index, A(k)-index and F&B index are all workload independent  APEX: Adaptive Path index  Maintains two types of paths in the summary: All paths of length two so that all queries can be answered using APEX Full paths are maintained for those paths that frequently appear in query workload so that frequently asked queries can be answered efficiently  A hash table is included in the index so that partial matching queries with the self-or-descendent axis (//) can be processed efficiently  C-W Chung, J-K Min, K. Shim, APEX: An Adaptive Path Index for XML Data, SIGMOD 02

L09: Introduction to XML Data Management 38H. Lu/HKUST D(k)-Index  A generalization of 1-Index and A(k)-Index.  Assigning different local bisimilarites to index nodes in the summary structure according to the query load to optimize its structure.  for any two index nodes n i and n j, k(n i )  k(n j )-1 if there is an edge from n i to n j, in which k(n i ) and k(n i ) are n i and n j ’s local bisimilarities, respectively.  Advantage over 1-Index and A(k)-Index  workload-sensitive;  can be more efficiently updated  Qun Chen, Andrew Lim and Kian Win Ong. D(k)-index: An adaptive structural summary for graph-structured data. SIGMOD 03, 134-144.

L09: Introduction to XML Data Management 39H. Lu/HKUST Node (Edge) Encoding  Structural relationships  Is node u an ancestor of node v?  Is node u the parent of node v?  Assigning a unique code to a node (edge) in the data graph so that the above question can be answered by looking at the codes rather than the original data graphs.  Issues:  Length of the code.  Complexity for computing the structural relationship. between two nodes from their codes.  Efficient code generation and code maintenance.

L09: Introduction to XML Data Management 40H. Lu/HKUST XML Data Coding Scheme  Region-based  XML document is ordered  Codes are assigned based on the lexicographical location of an element in the original document  Path-based  XML document is nested  Codes are assigned based on the nesting structure of the document, or the path that reaches and element from the root.  There are quite a number of variants for both categories of coding schemes

L09: Introduction to XML Data Management 41H. Lu/HKUST XML Region Based Coding  Region code: (start, end, level)  u is an ancestor of v iff u.start < v.start < u.end  u is the parent of v, additionally, u.level = v.level-1  Only a depth-first traversal for code generation  Property: strictly nesting  Completely disjoint (case 1,4) or containing (case 2,3)  Formally, a.start < b.start < a.end, if a is an ancestor of b

L09: Introduction to XML Data Management 42H. Lu/HKUST Sample of Region Codes  The order of start values is also the document order  The region can also be interpreted as an interval

L09: Introduction to XML Data Management 43H. Lu/HKUST Dewey blah 1234 5678 0000 contact name phone blah office homemobile 1234 5678 0000 1 1.1 1.1.1 1.2 1.2.1 1.2.1.1 1.2.2 1.2.2.1 1.2.3 1.2.3.1 a.Dewey is a prefix of d.Dewey Igor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene Shekita, and Chun Zhang. Storing and querying ordered XML using a relational database system. SIGMOD 2002.

Managing XML Data in Relational Systems  XML and XML Query Languages  XML Coding Scheme and Structural Summary  Managing XMLData in Relational Systems

L09: Introduction to XML Data Management 45H. Lu/HKUST XML-Enabled DB Systems  IBM DB2 XML Extender  XML column support, XML Collection, File liked from the DBMS, or Character Large Objects (CLOBs).  Side Tables server as XML indexes  Oracle 9i  CLOB, OracleText Cartridge, XMLType, and XML SQL Utility  Microsoft SQL Server  CLOBs, Generic Edge technique and user-defined decomposition (from XML to tables), XML views.

L09: Introduction to XML Data Management 46H. Lu/HKUST Storing XML Data in RDBMSs  RDBMS: a matured technology  RDBMS widely available  Less investment to adopt the new technology  Easy to be integrated with other existing applications  Impedance mismatch  Two level nature of relational schema (tuples and attributes) vs. arbitrary nesting of XML DTD  Flat structure vs. recursion  Structure-based and content-based query

L09: Introduction to XML Data Management 47H. Lu/HKUST XQuery vs SQL: Different Culture  Data Characteristics  Relational data: regular, homogeneous, flat structure in nature, and no order among tuples.  XML data: irregular, heterogeneous, unpredictable structure, order sensitive.  Query Languages  SQL: Select-from-where With capability to support some fix-point operation  XQuery: FLWOR (pronounced “flower”): For-let-where-order-return Simple/Regular Path expressions

L09: Introduction to XML Data Management 48H. Lu/HKUST Storing XML Data in RDBMSs: Architecture DTD Relational Schema XML Documents Tuples XML Query SQL Query Relational Result XML Result Automatic Schema/Data Mapping Commercial RDBMS

L09: Introduction to XML Data Management 49H. Lu/HKUST Storing XML Data in RDBMSs: Issues  Schema/Data mapping:  Automate storage of XML in RDBMS  Query mapping:  Provide XML views of relational sources  Result construction:  Export existing data as XML

L09: Introduction to XML Data Management 50H. Lu/HKUST XML-Relational Mapping  Model mapping  Database schemas represent constructs of the XML document model. DTD Independent [Florescu & Kossmann 99, Yoshikawa, et. al. TOIT01]  Structure mapping  Database schemas represent the logical structure of target XML documents DTD Dependent [Shanmugasundaram et. al. VDLB 99]

L09: Introduction to XML Data Management 51H. Lu/HKUST A Simple XML Document XML H. Lu luhj@cs.ust.hk Managing XML data using RDBMS 2001 … J.X. Yu Data mining

L09: Introduction to XML Data Management 52H. Lu/HKUST A Sample DOM Tree &28 &26 &27 &70 &71 &66 &65 XML H. Lu Managing … 2001 Data mining J.X. Yu luhj@cs.ust.hk &294 publication name emai l pname membe r project year projec t name pname title autho r &69 &24 &1 &3 &2

L09: Introduction to XML Data Management 53H. Lu/HKUST Model Mapping: Document Model to Relation  Database schema represents the constructs of XML documents  Fixed database schema for all XML documents  Data graph : tree (may contain cycles)  Relational schema represents a tree  Pros and cons  DTD is not required. Documents may not conform to DTD  Fixed schema: no schema evolution issue  Large collection of documents with various DTDs  Semantics get (totally) lost

L09: Introduction to XML Data Management 54H. Lu/HKUST Model Mapping – Edge/Monet Approach  Edge oriented approach  Single table schema [Florescu & Kossmann 99] Edge (source, ordinal, target, label, flag, value)  Monet [Schmidt et. al. WebDB00] multiple tables, horizontal partitions of edge table on label-path Note: Document ID is omitted here

L09: Introduction to XML Data Management 55H. Lu/HKUST Querying with Edge select name.Value from Edge dbgroup, Edge member, Edge age, Edge name where dbgroup.Label = `DBGroup' and member.Label = `Member' and age.Label = `Age' and name.Label = `Name' and dbgroup.Source = 0 and dbgroup.Target = member.Source and member.Target = age.Source and member.Target = name.Source andcast (age.Value as int) > 20 /DBGroup/Member[Age>20]/Name

L09: Introduction to XML Data Management 56H. Lu/HKUST Model Mapping – Node Approach XRel [Yoshikawa et. al. TOIT 2001]  Four table schema Element(pathID, start, end, ordinal) Attribute(pathID, start, end, value) Text(pathID, start, end, value) Path(pathID, pathexp)

L09: Introduction to XML Data Management 57H. Lu/HKUST Querying with XRel selectv2.Value from Element e1, Path p1, Path p2, Path p3, Text v1, Text v2 where p1.Pathexp = `\#/DBGroup\#/Member' and p2.Pathexp = `\#/DBGroup\#/Member\#/Age' and p3.Pathexp = `\#/DBGroup\#/Member\#/Name' and e1.PathID = p1.PathID and v1.PathID = p2.PathID and v2.PathID = p3.PathID /* containment testing */ and e1.Start v1.End and e1.Start v2.End and cast(v1.Value as int ) > 20 /DBGroup/Member[Age>20]/Name

L09: Introduction to XML Data Management 58H. Lu/HKUST Structural Mapping: Simplifying DTDs  DTD element specifications can be of arbitrary complexity is valid!  Simple DTD for translation purposes:  Key observations: not necessary to regenerate DTD from relational schema  XML queries query the position of an element, relative to its siblings, and the parent/child relationships.

L09: Introduction to XML Data Management 59H. Lu/HKUST DTD Simplification: Transformations (e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ?..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … [Deutsch, Fernandez, and Suciu, SIGMOD99] [Shanmugasundaram, Tufte, He, Zhang, DeWitt, and Naughton, VLDB99] Simplification Transformations Grouping Transformations Flattening Transformations

L09: Introduction to XML Data Management 60H. Lu/HKUST A Sample DTD <!ELEMENT book (booktitle, author) * book article monograph booktitle author contactautho r authori d title editor name firstnamelastname name ? address authorid * ? [Shanmugasundaram et. al. VDLB 99]

L09: Introduction to XML Data Management 61H. Lu/HKUST DTD to Relational Schema: Naïve Approach  Each Element ==> Relation  Each Attribute of Element ==> Column of Relation  Connect elements using foreign keys author (authorID: integer, id: string) name (nameID: integer, authorID: integer) firstname (firstnameID: integer, nameID: integer, value: string) lastname (lastnameID: integer, nameID: integer, value: string) address (addressID: integer, authorID: integer, value: string)

L09: Introduction to XML Data Management 62H. Lu/HKUST Basic Inlining Technique  Problem of the naïve approach: fragmentation – too many tables  Results in 5 relations in the previous example: retrieving first and last names of an author  Intuition:  Inline as many sub-elements as possible  Do not inline only if it is a set sub-element RDBMSs do not all support set-valued columns.  Connect relations using foreign keys Can handle recursions  A document can be rooted at any element Create separate a relation for each root

L09: Introduction to XML Data Management 63H. Lu/HKUST Basic Inlining Technique: Relation Schemas article (articleID: integer, article.contactauthor.authorid: string, article.title: string) article.author (article.authorID: integer, article.author.parentID: integer, article.author.name.firstname: string, article.author.name.lastname: string, article.author.address: string, article.author.authorid: string) article author contactauthor authorid ? titl e firstname lastname name ? address authorid *

L09: Introduction to XML Data Management 64H. Lu/HKUST Basic Inlining Technique: Pros & Cons  Reduces number of joins for queries like “get the first and last names of a book author”  Efficient for queries such as “list all authors of books”  Queries like “list all authors with name Ullman”  Union of 5 queries!  Large number of relations:  Unrolling recursive strongly connected components (major)  Separate relational schema for each element as root (minor)

L09: Introduction to XML Data Management 65H. Lu/HKUST Shared Inlining Technique  Intuition:  Inline as many sub-elements as possible.  Do not inline only if it is a shared, recursive or set sub-element.  An element node is represented in exactly one relation.  Technique:  Mapping the following nodes into relations: Shared: In-degree >= 2 in DTD graph Root elements: In-degree = 0

L09: Introduction to XML Data Management 66H. Lu/HKUST Issues with Sharing Elements  Parent of elements not fixed at schema level  Need to store type and ids of parents (or if there are no parents)  parentCODE field (type of parent)  parentID field (id of parent)  Not foreign key relationship

L09: Introduction to XML Data Management 67H. Lu/HKUST Shared: Relational Schema book (bookID: integer, book.booktitle.isroot: boolean, book.booktitle : string) article (articleID: integer, article.contactauthor.isroot: boolean, article.contactauthor.authorid: string) monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.editor.isroot: boolean, monograph.editor.name: string) title (titleID: integer, title.parentID: integer, title.parentCODE: integer, title: string) author (authorID: integer, author.parentID: integer, author.parentCODE: integer, author.name.isroot: boolean, author.name.firstname.isroot: :boolean, author.name.firstname: string, author.name.lastname.isroot: boolean, author.name.lastname: string, author.address.isroot: boolean, author.address: string, author.authorid: string)

L09: Introduction to XML Data Management 68H. Lu/HKUST Shared Inlining Techniques: Pros & Cons + Reduces number of joins for queries like “get the first and last names of an author” + Efficient for queries such as “list all authors with name Ullman” - Sharing whenever possible implies extra joins for path expressions “Article with a given title name”

L09: Introduction to XML Data Management 69H. Lu/HKUST Hybrid Inlining Technique  Inlines some elements that are shared in Shared  Elements with in-degree >= 2 that are not set sub- elements or recursive  Handles set and recursive sub-elements as in Shared

L09: Introduction to XML Data Management 70H. Lu/HKUST Hybrid: Relational Schema book (bookID: integer, book.booktitle.isroot: boolean, book.booktitle : string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string) article (articleID: integer, article.contactauthor.isroot: boolean, article.contactauthor.authorid: string, article.title.isroot: boolean, article.title: string) monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.title: string, monograph.editor.isroot: boolean, monograph.editor.name: string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string) author (authorID: integer, author.parentID: integer, author.parentCODE: integer, author.name.isroot: boolean, author.name.firstname.isroot: boolean, author.name.firstname: string, author.name.lastname.isroot: boolean, author.name.lastname: string, author.address.isroot: boolean, author.address: string, author.authorid: string)

L09: Introduction to XML Data Management 71H. Lu/HKUST Hybrid Inlining Technique: Pros & Cons + Reduces joins through shared elements (that are not set or recursive elements) + Shares some strengths of Shared: Reduces joins for queries like “get first and last names of a book author” - Requires more SQL sub-queries to retrieve all authors with name Ullman. Tradeoff between reducing number of queries and reducing number of joins Shared and Hybrid target query- and join-reduction respectively

L09: Introduction to XML Data Management 72H. Lu/HKUST More on Shared and Hybrid  Shared and Hybrid have pros and cons  In many cases, Shared and Hybrid are nearly identical  Number of joins per SQL query ~ path length  Mainly due to large number of set nodes  Problem as join processing is expensive!

L09: Introduction to XML Data Management 73H. Lu/HKUST Regular Expressions  Path expression queries can be represented by regular expressions.  Considering path expressions in the following from r = (r)* | (r)+ | (r)? | r 1 /r 2 | r 1 |r 2 | r 1 //r 2 | name. *: 0 or more occurrences +: 1 or more occurrences ? : 0 or 1 occurrences r1/r 2 : form a path from r 1 to r 2 (child) r1//r 2 : form a path from r 1 to r 2 (descendant) | : disjunction.

L09: Introduction to XML Data Management 74H. Lu/HKUST SPE to SQL /member/publication/author/name select m2.name from member m1, publication, member m2 where publication.perantid = m1.ID and publication.author = m2.ID Find the name of the authors for all member’s publications email member author publication IDname * &4 &7 &8&8 &9 &5 ? * member (ID, name, email, PARENTID); publication (ID, title, author, year, PARENTID);

L09: Introduction to XML Data Management 75H. Lu/HKUST RPE Expansion project//publication project/member/(project.member)*/publication | project/(member.project)*/publication project email * member author IDname * * &2 &4 &7 &8&8 &9 &5 ? * publication title year &1 2 &1 0 List the title of publications for all projects Substitute //

L09: Introduction to XML Data Management 76H. Lu/HKUST RPE Expansion project/member/(project/member)*/publication/title | project/(member/project)*/publication/title project email * member author IDname * * &2 &4 &7 &8&8 &9 &5 ? * publication title year &1 2 &1 0 List the title of publications for all projects select project.publication.title union select project.member.publication.title union select project.member.project.publication.title Expanding *

L09: Introduction to XML Data Management 77H. Lu/HKUST Recursive Path Expression Queries to SQL  Some DBMS supports least-fixed point computation. E.g., WITH statement in DB2 WITH R(PARENTID, ID) AS ( select m.PARENTID, p1.ID from member m, project p1 where m.ID=p1.PARENTID UNION ALL select R.PARENTID, p1.ID from R, member m, project p1 where R.ID=m.PARENTID and m.ID=p1.PARENTID) select p3.* from project p2, R, publicaton p3 where p2.ID=R.PARENTID and R.ID=p3.PARENTID; project/(member/project)*/publication project * member * &4 &2 &5 publication

L09: Introduction to XML Data Management 78H. Lu/HKUST Expanding Recursive Path Expression Queries  Expanding wild cards before sending to DBMS  Transitive closure operation is not always supported by RDBMS  Transitive closure with arbitrary nesting seems not supported  Can handle nested recursive queries (though DB2 does not support it)  How many SQL statements are required?  Executing SQL until empty result returned  VXMLR approach: keep statistics [Zhou et. al. VLDB 2001]

L09: Introduction to XML Data Management 79H. Lu/HKUST Query Translation for Structural Mapping  Translating XML-QL into SQL [Shanmugasundaram, et al, VLDB99]  Simple Path Expressions to SQL  Simple Recursive Path Expressions to SQL  Arbitrary Path Expressions to Simple Recursive Path Expressions  Discussion based on Shared approach

L09: Introduction to XML Data Management 80H. Lu/HKUST Queries with Simple Path Expressions WHERE The Selfish Gene $f $l IN * CONFORMING TO pubs.dtd CONSTRUCT $f $l Select A.”author.name.firstname”, A.”author.name.lastname” From author A, book B Where B.bookID = A.parentID AND A.parentCODE = 0 AND B.”book.booktitle” = “The Selfish Gene”

L09: Introduction to XML Data Management 81H. Lu/HKUST Queries with Recursive Path Expressions WHERE $n Subclass Cirripedia IN * CONFORMING TO pubs.dtd CONSTRUCT $n With Q1 (monographID, name) AS (Select X.monographID, X.”editor.name” From monograph X Where X.title = “Subclass Cirripedia” UNION ALL Select Z.monographID, Z.”editor.name” From Q1 Y, monograph Z Where Y.monographID = Z.parentID AND Z.parentCODE = 0 ) Select A.name From Q1 A

L09: Introduction to XML Data Management 82H. Lu/HKUST Queries with Arbitrary Path Expressions  Split complex path expression to (possibly many) simple recursive path expressions  Has effect of splitting a single XML-QL query to (possibly many) SQL queries  Can handle nested recursive queries WHERE $n CONSTRUCT $n

L09: Introduction to XML Data Management 83H. Lu/HKUST References (1) [Aboulnaga, Alameldeen et al, VLDB01] Ashraf Aboulnaga, Alaa R. Alameldeen, and Jeffrey F. Naughton. Estimating the selectivity of XML path expressions for Internet scale applications. VLDB 2001. [Bohannon et al, ICDE 2002] P. Bohannon, J. Freire, P. Roy, and J. Simeon. From XML schema to relations: A cost-based approach to XML storage. In Proceedings of ICDE, 2002. [Boag, Chamberlin et al, W3C 2003] Scott Boag, Don Chamberlin, Mary F. Fernández, Daniela Florescu, Jonathan Robie, Jérôme Siméon, XQuery 1.0: An XML Query Language, http://www.w3.org/TR/xqueryhttp://www.w3.org/TR/xquery [Bruno et al, SIGMOD02] N. Bruno, N. Koudas, D. Srivastava. Holistic twig joins: Optimal XML pattern matching. In SIGMOD Int'l Conf. on Management of Data, 310-311, 2002. [Chen, Jagadish et al, ICDE01] Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. T. Ng, and D. Srivastava. Counting twig matches in a tree. In Proceedings of the IEEE International Conference on Data Engineering, pages 595-604, 2001. [Cohen, Kaplen et al, PODS02] E. Cohen, H. Kaplan, T. Milo. Labeling dynamic XML trees. In Symposium on Principles of Database Systems (PODS), 271-281, 2002. [Cark, DeRose, W3C 1999] James Clark, and Steven DeRose, XML Path Language (XPath) Version 1.0 http://www.w3.org/TR/xpath

L09: Introduction to XML Data Management 84H. Lu/HKUST References (2) [Ceri, Comai et al, WWW99] S. Ceri, S. Comai, E. Damiani, P. Fraternali, S. Paraboschi, and L. Tanca. XMLGL: a graphical language for querying and restructuring WWW data. In International World Wide Web Conference (WWW), Toronto, Canada, May 1999. [Chamberlin, Robie et al, 2000] Don Chamberlin, Jonathan Robie, and Daniela Florescu. Quilt: An XML query language for heterogeneous data source. In Proceedings of the Third International Workshop on the Web and Databases, May 2000. [Chamberlin, Draper et al, 2003] Don Chamberlin, Denie Draper, Mary Fernandez, Michael Kay, Jonathan Robie, Michael Rys, Jerome Simeon, Jim Tivy, Philip Wadler. Editor: Howard Katz. XQuery from the Experts A Guide to the W3C XML Query Language. Addison-Wesley Press, 2003 [Chaudhri, Rashid et al, 2003] Akmal B. Chaudhri, Awais Rashid, Roberto Zicari. XML Data Management: Native XML and XML- Enabled Database Systems. Addison-Wesley Press, 2003 [Chen, Lim et al, SIGMOD03] Qun Chen, Andrew Lim and Kian Win Ong. D(k)-index: An adaptive structural summary for graph- structured data. In SIGMOD'03, 134-144. [Chien, Vagena and Zhang et al, VLDB02] S.-Y. Chien, Z. Vagena, D. Zhang, V. Tsotras, and C. Zaniolo. Efficient structural joins on indexed XML documents. In VLDB02, pages 263--274, 2002. [Chung, Min et al, SIGMOD02] C-W Chung, J-K Min, K. Shim, APEX: An Adaptive Path Index for XML Data, In SIGMOD'02, 2002

L09: Introduction to XML Data Management 85H. Lu/HKUST References (3) [Deutsch, Fernandez et al, QL98] A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. XML-QL: A query language for XML. In M. Marchiori, editor. QL'98--The Query Languages Workshop. W3C, Dec. 1998. http://www.w3.org/TR/1998/NOTE-xml-ql-19980819/ [Deutsch, Fernandez, and Suciu, SIGMOD99] A. Deutsch, M. Fernandez, and D. Suciu. Storing Semistructured Data with STORED. In Proc. of the ACM SIGMOD Conference on Management of Data, June 1999. [Dietz STOC 82] Paul F. Dietz. Maintaining order in a linked list. STOC 1982. [Grust SIGMOD02] Torsten Grust. Accelerating XPath Location Steps. In Proc. of the 21st ACM SIGMOD Conference, pages 109--120, Madison, Wisconsin, USA, June 2002. ACM Press. [Jiang, Lu, Wang and Ooi, ICDE03] Haifeng Jiang, Hongjun Lu, Wei Wang, Beng Chin Ooi, XR-Tree: Indexing XML Data for Efficient Structural Joins, The 19th International Conference on Data Engineering (ICDE 2003), page 253-264, Bangalore, India, March 5-8, 2003. [Jiang, Wang, Lu and Yu, VLDB03] Haifeng Jiang, Wei Wang, Hongjun Lu, Jeffrey Xu Yu, Holistic Twig Joins on Indexed XML Documents, The 29th International Conference on Very Large Data Bases (VLDB 2003), pages 273- 284, Berlin, Germany, September 9-12, 2003.

L09: Introduction to XML Data Management 86H. Lu/HKUST References (4) [Kaushik, Shenoy, ICDE02] Raghav Kaushik, Pradeep Shenoy, Philip Bohannon and Ehud Gudes. Exploiting local similarity for indexing paths in graph-structured data. In ICDE’02, 129-140. [Kha et al, ICDE01] Dao Dinh Kha, Masatoshi Yoshikawa, and Shunsuke Uemura. An XML indexing structure with relative region coordinate. ICDE 2001. [Krishnamurthy et al, 2003] R. Krishnamurthy, R. Kaushik, J. Naughto XML, XML-to-SQL Query Translation Literature: The State of the Art and Open Problems, Symposium (XSym), Sep 2003. [Li and Moon, VLDB01] Quanzhong Li and Bongki Moon. Indexing and querying XML data for regular path expressions. VLDB 2001. [Milo, Suciu, ICDT99] Tova Milo and Dan Suciu. Index structures for path expressions. In ICDT’99. 277-295, January 1999. [Lee, Srivastava DASFAA04] Dongwon Lee and Divesh Srivastava. Counting relaxed twig matches in a tree. DASFAA 2004. [Lim, Wang et al, VLDB02] Lipyeow Lim, Min Wang, Sriram Padmanabhan, Jeffrey Scott Vitter, and Ronald Parr. XPathLearner: An on-line self-tuning Markov histogram for XML path selectivity estimation. VLDB 2002. [Lee, Yoo et al, 1996] Yong Kyu Lee, Seong-Joon Yoo, Kyoungro Yoon, and P. Bruce Berra. Index structures forstructured documents. In Proceedings of the ACM Conference on Digital Libraries, 1996.

L09: Introduction to XML Data Management 87H. Lu/HKUST References (5) [Manolescu, Florescu et al, 2001] I. Manolescu, D. Florescu, and D. Kossmann. Pushing XML queries inside relational databases. Tech. Report no. 4112, INRIA, 2001 [Manolescu, Florescu et al, VLDB01] I. Manolescu, D. Florescu, and D. Kossmann. Answering xml queries over heterogeneous data sources. In proceedings on the International Conference on Very Large Data Bases (VLDB), Rome, Italy, September 2001. [Meier, 2002] Wolfgang Meier. eXist: An open source native XML database. In Web, Web-Services, and Database Systems 2002, 2002. [McHugh, Widom, VLDB99] Jason McHugh and Jennifer Widom. Query optimization for XML. VLDB 1999. [Polyzotis, Garofalakis SIGMOD02] Neoklis Polyzotis and Minos N. Garofalakis. Statistical synopses for graph-structured XML databases. SIGMOD 2002. [Polyzotis, Garofalakis VLDB02] Neoklis Polyzotis and Minos N. Garofalakis. Structure and value synopses for XML data graphs. VLDB 2002. [Robie, Lapp, QL98] J. Robie, J. Lapp, and D. Schach. XML query language (XQL). In M. Marchiori, editor. QL'98--The Query Languages Workshop. W3C, Dec. 1998. http://www.w3.org/TandS/QL/QL98/pp/xql.html

L09: Introduction to XML Data Management 88H. Lu/HKUST References (6) [Schmidt et. al. WebDB00] A. Schmidt, M. L. Kersten, M. Windhouwer, and F. Waas. Efficient relational storage and retrieval of XML documents. In WebDB (Informal Proceedings), pages 47--52, 2000. [Shanmugasundaram, Tufte, He, Zhang, DeWitt, and Naughton, VLDB99] Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, and Jeffrey F. Naughton. Relational databases for querying XML documents: Limitations and opportunities. In Proceedings of 25th International Conference on Very Large Data Bases (VLDB'99), pages 79-90. Morgan Kaufmann, 1999. [Shanmugasundaram et. al. VDLB 99] Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. DeWitt, Jeffrey F. Naughton: Relational Databases for Querying XML Documents: Limitations and Opportunities. VLDB 1999: 302-314 [Srivastava, Al-Khalifa et al, ICDE02] D. Srivastava, S. Al-Khalifa, H. V. Jagadish, N. Koudas, J. M. Patel, and Y. Wu. Structural joins: A primitive for efficient XML query pattern matching. In ICDE, pages 141-- 152, 2002. [Tatarinov, Viglas et al, SIGMOD02] Igor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene Shekita, and Chun Zhang. Storing and querying ordered XML using a relational database system. SIGMOD 2002. [Wang, Jiang et al, SIGMOD03] Wei Wang, Haifeng Jiang, Hongjun Lu, Jeffrey Xu Yu. Containment Join Size Estimation: Models and Methods. The 2003 ACM SIGMOD International Conference on Management of Data(SIGMOD03), San Diego, California, June 9 - June 12, 2003. Pages 145-156

L09: Introduction to XML Data Management 89H. Lu/HKUST References (7) [Wang, Jiang et al, ICDE03] Wei Wang, Haifeng Jiang, Hongjun Lu, and Jeffrey Xu Yu. PBiTree coding and efficient processing of containment joins. ICDE 2003. [Wu et al., EDBT02] Yuqing Wu, Jignesh Patel, H. V. Jagadish. Using Histograms to Estimate Answer Size for XML Queries. Information Systems 28 (1-2): 33-59 (2003) -- Special Issue: Best Papers from EDBT 2002. [Wu et al, ICDE03] Yuqing Wu, Jignesh Patel and H.V. Jagadish, Structural Join Order Selection for XML Query Optimization. ICDE 2003. [Yoshikawa, et. al. TOIT01] Masatoshi Yoshikawa, Toshiyuki Amagasa, Takeyuki Shimura, Shunsuke Uemura: XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Trans. Internet Techn. 1(1): 110-141 (2001) [Zhou et. al. VLDB 2001] Aoying Zhou, Hongjun Lu, Shihui Zheng, Yuqi Liang, Long Zhang, Wenyun Ji, Zengping Tian: VXMLR: A Visual XML-Relational Database System. VLDB 2001: 719-720 [Zhang, Naughton SIGMOD01] Chun Zhang, Jeffrey F. Naughton, David J. DeWitt, Qiong Luo, and Guy M. Lohman. On supporting containment queries in relational database management systems. SIGMOD 2001.

L09: Introduction to XML Data Management  XML and XML Query Languages  Structural Summary and Coding Scheme  Managing XML Data in Relational Systems.

Similar presentations

Presentation on theme: "L09: Introduction to XML Data Management  XML and XML Query Languages  Structural Summary and Coding Scheme  Managing XML Data in Relational Systems."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

L09: Introduction to XML Data Management  XML and XML Query Languages  Structural Summary and Coding Scheme  Managing XML Data in Relational Systems.

Similar presentations

Presentation on theme: "L09: Introduction to XML Data Management  XML and XML Query Languages  Structural Summary and Coding Scheme  Managing XML Data in Relational Systems."— Presentation transcript:

Similar presentations

About project

Feedback