XML Storage We must upgrade to XML. Everyone is talking about it. Well, that is going to cost us XXX on YYY and earn us WWW on ZZZ.
XML Topics Previous topics: –Motivation for XML –XML Syntax –DTDs –XPath This Week: XML Storage Upcoming Weeks: –Querying XML –XML Search –Advanced Topics (e.g., Web Services)
XML Storage Suppose that we are given some XML documents How should they be stored? Why does it matter? –Type of storage implies which type of use can be efficiently made of the XML –Type of usage determines which type of storage is needed Can’t really discuss using XML, without knowing how it is stored, and whether such usage is possible
3 Basic Strategies Files Relational Database Native XML Database What advantages do you think that each approach has? What disadvantages do you think that each approach has?
XML Files
Idea Store XML “as is”, in a file system –When querying, parse the document and traverse it to find the query answer Obvious Advantage: Simple storage system Obvious Disadvantage: –Must parse the XML document every time it is queried –Does not take advantage of indexes to quickly get to “interesting” elements (in order to reach a given element, must traverse everything appearing beforehand in the document)
Sample Document WEBM GE What must we read to be able to get information about the ticker element?
How is an XML document Parsed? Two basic types of parsers: –DOM parser: Creates a tree out of the document –SAX parser: Does not create any data structures. Notifies program for every element seen Both types of parsers have been standardized and have implementations in virtually every query language
DOM Parser DOM = Document Object Model Parser creates a tree object out of the document User accesses data by traversing the tree The API allows for constructing, accessing and manipulating the structure and content of XML documents
Document as Tree transaction account buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE GE exch NASDAQ Methods like: getRoot getChildren getAttributes etc.
Advantages and Disadvantages How would you answer a query like: –/transaction/buy –//ticker Advantages: –Natural and relatively easy to use –Can repeatedly query tree without reparsing Disadvantages: –High memory requirements – the whole document is kept in memory –Must parse the whole document and construct many objects before use
SAX Parser SAX = Simple API for XML Parser creates “events” (i.e., notifications) while traversing tree Goes through the document one time only
Document as Events WEBM GE Start tag: transaction Start tag: account Text: End tag: account Start tag: buy Attribute: shares Value: 100
Advantages and Disadvantages How would you answer a query like: –/transaction/buy –find accounts in which something is bought or sold from the NASDAQ Advantages: –Requires less memory –Fast Disadvantages: –Cannot read backwards
Storing XML in a Relational Database
Why? Relational databases have been developed for about 30 years There is extensive knowledge on how to use them efficiently Why not take advantage of this knowledge? Main Challenges: –get XML into database (inserting) –get XML out of database (querying)
Reminder Relational Database simply contains some tables Each table can have any number of columns (also called attributes) Data items in each column are atomic, i.e., single values A schema is a description of a set of tables, i.e., the table name and each table’s column names
Difficulties DTDs can be complex Modeling Mismatch –Conceptually, relational databases, i.e., tables, have 2 levels: tables and attributes –XML documents have arbitrary nesting XML documents can have set-valued attributes and recursion
Difficulties Relational Database System XML Translation Layer DTD Relational Schema Translation Information XML Documents Tuples XML Query SQL Query Relational Result XML Result
Relational Databases: Option 1 The Schema-less Case
Option 1: Store Tree Structure Bart Simpson 02 – – person name tel Bart Simpson 02 – –
Option 1: Store Tree Structure (cont.) 1. Assign each node a unique id 2. For each node, store type and value 3. For each node, store parent information person name tel Bart Simpson 02 – –
Option 1: Store Tree Structure (cont.) person name tel Bart Simpson 02 – – NodeTypeValueParentID 1elementpersonnull 6textBart Simpson2 ……
How Good Is This? Simple schema, can work with any document Translation from XML to tables is easy What about the translation back? –is this transformation lossless?
Answering XPath Queries Can you answer an XPath query that: –Just uses the Child axis, e.g., /a/b/c/d/e –Uses the Descendent axis at the beginning of the query, e.g., //a/b –Uses the Descendent axis in the middle of the query, e.g., /a/b//e –Uses the Following, Preceding, Following- Sibling axis?
Solving the Problem With the current modeling, it is not possible to evaluate many different types of steps of XPath queries To solve this problem, we: –number the nodes by DFS ordering –store, for each node, the id of its last descendent
phones person name tel Bart Simpson 02 – – NodeTypeValueParentIDLastDesc 1elementpersonnull10 4elementphones18 …… Can you answer these queries, now? these queries
Summary: Main Problems No convenient method to creating XML as output Each element in the path expression requires an additional join –Can become very expensive
Relational Databases: Option 2, Taking Advantage of DTDs Based On: Relational Databases for Querying XML Documents: Limitations and Opportunities By: Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton
Example XML The Selfish Gene Richard Dawkins Timbuktu Wouldn’t it be nice to store this as a table with the columns: booktitle author_id firstname lastname city zip
Example XML The Selfish Gene Richard Dawkins Timbuktu We can do this only if all XML documents that we will be considering follow this format. Otherwise, for example, what happens if there are 2 authors?
Considering the DTD If a DTD is given, then it defines what types of XML documents will be of interest Challenge: Given a DTD, find a relational schema such that ANY document conforming to the DTD can be stored in the relations –
Reducing the Complexity DTDs can be very complex Before translating a DTD to a relational schema, simplify the DTD Property of the Simplification: If D 2 is a simplification of D 1, then every document that conforms to D 1 also almost conforms to D 2 –almost means that it conforms, if the ordering of sub- elements is ignored
Simplification Rules (e 1, e 2 )* e 1 *, e 2 * (e 1, e 2 )? e 1 ?, e 2 ? (e 1 |e 2 ) e 1 ?, e 2 ? e 1 ** e 1 * e 1 *? e 1 * e 1 ?* e 1 * e 1 ?? e 1 ? e 1 + e 1 *..., a*,..., a*,... a*,......, a*,..., a?,... a*,......, a?,..., a*,... a*,......, a?,..., a?,... a*, … …,...a, …, a, … a*, …
(e 1, e 2 )* e 1 *, e 2 * (e 1, e 2 )? e 1 ?, e 2 ? (e 1 |e 2 ) e 1 ?, e 2 ? e 1 ** e 1 * e 1 *? e 1 * e 1 ?* e 1 * e 1 ?? e 1 ? e 1 + e 1 *..., a*,..., a*,... a*,......, a*,..., a?,... a*,......, a?,..., a*,... a*,......, a?,..., a?,... a*, … …,...a, …, a, … a*, … (b|c|e)?,(e?|f+)
(e 1, e 2 )* e 1 *, e 2 * (e 1, e 2 )? e 1 ?, e 2 ? (e 1 |e 2 ) e 1 ?, e 2 ? e 1 ** e 1 * e 1 *? e 1 * e 1 ?* e 1 * e 1 ?? e 1 ? e 1 + e 1 *..., a*,..., a*,... a*,......, a*,..., a?,... a*,......, a?,..., a*,... a*,......, a?,..., a?,... a*, … …,...a, …, a, … a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+?
(e 1, e 2 )* e 1 *, e 2 * (e 1, e 2 )? e 1 ?, e 2 ? (e 1 |e 2 ) e 1 ?, e 2 ? e 1 ** e 1 * e 1 *? e 1 * e 1 ?* e 1 * e 1 ?? e 1 ? e 1 + e 1 *..., a*,..., a*,... a*,......, a*,..., a?,... a*,......, a?,..., a*,... a*,......, a?,..., a?,... a*, … …,...a, …, a, … a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+?
(e 1, e 2 )* e 1 *, e 2 * (e 1, e 2 )? e 1 ?, e 2 ? (e 1 |e 2 ) e 1 ?, e 2 ? e 1 ** e 1 * e 1 *? e 1 * e 1 ?* e 1 * e 1 ?? e 1 ? e 1 + e 1 *..., a*,..., a*,... a*,......, a*,..., a?,... a*,......, a?,..., a*,... a*,......, a?,..., a?,... a*, … …,...a, …, a, … a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+? b??,c??,e??,e??,f*?
(e 1, e 2 )* e 1 *, e 2 * (e 1, e 2 )? e 1 ?, e 2 ? (e 1 |e 2 ) e 1 ?, e 2 ? e 1 ** e 1 * e 1 *? e 1 * e 1 ?* e 1 * e 1 ?? e 1 ? e 1 + e 1 *..., a*,..., a*,... a*,......, a*,..., a?,... a*,......, a?,..., a*,... a*,......, a?,..., a?,... a*, … …,...a, …, a, … a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+? b??,c??,e??,e??,f*? b?,c?,e?,e?,f*
(e 1, e 2 )* e 1 *, e 2 * (e 1, e 2 )? e 1 ?, e 2 ? (e 1 |e 2 ) e 1 ?, e 2 ? e 1 ** e 1 * e 1 *? e 1 * e 1 ?* e 1 * e 1 ?? e 1 ? e 1 + e 1 *..., a*,..., a*,... a*,......, a*,..., a?,... a*,......, a?,..., a*,... a*,......, a?,..., a?,... a*, … …,...a, …, a, … a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+? b??,c??,e??,e??,f*? b?,c?,e?,e?,f* b?,c?,e*,f*
You try it Can you simplify the expression –(b|c|e)?,(e?|(f?,(b,b)*))* (e 1, e 2 )* e 1 *, e 2 * (e 1, e 2 )? e 1 ?, e 2 ? (e 1 |e 2 ) e 1 ?, e 2 ? e 1 ** e 1 * e 1 *? e 1 * e 1 ?* e 1 * e 1 ?? e 1 ? e 1 + e 1 *..., a*,..., a*,... a*,......, a*,..., a?,... a*,......, a?,..., a*,... a*,......, a?,..., a?,... a*, … …,...a, …, a, … a*, …
DTD Graphs In order to describe a technique for converting a DTD to a schema it is convenient to first describe DTDs (or rather simplified DTDs) as graphs Its nodes are elements, attributes and operators in the DTD Each element appears exactly once in the graph Attributes and operators appear as many times as they are in the DTD Cycles indicate recursion
DTD Example
Corresponding DTD Graph
Creating the Schema: Shared Inline Technique When creating the schema for a DTD, we create a relation for: –each element with in-degree greater than 1 –each element with in-degree 0 –each element below a * –one element from each set of mutually recursive elements, having in-degree 1 All other elements are “inlined” into their parent’s relation (i.e., added into their parents relations) –Note that parent may also be inlined
Relations for which elements?
book (bookID: integer, book.booktitle : string) article (articleID: integer, article.contactauthor.authorid: string) monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.editor.name: string) title (titleID: integer, title: string, title.parentID: integer, title.parentCODE: integer) author (author.parentID: integer, author.parentCODE: integer, authorID: integer, author.authorid: string author.address: string, author.name.firstname: string, author.name.lastname: string, ) What are these for?
Advantages/Disadvantages Advantages: –Reduces number of joins for queries like “get the first and last names of an author” –Efficient for queries such as “list all authors with name Jack” Disadvantages: –Extra join needed for “Article with a given title name”
Notes Can/Should we use foreign keys to connect child tuples with their parents, e.g., titles with what they belong to? How can we answer queries, such as: –//title –//article/title –//article//name
Another Option: Hybrid Inlining Technique Same as Shared, except also inline elements with in-degree greater than one for the places in which they are not recursive or reached through a * node
What, in addition, will be inline?
book (bookID: integer, book.booktitle : string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string) article (articleID: integer, article.contactauthor.authorid: string, article.title: string) monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.title: string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string, monograph.editor.name: string, ) author (authorID: integer, author.parentID: integer, author.parentCODE: integer, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string) Why do we still have an author relation?
Advantages/Disadvantages Advantages: –Reduces joins through shared elements (that are not set or recursive elements) –Reduces joins for queries like “get first and last names of a book author” (like Shared) Disadvantages: –Requires more SQL sub-queries to retrieve all authors with first name Jack (i.e., unions) Tradeoff between reducing number of unions and reducing number of joins – Shared and Hybrid target union- and join-reduction, respectively
XML in Major Databases All major databases now have some level of support for XML Example: Oracle –XML data type (can have a column which contains XML documents) –XPath processing of XML values –Some indexing capabilities –XML is a second class citizen in the database (support consists of a bunch of tools – no coherent framework)