XML Storage.

XML Storage

XML Storage Suppose that we are given some XML documents
How should they be stored? Why does it matter? Storage implies which type of use can be efficiently made of the XML Usage requirements determine which type of storage is needed

3 Basic Strategies Files Relational Database Native XML Database
What advantages do you think that each approach has? What disadvantages do you think that each approach has?

XML Files

Idea Store XML “as is”, in a file system
When querying, parse the document and traverse it to find the query answer Obvious Advantage: Simple storage system Obvious Disadvantage: Must parse the XML document every time it is queried Does not take advantage of indexes to quickly get to “interesting” elements (in order to reach a given element, must traverse everything appearing beforehand in the document)

Sample Document <transaction>
<account>89-344</account> <buy shares=“100”> <ticker exch=“NASDAQ”>WEBM</ticker> </buy> <sell shares=“30”> <ticker exch=“NYSE”>GE</ticker> </sell> </transaction> What must we read to be able to get information about the ticker element?

How is an XML document Parsed?
Basic types of parsers: DOM parser: Creates a tree out of the document SAX parser: Does not create any data structures. Notifies program for every element seen (“pushes” parsing events to users) Pull parsing (StAX): Similar to SAX in memory requirements. Uses an Iterator style interface Parsers have implementations in virtually every query language

DOM Parser DOM = Document Object Model
Parser creates a tree object out of the document User accesses data by traversing the tree The API allows for constructing, accessing and manipulating the structure and content of XML documents

Document as Tree Methods like: getRoot getChildren getAttributes etc.
transaction account buy sell 89-344 shares shares ticker ticker 100 30 exch exch NASDAQ WEBM NYSE GE

Advantages and Disadvantages
How would you answer a query like: /transaction/buy //ticker Advantages: Natural and relatively easy to use Can repeatedly query tree without reparsing Disadvantages: High memory requirements – the whole document is kept in memory Must parse the whole document and construct many objects before use

SAX Parser SAX = Simple API for XML
Parser creates “events” (i.e., notifications) while traversing tree Goes through the document one time only

 Start tag: transaction
Document as Events  End tag: account  Start tag: account  Start tag: transaction  Text: <transaction> <account>89-344</account> <buy shares=“100”> <ticker exch=“NASDAQ”>WEBM</ticker> </buy> <sell shares=“30”> <ticker exch=“NYSE”>GE</ticker> </sell> </transaction>  Start tag: buy  Value: 100  Attribute: shares

Advantages and Disadvantages
How would you answer a query like: /transaction/buy find accounts in which something is bought or sold from the NASDAQ Advantages: Requires less memory Fast Disadvantages: Cannot read backwards

Compression Even if XML is stored “as-is”, we would like to compress the data Important since XML is very verbose! Types of compression: Compression-oriented: Goal is to maximize compression ratio Query-oriented: Integrate compression with XPath processor, so that evaluation can be performed directly on compressed data Ideas?

Storing XML in a Relational Database

Why? Relational databases have been developed for about 30 years
There is extensive knowledge on how to use them efficiently Why not take advantage of this knowledge? Main Challenges: get XML into database (inserting): translating XML into tables get XML out of database (querying): translating XPath into SQL

Reminder Relational Database simply contains some tables
Each table can have any number of columns (also called attributes) Data items in each column are atomic, i.e., single values A schema is a description of a set of tables, i.e., the table name and each table’s column names

Difficulties DTDs can be complex Modeling Mismatch
Conceptually, relational databases, i.e., tables, have 2 levels: tables and attributes XML documents have arbitrary nesting XML documents can have set-valued attributes and recursion

Relational Databases: Option 1
The Schema-less Case

Option 1: Store Tree Structure
<person> <name> Bart Simpson </name> <tel> 02 – </tel> <tel> 051 – </tel> < > </ > </person> person name tel Bart Simpson 02 – 051 –

Option 1: Store Tree Structure (cont.)
person name tel Bart Simpson 02 – 051 – 1 5 2 4 3 6 9 8 7 1. Assign each node a unique id 2. For each node, store type and value 3. For each node, store parent information

Option 1: Store Tree Structure (cont.)
person name tel Bart Simpson 02 – 051 – 1 5 2 4 3 6 9 8 7 Node Type Value ParentID 1 element person null 6 text Bart Simpson 2 …

How Good Is This? Simple schema, can work with any document
Translation from XML to tables is easy What about the translation back? is this transformation lossless?

Answering XPath Queries
Can you answer an XPath query that: Just uses the Child axis, e.g., /a/b/c/d/e Uses the Descendent axis at the beginning of the query, e.g., //a/b Uses the Descendent axis in the middle of the query, e.g., /a/b//e Uses the Following, Preceding, Following-Sibling axis?

Solving the Problem With the current modeling, it is not possible to evaluate many different types of steps of XPath queries To solve this problem, we: number the nodes by DFS ordering store, for each node, the id of its last descendent

Can you answer these queries, now?
1 person 9 4 2 name phones 7 Bart Simpson 3 5 tel tel 10 051 – 6 8 02 – Node Type Value ParentID LastDesc 1 element person null 10 4 phones 8 …

Summary: Main Problems
No convenient method to creating XML as output Each element in the path expression requires an additional join Can become very expensive

Relational Databases: Option 2, Taking Advantage of DTDs
Based On: Relational Databases for Querying XML Documents: Limitations and Opportunities By: Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton

Relational Database System
Framework DTD XML Documents XML Query XML Result XML Translation Layer Relational Schema Translation Information Tuples SQL Query Relational Result Relational Database System

Example XML <book>
<booktitle> The Selfish Gene </booktitle> <author id = “dawkins”> <name> <firstname> Richard </firstname> <lastname> Dawkins </lastname> </name> <address> <city> Timbuktu </city> <zip> </zip> </address> </author> </book> Wouldn’t it be nice to store this as a table with the columns: booktitle author_id firstname lastname city zip

Otherwise, for example, what happens if there are 2 authors?
Example XML <book> <booktitle> The Selfish Gene </booktitle> <author id = “dawkins”> <name> <firstname> Richard </firstname> <lastname> Dawkins </lastname> </name> <address> <city> Timbuktu </city> <zip> </zip> </address> </author> </book> We can do this only if all XML documents that we will be considering follow this format. Otherwise, for example, what happens if there are 2 authors?

Considering the DTD If a DTD is given, then it defines what types of XML documents will be of interest Challenge: Given a DTD, find a relational schema such that ANY document conforming to the DTD can be stored in the relations <!ELEMENT a ((b|c|e)?,(e?|(f?,(b,b)*))*)>

Reducing the Complexity
DTDs can be very complex Before translating a DTD to a relational schema, simplify the DTD Property of the Simplification: If D2 is a simplification of D1, then every document that conforms to D1 also almost conforms to D2 almost means that it conforms, if the ordering of sub-elements is ignored

Simplification Rules (e1, e2)*  e1*, e2* (e1, e2)?  e1?, e2?
..., a*, ..., a*,  a*, ... ..., a*, ..., a?,  a*, ... ..., a?, ..., a*,  a*, ... ..., a?, ..., a?,  a*, … …, ...a, …, a, …  a*, … …,...a?, …, a, …  a*, … …,...a, …, a?, …  a*, … …,...a*, …, a, …  a*, … …,...a, …, a*, …  a*, … e1**  e1* e1*?  e1* e1?*  e1* e1??  e1? e1+  e1*

(b|c|e)?,(e?|f+) (e1, e2)*  e1*, e2* (e1, e2)?  e1?, e2?
..., a*, ..., a*,  a*, ... ..., a*, ..., a?,  a*, ... ..., a?, ..., a*,  a*, ... ..., a?, ..., a?,  a*, … …, ...a, …, a, …  a*, …

(b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+?
(e1, e2)*  e1*, e2* (e1, e2)?  e1?, e2? (e1|e2)  e1?, e2? (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? e1**  e1* e1*?  e1* e1?*  e1* e1??  e1? e1+  e1* ..., a*, ..., a*,  a*, ... ..., a*, ..., a?,  a*, ... ..., a?, ..., a*,  a*, ... ..., a?, ..., a?,  a*, … …, ...a, …, a, …  a*, …

(b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+?
(e1, e2)*  e1*, e2* (e1, e2)?  e1?, e2? (e1|e2)  e1?, e2? (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? e1**  e1* e1*?  e1* e1?*  e1* e1??  e1? e1+  e1* b??,c??,e??,e??,f+? ..., a*, ..., a*,  a*, ... ..., a*, ..., a?,  a*, ... ..., a?, ..., a*,  a*, ... ..., a?, ..., a?,  a*, … …, ...a, …, a, …  a*, …

(e1, e2)*  e1*, e2* (e1, e2)?  e1?, e2? (e1|e2)  e1?, e2? (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? e1**  e1* e1*?  e1* e1?*  e1* e1??  e1? e1+  e1* b??,c??,e??,e??,f+? b??,c??,e??,e??,f*? ..., a*, ..., a*,  a*, ... ..., a*, ..., a?,  a*, ... ..., a?, ..., a*,  a*, ... ..., a?, ..., a?,  a*, … …, ...a, …, a, …  a*, …

(e1, e2)*  e1*, e2* (e1, e2)?  e1?, e2? (e1|e2)  e1?, e2? (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? e1**  e1* e1*?  e1* e1?*  e1* e1??  e1? e1+  e1* b??,c??,e??,e??,f+? b??,c??,e??,e??,f*? ..., a*, ..., a*,  a*, ... ..., a*, ..., a?,  a*, ... ..., a?, ..., a*,  a*, ... ..., a?, ..., a?,  a*, … …, ...a, …, a, …  a*, … b?,c?,e?,e?,f*

(e1, e2)*  e1*, e2* (e1, e2)?  e1?, e2? (e1|e2)  e1?, e2? (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? e1**  e1* e1*?  e1* e1?*  e1* e1??  e1? e1+  e1* b??,c??,e??,e??,f+? b??,c??,e??,e??,f*? ..., a*, ..., a*,  a*, ... ..., a*, ..., a?,  a*, ... ..., a?, ..., a*,  a*, ... ..., a?, ..., a?,  a*, … …, ...a, …, a, …  a*, … b?,c?,e?,e?,f* b?,c?,e*,f*

You try it Can you simplify the expression e1**  e1* e1*?  e1*
(b|c|e)?,(e?|(f?,(b,b)*))* ..., a*, ..., a*,  a*, ... ..., a*, ..., a?,  a*, ... ..., a?, ..., a*,  a*, ... ..., a?, ..., a?,  a*, … …, ...a, …, a, …  a*, … (e1, e2)*  e1*, e2* (e1, e2)?  e1?, e2? (e1|e2)  e1?, e2?

DTD Graphs In order to describe a technique for converting a DTD to a schema it is convenient to first describe DTDs (or rather simplified DTDs) as graphs Its nodes are elements, attributes and operators in the DTD Each element appears exactly once in the graph Attributes and operators appear as many times as they are in the DTD Cycles indicate recursion

DTD Example

Corresponding DTD Graph
attribute

Very Naïve Storage Store a table for each element name, with columns
ID parentID (if has an incoming edge) parentCODE (if has an incoming edge) isRoot Textual data, for elements of type PCDATA or attributes of type CDATA

book (bookID: int, isRoot: boolean)
booktitle (titleID: int, data : string, parentID: int, parentCode: int, isRoot: boolean) article (articleID: int, isRoot:boolean) contactauthor (contactauthorID: int, parentID: int, parentCode: int, isRoot: boolean) authorid (authoridID: int, data: string, title (titleID: int, data: string , parentID: int, parentCODE: int, isRoot: boolean) ….. Partial example ….

Disadvantages? Many, many joins!
Some relations serve basically no purpose (such as contactauthor) Solution: Inlining! Store some of the data of the children within the table of the parent When? Suggestions?

Creating the Schema: Shared Inline Technique
When creating the schema for a DTD, we create a relation for: each element with in-degree greater than 1 each element with in-degree 0 each element below a * one element from each set of mutually recursive elements, having in-degree 1 All other elements are “inlined” into their parent’s relation (i.e., added into their parents relations) Note that parent may also be inlined

In the Relations, Store:
Id of node Boolean isRoot column, for each of the inlined fields (omitted in the examples) Text content of all leaf nodes that are inlined For all nodes with an incoming edge: parentID parentCODE

Relations for which elements?
attribute

What are these for? book (bookID: integer, book.booktitle : string)
article (articleID: integer, article.contactauthor.authorid: string) monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.editor.name: string) title (titleID: integer, title: string , title.parentID: integer, title.parentCODE: integer) author (author.parentID: integer, author.parentCODE: integer, authorID: integer, author.authorid: string author.address: string, author.name.firstname: string, author.name.lastname: string, ) What are these for?

How many isRoot columns would you add to article?
book (bookID: integer, book.booktitle : string) article (articleID: integer, article.contactauthor.authorid: string) monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.editor.name: string) title (titleID: integer, title: string , title.parentID: integer, title.parentCODE: integer) author (author.parentID: integer, author.parentCODE: integer, authorID: integer, author.authorid: string author.address: string, author.name.firstname: string, author.name.lastname: string, ) How many isRoot columns would you add to article? To monograph?

Advantages/Disadvantages
Reduces number of joins for queries like “get the first and last names of an author” Efficient for queries such as “list all authors with name Jack” Disadvantages: Extra join needed for “Article with a given title name”

Notes Can/Should we use foreign keys to connect child tuples with their parents, e.g., titles with what they belong to? How can we answer queries, such as: //title //article/title //article//name

Another Option: Hybrid Inlining Technique
Same as Shared, except also inline elements with in-degree greater than one for the places in which they are not recursive or reached through a * node

What, in addition, will be inline?
attribute

Why do we still have an author relation?
book (bookID: integer, book.booktitle : string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string) article (articleID: integer, article.contactauthor.authorid: string, article.title: string) monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.title: string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string, monograph.editor.name: string, ) author (authorID: integer, author.parentID: integer, author.parentCODE: integer, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string) Why do we still have an author relation?

Advantages/Disadvantages
Reduces joins through shared elements (that are not set or recursive elements) Reduces joins for queries like “get first and last names of a book author” (like Shared) Disadvantages: Requires more SQL sub-queries to retrieve all authors with first name Jack (i.e., unions) Tradeoff between reducing number of unions and reducing number of joins Shared and Hybrid target union- and join-reduction, respectively

XML in Major Databases All major databases now have some level of support for XML Example: Oracle XML data type (can have a column which contains XML documents) XPath processing of XML values Some indexing capabilities XML is a second class citizen in the database (support consists of a bunch of tools – no coherent framework)

Try It Consider the DTD: <!DOCTYPE a [ <!ELEMENT a (d+|b)?>
<!ELEMENT b (c,f)> <!ELEMENT c (a)> <!ELEMENT e (d)> <!ELEMENT d (f)> <!ELEMENT f (g,h)> <!ELEMENT g (#PCDATA)> <!ELEMENT h (#PCDATA)> ]>

Try It Simplify the DTD and draw the DTD graph that corresponds to the simplified DTD. Show the schema that would be created using the Shared-Inline Technique. Show the schema that would be created using the Hybrid-Inlining Technique.

Native Databases for XML

Basic Idea Store XML as a tree
Main Challenge: make querying efficient (recall the difficulties when storing XML as a file) appropriate indexing efficient query processing Several native XML database systems have been developed: TIMBER (University of Michigan) ToX (University of Toronto) etc.

Pointer to block containing child
Natix Subtrees are stored in blocks. When a block is full another block is used. bib Pointer to block containing child <bib> <book> <title>...</title> <author>...</author> </book> </bib> book title author

Indexing In order to do efficient query processing, indexes are used
Reminder: An index is a structure that “points” directly to nodes satisfying a given constraint More indexes usually allow query processing to be more efficient, but also take up more space (time/space tradeoff)

Indexing Strategy We will discuss different indexing strategies and query processing with these indices Element and value inverted lists Rotated paths Graph-based indexes

Element and Value Inverted Lists

Basic Indexes At minimum, the following indexes are usually stored:
Value indexes: for each value appearing in the tree there is a list of nodes containing the value Element indexes: for each element name appearing in the tree, there is a list of nodes with the corresponding element Sometimes also structure indexes: for certain XPath expressions, there is a list of nodes that satisfy the expression

Example: Value Indexes
1 transaction 2 account 11 4 3 buy sell 89-344 12 5 shares shares WEBM 10 13 14 6 ticker 7 ticker 30 100 15 NYSE 16 9 exch 8 exch 16 17 10 NYSE GE 9 NYSE WEBM

Example: Element Indexes
1 transaction 2 account 11 4 3 buy sell 89-344 12 5 shares shares buy 4 13 14 6 ticker 7 ticker 30 100 15 exch 15 8 exch 8 exch 16 17 10 NYSE GE 9 NYSE WEBM

Example: Structure Indexes
1 transaction 2 account 11 4 3 buy sell 89-344 12 5 shares shares //buy//exch 8 13 14 6 ticker 7 ticker 30 100 15 exch 8 exch 16 17 10 NYSE GE 9 NYSE WEBM

Query Processing Suppose that we only have value indexes and element indexes How should we process the query: //buy//exch ? Strategy 1: Find buy elements. Then traverse the subtree of these elements to look for exch elements Strategy 2: Find exch elements. Then traverse the ancestors of these elements to look for buy elements Which is a better strategy?

//buy//exch: Strategy 1
transaction 2 account 11 4 3 buy sell 89-344 12 5 shares buy 4 shares 13 14 6 ticker 7 exch 15 8 ticker 30 100 15 exch 8 exch 16 17 10 NYSE GE 9 NYSE WEBM

//buy//exch: Strategy 2
1 transaction 2 account 11 4 3 buy sell 89-344 12 5 shares buy 4 shares 13 14 6 ticker 7 exch 15 8 ticker 30 100 15 exch 8 exch 16 17 10 NYSE GE 9 NYSE WEBM

Both Strategies Are BAD!
Both strategies require traversal of the tree Many disk reads Will be inefficient, if tree is large! GOAL: Answer queries using indices only, without traversing the XML tree

Improving the Execution
Instead of storing a running id for each element, store triple: (start, end, level) Find buy elements Find exch elements Merge these two lists by finding exch elements that are nested within buy elements Level is used in case we are interested in finding children, not descendents

//buy//exch: Improved
Start End Level buy (4,10,2) Merge the 2 lists by finding descendent elements exch (8,9,4) (15,17,4) What does this remind you of?

Merging Lists What is the complexity of merging the lists?
Is it enough to go through each list once? Assuming the lists are sorted by start? Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a b a b b

Merging Lists: Example
Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a b 1,7,1 3,6,2 4,4,3 5,5,3 2,2,2 Where should we go on the b list? a (1,7,1) (3,6,2) b (2,2,2) (4,4,3) (5,5,3)

Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a b 1,7,1 3,6,2 4,4,3 5,5,3 2,2,2 a (1,7,1) (3,6,2) b (2,2,2) (4,4,3) (5,5,3)

We did extra work Need a method to find the correct place to start in the b list a b 1,7,1 3,6,2 4,4,3 5,5,3 2,2,2 a (1,7,1) (3,6,2) b (2,2,2) (4,4,3) (5,5,3)

Minimizing the Work Several algorithms have been defined to minimize the amount of work required, by identifying exactly where to restart See: Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras, Carlo Zaniolo, “Efficient Structural Joins on Indexed XML Documents” Proc.of VLDB 2002 Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jingesh M. Patel, Divesh Srivastava, Yuqing Wu, “Structural Joins: A Primitive for Efficient XML Query Pattern Matching”, ICDE 2002 Nicolas Bruno, Nick Koudas, Divesh Srivastava, “Holistic Twig Joins: Optimal XML Pattern Matching”, ACM SIGMOD 2002

Goal Efficiently find all pairs of nodes n,m such that m is a descendent (child) of n, and n and m have the user specified labels E.g., a//b, c//d, e/f Recall: For any label, we have a sorted list (i.e., an index) of nodes with that label The sorted list of ids contains both the starting position of a node and its ending position

Stack-Tree Algorithms: Intuition
A depth-first traversal of a tree can be performed in linear time, using a stack as large as the height of the tree. An ancestor-descendant structural relationship is manifested as the ancestor appearing earlier on the stack than the descendant. Unfortunately, a depth-first traversal requires going over all the tree. DON’T GO OVER THE TREE!! ONLY THE INDEX

Stack-Tree Algorithms
We will study the algorithm Stack-Tree-Desc that returns the result ordered by (desc-start, anc-start) Paper also discusses the algorithm Stack-Tree-Anc that returns the result ordered by (anc-start, desc-start) Why is the ordering of the result of interest?

Stack-Tree-Desc a = Alist->first node; d = Dlist->first node; OutputList = NULL; while (lists are not finished or stack is not empty) { if (a.startPos < d.startPos) then e = a; else e = d; while (stack not empty and e.startPos > stack.Top().endPos) stack.Pop(); if (e == a) { stack.Push(a); a = a->nextNode; } else for each a’ in stack do append (a’, d) to OutputList; d = d->nextNode; } a d

Stack-Tree-Desc: section//paragraph
article paragraph section paragraph section paragraph paragraph paragraph section paragraph paragraph Bla,..Bla,.. Bla,..Bla,..

Stack-Tree-Desc: //section//paragraph
article Alist paragraph section paragraph section paragraph paragraph paragraph section paragraph paragraph Bla,..Bla,.. Bla,..Bla,..

article Dlist paragraph section paragraph section paragraph paragraph paragraph section paragraph paragraph Bla,..Bla,.. Bla,..Bla,..

article d7 a1 paragraph section d1 a2 d6 paragraph section paragraph d2 a3 d5 paragraph paragraph section d3 d4 paragraph paragraph Bla,..Bla,.. Bla,..Bla,..

article d7 paragraph a1 section d1 d6 paragraph a2 paragraph Note: These lists are not created at the beginning of the algorithm. They are already available! section d2 d5 paragraph paragraph a3 section d4 d3 paragraph paragraph Bla,..Bla,.. Bla,..Bla,..

Stack-Tree-Desc Stack: Output: d5 d6 a1 d1 a2 d2 a3 d3 d4 d7 d7 a1 d1

Questions and Disadvantages
Can a similar algorithm be used to compute other axes? e.g., child, following How can we use an algorithm for computing a single “step” to compute an entire XPath Query? E.g., //a//b[//c/d]//e

Tree Pattern Can Computed From Structural Relationships
Descendent edge Child edge book book title author XML jane title author XML jane Algorithm presented only computed a single edge query. Results can be combined to answer entire query.

Graph-Based Indexes: DataGuides

Exploiting Regularity
XML documents tend to have a very repetitive structure Structure can be summarized in a (relatively) small graph, called a dataguide Nodes in a dataguide point to their corresponding node in the XML document Strategy: Evaluate query over graph. Then find corresponding nodes in document Very efficient if dataguide fits into main memory

Notes In this work, we will model documents as graphs with the labels on the edges We will only consider path queries (no branching) Our XML documents can be arbitrary graphs There are many different types of indexes that exploit the same idea this was the first (1997)

An Example DataGuide: Intuition
How would you evaluate the queries: //Name /Restaurant/Owner

DataGuides: Formally Given a data source (i.e., XML document) X, a graph D is a dataguide for X if: every path of labels appearing in X appears exactly once in D (conciseness) every path of labels appearing in D appears at least once in X (accuracy)

Example Revisited Observe that every path in X also appears in D
Observe that no path (from the root) appears twice in D Document: X DataGuide: D

? Is this a DataGuide? Document: X A A B B B C C C C C D D D D D 1 1 1

? Is this a DataGuide? Document: X A B A B B B C C C C C C D D D D D D
1 1 A B A B B B 1 1 1 1 1 1 C C C C C C 1 1 1 1 1 1 D D D D D D 1 1 1 1 1 1 ? Document: X

? Is this a DataGuide? Document: X A C A B B B C C C C C C D D D D D D
1 1 A C A B B B 1 1 1 1 1 1 C C C C C C 1 1 1 1 1 1 D D D D D D 1 1 1 1 1 1 ? Document: X

? Is this a DataGuide? Document: X A B A B B C C C C D D D D 1 1 1 1 1

Choosing a DataGuide Document: X Option 1 Option 2
B A B B B 1 1 1 1 1 1 C C C C C C 1 1 1 1 1 1 D D D D D D 1 1 1 1 1 1 Document: X Option 1 Option 2 What does D point to?

Strong DataGuide: Formally
Consider source X and dataguide D Let p, p’ be two label paths Let p(X) be the set of nodes reached in X by traversing path p We define p ≡X p’ if p(X) = p’(X) That is, p and p’ are indistinguishable on X D is a strong DataGuide for a database X if the equivalence relations ≡D and ≡X are the same

Strong DataGuides Is (b) a strong dataguide for (a)?
Is (c) a strong dataguide for (a)?

Creating a Strong Dataguide
Strong dataguides can be used as indexes since they are unambiguous How big might a strong dataguide be? Can it be created efficiently? In general, exponential time. Requires turning a nondeterministic automaton into a deterministic one If XML is a tree, can be created in linear time

MakeDataGuide(n) { dg = NewObject() targetHash.Insert({n}, dg) RecursiveMake({n}, dg) } RecursiveMake(t1, d1) { p = set of <label, node-id> children pairs of each object in t1 foreach (unique label l in p) { t2 = set of node-ids paired with l in p d2 = targetHash.Lookup(t2) if (d2 != nil) { add an edge from d1 to d2 with label l } else { d2 = NewObject() targetHash.Insert(t2, d2) add an edge from d1 to d2 with label l RecursiveMake(t2, d2)

Can you create a Strong DataGuide?
Intuition: If the sets of nodes which are reachable for simple paths are equal, then the simple paths are represented as a single node. Compute on blackboard 1 A C B 2 3 4 5 6 2,4 3,5 Source Strong DataGuide Source Strong DataGuide 1 A C B 2 3 4 5 6 2,4 3,5 Source Strong DataGuide Source Strong DataGuide

Summary Advantages: Disadvantages:
if dataguide can fit in memory, evaluation can be performed efficiently for path queries Disadvantages: May be large (why is this worse here than for the rotated lexicon?) Only good for simple queries. Which axes?

Try It Construct a strong dataguide for this document, using the algorithm shown Show an example of a database, strong dataguide and XPath query such that evaluating the XPath query on the dataguide (and then finding the corresponding database nodes) yields a different answer than evaluating the query directly on the database.

XML Storage.

Similar presentations

Presentation on theme: "XML Storage."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

XML Storage.

Similar presentations

Presentation on theme: "XML Storage."— Presentation transcript:

Similar presentations

About project

Feedback