XML Storage. Suppose that we are given some XML documents How should they be stored? Why does it matter? –Storage implies which type of use can be efficiently.

Slides:



Advertisements
Similar presentations
APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.
Advertisements

XML: Extensible Markup Language
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al Khalifa et al., ICDE 2002.
CSE 6331 © Leonidas Fegaras XML and Relational Databases 1 XML and Relational Databases Leonidas Fegaras.
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Storing and Querying Ordered XML Using Relational Database System Swapna Dhayagude.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
XML To Relational Model. Key Index – Forward Traversal Backward Traversal.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Storage of XML Data XML data can be stored in –Non-relational data stores Flat files –Natural for storing XML –But has all problems discussed in Chapter.
Database Systems and XML David Wu CS 632 April 23, 2001.
Blind Search-Part 2 Ref: Chapter 2. Search Trees The search for a solution can be described by a tree - each node represents one state. The path from.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Primary Indexes Dense Indexes
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Mike 66 Sept Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
Database Management 9. course. Execution of queries.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Querying Structured Text in an XML Database By Xuemei Luo.
Computing & Information Sciences Kansas State University Thursday, 15 Mar 2007CIS 560: Database System Concepts Lecture 24 of 42 Thursday, 15 March 2007.
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML1 Efficient Structural Joins on Indexed XML Documents Shu-Yao Chien, Zografoula Vagena, Donghui.
Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.
Database Systems Part VII: XML Querying Software School of Hunan University
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
Starting at Binary Trees
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Computing & Information Sciences Kansas State University Friday, 20 Oct 2006CIS 560: Database System Concepts Lecture 24 of 42 Friday, 20 October 2006.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
Dr. N. MamoulisAdvanced Database Technologies1 Topic 8: Semi-structured Data In various application domains, the data are semi-structured; the database.
Grouping Robin Burke ECT 360. Outline Extra credit Numbering, revisited Grouping: Sibling difference method Uniquifying in XPath Grouping: Muenchian method.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
XML Storage We must upgrade to XML. Everyone is talking about it. Well, that is going to cost us XXX on YYY and earn us WWW on ZZZ.
1 Native Databases for XML. 2 Store XML as a tree Main Challenge: make querying efficient (recall the difficulties when storing XML as a file) –appropriate.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
XML Storage.
Efficient processing of path query with not-predicates on XML data
Database Management System
Chapter 12: Query Processing
Relational Algebra Chapter 4, Part A
Chapter 15 QUERY EXECUTION.
(b) Tree representation
Lecture 2- Query Processing (continued)
Structural Joins: A Primitive for Efficient XML Query Pattern Matching
Presentation transcript:

XML Storage

Suppose that we are given some XML documents How should they be stored? Why does it matter? –Storage implies which type of use can be efficiently made of the XML –Usage requirements determine which type of storage is needed

3 Basic Strategies Files Relational Database Native XML Database What advantages do you think that each approach has? What disadvantages do you think that each approach has?

XML Files

Idea Store XML “as is”, in a file system –When querying, parse the document and traverse it to find the query answer Obvious Advantage: Simple storage system Obvious Disadvantage: –Must parse the XML document every time it is queried –Does not take advantage of indexes to quickly get to “interesting” elements (in order to reach a given element, must traverse everything appearing beforehand in the document)

Sample Document WEBM GE What must we read to be able to get information about the ticker element?

How is an XML document Parsed? Two basic types of parsers: –DOM parser: Creates a tree out of the document –SAX parser: Does not create any data structures. Notifies program for every element seen Both types of parsers have been standardized and have implementations in virtually every query language

DOM Parser DOM = Document Object Model Parser creates a tree object out of the document User accesses data by traversing the tree The API allows for constructing, accessing and manipulating the structure and content of XML documents

Document as Tree transaction account buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE GE exch NASDAQ Methods like: getRoot getChildren getAttributes etc.

Advantages and Disadvantages How would you answer a query like: –/transaction/buy –//ticker Advantages: –Natural and relatively easy to use –Can repeatedly query tree without reparsing Disadvantages: –High memory requirements – the whole document is kept in memory –Must parse the whole document and construct many objects before use

SAX Parser SAX = Simple API for XML Parser creates “events” (i.e., notifications) while traversing tree Goes through the document one time only

Document as Events WEBM GE Start tag: transaction Start tag: account Text: End tag: account Start tag: buy Attribute: shares Value: 100

Advantages and Disadvantages How would you answer a query like: –/transaction/buy –find accounts in which something is bought or sold from the NASDAQ Advantages: –Requires less memory –Fast Disadvantages: –Cannot read backwards

Storing XML in a Relational Database

Why? Relational databases have been developed for about 30 years There is extensive knowledge on how to use them efficiently Why not take advantage of this knowledge? Main Challenges: –get XML into database (inserting): translating XML into tables –get XML out of database (querying): translating XPath into SQL

Reminder Relational Database simply contains some tables Each table can have any number of columns (also called attributes) Data items in each column are atomic, i.e., single values A schema is a description of a set of tables, i.e., the table name and each table’s column names

Difficulties DTDs can be complex Modeling Mismatch –Conceptually, relational databases, i.e., tables, have 2 levels: tables and attributes –XML documents have arbitrary nesting XML documents can have set-valued attributes and recursion

Relational Databases: Option 1 The Schema-less Case

Option 1: Store Tree Structure Bart Simpson 02 – – person name tel Bart Simpson 02 – –

Option 1: Store Tree Structure (cont.) 1. Assign each node a unique id 2. For each node, store type and value 3. For each node, store parent information person name tel Bart Simpson 02 – –

Option 1: Store Tree Structure (cont.) person name tel Bart Simpson 02 – – NodeTypeValueParentID 1elementpersonnull 6textBart Simpson2 ……

How Good Is This? Simple schema, can work with any document Translation from XML to tables is easy What about the translation back? –is this transformation lossless?

Answering XPath Queries Can you answer an XPath query that: –Just uses the Child axis, e.g., /a/b/c/d/e –Uses the Descendent axis at the beginning of the query, e.g., //a/b –Uses the Descendent axis in the middle of the query, e.g., /a/b//e –Uses the Following, Preceding, Following- Sibling axis?

Solving the Problem With the current modeling, it is not possible to evaluate many different types of steps of XPath queries To solve this problem, we: –number the nodes by DFS ordering –store, for each node, the id of its last descendent

phones person name tel Bart Simpson 02 – – NodeTypeValueParentIDLastDesc 1elementpersonnull10 4elementphones18 …… Can you answer these queries, now? these queries

Summary: Main Problems No convenient method to creating XML as output Each element in the path expression requires an additional join –Can become very expensive

Relational Databases: Option 2, Taking Advantage of DTDs Based On: Relational Databases for Querying XML Documents: Limitations and Opportunities By: Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton

Framework Relational Database System XML Translation Layer DTD Relational Schema Translation Information XML Documents Tuples XML Query SQL Query Relational Result XML Result

Example XML The Selfish Gene Richard Dawkins Timbuktu Wouldn’t it be nice to store this as a table with the columns: booktitle author_id firstname lastname city zip

Example XML The Selfish Gene Richard Dawkins Timbuktu We can do this only if all XML documents that we will be considering follow this format. Otherwise, for example, what happens if there are 2 authors?

Considering the DTD If a DTD is given, then it defines what types of XML documents will be of interest Challenge: Given a DTD, find a relational schema such that ANY document conforming to the DTD can be stored in the relations –

Reducing the Complexity DTDs can be very complex Before translating a DTD to a relational schema, simplify the DTD Property of the Simplification: If D 2 is a simplification of D 1, then every document that conforms to D 1 also almost conforms to D 2 –almost means that it conforms, if the ordering of sub- elements is ignored

Simplification Rules (e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … …,...a?, …, a, …  a*, … …,...a, …, a?, …  a*, … …,...a*, …, a, …  a*, … …,...a, …, a*, …  a*, …

(e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+)

(e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+?

(e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+?

(e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+? b??,c??,e??,e??,f*?

(e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+? b??,c??,e??,e??,f*? b?,c?,e?,e?,f*

(e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, … (b|c|e)?,(e?|f+) (b?,c?,e?)?,e??,f+? b??,c??,e??,e??,f+? b??,c??,e??,e??,f*? b?,c?,e?,e?,f* b?,c?,e*,f*

You try it Can you simplify the expression –(b|c|e)?,(e?|(f?,(b,b)*))* (e 1, e 2 )*  e 1 *, e 2 * (e 1, e 2 )?  e 1 ?, e 2 ? (e 1 |e 2 )  e 1 ?, e 2 ? e 1 **  e 1 * e 1 *?  e 1 * e 1 ?*  e 1 * e 1 ??  e 1 ? e 1 +  e 1 *..., a*,..., a*,...  a*,......, a*,..., a?,...  a*,......, a?,..., a*,...  a*,......, a?,..., a?,...  a*, … …,...a, …, a, …  a*, …

DTD Graphs In order to describe a technique for converting a DTD to a schema it is convenient to first describe DTDs (or rather simplified DTDs) as graphs Its nodes are elements, attributes and operators in the DTD Each element appears exactly once in the graph Attributes and operators appear as many times as they are in the DTD Cycles indicate recursion

DTD Example

Corresponding DTD Graph attribute

Creating the Schema: Shared Inline Technique When creating the schema for a DTD, we create a relation for: –each element with in-degree greater than 1 –each element with in-degree 0 –each element below a * –one element from each set of mutually recursive elements, having in-degree 1 All other elements are “inlined” into their parent’s relation (i.e., added into their parents relations) –Note that parent may also be inlined

In the Relations, Store: Id of node Text content of all leaf nodes that are inlined For all nodes with an incoming edge: –parentID –parentCODE

Relations for which elements? attribute

book (bookID: integer, book.booktitle : string) article (articleID: integer, article.contactauthor.authorid: string) monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.editor.name: string) title (titleID: integer, title: string, title.parentID: integer, title.parentCODE: integer) author (author.parentID: integer, author.parentCODE: integer, authorID: integer, author.authorid: string author.address: string, author.name.firstname: string, author.name.lastname: string, ) What are these for?

Advantages/Disadvantages Advantages: –Reduces number of joins for queries like “get the first and last names of an author” –Efficient for queries such as “list all authors with name Jack” Disadvantages: –Extra join needed for “Article with a given title name”

Notes Can/Should we use foreign keys to connect child tuples with their parents, e.g., titles with what they belong to? How can we answer queries, such as: –//title –//article/title –//article//name

Another Option: Hybrid Inlining Technique Same as Shared, except also inline elements with in-degree greater than one for the places in which they are not recursive or reached through a * node

What, in addition, will be inline? attribute

book (bookID: integer, book.booktitle : string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string) article (articleID: integer, article.contactauthor.authorid: string, article.title: string) monograph (monographID: integer, monograph.parentID: integer, monograph.parentCODE: integer, monograph.title: string, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string, monograph.editor.name: string, ) author (authorID: integer, author.parentID: integer, author.parentCODE: integer, author.name.firstname: string, author.name.lastname: string, author.address: string, author.authorid: string) Why do we still have an author relation?

Advantages/Disadvantages Advantages: –Reduces joins through shared elements (that are not set or recursive elements) –Reduces joins for queries like “get first and last names of a book author” (like Shared) Disadvantages: –Requires more SQL sub-queries to retrieve all authors with first name Jack (i.e., unions) Tradeoff between reducing number of unions and reducing number of joins – Shared and Hybrid target union- and join-reduction, respectively

XML in Major Databases All major databases now have some level of support for XML Example: Oracle –XML data type (can have a column which contains XML documents) –XPath processing of XML values –Some indexing capabilities –XML is a second class citizen in the database (support consists of a bunch of tools – no coherent framework)

Homework (Part 1) Consider the DTD: <!DOCTYPE a [ ]>

Homework (Part 1) Simplify the DTD and draw the DTD graph that corresponds to the simplified DTD. Show the schema that would be created using the Shared- Inline Technique. Show the schema that would be created using the Hybrid- Inlining Technique. NOTE: This example is a bit tricky. Make sure that you follow the rules given in class and that documents can be reconstructed from (1) the data stored in the relations and (2) the knowledge of the DTD structure

57 Native Databases for XML

58 Store XML as a tree Main Challenge: make querying efficient (recall the difficulties when storing XML as a file) –appropriate indexing –efficient query processing Several native XML database systems have been developed: –TIMBER (University of Michigan) –ToX (University of Toronto) –etc. Basic Idea

59 Natix... bib book titleauthor Subtrees are stored in blocks. When a block is full another block is used. Pointer to block containing child

60 Indexing In order to do efficient query processing, indexes are used Reminder: An index is a structure that “points” directly to nodes satisfying a given constraint More indexes usually allow query processing to be more efficient, but also take up more space (time/space tradeoff)

61 Indexing Strategy We will discuss different indexing strategies and query processing with these indices –Element and value inverted lists –Rotated paths –Graph-based indexes

62 Element and Value Inverted Lists

63 Basic Indexes At minimum, the following indexes are usually stored: –Value indexes: for each value appearing in the tree there is a list of nodes containing the value –Element indexes: for each element name appearing in the tree, there is a list of nodes with the corresponding element Sometimes also structure indexes: for certain XPath expressions, there is a list of nodes that satisfy the expression

64 Example: Value Indexes transaction account buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE WEBM10NYSE169

65 Example: Element Indexes transaction account buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE buy4exch158

66 Example: Structure Indexes transaction account buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE //buy//exch8

67 Query Processing Suppose that we only have value indexes and element indexes How should we process the query: //buy//exch ? –Strategy 1: Find buy elements. Then traverse the subtree of these elements to look for exch elements –Strategy 2: Find exch elements. Then traverse the ancestors of these elements to look for buy elements Which is a better strategy?

68 //buy//exch: Strategy 1 transaction account buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE buy4exch158

69 //buy//exch: Strategy 2 transaction account buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE buy4exch158

Both Strategies Are BAD! Both strategies require traversal of the tree Many disk reads Will be inefficient, if tree is large! GOAL: Answer queries using indices only, without traversing the XML tree

71 Improving the Execution Instead of storing a running id for each element, store triple: (start, end, level) Find buy elements Find exch elements Merge these two lists by finding exch elements that are nested within buy elements Level is used in case we are interested in finding children, not descendents

72 //buy//exch: Improved buy(4,10,2) exch(15,17,4)(8,9,4) Start EndLevel Merge the 2 lists by finding descendent elements What does this remind you of?

73 Merging Lists What is the complexity of merging the lists? Is it enough to go through each list once? –Assuming the lists are sorted by start? Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a a b b b

74 Merging Lists: Example Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a(3,6,2)(1,7,1) b(4,4,3)(2,2,2) a a b b 1,7,1 3,6,2 4,4,3 5,5,3 b 2,2,2 (5,5,3) Where should we go on the b list?

75 Merging Lists: Example Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a(3,6,2)(1,7,1) b(4,4,3)(2,2,2) a a b b 1,7,1 3,6,2 4,4,3 5,5,3 b 2,2,2 (5,5,3)

76 Merging Lists: Example We did extra work Need a method to find the correct place to start in the b list a(3,6,2)(1,7,1) b(4,4,3)(2,2,2) a a b b 1,7,1 3,6,2 4,4,3 5,5,3 b 2,2,2 (5,5,3)

77 Minimizing the Work Several algorithms have been defined to minimize the amount of work required, by identifying exactly where to restart See: –Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras, Carlo Zaniolo, “Efficient Structural Joins on Indexed XML Documents” Proc.of VLDB 2002 –Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jingesh M. Patel, Divesh Srivastava, Yuqing Wu, “Structural Joins: A Primitive for Efficient XML Query Pattern Matching”, ICDE 2002 –Nicolas Bruno, Nick Koudas, Divesh Srivastava, “Holistic Twig Joins: Optimal XML Pattern Matching”, ACM SIGMOD 2002

Goal Efficiently find all pairs of nodes n,m such that m is a descendent (child) of n, and n and m have the user specified labels –E.g., a//b, c//d, e/f Recall: –For any label, we have a sorted list (i.e., an index) of nodes with that label –The sorted list of ids contains both the starting position of a node and its ending position

79 Stack-Tree Algorithms: Intuition A depth-first traversal of a tree can be performed in linear time, using a stack as large as the height of the tree. An ancestor-descendant structural relationship is manifested as the ancestor appearing higher on the stack than the descendant. Unfortunately, a depth-first traversal requires going over all the tree. –DON’T GO OVER THE TREE!! ONLY THE INDEX

80 Stack-Tree Algorithms We will study the algorithm –Stack-Tree-Desc that returns the result ordered by (desc-start, anc-start) Paper also discusses the algorithm –Stack-Tree-Anc that returns the result ordered by (anc-start, desc-start) Why is the ordering of the result of interest?

81 Stack-Tree-Desc a = Alist->first node; d = Dlist->first node; OutputList = NULL; while (lists are not finished or stack is not empty) { if (a.startPos < d.startPos) then e = a; else e = d; while (stack not empty and e.startPos > stack.Top().endPos) stack.Pop(); if (e == a) { stack.Push(a); a = a->nextNode; } else for each a’ in stack do append (a’, d) to OutputList; d = d->nextNode; } a d

82 Stack-Tree-Desc: section//paragraph paragraph section Bla,..Bla,.. paragraph article

83 Stack-Tree-Desc: //section//paragraph paragraph section Bla,..Bla,.. paragraph article Alist

84 Stack-Tree-Desc: //section//paragraph paragraph section Bla,..Bla,.. paragraph article Dlist

85 Stack-Tree-Desc: //section//paragraph paragraph section Bla,..Bla,.. paragraph article a1 a2 a3 d1 d2 d3 d4 d5 d6 d7

86 Stack-Tree-Desc: //section//paragraph paragraph section Bla,..Bla,.. paragraph article a1 a2 a3 d1 d2 d3 d4 d5 d6 d7 a1a2a3 d1d4d2d5d3d6 section paragraph Note: These lists are not created at the beginning of the algorithm. They are already available!

87 Stack-Tree-Desc a1 d1 a2 d2 a3 d3 d4 d5 d6 d7 d1d6 d2d5 d3d4 a1 a2 a3 a1 (a1,d1) a2 (a1,d2),(a2,d2) d7 a3 (a1,d3),(a2,d3),(a3,d3) (a1,d4),(a2,d4),(a3,d4)(a1,d5),(a2,d5)(a1,d6) Output: Stack:

88 Analysis of Stack-Tree-Dec O(|Alist| + |Dlist| + |OutputList|) for ancestor- descendant structural relationships. –Each Alist element is pushed once and popped once, so stack operations take O(|Alist|). –The inner “for loop” outputs a new pair each time, so its total time is O(|OutputList|).

89 Questions and Disadvantages Can a similar algorithm be used to compute other axes? –e.g., child, following How can we use an algorithm for computing a single “step” to compute an entire XPath Query? –E.g., //a//b[//c/d]//e

90 Tree Pattern Can Computed From Structural Relationships Descendent edge Child edge book title XML author jane book title author XML jane Algorithm presented only computed a single edge query. Results can be combined to answer entire query.

Homework (Part 2) The underlying assumption behind the algorithm Stack-Tree-Desc is that the XML forms a tree. Can the algorithm easily be adapted to return ancestor-descendent pairs (as it does now), if the XML forms a graph? –If so, how? If not, explain intuitively why this is difficult.

92 Graph-Based Indexes: DataGuides

93 Exploiting Regularity XML documents tend to have a very repetitive structure Structure can be summarized in a (relatively) small graph, called a dataguide Nodes in a dataguide point to their corresponding node in the XML document Strategy: Evaluate query over graph. Then find corresponding nodes in document –Very efficient if dataguide fits into main memory

94 Notes In this work, we will model documents as graphs with the labels on the edges We will only consider path queries (no branching) Our XML documents can be arbitrary graphs There are many different types of indexes that exploit the same idea –this was the first (1997)

95 An Example DataGuide: Intuition How would you evaluate the queries: //Name /Restaurant/Owner

96 DataGuides: Formally Given a data source (i.e., XML document) X, a graph D is a dataguide for X if: –every path of labels appearing in X appears exactly once in D (conciseness) –every path of labels appearing in D appears at least once in X (accuracy)

97 Example Revisited Observe that every path in X also appears in D Observe that no path (from the root) appears twice in D Document: XDataGuide: D

98 Is this a DataGuide? A B B C CC D D D Document: X A B CC D D ?

99 Is this a DataGuide? A B B C CC D D D Document: X A B B C CC D D D ?

100 Is this a DataGuide? A B B C CC D D D Document: X A B C C CC D D D ?

101 Is this a DataGuide? A B B C CC D D D Document: X C D ? AB

102 Choosing a DataGuide A B B C CC D D D Document: X A B CC D D Option 1Option C D AB What does D point to?

103 Strong DataGuide: Formally Consider source X and dataguide D Let p, p’ be two label paths Let p(X) be the set of nodes reached in X by traversing path p We define p ≡ X p’ if p(X) = p’(X) –That is, p and p’ are indistinguishable on X –D is a strong DataGuide for a database X if the equivalence relations ≡ D and ≡ X are the same

104 Strong DataGuides Is (b) a strong dataguide for (a)? Is (c) a strong dataguide for (a)?

105 Creating a Strong Dataguide Strong dataguides can be used as indexes since they are unambiguous How big might a strong dataguide be? Can it be created efficiently? –In general, exponential time. Requires turning a nondeterministic automaton into a deterministic one –If XML is a tree, can be created in linear time

106 MakeDataGuide(n) { dg = NewObject() targetHash.Insert({n}, dg) RecursiveMake({n}, dg) } RecursiveMake(t1, d1) { p = set of children pairs of each object in t1 foreach (unique label l in p) { t2 = set of node-ids paired with l in p d2 = targetHash.Lookup(t2) if (d2 != nil) { add an edge from d1 to d2 with label l } else { d2 = NewObject() targetHash.Insert(t2, d2) add an edge from d1 to d2 with label l RecursiveMake(t2, d2) }

107 Can you create a Strong DataGuide? Intuition: If the sets of nodes which are reachable for simple paths are equal, then the simple paths are represented as a single node. Compute on blackboard 1 A A C B CC A C B C ,4 3, A A C B CC C Source Strong DataGuide A B C 1 2,4 3,5 6 C 1 A A C B CC A C B C ,4 3, A A C B CC C Source Strong DataGuide A B C 1 2,4 3,5 6 C

108 Summary Advantages: –if dataguide can fit in memory, evaluation can be performed efficiently for path queries Disadvantages: –May be large (why is this worse here than for the rotated lexicon?) –Only good for simple queries. Which axes?

Construct a strong dataguide for this document, using the algorithm shown Show an example of a database, strong dataguide and XPath query such that evaluating the XPath query on the dataguide (and then finding the corresponding database nodes) yields a different answer than evaluating the query directly on the database. Homework (Part 3)