1 Native Databases for XML. 2 Store XML as a tree Main Challenge: make querying efficient (recall the difficulties when storing XML as a file) –appropriate.

Slides:



Advertisements
Similar presentations
Ting Chen, Jiaheng Lu, Tok Wang Ling
Advertisements

APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Al Khalifa et al., ICDE 2002.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
XML Query Processing Talk prepared by Bhavana Dalvi ( ) Uma Sawant ( )
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Primary Indexes Dense Indexes
Chapter 12 Trees. Copyright © 2005 Pearson Addison-Wesley. All rights reserved Chapter Objectives Define trees as data structures Define the terms.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
Database Management 9. course. Execution of queries.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Analysis of Algorithms
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Querying Structured Text in an XML Database By Xuemei Luo.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
VLDB'02, Aug 20 Efficient Structural Joins on Indexed XML1 Efficient Structural Joins on Indexed XML Documents Shu-Yao Chien, Zografoula Vagena, Donghui.
Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.
TwigStackList¬: A Holistic Twig Join Algorithm for Twig Query with Not-predicates on XML Data by Tian Yu, Tok Wang Ling, Jiaheng Lu, Presented by: Tian.
Database Systems Part VII: XML Querying Software School of Hunan University
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
Starting at Binary Trees
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.
CE 221 Data Structures and Algorithms Chapter 4: Trees (Binary) Text: Read Weiss, §4.1 – 4.2 1Izmir University of Economics.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
Dr. N. MamoulisAdvanced Database Technologies1 Topic 8: Semi-structured Data In various application domains, the data are semi-structured; the database.
From Region Encoding To Extended Dewey: On Efficient Processing of XML Twig Pattern Matching Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, Ting Chen National.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Holistic Twig Joins: Optimal XML Pattern Matching Nicholas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 02 Presented by: Li Wei, Dragomir Yankov.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
1 Structural Join Algorithms – Examples Key property: x is a descendant (resp., child) of y iff x.docId = y.docId & x.StartPos < y.StartPos
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
XML Storage. Suppose that we are given some XML documents How should they be stored? Why does it matter? –Storage implies which type of use can be efficiently.
XML Storage We must upgrade to XML. Everyone is talking about it. Well, that is going to cost us XXX on YYY and earn us WWW on ZZZ.
1 Trees. 2 Trees Trees. Binary Trees Tree Traversal.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Structural Joins: A Primitive for Efficient XML Query Pattern Matching Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Divesh Srivastava,
XML Storage.
By A. Aboulnaga, A. R. Alameldeen and J. F. Naughton Vldb’01
Multiway Search Trees Data may not fit into main memory
Efficient processing of path query with not-predicates on XML data
Database Management System
Presented by Sandhya Rani Are Prabhas Kumar Samanta
RE-Tree: An Efficient Index Structure for Regular Expressions
B+ Tree.
i206: Lecture 13: Recursion, continued Trees
Lecture 2- Query Processing (continued)
Database Design and Programming
CE 221 Data Structures and Algorithms
Structural Joins: A Primitive for Efficient XML Query Pattern Matching
Presentation transcript:

1 Native Databases for XML

2 Store XML as a tree Main Challenge: make querying efficient (recall the difficulties when storing XML as a file) –appropriate indexing –efficient query processing Several native XML database systems have been developed: –TIMBER (University of Michigan) –ToX (University of Toronto) –etc. Basic Idea

3 Storing XML in Files: Natix... bib book titleauthor Subtrees are stored in blocks. When a block is full another block is used. Pointer to block containing child

4 Indexing In order to do efficient query processing, indexes are used Reminder: An index is a structure that “points” directly to nodes satisfying a given constraint More indexes usually allow query processing to be more efficient, but also take up more space (time/space tradeoff)

5 Indexing Strategy We will discuss 3 different indexing strategies and their query processing problem –Element and value inverted lists –Rotated paths –Graph-based indexes

6 Element and Value Inverted Lists

7 Basic Indexes At minimum, the following indexes are usually stored: –Value indexes: for each value appearing in the tree there is a list of nodes containing the value –Element indexes: for each element name appearing in the tree, there is a list of nodes with the corresponding element Sometimes also structure indexes: for certain XPath expressions, there is a list of nodes that satisfy the expression

8 Example: Value Indexes transaction account buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE WEBM10NYSE169

9 Example: Element Indexes transaction account buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE buy4exch158

10 Example: Structure Indexes transaction account buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE //buy//exch8

11 Query Processing Suppose that we only have value indexes and element indexes How should we process the query: //buy//exch ? –Strategy 1: Find buy elements. Then traverse the subtree of these elements to look for exch elements –Strategy 2: Find exch elements. Then traverse the ancestors of these elements to look for buy elements Which is a better strategy?

12 //buy//exch: Strategy 1 transaction account buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE buy4exch158

13 //buy//exch: Strategy 2 transaction account buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE buy4exch158

14 Improving the Execution Instead of storing a running id for each element, store triple: (start, end, level) Find buy elements Find exch elements Merge these two lists by finding exch elements that are nested within buy elements Level is used in case we are interested in finding children, not descendents

15 //buy//exch: Improved buy(4,10,2) exch(15,17,4)(8,9,4) Start EndLevel Merge the 2 lists by finding descendent elements What does this remind you of?

16 Merging Lists What is the complexity of merging the lists? Is it enough to go through each list once? –Assuming the lists are sorted by start? Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a a b b b

17 Merging Lists: Example Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a(3,6,2)(1,7,1) b(4,4,3)(2,2,2) a a b b 1,7,1 3,6,2 4,4,3 5,5,3 b 2,2,2 (5,5,3) Where should we go on the b list?

18 Merging Lists: Example Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a(3,6,2)(1,7,1) b(4,4,3)(2,2,2) a a b b 1,7,1 3,6,2 4,4,3 5,5,3 b 2,2,2 (5,5,3)

19 Merging Lists: Example We did extra work Need a method to find the correct place to start in the b list a(3,6,2)(1,7,1) b(4,4,3)(2,2,2) a a b b 1,7,1 3,6,2 4,4,3 5,5,3 b 2,2,2 (5,5,3)

20 Minimizing the Work Several algorithms have been defined to minimize the amount of work required, by identifying exactly where to restart See: –Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras, Carlo Zaniolo, “Efficient Structural Joins on Indexed XML Documents” Proc.of VLDB 2002 –Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jingesh M. Patel, Divesh Srivastava, Yuqing Wu, “Structural Joins: A Primitive for Efficient XML Query Pattern Matching”, ICDE 2002 –Nicolas Bruno, Nick Koudas, Divesh Srivastava, “Holistic Twig Joins: Optimal XML Pattern Matching”, ACM SIGMOD 2002

21 Tree Pattern Can Computed From Structural Relationships Descendent edge Child edge book title XML author jane book title author XML jane Algorithm we present only computes a single edge query. Results can be combined.

22 Stack-Tree Algorithms: Intuition A depth-first traversal of a tree can be performed in linear time, using a stack as large as the height of the tree. An ancestor-descendant structural relationship is manifested as the ancestor appearing higher on the stack than the descendant. Unfortunately, a depth-first traversal requires going over all the tree. –DON’T GO OVER THE TREE!! ONLY THE INDEX

23 Stack-Tree Algorithms We will study the algorithm –Stack-Tree-Desc that returns the result ordered by (desc-start, anc-start) Paper also discusses the algorithm –Stack-Tree-Anc that returns the result ordered by (anc-start, desc-start) Why is the ordering of the result of interest?

24 Stack-Tree-Desc a = Alist->first node; d = Dlist->first node; OutputList = NULL; while (lists are not empty or stack is not empty) { if (a.startPos < d.startPos) then e = a; else e = d; while (e.startPos > stack.Top().endPos) stack.Pop(); if (e == a) { stack.Push(a); a = a->nextNode; } else for each a’ in stack do append (a’, d) to OutputList; d = d->nextNode; } a d

25 Stack-Tree-Desc: section//paragraph paragraph section Bla,..Bla,.. paragraph article

26 Stack-Tree-Desc: //section//paragraph paragraph section Bla,..Bla,.. paragraph article Alist

27 Stack-Tree-Desc: //section//paragraph paragraph section Bla,..Bla,.. paragraph article Dlist

28 Stack-Tree-Desc: //section//paragraph paragraph section Bla,..Bla,.. paragraph article a1 a2 a3 d1 d2 d3 d4 d5 d6 d7

29 Stack-Tree-Desc: //section//paragraph paragraph section Bla,..Bla,.. paragraph article a1 a2 a3 d1 d2 d3 d4 d5 d6 d7 a1a2a3 d1d4d2d5d3d6 section paragraph Note: These lists are not created at the beginning of the algorithm. They are already available!

30 Stack-Tree-Desc a1 d1 a2 d2 a3 d3 d4 d5 d6 d7 d1d6 d2d5 d3d4 a1 a2 a3 a1 (a1,d1) a2 (a1,d2),(a2,d2) d7 a3 (a1,d3),(a2,d3),(a3,d3) (a1,d4),(a2,d4),(a3,d4)(a1,d5),(a2,d5)(a1,d6) Output: Stack:

31 Analysis of Stack-Tree-Dec O(|Alist| + |Dlist| + |OutputList|) for ancestor- descendant structural relationships. –Each Alist element is pushed once and popped once, so stack operations take O(|Alist|). –The inner “for loop” outputs a new pair each time, so its total time is O(|OutputList|).

32 Questions and Disadvantages Can a similar algorithm be used to compute other axes? –e.g., child, following Main Disadvantage: Each step of the path expression is computed separately –may find many intermediate results that will be discarded

33 Rotated Paths YAPI: Yet Another Path Index for XML searching Giuseppe Amato, Franca Debole, Fausto Rabitti, Pavel Zezula

34 Remember This? Term NumberTerm 1abhor 2bear 4labor 6labour Rotated FormAddress $abhor(1,0) $bear(2,0) $labor(4,0) $labour(5,0) abhor$(1,1) abor$l(4,2) abour$l(6,2) r$abho(1,5) r$bea(2,4) r$labo(4,5) r$labou(6,6) Note: We do not actually store the rotated string in the rotated lexicon. The pair of numbers is enough for binary search

35 Remember This? Term NumberTerm 1abhor 2bear 4labor 6labour Rotated FormAddress $abhor(1,0) $bear(2,0) $labor(4,0) $labour(5,0) abhor$(1,1) abor$l(4,2) abour$l(6,2) r$abho(1,5) r$bea(2,4) r$labo(4,5) r$labou(6,6) How would you find the terms for: lab* *or *ab* l*r l*b*r

36 Indexing structure: Previous Approaches Inverted index with element names as entries –we discussed this Inverted index with pathnames as entries –similar idea

37 Inverted index with paths as entries Path lexicon: /people->{1} /people/person->{2,10} /people/person/name->{3,11} /people/person/name/fn->{4,12} /people/person/name/ln->{6,4} /people/person/address->{8,16}

38 Inverted index with paths as entries Advantages: –Exact paths are efficiently handled –Paths with wildcard on last element are also efficiently handled. How? Drawbacks: –Problems with prefix or infix wildcards. Examples?

39 Rotated Lexicon Technique can process very efficiently with no need of containment join: /people/person/name // //name/fn // /people/person//* // /people//fn // // //name//* Similar patterns for * (i.e., * in same places as //) Other patterns can be processed as combination of them, using containment join

40 Rotated lexicon l$appea2 le$ais1 le$app3 le$stap5 loy$emp4 mploy$e4 oy$empl4 peal$ap2 ple$ap3 ple$sta5 ploy$em4 ppeal$a2 pple$a3 sle$ai1 staple$5 taple$s5 y$emplo4 $aisle1 $appeal2 $apple3 $employ4 $staple5 aisle$1 al$appe2 aple$st5 appeal$2 apple$3 e$aisl1 e$appl3 e$stapl5 eal$app2 employ$4 isle$a1 aisle1 appeal2 apple3 employ4 staple5 apple$apple *pleple$* *pl*pl* app*$app* a*lele$a* Original lexicon:Rotated lexicon: Queries :Transformed :

41 Rotated Path lexicon Term. element people1 person2 name3 fn4 ln5 address6 Element lexicon: /people/1/01->{1} /people/person/1/2/02->{2,10} /people/person/name/1/2/3/03->{3,11} /people/person/name/fn/1/2/3/4/04->{4,12} /people/person/name/ln/1/2/3/5/05->{6,14} /people/person/address/1/2/6/06->{8,16} Path lexicon: /0/11/0/1/22/0/1/2/33/0/1/2/3/44/0/1/2/3/55/0/1/2/66/1/01/1/2/02/1/2/3/03/1/2/3/4/04/1/2/3/5/05/1/2/6/06/2/0/12/2/3/0/13/2/3/4/0/14/2/3/5/0/15/2/6/0/16/3/0/1/23/3/4/0/1/24/3/5/0/1/25/4/0/1/2/34/5/0/1/2/35/6/0/1/26/0/11/0/1/22/0/1/2/33/0/1/2/3/44/0/1/2/3/55/0/1/2/66/1/01/1/2/02/1/2/3/03/1/2/3/4/04/1/2/3/5/05/1/2/6/06/2/0/12/2/3/0/13/2/3/4/0/14/2/3/5/0/15/2/6/0/16/3/0/1/23/3/4/0/1/24/3/5/0/1/25/4/0/1/2/34/5/0/1/2/35/6/0/1/26 Query: //person/name// //2/3// Encoded: /2/3// Transf.:

42 Storage space requirements Size of posting lists is directly proportional to the number of elements in the XML database –There is one entry in one posting list for each element The size of the rotated path lexicon is equal to #PL X (avg_PL_len) where –#PL is the size of the path lexicon –avg_PL_len is the average pathnames length The size of the path lexicon, that is the number of different pathnames, and the average path length are typically small

43 Question Can other axes be handled similarly?

44 Graph-Based Indexes: DataGuides

45 Exploiting Regularity XML documents tend to have a very repetitive structure Structure can be summarized in a (relatively) small graph, called a dataguide Nodes in a dataguide point to their corresponding node in the XML document Strategy: Evaluate query over graph. Then find corresponding nodes in document –Very efficient if dataguide fits into main memory

46 Notes In this work, we will model documents as graphs with the labels on the edges We will only consider path queries (no branching) Our XML documents can be arbitrary graphs There are many different types of indexes that exploit the same idea –this was the first (1997)

47 An Example DataGuide: Intuition How would you evaluate the queries: //Name /Restaurant/Owner

48 DataGuides: Formally Given a data source (i.e., XML document) X, a graph D is a dataguide for X if: –every path of labels appearing in X appears exactly once in D (conciseness) –every path of labels appearing in D appears at least once in X (accuracy)

49 Example Revisited Observe that every path in X also appears in D Observe that no path (from the root) appears twice in D Document: XDataGuide: D

50 Is this a DataGuide? A B B C CC D D D Document: X A B CC D D ?

51 Is this a DataGuide? A B B C CC D D D Document: X A B B C CC D D D ?

52 Is this a DataGuide? A B B C CC D D D Document: X A B C C CC D D D ?

53 Is this a DataGuide? A B B C CC D D D Document: X C D ? AB

54 Strong DataGuides: The Problem A B B C CC D D D Document: X A B CC D D Option 1Option C D AB What does D point to?

55 Strong DataGuide: Formally Consider source X and dataguide D Consider a path l (i.e., sequence of labels) in X –Let T X (l) be all the nodes reached by the path l The path l also appears in D and leads to a single node –Let T D (l) be the set containing this single node Let L X (l) be the set of all labels paths in X that lead to the set T X (l). Similarly, we define L D (l) If, for all paths L X (l) = L D (l), then D is a strong dataguide

56 Strong DataGuides In the source T X (B.C) = {6, 7}, and L X (B.C) = {B.C}. DataGuide T D (B.C) = {20} and L D (B.C) is {B.C, A.C}. L X (B.C) ≠L D (B.C), so DataGuide (c) is not strong.

57 Creating a Strong Dataguide Strong dataguides can be used as indexes since they are unambiguous How big might a strong dataguide be? Can it be created efficiently? –In general, exponential time. Requires turning a nondeterministic automaton into a deterministic one –If XML is a tree, can be created in linear time

58 MakeDataGuide(o) { dg = NewObject() targetHash.Insert({n}, dg) RecursiveMake({n}, dg) } RecursiveMake(t1, d1) { p = set of children pairs of each object in t1 foreach (unique label l in p) { t2 = set of node-ids paired with l in p d2 = targetHash.Lookup(t2) if (d2 != nil) { add an edge from d1 to d2 with label l } else { d2 = NewObject() targetHash.Insert(t2, d2) add an edge from d1 to d2 with label l RecursiveMake(t2, d2) }

59 Can you create a Strong DataGuide? Intuition: If the sets of nodes which are reachable for simple paths are equal, then the simple paths are represented as a single node. Compute on blackboard 1 A A C B CC A C B C ,4 3, A A C B CC C Source Strong DataGuide A B C 1 2,4 3,5 6 C 1 A A C B CC A C B C ,4 3, A A C B CC C Source Strong DataGuide A B C 1 2,4 3,5 6 C

60 Summary Advantages: –if dataguide can fit in memory, evaluation can be performed efficiently for path queries Disadvantages: –May be large (why is this worse here than for the rotated lexicon?) –Only good for simple queries. Which axes?