Download presentation
Presentation is loading. Please wait.
Published byJocelin Caldwell Modified over 8 years ago
1
1 Native Databases for XML
2
2 Store XML as a tree Main Challenge: make querying efficient (recall the difficulties when storing XML as a file) –appropriate indexing –efficient query processing Several native XML database systems have been developed: –TIMBER (University of Michigan) –ToX (University of Toronto) –etc. Basic Idea
3
3 Storing XML in Files: Natix... bib book titleauthor Subtrees are stored in blocks. When a block is full another block is used. Pointer to block containing child
4
4 Indexing In order to do efficient query processing, indexes are used Reminder: An index is a structure that “points” directly to nodes satisfying a given constraint More indexes usually allow query processing to be more efficient, but also take up more space (time/space tradeoff)
5
5 Indexing Strategy We will discuss 3 different indexing strategies and their query processing problem –Element and value inverted lists –Rotated paths –Graph-based indexes
6
6 Element and Value Inverted Lists
7
7 Basic Indexes At minimum, the following indexes are usually stored: –Value indexes: for each value appearing in the tree there is a list of nodes containing the value –Element indexes: for each element name appearing in the tree, there is a list of nodes with the corresponding element Sometimes also structure indexes: for certain XPath expressions, there is a list of nodes that satisfy the expression
8
8 Example: Value Indexes transaction account 89-344 buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1617 WEBM10NYSE169
9
9 Example: Element Indexes transaction account 89-344 buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1617 buy4exch158
10
10 Example: Structure Indexes transaction account 89-344 buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1617 //buy//exch8
11
11 Query Processing Suppose that we only have value indexes and element indexes How should we process the query: //buy//exch ? –Strategy 1: Find buy elements. Then traverse the subtree of these elements to look for exch elements –Strategy 2: Find exch elements. Then traverse the ancestors of these elements to look for buy elements Which is a better strategy?
12
12 //buy//exch: Strategy 1 transaction account 89-344 buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1617 buy4exch158
13
13 //buy//exch: Strategy 2 transaction account 89-344 buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE exch NYSE GE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1617 buy4exch158
14
14 Improving the Execution Instead of storing a running id for each element, store triple: (start, end, level) Find buy elements Find exch elements Merge these two lists by finding exch elements that are nested within buy elements Level is used in case we are interested in finding children, not descendents
15
15 //buy//exch: Improved buy(4,10,2) exch(15,17,4)(8,9,4) Start EndLevel Merge the 2 lists by finding descendent elements What does this remind you of?
16
16 Merging Lists What is the complexity of merging the lists? Is it enough to go through each list once? –Assuming the lists are sorted by start? Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a a b b b
17
17 Merging Lists: Example Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a(3,6,2)(1,7,1) b(4,4,3)(2,2,2) a a b b 1,7,1 3,6,2 4,4,3 5,5,3 b 2,2,2 (5,5,3) Where should we go on the b list?
18
18 Merging Lists: Example Example: Suppose we want to find all pairs of a and b such that b is a descendent of a a(3,6,2)(1,7,1) b(4,4,3)(2,2,2) a a b b 1,7,1 3,6,2 4,4,3 5,5,3 b 2,2,2 (5,5,3)
19
19 Merging Lists: Example We did extra work Need a method to find the correct place to start in the b list a(3,6,2)(1,7,1) b(4,4,3)(2,2,2) a a b b 1,7,1 3,6,2 4,4,3 5,5,3 b 2,2,2 (5,5,3)
20
20 Minimizing the Work Several algorithms have been defined to minimize the amount of work required, by identifying exactly where to restart See: –Shu-Yao Chien, Zografoula Vagena, Donghui Zhang, Vassilis J. Tsotras, Carlo Zaniolo, “Efficient Structural Joins on Indexed XML Documents” Proc.of VLDB 2002 –Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jingesh M. Patel, Divesh Srivastava, Yuqing Wu, “Structural Joins: A Primitive for Efficient XML Query Pattern Matching”, ICDE 2002 –Nicolas Bruno, Nick Koudas, Divesh Srivastava, “Holistic Twig Joins: Optimal XML Pattern Matching”, ACM SIGMOD 2002
21
21 Tree Pattern Can Computed From Structural Relationships Descendent edge Child edge book title XML author jane book title author XML jane Algorithm we present only computes a single edge query. Results can be combined.
22
22 Stack-Tree Algorithms: Intuition A depth-first traversal of a tree can be performed in linear time, using a stack as large as the height of the tree. An ancestor-descendant structural relationship is manifested as the ancestor appearing higher on the stack than the descendant. Unfortunately, a depth-first traversal requires going over all the tree. –DON’T GO OVER THE TREE!! ONLY THE INDEX
23
23 Stack-Tree Algorithms We will study the algorithm –Stack-Tree-Desc that returns the result ordered by (desc-start, anc-start) Paper also discusses the algorithm –Stack-Tree-Anc that returns the result ordered by (anc-start, desc-start) Why is the ordering of the result of interest?
24
24 Stack-Tree-Desc a = Alist->first node; d = Dlist->first node; OutputList = NULL; while (lists are not empty or stack is not empty) { if (a.startPos < d.startPos) then e = a; else e = d; while (e.startPos > stack.Top().endPos) stack.Pop(); if (e == a) { stack.Push(a); a = a->nextNode; } else for each a’ in stack do append (a’, d) to OutputList; d = d->nextNode; } a d
25
25 Stack-Tree-Desc: section//paragraph paragraph section Bla,..Bla,.. paragraph article
26
26 Stack-Tree-Desc: //section//paragraph paragraph section Bla,..Bla,.. paragraph article Alist
27
27 Stack-Tree-Desc: //section//paragraph paragraph section Bla,..Bla,.. paragraph article Dlist
28
28 Stack-Tree-Desc: //section//paragraph paragraph section Bla,..Bla,.. paragraph article a1 a2 a3 d1 d2 d3 d4 d5 d6 d7
29
29 Stack-Tree-Desc: //section//paragraph paragraph section Bla,..Bla,.. paragraph article a1 a2 a3 d1 d2 d3 d4 d5 d6 d7 a1a2a3 d1d4d2d5d3d6 section paragraph Note: These lists are not created at the beginning of the algorithm. They are already available!
30
30 Stack-Tree-Desc a1 d1 a2 d2 a3 d3 d4 d5 d6 d7 d1d6 d2d5 d3d4 a1 a2 a3 a1 (a1,d1) a2 (a1,d2),(a2,d2) d7 a3 (a1,d3),(a2,d3),(a3,d3) (a1,d4),(a2,d4),(a3,d4)(a1,d5),(a2,d5)(a1,d6) Output: Stack:
31
31 Analysis of Stack-Tree-Dec O(|Alist| + |Dlist| + |OutputList|) for ancestor- descendant structural relationships. –Each Alist element is pushed once and popped once, so stack operations take O(|Alist|). –The inner “for loop” outputs a new pair each time, so its total time is O(|OutputList|).
32
32 Questions and Disadvantages Can a similar algorithm be used to compute other axes? –e.g., child, following Main Disadvantage: Each step of the path expression is computed separately –may find many intermediate results that will be discarded
33
33 Rotated Paths YAPI: Yet Another Path Index for XML searching Giuseppe Amato, Franca Debole, Fausto Rabitti, Pavel Zezula
34
34 Remember This? Term NumberTerm 1abhor 2bear 4labor 6labour Rotated FormAddress $abhor(1,0) $bear(2,0) $labor(4,0) $labour(5,0) abhor$(1,1) abor$l(4,2) abour$l(6,2) r$abho(1,5) r$bea(2,4) r$labo(4,5) r$labou(6,6) Note: We do not actually store the rotated string in the rotated lexicon. The pair of numbers is enough for binary search
35
35 Remember This? Term NumberTerm 1abhor 2bear 4labor 6labour Rotated FormAddress $abhor(1,0) $bear(2,0) $labor(4,0) $labour(5,0) abhor$(1,1) abor$l(4,2) abour$l(6,2) r$abho(1,5) r$bea(2,4) r$labo(4,5) r$labou(6,6) How would you find the terms for: lab* *or *ab* l*r l*b*r
36
36 Indexing structure: Previous Approaches Inverted index with element names as entries –we discussed this Inverted index with pathnames as entries –similar idea
37
37 Inverted index with paths as entries Path lexicon: /people->{1} /people/person->{2,10} /people/person/name->{3,11} /people/person/name/fn->{4,12} /people/person/name/ln->{6,4} /people/person/address->{8,16}
38
38 Inverted index with paths as entries Advantages: –Exact paths are efficiently handled –Paths with wildcard on last element are also efficiently handled. How? Drawbacks: –Problems with prefix or infix wildcards. Examples?
39
39 Rotated Lexicon Technique can process very efficiently with no need of containment join: /people/person/name // //name/fn // /people/person//* // /people//fn // // //name//* Similar patterns for * (i.e., * in same places as //) Other patterns can be processed as combination of them, using containment join
40
40 Rotated lexicon l$appea2 le$ais1 le$app3 le$stap5 loy$emp4 mploy$e4 oy$empl4 peal$ap2 ple$ap3 ple$sta5 ploy$em4 ppeal$a2 pple$a3 sle$ai1 staple$5 taple$s5 y$emplo4 $aisle1 $appeal2 $apple3 $employ4 $staple5 aisle$1 al$appe2 aple$st5 appeal$2 apple$3 e$aisl1 e$appl3 e$stapl5 eal$app2 employ$4 isle$a1 aisle1 appeal2 apple3 employ4 staple5 apple$apple *pleple$* *pl*pl* app*$app* a*lele$a* Original lexicon:Rotated lexicon: Queries :Transformed :
41
41 Rotated Path lexicon Term. element people1 person2 name3 fn4 ln5 address6 Element lexicon: /people/1/01->{1} /people/person/1/2/02->{2,10} /people/person/name/1/2/3/03->{3,11} /people/person/name/fn/1/2/3/4/04->{4,12} /people/person/name/ln/1/2/3/5/05->{6,14} /people/person/address/1/2/6/06->{8,16} Path lexicon: /0/11/0/1/22/0/1/2/33/0/1/2/3/44/0/1/2/3/55/0/1/2/66/1/01/1/2/02/1/2/3/03/1/2/3/4/04/1/2/3/5/05/1/2/6/06/2/0/12/2/3/0/13/2/3/4/0/14/2/3/5/0/15/2/6/0/16/3/0/1/23/3/4/0/1/24/3/5/0/1/25/4/0/1/2/34/5/0/1/2/35/6/0/1/26/0/11/0/1/22/0/1/2/33/0/1/2/3/44/0/1/2/3/55/0/1/2/66/1/01/1/2/02/1/2/3/03/1/2/3/4/04/1/2/3/5/05/1/2/6/06/2/0/12/2/3/0/13/2/3/4/0/14/2/3/5/0/15/2/6/0/16/3/0/1/23/3/4/0/1/24/3/5/0/1/25/4/0/1/2/34/5/0/1/2/35/6/0/1/26 Query: //person/name// //2/3// Encoded: /2/3// Transf.:
42
42 Storage space requirements Size of posting lists is directly proportional to the number of elements in the XML database –There is one entry in one posting list for each element The size of the rotated path lexicon is equal to #PL X (avg_PL_len) where –#PL is the size of the path lexicon –avg_PL_len is the average pathnames length The size of the path lexicon, that is the number of different pathnames, and the average path length are typically small
43
43 Question Can other axes be handled similarly?
44
44 Graph-Based Indexes: DataGuides
45
45 Exploiting Regularity XML documents tend to have a very repetitive structure Structure can be summarized in a (relatively) small graph, called a dataguide Nodes in a dataguide point to their corresponding node in the XML document Strategy: Evaluate query over graph. Then find corresponding nodes in document –Very efficient if dataguide fits into main memory
46
46 Notes In this work, we will model documents as graphs with the labels on the edges We will only consider path queries (no branching) Our XML documents can be arbitrary graphs There are many different types of indexes that exploit the same idea –this was the first (1997)
47
47 An Example DataGuide: Intuition How would you evaluate the queries: //Name /Restaurant/Owner
48
48 DataGuides: Formally Given a data source (i.e., XML document) X, a graph D is a dataguide for X if: –every path of labels appearing in X appears exactly once in D (conciseness) –every path of labels appearing in D appears at least once in X (accuracy)
49
49 Example Revisited Observe that every path in X also appears in D Observe that no path (from the root) appears twice in D Document: XDataGuide: D
50
50 Is this a DataGuide? 1 1 1 1 1 11 1 1 1 A B B C CC D D D Document: X 1 1 1 1 1 1 1 A B CC D D ?
51
51 Is this a DataGuide? 1 1 1 1 1 11 1 1 1 A B B C CC D D D Document: X 1 1 1 1 1 11 1 1 1 A B B C CC D D D ?
52
52 Is this a DataGuide? 1 1 1 1 1 11 1 1 1 A B B C CC D D D Document: X 1 1 1 1 1 11 1 1 1 A B C C CC D D D ?
53
53 Is this a DataGuide? 1 1 1 1 1 11 1 1 1 A B B C CC D D D Document: X 1 1 1 1 C D ? AB
54
54 Strong DataGuides: The Problem 1 1 1 1 1 11 1 1 1 A B B C CC D D D Document: X 1 1 1 1 1 1 1 A B CC D D Option 1Option 2 1 1 1 1 C D AB What does D point to?
55
55 Strong DataGuide: Formally Consider source X and dataguide D Consider a path l (i.e., sequence of labels) in X –Let T X (l) be all the nodes reached by the path l The path l also appears in D and leads to a single node –Let T D (l) be the set containing this single node Let L X (l) be the set of all labels paths in X that lead to the set T X (l). Similarly, we define L D (l) If, for all paths L X (l) = L D (l), then D is a strong dataguide
56
56 Strong DataGuides In the source T X (B.C) = {6, 7}, and L X (B.C) = {B.C}. DataGuide T D (B.C) = {20} and L D (B.C) is {B.C, A.C}. L X (B.C) ≠L D (B.C), so DataGuide (c) is not strong.
57
57 Creating a Strong Dataguide Strong dataguides can be used as indexes since they are unambiguous How big might a strong dataguide be? Can it be created efficiently? –In general, exponential time. Requires turning a nondeterministic automaton into a deterministic one –If XML is a tree, can be created in linear time
58
58 MakeDataGuide(o) { dg = NewObject() targetHash.Insert({n}, dg) RecursiveMake({n}, dg) } RecursiveMake(t1, d1) { p = set of children pairs of each object in t1 foreach (unique label l in p) { t2 = set of node-ids paired with l in p d2 = targetHash.Lookup(t2) if (d2 != nil) { add an edge from d1 to d2 with label l } else { d2 = NewObject() targetHash.Insert(t2, d2) add an edge from d1 to d2 with label l RecursiveMake(t2, d2) }
59
59 Can you create a Strong DataGuide? Intuition: If the sets of nodes which are reachable for simple paths are equal, then the simple paths are represented as a single node. Compute on blackboard 1 A A C B CC A C B C 2 3 4 5 6 1 2,4 3,5 6 5 1 A A C B CC 2 3 4 5 6 C Source Strong DataGuide A B C 1 2,4 3,5 6 C 1 A A C B CC A C B C 2 3 4 5 6 1 2,4 3,5 6 5 1 A A C B CC 2 3 4 5 6 C Source Strong DataGuide A B C 1 2,4 3,5 6 C
60
60 Summary Advantages: –if dataguide can fit in memory, evaluation can be performed efficiently for path queries Disadvantages: –May be large (why is this worse here than for the rotated lexicon?) –Only good for simple queries. Which axes?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.