Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries Dr. Yangjun Chen Dept. Applied Computer Science, University.

Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R3B 2E9

Outline Motivation Algorithm for unordered tree pattern query evaluation -Tree encoding -Algorithm description On the strict unordered tree matching Experiment results Summary

Efficient method to evaluate XPath expression queries – XML query processing XML documentsa tree pattern query Motivation

Document: dell IBM part#1 Intel Part#2 Houston Winnipeg Y-Chen P SB IILLNN I M MNN IBMPart#1Part#2 Dell HoustonWinnipegY-Chen Intel Motivation

Document:Query – XPath expressions: Q1: /Purchase[Seller[Loc=‘Boston’]]/ Buyer[Loc = ‘New York’ Q2: /Purchase//Item[Manufacturer = ‘Intel’] Purchase Seller Buyer Location ‘Houston’ ‘Winnipeg’ Buyer Item Manufacturer ‘Intel’ d-edge: ancestor- descendant relationship c-edge: parent-child relationship P SB IILLNN I M MNN IBMPart#1Part#2 Dell HoustonWinnipegY-Chen Intel

Motivation - XPath expression a[b[c and.//d]]/b[c and e//d] book[title = ‘Art of Programming’]//author[fn = ‘Donald’ and ln = ‘Knuth’] a bb cdce d title Art of Programming book author fnln Knuth Donald Art of Programming Donald Knuth … XPath evaluation against XML documents

- Evaluation based on unordered tree matching: Definition An embedding of a tree pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i)Preserve node label: For each u  Q, label(u) matches label(f(u)). (ii)Preserve parent-child/ancestor-descendant relationships: If u  v in Q, then f(v) is a child of f(u) in T; if u  v in Q, then f(v) is a descendant of f(u) in T. Motivation XPath evaluation against XML documents a b c Q:Q: q3q3 q1q1 q2q2 a b c cbb T:T: v1v1 v2v2 v3v3 v4v4 v5v5 v6v6

- Evaluation based on unordered tree matching: Definition 1 An embedding of a tree pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i)Preserve node label: For each u  Q, label(u) matches label(f(u)). (ii)Preserve parent-child/ancestor-descendant relationships: If u  v in Q, then f(v) is a child of f(u) in T; if u  v in Q, then f(v) is a descendant of f(u) in T. Motivation XPath evaluation against XML documents a b c Q:Q: q3q3 q1q1 q2q2 a b c cbb T:T: v1v1 v2v2 v3v3 v4v4 v5v5 v6v6

Algorithm for query evaluation Tree encoding Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document. (1, 1, 11, 1) (1, 10, 10, 2) B v 8 A v 1 (1, 7, 7, 4) (1, 6, 6, 4) T:T: (1, 5, 5, 4) (1, 3, 3, 3) B v 2 v 3 C B v 4 D v 7 v 5 C (1, 4, 8, 3) (1, 2, 9, 2) v 6 C 1 2 3 4 3 5 6 5 6 7 7 8 9 10 11

Tree encoding Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), denoted as  (v), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document. (i)ancestor-descendant: a node v 1 associated with (d 1, l 1, r 1, ln 1 ) is an ancestor of another node v 2 with (d 2, l 2, r 2, ln 2 ) iff d 1 = d 2, l 1 r 2. (ii)parent-child: a node v 1 associated with (d 1, l 1, r 1, ln 1 ) is the parent of another node v 2 with (d 2, l 2, r 2, ln 2 ) iff d 1 = d 2, l 1 r 2, and ln 2 = ln 1 + 1. (iii)from left to right: a node v 1 associated with (d 1, l 1, r 1, ln 1 ) is to the left of another node v 2 with (d 2, l 2, r 2, ln 2 ) iff d 1 = d 2, r 1 < l 2.

A: (1, 1, 11, 1) B: (1, 2, 9, 2) (1, 4, 8, 3), (1, 10, 10, 2) C: (1, 3, 3, 3) (1, 5, 5, 4), (1, 6, 6, 4) D: (1, 7, 7, 4) Data streams: Algorithm for query evaluation Tree encoding (1, 1, 11, 1) (1, 10, 10, 2) B v 8 A v 1 (1, 7, 7, 4) (1, 6, 6, 4) T:T: (1, 5, 5, 4) (1, 3, 3, 3) B v 2 v 3 C B v 4 D v 7 v 5 C (1, 4, 8, 3) (1, 2, 9, 2) v 6 C sorted by LeftPos values 1 2 3 4 3 5 6 5 6 7 7 8 9 10 11

Algorithm for query evaluation Algorithm description Our algorithm works bottom-up. Therefore, we need to sort XML streams by (DocID, RightPos) values. Each time a query Q is submitted to the system, we will associate each query node q with a data stream L(q) such that for each v  L(q) label(v) = label(q), in which each query node is attached with a list of matching nodes of the document tree. {v1}{v1} A q 1 q 2 B B q 5 q 3 CC q 4 L(q1)L(q1) Q:Q: L(q2)L(q2)L(q5)L(q5) L(q4)L(q4) L(q3)L(q3) L(q 1 ) = (1, 1, 11, 1) - L(q 2 ) = L(q 5 ) = (1, 4, 8, 3), (1, 2, 9, 2) (1, 10, 10, 2) - L(q 3 ) = L(q 4 ) = (1, 3, 3, 3) (1, 5, 5, 4), (1, 6, 6, 4) - { v 4, v 2, v 8 } { v 3, v 5, v 6 } sorted by RightPos values T:T: Q:Q:

Algorithm for query evaluation Algorithm description 1.S(v) – a set of query nodes q associated with document tree node v such that Q[q] can be embedded in T[v]. 2.(q) – a variable associated with query node q. During the process,  (q) will be dynamically assigned a series of values a 0, a 1,..., a m for some m in sequence, where a 0 =  and a i ’s (i = 1,..., m) are different nodes of T. Initially,  (q) is set to a 0 = .  (q) will be changed from a i-1 to a i = v (i = 1,..., m) when the following conditions are satisfied. i)v is the node currently encountered. ii)q appears in S(u) for some child node u of v. iii)q is a d-child, or q is a c-child, and u is a c-child with label(u) = label(q). v S(v)S(v) T:T: q (q)(q) Q:Q:

Algorithm for query evaluation Algorithm description i)v is the node currently encountered. ii)q appears in S(u) for some child node u of v. iii)q is a d-child, or q is a c-child, and u is a c-child with label(u) = label(q). v u S(u) = {…, q, …} q’ q v u S(u) = {…, q, …} q’ q label(u) = label(q). u must be a direct child of v. (q) is changed from a i-1 to a i = v d-child c-child

v (q 1 ) = (q 2 ) = … = (q l ) = v q’ q1q1 Algorithm for query evaluation Algorithm description 3.Subtree embedding by using (q) qlql label(v) = label(q’} T[v] embeds Q[q’].

Algorithm for query evaluation Algorithm description 4.Construction of S(v) 5 3 A q 1 B q 5 C q 4 q 2 B q 3 C Q : 1 2 4 The nodes of Q are numbered in postorder. So the nodes in Q will be referenced by their postorder numbers. For a leaf node v in T, any encountered leaf node q in Q will be inserted into S(v) if label(u) = label(q). (Since Q is explored bottom-up, the query nodes in S(v) must be increasingly sorted.) For a non-leaf node v with children v 1, …, v k, S(v)  S(v 1 )  …  S(v k )  {all nodes q with T[v] embedding Q[q]}. Since S(v) is sorted, ‘union’ can be implemented using the ‘merge’ operation.

Strict unordered tree matching Definition Definition 2 A strict embedding of a tree pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i)same as (i) in Definition 1. (ii)same as (ii) in Definition 1. (iii)For any two nodes v1  Q and v2  Q, if v 1 and v 2 are not related by an ancestor-descendant relationship, then f(v 1 ) and f(v 2 ) in T are not related by an ancestor-descendant relationship. a b c Q:Q: q3q3 q1q1 q2q2 a b c cbb T:T: v1v1 v2v2 v3v3 v4v4 v5v5 v6v6

Strict unordered tree matching Definition Definition 2 A strict embedding of a tree pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i)same as (i) in Definition 1. (ii)same as (ii) in Definition 1. (iii)For any two nodes v1  Q and v2  Q, if v 1 and v 2 are not related by an ancestor-descendant relationship, then f(v 1 ) and f(v 2 ) in T are not related by an ancestor-descendant relationship. a b c Q:Q: q3q3 q1q1 q2q2 a b c cbb T:T: v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 By Definition 2, this mapping is not allowed.

Strict unordered tree matching Strict unordered tree matching needs the exponential time. Q:Q: a T:T: b dce a x dybce … O(|T||Q|d k ) d – the largest out-degree of the nodes in T. k – the largest out-degree of the nodes in Q. Using the concept of hypergraphs, the time complexity can be slightly improved to O(|T||Q|2 k ).

Algorithm for query evaluation Experiments We conducted our experiments on a DELL desktop PC equipped with Pentium(R) 4 CPU 2.80GHz, 0.99GB RAM and 20GB hard disk. The code was compiled using Microsoft Visual C++ compiler version 6.0, running standalone. Tested methods In the experiments, we have tested four methods: -TwigStack (TS for short), -Twig 2 Stack (T 2 S for short), -Twig-List (TL for short), -One-Phase Holistic (OPH for short), -tree-embedding (discussed in this paper, TE for short).

Tested methods In the experiments, we have tested five methods: -TwigStack (TS for short) [1], -Twig 2 Stack (T 2 S for short) [2], -Twig-List (TL for short) [3], -One-Phase Holistic (OPH for short) [4], -tree-embedding (discussed in this paper, TE for short). Algorithm for query evaluation Experiments [1]N. Bruno, N. Koudas, and D. Srivastava, Holistic Twig Joins: Optimal XML Pattern Matching, in Proc. SIGMOD Int. Conf. on Management of Data, Madison, Wisconsin, June 2002, pp. 310-321. [2]S. Chen, H-G. Li, J. Tatemura, W-P. Hsiung, D. Agrawa, and K.S. Canda, Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents, in Proc. VLDB, Seoul, Korea, Sept. 2006, pp. 283-294. [3]Qin, L., Yu, J. X., and Ding, B., “TwigList: Make Twig Pattern Matching fast,” In Proc. 12th Int’l Conf. on Database Systems for Advanced Applications (DASFAA), pp. 850-862, Apr. 2007. [4]Jiang, Z., Luo, C., Hou, W.-C., Zhu, Q., and Che, D., “Efficient Processing of XML Twig Pattern: A Novel One-Phase Holistic Solution,” In Proc. The 18th Int’l Conf. on Database and Expert Systems Applications (DEXA), pp. 87-97, Sept. 2007.

Algorithm for query evaluation Experiments Theoretical computational complexities methodsQuery timeRuntime space usage TwigStackO(|D| |Q| ) Twig 2 StackO(|D||Q| 2 +|subTwig- Results| O(|D||Q|) TL O(|D||Q| 2 )O(|D||Q|) OPH O(|D||Q| 2 )O(|D||Q|) TEO(|D||Q|) D - a largest data stream associated with a node q of Q such that for each v in the data stream we have label(v) = label(q).

Algorithm for query evaluation Experiments Indexes All the tested methods use XB-trees as their index structure.

Algorithm for query evaluation Experiments Data sets The data sets used for the tests are DBLP data set [30] and a synthetic XMARK data set [35]. The DBLP data set is another real data set with high similarity in structure. It is in fact a wide and shallow document. DBLPXMark Data size127 (MB)113 (MB) Number of nodes3332k1666k Max/Avg. depth 6/2.912/5.5

Algorithm for query evaluation Experiments Queries – altogether 20 queries, divided into 4 groups Q1://inproceedings [author]//year [text() = ‘2004’] Q2://inproceedings [author and title]//year [text() = ‘2004’] Q3://inproceedings [author and title and.//pages]//year[text() = ‘2004’] Q4://inproceedings [author and title and.//pages and.//url] //year [text() = 2004’] Q5://articles [author and title and.//volume and.//pages and //url]//year [text() = ‘2004’] Group I:

Algorithm for query evaluation Experiments Test results For all the experiments, the buffer pool size was fixed at 2000 pages. The page size of 8KB was used. For each data set, all the tag names are stored in a single list and then each tag name is represented by its order number in that list during the evaluation of queries. In our implementation, each DocId occupies 4 bytes while a number in a Prüfer sequence, a LeftPos or a RightPos occupies 2 bytes. A levelNum value takes only 1 byte.

Q1Q2Q3Q4Q5 5 4 3 2 1 execution time (sec.) TS T 2 S OPH TL TM + + + + Algorithm for query evaluation Experiments * * *

Algorithm for query evaluation Experiments Queries – altogether 20 queries, divided into 4 groups Q6://inproceedings[author/* and./*]/year Q7://inproceedings[author/* and title and./*]/year Q8://inproceedings[author/* and title and.//pages and./*]/year Q9://inproceeding[author/* and title and.//pages and.//url and./*]/year Q10://articles[author/* and title and.//volume and.//pages and.//pages and.//url and./*]/year Group II:

Q6Q7Q8Q9Q10 60 50 40 30 20 execution time (sec.) + + Algorithm for query evaluation Experiments TS T 2 S OPH TL TM + * * * * *

Algorithm for query evaluation Experiments Queries – altogether 20 queries, divided into 4 groups Q11:/site//open_auction[.//seller/person]//date[text() = ‘10/23/2006’] Q12:/site//open_auction[.//seller/person and.//bidder]//date[text() = ‘10/23/2006’] Q13:/site//open_auction[.//seller/person and /./bidder/increase]//date [text() = ‘10/23/2006’] Q14:/site//open_auction[.//seller/person and.//bidder/increase and.//initial]//date [text() = ‘10/23/ 2006’] Q15:/site//open_auction[.//seller/person and.//bidder/increase and //initial and.//description]//date [text() = ‘10/23/2006’] Group III:

Q11Q12Q13Q14Q15 5 4 3 2 1 execution time (sec.) + + + TS T 2 S OPH TL TM + * * * Algorithm for query evaluation Experiments * *

Algorithm for query evaluation Experiments Queries – altogether 20 queries, divided into 4 groups Q16:/site//open_auction[.//seller/person/* and./*]/date Q17:/site//open_auction[.//seller/person/* and.//bidder and./*]/date Q19:/site//open_auction[.//seller/person/* and.//bidder/increase and.//initial and./*]/date Q20:/site//open_auction[.//seller/person/* and.//bidder/increase and.//initial and.//description and./*]/ date Group IV: Q18:/site//open_auction[.//seller/person/* and.//bidder/increase]/date

Q16Q17Q18Q19Q20 5 4 3 2 1 execution time (sec.) + TS T 2 S OPH TL TM + * * * * * Algorithm for query evaluation Experiments

Algorithm for query evaluation Experiments Test results Run time space usage In the following figure, we compare the runtime memory usage of all the five tested approaches for the second group of queries. By the memory usage, we mean the intermediate data structures, not including data stream (concretely, path stacks for TwigStack; hierarchical stacks for Twig2Stack, TL and OPH; and QSs for ours.)

Summary An efficient method for evaluating unordered tree pattern queries in XML document databases - parent/child and ancestor/descendant relation -O(|D||Q|) time and O(|D||Q|) space Strict unordered tree matching - exponential time Experiments -TreeBank database, DBLP and XMark documents -I/O time and CPU processing time

Thank you.

Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries Dr. Yangjun Chen Dept. Applied Computer Science, University.

Similar presentations

Presentation on theme: "Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries Dr. Yangjun Chen Dept. Applied Computer Science, University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries Dr. Yangjun Chen Dept. Applied Computer Science, University.

Similar presentations

Presentation on theme: "Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries Dr. Yangjun Chen Dept. Applied Computer Science, University."— Presentation transcript:

Similar presentations

About project

Feedback