Core Labeling: A New Way to Compress Transitive Closure

Slides:



Advertisements
Similar presentations
Transitive Closure Compression Jan. 2013Yangjun Chen ACS Outline: Transitive Closure Compression Motivation DAG decomposition into node-disjoint.
Advertisements

TREES Chapter 6. Trees - Introduction  All previous data organizations we've studied are linear—each element can have only one predecessor and successor.
More Graphs COL 106 Slides from Naveen. Some Terminology for Graph Search A vertex is white if it is undiscovered A vertex is gray if it has been discovered.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
Data Structures and Algorithms1 Trees The definitions for this presentation are from from: Corman, et. al., Introduction to Algorithms (MIT Press), Chapter.
CS 171: Introduction to Computer Science II
Trees Chapter 8.
Jan. 2013Dr. Yangjun Chen ACS Outline Signature Files - Signature for attribute values - Signature for records - Searching a signature file Signature.
Implementation of Graph Decomposition and Recursive Closures Graph Decomposition and Recursive Closures was published in 2003 by Professor Chen. The project.
Fall 2007CS 2251 Trees Chapter 8. Fall 2007CS 2252 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information.
Trees Chapter 8. Chapter 8: Trees2 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information To learn how.
An Efficient Algorithm for Answering Graph Reachability Queries Yangjun Chen, Yibin Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage.
Lists A list is a finite, ordered sequence of data items. Two Implementations –Arrays –Linked Lists.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Unordered Tree Matching and Strict Unordered Tree Matching: the Evaluation of Tree Pattern Queries Dr. Yangjun Chen Dept. Applied Computer Science, University.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Reachability Queries Sept. 2014Yangjun Chen ACS Outline: Reachability Query Evaluation What is reachability query? Reachability query evaluation.
Recursive Graph Deduction and Reachability Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Chapter 12 Trees. Copyright © 2005 Pearson Addison-Wesley. All rights reserved Chapter Objectives Define trees as data structures Define the terms.
Constructing Signature Graphs for Signature Files Dr. Yangjun Chen Dept. Applied Computer Science University of Winnipeg Canada.
CHAPTER 12 Trees. 2 Tree Definition A tree is a non-linear structure, consisting of nodes and links Links: The links are represented by ordered pairs.
Evaluation of Tree Pattern Queries Sept. 2014Yangjun Chen ACS Evaluation of Tree Pattern Queries Motivation Tree encoding and XML data streams Evaluation.
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
Review of Graphs A graph is composed of edges E and vertices V that link the nodes together. A graph G is often denoted G=(V,E) where V is the set of vertices.
Important Problem Types and Fundamental Data Structures
Data Structures and Algorithms Session 13 Ver. 1.0 Objectives In this session, you will learn to: Store data in a tree Implement a binary tree Implement.
Trees Chapter 8. 2 Tree Terminology A tree consists of a collection of elements or nodes, organized hierarchically. The node at the top of a tree is called.
COSC2007 Data Structures II
Chapter Tow Search Trees BY HUSSEIN SALIM QASIM WESAM HRBI FADHEEL CS 6310 ADVANCE DATA STRUCTURE AND ALGORITHM DR. ELISE DE DONCKER 1.
Trees. Tree Terminology Chapter 8: Trees 2 A tree consists of a collection of elements or nodes, with each node linked to its successors The node at the.
Advanced Algorithms Analysis and Design Lecture 8 (Continue Lecture 7…..) Elementry Data Structures By Engr Huma Ayub Vine.
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
Spring 2015 Lecture 10: Elementary Graph Algorithms
“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Querying Structured Text in an XML Database By Xuemei Luo.
Trees Chapter 8. Chapter 8: Trees2 Chapter Objectives To learn how to use a tree to represent a hierarchical organization of information To learn how.
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.
TwigStackList¬: A Holistic Twig Join Algorithm for Twig Query with Not-predicates on XML Data by Tian Yu, Tok Wang Ling, Jiaheng Lu, Presented by: Tian.
Trees Chapter 8. 2 Tree Terminology A tree consists of a collection of elements or nodes, organized hierarchically. The node at the top of a tree is called.
Discrete Structures Trees (Ch. 11)
Segment Trees Basic data structure in computational geometry. Computational geometry.  Computations with geometric objects.  Points in 1-, 2-, 3-, d-space.
M180: Data Structures & Algorithms in Java Trees & Binary Trees Arab Open University 1.
24 January Trees CSE 2011 Winter Trees Linear access time of linked lists is prohibitive  Does there exist any simple data structure for.
A New Top-down Algorithm for Tree Inclusion Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Holistic Twig Joins: Optimal XML Pattern Matching Nicholas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 02 Presented by: Li Wei, Dragomir Yankov.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Finding Regular Simple Paths Sept. 2013Yangjun Chen ACS Finding Regular Simple Paths in Graph Databases Basic definitions Regular paths Regular simple.
A New Algorithm for Evaluating Ordered Tree Pattern Queries Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg,
1 Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach Presenter: Qi He.
Chapter 7 Trees_ Part2 TREES. Depth and Height 2  Let v be a node of a tree T. The depth of v is the number of ancestors of v, excluding v itself. 
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Indexing and Querying XML Data for Regular Path Expressions Quanzhong Li and Bongki Moon Dept. of Computer Science University of Arizona VLDB 2001.
What is a Tree? Formally, we define a tree T as a set of nodes storing elements such that the nodes have a parent-child relationship, that satisfies the.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
XML Query languages--XPath. Objectives Understand XPath, and be able to use XPath expressions to find fragments of an XML document Understand tree patterns,
TREES From root to leaf. Trees  A tree is a non-linear collection  The elements are in a hierarchical arrangement  The elements are not accessible.
A Linear-Space Top-down Algorithm for Tree Inclusion Problem
Evaluation of Tree Pattern Queries
MCS680: Foundations Of Computer Science
Outline: Transitive Closure Compression
Outline: Reachability Query Evaluation
Assignment #3 Due: April 03, 2017
On the Graph Decomposition
Trees-2, Graphs Data Structures with C Chpater-6 Course code: 10CS35
Important Problem Types and Fundamental Data Structures
Presentation transcript:

Core Labeling: A New Way to Compress Transitive Closure Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R3B 2E9

Outline Motivation Tree labeling Main algorithm - Core tree Conclusion - Graph labeling: Core-I - Graph labeling: Core-II Conclusion

Motivation Efficient method to evaluate sparse graph reachability queries Given a directed sparse graph G, check whether a node v is reachable from another node u through a path in G. Application XML data processing, gene-regulatory networks or metabolic networks. It is well known that XML documents are often represented by tree structure. However, an XML document may contain IDREF/ID references that turn itself into a directed, but sparse graph: a tree structure plus a few reference links. For a metabolic network, the graph reachability models a relationship whether two genes interact with each other or whether two proteins participate in a common pathway. Many such graphs are sparse.

Motivation A simple method - store a transitive closure as a matrix G: b c d e a b c d e 1 c b a d e G: a b c d e a b c d e 1 c b a d e G*: M  M = O(n2) space

Tree labeling Tree encoding Let G be a sparse graph. we will first find a spanning tree T of G. Each node v in T will be assigned an interval [start, end), where start is v’s preorder number and end - 1 is the largest preorder number among all the nodes in T[v]. So another node u labeled [start’, end’) is a descendant of v (with respect to T) iff start’  [start, end). a [0, 12) [5, 9) r [1, 5) b e [6, 9) h [9, 12) d f g [2, 4) c [4, 5) [7, 8) i j [11, 12) [8, 9) [10, 11) [3, 4) k Let v and u be two nodes in T, labeled [a, b) and [a’, b’), respectively. If a  [a’, b’), v is a descendant of u. In this case, we say, [a, b) is subsumed by [a’, b’). Also, we must have b  b’. Therefore, if v and u are not on the same path in T, we have either a’  b or a  b’. In the former case, we say, [a, b) is smaller than [a’, b’), denoted [a, b)  [a’, b’). In the latter case, [a’, b’) is smaller than [a, b).

Tree labeling Tree encoding Interval sequences: (label space) a [0, 12) a h e f d k g c i j b r [0, 12) [2, 4)[4, 5)[6, 9)[9, 12) [2, 4)[4, 5)[6, 9) [3, 4)[4, 5)[7, 8) [3, 4)[4, 5) [3, 4) [2, 4)[8, 9) [2, 4) [10, 11) [11, 12) [1, 5) [2, 4)[5, 9)[6, 9) [5, 9) r [1, 5) b e [6, 9) h [9, 12) d f g [2, 4) c [4, 5) [7, 8) i j [11, 12) [3, 4) k [8, 9) [10, 11)

Main Algorithm Core tree (core of G) Let T be a spanning tree. We denote E’ the set of all the non-tree edges. Denote V’ the set of all the end points of the non-tree edges. Then, V’ = Vstart  Vend, where Vstart stands for a set containing all the start nodes of the non-tree edges and Vend for all the end nodes of the non-tree edges. Definition 1. (anti-subsuming subset) A subset S  Vstart is called an anti-subsuming set iff |S| > 1 and no two nodes in S are related by ancestor-descendant relationship with respect to T. Vstart = {d, f, g, h} Vend = {c, k, e, d, g} anti-subsumming subsets: a {d, f} {d, g} {d, h} {f, g} {f, h} {g, h} {d, f, g} {d, f, h} {d, g, h} {f, g, h} {d, f, g, h} r b e h d f g c i j k

Main Algorithm Core tree (core of G) Definition 2. (critical node) A node v in a spanning tree T of G is critical if v  Vstart or there exists an anti-subsuming subset S = {v1, v2, ..., vk} for k  2 such that v is the lowest common ancestor of v1, v2, ..., vk. We denote Vcritical the set of all critical nodes.  In the graph, node e is the lowest common ancestor of {f, g}, and node a is the lowest common ancestor of {d, f, g, h}. So e and a are critical nodes. In addition, each v  Vstart is a critical node. So all the critical nodes of G with respect to T are {d, f, g, h, e, a}. a r b e h d f g c i j k

Main Algorithm Core tree (core of G) Definition 3. (core of G) Let G = (V, E) be a directed graph. Let T be a spanning tree of G. The core of G with respect to T is a tree structure with the node set being Vcritical and there is an edge from u to v (u, v  Vcritical) iff there is a path p from u to v in T and p contains no other critical nodes. The core of G with respect to T is denoted Gcore = (Vcore, Ecore). a h e f d g [0, 12) [2, 4)[4, 5)[6, 9)[9, 12) [2, 4)[4, 5)[6, 9) [3, 4)[4, 5)[7, 8) [3, 4)[4, 5) [2, 4)[8, 9) a Gcore: a r b e h e d f g h d f g c i j k

Main Algorithm Core generation Algorithm core-generation(T) Mark any node in T, which belongs to Vstart. Let v be the first marked node encountered during the bottom-up searching of T. Create the first node for v in Gcore. Let u be the currently encountered node in T. Let u’ be a node in T, for which a node in Gcore is created just before u is met. Do (4) or (5), depending on whether u is a marked node or not. If u is a marked node, then do the following. (a) If u’ is not a child (descendant) of u, create a link from u to u’, called a left-sibling link and denoted as left-sibling(u) = u’.

Main Algorithm Core generation Algorithm core-generation(T) (continued) (b) If u’ is a child (descendant) of u, we will first create a link from u’ to u, called a parent link and denoted as parent(u’) = u. Then, we will go along a left-sibling chain starting from u’ until we meet a node u’’ which is not a child (descendant) of u. For each encountered node w except u’’, set parent(w)  u. Set left- sib­ling(u)  u’’. Remove left-sibling(w) for each child w of u. 5. If u is a non-marked node, then do the following. (c) If u’ is not a child (descendant) of u, no node will be created. (d) If u’ is a child (descendant) of u, we will go along a left-sibling chain starting from u’ until we meet a node u’’ which is not a child (descendant) of u. If the number of the nodes encountered during the chain navigation (not including u’’) is more than 1, we will create new node in Gcore and do the same operation as (4.b). Otherwise, no node is created.

Main Algorithm Core tree (core of G) u’’ is not a child of u. u u u’’ … … u’ u’’ … … u’ link to the left sibling d d f d f g (a) (b) (c) a e e h r (d) (e) d f g d f g b e h a d f g c i j e (f) k d f g h

Main Algorithm Graph labeling: Core-I Definition 4. Let Vcore = {v1, ..., vg} be the node set of Gcore. The core label for G is a set {L(v1), ..., L(vg)}, where each L(vl) (l = 1, ..., g) is an interval sequence associated with vl, satisfying the following two properties: (1) Let L(vl) = [al1, bl1), ..., [alr, blr) for some r. Then, for any i, j  {1, ..., r}, ali  blj if i < j. That is, [ali, bli) ≺ [alj, blj) for i < j. (In this sense, the intervals in L(vl) are considered to be sorted.) (2) Let [a, b) be the interval associated with a descendant of vl with respect to G. There exists an interval [ali, bli) (1  i  r) in L(vl) such that a  [ali, bli). Definition 5. (link graph) Let G = (V, E) be a directed graph. Let T be a spanning tree of G. The link graph of G with respect to T is a graph, denoted Glink, with the node set being V’ (the end points of all the non-tree edges) and the edge set E’  E’’, where (v, u)  E’’ iff v  Vend, u  Vstart, and there exists a path from v to u in T.

Main Algorithm Graph labeling: Core-I Glink: Gcom = Gcore  Glink: e h d f k Gcom = Gcore  Glink: a h e f d k g c [0, 12) [2, 4)[4, 5)[6, 9)[9, 12) [2, 4)[4, 5)[6, 9) [3, 4)[4, 5)[7, 8) [3, 4)[4, 5) [3, 4) [2, 4)[8, 9) [2, 4) [0, 12) a [6, 9) h reverse topological order e [9, 12) c [2, 4) d f g [4, 5) [7, 8) [8, 9) k [3, 4)

Main Algorithm - Generation of interval sequences 1. Scan the reverse topological order of Gcom. 2. For each node v, the interval sequence L(v) is stored in a linked list Av. Initially, Av contains only one interval, which is generated by labeling T. 3. Let v1, ..., vk be the children of v (in Gcom). Merge Av with each Avl for the child node vl (l = 1, ..., k) as follows. Assume Av = p1  p2  ...  pg and Avl = q1  q2  ...  qh. Assume that both Av and Avl are increasingly ordered. (As we will see soon, any interval sequence generated by the following algorithm has this nice property. It contains only the intervals not on the same path in T. Initially, Av contains only one interval. It is considered to be sorted.)

Main Algorithm - Generation of interval sequences 4. We step through both Av and Avl from left to right. Let pi = [ai, bi) and qj = [aj, bj) be the intervals encountered. We will conduct the following checkings. (i) If ai  bj, insert qj into Av after pi-1 and before pi and move to qj+1. (ii) If ai  [aj, bj), remove pi from Av and move to pi+1. (*pi is subsumed by qj.*) (iii) If aj  [ai, bi), ignore qj and move to qj+1. (*qj is subsumed by pi; but it should not be removed from Avl.*) (iv) If aj  bi, ignore pi and move to pi+1. (v) If ai = aj and bi = bj, ignore both pi and qj, and move to pi

Main Algorithm - Generation of interval sequences Example. P = nil A1: [2, 4)[4, 5)[7, 8) A2: [2, 4)[8, 9) A1: [2, 4)[4, 5)[7, 8)[8, 9) A2: [2, 4)[8, 9) A q q

Main Algorithm - Core labels [0, 12) [2, 4)[4, 5)[6, 9) [3, 4)[4, 5) [2, 4)[4, 5)[6, 9)[9, 12) d f g h [3, 4)[4, 5)[7, 8) [2, 4)[8, 9)

Main Algorithm - Non-tree labeling Let Vcore = {v1, ..., vj}. We store the core label of G as a list: s1 = L(v1), ..., sj = L(vj). Then, we define a function f: Vcore  {1, ..., j} such that for each v  Vcore f(v) = i iff si = L(v). Based on the above concepts, we define Core-I below. s1: L(a) s2: L(h) s3: L(e) s4: L(f) s5: L(d) s6: L(g) = [0, 12) = [2, 4)[4, 5)[6, 9)[9, 12) = [2, 4)[4, 5)[6, 9) = [3, 4)[4, 5)[7, 8) = [3, 4)[4, 5) = [2, 4)[8, 9) f(a) f (h) f (e) f (f) f (d) f (g) = 1 = 2 = 3 = 4 = 5 = 6

Main Algorithm r- = e, r* does not exist. e- = e, e* = e. - Non-tree labeling Each node v in V is associated with two nodes: v- and v*. v- - a critical node in T[v], which is closest to v. v* - the lowest ancestor of v (in T), which has a non-tree incoming edge. Example. r- = e, r* does not exist. e- = e, e* = e. a r h e b d f g i c j k

Main Algorithm - Non-tree labeling Definition (Core-I) Let v be a node in G. The non-tree label of v is a pair <d, t>, where - d = i if v- exists and f(v-) = i. If v- does not exists, let d be the special symbol “-”. - t = [x, y) if v* exists and [x, y) is the interval of v*. If v* does not exist, let y be “-”. i [3, 4) <_,[3, 4)> j [11, 12) <_, _> [9, 12) <2, _> [5, 9) <3, _> k d r [8, 9) <6, [8, 9)> h e f c b a [10, 11) [6, 9) <3, [6, 9)> [7, 8) <4, [6, 9)> [4, 5) <5, [4, 5)> [2, 4) <_,[2, 4)> [1, 5) <5, _> [0, 12) <1, _> g

Main Algorithm - Non-tree labeling Proposition Assume that u and v are two nodes in G, labeled ([a1, b1), <x1, y1>) and ([a2, b2), <x2, y2>), respectively. Node v is reachable from u iff one of the following conditions holds: (i) [a2, b2) is subsumed by [a1, b1), or (ii) There exists an interval [a, b) in sx1 such that for y2 = [a’, b’) we have a’  [a, b) (i.e., y2 is subsumed by [a, b) .)

Main Algorithm Graph labeling: Core-II We can store the core label of G as a d  g boolean matrix M, where d is the number of the end nodes of all non-tree edges and g the number of the nodes in Gcore. Let u1, u2, ..., ud be all the end nodes of the non-tree edges. Let v1, v2, ..., vg be all the nodes in Gcore. Assign each ui an index, denoted index(ui) (i.e., u1, u2, ..., ud will be assigned contiguous integers, starting from 0.) Assign each vj an index, denoted index’(vj). An entry M[index(ui), index’(vj)] is set to 1 if there exists an interval [a’, b’) in L(vj) such that for ui’s interval [a, b) we have a  [a’, b’); otherwise, it is set to 0. 1 1 2 1 3 1 4 1 5 1 index(c) = 0 index(k) = 1 index(d) = 2 index(e) = 3 index(g) = 4 Index’(a) = 0 Index’(h) = 1 Index’(e) = 2 Index’(f) = 3 Index’(d) = 4 Index’(g) = 5 1 2 3

Conclusion A new algorithm for graph recheabiliy - Core tree - Graph labeling: Core-I query time: O(log(min{b, s})) labeling time: O(n + e + t · min{b, s}) space overhead: O(n + s · min{b, s} ) - Graph labeling: Core-II query time: O(1) labeling time: O(n + e + t · min{b, s} + d·s log(min{b, s}) space overhead: O(n + d · s)

Evaluation of Twig Pattern Queries Based on Ordered Tree matching Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R3B 2E9

Outline Motivation Algorithm for tree pattern query evaluation based on ordered tree matching - Tree encoding - Algorithm description Index-based algorithm Conclusion

Motivation XPath evaluation against XML documents - XPath expression a[b[c and .//d]]/b[c and e//d] book[title = ‘Art of Programming’]//author[fn = ‘Donald’ and ln = ‘Knuth’] a b c d e book <document> <book> <title> Art of Programming </title> <author> <fn>Donald Knuth</fn> … … title author Art of Programming fn ln Donald Knuth

Motivation XPath evaluation against XML documents Evaluation based on unordered tree matching XPath expression: Definition An embedding of a twig pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i) Preserve node label: For each u  Q, label(u) matches label(f(u)). (ii) Preserve parent-child/ancestor-descendant relationships: If u  v in Q, then f(v) is a child of f(u) in T; if u  v in Q, then f(v) is a descendant of f(u) in T. Q: T: c b a a b d c e f g

Motivation XPath evaluation against XML documents - Evaluation based on ordered tree matching XPath expression: a[b[c/following-sibling:: .//d]]/following-sibling::b[c/following- sibling:: e//d]

Motivation XPath evaluation against XML documents - Evaluation based on ordered tree matching Definition An embedding of a twig pattern Q into an XML document T is a mapping f: Q  T, from the nodes of Q to the nodes of T, which satisfies the following conditions: (i) Preserve node label: For each u  Q, label(u) matches label(f(u)). (ii) Preserve parent-child/ancestor-descendant relationships: If u  v in Q, then f(v) is a child of f(u) in T; if u  v in Q, then f(v) is a descendant of f(u) in T. (iii) Preserve sibling order: For any two nodes v1  Q and v2  Q, if v1 is to the left of v2, then f(v1) is to the left of f(v2) in T. Q: T: q3 v6 a a q1 q2 v4 b c c c v5 v1 v3 c b v2 b

Algorithm for tree pattern query evaluation Tree encoding Let T be a document tree. We associate each node v in T with a quadruple (DocId, LeftPos, RightPos, LevelNum), denoted as a(v), where DocId is the document identifier; LeftPos and RightPos are generated by counting word numbers from the beginning of the document until the start and end of the element, respectively; and LevelNum is the nesting depth of the element in the document. (i) ancestor-descendant: a node v1 associated with (d1, l1, r1, ln1) is an ancestor of another node v2 with (d2, l2, r2, ln2) iff d1 = d2, l1 < l2, and r1 > r2. (ii) parent-child: a node v1 associated with (d1, l1, r1, ln1) is the parent of another node v2 with (d2, l2, r2, ln2) iff d1 = d2, l1 < l2, r1 > r2, and ln2 = ln1 + 1. (iii)from left to right: a node v1 associated with (d1, l1, r1, ln1) is to the left of another node v2 with (d2, l2, r2, ln2) iff d1 = d2, r1 < l2.

Algorithm for tree pattern query evaluation Tree encoding Example. T: (1, 1, 9, 1) v6 a (1, 2, 7, 2) v4 v5 (1, 8, 8, 2) c c (1, 3, 3, 3) (1, 4, 6, 3) c v1 v3 b v2 b (1, 5, 5, 4)

Algorithm for tree pattern query evaluation Main algorithm 1. First, we will number both T and Q in postorder. So the nodes in both trees will be referenced by their postorder numbers. Q: T: q3 3 6 v6 a a q1 q2 v4 4 5 b c c c v5 1 2 v1 3 v3 c b 1 2 v2 b 2. We will access the nodes in T and the nodes in Q along their postorder numbers. Each time we meet a node i in Q, we will associate it with an array, Ai, of length |T|, indexed from 0 to |T| - 1. Ai’s are manipulated as follows.

Algorithm for tree pattern query evaluation (i) We set a virtual node for T, numbered 0, which is considered to be to the left of any node in T. (ii) If we find Q[i] can be embedded in T[j], we will set Ai[j1], ..., Ai[jk] (0  k  j - 1) to j, where each jl (0  l  k) is a node to the left of j, to record the fact that j is the closest node to the right of jl such that T[j] embeds Q[i]. a b c Q: q3 q1 q2 1 2 3 T: 6 v6 a v0 v4 4 5 c c v5 v1 3 v3 c b 1 2 v2 1      0 1 2 3 4 5 A2: 1 5 5 5 5  0 1 2 3 4 5 A2: b

Algorithm for tree pattern query evaluation (iii) If some time later we find another node p such that Q[i] can be embedded in T[p], we will set Ai[p1], ..., Ai[pq] to p, where each ps (1  s  q) is to the left of p but to the right of jk. For all the other nodes j’ such that T[j’] embeds Q[i], we will set values for the entries in Ai in the same way as (ii) and (iii). 3. During the process, when we meet i in Q and j in T, we will do the following: Let i1, ..., ik be the child nodes of i in Q. We first check starting from Ai1[l], where l = min{desc(j)} - 1 and desc(j) represents all the descendants of j. We begin the searching from min{desc(j)} - 1 because it is the closest node to the left of a descendant of j, which has the least postorder number. Let Ai1[l] = j’. If (i, i1) is /- edge, we will check whether (j, j’) is a /-edge. Otherwise, we only check whether j’ is descendant of j. If it is not the case, we will check Ai1[j’]. This process continues until one of the following conditions is satisfied: (i) Ai1 is exhausted (we cannot find a descendant j’’ of j such that T[j’’] contains Q[i1]; or (ii) we find an j’’ satisfying the parent-child or ancestor-descendant relationship, depending on whether (i, i1) is a /-edge or a //-edge. Then, we will check Ai2[j’’].

Algorithm for tree pattern query evaluation If Ai1[l], is exhausted (case (i)), it shows that Q[i1] cannot be embedded in any subtree rooted at a child node (for /-edge) or a descendant (for //-edge) of j. It indicates that Q[i1] cannot be embedded into T[j] and thus T[j] cannot embed Q[i]. We will continue to check i against a next node in T. If it is case (ii), we will check Ai2, starting from [j’’]. For all the other Ail’s (l = 3, ..., k), we will do the same checkings. If for each il (l = 1, ..., k) we can find j’ such that T[j’] embeds Q[il ], it shows that T[j] embeds Q[i] and we will set some new values in Ai as described in (2). Q: l T: j i i1 i2 ik … … j’ j’’ l j’ .. .. .. j’ .. .. Ai1: .. .. .. j’’ .. .. Ai2:

Algorithm for tree pattern query evaluation Example. a b c Q: q3 q1 q2 1 2 3 T: 6 v6 a v0 v4 4 5 c c v5 v1 3 v3 c b 1 2 v2 1      0 1 2 3 4 5 A2: 2 2     A1: (a) (b) b 6      0 1 2 3 4 5 A3: 1      0 1 2 3 4 5 A2: 2 2     A1: (d) (c) (f) 1 5 5 5 5  0 1 2 3 4 5 A2: The time complexity of the algorithm is O(|T||Q|). (e)

Index-base algorithm XB-tree T: An XB-tree is a variant of B+-tree over a quadruple sequences. a c b T: v1 v2 v3 v4 v5 v6 (1, 1, 9, 1) (1, 8, 8, 2) (1, 2, 7, 2) (1, 4, 6, 3) (1, 3, 3, 3) (1, 5, 5, 4) (1, 3, 3, 3) (1, 5, 5, 4) (1, 4, 6, 3) (1, 2, 7, 2) (1, 8, 8, 2) (1, 1, 9, 1) sorted by RightPos values P1: P.parentIndex 3, 5 2, 7 1, 9 P.parent P2: P3: P4: 3, 3 5, 5 4, 6 2, 7 8, 8 1, 9 c b b c c a

Index-base algorithm Searching an XB-tree -  = (P, i) – indicates that the ith entry in the page P is currently accessed. - advance(b) (going up from a page to its parent): If b = (P, i) does not point to the last entry of P, i  i + 1. Otherwise, b  (P.parent, P.parentIndex). - drilldown(b) (going down from a page to one of its children): If b = (P, i) and P is not a leaf page, b  (P’, 1), where P’ is the ith child page of P. - Initially, b  (rootPage, 1), pointing to the first entry in the root page. We finish a traversal of the XB-tree when b = (rootPage, last), where last points to the last entry in the root page, and we advance it (in this case, we set b to nil).

Index-base algorithm Searching an XB-tree Procedure search(XB, i) Assume that i in Q is the node currently encountered. We will find, by searching the XB-tree, a node j of T with label(i) = label(j), for which it is possible that T[j] embeds Q[i]. - L(i) - the most recently found node such that Q[i] can be embedded into T[L(i)]. Procedure search(XB, i) Let i1, ..., ik be the children of i. Assume that L(ik) = v. l  v.LeftPos. r  v.RightPos. If i is a leaf node, then l  , r  0. Assume that  = (P, c). Let j be the entry pointed to by . We will do the following checkings. If P is a leaf page, label(j) = label(i) and j.LeftPos < l and j.RightPos > r, then   advance(), return j. If P is an internal page, and j.LeftPos < l and j.RightPos > r,   drilldown(). If j.RightPos < r, then   advance(). If  = nil, return nil. Repeat (2) until the whole XB-tree is traversed (i.e., when  = nil) or a node j is found (i.e., the condition in (2)-(i) is satisfied).

Conclusion Algorithm for evaluating tree pattern queries based on ordered tree matching time complexity: O(|T||Q|). Space complexity: O(|T||Q|). The algorithm can be integrated into an index environment by using XB-trees.