2004/12/31 報告人 : 邱紹禎 1 Mining Frequent Query Patterns from XML Queries L.H. Yang, M.L. Lee, W. Hsu, and S. Acharya. Proc. of 8th Int. Conf. on Database System for Advanced Applications(DASFAA’03)
2 Motive As XML prevail, the efficient retrieval of XML Data become important many researches focus on 1. index XML documents 2. process regular path expression 3. discover frequent query pattern
3 Query Pattern Tree Q 1 {resultPattern = {/book/title, /book/price}, predicates = {/book/author/data() = ”Buneman”}, documents = {book.xml}} Wildcards “*” indicate the ANY label in DTD Relative path “//”indicate zero or more labels
4 Query Pattern Tree Def. Query Pattern tree A rooted tree QPT V is the vertex set E is the edge set Each vertex v has a label with its value in {“*”, “//”, tagSet} Def. Rooted Subtree A rooted subtree RST Root(RST)= Root(QPT) V’ V, E’ E
5 Frequent Query Pattern Trees D : a database of query pattern trees {QPT 1,……,QPT N } Freq(RST) : the total occurrence of a rooted subtree RST in D Supp(RST) = freq(RST) / |D| The problem is to find all the frequent RSTs in D with some minimum support Supp(RST) = 2/3
6 Tree Pattern Matching book/section/figure/title book // title book/section/figure/image book/section/*/image So node with label x ≦ * ≦ // Def. Query Pattern Tree Matching we say that RST is contained in a QPT if the following hold: 1. The root nodes in RST and QPT have the same label 2. If a node w RST is matched with node v QPT, then it satisfies (a)w.label ≦ v.label (b)each subtree of w is contained in some subtree of QPT
7 Discovering Frequent Rooted Subtrees find all frequent 1-edge RSTs by scaning Database once RST-Gen generate the candidate set C k+1 by using the previously found frequent set F k and pruning those unqualified candidates. Contains determines if RSTk+1 is contained in the pattern tree t.
8 Generation of Candidate RSTs use schema-guided enumeration method to generate candidate RST without repetition contruct a G-QPT by merging the query pattern tree in the database use Apriori property to prune the candidates RST
9 Generation of Candidate RSTs
10 Containment of RST in a Pattern Tree count the RSTs’ support in the database compare recursive from the root to the leaf node Algorithm Contains Case1 : w is a leaf node a) v.label is not ”//” 1) w.label = “//” or “*”, return the result of comparison w.label ≦ v.label 2) If w.label apprears in the set of labels of node v’s ancestors, return TRUE b) v.label is “//”, we must find if any of v’s child node n satisfies w.label ≦ n.label
11 Containment of RST in a Pattern Tree Case2 : w is not a leaf node, and v is a leaf node w is impossible to be contained in v Case3 : Both w and v are not leaf nodes 1. if w.label ≦ v.label doesn’t hold, return false 2. compute whether all of the subtrees of w is contained in those of v 3. If v is “//” a) Check whether w is contained in one of v’s children b) Check whether the subtree of w is contained in v
12 Containment of RST in a Pattern Tree-Example
13 Optimizations for XQPMiner Encoding Query Pattern Trees Replaced by “1,2,-1,3,-1,8,-1” Indexing Frequent RSTs Using Transaction IDs Divide the enumeration of RST k into two sets 1. G leaf : generated by expanding the right most leaf node 2. G internal : generated by expanding the nodes along the right most branch except the leaf node
14 Optimizations for XQPMiner- Example RST k+1.TIDList = RST k.TIDList ∩ RST 1 k.TIDList the RSTs in G internal need not be matched in D
15 Performance Study P4 2.4GHz with 1GB RAM, running Windows XP Each dataset consist of QPTs Zipfian distribution Datasets DBLP Shakespears Play G-QPT Num. of nodes9867 Max depth86 Num. of //130 Max fanout129 QPT in DB Ave # of nodes Max depth86 Max fanout129
16
17
18 Algorithm Contains
19 Algorithm Contains