XRANK: Ranked Keyword Search over XML Documents SangJin Lee sjinlee@snu.ac.kr Dept. of Industrial Engineering Seoul National University Hello. In today class, I would like to present this conference paper ‘XRANK’. This paper was published at SIGMOID 2003, June by Professor Jayavel and other authors in Cornell University. XRANK is the name of search system or search engine, whose target is mainly XML documents. Of course, it can be used for HTML documents too. But, main target is XML documents which have hierarchy structure, including many sub-elements.
Contents Introduction: problem overview Data Model and Query Semantics Computing ElemRanks Efficiently Evaluating XML Keyword Search Queries Experimental Evaluation Conclusion The contents of this paper is like below. First, new problem definition about searching over XML documents is suggested in the introduction. Second, data model and query semantics based on graph notation is defined. And then, the XML ranking algorithm ‘ElemRanks’ and the other algorithms about inverted list index and query processing are explained. At last, they present the experiments and conclusion.
Introduction: Problem Overview 01. <workshop date=”28 July 2000”> 02. <title> XML and IR: A SIGIR 2000 Workshop </title> 03. <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> 04. <proceedings> 05. <paper id=”1”> 06. <title> XQL and Proximal Nodes </title> 07. <author> Ricardo Baeza-Yates </author> 08. <author> Gonzalo Navarro </author> 09. <abstract> The recently proposed language … 10. </abstract> 11. <body> 12. <section name=”Introduction”> 13. Searching on structured text is more important … 14. </section> 15. <section name=”Implementing XML Operations”> 16. <subsection name=”Path Expressions”> 17. At first sight, the XQL query language looks … 18. </subsection> 19. … 20. </section> 21. <cite ref=”2”>Querying XML in Xyleme</cite> 22. <cite xlink=”../paper/xmlql/”>A Query … </cite> 23. </body> 24. </paper> 25. <paper id=”2”> 26. <title> Querying XML in Xyleme </title> 27. … 28. </paper> 29. </proceedings> 30. </workshop> Problem Efficiently producing ranked results for keyword search queries over hierarchical XML documents Challenges Nested elements are results Best relevant element Hyperlink and containment links Parent/child elements Keyword proximity Keyword distance Ancestor distance E.g. XQL language The problem which is defined by this paper is efficiently producing ranked results for keyword search queries over hierarchical XML documents. ‘Keyword search querying’ is so simple and easy that users do not have to learn a complex query language, for example XQuery. But, this hierarchical XML document causes below 3 challenges . First, in contrast with the flat HTML, the results of XML documents search are nested elements. So, the choice of the best relevant element might be a problem. Second, there are containment links relating parent and child elements. These are very different from that of hyperlinks (such as IDREFs and XLinks) Third, proximity among keywords is more complex. For XML there is a 2-dimensional proximity metric, which are Keyword distance and Ancestor distance
Data Model and Query Semantics (1/4) 11. <body> 12. <section name=”Introduction”> 13. Searching on structured text is more important … 14. </section> 15. <section name=”Implementing XML Operations”> 16. <subsection name=”Path Expressions”> 17. At first sight, the XQL query language looks … 18. </subsection> 19. … 20. </section> 21. <cite ref=”2”>Querying XML in Xyleme</cite> 22. <cite xlink=”../paper/xmlql/”>A Query … </cite> 23. </body> Definitions XML Document G = (N, CE, HE) N : The set of nodes N = NE U NV - NE : The set of elements - NV : The set of values CE : The set of containment edges relating nodes HE : The set of hyperlink edges relating nodes ※ contains*(v, k): the node v directly or indirectly contains keyword k XML document is defined as a Graph G, which consists of node set N, element set NE and value set NV. An element u is a sub-element of an element v if (v,u) CE. An element u is the parent of node v if (u,v) CE. The predicate contains*(v, k) is true if the node v directly or indirectly contains the keyword k.
Data Model and Query Semantics (2/4) Definitions (continued) Keyword Query Q={k1,…, kn} conjunctive semantics (k1…kn): contain all of the query keywords disjunctive semantics (k1…kn): contain at least one of the query keywords (*) Results (1) R0 = {v v NE k Q(contains*(v,k))} directly or indirectly contain all of the query keywords (2) Result(Q)={v kQ, cN ((v,c)CE cR0 contains*(c,k))} only the most specific results are returned. element that has multiple independent occurrences of the query keywords ※ CE are considered for result set, HE are considered for ranking Keyword query Q is composed of multiple keywords. There are 2 types of query semantics. One is conjunction set, the other is disjunction set. Ro is the set of elements that directly or indirectly contain all of the query keywords. Result(Q) contains the set of elements that contain at least one occurrence of all of the query keywords, after excluding the occurrences of the keywords in sub-elements that already contain all of the query keywords. By the result equation, it is ensured that only the most specific results are returned and an element that has multiple independent occurrences of the query keywords is returned
Data Model and Query Semantics (3/4) Ranking function Desired properties Result specificity: more specific results higher than less specific results Keyword proximity Keyword distance Ancestor distance Hyperlink awareness Hyperlink Containment links Parent/child elements ※ Search keyword: XQL language 05. <paper id=”1”> 06. <title> XQL and Proximal Nodes </title> 07. <author> Ricardo Baeza-Yates </author> 08. <author> Gonzalo Navarro </author> 09. <abstract> The recently proposed language … 10. </abstract> 11. <body> 12. <section name=”Introduction”> 13. Searching … 14. </section> 15. <section name=”Implementing XML Operations”> 16. <subsection name=”Path Expressions”> 17. At first sight, the XQL query language looks … 18. </subsection> 19. … 20. </section> 21. <cite ref=”2”>Querying XML in Xyleme</cite> 22. <cite xlink=”../paper/xmlql/”>A Query … </cite> 23. </body> 24. </paper> The ranking function should take the proximity of the query keywords into account. The ranking function should use the hyperlinked structure of XML documents. 25. <paper id=”2”> 26. <title> Querying XQL language ... </title> 27. … 28. </paper>
Data Model and Query Semantics (4/4) Ranking function (continued) ElemRank(v) Defined at the granularity of an element Taking the nested structure of XML into account Consider Keyword search query Q={k1,…, kn} Results R= Result(Q) A result element v1 R With respect to one keyword: r(v1,Q) (v1,v2) (v2,v3), ... , (vt,vt+1), vt+1: directly contains the keyword ki Overall Ranking ElemRank is suggested which address the Desired properties of rank function. Rank r(v1, ki) depends on containments edges. That is to say, less specific results indeed get lower ranks by the decay factor. For multiple occurences of ki in v1, combined rank is like function f, Which is some aggregation function. In this paper, function f = max by default. But, other functions, for example sum function, also can be used. The overall ranking is the sum of the ranks with respect to each query keyword, multiplied by a measure of keyword proximity P, Keyword proximity will be examined in next section. ※ f = max or f=sum
Computing ElemRanks (1/2) PageRank Sum of 2 probabilities Visiting document v at random, e.g. d=0.85 Visiting document v by navigating (hyperlink) from document u The basic idea of PageRank is simple. The more coming references to it, The more important the page is.
Computing ElemRanks (2/2) Element level granularity, Sum of 4 probabilities Nd(u): the total number of documents Nde(u): the number of elements containing the element v Nc(u): the number of sub-elements of u Nh(u): the number of out-going hyperlinks from element u d1: by hyperlink d2: by forward containment edges d3: by reverse containment edge Convergence It is said to be proved But, .... ? The basic idea of ElemRank is equal to that of PageRank. But, the granularity of ranking is element not document. And there is discrimination between containment and hyperlink edges. In containment edges, there is Bi-directional transfer of ElemRanks, Aggregation of forward and reverse calculation like HITS. 11. <body> 21. <section>Querying XML ... </section> 22. <section>Querying HTML .... </section> 23. </body>
Efficiently Evaluating XML Keyword Search Queries (1/8) How to produce ranked results efficiently? Naive approach Dewey Inverted List (DIL) Ranked Dewey Inverted List (RDIL) Hybrid Dewey Inverted List (HDIL) Naive approach Treating each element as a document Problems Space overhead Spurious query results Inaccurate ranking of results E.g. Search Query keyword XQL: 1, 5, 6, 8 Ricardo: 1, 5, 6, 7 <workshop> date <title> <editors> <proceedings> 28 July … XML and … David Carmel … <paper> … <author> XQL and … Ricardo … 1 2 3 4 5 6 7 8 Naive approach treats each element as a document. So, inverted list of each elements have all keywords of it’s child-elements redundantly. It is a large space overhead. And, The naïve approach ignores ancestor-descendant relationships. In other words, all elements treated as independent documents. So, results will not reflect the desired semantics for XML keyword search. Ranking of results is Inaccurate because this approache does not take result specificity into account when ranking results.
Efficiently Evaluating XML Keyword Search Queries (2/8) Dewy Inverted List (DIL) Dewy ID: ID of an ancestor prefix of the descendant ID Ancestor-descendant relationship are implicitly captured E.g. XQL: 0.3.0.0 Ricardo: 0.3.0.1 <workshop> date <title> <editors> <proceedings> 28 July … XML and … David Carmel … <paper> … <author> 0.0 0.1 0.2 0.3 0.3.0 0.3.0.0 0.3.0.1 0.3.1 XQL Ricardo The idea of Dewey IDs is not new, and it has been used in the classification, tree addressing domains. But, this paper uses Dewey IDs to support XML keyword search. Dewy IDs
Efficiently Evaluating XML Keyword Search Queries (3/8) Dewy Inverted List (Continued) DIL Data Structure The inverted list for a keyword k contains the Dewey IDs of all the XML elements that directly contain the keyword k An entry in DIL ElemRank The list of all positions where the keyword k appears in that element Entries are sorted by Dewey IDs
Efficiently Evaluating XML Keyword Search Queries (4/8) Dewy Inverted List (Continued) DIL Query Processing Key idea Merge the query keyword inverted lists Simultaneously compute the longest common prefix of the Dewey IDs in different lists. <workshop> date <title> <editors> <proceedings> 28 July … XML and … David Carmel … <paper> … <author> 0.0 0.1 0.2 0.3 0.3.0 0.3.0.0 0.3.0.1 0.3.1 XQL Ricardo PosList[1] PosList[2] ContainsAll PosList[1] PosList[2] ContainsAll Dewey Rank[1] Rank[2] Dewey Rank[1] Rank[2] 85 32 3 5 1 82 38 77 32 3 5 First, simple merge-join of the query keyword inverted lists can not be used because the result IDs have to be inferred from the IDs of descendants. Merge operation is needed because this paper focus on disjunctive query processing. Second, spurious results must be suppressed. I’ll describe an algorithm that merge the query keyword inverted lists. Read the smallest Dewey ID entry from the inverted list 5.0.3.0.0 [document id 5] 2. Read the next entry 5.0.3.0.1 [document id 5] 3. Computing the longest common prefix of the current entry 4. If all the query keywords are not contained, scaled down ranks are added to its parent (ContainsAll flag is 0) nothing
Efficiently Evaluating XML Keyword Search Queries (4/8) DIL Query Processing (Continued) compare top k PosList[1] PosList[2] ContainsAll PosList[1] PosList[2] ContainsAll Dewey Rank[1] Rank[2] Dewey Rank[1] Rank[2] RESULT 1 82 38 77 32 3 5 3 38 89 91 8 6 77 74 32 38 1 3 5 Read the smallest Dewey ID entry from the inverted list 6.0.3.8.3 [document id 6] 2. Computing the longest common prefix of the current entry 3. Since the longest common prefix with the Dewey stack is empty, it pops all of the entries. 4. At this point, since the entry does not contain all the query keywords, its scaled down rank and position lists are copied to its parent (5.0.3.0). Since 5.0.3.0 now contains all the query keywords, its ContainsAll flag is set to true, and it is added to the result heap 5. The algorithm then pushes 6.0.3.8.3 onto the stack and proceeds as before. <workshop> date <title> <editors> <proceedings> 28 July … XML and … David Carmel … <paper> … <author> 0.0 0.1 0.2 0.3 0.3.0 0.3.0.0 0.3.0.1 0.3.1 XQL Ricardo
Efficiently Evaluating XML Keyword Search Queries (5/8) Ranked Dewey Inverted List DIL Challenge If inverted lists are long (e.g. common keywords or large document collections) ⇒ the cost of a single scan of the inverted list can be expensive ( users want only the top few results ) RDIL Inverted lists are ordered by the ElemRank Cf) DIL: by the Dewey ID Each inverted list has a B+-tree index of the Dewey ID field Higher ranked results will appear first in the inverted list XQL Inverted List … Sorted by ElemRank B+-tree On Dewey Id
Efficiently Evaluating XML Keyword Search Queries (6/8) Ranked Dewey Inverted List (RDIL) Threshold Algorithm Output Heap S B+-tree on Dewey Id Ricardo P: 9.0.4.2.0 Inverted List R2 Rank(9.0.4) threshold < Rank(S) stop! threshold = (ElemRank(R1), ElemRank(R2)) XQL R1 9.0.4.1.2 Now, I’ll show some example about RDIL. 1. For Ricardo keyword, top-ranked Dewey ID, 9.0.4.2.0 is returned. 2. Looking the leaf nodes of the B+-tree for the ‘XQL’ inverted list. These nodes are sorted by Dewey ID. 3. determining the longest common prefix of 9.0.4.2.0 which contains the keyword XQL => 9.0.4.1.2 4. Calculating the rank by two sub-element’s rank value 5. If the result rank value is greater than the threshold, then RDIL algorithm stops. 8.2.1.4.2 9.0.4.1.2 9.0.5.6 10.8.3 B+-tree on Dewey Id 9.0.4.2.0
Efficiently Evaluating XML Keyword Search Queries (7/8) Hybrid Dewey Inverted List (HDIL) Motivation In many cases, RDIL is likely to perform well. It may perform worse than DIL when there is a query where keywords are not correlated The individual query keywords occur relatively frequently in the document collection but rarely occur together in the same document. Since the number of results is small: RDIL has to scan most (or all) of the inverted lists to produce the output. Combination the benefits of DIL and RDIL
Efficiently Evaluating XML Keyword Search Queries (8/8) HDIL (Continued) An adaptive strategy: Calculate the estimated time for RDIL Time spent: t The number of results above the threshold: r Estimated time remaining for RDIL = (m-r)*t/r m: desired number of query results Estimated time for DIL depends on the number of query keywords the size of each query keyword inverted list If estimated time of RDIL is more than the expected time for DIL, then switch to DIL. An adaptive startegy of Hybrid Dewey Inverted List consists of 2-steps.
Experimental Evaluation Setup: Data sets DBLP (real data, 143MB, depth = 4, many small documents) XMARK (synthetic data, 113MB, depth = 10, one large document) Quality and Ranking Function Space requirements Query Performance: DBLP – High Correlation Keywords The experimental evaluation shows that it’s index structures and query evaluation techniques provide significant space savings and performance gains.
Conclusion XRANK is the first system that takes into account The hierarchical and hyperlinked structure of XML documents Two-dimensional notion of keyword proximity Future work Open problems: Incremental index maintenance Integration wit structured queries
Thank you
Appendix A. XRANK Architecture ElemRank Computation Module Computes the ElemRanks of XML elements Combined with ancestor info HDIL Generates an index structure called HDIL The Query Evaluator Module Evaluates queries using HDIL Returns ranked results.
Appendix B. DIL Query Processing Algorithm 01. procedure EvaluateQuery (k1, k2, …, kn, m) returns idList 02. // k1 … kn are the query keywords, m is the desired number of query results 03. // invertedList[i] is the inverted list for keyword ki 04. resultHeap = empty; // Intialize the result heap of size m 05. deweyStack = empty; // Initialize the Dewey stack 05. while (eof has not been reached on all inverted lists) { 07. // Read the next entry from the inverted list having the smallest DeweyID 08. find ilIndex such that the next entry of invertedList[ilIndex] is the smallest DeweyID 09. currentEntry = invertedList[ilIndex].nextEntry; 10. // Find the longest common prefix between deweyStack and currentEntry.deweyId 11. find largest lcp such that deweyStack[i] = currentEntry.deweyId[i], 1 <= i <= lcp 12. // Pop non-matching entries in the Dewey stack; add to result heap if appropriate 13. while (deweyStack.size > lcp) { 14. stackEntry = deweyStack.pop(); 15. if ( stackEntry.posList non-empty for all keywords) { 16. stackEntry.ContainsAll = true 17. compute overall rank using formula in Section 2.3.2.2 18. if overall rank is among top m seen so far, add deweyStack ID to resultHeap 19. }else if ( ! stackEntry.ContainsAll) { 20. deweyStack[deweyStack.size].posList[i] += stackEntry.posList[i] (for all i) 21. deweyStack[deweyStack.size].rank[i] = rank as in Sec. 2.3.2.1 (for all i) 22. } 23. if (stackEntry.ContainsAll) deweyStack[deweyStack.size].containsAll = true 24. } 25. // Add non-matching part of currentEntry.deweyId to deweyStack 26. for (all i such that lcp < i <= currDeweyIdLen) { 27. deweyStack.push(deweyStackEntry); 28. } 29. // Add components to the top entry 30. deweyStack[currDeweyIdLen].rank[ilIndex] = rank as in Section 2.3.2.1 31. deweyStack[currDeweyIdLen].posList[ilIndex] += currentEntry.posList; 32. } // End of looping over all inverted lists 33. pop entries of deweyStack and add to result heap if appropriate (similar to lines 12-24) 34. return ids in resultHeap
Appendix C. Experimental Evaluation Space Requirements DBLP XMARK Inv. List Index Naïve-ID 258MB N/A 872MB Naïve-Rank 217MB 527MB DIL 144MB 254MB RDIL 156MB 209MB HDIL 186MB 7MB 307MB 3.2MB
Appendix D. Experimental Evaluation Query Performance: DBLP - Low Correlation Keywords