XRANK: Ranked Keyword Search over XML Documents

Slides:

Advertisements

Similar presentations

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,

Advertisements

Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.

Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Search Engines and Information Retrieval

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Xyleme A Dynamic Warehouse for XML Data of the Web.

IR Models: Structural Models

Information Retrieval and Databases: Synergies and Syntheses IDM Workshop Panel 15 Sep 2003 Jayavel Shanmugasundaram Cornell University.

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.

COMP630 Paper Presentation by Haomian(Eric) Wang.

CAREER: Towards Unifying Database Systems and Information Retrieval Systems NSF IDM Workshop 10 Oct 2004 Jayavel Shanmugasundaram Cornell University.

Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.

Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)

Search Engines and Information Retrieval Chapter 1.

Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

WAES 3308 Numerical Methods for AI

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

Querying Structured Text in an XML Database By Xuemei Luo.

Chapter 6: Information Retrieval and Web Search

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.

Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003.

1 FollowMyLink Individual APT Presentation Third Talk February 2006.

XRANK: Ranked Keyword Search over XML Documents Presentation by: Meghana Kshirsagar Nitin Gupta Indian Institute of Technology, Bombay Lin Guo Feng Shao.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Structured Text Retrieval Models. Str. Text Retrieval Text Retrieval retrieves documents based on index terms. Observation: Documents have implicit structure.

Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.

1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.

Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

Databases and Information Retrieval: Rethinking the Great Divide SIGMOD Panel 14 Jun 2005 Jayavel Shanmugasundaram Cornell University.

Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.

Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.

1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.

1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.

Mehdi Kargar Department of Computer Science and Engineering

An Efficient Algorithm for Incremental Update of Concept space

Neighborhood - based Tag Prediction

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Database Management System

Information Retrieval in Practice

Information Retrieval

Chapter 12: Query Processing

Martin Rajman, Martin Vesely

Toshiyuki Shimizu (Kyoto University)

Information Retrieval

Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.

Objective of This Course

Indexing and Hashing Basic Concepts Ordered Indices

Structure and Content Scoring for XML

Lecture 2- Query Processing (continued)

Structure and Content Scoring for XML

A Framework for Testing Query Transformation Rules

Chapter 31: Information Retrieval

Recuperação de Informação B

Information Retrieval and Web Design

Chapter 19: Information Retrieval

Introduction to XML IR XML Group.

CoXML: A Cooperative XML Query Answering System

Presentation transcript:

XRANK: Ranked Keyword Search over XML Documents SangJin Lee sjinlee@snu.ac.kr Dept. of Industrial Engineering Seoul National University Hello. In today class, I would like to present this conference paper ‘XRANK’. This paper was published at SIGMOID 2003, June by Professor Jayavel and other authors in Cornell University. XRANK is the name of search system or search engine, whose target is mainly XML documents. Of course, it can be used for HTML documents too. But, main target is XML documents which have hierarchy structure, including many sub-elements.

Contents Introduction: problem overview Data Model and Query Semantics Computing ElemRanks Efficiently Evaluating XML Keyword Search Queries Experimental Evaluation Conclusion The contents of this paper is like below. First, new problem definition about searching over XML documents is suggested in the introduction. Second, data model and query semantics based on graph notation is defined. And then, the XML ranking algorithm ‘ElemRanks’ and the other algorithms about inverted list index and query processing are explained. At last, they present the experiments and conclusion.

Introduction: Problem Overview 01. <workshop date=”28 July 2000”> 02. <title> XML and IR: A SIGIR 2000 Workshop </title> 03. <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> 04. <proceedings> 05. <paper id=”1”> 06. <title> XQL and Proximal Nodes </title> 07. <author> Ricardo Baeza-Yates </author> 08. <author> Gonzalo Navarro </author> 09. <abstract> The recently proposed language … 10. </abstract> 11. <body> 12. <section name=”Introduction”> 13. Searching on structured text is more important … 14. </section> 15. <section name=”Implementing XML Operations”> 16. <subsection name=”Path Expressions”> 17. At first sight, the XQL query language looks … 18. </subsection> 19. … 20. </section> 21. <cite ref=”2”>Querying XML in Xyleme</cite> 22. <cite xlink=”../paper/xmlql/”>A Query … </cite> 23. </body> 24. </paper> 25. <paper id=”2”> 26. <title> Querying XML in Xyleme </title> 27. … 28. </paper> 29. </proceedings> 30. </workshop> Problem Efficiently producing ranked results for keyword search queries over hierarchical XML documents Challenges Nested elements are results Best relevant element Hyperlink and containment links Parent/child elements Keyword proximity Keyword distance Ancestor distance E.g. XQL language The problem which is defined by this paper is efficiently producing ranked results for keyword search queries over hierarchical XML documents. ‘Keyword search querying’ is so simple and easy that users do not have to learn a complex query language, for example XQuery. But, this hierarchical XML document causes below 3 challenges . First, in contrast with the flat HTML, the results of XML documents search are nested elements. So, the choice of the best relevant element might be a problem. Second, there are containment links relating parent and child elements. These are very different from that of hyperlinks (such as IDREFs and XLinks) Third, proximity among keywords is more complex. For XML there is a 2-dimensional proximity metric, which are Keyword distance and Ancestor distance

Data Model and Query Semantics (1/4) 11. <body> 12. <section name=”Introduction”> 13. Searching on structured text is more important … 14. </section> 15. <section name=”Implementing XML Operations”> 16. <subsection name=”Path Expressions”> 17. At first sight, the XQL query language looks … 18. </subsection> 19. … 20. </section> 21. <cite ref=”2”>Querying XML in Xyleme</cite> 22. <cite xlink=”../paper/xmlql/”>A Query … </cite> 23. </body> Definitions XML Document G = (N, CE, HE)  N : The set of nodes N = NE U NV - NE : The set of elements - NV : The set of values  CE : The set of containment edges relating nodes  HE : The set of hyperlink edges relating nodes ※ contains*(v, k): the node v directly or indirectly contains keyword k XML document is defined as a Graph G, which consists of node set N, element set NE and value set NV. An element u is a sub-element of an element v if (v,u)  CE. An element u is the parent of node v if (u,v)  CE. The predicate contains*(v, k) is true if the node v directly or indirectly contains the keyword k.

Data Model and Query Semantics (2/4) Definitions (continued) Keyword Query Q={k1,…, kn}  conjunctive semantics (k1…kn): contain all of the query keywords  disjunctive semantics (k1…kn): contain at least one of the query keywords (*) Results (1) R0 = {v  v  NE   k  Q(contains*(v,k))} directly or indirectly contain all of the query keywords (2) Result(Q)={v  kQ, cN ((v,c)CE  cR0  contains*(c,k))} only the most specific results are returned. element that has multiple independent occurrences of the query keywords ※ CE are considered for result set, HE are considered for ranking Keyword query Q is composed of multiple keywords. There are 2 types of query semantics. One is conjunction set, the other is disjunction set. Ro is the set of elements that directly or indirectly contain all of the query keywords. Result(Q) contains the set of elements that contain at least one occurrence of all of the query keywords, after excluding the occurrences of the keywords in sub-elements that already contain all of the query keywords. By the result equation, it is ensured that only the most specific results are returned and an element that has multiple independent occurrences of the query keywords is returned

Data Model and Query Semantics (3/4) Ranking function Desired properties Result specificity: more specific results higher than less specific results Keyword proximity Keyword distance Ancestor distance Hyperlink awareness Hyperlink Containment links Parent/child elements ※ Search keyword: XQL language 05. <paper id=”1”> 06. <title> XQL and Proximal Nodes </title> 07. <author> Ricardo Baeza-Yates </author> 08. <author> Gonzalo Navarro </author> 09. <abstract> The recently proposed language … 10. </abstract> 11. <body> 12. <section name=”Introduction”> 13. Searching … 14. </section> 15. <section name=”Implementing XML Operations”> 16. <subsection name=”Path Expressions”> 17. At first sight, the XQL query language looks … 18. </subsection> 19. … 20. </section> 21. <cite ref=”2”>Querying XML in Xyleme</cite> 22. <cite xlink=”../paper/xmlql/”>A Query … </cite> 23. </body> 24. </paper> The ranking function should take the proximity of the query keywords into account. The ranking function should use the hyperlinked structure of XML documents. 25. <paper id=”2”> 26. <title> Querying XQL language ... </title> 27. … 28. </paper>

Data Model and Query Semantics (4/4) Ranking function (continued) ElemRank(v) Defined at the granularity of an element Taking the nested structure of XML into account Consider Keyword search query Q={k1,…, kn} Results R= Result(Q) A result element v1 R With respect to one keyword: r(v1,Q)  (v1,v2) (v2,v3), ... , (vt,vt+1), vt+1: directly contains the keyword ki Overall Ranking ElemRank is suggested which address the Desired properties of rank function. Rank r(v1, ki) depends on containments edges. That is to say, less specific results indeed get lower ranks by the decay factor. For multiple occurences of ki in v1, combined rank is like function f, Which is some aggregation function. In this paper, function f = max by default. But, other functions, for example sum function, also can be used. The overall ranking is the sum of the ranks with respect to each query keyword, multiplied by a measure of keyword proximity P, Keyword proximity will be examined in next section. ※ f = max or f=sum

Computing ElemRanks (1/2) PageRank Sum of 2 probabilities Visiting document v at random, e.g. d=0.85 Visiting document v by navigating (hyperlink) from document u The basic idea of PageRank is simple. The more coming references to it, The more important the page is.

Computing ElemRanks (2/2) Element level granularity, Sum of 4 probabilities Nd(u): the total number of documents Nde(u): the number of elements containing the element v Nc(u): the number of sub-elements of u Nh(u): the number of out-going hyperlinks from element u d1: by hyperlink d2: by forward containment edges d3: by reverse containment edge Convergence It is said to be proved But, .... ? The basic idea of ElemRank is equal to that of PageRank. But, the granularity of ranking is element not document. And there is discrimination between containment and hyperlink edges. In containment edges, there is Bi-directional transfer of ElemRanks, Aggregation of forward and reverse calculation like HITS. 11. <body> 21. <section>Querying XML ... </section> 22. <section>Querying HTML .... </section> 23. </body>

Efficiently Evaluating XML Keyword Search Queries (1/8) How to produce ranked results efficiently? Naive approach Dewey Inverted List (DIL) Ranked Dewey Inverted List (RDIL) Hybrid Dewey Inverted List (HDIL) Naive approach Treating each element as a document Problems Space overhead Spurious query results Inaccurate ranking of results E.g. Search Query keyword XQL: 1, 5, 6, 8 Ricardo: 1, 5, 6, 7 <workshop> date <title> <editors> <proceedings> 28 July … XML and … David Carmel … <paper> … <author> XQL and … Ricardo … 1 2 3 4 5 6 7 8 Naive approach treats each element as a document. So, inverted list of each elements have all keywords of it’s child-elements redundantly. It is a large space overhead. And, The naïve approach ignores ancestor-descendant relationships. In other words, all elements treated as independent documents. So, results will not reflect the desired semantics for XML keyword search. Ranking of results is Inaccurate because this approache does not take result specificity into account when ranking results.

Efficiently Evaluating XML Keyword Search Queries (2/8) Dewy Inverted List (DIL) Dewy ID: ID of an ancestor  prefix of the descendant ID Ancestor-descendant relationship are implicitly captured E.g. XQL: 0.3.0.0 Ricardo: 0.3.0.1 <workshop> date <title> <editors> <proceedings> 28 July … XML and … David Carmel … <paper> … <author> 0.0 0.1 0.2 0.3 0.3.0 0.3.0.0 0.3.0.1 0.3.1 XQL Ricardo The idea of Dewey IDs is not new, and it has been used in the classification, tree addressing domains. But, this paper uses Dewey IDs to support XML keyword search. Dewy IDs

Efficiently Evaluating XML Keyword Search Queries (3/8) Dewy Inverted List (Continued) DIL Data Structure The inverted list for a keyword k contains the Dewey IDs of all the XML elements that directly contain the keyword k An entry in DIL ElemRank The list of all positions where the keyword k appears in that element Entries are sorted by Dewey IDs

Efficiently Evaluating XML Keyword Search Queries (4/8) Dewy Inverted List (Continued) DIL Query Processing Key idea Merge the query keyword inverted lists Simultaneously compute the longest common prefix of the Dewey IDs in different lists. <workshop> date <title> <editors> <proceedings> 28 July … XML and … David Carmel … <paper> … <author> 0.0 0.1 0.2 0.3 0.3.0 0.3.0.0 0.3.0.1 0.3.1 XQL Ricardo PosList[1] PosList[2] ContainsAll PosList[1] PosList[2] ContainsAll Dewey Rank[1] Rank[2] Dewey Rank[1] Rank[2] 85 32 3 5 1 82 38 77 32 3 5 First, simple merge-join of the query keyword inverted lists can not be used because the result IDs have to be inferred from the IDs of descendants. Merge operation is needed because this paper focus on disjunctive query processing. Second, spurious results must be suppressed. I’ll describe an algorithm that merge the query keyword inverted lists. Read the smallest Dewey ID entry from the inverted list 5.0.3.0.0 [document id 5] 2. Read the next entry 5.0.3.0.1 [document id 5] 3. Computing the longest common prefix of the current entry 4. If all the query keywords are not contained, scaled down ranks are added to its parent (ContainsAll flag is 0) nothing

Efficiently Evaluating XML Keyword Search Queries (4/8) DIL Query Processing (Continued) compare top k PosList[1] PosList[2] ContainsAll PosList[1] PosList[2] ContainsAll Dewey Rank[1] Rank[2] Dewey Rank[1] Rank[2] RESULT 1 82 38 77 32 3 5 3 38 89 91 8 6 77 74 32 38 1 3 5 Read the smallest Dewey ID entry from the inverted list 6.0.3.8.3 [document id 6] 2. Computing the longest common prefix of the current entry 3. Since the longest common prefix with the Dewey stack is empty, it pops all of the entries. 4. At this point, since the entry does not contain all the query keywords, its scaled down rank and position lists are copied to its parent (5.0.3.0). Since 5.0.3.0 now contains all the query keywords, its ContainsAll flag is set to true, and it is added to the result heap 5. The algorithm then pushes 6.0.3.8.3 onto the stack and proceeds as before. <workshop> date <title> <editors> <proceedings> 28 July … XML and … David Carmel … <paper> … <author> 0.0 0.1 0.2 0.3 0.3.0 0.3.0.0 0.3.0.1 0.3.1 XQL Ricardo

Efficiently Evaluating XML Keyword Search Queries (5/8) Ranked Dewey Inverted List DIL Challenge If inverted lists are long (e.g. common keywords or large document collections) ⇒ the cost of a single scan of the inverted list can be expensive ( users want only the top few results ) RDIL Inverted lists are ordered by the ElemRank Cf) DIL: by the Dewey ID Each inverted list has a B+-tree index of the Dewey ID field Higher ranked results will appear first in the inverted list XQL Inverted List … Sorted by ElemRank B+-tree On Dewey Id

Efficiently Evaluating XML Keyword Search Queries (6/8) Ranked Dewey Inverted List (RDIL) Threshold Algorithm Output Heap S B+-tree on Dewey Id Ricardo P: 9.0.4.2.0 Inverted List R2 Rank(9.0.4) threshold < Rank(S)  stop! threshold = (ElemRank(R1), ElemRank(R2)) XQL R1 9.0.4.1.2 Now, I’ll show some example about RDIL. 1. For Ricardo keyword, top-ranked Dewey ID, 9.0.4.2.0 is returned. 2. Looking the leaf nodes of the B+-tree for the ‘XQL’ inverted list. These nodes are sorted by Dewey ID. 3. determining the longest common prefix of 9.0.4.2.0 which contains the keyword XQL => 9.0.4.1.2 4. Calculating the rank by two sub-element’s rank value 5. If the result rank value is greater than the threshold, then RDIL algorithm stops. 8.2.1.4.2 9.0.4.1.2 9.0.5.6 10.8.3 B+-tree on Dewey Id 9.0.4.2.0

Efficiently Evaluating XML Keyword Search Queries (7/8) Hybrid Dewey Inverted List (HDIL) Motivation In many cases, RDIL is likely to perform well. It may perform worse than DIL when there is a query where keywords are not correlated The individual query keywords occur relatively frequently in the document collection but rarely occur together in the same document. Since the number of results is small: RDIL has to scan most (or all) of the inverted lists to produce the output. Combination the benefits of DIL and RDIL

Efficiently Evaluating XML Keyword Search Queries (8/8) HDIL (Continued) An adaptive strategy: Calculate the estimated time for RDIL Time spent: t The number of results above the threshold: r Estimated time remaining for RDIL = (m-r)*t/r m: desired number of query results Estimated time for DIL depends on the number of query keywords the size of each query keyword inverted list If estimated time of RDIL is more than the expected time for DIL, then switch to DIL. An adaptive startegy of Hybrid Dewey Inverted List consists of 2-steps.

Experimental Evaluation Setup: Data sets DBLP (real data, 143MB, depth = 4, many small documents) XMARK (synthetic data, 113MB, depth = 10, one large document) Quality and Ranking Function Space requirements Query Performance: DBLP – High Correlation Keywords The experimental evaluation shows that it’s index structures and query evaluation techniques provide significant space savings and performance gains.

Conclusion XRANK is the first system that takes into account The hierarchical and hyperlinked structure of XML documents Two-dimensional notion of keyword proximity Future work Open problems: Incremental index maintenance Integration wit structured queries

Thank you

Appendix A. XRANK Architecture ElemRank Computation Module Computes the ElemRanks of XML elements Combined with ancestor info HDIL Generates an index structure called HDIL The Query Evaluator Module Evaluates queries using HDIL Returns ranked results.

Appendix B. DIL Query Processing Algorithm 01. procedure EvaluateQuery (k1, k2, …, kn, m) returns idList 02. // k1 … kn are the query keywords, m is the desired number of query results 03. // invertedList[i] is the inverted list for keyword ki 04. resultHeap = empty; // Intialize the result heap of size m 05. deweyStack = empty; // Initialize the Dewey stack 05. while (eof has not been reached on all inverted lists) { 07. // Read the next entry from the inverted list having the smallest DeweyID 08. find ilIndex such that the next entry of invertedList[ilIndex] is the smallest DeweyID 09. currentEntry = invertedList[ilIndex].nextEntry; 10. // Find the longest common prefix between deweyStack and currentEntry.deweyId 11. find largest lcp such that deweyStack[i] = currentEntry.deweyId[i], 1 <= i <= lcp 12. // Pop non-matching entries in the Dewey stack; add to result heap if appropriate 13. while (deweyStack.size > lcp) { 14. stackEntry = deweyStack.pop(); 15. if ( stackEntry.posList non-empty for all keywords) { 16. stackEntry.ContainsAll = true 17. compute overall rank using formula in Section 2.3.2.2 18. if overall rank is among top m seen so far, add deweyStack ID to resultHeap 19. }else if ( ! stackEntry.ContainsAll) { 20. deweyStack[deweyStack.size].posList[i] += stackEntry.posList[i] (for all i) 21. deweyStack[deweyStack.size].rank[i] = rank as in Sec. 2.3.2.1 (for all i) 22. } 23. if (stackEntry.ContainsAll) deweyStack[deweyStack.size].containsAll = true 24. } 25. // Add non-matching part of currentEntry.deweyId to deweyStack 26. for (all i such that lcp < i <= currDeweyIdLen) { 27. deweyStack.push(deweyStackEntry); 28. } 29. // Add components to the top entry 30. deweyStack[currDeweyIdLen].rank[ilIndex] = rank as in Section 2.3.2.1 31. deweyStack[currDeweyIdLen].posList[ilIndex] += currentEntry.posList; 32. } // End of looping over all inverted lists 33. pop entries of deweyStack and add to result heap if appropriate (similar to lines 12-24) 34. return ids in resultHeap

Appendix C. Experimental Evaluation Space Requirements DBLP XMARK Inv. List Index Naïve-ID 258MB N/A 872MB Naïve-Rank 217MB 527MB DIL 144MB 254MB RDIL 156MB 209MB HDIL 186MB 7MB 307MB 3.2MB

Appendix D. Experimental Evaluation Query Performance: DBLP - Low Correlation Keywords