1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.

Slides:



Advertisements
Similar presentations
Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Optimizing search engines using clickthrough data
CS 430 / INFO 430 Information Retrieval
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
CS 430 / INFO 430 Information Retrieval
K nearest neighbor and Rocchio algorithm
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
WMES3103 : INFORMATION RETRIEVAL
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 5 Query Operations.
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Query Reformulation: User Relevance Feedback. Introduction Difficulty of formulating user queries –Users have insufficient knowledge of the collection.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Presented by Zeehasham Rasheed
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
CS 430 / INFO 430 Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Chapter 5: Information Retrieval and Web Search
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Universit at Dortmund, LS VIII
CS 430: Information Discovery
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
IR Theory: Relevance Feedback. Relevance Feedback: Example  Initial Results Search Engine2.
For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
C.Watterscsci64031 Probabilistic Retrieval Model.
Information Retrieval
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Lecture 12: Relevance Feedback & Query Expansion - II
Text Based Information Retrieval
CS 430: Information Discovery
Multimedia Information Retrieval
Relevance Feedback Hongning Wang
Text Categorization Assigning documents to a fixed set of categories
CS 430: Information Discovery
CS 430: Information Discovery
Presentation transcript:

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback

2 Course Administration Assignment Reports A sample report will be posted before the next assignment is due. Preparation for Discussion Classes Most of the readings were used last year. You can see the questions that were used on last year's Web site:

3 CS 430 / INFO 430 Information Retrieval Completion of Lecture 7

4 Search for Substring In some information retrieval applications, any substring can be a search term. Tries, using suffix trees, provide lexicographical indexes for all the substrings in a document or set of documents.

5 Tries: Search for Substring Basic concept The text is divided into unique semi-infinite strings, or sistrings. Each sistring has a starting position in the text, and continues to the right until it is unique. The sistrings are stored in (the leaves of) a tree, the suffix tree. Common parts are stored only once. Each sistring can be associated with a location within a document where the sistring occurs. Subtrees below a certain node represent all occurrences of the substring represented by that node. Suffix trees have a size of the same order of magnitude as the input documents.

6 Tries: Suffix Tree Example: suffix tree for the following words: begin beginning between bread break b e rea gin tween d k null ning

7 Tries: Sistrings A binary example String: Sistrings:

8 Tries: Lexical Ordering Unique string indicated in blue

9 Trie: Basic Concept

10 Patricia Tree Single-descendant nodes are eliminated. Nodes have bit number.

11 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback

12 Query Refinement Search Reformulate query Display retrieved information new query reformulated query Query formulation EXIT

13 Reformulation of Query Manual Add or remove search terms Change Boolean operators Change wild cards Automatic Remove search terms Change weighting of search terms Add new search terms

14 Manual Reformulation: Vocabulary Tools Feedback Information about stop lists, stemming, etc. Numbers of hits on each term or phrase Suggestions Thesaurus Browse lists of terms in the inverted index Controlled vocabulary

15 Manual Reformulation: Document Tools Feedback to user consists of document excerpts or surrogates Shows the user how the system has interpreted the query Effective at suggesting how to restrict a search Shows examples of false hits Less good at suggesting how to expand a search No examples of missed items

16 Relevance Feedback: Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Then the ends of the vectors all lie on a surface with unit radius For similar documents, we can represent parts of this surface as a flat region Similar document are represented as points that are close together on this surface

17 Results of a Search x x x x x x x  hits from search x documents found by search  query

18 Relevance Feedback (Concept) x x x x o o o  hits from original search x documents identified by user as non-relevant o documents identified by user as relevant  original query reformulated query

19 Theoretically Best Query x x x x o o o optimal query x non-relevant documents o relevant documents o o o x x x x x x x x x x x x x x

20 Theoretically Best Query For a specific query, q, let: D R be the set of all relevant documents D N-R be the set of all non-relevant documents sim (q, D R ) be the mean similarity between query q and documents in D R sim (q, D N-R ) be the mean similarity between query q and documents in D N-R The theoretically best query would maximize: F = sim (q, D R ) - sim (q, D N-R )

21 Estimating the Best Query In practice, D R and D N-R are not known. (The objective is to find them.) However, the results of an initial query can be used to estimate sim (q, D R ) and sim (q, D N-R ).

22 Rocchio's Modified Query Modified query vector = Original query vector + Mean of relevant documents found by original query - Mean of non-relevant documents found by original query

23 Query Modification q 1 = q 0 + r i - s i  i =1 n1n1 n1n1 1  n2n2 n2n2 1 q 0 = vector for the initial query q 1 = vector for the modified query r i = vector for relevant document i s i = vector for non-relevant document i n 1 = number of relevant documents n 2 = number of non-relevant documents Rocchio 1971

24 Difficulties with Relevance Feedback x x x x o o o  optimal query x non-relevant documents o relevant documents  original query reformulated query o o o x x x x x x x x x x x x x x Hits from the initial query are contained in the gray shaded area

25 Difficulties with Relevance Feedback x x x x o o o  optimal results set x non-relevant documents o relevant documents  original query reformulated query o o o x x x x x x x x x x x x x x What region provides the optimal results set?

26 Effectiveness of Relevance Feedback Best when: Relevant documents are tightly clustered (similarities are large) Similarities between relevant and non-relevant documents are small

27 When to Use Relevance Feedback Relevance feedback is most important when the user wishes to increase recall, i.e., it is important to find all relevant documents. Under these circumstances, users can be expected to put effort into searching: Formulate queries thoughtfully with many terms Review results carefully to provide feedback Iterate several times Combine automatic query enhancement with studies of thesauruses and other manual enhancements

28 Adjusting Parameters 1: Relevance Feedback q 1 =  q 0 +  r i -  s i  i =1 n1n1 n1n1 1  n2n2 n2n2 1 ,  and  are weights that adjust the importance of the three vectors. If  = 0, the weights provide positive feedback, by emphasizing the relevant documents in the initial set. If  = 0, the weights provide negative feedback, by reducing the emphasis on the non-relevant documents in the initial set.

29 Adjusting Parameters 2: Filtering Incoming Messages D 1, D 2, D 3,... is a stream of incoming documents that are to be divided into two sets: R - documents judged relevant to an information need S - documents judged not relevant to the information need A query is defined as the vector in the term vector space: q = (w 1, w 2,..., w n ) where w i is the weight given to term i D j will be assigned to R if similarity(q, D j ) > What is the optimal query, i.e., the optimal values of the w i ?

30 Seeking Optimal Parameters Theoretical approach Develop a theoretical model Derive parameters Test with users Heuristic approach Develop a heuristic Vary parameters Test with users Machine learning approach

31 Seeking Optimal Parameters using Machine Learning GENERAL:EXAMPLE: Text RetrievalInput: training examples queries with relevance judgments design space parameters of retrieval functionTraining: automatically find the solution find parameters so that many in design space that works well relevant documents are ranked on the training data highlyPrediction: predict well on new examples rank relevant documents high also for new queries Joachims

32 Task Application Text Routing Help-Desk Support: Who is an appropriate expert for a particular problem? Information Information Agents: FilteringWhich news articles are interesting to a particular person? Relevance Information Retrieval: FeedbackWhat are other documents relevant for a particular query? Text Knowledge Management: Categorization Organizing a document database by semantic categories. Machine Learning: Tasks and Applications

33 Learning to Rank Assume: distribution of queries P(q) distribution of target rankings for query P(r | q) Given: collection D of documents independent, identically distributed training sample (q i, r i ) Design: set of ranking functions F loss function l(r a, r b ) learning algorithm Goal: find f  F that minimizes l(f (q), r) integrated across all queries

34 A Loss Function for Rankings For two orderings r a and r b, a pair is: concordant, if r a and r b agree in their ordering P = number of concordant pairs discordant, if r a and r b disagree in their ordering Q = number of discordant pairs Loss function: l(r a, r b ) = Q Example: r a = (a, c, d, b, e, f, g, h) r b = (a, b, c, d, e, f, g, h) The discordant pairs are: (c, b), (d, b) l(r a, r b ) = 2 Joachims

35 Machine Learning: Algorithms The choice of algorithms is a subject of active research, which is covered in several courses, notably CS 478 and CS/INFO 630. Some effective methods include: Naive Bayes Rocchio Algorithm C4.5 Decision Tree k-Nearest Neighbors Support Vector Machine

36 Relevance Feedback: Clickthrough Data Relevance feedback methods have suffered from the unwillingness of users to provide feedback. Joachims and others have developed methods that use Clickthrough data from online searches. Concept: Suppose that a query delivers a set of hits to a user. If a user skips a link a and clicks on a link b ranked lower, then the user preference reflects rank(b) < rank(a).

37 Clickthrough Example Ranking Presented to User: 1. Kernel Machines 2. Support Vector Machine 3. SVM-Light Support Vector Machine light/ 4. An Introduction to Support Vector Machines 5. Support Vector Machine and Kernel... References Ranking: (3 < 2) and (4 < 2) User clicks on 1, 3 and 4 Joachims