LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Metadata in Carrot II Current metadata –TF.IDF for both documents and collections –Full-text index –Metadata are transferred between different nodes Potential.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Chapter 5: Introduction to Information Retrieval
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
CpSc 881: Information Retrieval
Probabilistic Ranking Principle
Information Retrieval Models: Probabilistic Models
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR
Chapter 7 Retrieval Models.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
Modern Information Retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
IR Models: Review Vector Model and Probabilistic.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Chapter 5: Information Retrieval and Web Search
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
A Language Modeling Approach to Information Retrieval 한 경 수  Introduction  Previous Work  Model Description  Empirical Results  Conclusions.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Information Retrieval Models: Vector Space Models
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Relevance Feedback Hongning Wang
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
SIGIR 2005 Relevance Information: A Loss of Entropy but a Gain for IDF? Arjen P. de Vries Thomas Roelleke,
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Lecture 13: Language Models for IR
CSCI 5417 Information Retrieval Systems Jim Martin
Relevance Feedback Hongning Wang
Language Models for Information Retrieval
Murat Açar - Zeynep Çipiloğlu Yıldız
John Lafferty, Chengxiang Zhai School of Computer Science
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Language Model Approach to IR
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
5. Vector Space and Probabilistic Retrieval Models
INF 141: Information Retrieval
Information Retrieval and Web Design
Language Models for TR Rong Jin
Presentation transcript:

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee

2 Abstract  The language modeling approach to IR  Query - random event  Documents - ranked according to the likelihood  users have a prototypical document in mind and will choose query terms accordingly  inferences about the semantic content of documents do not need to be made resulting in a conceptually

3 1. Introduction  the language modeling approach to IR  Developed by Ponte and Croft, 1998  Query – random event generated according to a probability distribution  Document similarity -estimating a model of the term generation probabilities for the query terms for each document -ranking the documents according to the probability of generating the query  The main advantage to the language modeling approach  Document boundaries are not predefined -use the document level statistics of tf and idf  Uncertainty are modeled by probabilities -noisy data such as OCR text and automatically recognized speech transcripts  relevance feedback or document routing

4 2. The Language Modeling Approach to IR  The query generation probability  The probability will be estimated starting with the maximum likelihood estimate of the probability of term t in document d -tf (t,d) : the raw term frequency of term t in document d -dl d : the total number of tokens in document d

5 2.1 Insufficient Data  Two problem with the maximum likelihood estimator  We do not wish to assign a probability of zero for a document that is missing one or more of the query terms -If a user included several synonyms in the query, a document missing even one of them would not be retrieved -A more reasonable distribution  We only have a document sized sample from that distribution and so the variation in the raw counts may partially be accounted for by randomness * cft : the raw count of term t in the collection * cs : the raw collection size or the total number of tokens in the collection

6 2.2 Averaging  The mean probability estimate of t in documents containing it -to circumvent the problem of insufficient data -some risk : if the mean were used by itself, there would be no distinction between documents with different term frequencies  Combining the two estimates using the geometric distribution -Ghosh et al., robustness of estimation, minimize the risk dft : the document frequency of t : the mean term frequency of term t in documents

7 2.3 Combining the Two Estimates  The estimate of the probability of producing the query for a given document model -first term : the probability of producing the terms in the query -second term : the probability of not producing other terms -better discriminators of the document If tf (t,d) >0 otherwise If tf (t,d) >0 otherwise

8 3. Related Work 1.The harper and van rijsbergen model 2.The rocchio method 3.The inquery model 4.Exponential models

9 3.1 The Harper and Van rijsbergen model (1978)  to obtain better estimates for the probability of relevance of a document given the query  An approximation of the dependence of query terms was defined by the authors by means of a maximal spanning tree  each node of the tree : a single query term  The edges between nodes : weighted by a measure of term dependency  A tree that spanned all of the nodes and that maximized the expected mutual information - P(xi,xj) : the probability of term xi and term xj occurring - P(xi) : the probability of term xi occurring in a relevant document - P(xj) : the probability of term xj occurring in a relevant document

The Rocchio Method (1971)  Rocchio method  provides a mechanism for the selection and weighting of expansion terms  can be used to rank the terms in the judged documents -The top N can then be added to the query and weighted  a reasonable solution to the problem of relevance feedback that works very well in practice  empirically determine the optimal value of,, - : the weight assigned for occurring in relevant doc - : the weight assigned for occurring in non-relevant doc

The INQUERY Model (1/2)  INQUERY inference network (Turtle, 1991)  document portion -computed in advance  query portion -computed at retrieval time  Document Network  document nodes – d 1...d i  text nodes – t 1...t j  concept representation nodes – r 1...r k  Query Network  query concepts – c 1 …c m  queries – q 1, q 2  Information need – I  Uncertainty  due to differences in word sense Figure 3.1 Example inference network

The INQUERY Model (2/2)  Relevance Feedback  Implementation of the theoretical relevance feedback was done by Hains(1996)  Annotated query network  Proposition nodes – k 1, k 2  Observed relevance judgment nodes – j 1, j 2  and nodes – require that an annotation to have an effect on the score  The drawback of this technique  It requires inferences of considerable complexity  Relevance judgment -Two additional layers of inference and several new propositions are required Figure 3.3 Annotated query network

Exponential Models  An approach to predicting topic shifts in text using exponential models (Beeferman et al., 1997)  The model utilized ratios of long range language models and short range language models -predict useful terms  Topic shift -When a long range language model is not able to predict the next word better than a short range language model - Pl(x) : the probability of seeing word x given the context of the last 500 words - Ps(x) : the probability of seeing word x given the two previous words

14 4. Query Expansion in the Language Modeling Approach  Assumption of this approach  Users can choose query terms that are likely to occur in documents in which they would interested  This approach has been developed into a ranking formula by means of probabilistic language models

Interactive Retrieval with Relevance Feedback  Relevance Feedback  Small number of documents are judged relevant by user -The relevance of all the remaining documents is unknown to the system

Document Routing  Document Routing  The task is to choose terms associated with documents of interest and to avoid those associated with other documents  Training collection is available with a large number of relevance judgments, both positive and negative, for particular query  Ratio Method  Can utilize additional information by estimating probabilities for both sets

The Ratio Method  Ratio Method  predict useful terms  Terms can be ranked according to the probability of occurrence according to the relevant document models  Terms are ranked according to this ratio and top N are added to the initial query - R : the set of relevant documents - P(t|M d ) : the probability of term t given the document model for d - cft : the raw count of term t in the collection - cs : the raw collection size

Evaluation  Result are measured using the recall and precision

Experiments (1/2)  Comparison of Rocchio method vs Language Model approach  Language Model : log ratio of the probability in the judged relevant set  Rocchio : weighting function was tf,idf and no negative feedback( = 0 )  Language Modeling approach works well

Experiments (2/2)

Information Routing  Ratio Methods With More Data  Ratio 1  Ratio 2 -The log ratio of the average probability in judged relevant documents vs. the average probability in judged non-relevant documents  Result  The language modeling approach is a good model for retrieval

22 5. Query Term Weighting  probability estimation  Maximum likelihood probability  The average probability (combined a geometric risk function)  risk function  current risk function treats all terms equally  The change will be to mix the estimation -useless term, stop word – term is assigned an equal probability estimate for every documents ( it to have no effect on the ranking )  user specified Language Models  Queries -A specific type of text produced by the user  The term weights -Equivalent to the generation probabilities of the query model