Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.

Slides:

Advertisements

Similar presentations

Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.

Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

Chapter 5: Introduction to Information Retrieval

Introduction to Information Retrieval

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,

Language Models Hongning Wang

Introduction to Information Retrieval (Part 2) By Evren Ermis.

Probabilistic Ranking Principle

K-means clustering Hongning Wang

Information Retrieval Models: Probabilistic Models

Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.

Mixture Language Models and EM Algorithm

Evaluating Search Engine

IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.

Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.

Modern Information Retrieval

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Modern Information Retrieval Chapter 5 Query Operations.

Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.

1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.

Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.

SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.

Language Modeling Frameworks for Information Retrieval John Lafferty School of Computer Science Carnegie Mellon University.

Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.

Information Filtering LBSC 796/INFM 718R Douglas W. Oard Session 10, April 13, 2011.

Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.

Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Chapter 6: Information Retrieval and Web Search

1 Computing Relevance, Similarity: The Vector Space Model.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Search Engine Architecture

Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.

IR Theory: Relevance Feedback. Relevance Feedback: Example  Initial Results Search Engine2.

Probabilistic Ranking Principle Hongning Wang

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Language Models Hongning Wang Recap: document generation model 4501: Information Retrieval Model of relevant docs for Q Model of non-relevant.

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate.

Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.

Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Relevance Feedback Hongning Wang

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.

Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

KNN & Naïve Bayes Hongning Wang

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.

Lecture 12: Relevance Feedback & Query Expansion - II

Search Engine Architecture

Information Retrieval Models: Probabilistic Models

Relevance Feedback Hongning Wang

John Lafferty, Chengxiang Zhai School of Computer Science

Search Engine Architecture

CS 4501: Information Retrieval

CS590I: Information Retrieval

Retrieval Utilities Relevance feedback Clustering

INF 141: Information Retrieval

Language Models for TR Rong Jin

Presentation transcript:

Relevance Feedback Hongning Wang

What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker Indexer Doc Analyzer Index Crawler (Query) Evaluation Feedback Indexed corpus Ranking procedure Research attention 2

User feedback An IR system is an interactive system Query Ranked documents Information need Inferred information need GAP! Feedback should be Information Retrieval3

Relevance feedback Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment Retrieval Engine Document collection Results: d d … d k Information Retrieval4

Relevance feedback in real systems Google used to provide such functions – Guess why? Relevant Nonrelevant Information Retrieval5

Relevance feedback in real systems Popularly used in image search systems Information Retrieval6

Pseudo feedback What if the users are reluctant to provide any feedback Query Retrieval Engine Results: d d … d k Judgments: d 1 + d 2 + d 3 + … d k -... Document collection Feedback Updated query top k Information Retrieval7

Basic idea in feedback Query expansion – Feedback documents can help discover related query terms – E.g., query=“information retrieval” Relevant or pseudo-relevant docs may likely share very related words, such as “search”, “search engine”, “ranking”, “query” Expand the original query with such words will increase recall and sometimes also precision Information Retrieval8

Basic idea in feedback Learning-based retrieval – Feedback documents can be treated as supervision for ranking model update – Will be covered in the lecture of “learning-to-rank” Information Retrieval9

Feedback techniques Feedback as query expansion – Step 1: Term selection – Step 2: Query expansion – Step 3: Query term re-weighting Feedback as training signal – Will be covered later Information Retrieval10

Relevance feedback in vector space models General idea: query modification – Adding new (weighted) terms – Adjusting weights of old terms The most well-known and effective approach is Rocchio [Rocchio 1971] Information Retrieval11

+ Illustration of Rocchio feedback q q Information Retrieval12

Standard operation in vector space Formula for Rocchio feedback Original query Rel docs Non-rel docs Parameters Modified query Information Retrieval13

Rocchio in practice Negative (non-relevant) examples are not very important (why?) Efficiency concern – Restrict the vector onto a lower dimension (i.e., only consider highly weighted words in the centroid vector) Avoid “training bias” – Keep relatively high weight on the original query weights Can be used for relevance feedback and pseudo feedback Usually robust and effective Information Retrieval14

Feedback in probabilistic models Classic Prob. Model Language Model Rel. doc model NonRel. doc model “Rel. query” model P(D|Q,R=1) P(D|Q,R=0) P(Q|D,R=1) (q 1,d 1,1) (q 1,d 2,1) (q 1,d 3,1) (q 1,d 4,0) (q 1,d 5,0) (q 3,d 1,1) (q 4,d 1,1) (q 5,d 1,1) (q 6,d 2,1) (q 6,d 3,0) Parameter Estimation Initial retrieval: - P(D|Q,R=1): query as rel doc - P(Q|D,R=1): doc as rel query Feedback: - P(D|Q,R=1) can be improved for the current query and future doc - P(Q|D,R=1) can be improved for the current doc and future query Information Retrieval15

Document generation model Information Retrieval Terms occur in doc Terms do not occur in doc documentrelevant(R=1)nonrelevant(R=0) term present A i =1pipi uiui term absent A i =01-p i 1-u i Assumption: terms not occurring in the query are equally likely to occur in relevant and nonrelevant documents, i.e., p t =u t Important tricks 16

Robertson-Sparck Jones Model (Robertson & Sparck Jones 76) Information Retrieval Two parameters for each term A i : p i = P(A i =1|Q,R=1): prob. that term A i occurs in a relevant doc u i = P(A i =1|Q,R=0): prob. that term A i occurs in a non-relevant doc (RSJ model) How to estimate these parameters? Suppose we have relevance judgments, “+0.5” and “+1” can be justified by Bayesian estimation as priors Per-query estimation! P(D|Q,R=1) can be improved for the current query and future doc 17

Feedback in language models Document language model Information Retrieval18

Feedback in language models Approach – Introduce a probabilistic query model – Ranking: measure distance between query model and document model – Feedback: query model update Q:Back to vector space model? A: Kind of, but in different perspective. Information Retrieval19

Kullback-Leibler (KL) divergence based retrieval model Query-specific quality, ignored for ranking Query language model, need to be estimated Document language model, we know how to estimate Information Retrieval20

Background knowledge Information Retrieval21

Kullback-Leibler (KL) divergence based retrieval model same smoothing strategy Information Retrieval22

Feedback as model interpolation Query Q Document D Results Feedback Docs F={d 1, d 2, …, d n } Generative model  =0 No feedback  =1 Full feedback Q: Rocchio feedback in vector space model? A: Very similar, but in different interpretation. Key: estimate the feedback model Information Retrieval23

Feedback in language models Information Retrieval24 Feedback documents protect passengers, accidental/malicious harm, crime, rules

Generative mixture model of feedback F={d 1, …, d n } P(w|  F ) P(w| C) 1- P(feedback) Background words Topic words = Noise ratio in feedback documents Maximum Likelihood Information Retrieval25

How to estimate  F ? the 0.2 a 0.1 we 0.01 to 0.02 … flight company … Known Background p(w|C) … accident =? regulation =? passenger=? rules =? … Unknown query topic p(w|  F )=? “airport security” =0.7 =0.3 Feedback Doc(s) Suppose, we know the identity of each word ML Estimator fixed ; but we don’t... Information Retrieval26

Appeal to EM algorithm Identity (“hidden”) variable: z i  {1 (background), 0(topic)} the paper presents a text mining algorithm the paper... z i Suppose the parameters are all known, what’s a reasonable guess of z i ? - depends on (why?) - depends on p(w|C) and p(w|  F ) (how?) E-step M-step Why in Rocchio we did not distinguish a word’s identity? Information Retrieval27

A toy example of EM computation Assume =0.5 Expectation-Step: Augmenting data by guessing hidden variables Maximization-Step With the “augmented data”, estimate parameters using maximum likelihood Information Retrieval28

Example of feedback query model Query: “airport security” – Pesudo feedback with top 10 documents =0.9 =0.7 Open question: how do we handle negative feedback? Information Retrieval29

Keep this in mind, we will come back the 0.2 a 0.1 we 0.01 to 0.02 … text mining … Known Background p(w|C) … text =? mining =? association =? word =? … Unknown query topic p(w|  F )=? “Text mining” =0.7 =0.3 Observed Doc(s) ML Estimator fixed Information Retrieval30

What you should know Purpose of relevance feedback Rocchio relevance feedback for vector space models Query model based feedback for language models Information Retrieval31