Relevance Feedback Main Idea:

Slides:



Advertisements
Similar presentations
Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.
Advertisements

Chapter 5: Query Operations Hassan Bashiri April
Introduction to Information Retrieval
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Learning for Text Categorization
1 Advanced information retrieval Chapter. 05: Query Reformulation.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
9/7 Agenda  Project 1 discussion  Correlation Analysis  PCA (LSI)
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Improving Vector Space Ranking
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
IR Models: Review Vector Model and Probabilistic.
Learning Techniques for Information Retrieval We cover 1.Perceptron algorithm 2.Least mean square algorithm 3.Chapter 5.2 User relevance feedback (pp )
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Modern Information Retrieval Chapter 5 Query Operations 報告人:林秉儀 學號:
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Automatically obtain a description for a larger cluster of relevant documents Identify terms related to query terms  Synonyms, stemming variations, terms.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Query Operations. Query Models n IR Systems usually adopt index terms to process queries; n Index term: u A keyword or a group of selected words; u Any.
1 Query Operations Relevance Feedback & Query Expansion.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Chap. 5 Chapter 5 Query Operations. 2 Chap. 5 Contents Introduction User relevance feedback Automatic local analysis Automatic global analysis Trends.
IR Theory: Relevance Feedback. Relevance Feedback: Example  Initial Results Search Engine2.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
C.Watterscsci64031 Probabilistic Retrieval Model.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Relevance Feedback Hongning Wang
Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.
Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework n Given a user query, there is an ideal answer set n Querying.
Hsin-Hsi Chen5-1 Chapter 5 Query Operations Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Information Retrieval CSE 8337 Spring 2007 Query Operations Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
Information Retrieval CSE 8337 Spring 2003 Query Operations Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
Advanced information retrieval
Representation of documents and queries
Chapter 5: Information Retrieval and Web Search
Recuperação de Informação B
Automatic Global Analysis
Query Operations Berlin Chen 2003 Reference:
CS 430: Information Discovery
Yet another Example T This happens to be a rank-7 matrix
Recuperação de Informação B
Information Retrieval and Web Design
Query Operations Berlin Chen
Presentation transcript:

Relevance Feedback Main Idea: Modify existing query based on relevance judgements Extract terms from relevant documents and add them to the query and/or re-weight the terms already in the query Two main approaches: Automatic (psuedo-relevance feedback) Users select relevant documents Users/system select terms from an automatically-generated list

Relevance Feedback Usually do both: There are many variations expand query with new terms re-weight terms in query There are many variations usually positive weights for terms from relevant docs sometimes negative weights for terms from non-relevant docs Remove terms ONLY in non-relevant documents

Relevance Feedback for Vector Model In the “ideal” case where we know the relevant Documents a priori Cr = Set of documents that are truly relevant to Q N = Total number of documents

Rocchio Method Qo is initial query. Q1 is the query after one iteration Dr are the set of relevant docs Dn are the set of irrelevant docs Alpha =1; Beta=.75, Gamma=.25 typically. Other variations possible, but performance similar

Rocchio/Vector Illustration Retrieval Information 0.5 1.0 D1 D2 Q0 Q’ Q” Q0 = retrieval of information = (0.7,0.3) D1 = information science = (0.2,0.8) D2 = retrieval systems = (0.9,0.1) Q’ = ½*Q0+ ½ * D1 = (0.45,0.55) Q” = ½*Q0+ ½ * D2 = (0.80,0.20)

Example Rocchio Calculation Relevant docs Non-rel doc Original Query Constants Rocchio Calculation Resulting feedback query

Rocchio Method Rocchio automatically Most methods perform similarly re-weights terms adds in new terms (from relevant docs) have to be careful when using negative terms Rocchio is not a machine learning algorithm Most methods perform similarly results heavily dependent on test collection Machine learning methods are proving to work better than standard IR approaches like Rocchio

Relevance feedback in Probabilistic Model sim(dj,q) ~ ~  wiq * wij * (log P(ki | R) + log P(ki | R) ) P(ki | R) P(ki | R) Probabilities P(ki | R) and P(ki | R) ? Estimates based on assumptions: P(ki | R) = 0.5 P(ki | R) = ni N where ni is the number of docs that contain ki Use this initial guess to retrieve an initial ranking Improve upon this initial ranking

Improving the Initial Ranking sim(dj,q) ~ ~  wiq * wij * (log P(ki | R) + log P(ki | R) ) P(ki | R) P(ki | R) Let V : set of docs initially retrieved Vi : subset of docs retrieved that contain ki Reevaluate estimates: P(ki | R) = Vi V P(ki | R) = ni - Vi N - V Repeat recursively Relevance Feedback..

Improving the Initial Ranking sim(dj,q) ~ ~  wiq * wij * (log P(ki | R) + log P(ki | R) ) P(ki | R) P(ki | R) To avoid problems with V=1 and Vi=0: P(ki | R) = Vi + 0.5 V + 1 P(ki | R) = ni - Vi + 0.5 N - V + 1 Also, P(ki | R) = Vi + ni/N V + 1 P(ki | R) = ni - Vi + ni/N N - V + 1 Relevance Feedback..

Using Relevance Feedback Known to improve results in TREC-like conditions (no user involved) What about with a user in the loop? How might you measure this? Precision/Recall figures for the unseen documents need to be computed

Relevance Feedback Summary Iterative query modification can improve precision and recall for a standing query In at least one study, users were able to make good choices by seeing which terms were suggested for R.F. and selecting among them

Query Expansion Add terms that are closely related to the query terms to improve precision and recall. Two variants: Local  only analyze the closeness among the set of documents that are returned Global  Consider all the documents in the corpus a priori How to decide closely related terms? THESAURI!! -- Hand-coded thesauri (Roget and his brothers) -- Automatically generated thesauri --Correlation based (association, nearness) --Similarity based (terms as vectors in doc space) --Statistical (clustering techniques)

Correlation/Co-occurrence analysis Terms that are related to terms in the original query may be added to the query. Two terms are related if they have high co-occurrence in documents. Let n be the number of documents; n1 and n2 be # documents containing terms t1 and t2, m be the # documents having both t1 and t2 If t1 and t2 are independent If t1 and t2 are correlated Measure degree of correlation

Association Clusters Let Mij be the term-document matrix For the full corpus (Global) For the docs in the set of initial results (local) (also sometimes, stems are used instead of terms) Correlation matrix C = MMT (term-doc Xdoc-term = term-term) Un-normalized Association Matrix Normalized Association Matrix Nth-Association Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk

Example 11 4 6 4 34 11 6 11 26 Correlation Matrix d1d2d3d4d5d6d7 11 4 6 4 34 11 6 11 26 Correlation Matrix d1d2d3d4d5d6d7 K1 2 1 0 2 1 1 0 K2 0 0 1 0 2 2 5 K3 1 0 3 0 4 0 0 Normalized Correlation Matrix 1.0 0.097 0.193 0.097 1.0 0.224 0.193 0.224 1.0 1th Assoc Cluster for K2 is K3

Scalar clusters Even if terms u and v have low correlations, they may be transitively correlated (e.g. a term w has high correlation with u and v). Consider the normalized association matrix S The “association vector” of term u Au is (Su1,Su2…Suk) To measure neighborhood-induced correlation between terms: Take the cosine-theta between the association vectors of terms u and v Nth-scalar Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk

Example AK1 1th Scalar Cluster for K2 is still K3 1.0 0.226 0.383 Normalized Correlation Matrix AK1 USER(43): (neighborhood normatrix) 0: (COSINE-METRIC (1.0 0.09756097 0.19354838) (1.0 0.09756097 0.19354838)) 0: returned 1.0 0: (COSINE-METRIC (1.0 0.09756097 0.19354838) (0.09756097 1.0 0.2244898)) 0: returned 0.22647195 0: (COSINE-METRIC (1.0 0.09756097 0.19354838) (0.19354838 0.2244898 1.0)) 0: returned 0.38323623 0: (COSINE-METRIC (0.09756097 1.0 0.2244898) (1.0 0.09756097 0.19354838)) 0: (COSINE-METRIC (0.09756097 1.0 0.2244898) (0.09756097 1.0 0.2244898)) 0: (COSINE-METRIC (0.09756097 1.0 0.2244898) (0.19354838 0.2244898 1.0)) 0: returned 0.43570948 0: (COSINE-METRIC (0.19354838 0.2244898 1.0) (1.0 0.09756097 0.19354838)) 0: (COSINE-METRIC (0.19354838 0.2244898 1.0) (0.09756097 1.0 0.2244898)) 0: (COSINE-METRIC (0.19354838 0.2244898 1.0) (0.19354838 0.2244898 1.0)) Scalar (neighborhood) Cluster Matrix 1.0 0.226 0.383 0.226 1.0 0.435 0.383 0.435 1.0 1th Scalar Cluster for K2 is still K3

Metric Clusters average.. Let r(ti,tj) be the minimum distance (in terms of number of separating words) between ti and tj in any single document (infinity if they never occur together in a document) Define cluster matrix Suv= 1/r(ti,tj) Nth-metric Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk R(ti,tj) is also useful For proximity queries And phrase queries

Similarity Thesaurus The similarity thesaurus is based on term to term relationships rather than on a matrix of co-occurrence. obtained by considering that the terms are concepts in a concept space. each term is indexed by the documents in which it appears. Terms assume the original role of documents while documents are interpreted as indexing elements

Motivation Ki Kv Kj Ka Q Kb

Similarity Thesaurus Inverse term frequency for document dj Terminology t: number of terms in the collection N: number of documents in the collection Fi,j: frequency of occurrence of the term ki in the document dj tj: vocabulary of document dj itfj: inverse term frequency for document dj Inverse term frequency for document dj To ki is associated a vector Where Idea: It is no surprise if Oxford dictionary Mentions the word!

Similarity Thesaurus The relationship between two terms ku and kv is computed as a correlation factor cu,v given by The global similarity thesaurus is built through the computation of correlation factor Cu,v for each pair of indexing terms [ku,kv] in the collection Expensive Possible to do incremental updates… Similar to the scalar clusters Idea, but for the tf/itf weighting Defining the term vector

Query expansion with Global Thesaurus three steps as follows: Represent the query in the concept space used for representation of the index terms Based on the global similarity thesaurus, compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q. Expand the query with the top r ranked terms according to sim(q,kv)

Query Expansion - step one To the query q is associated a vector q in the term-concept space given by where wi,q is a weight associated to the index-query pair[ki,q]

Query Expansion - step two Compute a similarity sim(q,kv) between each term kv and the user query q where Cu,v is the correlation factor

Query Expansion - step three Add the top r ranked terms according to sim(q,kv) to the original query q to form the expanded query q’ To each expansion term kv in the query q’ is assigned a weight wv,q’ given by The expanded query q’ is then used to retrieve new documents to the user

Query Expansion Sample Doc1 = D, D, A, B, C, A, B, C Doc2 = E, C, E, A, A, D Doc3 = D, C, B, B, D, A, B, C, A Doc4 = A c(A,A) = 10.991 c(A,C) = 10.781 c(A,D) = 10.781 ... c(D,E) = 10.398 c(B,E) = 10.396 c(E,E) = 10.224

Query Expansion Sample Query: q = A E E sim(q,A) = 24.298 sim(q,C) = 23.833 sim(q,D) = 23.833 sim(q,B) = 23.830 sim(q,E) = 23.435 New query: q’ = A C D E E w(A,q')= 6.88 w(C,q')= 6.75 w(D,q')= 6.75 w(E,q')= 6.64

Statistical Thesaurus formulation Expansion terms must be low frequency terms However, it is difficult to cluster low frequency terms Idea: Cluster documents into classes instead and use the low frequency terms in these documents to define our thesaurus classes. This algorithm must produce small and tight clusters.

A clustering algorithm (Complete Link) This is document clustering algorithm with produces small and tight clusters Place each document in a distinct cluster. Compute the similarity between all pairs of clusters. Determine the pair of clusters [Cu,Cv] with the highest inter-cluster similarity. Merge the clusters Cu and Cv Verify a stop criterion. If this criterion is not met then go back to step 2. Return a hierarchy of clusters. Similarity between two clusters is defined as the minimum of similarities between all pair of inter-cluster documents

Selecting the terms that compose each class Given the document cluster hierarchy for the whole collection, the terms that compose each class of the global thesaurus are selected as follows Obtain from the user three parameters TC: Threshold class NDC: Number of documents in class MIDF: Minimum inverse document frequency

Selecting the terms that compose each class Use the parameter TC as threshold value for determining the document clusters that will be used to generate thesaurus classes This threshold has to be surpassed by sim(Cu,Cv) if the documents in the clusters Cu and Cv are to be selected as sources of terms for a thesaurus class Use the parameter NDC as a limit on the size of clusters (number of documents) to be considered. A low value of NDC might restrict the selection to the smaller cluster Cu+v

Selecting the terms that compose each class Consider the set of document in each document cluster pre-selected above. Only the lower frequency documents are used as sources of terms for the thesaurus classes The parameter MIDF defines the minimum value of inverse document frequency for any term which is selected to participate in a thesaurus class

Query Expansion based on a Statistical Thesaurus Use the thesaurus class to query expansion. Compute an average term weight wtc for each thesaurus class C

Query Expansion based on a Statistical Thesaurus wtc can be used to compute a thesaurus class weight wc as

Query Expansion Sample Doc1 = D, D, A, B, C, A, B, C Doc2 = E, C, E, A, A, D Doc3 = D, C, B, B, D, A, B, C, A Doc4 = A q= A E E sim(1,3) = 0.99 sim(1,2) = 0.40 sim(2,3) = 0.29 sim(4,1) = 0.00 sim(4,2) = 0.00 sim(4,3) = 0.00 idf A = 0.0 idf B = 0.3 idf C = 0.12 idf D = 0.12 idf E = 0.60 TC = 0.90 NDC = 2.00 MIDF = 0.2 q'=A B E E

Query Expansion based on a Statistical Thesaurus Problems with this approach initialization of parameters TC,NDC and MIDF TC depends on the collection Inspection of the cluster hierarchy is almost always necessary for assisting with the setting of TC. A high value of TC might yield classes with too few terms

Conclusion Thesaurus is a efficient method to expand queries The computation is expensive but it is executed only once Query expansion based on similarity thesaurus may use high term frequency to expand the query Query expansion based on statistical thesaurus need well defined parameters

Using correlation for term change Low freq to Medium Freq By synonym recognition High to medium frequency By phrase recognition