Relevance Feedback Main Idea:

Relevance Feedback Main Idea:
Modify existing query based on relevance judgements Extract terms from relevant documents and add them to the query and/or re-weight the terms already in the query Two main approaches: Automatic (psuedo-relevance feedback) Users select relevant documents Users/system select terms from an automatically-generated list

Relevance Feedback Usually do both: There are many variations
expand query with new terms re-weight terms in query There are many variations usually positive weights for terms from relevant docs sometimes negative weights for terms from non-relevant docs Remove terms ONLY in non-relevant documents

Relevance Feedback for Vector Model
In the “ideal” case where we know the relevant Documents a priori Cr = Set of documents that are truly relevant to Q N = Total number of documents

Rocchio Method Qo is initial query. Q1 is the query after one iteration Dr are the set of relevant docs Dn are the set of irrelevant docs Alpha =1; Beta=.75, Gamma=.25 typically. Other variations possible, but performance similar

Rocchio/Vector Illustration
Retrieval Information 0.5 1.0 D1 D2 Q0 Q’ Q” Q0 = retrieval of information = (0.7,0.3) D1 = information science = (0.2,0.8) D2 = retrieval systems = (0.9,0.1) Q’ = ½*Q0+ ½ * D1 = (0.45,0.55) Q” = ½*Q0+ ½ * D2 = (0.80,0.20)

Example Rocchio Calculation
Relevant docs Non-rel doc Original Query Constants Rocchio Calculation Resulting feedback query

Rocchio Method Rocchio automatically Most methods perform similarly
re-weights terms adds in new terms (from relevant docs) have to be careful when using negative terms Rocchio is not a machine learning algorithm Most methods perform similarly results heavily dependent on test collection Machine learning methods are proving to work better than standard IR approaches like Rocchio

Improving the Initial Ranking
sim(dj,q) ~ ~  wiq * wij * (log P(ki | R) + log P(ki | R) ) P(ki | R) P(ki | R) Let V : set of docs initially retrieved Vi : subset of docs retrieved that contain ki Reevaluate estimates: P(ki | R) = Vi V P(ki | R) = ni - Vi N - V Repeat recursively Relevance Feedback..

Using Relevance Feedback
Known to improve results in TREC-like conditions (no user involved) What about with a user in the loop? How might you measure this? Precision/Recall figures for the unseen documents need to be computed

Relevance Feedback Summary
Iterative query modification can improve precision and recall for a standing query In at least one study, users were able to make good choices by seeing which terms were suggested for R.F. and selecting among them

Query Expansion Add terms that are closely related to the query terms
to improve precision and recall. Two variants: Local  only analyze the closeness among the set of documents that are returned Global  Consider all the documents in the corpus a priori How to decide closely related terms? THESAURI!! -- Hand-coded thesauri (Roget and his brothers) -- Automatically generated thesauri --Correlation based (association, nearness) --Similarity based (terms as vectors in doc space) --Statistical (clustering techniques)

Correlation/Co-occurrence analysis
Terms that are related to terms in the original query may be added to the query. Two terms are related if they have high co-occurrence in documents. Let n be the number of documents; n1 and n2 be # documents containing terms t1 and t2, m be the # documents having both t1 and t2 If t1 and t2 are independent If t1 and t2 are correlated Measure degree of correlation

Association Clusters Let Mij be the term-document matrix
For the full corpus (Global) For the docs in the set of initial results (local) (also sometimes, stems are used instead of terms) Correlation matrix C = MMT (term-doc Xdoc-term = term-term) Un-normalized Association Matrix Normalized Association Matrix Nth-Association Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk

Example 11 4 6 4 34 11 6 11 26 Correlation Matrix d1d2d3d4d5d6d7
Correlation Matrix d1d2d3d4d5d6d7 K K K Normalized Correlation Matrix 1th Assoc Cluster for K2 is K3

Scalar clusters Even if terms u and v have low correlations,
they may be transitively correlated (e.g. a term w has high correlation with u and v). Consider the normalized association matrix S The “association vector” of term u Au is (Su1,Su2…Suk) To measure neighborhood-induced correlation between terms: Take the cosine-theta between the association vectors of terms u and v Nth-scalar Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk

Example AK1 1th Scalar Cluster for K2 is still K3 1.0 0.226 0.383
Normalized Correlation Matrix AK1 USER(43): (neighborhood normatrix) 0: (COSINE-METRIC ( ) ( )) 0: returned 1.0 0: (COSINE-METRIC ( ) ( )) 0: returned 0: (COSINE-METRIC ( ) ( )) 0: returned 0: (COSINE-METRIC ( ) ( )) 0: (COSINE-METRIC ( ) ( )) 0: (COSINE-METRIC ( ) ( )) 0: returned 0: (COSINE-METRIC ( ) ( )) 0: (COSINE-METRIC ( ) ( )) 0: (COSINE-METRIC ( ) ( )) Scalar (neighborhood) Cluster Matrix 1th Scalar Cluster for K2 is still K3

Metric Clusters average.. Let r(ti,tj) be the minimum distance (in terms of number of separating words) between ti and tj in any single document (infinity if they never occur together in a document) Define cluster matrix Suv= 1/r(ti,tj) Nth-metric Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk R(ti,tj) is also useful For proximity queries And phrase queries

Similarity Thesaurus The similarity thesaurus is based on term to term relationships rather than on a matrix of co-occurrence. obtained by considering that the terms are concepts in a concept space. each term is indexed by the documents in which it appears. Terms assume the original role of documents while documents are interpreted as indexing elements

Motivation Ki Kv Kj Ka Q Kb

Similarity Thesaurus Inverse term frequency for document dj
Terminology t: number of terms in the collection N: number of documents in the collection Fi,j: frequency of occurrence of the term ki in the document dj tj: vocabulary of document dj itfj: inverse term frequency for document dj Inverse term frequency for document dj To ki is associated a vector Where Idea: It is no surprise if Oxford dictionary Mentions the word!

Similarity Thesaurus The relationship between two terms ku and kv is computed as a correlation factor cu,v given by The global similarity thesaurus is built through the computation of correlation factor Cu,v for each pair of indexing terms [ku,kv] in the collection Expensive Possible to do incremental updates… Similar to the scalar clusters Idea, but for the tf/itf weighting Defining the term vector

Query expansion with Global Thesaurus
three steps as follows: Represent the query in the concept space used for representation of the index terms Based on the global similarity thesaurus, compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q. Expand the query with the top r ranked terms according to sim(q,kv)

Query Expansion - step one
To the query q is associated a vector q in the term-concept space given by where wi,q is a weight associated to the index-query pair[ki,q]

Query Expansion - step two
Compute a similarity sim(q,kv) between each term kv and the user query q where Cu,v is the correlation factor

Query Expansion - step three
Add the top r ranked terms according to sim(q,kv) to the original query q to form the expanded query q’ To each expansion term kv in the query q’ is assigned a weight wv,q’ given by The expanded query q’ is then used to retrieve new documents to the user

Query Expansion Sample
Doc1 = D, D, A, B, C, A, B, C Doc2 = E, C, E, A, A, D Doc3 = D, C, B, B, D, A, B, C, A Doc4 = A c(A,A) = c(A,C) = c(A,D) = ... c(D,E) = c(B,E) = c(E,E) =

Query: q = A E E sim(q,A) = sim(q,C) = sim(q,D) = sim(q,B) = sim(q,E) = New query: q’ = A C D E E w(A,q')= 6.88 w(C,q')= 6.75 w(D,q')= 6.75 w(E,q')= 6.64

Statistical Thesaurus formulation
Expansion terms must be low frequency terms However, it is difficult to cluster low frequency terms Idea: Cluster documents into classes instead and use the low frequency terms in these documents to define our thesaurus classes. This algorithm must produce small and tight clusters.

A clustering algorithm (Complete Link)
This is document clustering algorithm with produces small and tight clusters Place each document in a distinct cluster. Compute the similarity between all pairs of clusters. Determine the pair of clusters [Cu,Cv] with the highest inter-cluster similarity. Merge the clusters Cu and Cv Verify a stop criterion. If this criterion is not met then go back to step 2. Return a hierarchy of clusters. Similarity between two clusters is defined as the minimum of similarities between all pair of inter-cluster documents

Selecting the terms that compose each class
Given the document cluster hierarchy for the whole collection, the terms that compose each class of the global thesaurus are selected as follows Obtain from the user three parameters TC: Threshold class NDC: Number of documents in class MIDF: Minimum inverse document frequency

Use the parameter TC as threshold value for determining the document clusters that will be used to generate thesaurus classes This threshold has to be surpassed by sim(Cu,Cv) if the documents in the clusters Cu and Cv are to be selected as sources of terms for a thesaurus class Use the parameter NDC as a limit on the size of clusters (number of documents) to be considered. A low value of NDC might restrict the selection to the smaller cluster Cu+v

Consider the set of document in each document cluster pre-selected above. Only the lower frequency documents are used as sources of terms for the thesaurus classes The parameter MIDF defines the minimum value of inverse document frequency for any term which is selected to participate in a thesaurus class

Query Expansion based on a Statistical Thesaurus
Use the thesaurus class to query expansion. Compute an average term weight wtc for each thesaurus class C

wtc can be used to compute a thesaurus class weight wc as

Doc1 = D, D, A, B, C, A, B, C Doc2 = E, C, E, A, A, D Doc3 = D, C, B, B, D, A, B, C, A Doc4 = A q= A E E sim(1,3) = 0.99 sim(1,2) = 0.40 sim(2,3) = 0.29 sim(4,1) = 0.00 sim(4,2) = 0.00 sim(4,3) = 0.00 idf A = 0.0 idf B = 0.3 idf C = 0.12 idf D = 0.12 idf E = 0.60 TC = 0.90 NDC = 2.00 MIDF = 0.2 q'=A B E E

Problems with this approach initialization of parameters TC,NDC and MIDF TC depends on the collection Inspection of the cluster hierarchy is almost always necessary for assisting with the setting of TC. A high value of TC might yield classes with too few terms

Conclusion Thesaurus is a efficient method to expand queries
The computation is expensive but it is executed only once Query expansion based on similarity thesaurus may use high term frequency to expand the query Query expansion based on statistical thesaurus need well defined parameters

Using correlation for term change
Low freq to Medium Freq By synonym recognition High to medium frequency By phrase recognition

Relevance Feedback Main Idea:

Similar presentations

Presentation on theme: "Relevance Feedback Main Idea:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Relevance Feedback Main Idea:

Similar presentations

Presentation on theme: "Relevance Feedback Main Idea:"— Presentation transcript:

Similar presentations

About project

Feedback