Download presentation
Presentation is loading. Please wait.
1
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval
2
9/21/2000Information Organization and Retrieval Review Inverted files The Vector Space Model Term weighting
3
9/21/2000Information Organization and Retrieval tf x idf
4
9/21/2000Information Organization and Retrieval Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of 10000 documents
5
9/21/2000Information Organization and Retrieval Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient
6
9/21/2000Information Organization and Retrieval Vector Space Visualization
7
9/21/2000Information Organization and Retrieval Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2
9
9/21/2000Information Organization and Retrieval tf x idf normalization Normalize the term weights (so longer documents are not unfairly given more weight) –normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive.
10
9/21/2000Information Organization and Retrieval Vector space similarity (use the weights to compare the documents)
11
9/21/2000Information Organization and Retrieval Vector Space Similarity Measure combine tf x idf into a similarity measure
12
9/21/2000Information Organization and Retrieval Vector Space with Term Weights and Cosine Matching 1.0 0.8 0.6 0.4 0.2 0.80.60.40.201.0 D2D2 D1D1 Q Term B Term A D i =(d i1,w di1 ;d i2, w di2 ;…;d it, w dit ) Q =(q i1,w qi1 ;q i2, w qi2 ;…;q it, w qit ) Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)
13
9/21/2000Information Organization and Retrieval Term Weights in SMART In SMART weights are decomposed into three factors:
14
9/21/2000Information Organization and Retrieval SMART Freq Components Binary maxnorm augmented log
15
9/21/2000Information Organization and Retrieval Collection Weighting in SMART Inverse squared probabilistic frequency
16
9/21/2000Information Organization and Retrieval Term Normalization in SMART sum cosine fourth max
17
9/21/2000Information Organization and Retrieval Probabilistic Models: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged
18
9/21/2000Information Organization and Retrieval Probabilistic Models: Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown previously
19
9/21/2000Information Organization and Retrieval Probabilistic Models Strong theoretical basis In principle should supply the best predictions of relevance given available information Can be implemented similarly to Vector Relevance information is required -- or is “guestimated” Important indicators of relevance may not be term -- though terms only are usually used Optimally requires on- going collection of relevance information AdvantagesDisadvantages
20
9/21/2000Information Organization and Retrieval Vector and Probabilistic Models Support “natural language” queries Treat documents and queries the same Support relevance feedback searching Support ranked retrieval Differ primarily in theoretical basis and in how the ranking is calculated –Vector assumes relevance –Probabilistic relies on relevance judgments or estimates
21
9/21/2000Information Organization and Retrieval Current use of Probabilistic Models Virtually all the major systems in TREC now use the “Okapi BM25 formula” which incorporates the Robertson-Sparck Jones weights…
22
9/21/2000Information Organization and Retrieval Okapi BM25 Where: Q is a query containing terms T K is k 1 ((1-b) + b.dl/avdl) k 1, b and k 3 are parameters, usually 1.2, 0.75 and 7- 1000 tf is the frequency of the term in a specific document qtf is the frequency of the term in a topic from which Q was derived dl and avdl are the document length and the average document length measured in some convenient unit w (1) is the Robertson-Sparck Jones weight.
23
9/21/2000Information Organization and Retrieval Today Logistic regression and Cheshire Relevance Feedback
24
9/21/2000Information Organization and Retrieval Logistic Regression and Cheshire II The Cheshire II system uses Logistic Regression equations estimated from TREC full-text data. Demo (?)
25
9/21/2000Information Organization and Retrieval Querying in IR System Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System
26
9/21/2000Information Organization and Retrieval Relevance Feedback in an IR System Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Selected relevant docs
27
9/21/2000Information Organization and Retrieval Query Modification Problem: how to reformulate the query? –Thesaurus expansion: Suggest terms similar to query terms –Relevance feedback: Suggest terms (and documents) similar to retrieved documents that have been judged to be relevant
28
9/21/2000Information Organization and Retrieval Relevance Feedback Main Idea: –Modify existing query based on relevance judgements Extract terms from relevant documents and add them to the query and/or re-weight the terms already in the query –Two main approaches: Automatic (psuedo-relevance feedback) Users select relevant documents –Users/system select terms from an automatically- generated list
29
9/21/2000Information Organization and Retrieval Relevance Feedback Usually do both: –expand query with new terms –re-weight terms in query There are many variations –usually positive weights for terms from relevant docs –sometimes negative weights for terms from non- relevant docs –Remove terms ONLY in non-relevant documents
30
9/21/2000Information Organization and Retrieval Rocchio Method
31
9/21/2000Information Organization and Retrieval Rocchio Method
32
9/21/2000Information Organization and Retrieval Rocchio/Vector Illustration Retrieval Information 0.5 1.0 0 0.51.0 D1D1 D2D2 Q0Q0 Q’ Q” Q 0 = retrieval of information = (0.7,0.3) D 1 = information science = (0.2,0.8) D 2 = retrieval systems = (0.9,0.1) Q’ = ½*Q 0 + ½ * D 1 = (0.45,0.55) Q” = ½*Q 0 + ½ * D 2 = (0.80,0.20)
33
9/21/2000Information Organization and Retrieval Example Rocchio Calculation Relevant docs Non-rel doc Original Query Constants Rocchio Calculation Resulting feedback query
34
9/21/2000Information Organization and Retrieval Rocchio Method Rocchio automatically –re-weights terms –adds in new terms (from relevant docs) have to be careful when using negative terms Rocchio is not a machine learning algorithm Most methods perform similarly –results heavily dependent on test collection Machine learning methods are proving to work better than standard IR approaches like Rocchio
35
9/21/2000Information Organization and Retrieval Probabilistic Relevance Feedback Robertson & Sparck Jones Document Relevance Document indexing Given a query term t + - + r n-r n - R-r N-n-R+r N-n R N-R N Where N is the number of documents seen
36
9/21/2000Information Organization and Retrieval Robertson-Spark Jones Weights Retrospective formulation --
37
9/21/2000Information Organization and Retrieval Robertson-Sparck Jones Weights Predictive formulation
38
9/21/2000Information Organization and Retrieval Using Relevance Feedback Known to improve results –in TREC-like conditions (no user involved) What about with a user in the loop? –How might you measure this?
39
9/21/2000Information Organization and Retrieval Relevance Feedback Summary Iterative query modification can improve precision and recall for a standing query In at least one study, users were able to make good choices by seeing which terms were suggested for R.F. and selecting among them
40
9/21/2000Information Organization and Retrieval Alternative Notions of Relevance Feedback
41
9/21/2000Information Organization and Retrieval Alternative Notions of Relevance Feedback Find people whose taste is “similar” to yours. Will you like what they like? Follow a users’ actions in the background. Can this be used to predict what the user will want to see next? Track what lots of people are doing. Does this implicitly indicate what they think is good and not good?
42
9/21/2000Information Organization and Retrieval Alternative Notions of Relevance Feedback Several different criteria to consider: –Implicit vs. Explicit judgements –Individual vs. Group judgements –Standing vs. Dynamic topics –Similarity of the items being judged vs. similarity of the judges themselves
43
Information Organization and Retrieval Collaborative Filtering (social filtering) If Pam liked the paper, I’ll like the paper If you liked Star Wars, you’ll like Independence Day Rating based on ratings of similar people –Ignores the text, so works on text, sound, pictures etc. –But: Initial users can bias ratings of future users
44
Information Organization and Retrieval Users rate musical artists from like to dislike –1 = detest 7 = can’t live without 4 = ambivalent –There is a normal distribution around 4 –However, what matters are the extremes Nearest Neighbors Strategy: Find similar users and predicted (weighted) average of user ratings Pearson r algorithm: weight by degree of correlation between user U and user J –1 means very similar, 0 means no correlation, -1 dissimilar –Works better to compare against the ambivalent rating (4), rather than the individual’s average score Ringo Collaborative Filtering (Shardanand & Maes 95)
45
9/21/2000Information Organization and Retrieval Social Filtering Ignores the content, only looks at who judges things similarly Works well on data relating to “taste” –something that people are good at predicting about each other too Does it work for topic? –GroupLens results suggest otherwise (preliminary) –Perhaps for quality assessments –What about for assessing if a document is about a topic?
46
Information Organization and Retrieval Learning interface agents Add agents in the UI, delegate tasks to them Use machine learning to improve performance –learn user behavior, preferences Useful when: –1) past behavior is a useful predictor of the future –2) wide variety of behaviors amongst users Examples: –mail clerk: sort incoming messages in right mailboxes –calendar manager: automatically schedule meeting times?
47
9/21/2000Information Organization and Retrieval Summary Relevance feedback is an effective means for user-directed query modification. Modification can be done with either direct or indirect user input Modification can be done based on an individual’s or a group’s past input.
48
9/21/2000Information Organization and Retrieval
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.