Download presentation
Presentation is loading. Please wait.
1
Evidence from Behavior LBSC 796/INFM 719R Douglas W. Oard Session 7, October 22, 2007
2
Picture of Relevance Feedback x x x x o o o Revised query x non-relevant documents o relevant documents o o o x x x x x x x x x x x x x x Initial query x
3
Rocchio Formula q m = modified query vector; q 0 = original query vector; α,β,γ: weights (hand-chosen or set empirically); D r = set of known relevant doc vectors; D nr = set of known irrelevant doc vectors
4
Rocchio Example 040800 124001 201104 6370-3 040800 248002 8044016 Original query Positive Feedback Negative feedback (+) (-) New query Typically, <
5
Using Positive Information Source: Jon Herlocker, SIGIR 1999
6
Using Negative Information Source: Jon Herlocker, SIGIR 1999
7
Rating-Based Recommendation Use ratings as to describe objects –Personal recommendations, peer review, … Beyond topicality: –Accuracy, coherence, depth, novelty, style, … Has been applied to many modalities –Books, Usenet news, movies, music, jokes, beer, …
8
Motivations to Provide Ratings Self-interest –Use the ratings to improve system’s user model Economic benefit –If a market for ratings is created Altruism
9
Explicit Feedback: Assumptions A1: User has sufficient knowledge for a reasonable initial query A2: Selected examples are representative A3: The user will try giving feedback A4: The user will keep giving feedback
10
A1: Good Initial Query? Two problems: –User may not have sufficient initial knowledge –Few or no relevant documents may be retrieved Examples: –Misspellings (Brittany Speers) –Cross-language information retrieval –Vocabulary mismatch (e.g., cosmonaut/astronaut)
11
A2: Representative Examples? There may be several clusters of relevant documents Examples: –Burma/Myanmar –Contradictory government policies –Opinions
12
A3: Will People Use It? Efficiency –Longer queries require more processing time Understandability –Harder to see why subsequent documents retrieved Risk –Users are reluctant to provide negative feedback
13
A4: Self-Interest Decreases! Number of Ratings Value of ratings Marginal value to rater Marginal value to community Few Lots Marginal cost
14
Solving the Cost vs. Value Problem Maximize the value –Provide for continuous user model adaptation Minimize the costs –Use implicit feedback rather than explicit ratings –Minimize privacy concerns through encryption –Build an efficient scalable architecture –Limit the scope to noncompetitive activities
15
Solution: Reduce the Marginal Cost Number of Ratings Marginal value to rater Marginal value to community Few Lots Marginal cost
16
View Listen Select Print Bookmark Save Purchase Delete Subscribe Copy / paste Quote Forward Reply Link Cite Mark up Tag Publish Organize Type Edit
17
View Listen Select Print Bookmark Save Purchase Delete Subscribe Copy / paste Quote Forward Reply Link Cite Mark up Tag Publish Organize Behavior Category Examine Retain Reference Annotate Create Type Edit
18
Minimum Scope SegmentObjectClass View Listen Select Print Bookmark Save Purchase Delete Subscribe Copy / paste Quote Forward Reply Link Cite Mark up Tag Publish Organize Behavior Category Examine Retain Reference Annotate Create Type Edit
19
Estimating Authority from Links Authority Hub
20
Anchor Text Indexing Sometimes yields unexpected results: –Google ranked Microsoft #1 for “evil empire” www.ibm.com Armonk, NY-based computer giant IBM announced today Joe’s computer hardware links Compaq HP IBM Big Blue today announced record profits for the quarter Adapted from: Manning and Raghavan
21
Collecting Click Streams Browsing histories are easily captured –Make all links initially point to a central site Encode the desired URL as a parameter –Build a time-annotated transition graph for each user Cookies identify users (when they use the same machine) –Redirect the browser to the desired page Reading time is correlated with interest –Can be used to build individual profiles –Used to target advertising by doubleclick.com
22
0 20 40 60 80 100 120 140 160 180 No Interest Low Interest Moderate Interest High Interest Rating Reading Time (seconds) Full Text Articles (Telecommunications) 50 32 58 43
23
More Complete Observations User selects an article –Interpretation: Summary was interesting User quickly prints the article –Interpretation: They want to read it User selects a second article –Interpretation: another interesting summary User scrolls around in the article –Interpretation: Parts with high dwell time and/or repeated revisits are interesting User stops scrolling for an extended period –Interpretation: User was interrupted
24
No Interest No Interest Low Interest Moderate Interest High Interest Abstracts (Pharmaceuticals) 42 55 52 51
25
Recommending w/Implicit Feedback Estimate Rating User Model Ratings Server User Ratings Community Ratings Predicted Ratings User Observations User Ratings User Model Estimate Ratings Observations Server Predicted Observations Community Observations Predicted Ratings User Observations
26
Critical Issues Protecting privacy –What absolute assurances can we provide? –How can we make remaining risks understood? Scalable rating servers –Is a fully distributed architecture practical? Non-cooperative users –How can the effect of spamming be limited?
27
Gaining Access to Observations Observe public behavior –Hypertext linking, publication, citing, … Policy protection –EU: Privacy laws –US: Privacy policies + FTC enforcement Statistical assurance of privacy –Distributed architecture –Model and mitigate privacy risks
28
Search Engine Query Logs A: Southeast Asia (Dec 27, 2004) B: Indonesia (Mar 29, 2005) C; Pakistan (Oct 10, 2005) D; Hawaii (Oct 16, 2006) E: Indonesia (Aug 8, 2007) F: Peru (Aug 16, 2007)
29
Search Engine Query Logs http://hannu.biz/aolsearch/
30
AOL User 4417749
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.