1 Information Filtering Rong Jin
2 Outline Brief introduction to information filtering Collaborative filtering Adaptive filtering
3 Short vs. Long Term Info. Need Short-term information need (Ad hoc retrieval) Temporary need, e.g., info about used cars Information source is relatively static User pulls information Application example: library search, Web search Long-term information need (Filtering) Stable need, e.g., new data mining algorithms Information source is dynamic System pushes information to user Applications: news filter
4 Short vs. Long Term Info. Need Short-term information need (Ad hoc retrieval) Temporary need, e.g., info about used cars Information source is relatively static User pulls information Application example: library search, Web search Long-term information need (Filtering) Stable need, e.g., new data mining algorithms Information source is dynamic System pushes information to user Applications: news filter
5 Examples of Information Filtering News filtering filtering Movie/book/product recommenders Literature recommenders And many others …
6 Information Filtering Basic filtering question: Will user U like item X? Two different ways of answering it Look at what U likes characterize X content-based filtering Look at who likes X characterize U collaborative filtering Combine content-based filtering and collaborative filtering unified filtering (open research topic)
7 Other Names for Information Filtering Content-based filtering is also called Adaptive Information Filtering in TREC Selective Dissemination of Information (SDI) in Library & Information Science Collaborative filtering is also called Recommender systems
8 Part I: Collaborative Filtering
9 Example: Collaborative Filtering User User User 32?354 ?
10 Example: Collaborative Filtering User User User 32?354 User 3 is more similar to user 1 than user 2 5 for movie 15 minutes for user 3 5
11 Collaborative Filtering (CF) vs. Content-based Filtering (CBF) CF do not need content of items while CBF relies the content of items CF is useful when content of items are not available or difficult to acquire are difficult to analyze Problems with CF Privacy issues
12 Why Collaborative Filtering?
13 Why Collaborative Filtering? Because it worth a million dollars!
14 Collaborative Filtering Goal: Making filtering decisions for an individual user based on the judgments of other users u1u2…umu1u2…um Users: U Objects: O o 1 o 2 … o j o j+1 … o n 3 1 …. … 4 2 ? 2 5 ? 4 3 ? 3 ? 1 2 u test 3 4…… 1
15 ? Collaborative Filtering Goal: Making filtering decisions for an individual user based on the judgments of other users u1u2…umu1u2…um Users: U Objects: O o 1 o 2 … o j o j+1 … o n 3 1 …. … 4 2 ? 2 5 ? 4 3 ? 3 ? 1 2 u test 3 4…… 1
16 Collaborative Filtering Goal: Making filtering decisions for an individual user based on the judgments of other users Memory-based approaches Given a test user u, find similar users {u 1, …, u l } Predict us rating based on the ratings of u 1, …, u l
17 Example: Collaborative Filtering User User User 32?354 User 3 is more similar to user 2 than user 1 5 for movie 15 minutes for user 3 5
18 Important Issues with CF How to determine the similarity between different users? How to aggregate ratings from similar users to form the predictions?
19 Pearson Correlation for CF V1 = (1, 3,4,3), va1 = 2.75 V3 = (2,3,5,4), va3 = 3.5 Pearson correlation measures the linear correlation between two vectors User User User 32?354
20 Pearson Correlation for CF V1 = (1, 3, 4, 3), va1 = 2.75 V3 = (2, 3, 5, 4), va3 = 3.5 User User User 32?
21 Pearson Correlation for CF V1 = (1, 3, 4, 3), va1 = 2.75 V3 = (2, 3, 5, 4), va3 = 3.5 User User User 32?
22 Pearson Correlation for CF V1 = (1, 3, 4, 3), va1 = 2.75 V3 = (2, 3, 5, 4), va3 = 3.5 User User User 32?
23 Pearson Correlation for CF V2 = (4, 5, 2, 5), va2 = 4 V3 = (2, 3, 5, 4), va3 = 3.5 User User User 32?
24 Pearson Correlation for CF V2 = (4, 5, 2, 5), va2 = 4 V3 = (2, 3, 5, 4), va3 = 3.5 User User User 32?
25 Pearson Correlation for CF V2 = (4, 5, 2, 5), va2 = 4 V3 = (2, 3, 5, 4), va3 = 3.5 User User User 32?
26 Aggregate Ratings va1 = 2.75, va2 = 4, va3 = 3.5 User User User 32? Estimated Relative Rating Average Rating Roundup Rating
27 Pearson Correlation for CF V1 = (1, 3, 3), va1 = 2.33 V3 = (2, 3, 4), va3 = 3 User 1153?3 User 24152? User 32?
28 Pearson Correlation for CF V1 = (1, 3, 3), va1 = 2.33 V3 = (2, 3, 4), va3 = 3 User 1153?3 User 24152? User 32?
29 Pearson Correlation for CF V2 = (4, 5, 2), va2 = 3.67 V3 = (2, 3, 5), va3 = 3.33 User 1153?3 User 24152? User 32?
30 Pearson Correlation for CF V2 = (4, 5, 2), va2 = 3.67 V3 = (2, 3, 5), va3 = 3.33 User 1153?3 User 24152? User 32?
31 Aggregate Ratings va1 = 2.33, va2 = 3.67, va3 = 3.3 User 1153?3 User 24152? User 32? Estimated Relative Rating Average Rating Roundup Rating
32 Problems with Memory-based Approaches User 1?5342 User 2415?5 User 35?425 User 41535? Most users only rate a few items Two similar users can may not rate the same set of items Clustering users and items
33 Problems with Memory-based Approaches User 1?5342 User 2415?5 User 35?425 User 41535? Most users only rate a few items Two similar users can may not rate the same set of items Clustering users and items
34 Flexible Mixture Model (FMM) Cluster both users and items simultaneously User 1?5342 User 2415?5 User 35?425 User 41535? User clustering and item clustering are correlated !
35 Evaluation: Datasets EachMovie: (no longer available) MovieRating: Netflix prize: MovieRatingEachMovieNetflix Number of Users71,56772,000480,000 Number of Movies10, ,000 Avg. # of rated items/User Number of ratings565
36 Evaluation Metric Mean Absolute Error (MAE): average absolute deviation of the predicted ratings to the actual ratings on items. The smaller MAE, the better the performance Predicted rating True rating T: #Predicted Items
37 Part II: Adaptive Filtering
38 Adaptive Information Filtering Stable & long term interest, dynamic info source System must make a delivery decision immediately as a document arrives Filtering System … my interest:
39 Example: Adaptive Filtering Description: A homicide detective and a fire marshall must stop a pair of murderers who commit videotaped crimes to become media darlings Rating: Description: Benjamin Martin is drawn into the American revolutionary war against his will when a brutal British commander kills his son. Rating: Description: A biography of sports legend, Muhammad Ali, from his early days to his days in the ring Rating: History What to Recommend? Description: A high-school boy is given the chance to write a story about an up-and-coming rock band as he accompanies it on their concert tour. Recommend: ? Description: A young adventurer named Milo Thatch joins an intrepid group of explorers to find the mysterious lost continent of Atlantis. Recommend: ? No Yes
40 A Typical AIF System... Binary Classifier User Interest Profile User Doc Source Accepted Docs Initialization Learning Feedback Accumulated Docs utility func User profile text
41 Evaluation Typically evaluated with a utility function Each delivered doc gets a utility value Good doc gets a positive value Bad doc gets a negative value E.g., Utility = 3* #good - 2 *#bad (linear utility)
42 Three Basic Problems in AIF Making filtering decision (Binary classifier) Doc text, profile text yes/no Initialization Initialize the filter based on only the profile text or very few examples Learning from Limited relevance judgments (only on yes docs) Accumulated documents All trying to maximize the utility
43 AIF vs. Retrieval, & Categorization Adaptive filtering as information retrieval Rank the incoming documents Only returned top k ranked ones to users
44 AIF vs. Retrieval, & Categorization Adaptive filtering as information retrieval Rank the incoming documents Only returned top k ranked ones to users Adaptive filtering as categorization problems Classify documents into the categories of interested and not-interested Only returned the ones that are classified as of being interested
45 AIF vs. Retrieval, & Categorization Like retrieval over a dynamic stream of docs, but ranking is impossible Like online binary categorization, but with no initial training data and with limited feedback
46 Major Approaches to AIF Extended retrieval systems Reuse retrieval techniques to score documents Use a score threshold for filtering decision Learn to improve scoring with traditional feedback New approaches to threshold setting and learning Modified categorization systems (not covered) Adapt to binary, unbalanced categorization New approaches to initialization Train with censored training examples
47 A General Vector-Space Approach doc vector profile vector Scoring Thresholding yes no Feedback Information Vector Learning Threshold Learning threshold Utility Evaluation
48 Difficulties in Threshold Learning 36.5 R 33.4 N 32.1 R 29.9 ? 27.3 ? …... =30.0 Little/none labeled data Correlation between threshold and profile vector Exploration vs. Exploitation (related to utility function)
49 Threshold Setting in Extended Retrieval Systems Utility-independent approaches (generally not working well, not covered in this lecture) Indirect (linear) utility optimization Logistic regression (score->prob. of relevance) Direct utility optimization Empirical utility optimization Expected utility optimization given score distributions All try to learn the optimal threshold
50 Logistic Regression (Robertson & Walker. 00) General idea: convert score of D to p(R|D) Fit the model using feedback data Linear utility is optimized with a fixed prob. cutoff But, Possibly incorrect parametric assumptions No positive examples initially limited positive feedback Doesnt address the issue of exploration
51 Score Distribution Approaches ( Aramptzis & Hameren 01; Zhang & Callan 01) Assume generative model of scores p(s|R), p(s|N) Estimate the model with training data Find the threshold by optimizing the expected utility under the estimated model Specific methods differ in the way of defining and estimating the scoring distributions
52 Gaussian-Exponential Distributions P(s|R) ~ N(, 2 ) p(s-s 0 |N) ~ E( ) (From Zhang & Callan 2001)
53 Score Distribution Approaches (cont.) Pros Principled approach Arbitrary utility Empirically effective Cons May be sensitive to the scoring function Exploration not addressed
54 Direct Utility Optimization Given A utility function U(C R+,C R-,C N+,C N- ) Training data D={ } Formulate utility as a function of the threshold and training data: U=F(,D) Choose the threshold by optimizing F(,D), i.e.,
55 Empirical Utility Optimization Basic idea Compute the utility on the training data for each candidate threshold (score of a training doc) Choose the threshold that gives the maximum utility Difficulty: Biased training sample! We can only get an upper bound for the true optimal threshold. Solutions: Heuristic adjustment(lowering) of threshold Lead to beta-gamma threshold learning
56 Illustration of Beta-Gamma Threshold Learning Cutoff position Utility … K..., N, [0,1] The more examples, the less exploration (closer to optimal ) Encourage exploration up to zero
57 Beta-Gamma Threshold Learning (cont.) Pros Explicitly addresses exploration-exploitation tradeoff (Safe exploration) Arbitrary utility (with appropriate lower bound) Empirically effective Cons Purely heuristic Zero utility lower bound often too conservative