Information Filtering LBSC 878 Douglas W. Oard and Dagobert Soergel Week 10: April 12, 1999.

Slides:



Advertisements
Similar presentations
Recommender Systems & Collaborative Filtering
Advertisements

Relevance Feedback Limitations –Must yield result within at most 3-4 iterations –Users will likely terminate the process sooner –User may get irritated.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Multimedia Database Systems
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
Recommender Systems Aalap Kohojkar Yang Liu Zhan Shi March 31, 2008.
Evaluating Search Engine
Information Retrieval in Practice
Search Engines and Information Retrieval
Information Retrieval Review
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Evidence from Behavior LBSC 796/INFM 719R Douglas W. Oard Session 7, October 22, 2007.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
INFO 624 Week 3 Retrieval System Evaluation
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Information Access Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies Design Understanding.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Recommender systems Ram Akella November 26 th 2008.
Information Filtering LBSC 796/INFM 718R Douglas W. Oard Session 10, November 12, 2007.
Overview of Search Engines
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
CHAPTER 12 ADVANCED INTELLIGENT SYSTEMS © 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Search Engines and Information Retrieval Chapter 1.
Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Information Filtering LBSC 796/INFM 718R Douglas W. Oard Session 10, April 13, 2011.
Genetic Learning for Information Retrieval Andrew Trotman Computer Science 365 * 24 * 60 / 40 = 13,140.
Filtering and Recommendation INST 734 Module 9 Doug Oard.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
1 Computing Relevance, Similarity: The Vector Space Model.
Collaborative Information Retrieval - Collaborative Filtering systems - Recommender systems - Information Filtering Why do we need CIR? - IR system augmentation.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Filtering and Recommendation INST 734 Module 9 Doug Oard.
Clustering C.Watters CS6403.
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
Chapter 9 Genetic Algorithms.  Based upon biological evolution  Generate successor hypothesis based upon repeated mutations  Acts as a randomized parallel.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Relevance Feedback Hongning Wang
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
User Modeling and Recommender Systems: recommendation algorithms
KMS & Collaborative Filtering Why CF in KMS? CF is the first type of application to leverage tacit knowledge People-centric view of data Preferences matter.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Information Retrieval in Practice
General-Purpose Learning Machine
Recommender Systems & Collaborative Filtering
Search Engine Architecture
Artificial Intelligence Lecture No. 28
Presentation transcript:

Information Filtering LBSC 878 Douglas W. Oard and Dagobert Soergel Week 10: April 12, 1999

Agenda Information filtering Profile learning Social filtering

Information Access Problems Collection Information Need Stable Different Each Time Retrieval Filtering Different Each Time

Information Filtering An abstract problem in which: –The information need is stable Characterized by a “profile” –A stream of documents is arriving Each must either be presented to the user or not Introduced by Luhn in 1958 –As “Selective Dissemination of Information” Named “Filtering” by Denning in 1975

A Simple Filtering Strategy Use any information retrieval system –Boolean, vector space, probabilistic, … Have the user specify a “standing query” –This will be the profile Limit the standing query by date –Each use, show what arrived since the last use

What’s Wrong With That? Unnecessary indexing overhead –Indexing only speeds up retrospective searches Every profile is treated separately –The same work might be done repeatedly Forming effective queries by hand is hard –The computer might be able to help It is OK for text, but what about audio, video, … –Are words the only possible basis for filtering?

The Fast Data Finder Fast Boolean filtering using custom hardware –Up to 10,000 documents per second Words pass through a pipeline architecture –Each element looks for one word goodpartygreataid OR AND NOT

Profile Indexing in SIFT Build an inverted file of profiles –Postings are profiles that contain each term RAM can hold 5,000 profiles/megabyte –And several machines can be run in parallel Both Boolean and vector space matching –User-selected threshold for each ranked profile Hand-tuned on a web page using today’s news Delivered by

SIFT Limitations Centralization –Requires some form of cost sharing Privacy –Centrally registered profiles –Each profile associated with an address Usability –Manually specified profiles –Thresholds based on obscure retrieval status value

Profile Learning Automatic learning for stable profiles –Based on observations of user behavior Explicit ratings are easy to use –Binary ratings are simply relevance feedback Unobtrusive observations are easier to get –Reading time, save/delete, … –But we only have a few clues on how to use them

Relevance Feedback Make Profile Vector Compute Similarity Select and Examine (user) Assign Ratings (user) Update User Model New Documents Vector Documents, Vectors, Rank Order Document, Vector Rating, Vector Vector(s) Make Document Vectors Initial Profile Terms Vectors

Machine Learning Given a set of vectors with associated values –e.g., document vector + relevance judgments Predict the values associated with new vectors –i.e., learn a mapping from vectors to values All learning systems share two problems –They need some basis for making predictions This is called an “inductive bias” –They must balance adaptation with generalization

Machine Learning Techniques Hill climbing (Rocchio Feedback) Instance-based learning Rule induction Regression Neural networks Genetic algorithms Statistical classification

Instance-Based Learning Remember examples of desired documents –Dissimilar examples capture range of possibilities Compare new documents to each example –Vector similarity, probabilistic, … Base rank order on the most similar example May be useful for broader interests

Rule Induction Automatically derived Boolean profiles –(Hopefully) effective and easily explained Specificity from the “perfect query” –AND terms in a document, OR the documents Generality from a bias favoring short profiles –e.g., penalize rules with more Boolean operators –Balanced by rewards for precision, recall, …

Regression Fit a simple function to the observations –Straight line, s-curve, … Logistic regression matches binary relevance –S-curves fit better than straight lines –Same calculation as a 3-layer neural network But with a different training technique

Neural Networks Design inspired by the human brain –Neurons, connected by synapses Organized into layers for simplicity –Synapse strengths are learned from experience Seek to minimize predict-observe differences Can be fast with special purpose hardware –Biological neurons are far slower than computers Work best with application-specific structures –In biology this is achieved through evolution

Genetic Algorithms Design inspired by evolutionary biology –Parents exchange parts of genes (crossover) –Random variations in each generation (mutation) –Fittest genes dominate the population (selection) Genes are vectors –Term weights, source weights, module weights, … Selection is based on a fitness function –Tested repeatedly - must be informative and cheap

Statistical Classification Represent relevant docs as one random vector –And nonrelevant docs as another Build a statistical model for each –e.g., a normal distribution Find the surface separating the distributions –e.g., a hyperplane Rank documents by distance from that surface –Possibly distorted by the shape of the distributions

Training Strategies Overtraining can hurt performance –Performance on training data rises and plateaus –Performance on new data rises, then falls One strategy is to learn less each time –But it is hard to guess the right learning rate Splitting the training set is a useful alternative –Part for training, part for finding “new data” peak

Social Filtering Exploit ratings from other users as features –Like personal recommendations, peer review, … Reaches beyond topicality to: –Accuracy, coherence, depth, novelty, style, … Applies equally well to other modalities –Movies, recorded music, … Sometimes called “collaborative” filtering

Social Filtering Example ComedyDramaActionMystery Amy Norina Paul Skip

Some Things We (Sort of) Know Treating each genre separately can be useful –Separate predictions for separate tastes Negative information might be useful –“I hate everything my parents like” People like to know who provided ratings Popularity provides a useful fallback People don’t like to provide ratings –Few experiments have achieved sufficient scale

Implicit Feedback Observe user behavior to infer a set of ratings –Examine (reading time, scrolling behavior, …) –Retain (bookmark, save, save & annotate, print, …) –Refer to (reply, forward, include link, cut & paste, …) Some measurements are directly useful –e.g., use reading time to predict reading time Others require some inference –Should you treat cut & paste as an endorsement?

Constructing a Rating Matrix 1234 Document Jack Sam Tim Ken Marty Joe Jill Mary Jack Sam Tim Ken Marty Joe Jill Mary Examination Retention Reference Explicit Rating Rater Observation 1234 Document Rater

Some Research Questions How readers can pay raters efficiently How to use implicit feedback –No large scale integrated experiments yet How to protect privacy –Pseudonyms can be defeated fairly easily How to merge social and content-based filtering –Social filtering never finds anything new How to evaluate social filtering effectiveness

Jack Sam Tim Ken Marty Joe Jill Mary Rater nuclear fallout siberia contaminated interesting complicated information retrieval Terms Document Combining Content and Ratings Two sources of features –Terms and raters Each is used differently –Find terms in documents –Find informative raters

Variations on Social Filtering Citation indexing –A special case of refer-to implicit feedback Search for people based on their behavior –Discover potential collaborators –Focus advertising on interested consumers Collaborative exploration –Explore a large, fairly stable collection –Discoveries migrate to people with similar interests