Filtering and Recommendation INST 734 Module 9 Doug Oard
Agenda Filtering Recommender ystems Classification
Information Filtering An abstract task in which: –The information need is stable –A stream of documents is arriving –The system must decide which ones to present Introduced by Luhn in 1958 –As Selective Dissemination of Information (SDI) –Named “Filtering” by Denning in 1975 –After 1983, SDI came to have another meaning
Information Filtering User Profile Matching New Documents Recommendation Rating
Information Access Problems Collection Information Need Stable Different Each Time Data Mining Retrieval Filtering Different Each Time
Information Filtering Examples spam filtering Personalized newspaper Children’s Internet Protection Act (CIPA)
Standing Queries Have the user specify a “standing query” –This is the initial “profile” Allow updates based on relevance feedback –Track changing interests –Learn new terms Match each arriving document to the profile –On arrival, on a schedule, or when app is opened
Profile Indexing Build an inverted file of profiles –Postings are profiles that contain each term RAM can hold ~5 million profiles/GB –And several machines could run in parallel Challenges: –New terms (would) have infinite IDF! –No obvious a priori way to do threshold selection –Privacy (with a centralized profile index)
Content-Based Filtering “Fast Data Finder” Boolean filtering using custom hardware –Up to 10,000 documents per second (in 1996!) Words pass through a pipeline architecture –Each element looks for one word goodpartygreataid OR AND NOT
Spam Filtering Adversarial IR yields an adaptation cycle Content signatures –Compression-based techniques Source signatures –Blacklists and whitelists –DomainKey Identified Mail (DKIM) Behavioral signatures
Agenda Filtering Recommender systems Classification