An Overview of Information Retrieval Nov. 10, 2009 Maryam Karimzadehgan Department of Computer Science University of Illinois, Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Multimedia Database Systems
Information Retrieval Models: Probabilistic Models
IR Models: Overview, Boolean, and Vector
Information Retrieval in Practice
Information Retrieval Review
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
1 Information Retrieval and Web Search Introduction.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Recommender systems Ram Akella November 26 th 2008.
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
IR Models: Review Vector Model and Probabilistic.
Information Retrieval
Overview of Web Data Mining and Applications Part I
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Frame an IR Research Problem and Form Hypotheses ChengXiang Zhai Department.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Vector Space Models.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
1 CS 430: Information Discovery Lecture 5 Ranking.
User Modeling and Recommender Systems: recommendation algorithms
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Automated Information Retrieval
Information Retrieval in Practice
Plan for Today’s Lecture(s)
Information Retrieval and Web Search
Information Retrieval and Web Search
Information Retrieval
Information Retrieval
Introduction to Information Retrieval
Search Engine Architecture
Language Models for TR Rong Jin
Introduction to Search Engines
Presentation transcript:

An Overview of Information Retrieval Nov. 10, 2009 Maryam Karimzadehgan Department of Computer Science University of Illinois, Urbana-Champaign

Outline Limitations of Database systems (Motivation for IR systems) Information Retrieval –Indexing –Similarity Measures –Evaluation –Other IR applications Web Search - PageRank Algorithm News Recommender system on Facebook 10/27/20152Introduction to Information Retrieval

A (Simple) Database Example Student Table Department TableCourse Table Enrollment Table 10/27/2015 3

Databases vs. IR Format of data: –DB: Structured data. Clear semantics based on a formal model. –IR: Mostly unstructured. Free text. Queries: –DB: Formal (like SQL) –IR: often expressed in natural language (keywords search) Result: –DB: exact result –IR: Sometimes relevant, often not 10/27/20154Introduction to Information Retrieval

5 Short vs. Long Term Info Need Short-term information need (Ad hoc retrieval) –“Temporary need” –Information source is relatively static –User “pulls” information –Application example: library search, Web search Long-term information need (Filtering) –“Stable need”, e.g., new data mining algorithms –Information source is dynamic –System “pushes” information to user –Applications: news filter 10/27/2015Introduction to Information Retrieval

What is Information Retrieval? Goal: Find the documents most relevant to a certain query (information need) Dealing with notions of: –Collection of documents –Query (User’s information need) –Notion of Relevancy 10/27/20156Introduction to Information Retrieval

What Types of Information? Text (Documents) XML and structured documents Images Audio (sound effects, songs, etc.) Video Source code Applications/Web services 10/27/20157Introduction to Information Retrieval

The Information Retrieval Cycle Source Selection Search Query Selection Ranked List result Documents Query Formulation Resource query reformulation, relevance feedback Slide is from Jimmy Lin’s tutorial 10/27/20158 Introduction to Information Retrieval

The IR Black Box Documents Query Results Slide is from Jimmy Lin’s tutorial 10/27/20159Introduction to Information Retrieval

Inside The IR Black Box Documents Query Results Representation Query Representation Document Representation Comparison Function Index Slide is from Jimmy Lin’s tutorial 10/27/201510Introduction to Information Retrieval

11 Typical IR System Architecture User querydocs results Query RepDoc Rep (Index) Scorer Indexer Tokenizer Index judgments Feedback Slide is from ChengXiang Zhai’s CS410 10/27/2015Introduction to Information Retrieval

12 1) Indexing Making it easier to match a query with a document Query and document should be represented using the same units/terms This is a document in information retrieval document information retrieval is this DOCUMENT INDEX Bag of Word Representation 10/27/2015Introduction to Information Retrieval

13 What is a good indexing term? Specific (phrases) or general (single word)? Words with middle frequency are most useful –Not too specific (low utility, but still useful!) –Not too general (lack of discrimination, stop words) –Stop word removal is common, but rare words are kept in modern search engines Stop words are words such as: –a, about, above, according, across, after, afterwards, again, against, albeit, all, almost, alone, already, also, although, always, among, as, at 10/27/2015Introduction to Information Retrieval

Inverted Index This is a sample document with one sample sentence Doc 1 This is another sample document Doc 2 DictionaryPostings Term# docs Total freq This22 is22 sample23 another11 ……… Doc idFreq …… …… 10/27/201514Introduction to Information Retrieval

15 2) Tokenization/Stemming Stemming: Mapping all inflectional forms of words to the same root form, e.g. –computer -> compute –computation -> compute –computing -> compute Porter’s Stemmer is popular for English 10/27/2015Introduction to Information Retrieval

16 3) Relevance Feedback Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query Retrieval Engine Results: d d … d k User Document collection 10/27/2015Introduction to Information Retrieval Slide is from ChengXiang Zhai’s CS410

4) Scorer/Similarity Methods 1) Boolean model 2) Vector-space model 3) Probabilistic model 4) Language model 10/27/201517Introduction to Information Retrieval

Boolean Model Each index term is either present or absent Documents are either Relevant or Not Relevant(no ranking) Advantages –Simple Disadvantages –No notion of ranking (exact matching only) –All index terms have equal weight 10/27/201518Introduction to Information Retrieval

Vector Space Model Query and documents are represented as vectors of index terms Similarity calculated using COSINE similarity between two vectors –Ranked according to similarity 10/27/201519Introduction to Information Retrieval

TF-IDF in Vector Space model TF part IDF part IDF: A term is more discriminative if it occurs only in fewer documents 10/27/201520Introduction to Information Retrieval

Language Models for Retrieval Document Text mining paper Food nutrition paper Language Model … text ? mining ? assocation ? clustering ? … food ? … food ? nutrition ? healthy ? diet ? … Query = “data mining algorithms” ? Which model would most likely have generated this query? Slide is from ChengXiang Zhai’s CS410 10/27/ Introduction to Information Retrieval

Retrieval as Language Model Estimation Document ranking based on query likelihood Retrieval problem  Estimation of p(w i |d) Smoothing is an important issue, and distinguishes different approaches Document language model Slide is from ChengXiang Zhai’s CS410 10/27/201522Introduction to Information Retrieval

Information Retrieval Evaluation –Coverage of information –Form of presentation –Effort required/ease of Use –Time and space efficiency –Recall proportion of relevant material actually retrieved –Precision proportion of retrieved material actually relevan t 10/27/201523Introduction to Information Retrieval

Precision vs. Recall 24 Relevant Retrieved All docs 10/27/2015Introduction to Information Retrieval

Web Search – Google PageRank Algorithm 10/27/201525Introduction to Information Retrieval

26 Characteristics of Web Information “Infinite” size –Static HTML pages –Dynamically generated HTML pages (DB) Semi-structured –Structured = HTML tags, hyperlinks, etc –Unstructured = Text Different format (pdf, word, ps, …) Multi-media (Textual, audio, images, …) High variances in quality (Many junks) 10/27/2015Introduction to Information Retrieval

27 Exploiting Inter-Document Links Description (“anchor text”) Hub Authority “Extra text”/summary for a doc Links indicate the utility of a doc Slide is from ChengXiang Zhai’s CS410 10/27/2015Introduction to Information Retrieval

28 PageRank: Capturing Page “Popularity” [Page & Brin 98] Intuitions –Links are like citations in literature –A page that is cited often can be expected to be more useful in general –Consider “indirect citations” (being cited by a highly cited paper counts a lot…) –Smoothing of citations (every page is assumed to have a non- zero citation count) 10/27/2015Introduction to Information Retrieval

29 The PageRank Algorithm (Page et al. 98) d1d1 d2d2 d4d4 “Transition matrix” d3d3 iterate until converge N= # pages p(d i ): PageRank score (average probability of visiting page di); Random surfing model: At any page, With prob. , randomly jumping to a page With prob. (1-  ), randomly picking a link to follow. I ij = 1/N Initial value p(d)=1/N, M ij = probability of going from d i to d j Slide is from ChengXiang Zhai’s CS410 10/27/2015Introduction to Information Retrieval

30 PageRank: Example d1d1 d2d2 d4d4 d3d3 iterate until convergeInitial value p(d)=1/N, 10/27/2015Introduction to Information Retrieval

Beyond Just Search – Information Retrieval Applications 10/27/201531Introduction to Information Retrieval

Examples of Text Management Applications Search –Web search engines (Google, Yahoo, …) –Library systems –… Filtering –News filter –Spam filter –Literature/movie recommender Categorization –Automatically sorting s –… Mining/Extraction –Discovering major complaints from in customer service –Business intelligence –Bioinformatics –… 10/27/201532Introduction to Information Retrieval

33 Sample Applications 1) Text Categorization 2) Document/Term Clustering 3) Text Summarization 4) Filtering 10/27/2015Introduction to Information Retrieval

34 1) Text Categorization Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents A standard supervised learning problem Categorization System … Sports Business Education Science … Sports Business Education Slide is from ChengXiang Zhai’s CS410 10/27/2015 Introduction to Information Retrieval

35 K-Nearest Neighbor Classifier Keep all training examples Find k examples that are most similar to the new document (“neighbor” documents) Assign the category that is most common in these neighbor documents (neighbors vote for the category) Can be improved by considering the distance of a neighbor ( A closer neighbor has more influence) Technical elements (“retrieval techniques”) –Document representation –Document distance measure Slide is from ChengXiang Zhai’s CS410 10/27/2015Introduction to Information Retrieval

36 Example of K-NN Classifier (k=1) (k=4) 10/27/2015Introduction to Information Retrieval Slide is from ChengXiang Zhai’s CS410

37 Examples of Text Categorization News article classification Meta-data annotation Automatic sorting Web page classification 10/27/2015Introduction to Information Retrieval

38 2) The Clustering Problem Group similar objects together Object can be document, term, passages Example 10/27/2015Introduction to Information Retrieval Slide is from ChengXiang Zhai’s CS410

39 Similarity-based Clustering Define a similarity function to measure similarity between two objects Gradually group similar objects together in a bottom-up fashion Stop when some stopping criterion is met 10/27/2015Introduction to Information Retrieval

40 Examples of Doc/Term Clustering Clustering of retrieval results Clustering of documents in the whole collection Term clustering to define “concept” or “theme” 10/27/2015Introduction to Information Retrieval

41 3) Summarization - Simple Discourse Analysis vector 1 vector 2 vector 3 … vector n-1 vector n similarity Slide is from ChengXiang Zhai’s CS410 10/27/2015Introduction to Information Retrieval

42 A Simple Summarization Method sentence 1 sentence 2 sentence 3 summary Doc vector Most similar in each segment Slide is from ChengXiang Zhai’s CS410 10/27/2015 Introduction to Information Retrieval

43 Examples of Summarization News summary Summarize retrieval results –Single doc summary –Multi-doc summary 10/27/2015Introduction to Information Retrieval

4) Filtering Content-based filtering (adaptive filtering) Collaborative filtering (recommender systems) 10/27/201544Introduction to Information Retrieval

45 Examples of Information Filtering News filtering filtering Movie/book recommenders such as Amazon.com Literature recommenders 10/27/2015Introduction to Information Retrieval

46 Content-based Filtering vs. Collaborative Filtering Basic filtering question: Will user U like item X? Two different ways of answering it –Look at what U likes –Look at who likes X Can be combined => characterize X => content-based filtering => characterize U => collaborative filtering Collaborative filtering is also called “Recommender Systems” Slide is from ChengXiang Zhai’s CS410 10/27/2015Introduction to Information Retrieval

47 Adaptive Information Filtering Stable & long term interest System must make a delivery decision immediately as a document “arrives” Filtering System … my interest: 10/27/2015Introduction to Information Retrieval Slide is from ChengXiang Zhai’s CS410

48 Collaborative Filtering Making filtering decisions for an individual user based on the judgments of other users Inferring individual’s interest/preferences from that of other similar users General idea –Given a user u, find similar users {u 1, …, u m } –Predict u’s preferences based on the preferences of u 1, …, u m 10/27/2015Introduction to Information Retrieval

49 Collaborative Filtering: Assumptions Users with a common interest will have similar preferences Users with similar preferences probably share the same interest Examples –“interest is IR” => “favor SIGIR papers” –“favor SIGIR papers” => “interest is IR” Sufficiently large number of user preferences are available 10/27/2015Introduction to Information Retrieval

50 Collaborative Filtering: Intuitions User similarity (user X and Y) –If X liked the movie, Y will like the movie Item similarity –Since 90% of those wh o liked Star Wars also liked Independence Day, and, you liked Star Wars –You may also like Independence Day 10/27/2015Introduction to Information Retrieval

51 A Formal Framework for Rating u 1 u 2 … u i... u m Users: U Objects: O o 1 o 2 … o j … o n …. … X ij =f(u i,o j )=? ? The task Unknown function f: U x O  R Assume known f values for some (u,o)’s Predict f values for other (u,o)’s Essentially function approximation, like other learning problems 10/27/2015 Introduction to Information Retrieval Slide is from ChengXiang Zhai’s CS410

News Recommendation on Facebook 10/27/201552Introduction to Information Retrieval

Motivation 53 Newsletter Organizer www 10/27/2015Introduction to Information Retrieval

Facebook as a medium for recommendations Provides a great platform with in built social networks. More than 120 million users log on to Facebook at least once each day. More than 95% of the users have used at least one application built on the Facebook Platform. Possible to make applications that deeply integrate into a user's Facebook experience. –FBML (Facebook Markup language) –FBJS (Facebook Javascript) –FQL (Facebook Query Language) –Facebook API 54 10/27/2015Introduction to Information Retrieval

System Architecture Crawler Indexer Meta Data (RDBMS) Date-wise Index Query Index Clusteri ng Newsletter (RDBMS) Register Community Facebook Application 55 10/27/2015Introduction to Information Retrieval

56 Application Main page 10/27/2015Introduction to Information Retrieval

Collaborative User Feedback 57 Three kinds of user feedback captured –Clickthroughs –Explicit Ratings –Inter-person recommendations They are linearly combined as follows: Where F ij is aggregating all kinds of feedback for article a j from user u i 10/27/2015Introduction to Information Retrieval

Demo News Recommender on Facebook 10/27/201558Introduction to Information Retrieval

Application Information For more information about the application: – 10/27/201559Introduction to Information Retrieval

We are looking for motivated students to work on this application. Requirements: –DataBase Knowledge –PHP –Perl Contact me if you are interested: 10/27/201560Introduction to Information Retrieval

Thanks 10/27/201561Introduction to Information Retrieval