2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., 2008 1 Statistical Models for Information Retrieval and Text Mining ChengXiang.

Slides:

Advertisements

Similar presentations

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Advertisements

Chapter 5: Introduction to Information Retrieval

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Probabilistic Ranking Principle

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

Information Retrieval Models: Probabilistic Models

Evaluating Search Engine

Chapter 7 Retrieval Models.

Search Engines and Information Retrieval

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Modeling Modern Information Retrieval

Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.

1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

IR Models: Review Vector Model and Probabilistic.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

Chapter 5: Information Retrieval and Web Search

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

Search Engines and Information Retrieval Chapter 1.

1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

Information Retrieval: Problem Formulation & Evaluation ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Overview of Text Retrieval: Part 1 ChengXiang Zhai (

Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Chapter 6: Information Retrieval and Web Search

1 Computing Relevance, Similarity: The Vector Space Model.

CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.

Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.

Probabilistic Ranking Principle Hongning Wang

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model.

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Statistical Language Models for Biomedical Literature Retrieval ChengXiang Zhai Department of Computer Science, Institute for Genomic Biology, And Graduate.

Vector Space Models.

1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

Information Retrieval Models: Vector Space Models

Relevance Feedback Hongning Wang

1 CS 430: Information Discovery Lecture 5 Ranking.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.

Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.

Automated Information Retrieval

Sampath Jayarathna Cal Poly Pomona

Information Retrieval Models: Probabilistic Models

Relevance Feedback Hongning Wang

John Lafferty, Chengxiang Zhai School of Computer Science

CS 4501: Information Retrieval

Probabilistic Ranking Principle

CS 430: Information Discovery

INF 141: Information Retrieval

Retrieval Performance Evaluation - Measures

Language Models for TR Rong Jin

ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH

Presentation transcript:

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Statistical Models for Information Retrieval and Text Mining ChengXiang Zhai ( 翟成祥 ) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Course Overview Statistics Machine Learning Computer VisionNatural Language Processing Information Retrieval Multimedia DataText Data Scope of the course

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Goal of the Course Overview of techniques for information retrieval (IR) Detailed explanation of a few statistical models for IR and text mining –Probabilistic retrieval models (for search) –Probabilistic topic models (for text mining) Potential benefit for you: –Some ideas working well for text retrieval may also work for computer vision –Techniques for computer vision may be applicable to IR –IR and text mining raise new challenges as well as opportunities for machine learning

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Course Plan Lecture 1: Overview of information retrieval Lecture 2: Statistical language models for IR: Part 1 Lecture 3: Statistical language models for IR: Part 2 Lecture 4: Formal retrieval frameworks Lecture 5: Probabilistic topic models for text mining

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Lecture 1: Overview of IR Basic Concepts in Text Retrieval (TR) Evaluation of TR Common Components of a TR system Overview of Retrieval Models

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Basic Concepts in TR

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., What is Text Retrieval (TR)? There exists a collection of text documents User gives a query to express the information need A retrieval system returns relevant documents to users Known as “search technology” in industry

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., History of TR on One Slide Birth of TR –1945: V. Bush’s article “As we may think” –1957: H. P. Luhn’s idea of word counting and matching Indexing & Evaluation Methodology (1960’s) –Smart system (G. Salton’s group) –Cranfield test collection (C. Cleverdon’s group) –Indexing: automatic can be as good as manual (controlled vocabulary) TR Models (1970’s & 1980’s) … Large-scale Evaluation & Applications (1990’s-Present) –TREC (D. Harman & E. Voorhees, NIST) –Web search, PubMed, … –Boundary with related areas are disappearing

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Short vs. Long Term Info Need Short-term information need (Ad hoc retrieval) –“Temporary need”, e.g., info about used cars –Information source is relatively static –User “pulls” information –Application example: library search, Web search Long-term information need (Filtering) –“Stable need”, e.g., new data mining algorithms –Information source is dynamic –System “pushes” information to user –Applications: news filter

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Importance of Ad hoc Retrieval Directly manages any existing large collection of information There are many many “ad hoc” information needs A long-term information need can be satisfied through frequent ad hoc retrieval Basic techniques of ad hoc retrieval can be used for filtering and other “non-retrieval” tasks, such as automatic summarization.

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Formal Formulation of TR Vocabulary V={w 1, w 2, …, w N } of language Query q = q 1,…,q m, where q i  V Document d i = d i1,…,d im i, where d ij  V Collection C= {d 1, …, d k } Set of relevant documents R(q)  C –Generally unknown and user-dependent –Query is a “hint” on which doc is in R(q) Task = compute R’(q), an “approximate R(q)”

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Computing R(q) Strategy 1: Document selection –R(q)={d  C|f(d,q)=1}, where f(d,q)  {0,1} is an indicator function or classifier –System must decide if a doc is relevant or not (“absolute relevance”) Strategy 2: Document ranking –R(q) = {d  C|f(d,q)>  }, where f(d,q)  is a relevance measure function;  is a cutoff –System must decide if one doc is more likely to be relevant than another (“relative relevance”)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Document Selection vs. Ranking Doc Selection f(d,q)=? Doc Ranking f(d,q)=? d d d d d d d d d 9 - R’(q) True R(q)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Problems of Doc Selection The classifier is unlikely accurate –“Over-constrained” query (terms are too specific): no relevant documents found –“Under-constrained” query (terms are too general): over delivery –It is extremely hard to find the right position between these two extremes Even if it is accurate, all relevant documents are not equally relevant Relevance is a matter of degree!

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Ranking is often preferred Relevance is a matter of degree A user can stop browsing anywhere, so the boundary is controlled by the user –High recall users would view more items –High precision users would view only a few Theoretical justification: Probability Ranking Principle [Robertson 77]

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Probability Ranking Principle [Robertson 77] As stated by Cooper Robertson provides two formal justifications Assumptions: Independent relevance and sequential browsing (not necessarily all hold in reality) “If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately a possible on the basis of whatever data made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.”

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., According to the PRP, all we need is “A relevance measure function f” which satisfies For all q, d 1, d 2, f(q,d 1 ) > f(q,d 2 ) iff p(Rel|q,d 1 ) >p(Rel|q,d 2 ) Most IR research has focused on finding a good function f

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Evaluation in Information Retrieval

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Evaluation Criteria Effectiveness/Accuracy –Precision, Recall Efficiency –Space and time complexity Usability –How useful for real user tasks?

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Methodology: Cranfield Tradition Laboratory testing of system components –Precision, Recall –Comparative testing Test collections –Set of documents –Set of questions –Relevance judgments

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., The Contingency Table Relevant Retrieved Irrelevant RetrievedIrrelevant Rejected Relevant Rejected Relevant Not relevant RetrievedNot Retrieved Doc Action

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., How to measure a ranking? Compute the precision at every recall point Plot a precision-recall (PR) curve precision recall x x x x precision recall x x x x Which is better?

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Summarize a Ranking: MAP Given that n docs are retrieved –Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs –E.g., if the first rel. doc is at the 2 nd rank, then p(1)=1/2. –If a relevant document never gets retrieved, we assume the precision corresponding to that rel. doc to be zero Compute the average over all the relevant documents –Average precision = (p(1)+…p(k))/k This gives us (non-interpolated) average precision, which captures both precision and recall and is sensitive to the rank of each relevant document Mean Average Precisions (MAP) – MAP = arithmetic mean average precision over a set of topics –gMAP = geometric mean average precision over a set of topics (more affected by difficult topics)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Summarize a Ranking: NDCG What if relevance judgments are in a scale of [1,r]? r>2 Cumulative Gain (CG) at rank n –Let the ratings of the n documents be r1, r2, …rn (in ranked order) –CG = r1+r2+…rn Discounted Cumulative Gain (DCG) at rank n –DCG = r1 + r2/log r3/log … rn/log 2 n –We may use any base for the logarithm, e.g., base=b –For rank positions above b, do not discount Normalized Cumulative Gain (NDCG) at rank n –Normalize DCG at rank n by the DCG value at rank n of the ideal ranking –The ideal ranking would first return the documents with the highest relevance level, then the next highest relevance level, etc – Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs NDCG is now quite popular in evaluating Web search

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., When There’s only 1 Relevant Document Scenarios: –known-item search –navigational queries Search Length = Rank of the answer: –measures a user’s effort Mean Reciprocal Rank (MRR): –Reciprocal Rank: 1/Rank-of-the-answer –Take an average over all the queries

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Precion-Recall Curve Mean Avg. Precision (MAP) Recall=3212/4728 Breakeven Point (prec=recall) Out of 4728 rel docs, we’ve got 3212 D1 + D2 + D3 – D4 – D5 + D6 - Total # rel docs = 4 System returns 6 docs Average Prec = (1/1+2/2+3/5+0)/4 about 5.5 docs in the top 10 docs are relevant

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., What Query Averaging Hides Slide from Doug Oard’s presentation, originally from Ellen Voorhees’ presentation

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., The Pooling Strategy When the test collection is very large, it’s impossible to completely judge all the documents TREC’s strategy: pooling –Appropriate for relative comparison of different systems –Given N systems, take top-K from the result of each, combine them to form a “pool” –Users judge all the documents in the pool; unjudged documents are assumed to be non-relevant Advantage: less human effort Potential problem: –bias due to incomplete judgments (okay for relative comparison) –Favor a system contributing to the pool, but when reused, a new system’s performance may be under-estimated Reuse the data set with caution!

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., User Studies Limitations of Cranfield evaluation strategy: –How do we evaluate a technique for improving the interface of a search engine? –How do we evaluate the overall utility of a system? User studies are needed General user study procedure: –Experimental systems are developed –Subjects are recruited as users –Variation can be in the system or the users –Users use the system and user behavior is logged –User information is collected (before: background, after: experience with the system) Clickthrough-based real-time user studies: –Assume clicked documents to be relevant –Mix results from multiple methods and compare their clickthroughs

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Common Components in a TR System

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Typical TR System Architecture User querydocs results Query Rep Doc Rep (Index) Scorer Indexer Tokenizer Index judgments Feedback

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Text Representation/Indexing Making it easier to match a query with a document Query and document should be represented using the same units/terms Controlled vocabulary vs. full text indexing Full-text indexing is more practically useful and has proven to be as effective as manual indexing with controlled vocabulary

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., What is a good indexing term? Specific (phrases) or general (single word)? Luhn found that words with middle frequency are most useful –Not too specific (low utility, but still useful!) –Not too general (lack of discrimination, stop words) –Stop word removal is common, but rare words are kept All words or a (controlled) subset? When term weighting is used, it is a matter of weighting not selecting of indexing terms

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Tokenization Word segmentation is needed for some languages –Is it really needed? Normalize lexical units: Words with similar meanings should be mapped to the same indexing term –Stemming: Mapping all inflectional forms of words to the same root form, e.g. computer -> compute computation -> compute computing -> compute (but king->k?) –Are we losing finer-granularity discrimination? Stop word removal –What is a stop word? What about a query like “to be or not to be”?

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Relevance Feedback Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query Retrieval Engine Results: d d … d k User Document collection

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Pseudo/Blind/Automatic Feedback Query Retrieval Engine Results: d d … d k Judgments: d 1 + d 2 + d 3 + … d k -... Document collection Feedback Updated query top 10

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Implicit Feedback Updated query Feedback Query Retrieval Engine Results: d d … d k User Document collection User Activities e.g. clickthroughs Judgments: d 1 + d 2 - d 3 + … d k -... infer

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Important Points to Remember PRP provides a justification for ranking, which is generally preferred to document selection How to compute the major evaluation measure (precision, recall, precision-recall curve, MAP, gMAP, breakeven precision, NDCG, MRR) What is pooling What is tokenization (word segmentation, stemming, stop word removal) What are relevance feedback, pseudo relevance feedback, and implicit feedback

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Overview of Retrieval Models Relevance  (Rep(q), Rep(d)) Similarity P(r=1|q,d) r  {0,1} Probability of Relevance P(d  q) or P(q  d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) … Generative Model Regression Model (Fox 83) Classical prob. Model (Robertson & Sparck Jones, 76) Doc generation Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Learn to Rank (Joachims 02) (Burges et al. 05)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Retrieval Models: Vector Space

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Relevance = Similarity Assumptions –Query and document are represented similarly –A query can be regarded as a “document” –Relevance(d,q)  similarity(d,q) R(q) = {d  C|f(d,q)>  }, f(q,d)=  (Rep(q), Rep(d)) Key issues –How to represent query/document? –How to define the similarity measure  ?

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Vector Space Model Represent a doc/query by a term vector –Term: basic concept, e.g., word or phrase –Each term defines one dimension –N terms define a high-dimensional space –Element of vector corresponds to term weight –E.g., d=(x 1,…,x N ), x i is “importance” of term i Measure relevance by the distance between the query vector and document vector in the vector space

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., VS Model: illustration Java Microsoft Starbucks D6D6 D 10 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D2D2 ? D1D1 ? ?? ? D3D3 Query

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., What the VS model doesn’t say How to define/select the “basic concept” –Concepts are assumed to be orthogonal How to assign weights –Weight in query indicates importance of term –Weight in doc indicates how well the term characterizes the doc How to define the similarity/distance measure

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., What’s a good “basic concept”? Orthogonal –Linearly independent basis vectors –“Non-overlapping” in meaning No ambiguity Weights can be assigned automatically and hopefully accurately Many possibilities: Words, stemmed words, phrases, “latent concept”, …

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., How to Assign Weights? Very very important! Why weighting –Query side: Not all terms are equally important –Doc side: Some terms carry more information about contents How? –Two basic heuristics TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) –TF normalization

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., TF Weighting Idea: A term is more important if it occurs more frequently in a document Some formulas: Let f(t,d) be the frequency count of term t in doc d –Raw TF: TF(t,d) = f(t,d) –Log TF: TF(t,d)=log f(t,d) –Maximum frequency normalization: TF(t,d) = *f(t,d)/MaxFreq(d) –“ Okapi/BM25 TF ” : TF(t,d) = k f(t,d)/(f(t,d)+k(1-b+b*doclen/avgdoclen)) Normalization of TF is very important!

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., TF Normalization Why? –Document length variation –“Repeated occurrences” are less informative than the “first occurrence” Two views of document length –A doc is long because it uses more words –A doc is long because it has more contents Generally penalize long doc, but avoid over- penalizing (pivoted normalization)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., TF Normalization: How? Norm. TF Raw TF Which curve is more reasonable? Should normalized-TF be up-bounded? Normalization interacts with the similarity measure

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Regularized/“Pivoted” Length Normalization Norm. TF Raw TF “Pivoted normalization”: Using avg. doc length to regularize normalization 1-b+b*doclen/avgdoclen (b varies from 0 to 1) What would happen if doclen is {>, <,=} avgdoclen? Advantage: stabalize parameter setting

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., IDF Weighting Idea: A term is more discriminative if it occurs only in fewer documents Formula: IDF(t) = 1+ log(n/k) n – total number of docs k -- # docs with term t (doc freq)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., TF-IDF Weighting TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t) –Common in doc  high tf  high weight –Rare in collection  high idf  high weight Imagine a word count profile, what kind of terms would have high weights?

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., How to Measure Similarity? How about Euclidean?

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., What Works the Best? (Singhal 2001) Use single words Use stat. phrases Remove stop words Stemming Others(?) Error [ ]

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., “Extension” of VS Model Alternative similarity measures –Many other choices (tend not to be very effective) –P-norm (Extended Boolean): matching a Boolean query with a TF-IDF document vector Alternative representation –Many choices (performance varies a lot) –Latent Semantic Indexing (LSI) [TREC performance tends to be average] Generalized vector space model –Theoretically interesting, not seriously evaluated

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Relevance Feedback in VS Basic setting: Learn from examples –Positive examples: docs known to be relevant –Negative examples: docs known to be non-relevant –How do you learn from this to improve performance? General method: Query modification –Adding new (weighted) terms –Adjusting weights of old terms –Doing both The most well-known and effective approach is Rocchio [Rocchio 1971]

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Rocchio Feedback: Illustration q q

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Rocchio Feedback: Formula Origial query Rel docs Non-rel docs Parameters New query

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Rocchio in Practice Negative (non-relevant) examples are not very important (why?) Often truncate the vector onto to lower dimension (i.e., consider only a small number of words that have high weights in the centroid vector) (efficiency concern) Avoid overfitting by keeping a relatively high weight on the original query weights (why?) Can be used for relevance feedback and pseudo feedback Usually robust and effective

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Advantages of VS Model Empirically effective! (Top TREC performance) Intuitive Easy to implement Well-studied/Most evaluated The Smart system –Developed at Cornell: –Still available Warning: Many variants of TF-IDF!

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Disadvantages of VS Model Assume term independence Assume query and document to be the same Lack of “predictive adequacy” –Arbitrary term weighting –Arbitrary similarity measure Lots of parameter tuning!

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Probabilistic Retrieval Models

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Overview of Retrieval Models Relevance  (Rep(q), Rep(d)) Similarity P(r=1|q,d) r  {0,1} Probability of Relevance P(d  q) or P(q  d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) … Generative Model Regression Model (Fox 83) Classical prob. Model (Robertson & Sparck Jones, 76) Doc generation Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Learn to Rank (Joachims 02) (Burges et al. 05)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Probability of Relevance Three random variables –Query Q –Document D –Relevance R  {0,1} Goal: rank D based on P(R=1|Q,D) –Evaluate P(R=1|Q,D) –Actually, only need to compare P(R=1|Q,D1) with P(R=1|Q,D2), I.e., rank documents Several different ways to refine P(R=1|Q,D)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Refining P(R=1|Q,D) Method 1: conditional models Basic idea: relevance depends on how well a query matches a document –Define features on Q x D, e.g., #matched terms, # the highest IDF of a matched term, #doclen,.. –P(R=1|Q,D)=g(f1(Q,D), f2(Q,D),…,fn(Q,D),  ) –Using training data (known relevance judgments) to estimate parameter  –Apply the model to rank new documents Early work (e.g., logistic regression [Cooper 92, Gey 94]) –Attempted to compete with other models Recent work (e.g. Ranking SVM [Joachims 02], RankNet (Burges et al. 05)) –Attempted to leverage other models –More features (notably PageRank, anchor text) –More sophisticated learning (Ranking SVM, RankNet, …)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Logistic Regression (Cooper 92, Gey 94) logit function: logistic (sigmoid) function: X P(R=1|Q,D) 1.0 Uses 6 features X 1, …, X 6

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Features/Attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Learning to Rank Advantages –May combine multiple features (helps improve accuracy and combat web spams) –May re-use all the past relevance judgments (self-improving) Problems –Don’t learn the semantic associations between query words and document words –No much guidance on feature generation (rely on traditional retrieval models) All current Web search engines use some kind of learning algorithms to combine many features such as PageRank and many different representations of a page

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., The PageRank Algorithm (Page et al. 98) d1d1 d2d2 d4d4 “Transition matrix” d3d3 Iterate until converge N= # pages Stationary (“stable”) distribution, so we ignore time Random surfing model: At any page, With prob. , randomly jumping to a page With prob. (1-  ), randomly picking a link to follow. I ij = 1/N Initial value p(d)=1/N

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., PageRank in Practice Interpretation of the damping factor  (  0.15): –Probability of a random jump –Smoothing the transition matrix (avoid zero’s) Normalization doesn’t affect ranking, leading to some variants The zero-outlink problem: p(di)’s don’t sum to 1 –One possible solution = page-specific damping factor (  =1.0 for a page with no outlink)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., HITS: Capturing Authorities & Hubs [Kleinberg 98] Intuitions –Pages that are widely cited are good authorities –Pages that cite many other pages are good hubs The key idea of HITS –Good authorities are cited by good hubs –Good hubs point to good authorities –Iterative reinforcement…

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., The HITS Algorithm [Kleinberg 98] d1d1 d2d2 d4d4 “Adjacency matrix” d3d3 Initial values: a(d i )=h(d i )=1 Iterate Normalize:

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Refining P(R=1|Q,D) Method 2: generative models Basic idea –Define P(Q,D|R) –Compute O(R=1|Q,D) using Bayes’ rule Special cases –Document “generation”: P(Q,D|R)=P(D|Q,R)P(Q|R) –Query “generation”: P(Q,D|R)=P(Q|D,R)P(D|R) Ignored for ranking D

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Document Generation Model of relevant docs for Q Model of non-relevant docs for Q Assume independent attributes A 1 …A k ….(why?) Let D=d 1 …d k, where d k  {0,1} is the value of attribute A k (Similarly Q=q 1 …q k ) Non-query terms are equally likely to appear in relevant and non-relevant docs

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Robertson-Sparck Jones Model (Robertson & Sparck Jones 76) Two parameters for each term A i : p i = P(A i =1|Q,R=1): prob. that term A i occurs in a relevant doc q i = P(A i =1|Q,R=0): prob. that term A i occurs in a non-relevant doc (RSJ model) How to estimate parameters? Suppose we have relevance judgments, “+0.5” and “+1” can be justified by Bayesian estimation

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., RSJ Model: No Relevance Info (Croft & Harper 79) (RSJ model) How to estimate parameters? Suppose we do not have relevance judgments, - We will assume p i to be a constant - Estimate q i by assuming all documents to be non-relevant N: # documents in collection n i : # documents in which term A i occurs

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Improving RSJ: Adding TF Let D=d 1 …d k, where d k is the frequency count of term A k Basic doc. generation model: 2-Poisson mixture model Many more parameters to estimate! (how many exactly?)

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., BM25/Okapi Approximation (Robertson et al. 94) Idea: Approximate p(R=1|Q,D) with a simpler function that share similar properties Observations: –log O(R=1|Q,D) is a sum of term weights W i –W i = 0, if TF i =0 –W i increases monotonically with TF i –W i has an asymptotic limit The simple function is

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Adding Doc. Length & Query TF Incorporating doc length –Motivation: The 2-Poisson model assumes equal document length –Implementation: “Carefully” penalize long doc Incorporating query TF –Motivation: Appears to be not well-justified –Implementation: A similar TF transformation The final formula is called BM25, achieving top TREC performance

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., The BM25 Formula “Okapi TF/BM25 TF”

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Extensions of “Doc Generation” Models Capture term dependence (Rijsbergen & Harper 78) Alternative ways to incorporate TF (Croft 83, Kalt96) Feature/term selection for feedback (Okapi’s TREC reports) Other Possibilities (machine learning … )

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Query Generation Assuming uniform prior, we have Query likelihood p(q|  d )Document prior Now, the question is how to compute ? Generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model Leading to the so-called “Language Modeling Approach” …

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Lecture 1: Key Points Vector Space Model is a family of models, not a single model Many variants of TF-IDF weighting and some are more effective than others State of the art retrieval performance is achieved through –Bag of words representation –TF-IDF weighting (BM25) + length normalization –Pseudo relevance feedback (mostly for recall) –For web search, add PageRank, anchor text, …, plus learning to rank Principled approaches didn’t lead to good performance directly (before the “language modeling approach” was proposed); heuristic modification has been necessary

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Readings Amit Singhal’s overview: – My review of IR models: –

2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Discussion Text retrieval and image retrieval –Query language: keywords vs. image features By example –Content representation Bag of words vs. “bag of image features”? Phrase indexing vs. units from image parsing? Sentiment analysis –Retrieval heuristics TF-IDF weighting vs. ? Passage retrieval vs. image region retrieval? Proximity Text retrieval and video retrieval Multimedia retrieval?