Information Retrieval Models: Vector Space Models

Name: Information Retrieval Models: Vector Space Models
Uploaded: 2017-08-18T16:22:31+00:00
Duration: PTM15S54
Channel: Ralf Davis
Description: Information Retrieval Models: Vector Space Models

Information Retrieval Models: Vector Space Models
ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Empirical IR vs. Model-based IR
heuristic approaches solely rely on empirical evaluation assumptions not always clearly stated findings: empirical observations; may or may not generalize well Model-based IR: theoretical approaches rely more on mathematics assumptions are explicitly stated findings: principles, models that may work well or not work well; generalize better Boundary may not be clear and a combination is generally necessary

History of Research on IR Models
1960: First probabilistic model [Maron & Kuhns 60] 1970s: Active research on retrieval models started Vector-space model [Salton et al. 75] Classic probabilistic model [Robertson & Sparck Jones 76] Probability Ranking Principle [Robertson 77] 1980s: Further development of different models Non-classic logic model [Rijsbergen 86] Extended Boolean [Salton et al. 83] Early work on learning to rank [Fuhr 89]

History of Research on IR Models (cont.)
1990s: retrieval model research driven by TREC Inference network [Turtle & Croft 91] BM25/Okapi [Robertson et al. 94] Pivoted length normalization [Singhal et al. 96] Language model [Ponte & Croft 98] 2000s-present: retrieval model influenced by machine learning and Web search Further development of language models [Zhai & Lafferty 01, Lavrenko & Croft 01] Divergence from randomness [Amati et al. 02] Axiomatic model [Fang et al. 04] Markov Random Field [Metzler & Croft 05] Further development of Learning to rank [Joachimes 02, Burges et al. 05]

Modeling Relevance: Raodmap for Retrieval Models
Relevance constraints [Fang et al. 04] Relevance (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Div. from Randomness (Amati & Rijsbergen 02) Generative Model Regression Model (Fuhr 89) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. (Wong & Yao, 89) … Doc generation Query Learn. To Rank (Joachims 02, Berges et al. 05) Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a)

1. Vector Space Models

Given a query, how do we know if document A is more relevant than B?
The Basic Question Given a query, how do we know if document A is more relevant than B? One Possible Answer If document A uses more query words than document B (Word usage in document A is more similar to that in query)

Relevance = Similarity
Assumptions Query and document are represented similarly A query can be regarded as a “document” Relevance(d,q)  similarity(d,q) R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d)) Key issues How to represent query/document? How to define the similarity measure ?

Vector Space Model Represent a doc/query by a term vector
Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define a high-dimensional space Element of vector corresponds to term weight E.g., d=(x1,…,xN), xi is “importance” of term i Measure relevance by the distance between the query vector and document vector in the vector space

VS Model: illustration
Java Microsoft Starbucks D2 ? ? D6 D10 D9 D4 D7 D8 D5 D11 D3 ? ? Query D1 ? ?

What the VS model doesn’t say
How to define/select the “basic concept” Concepts are assumed to be orthogonal How to assign weights Weight in query indicates importance of term Weight in doc indicates how well the term characterizes the doc How to define the similarity/distance measure

What’s a good “basic concept”?
Orthogonal Linearly independent basis vectors “Non-overlapping” in meaning No ambiguity Weights can be assigned automatically and hopefully accurately Many possibilities: Words, stemmed words, phrases, “latent concept”, … “Bag of words” representation works “surprisingly” well!

How to Assign Weights? Very very important! Why weighting How?
Query side: Not all terms are equally important Doc side: Some terms carry more information about contents How? Two basic heuristics TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) Document length normalization

TF Weighting Idea: A term is more important if it occurs more frequently in a document Formulas: Let f(t,d) be the frequency count of term t in doc d Raw TF: TF(t,d) = f(t,d) Log TF: TF(t,d)=log ( f(t,d) +1) Maximum frequency normalization: TF(t,d) = *f(t,d)/MaxFreq(d) “Okapi/BM25 TF”: TF(t,d) = k f(t,d)/(f(t,d)+k(1-b+b*doclen/avgdoclen)) Normalization of TF is very important!

TF Normalization Why? Two views of document length
Document length variation “Repeated occurrences” are less informative than the “first occurrence” Two views of document length A doc is long because it uses more words A doc is long because it has more contents Generally penalize long doc, but avoid over-penalizing (e.g., pivoted normalization)

TF Normalization (cont.)
Norm. TF Raw TF “Pivoted normalization”: Using avg. doc length to regularize normalization 1-b+b*doclen/avgdoclen b varies from 0 to 1 Normalization interacts with the similarity measure

IDF Weighting Idea: A term is more discriminative/important if it occurs only in fewer documents Formula: IDF(t) = 1+ log(n/k) n – total number of docs k -- # docs with term t (doc freq) Other variants: IDF(t) = log((n+1)/k) IDF(t)=log ((n+1)/(k+0.5)) What are the maximum and minimum values of IDF?

Non-Linear Transformation in IDF
IDF(t) IDF(t) = 1+ log(n/k) Linear penalization 1+log(n) k (doc freq) 1 N =totoal number of docs in collection Is this transformation optimal?

TF-IDF Weighting TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t)
Common in doc  high tf  high weight Rare in collection high idf high weight Imagine a word count profile, what kind of terms would have high weights?

Empirical distribution of words
There are stable language-independent patterns in how people use natural languages A few words occur very frequently; most occur rarely. E.g., in news articles, Top 4 words: 10~15% word occurrences Top 50 words: 35~40% word occurrences The most frequent word in one corpus may be rare in another

Zipf’s Law rank * frequency  constant Word Freq. Word Rank (by Freq)
High entropy words Generalized Zipf’s law: Applicable in many domains

How to Measure Similarity?
How about Euclidean?

What Works the Best? Error [ ] Use single words Use stat. phrases
Remove stop words Stemming Others(?) Error [ ] (Singhal 2001; Singhal et al. 1996)

Relevance Feedback in VS
Basic setting: Learn from examples Positive examples: docs known to be relevant Negative examples: docs known to be non-relevant How do you learn from this to improve performance? General method: Query modification Adding new (weighted) terms Adjusting weights of old terms Doing both The most well-known and effective approach is Rocchio [Rocchio 1971]

Rocchio Feedback: Illustration
Centroid of non-relevant documents Centroid of relevant documents - - - - - + - + + - - + + - - - - + q - qm - + + + + + - - - - + + + + + - + + + - - - - - - - -

Rocchio Feedback: Formula
Parameters New query Origial query Rel docs Non-rel docs

Rocchio in Practice How can we optimize the parameters?
Can it be used for both relevance feedback and pseudo feedback? How does Rocchio feedback affect the efficiency of scoring documents? How can we improve the efficiency?

Advantages of VS Model Empirically effective! (Top TREC performance)
Intuitive Easy to implement Well-studied/Most evaluated The Smart system Developed at Cornell: Still widely used Warning: Many variants of TF-IDF!

Disadvantages of VS Model
Assume term independence Assume query and document to be the same Lack of “predictive adequacy” Arbitrary term weighting Arbitrary similarity measure Lots of parameter tuning!

What You Should Know Basic idea of the vector space model
TF-IDF weighting Pivoted length normalization (read [Singhal et al. 1996] to know more) BM25/Okapi retrieval function (particularly TF weighting) How Rocchio feedback works

Information Retrieval Models: Vector Space Models

Similar presentations

Presentation on theme: "Information Retrieval Models: Vector Space Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval Models: Vector Space Models

Similar presentations

Presentation on theme: "Information Retrieval Models: Vector Space Models"— Presentation transcript:

Similar presentations

About project

Feedback