Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell

Topics Covered The Vector Space Model for IR (VSM) Evaluation Metrics for IR Query Expansion (the Rocchio Method) Inverted Indexing for Efficiency A Glimpse into Harder Problems

The Vector Space Model (1) Let Σ = [w 1, w 2,... w n ] Let D j = [c(w 1, D j ), c(w 2, D j ),... c(w n, D j )] Let Q = [c(w 1, Q), c(w 2, Q),... c(w n, Q)]

The Vector Space Model (2) Initial Definition of Similarity: S I (Q, D j ) = Q. D j Normalized Definition of Similarity: S N (Q, D j ) = (Q. D j )/(|Q| x |D j |) = cos(Q, D j )

The Vector Space Model (3) Relevance Ranking If S N (Q, D i ) > S N (Q, D j ) Then D i is more relevant than D i to Q Retrieve(k,Q,{D j }) = Arg max k [cos(Q, D j )] D j in {D j }

Refinements to VSM (1) Word normalization Words in morphological root form countries => country interesting => interest Stemming as a fast approximation countries, country => countr moped => mop Reduces vocabulary (always good) Generalizes matching (usually good) More useful for non-English IR (Arabic has > 100 variants per verb)

Refinements to VSM (2) Stop-Word Elimination Discard articles, auxiliaries, prepositions,... typically 100-300 most frequent small words Reduce document length by 30-40% Retrieval accuracy improves slightly (5- 10%)

Refinements to VSM (3) Proximity Phrases E.g.: "air force" => airforce Found by high-mutual information p(w 1 w 2 ) >> p(w 1 )p(w 2 ) p(w 1 & w 2 in k-window) >> p(w 1 in k-window) p(w 2 in same k-window) Retrieval accuracy improves slightly (5-10%) Too many phrases => inefficiency

Refinements to VSM (4) Words => Terms term = word | stemmed word | phrase Use exactly the same VSM method on terms (vs words)

Evaluating Information Retrieval (1) Contingency table: relevantnot-relevant retrievedab not retrievedcd

Evaluating Information Retrieval (2) P = a/(a+b)R = a/(a+c) Accuracy = (a+d)/(a+b+c+d) F1 = 2PR/(P+R) Miss = c/(a+c) = 1 - R (false negative) F/A = b/(a+b+c+d) (false positive)

Evaluating Information Retrieval (3) 11-point precision curves IR system generates total ranking Plot precision at 10%, 20%, 30%... recall,

Query Expansion (1) Observations: Longer queries often yield better results User’s vocabulary may differ from document vocabulary Q: how to avoid heart disease D: "Factors in minimizing stroke and cardiac arrest: Recommended dietary and exercise regimens" Maybe longer queries have more chances to help recall.

Query Expansion (2) Bridging the Gap Human query expansion (user or expert) Thesaurus-based expansion Seldom works in practice (unfocused) Relevance feedback –Widen a thin bridge over vocabulary gap –Adds words from document space to query Pseudo-Relevance feedback Local Context analysis

Relevance Feedback (1) Rocchio Formula Q’ = F[Q, D ret ] F = weighted vector sum, such as: W(t,Q’) = αW(t,Q) + βW(t,D rel ) - γW(t,D irr )

Relevance Feedback (2) For example, if: Q = (heart attack medicine) W(heart,Q) = W(attack,Q) = W(medicine,Q) = 1 D rel = (cardiac arrest prevention medicine nitroglycerine heart disease...) W(nitroglycerine,D) = 2, W(medicine,D) = 1 D irr = (terrorist attack explosive semtex attack nitroglycerine proximity fuse...) W(attack,D) = 1, W(nitroglycerine = 2), W(explosive,D) = 1 AND α =1, β =2, γ =.5

Relevance Feedback (3) Then: W(attack,Q’) = 1*1 - 0.5*1 = 0.5 W(nitroglycerine, Q’) = W(medicine, Q’) = w(explosive, Q’) =

Term Weighting Methods (1) Salton’s Tf*IDf Tf = term frequency in a document Df = document frequency of term = # documents in collection with this term IDf = Df -1

Term Weighting Methods (2) Salton’s Tf*IDf TfIDf = f 1 (Tf)*f 2 (IDf) E.g. f 1 (Tf) =Tf*ave(|D j |)/|D| E.g. f 2 (IDf) = log 2 (IDF) f 1 and f 2 can differ for Q and D

Vector Space Model: a toy example Suppose your document collection only contains 2 documents: Q = (heart attack medicine) D1= (cardiac arrest prevention medicine nitroglycerine heart disease) D2= (terrorist attack explosive semtex attack nitroglycerine proximity fuse)

Vector Space Model: a toy example Then, the dictionary is (alphabetically sorted): {arrest, attack, cardiac, disease, explosive, fuse, heart, medicine, nitroglycerine, prevention, proximity, semtex, terrorist} The vectors of Q, D1 and D2 are as follows: Q = [0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0] D1=[1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0] D2=[0, 2, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1] Each component of the vector is defined as: TW(t, V), and here for simplicity, we just use raw TF as the term weight of a term in the vector V.

Efficient Implementations of VSM (1) Exploit sparseness Only compute non-zero multiplies in dot- products Do not even look at zero elements (how?) => Use non-stop terms to index documents

Efficient Implementations of VSM (2) Inverted Indexing Find all unique [stemmed] terms in document collection Remove stopwords from word list If collection is large (over 100,000 documents), [Optionally] remove singletons Usually spelling errors or obscure names Alphabetize or use hash table to store list For each term create data structure like:

Efficient Implementations of VSM (3) [term IDF term i, <doc i, freq(term, doc i ) doc j, freq(term, doc j )...>] or: [term IDF term i, <doc i, freq(term, doc i ), [pos 1,i, pos 2,i,...] doc j, freq(term, doc j ), [pos 1,j, pos 2,j,...]...>] pos l,1 indicates the first position of term in document j and so on.

Open Research Problems in IR (1) Beyond VSM Vectors in different Spaces: Generalized VSM, Latent Semantic Indexing... Probabilistic IR (Language Modeling): P(D|Q) = P(Q|D)P(D)/P(Q)

Open Research Problems in IR (2) Beyond Relevance Appropriateness of doc to user comprehension level, etc. Novelty of information in doc to user anti- redundancy as approx to novelty

Open Research Problems in IR (3) Beyond one Language Translingual IR Transmedia IR

Open Research Problems in IR (4) Beyond Content Queries "What’s new today?" "What sort of things to you know about" "Build me a Yahoo-style index for X" "Track the event in this news-story"

Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Similar presentations

Presentation on theme: "Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Similar presentations

Presentation on theme: "Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell."— Presentation transcript:

Similar presentations

About project

Feedback