15-381 Artificial Intelligence Information Retrieval (How to Power a Search Engine) Jaime Carbonell 20 September 2001 Topics Covered: “Bag of Words” Hypothesis.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Under The Hood [Part I] Web-Based Information Architectures MSEC – Mini II 28-October-2003 Jaime Carbonell.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
Evaluating Search Engine
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
K nearest neighbor and Rocchio algorithm
Information Retrieval Review
ISP 433/533 Week 2 IR Models.
Web-based Information Architectures MSEC Mini II Location:GSIA Simon Auditorium Time:1:30-3:20pm, Tues. & Thurs. Instructor:Prof. Jaime Carbonell.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
CS276 Information Retrieval and Web Mining
Web-based Information Architectures Jian Zhang. Today’s Topics Term Weighting Scheme Vector Space Model & GVSM Evaluation of IR Rocchio Feedback Web Spider.
Hinrich Schütze and Christina Lioma
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Under The Hood [Part II] Web-Based Information Architectures MSEC Mini II Jaime Carbonell.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
IR Models: Review Vector Model and Probabilistic.
Under The Hood [Part I] Web-Based Information Architectures MSEC Mini II Jaime Carbonell.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Vector Space Models.
C.Watterscsci64031 Probabilistic Retrieval Model.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Natural Language Processing Topics in Information Retrieval August, 2002.
(Pseudo)-Relevance Feedback & Passage Retrieval Ling573 NLP Systems & Applications April 28, 2011.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Automated Information Retrieval
The Vector Space Models (VSM)
Plan for Today’s Lecture(s)
Basic Information Retrieval
Representation of documents and queries
Chapter 5: Information Retrieval and Web Search
Boolean and Vector Space Retrieval Models
INF 141: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Artificial Intelligence Information Retrieval (How to Power a Search Engine) Jaime Carbonell 20 September 2001 Topics Covered: “Bag of Words” Hypothesis Vector Space Model & Cosine Similarity Query Expansion Methods

Information Retrieval: The Challenge (1) Text DB includes: (1) Rainfall measurements in the Sahara continue to show a steady decline starting from the first measurements in In 1996 only 12mm of rain were recorded in upper Sudan, and 1mm in Southern Algiers... (2) Dan Marino states that professional football risks loosing the number one position in heart of fans across this land. Declines in TV audience ratings are cited... (3) Alarming reductions in precipitation in desert regions are blamed for desert encroachment of previously fertile farmland in Northern Africa. Scientists measured both yearly precipitation and groundwater levels...

Information Retrieval: The Challenge (2) User query states: "Decline in rainfall and impact on farms near Sahara" Challenges How to retrieve (1) and (3) and not (2)? How to rank (3) as best? How to cope with no shared words?

Information Retrieval Assumptions (1) Basic IR task There exists a document collection {D j } Users enters at hoc query Q Q correctly states user’s interest User wants {D i } < {D j } most relevant to Q

"Shared Bag of Words" assumption Every query = {w i } Every document = {w k }...where w i & w k in same Σ All syntax is irrelevant (e.g. word order) All document structure is irrelevant All meta-information is irrelevant (e.g. author, source, genre) => Words suffice for relevance assessment Information Retrieval Assumption (2)

Information Retrieval Assumption (3) Retrieval by shared words If Q and D j share some w i, then Relevant(Q, D j ) If Q and D j share all w i, then Relevant(Q, D j ) If Q and D j share over K% of w i, then Relevant(Q, D j )

Boolean Queries (1) Industrial use of Silver Q: silver R: "The Count’s silver anniversary..." "Even the crash of ’87 had a silver lining..." "The Lone Ranger lived on in syndication..." "Sliver dropped to a new low in London..."... Q: silver AND photography R: "Posters of Tonto and the Lone Ranger..." "The Queen’s Silver Anniversary photos..."...

Boolean Queries (2) Q: (silver AND (NOT anniversary) AND (NOT lining) AND emulsion) OR (AgI AND crystal AND photography)) R: "Silver Iodide Crystals in Photography..." "The emulsion was worth its weight in silver..."...

Boolean Queries (3) Boolean queries are: a) easy to implement b) confusing to compose c) seldom used (except by librarians) d) prone to low recall e) all of the above

Beyond the Boolean Boondoggle (1) Desiderata (1) Query must be natural for all users Sentence, phrase, or word(s) No AND’s, OR’s, NOT’s,... No parentheses (no structure) System focus on important words Q: I want laser printers now

Beyond the Boolean Boondoggle (2) Desiderata (2) Find what I mean, not just what I say Q: cheap car insurance (pAND (pOR "cheap" [1.0] "inexpensive" [0.9] "discount" [0.5)] (pOR "car" [1.0] "auto" [0.8] "automobile" [0.9] "vehicle" [0.5]) (pOR "insurance" [1.0] "policy" [0.3]))

The Vector Space Model (1) Let Σ = [w 1, w 2,... w n ] Let D j = [c(w 1, D j ), c(w 2, D j ),... c(w n, D j )] Let Q = [c(w 1, Q), c(w 2, Q),... c(w n, Q)]

The Vector Space Model (2) Initial Definition of Similarity: S I (Q, D j ) = Q. D j Normalized Definition of Similarity: S N (Q, D j ) = (Q. D j )/(|Q| x |D j |) = cos(Q, D j )

The Vector Space Model (3) Relevance Ranking If S N (Q, D i ) > S N (Q, D j ) Then D i is more relevant than D i to Q Retrieve(k,Q,{D j }) = Arg max k [cos(Q, D j )] D j in {D j }

Refinements to VSM (2) Stop-Word Elimination Discard articles, auxiliaries, prepositions,... typically most frequent small words Reduce document length by 30-40% Retrieval accuracy improves slightly (5- 10%)

Refinements to VSM (3) Proximity Phrases E.g.: "air force" => airforce Found by high-mutual information p(w 1 w 2 ) >> p(w 1 )p(w 2 ) p(w 1 & w 2 in k-window) >> p(w 1 in k-window) p(w 2 in same k-window) Retrieval accuracy improves slightly (5-10%) Too many phrases => inefficiency

Refinements to VSM (4) Words => Terms term = word | stemmed word | phrase Use exactly the same VSM method on terms (vs words)

Evaluating Information Retrieval (1) Contingency table: relevantnot-relevant retrievedab not retrievedcd

Evaluating Information Retrieval (2) P = a/(a+b)R = a/(a+c) Accuracy = (a+d)/(a+b+c+d) F1 = 2PR/(P+R) Miss = c/(a+c) = 1 - R (false negative) F/A = b/(a+b+c+d) (false positive)

Query Expansion (1) Observations: Longer queries often yield better results User’s vocabulary may differ from document vocabulary Q: how to avoid heart disease D: "Factors in minimizing stroke and cardiac arrest: Recommended dietary and exercise regimens" Maybe longer queries have more chances to help recall.

Query Expansion (2) Bridging the Gap Human query expansion (user or expert) Thesaurus-based expansion Seldom works in practice (unfocused) Relevance feedback –Widen a thin bridge over vocabulary gap –Adds words from document space to query Pseudo-Relevance feedback Local Context analysis

Relevance Feedback Rocchio Formula Q’ = F[Q, D ret ] F = weighted vector sum, such as: W(t,Q’) = αW(t,Q) + βW(t,D rel ) - γW(t,D irr )

Term Weighting Methods (1) Salton’s Tf*IDf Tf = term frequency in a document Df = document frequency of term = # documents in collection with this term IDf = Df -1

Term Weighting Methods (2) Salton’s Tf*IDf TfIDf = f 1 (Tf)*f 2 (IDf) E.g. f 1 (Tf) =Tf*ave(|D j |)/|D| E.g. f 2 (IDf) = log 2 (IDF) f 1 and f 2 can differ for Q and D

Efficient Implementations of VSM (1) Build an Inverted Index (next slide) Filter all 0-product terms Precompute IDF, per-document TF …but remove stopwords first.

Efficient Implementations of VSM (3) [term IDF term i, <doc i, freq(term, doc i ) doc j, freq(term, doc j )...>] or: [term IDF term i, <doc i, freq(term, doc i ), [pos 1,i, pos 2,i,...] doc j, freq(term, doc j ), [pos 1,j, pos 2,j,...]...>] pos l,1 indicates the first position of term in document j and so on.

Generalized Vector Space Model (1) Principles Define terms by their occurrence patterns in documents Define query terms in the same way Compute similarity by document-pattern overlap for terms in D and Q Use standard Cos similarity and either binary or TfIDf weights

Generalized Vector Space Model (2) Advantages Automatically calculates partial similarity If "heart disease" and "stroke" and "ventricular" co-occur in many documents, then if the query contains only one of these terms, documents containing the other will receive partial credit proportional to their document co-occurrence ratio. No need to do query expansion or relevance feedback

GVSM, How it Works (1) Represent the collection as vector of documents: Let C = [D 1, D 2,..., D m ] Represent each term by its distributional frequency: Let t i = [Tf(t i, D 1 ), Tf(t i, D 2 ),..., Tf(t i, D m )] Term-to-term similarity is computed as: Sim(t i, t j ) = cos(vec(t i ), vec(t j )) Hence, highly co-occurring terms like "Arafat" and "PLO" will be treated as near-synonyms for retrieval

GVSM, How it Works (2) And query-document similarity is computed as before: Sim(Q,D) = cos(vec(Q)), vec(D)), except that instead of the dot product calculation, we use a function of the term-to-term similarity computation above, For instance: Sim(Q,D) = Σ i [Max j (sim(q i, d j )] or normalizing for document & query length: Sim norm (Q, D) =

A Critique of Pure Relevance (1) IR Maximizes Relevance Precision and recall are relevance measures Quality of documents retrieved is ignored

A Critique of Pure Relevance (2) Other Important Factors What about information novelty, timeliness, appropriateness, validity, comprehensibility, density, medium,...?? In IR, we really want to maximize: P(U(f i,..., f n ) | Q & {C} & U & H) where Q = query, {C} = collection set, U = user profile, H = interaction history...but we don’t yet know how. Darn.

Maximal Marginal Relevance (1) A crude first approximation: novelty => minimal-redundancy Weighted linear combination: (redundancy = cost, relevance = benefit) Free parameters: k and λ

Maximal Marginal Relevance (2) MMR(Q, C, R) = Argmax k d i in C [λS(Q, d i ) - (1-λ)max d j in R (S(d i, d j ))]

Maximal Marginal Relevance (MMR) (3) COMPUTATION OF MMR RERANKING 1. Standard IR Retrieval of top-N docs Let D r = IR(D, Q, N) 2. Rank max sim(d i ε D r, Q) as top doc, i.e. Let Ranked = {d i } 3. Let D r = D r \{d i } 4. While D r is not empty, do: a. Find d i with max MMR(D r, Q. Ranked) b. Let Ranked = Ranked.d i c. Let D r = D r \{d i }

Maximal Marginal Relevance (MMR) (4) Applications: Ranking retrieved documents from IR Engine Ranking passages for inclusion in Summaries

Document Summarization in a Nutshell (1) Types of Summaries TaskQuery-relevant (focused) Query-free (generic) INDICATIVE, for Filtering (Do I read further?) To filter search engine results Short abstracts CONTENTFUL, for reading in lieu of full doc. To solve problems for busy professionals Executive summaries

Summarization as Passage Retrieval (1) For Query-Driven Summaries 1. Divide document into passages e.g, sentences, paragraphs, FAQ-pairs, Use query to retrieve most relevant passages, or better, use MMR to avoid redundancy. 3. Assemble retrieved passages into a summary.