Under The Hood [Part II] Web-Based Information Architectures MSEC 20-760 Mini II Jaime Carbonell.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Under The Hood [Part I] Web-Based Information Architectures MSEC – Mini II 28-October-2003 Jaime Carbonell.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
K nearest neighbor and Rocchio algorithm
Information Retrieval Review
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Web-based Information Architectures Jian Zhang. Today’s Topics Term Weighting Scheme Vector Space Model & GVSM Evaluation of IR Rocchio Feedback Web Spider.
Hinrich Schütze and Christina Lioma
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
IR Models: Review Vector Model and Probabilistic.
Under The Hood [Part I] Web-Based Information Architectures MSEC Mini II Jaime Carbonell.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Search Engines and Information Retrieval Chapter 1.
Artificial Intelligence Information Retrieval (How to Power a Search Engine) Jaime Carbonell 20 September 2001 Topics Covered: “Bag of Words” Hypothesis.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
1 Statistical source expansion for question answering CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Vector Space Models.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Plan for Today’s Lecture(s)
Lecture 13: Language Models for IR
Multimedia Information Retrieval
Representation of documents and queries
CS 430: Information Discovery
Presentation transcript:

Under The Hood [Part II] Web-Based Information Architectures MSEC Mini II Jaime Carbonell

Today’s Topics Term weighting in detail Generalized Vector Space Model (GVSM) Maximal Marginal Relevance Summarization as Passage Retrieval

Term Weighting Revisited (1) Definitions w i "ith Term:" a word, stemmed word, or indexed phrase D j "jth Document:" a unit of indexed text, e.g. a web-page, a news report, an article, a patent, a legal case, a book, a chapter of a book, etc.

Term Weighting Revisited (2) Definitions C "The Collection:" the full set of indexed documents (e.g. the New York Times archive, the Web,...) Tf(w i,D j ) "Term Frequency:" the number of times w i occurs in document D j. Tf is sometimes normalized by dividing by frequency of the most-frequent non-stop term in the document [Tf norm = Tf/ max_TF ].

Term Weighting Revisited (3) Definitions Df(w i,C) "Document Frequency:" the number of documents from C in which w i occurs. Df may be normalized by dividing it by the total number of documents in C. IDf(w i, C)“Inverse Document Frequency”: [Df(w i, C)/size(C)] -1. Most often the log 2 (IDf) is used, rather than IDf directly.

Term Weighting Revisited (4) TfIDf Term Weights In general: TfIDf(w i, D j, C) = F 1 (Tf(w i, D j ) * F 2 (IDf(w i, C)) Usually F 1 = log 2 (Tf), or Tf/Tf max or Tf/Tf max Usually F 2 = log 2 (IDf) In the SMART IR system: TfIDf(w i, D j,C) = [ Tf(w i, D j /Tf max (D j )] * log 2 (IDf(w i, C))

Term Weighting beyond TfIDf (1) Probabilistic Models Old style (see textbooks) Improves precision-recall slightly Full statistical language modeling (CMU) Improves precision-recall more significantly Difficult to compute efficiently.

Term Weighting beyond TfIDf (2) Neural Networks Theoretically attractive Do not scale up at all, unfortunately Fuzzy Sets Not deeply researched, scaling difficulties

Term Weighting beyond TfIDf (3) Natural Language Analysis Analyze and understand D’s & Q first Ultimate IR method, in theory Generally NL understanding is an unsolved problem Scale up challenges, even if we could do it But, shown to improve IR for very limited domains

Generalized Vector Space Model (1) Principles Define terms by their occurrence patterns in documents Define query terms in the same way Compute similarity by document-pattern overlap for terms in D and Q Use standard Cos similarity and either binary or TfIDf weights

Generalized Vector Space Model (2) Advantages Automatically calculates partial similarity If "heart disease" and "stroke" and "ventricular" co-occur in many documents, then if the query contains only one of these terms, documents containing the other will receive partial credit proportional to their document co-occurrence ratio. No need to do query expansion or relevance feedback

Generalized Vector Space Model (3) Disadvantages Computationally expensive Performance = vector space + Q expansion

GVSM, How it Works (1) Represent the collection as vector of documents: Let C = [D 1, D 2,..., D m ] Represent each term by its distributional frequency: Let t i = [Tf(t i, D 1 ), Tf(t i, D 2 ),..., Tf(t i, D m )] Term-to-term similarity is computed as: Sim(t i, t j ) = cos(vec(t i ), vec(t j )) Hence, highly co-occurring terms like "Arafat" and "PLO" will be treated as near-synonyms for retrieval

GVSM, How it Works (2) And query-document similarity is computed as before: Sim(Q,D) = cos(vec(Q)), vec(D)), except that instead of the dot product calculation, we use a function of the term-to-term similarity computation above, For instance: Sim(Q,D) = Σ i [Max j (sim(q i, d j )] or normalizing for document & query length: Sim norm (Q, D) =

GVSM, How it Works (3) Primary problem: More computation (sparse => dense) Primary benefit: Automatic term expansion by corpus

A Critique of Pure Relevance (1) IR Maximizes Relevance Precision and recall are relevance measures Quality of documents retrieved is ignored

A Critique of Pure Relevance (2) Other Important Factors What about information novelty, timeliness, appropriateness, validity, comprehensibility, density, medium,...?? In IR, we really want to maximize: P(U(f i,..., f n ) | Q & {C} & U & H) where Q = query, {C} = collection set, U = user profile, H = interaction history...but we don’t yet know how. Darn.

Maximal Marginal Relevance (1) A crude first approximation: novelty => minimal-redundancy Weighted linear combination: (redundancy = cost, relevance = benefit) Free parameters: k and λ

Maximal Marginal Relevance (2) MMR(Q, C, R) = Argmax k d i in C [λS(Q, d i ) - (1-λ)max d j in R (S(d i, d j ))]

Maximal Marginal Relevance (MMR) (3) COMPUTATION OF MMR RERANKING 1. Standard IR Retrieval of top-N docs Let D r = IR(D, Q, N) 2. Rank max sim(d i ε D r, Q) as top doc, i.e. Let Ranked = {d i } 3. Let D r = D r \{d i } 4. While D r is not empty, do: a. Find d i with max MMR(D r, Q. Ranked) b. Let Ranked = Ranked.d i c. Let D r = D r \{d i }

MMR Ranking vs Standard IR query documents MMR IR λ controls spiral curl

Maximal Marginal Relevance (MMR) (4) Applications: Ranking retrieved documents from IR Engine Ranking passages for inclusion in Summaries

Document Summarization in a Nutshell (1) Types of Summaries TaskQuery-relevant (focused) Query-free (generic) INDICATIVE, for Filtering (Do I read further?) To filter search engine results Short abstracts CONTENTFUL, for reading in lieu of full doc. To solve problems for busy professionals Executive summaries

Document Summarization in a Nutshell (2) Other Dimensions Single vs multi document summarization Genre-adaptive vs one-size-fits all Single-language vs translingual Flat summary vs hyperlinked pyramid Text-only vs multi-media...

Summarization as Passage Retrieval (1) For Query-Driven Summaries 1. Divide document into passages e.g, sentences, paragraphs, FAQ-pairs, Use query to retrieve most relevant passages, or better, use MMR to avoid redundancy. 3. Assemble retrieved passages into a summary.

Summarization as Passage Retrieval (2) For Generic Summaries 1. Use title or top-k Tf-IDF terms as query. 2. Proceed as Query-Driven Summarization.

Summarization as Passage Retrieval (3) For Multidocument Summaries 1. Cluster documents into topically-related groups. 2. For each group, divide document into passages and keep track of source of each passage. 3. Use MMR to retrieve most relevant non- redundant passages (MMR is necessary for multiple docs). 4. Assemble a summary for each cluster.