Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Chapter 5: Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
Learning for Text Categorization
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
CpSc 881: Information Retrieval
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Boolean, Vector Space, Probabilistic
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
The Vector Space Model …and applications in Information Retrieval.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Vector Space Model : TF - IDF
CES 514 Data Mining March 11, 2010 Lecture 5: scoring, term weighting, vector space model (Ch 6)
Chapter 5: Information Retrieval and Web Search
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Boolean and Vector Space Models
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Scoring, Term Weighting, and Vector Space Model Lecture 7: Scoring, Term Weighting and the Vector Space Model Web Search and Mining 1.
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang.
Basic ranking Models Boolean and Vector Space Models.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 5: Scoring, Term.
Vector Space Models.
1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
1 CS 430: Information Discovery Lecture 5 Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
SIGIR 2005 Relevance Information: A Loss of Entropy but a Gain for IDF? Arjen P. de Vries Thomas Roelleke,
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
Web Information Retrieval
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
IR 6 Scoring, term weighting and the vector space model.
Information Retrieval and Web Search
Information Retrieval and Web Search
Representation of documents and queries
From frequency to meaning: vector space models of semantics
4. Boolean and Vector Space Retrieval Models
Hankz Hankui Zhuo Text Categorization Hankz Hankui Zhuo
Boolean and Vector Space Retrieval Models
INF 141: Information Retrieval
Information Retrieval and Web Design
Term Frequency–Inverse Document Frequency
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea

Graphic Representation Example: D 1 = 2T 1 + 3T 2 + 5T 3 D 2 = 3T 1 + 7T 2 + T 3 Q = 0T 1 + 0T 2 + 2T 3 T3T3 T1T1 T2T2 D 1 = 2T 1 + 3T 2 + 5T 3 D 2 = 3T 1 + 7T 2 + T 3 Q = 0T 1 + 0T 2 + 2T Is D 1 or D 2 more similar to Q? How to measure the degree of similarity? Distance? Angle? Projection?

Document Collection Representation A collection of n documents can be represented in the vector space model by a term-document matrix. An entry in the matrix corresponds to the “ weight ” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document. T 1 T 2 …. T t D 1 w 11 w 21 … w t1 D 2 w 12 w 22 … w t2 : : : : D n w 1n w 2n … w tn

Term Frequency tf The term frequency tf t,d of term t in document d is defined as the number of times that t occurs in d. More frequent terms in a document are more important, i.e., more indicative of the topic. May want to normalize term frequency (tf): tf t,d = f t,d / max{f t,d }

Document Frequency Rare terms are more informative than frequent terms –Recall stop words Consider a term in the query that is rare in the collection (e.g., arachnocentric) A document containing this term is very likely to be relevant to the query arachnocentric → We want a high weight for rare terms like arachnocentric. Sec

Idf Weight df t is the document frequency of t: the number of documents that contain t –df t is an inverse measure of the informativeness of t –df t  N We define the idf (inverse document frequency) of t by –We use log (N/df t ) instead of N/df t to “dampen” the effect of idf. Sec

tf-idf Weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known weighting scheme in information retrieval –Theoretically proven to work well (Papineni, NAACL 2001) –Note: the “-” in tf-idf is a hyphen, not a minus sign! –Alternative names: tf.idf, tf x idf Increases with the number of occurrences within a document Increases with the rarity of the term in the collection Sec

Computing tf-idf: An Example Given a document containing terms with given frequencies: A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3 B: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3 C: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2

Salton & Buckley Experiments with term weighting approaches Six collections, 1,800 term weighting approaches –287 combinations found to be distinct Comparative evaluations using: –Ranking: the lower the better –Best weighting scheme has a rank of 1 Average search precision: –Average of precisions for recall points of 0.25, 0.50, 0.75 Average across queries: –Macro-average: average of the average search precisions

IR Test Collections CACM (articles from 'Communications of the ACM journal) CISI (articles about information sciences) CRAN (abstracts from aeronautics articles) INSPEC (articles in computer engineering) MED (medical articles) NPL (articles about electrical engineering)

IR Collections

Term Weighting Components

Sample Weighting Schemes

Performance Results

Lessons Learned Term weighting DOES matter Query vector: –Term frequency Use n for short queries Use t for longer queries that require better discrimination among terms –Document frequency Use f –Do not do normalization with query length

Lessons Learned Document vectors: –Term-frequency: For technical vocabulary (e.g., CRAN) use n For more varied vocabulary, use t For short document vectors, use b –Document frequency: Inverse document-frequency f is similar to probabilistic term weight p: typically use f For dynamic collections with many changes in the document collection makeup, use x –Normalization: Typically use c (in particular when there is high deviation in vector length) For short documents with homogenous length, use x