Hinrich Schütze and Christina Lioma

Slides:



Advertisements
Similar presentations
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Advertisements

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Beyond Boolean Queries Ranked retrieval  Thus far, our queries have all been Boolean.  Documents either match or don’t.  Good for expert users with.
Text Similarity David Kauchak CS457 Fall 2011.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Hinrich Schütze and Christina Lioma
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
IR Models: Overview, Boolean, and Vector
Hinrich Schütze and Christina Lioma
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
CpSc 881: Information Retrieval
Indexing The essential step in searching. Review a bit We have seen so far – Crawling In the abstract and as implemented Your own code and Nutch If you.
Information Retrieval Lecture 6. Recap of lecture 4 Query expansion Index construction.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
CS276 Information Retrieval and Web Mining
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
The Vector Space Model …and applications in Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Vector Space Model : TF - IDF
CES 514 Data Mining March 11, 2010 Lecture 5: scoring, term weighting, vector space model (Ch 6)
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Boolean and Vector Space Models
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Scoring, Term Weighting, and Vector Space Model Lecture 7: Scoring, Term Weighting and the Vector Space Model Web Search and Mining 1.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Information Retrieval continued Many slides in this presentation are from Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction.
TF-IDF David Kauchak cs458 Fall 2012 adapted from:
Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang.
Basic ranking Models Boolean and Vector Space Models.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Information Retrieval and Data Mining (AT71. 07) Comp. Sc. and Inf
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Lecture 5 Term and Document Frequency CS 6633 資訊檢索 Information Retrieval and Web Search 資電 125 Based on ppt files by Hinrich Schütze.
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 5: Scoring, Term.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 5 9/6/2011.
Vector Space Models.
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER VI SCORING, TERM WEIGHTING AND THE VECTOR SPACE MODEL 1.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
Web Information Retrieval
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
IR 6 Scoring, term weighting and the vector space model.
Introduction to Information Retrieval Information Retrieval and Data Mining (AT71.07) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor:
The Vector Space Models (VSM)
Plan for Today’s Lecture(s)
7CCSMWAL Algorithmic Issues in the WWW
Ch 6 Term Weighting and Vector Space Model
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Representation of documents and queries
CS276: Information Retrieval and Web Search
Term Frequency–Inverse Document Frequency
Presentation transcript:

Hinrich Schütze and Christina Lioma Lecture 6: Scoring, Term Weighting, The Vector Space Model

Overview Why ranked retrieval? Weighted zone scoring Term frequency tf-idf weighting The vector space model

Outline Why ranked retrieval? Weighted zone scoring Term frequency tf-idf weighting The vector space model

Problem with Boolean search Thus far, our queries have all been Boolean. Documents either match or don’t. Good for expert users with precise understanding of their needs and of the collection. Not good for the majority of users Most users are not capable of writing Boolean (or they are, but they think it’s too much work.) Most users don’t want to wade through 1000s of results. 4

Problem with Boolean search: Feast or famine Boolean queries often result in either too few or too many (1000s) results. Feast Query : “standard user dlink 650”  200,000 hits Famine Query : “standard user dlink 650 no card found”  2 hits In Boolean retrieval, it takes a lot of skill to come up with a query that produces a manageable number of hits. AND gives too few, and OR gives too many.

Feast or famine: No problem in ranked retrieval With ranking, large result sets are not an issue. Just show the top 10 results Doesn’t overwhelm the user Premise: the ranking algorithm works: More relevant results are ranked higher than less relevant results. 6

Query-document matching scores How do we compute the score of a query-document pair? Let’s start with a one-term query. If the query term does not occur in the document: score should be 0. The more frequent the query term in the document, the higher the score (should be) We will look at a number of alternatives for doing this. 7

Outline Why ranked retrieval? Weighted zone scoring Term frequency tf-idf weighting The vector space model 8

Parametric and zone indexes Metadata- we mean specific forms of data about a document, such as its author(s), title and data of publication. The metadata would generally include fields such as the data of creation and the format of the document, as well the author and possibly the title of the document. Zones are similar to fields, except the contents of a zone can be arbitrary free text. 9

Parametric index Parametric search

Zone index Basic zone index Zone index in which the zone is encoded steven

Weighted zone scoring Weighted zone scoring is sometimes referred to also as ranked Boolean retrieval Consider a set of documents contributes a Boolean value.

Weighted zone scoring-Example Consider the query steven in a collection in which each document has three zones: abstract, title and author. The Boolean score function for a zone takes on the value 1 if the query term steven is present in the zone, and 0 otherwise. Weighted zone scoring in such a collection would require three weights g1,g2 and g3, respectively corresponding to abstract, title and author zones. Suppose we set g1=0.2, g2=0.3, and g3=0.5 steven

Algorithm for computing the weighted zone score from two postings lists

Learning weights Given a query q and a document d, we use the given Boolean match function to compute Boolean variables ST(d,q) and SB(d,q). Train examples, each of which is a triple of the form

Learning weights

The optimal weight g

The optimal weight g Total error=0.75 g= =0.25

Outline Why ranked retrieval? Weighted zone scoring Term frequency tf-idf weighting The vector space model

Bag of words model The exact ordering of the terms in a document is ignored but the number of occurrence of each term is material. We only retain information on the number of occurrences of each term. “John is quicker than Mary” and “Mary is quicker than John” are represented the same way. This is called a bag of words model. For now: bag of words model 20

Term frequency tf The term frequency tft,d of term t in document d is defined as the number of occurrences of term t in document d. We want to use tf when computing query-document match scores. But how? Raw term frequency is not what we want because: A document with tf = 10 occurrences of the term is more relevant than a document with tf = 1 occurrence of the term. But not 10 times more relevant. Relevance does not increase proportionally with term frequency. 21

Log-frequency weighting The log frequency weight of term t in d is tft,d → wft,d : 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. Score for a document-query pair: sum over terms t in both q and d: Score The score is 0 if none of the query terms is present in the document. 22

Outline Why ranked retrieval? Weighted zone scoring Term frequency tf-idf weighting The vector space model

Collection frequency vs. Document frequency Collection frequency(cft) The total number of occurrences of a term t in the collection. Document frequency(dft) The number of documents in the collection that contain a term t. 24

Desired weight for rare terms Rare terms are more informative than frequent terms. Consider a term in the query that is rare in the collection (e.g., ARACHNOCENTRIC). A document containing this term is very likely to be relevant. → We want high weights for rare terms like ARACHNOCENTRIC. 25

Desired weight for frequent terms Frequent terms are less informative than rare terms. Consider a term in the query that is frequent in the collection (e.g., GOOD, INCREASE, LINE). A document containing this term is more likely to be relevant than a document that doesn’t . . . . . . but words like GOOD, INCREASE and LINE are not sure indicators of relevance. → For frequent terms like GOOD, INCREASE and LINE, we want positive weights . . . . . . but lower weights than for rare terms. 26

Inverse Document Frequency We want high weights for rare terms like ARACHNOCENTRIC. We want low (positive) weights for frequent words like GOOD, INCREASE and LINE. We will use document frequency to factor this into computing the matching score. 27

idf weight dft is an inverse measure of the informativeness of term t. We define the idft weight of term t as follows: (N is the number of documents in the collection.) idft is a measure of the informativeness of the term. 28

Examples for idf The Reuters collection of 806,791 documents Compute idft using the formula: 29

tf-idf weighting The tf-idf weighting scheme assigns to term t a weight in document d given by Highest when t occurs many times within a small number of documents. Lower when the term occurs fewer times in a document, or occurs in many documents. Lowest when the term occurs in virtually all documents. 30

Examples for tf-idf 31

Variant tf-idf functions wf-idft,d weighting 32

Outline Why ranked retrieval? Weighted zone scoring Term frequency tf-idf weighting The vector space model

Binary incidence matrix Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth . . . ANTHONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER . . . 1 Each document is represented as a binary vector ∈ {0, 1}|V|. 34

Count matrix Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth . . . ANTHONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER . . . 157 4 232 57 2 73 227 10 3 1 8 5 Each document is now represented as a count vector ∈ N|V|. 35

Binary → count → weight matrix Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth . . . ANTHONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER . . . 5.25 1.21 8.59 0.0 2.85 1.51 1.37 3.18 6.10 2.54 1.54 1.90 0.11 1.0 0.12 4.15 0.25 0.35 0.88 1.95 Each document is now represented as a real-valued vector of wf-idf weights ∈ R|V|. 36

Documents as vectors Each document is now represented as a real-valued vector of tf-idf weights ∈ R|V|. So we have a |V|-dimensional real-valued vector space. Terms are axes of the space. Documents are points or vectors in this space. Very high-dimensional: tens of millions of dimensions when you apply this to web search engines Each vector is very sparse - most entries are zero. 37

Queries as vectors Key idea 1: do the same for queries: represent them as vectors in the high-dimensional space. Key idea 2: Rank documents according to their proximity to the query. proximity = similarity. proximity ≈ negative distance. Rank relevant documents higher than non-relevant documents. 38

How do we formalize vector space similarity? First cut: distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea because Euclidean distance is large for vectors of different lengths. 39

Use angle instead of distance Rank documents according to angle with query Thought experiment: take a document d and append it to itself. Call this document d′. d′ is twice as long as d. d and d′ have the same content. The angle between the two documents is 0, corresponding to maximal similarity even though the Euclidean distance between the two documents can be quite large. The following two notions are equivalent. Rank documents according to the angle between query and document in decreasing order Rank documents according to cosine(query,document) in increasing order 40

Length normalization How do we compute the cosine? A vector can be (length-) normalized by dividing each of its components by its length – here we use the L2 norm: This maps vectors onto the unit sphere . . . . . . since after normalization: As a result, longer documents and shorter documents have weights of the same order of magnitude. Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length- normalization. 41

Cosine similarity between query and document qi is the tf-idf weight of term i in the query. di is the tf-idf weight of term i in the document. | | and | | are the lengths of and This is the cosine similarity of and . . . . . . or, equivalently, the cosine of the angle between and 42

Cosine for normalized vectors For normalized vectors, the cosine is equivalent to the dot product or scalar product. (if and are length-normalized). 43

Cosine similarity illustrated 44

Cosine: Example1 v(Doc1)•v(Doc2)=0.897*0.076+0.126*0.787=0.167 45

Cosine: Example2(N=1,000,000) v(q,Doc1)=0*2.3+0.34*0+0.52*2+0.78*6=5.72 Length[Doc1]=6.73, scores[Doc1]=0.85 v(q,Doc2)=0*4.6+0.34*1.3+0.52*4.0+0.78*0=2.52 Length[Doc2]=6.23, scores[Doc2]=0.40 v(q,Doc3)=0*2.3+0.34*2.6+0.52*10.0+0.78*6.0=10.76 Length[Doc3]=12.17, scores[Doc3]=0.88 Ranking=Doc3, Doc1, Doc2. 46

Computing the cosine score 47

Components of tf-idf weighting 48