Download presentation
Presentation is loading. Please wait.
Published byAnabel Thompson Modified over 8 years ago
1
IR 6 Scoring, term weighting and the vector space model
3
Term frequency and weighting ● each term in a document a weight for that term, that depends on the number of occurrences of the term in the document. ● TERM FREQUENCY (tf t,d,) : to be equal to the number of occurrences of term t in document d. ● BAG OF WORDS (order is not considered)
4
Inverse document frequency ● Collection frequency )(cf): to be the total number of occurrences of a term in the collection. ● document frequency (dft), defined to be the number of documents in the collection that contain a term t. ● total number of documents in a collection by N, ● inverse document frequency (idf)
5
Tf-idf weighting ● tf-idft,d = tf t,d ×idf t ● term t a weight in document d. 1. highest when t occurs many times within a small number of documents 2. lower when the term occurs fewer times in a document, or occurs in many documents 3. lowest when the term occurs in virtually all documents.
6
DOCUMENT VECTOR ● view each document as a vector with one component corresponding to each term in the dictionary ● together with a weight for each component that is given by Tf-idf. ● For dictionary terms that do not occur in a document, this weight is zero.
7
● overlap score measure: the score of a document d is the sum, over all query terms, of the number of times each of the query terms occurs in d.
8
VECTOR SPACE MODEL ● From – scoring documents on a query – document classification – document clustering
9
COSINE SIMILARITY dot product Euclidean length Normalizing length
10
Queries as vectors ● Assign to each document d a score equal to the dot product.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.