Representation of documents and queries Why do this? Want to compare documents Want to compare documents with queries Want to retrieve and rank documents with regards to a specific query A document representation permits this in a consistent way (type of conceptualization)
Boolean queries Document is relevant to a query of the query is in the document. Document is either relevant or not relevant to the query. What about relevance ranking – partial relevance. Vector model deals with this.
Matching - similarity Define methods of similarity and matching for documents and queries Use similarity and matching for ranking
Measures of similarity Retrieve the most similar documents to a query Equate similarity to relevance Most similar are the most relevant This measure is one of “lexical similarity” The matching of text or words
Document space Documents are organized in some manner - exist as points in a document space Documents treated as text, etc. Match query with document Query similar to document space Query not similar to document space and becomes a characteristic function on the document space Documents most similar are the ones we retrieve Reduce this a computable measure of similarity
Query similar to document space Query is a point in document space Documents “near” to the query are the ones we want. Near: Distance Lying in similar direction as other documents Others
Document Clustering Term 1 Term 2
Documents in 3D Space Assumption: Documents that are “close together” in space are similar in meaning.
Representation of Documents Consider now only text documents Words are tokens (primitives) Why not letters? Stop words? How do we represent words? Even for video, audio, etc documents, we often use words as part of the representation
Documents as Vectors Documents are represented as “bags of words” Example? Represented as vectors when used computationally A vector is like an array of floating point values Has direction and magnitude Each vector holds a place for every term in the collection Therefore, most vectors are sparse
Vector Space Model Documents and queries are represented as vectors in term space Terms are usually stems Documents represented by binary vectors of terms Queries represented the same as documents Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents
The Vector-Space Model Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. These “orthogonal” terms form a vector space. Dimension = t = |vocabulary| Each term i in a document or query j is given a real-valued weight, wij. Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj)
The Vector-Space Model 3 terms, t1, t2, t3 for all documents Vectors can be written differently d1 = (weight of t1, weight of t2, weight of t3) d1 = (w1,w2,w3) d1 = w1,w2,w3 or d1 = w1 t1 + w2 t2 + w3 t3
Example - documents and queries D1: to be or not to be D2: to be here or to be there D3: not to be D4: to be forever here Q1: here Q2: to be
Definitions Documents vs terms Treat documents and queries as the same 4 docs and 2 queries => 6 rows Vocabulary in alphabetical order – dimension 7 be, forever, here, not, or, there, to => 7 columns 6 X 7 doc-term matrix 4 X 4 doc-doc matrix (exclude queries) 7 X 7 term-term matrix (exclude queries)
Document Collection A collection of n documents can be represented in the vector space model by a term-document matrix. An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document. T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : Dn w1n w2n … wtn Queries are treated just like documents!
Assigning Weights to Terms wij is the weight of term j in document i Binary Weights Raw term frequency tf x idf Deals with Zipf distribution Want to weight terms highly if they are frequent in relevant documents … BUT infrequent in the collection as a whole
doc-terms matrix Binary wts be forever here not or there to Doc 1 1 Doc 2 Doc 3 Doc 4 Q 1 Q 2
doc-doc matrix wts terms that overlap 3 2 Doc 3 Doc 4
term-term matrix wts be forever here not or there to - 1 2 4
doc-terms tf wts be forever here not or there to Doc 1 2 1 Doc 2 Doc 3 1 Doc 2 Doc 3 Doc 4 Q 1 Q 2
doc-doc matrix wts term freq 5 3 2 Doc 3 Doc 4
Term Weights: Term Frequency More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j May want to normalize term frequency (tf) across the entire corpus: tfij = fij / max{fij}
Assigning Weights tf x idf measure: term frequency (tf) inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution Goal: assign a tf x idf weight to each term in each document A term occurring frequently in the document but rarely in the rest of the collection is given high weight. Many other ways of determining term weights have been proposed. Experimentally, tf-idf has been found to work well.
TF x IDF (term frequency-inverse document frequency) wij = tfij [log2 (N/nj) + 1] wij = weight of Term Tj in Document Di tfij = frequency of Term Tj in Document Di N = number of Documents in collection nj = number of Documents where term Tj occurs at least once Red text is the Inverse Document Frequency measure idfj
IDF (inverse document frequency) idfj = log2 (N/nj) + 1 log2 (N/nj) + 1 = log N - log nj + 1 Recall n is 1 or greater Moderates effect of document size
Inverse Document Frequency (idf) idf provides high values for rare words and low values for common words Double the number of documents, what happens? For a collection of 10000 documents
Inverse Document Frequency idfj modifies only the columns not the rows! log2 (N/nj) + 1 = log N - log nj + 1 Consider only the documents, not the queries! N = 4
tf-idf wt calculation N = 4 be forever here not or there to Doc 1 2 1 1 Doc 2 Doc 3 Doc 4 n1 n2 n3 n4 n5 n6 n7 4 N = 4
tf-idf wt calculation N = 4 be forever here not or there to Doc 1 2 1 1 Doc 2 Doc 3 Doc 4 n1 n2 n3 n4 n5 n6 n7 ni 4 N/n
tf-idf wt calculation N = 4 be forever here not or there to Doc 1 2 1 1 Doc 2 Doc 3 Doc 4 n1 n2 n3 n4 n5 n6 n7 ni 4 N/n idf 3
tf-idf wt calculation be forever here not or there to Doc 1 2 Doc 2 3 Doc 2 3 Doc 3 1 Doc 4
(Doc + queries) terms tf-idf wts be forever here not or there to Doc 1 2 Doc 2 3 Doc 3 1 Doc 4 Q 1 Q 2
Document Similarity With a query what do we want to retrieve? Relevant documents Similar documents Query should be similar to the document? Innate concept – want a document without your query terms?
Similarity Measures Queries are treated like documents Documents are ranked by some measure of closeness to the query Closeness is determined by a Similarity Measure s Ranking is usually s(1) > s(2) > s(3)
Document Similarity Types of similarity Text Content Authors Date of creation Images Etc.
Similarity Measure - Inner Product Similarity between vectors for the document di and query q can be computed as the vector inner product: s = sim(dj,q) = dj•q = wij · wiq where wij is the weight of term i in document j and wiq is the weight of term i in the query For binary vectors, the inner product is the number of matched query terms in the document (size of intersection). For weighted term vectors, it is the sum of the products of the weights of the matched terms.
Inner Product binary doc1 doc2 doc3 doc4 Q1 1 Q2 2 For binary, tf and tfidf models binary doc1 doc2 doc3 doc4 Q1 1 Q2 2
Binary wts be forever here not or there to Doc 1 1 Doc 2 Doc 3 Doc 4 Doc 2 Doc 3 Doc 4 Q 1 Q 2
Inner Product tf doc1 doc2 doc3 doc4 Q1 1 Q2 4 2 For binary, tf and tfidf models tf doc1 doc2 doc3 doc4 Q1 1 Q2 4 2
doc-terms tf wts be forever here not or there to Doc 1 2 1 Doc 2 Doc 3 1 Doc 2 Doc 3 Doc 4 Q 1 Q 2
Inner Product tfidf doc1 doc2 doc3 doc4 Q1 4 Q2 2 For binary, tf and tfidf models tfidf doc1 doc2 doc3 doc4 Q1 4 Q2 2
(Doc + queries) terms for tf-idf wts be forever here not or there to Doc 1 2 Doc 2 3 Doc 3 1 Doc 4 Q 1 Q 2
Inner Product binary doc1 doc2 doc3 doc4 Q1 1 Q2 2 tf doc1 doc2 doc3 1 Q2 2 tf doc1 doc2 doc3 doc4 Q1 1 Q2 4 2 D1: to be or not to be D2: to be here or to be there D3: not to be D4: to be forever here Q1: here Q2: to be tfidf doc1 doc2 doc3 doc4 Q1 4 Q2 2
Properties of Inner Product The inner product is unbounded. Favors long documents with a large number of unique terms. Measures how many terms matched but not how many terms are not matched.
Cosine Similarity Measure 2 t3 t1 t2 D1 D2 Q 1 Cosine similarity measures the cosine of the angle between two vectors. Inner product normalized by the vector lengths. CosSim(dj, q) =
Binary wts normalized be forever here not or there to Doc 1 .5 Doc 2 Doc 2 .45 Doc 3 .58 Doc 4 Q 1 1 Q 2 .71
Cosine Measure binary doc1 doc2 doc3 doc4 Q1 .45 .5 Q2 .71 .64 .82 For binary, tf and tfidf models binary doc1 doc2 doc3 doc4 Q1 .45 .5 Q2 .71 .64 .82
docXterms tf wts normalized be forever here not or there to Doc 1 .64 .32 Doc 2 .60 .30 Doc 3 .58 Doc 4 .5 Q 1 1 Q 2 .71
Cosine Measure tf doc1 doc2 doc3 doc4 Q1 .30 .5 Q2 .90 .85 .82 .71 For binary, tf and tfidf models tf doc1 doc2 doc3 doc4 Q1 .30 .5 Q2 .90 .85 .82 .71
(Doc + queries) terms tf-idf wts normalized be forever here not or there to Doc 1 .5 Doc 2 .4 .6 Doc 3 .41 .82 Doc 4 .26 .77 .58 Q 1 .71 Q 2
Cosine Measure tfidf doc1 doc2 doc3 doc4 Q1 .28 .41 Q2 .71 .57 .58 .37 For binary, tf and tfidf models tfidf doc1 doc2 doc3 doc4 Q1 .28 .41 Q2 .71 .57 .58 .37
Cosine Measure bnry d1 d2 d3 d4 Q1 .45 .5 Q2 .71 .64 .82 tf d1 d2 d3 .45 .5 Q2 .71 .64 .82 tf d1 d2 d3 d4 Q1 .30 .5 Q2 .90 .85 .82 .71 D1: to be or not to be D2: to be here or to be there D3: not to be D4: to be forever here Q1: here Q2: to be tfidf d1 d2 d3 d4 Q1 .28 .41 Q2 .71 .57 .58 .37
Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient
Properties of similarity or matching metrics is the similarity measure Symmetric (Di,Dk) = (Dk,Di) s is close to 1 if similar s is close to 0 if different Others?
Similarity Measures A similarity measure is a function which computes the degree of similarity between a pair of vectors or documents since queries and documents are both vectors, a similarity measure can represent the similarity between two documents, two queries, or one document and one query There are a large number of similarity measures proposed in the literature, because the best similarity measure doesn't exist (yet!) With similarity measure between query and documents it is possible to rank the retrieved documents in the order of presumed importance it is possible to enforce certain threshold so that the size of the retrieved set can be controlled the results can be used to reformulate the original query in relevance feedback (e.g., combining a document vector with the query vector)