6. Implementation of Vector-Space Retrieval

6. Implementation of Vector-Space Retrieval
These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.

Vector Space Retrieval (Naïve version)
Steps: 1. Convert all documents in collection D to tf-idf weighted vectors, dj, for term vocabulary V. 2. Convert query to a tf-idf-weighted vector q. 3. For each dj in D do Compute score sj = cosSim(dj, q) 4. Sort documents by decreasing score. 5. Present top ranked documents to the user. Time complexity: O(|V|·|D|) Bad for large V & D ! |V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000

Implementation Based on Inverted Files
In practice, document vectors are not stored directly; an inverted list provides much better efficiency. The dictionary part can be implemented as a hash table, a sorted array, or a tree-based data structure (trie, B-tree). Critical issue is logarithmic or constant-time access to token information.

Inverted Index Dj, tfj df Index terms computer 3 D7, 4 database D1, 3
system computer database science D2, 4 D5, 2 D1, 3 D7, 4 Index terms df 3 2 4 1 Dj, tfj Index file Postings lists   

Step 1: Indexing Documents
Skipping the algorithm, these data structures store indices. Assume a dictionary (for all terms in the vocabulary) is stored in a HashMap, which maps a term to its inverse document frequencies (IDF) – called ‘H’ in the algo. Assume a posting list (term frequency information) is stored in a vector of HashMap’s, which maps a documentID to the (raw) TF of the term in the document.

Note on Document Length
We also compute document length (for all documents) and store them in a HashMap , which maps a documentID to the length – called ‘DL’ in the algo. Remember that the length of a document vector is the square-root of sum of the squares of the weights of its tokens – and the weight of a token is TF * IDF. Therefore, must wait until IDF’s are known (and therefore until all documents are indexed) before document lengths can be determined.

Retrieval with an Inverted Index
Tokens that are not in both the query and the document do not effect cosine similarity. Product of token weights is zero and does not contribute to the dot product. Usually the query is fairly short, and therefore its vector is extremely sparse. Use inverted index to find the limited set of documents that contain at least one of the query words.

Inverted Query Retrieval Efficiency
Assume that, on average, a query word appears in B documents: Then retrieval time is O(|Q| B), which is typically, much better than naïve retrieval that examines all N documents, O(|V| N), because |Q| << |V| and B << N. Q = q q … qn D11…D1B D21…D2B Dn1…DnB

Step 2: Query as a Vector Create a HashMapVector, Q, for the query -- a vector of HashMap’s, where each term in the query is a HashMap that maps a term to its TF in the query.

Step 3: Compute Cosine Incrementally compute cosine similarity of each indexed document as query terms are processed one by one. To accumulate a total score for each retrieved document, store retrieved documents in a HashMap (called ‘R’ in the algo), which maps documentID’s to computed cosine scores.

Inverted-Index Retrieval Algorithm
Create empty HashMap, R, to store retrieved documents with scores. For each token, T, in Q: Let I be the IDF of T, and K be the frequency of T in Q; Set the weight of T in Q: W = K * I; (tf*idf for T in Q) Let L be the posting list of T from H; For each pair HashMap, O, in L: Let D be the document of O, and C be the freq of O (tf of T in D); If D is not already in R (D was not previously retrieved) Then add D to R and initialize score to 0.0; Increment D’s score by W * C * I; (tf*idf for T in Q x tf*idf for T in D)

Inverted-Index Retrieval Algorithm (cont.)
Compute the length, L, of the vector Q (square-root of the sum of the squares of its weights). For each retrieved document D in R: Let S be the current accumulated score of D; (S is the dot-product of D and Q) Let Y be the length of D in DL; Let D’s final score to be S/(L * Y); (the cosine)

Step 4 and 5 Sort retrieved documents in R by final score. Return the sorted documents in an array. (ranked results)

Exercise Apply the Inverted-Index Retrieval Algorithm to the following, and show the ranked results with cosine values to the query “best car insurance”.

6. Implementation of Vector-Space Retrieval

Similar presentations

Presentation on theme: "6. Implementation of Vector-Space Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

6. Implementation of Vector-Space Retrieval

Similar presentations

Presentation on theme: "6. Implementation of Vector-Space Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback