Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

TF/IDF Ranking. Vector space model Documents are also treated as a “bag” of words or terms. –Each document is represented as a vector. Term Frequency.
1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.
Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Hashing as a Dictionary Implementation
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Hashing Chapters What is Hashing? A technique that determines an index or location for storage of an item in a data structure The hash function.
Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Learning for Text Categorization
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Ch 4: Information Retrieval and Text Mining
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
CS/Info 430: Information Retrieval
Evaluating the Performance of IR Sytems
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
The Vector Space Model …and applications in Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
1 Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval.
Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
Advanced Search Features Dr. Susan Gauch. Pruning Search Results  If a query term has many postings  It is inefficient to add all postings to the accumulator.
Hashing Hashing is another method for sorting and searching data.
Hashing as a Dictionary Implementation Chapter 19.
CSC 427: Data Structures and Algorithm Analysis
Vector Space Models.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
An Efficient Information Retrieval System Objectives: n Efficient Retrieval incorporating keyword’s position; and occurrences of keywords in heading or.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
CS315 Introduction to Information Retrieval Boolean Search 1.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
IR 6 Scoring, term weighting and the vector space model.
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Indexing & querying text
Information Retrieval in Practice
Information Retrieval and Web Search
אחזור מידע, מנועי חיפוש וספריות
Information Retrieval and Web Search
Representation of documents and queries
Java VSR Implementation
Implementation Based on Inverted Files
6. Implementation of Vector-Space Retrieval
Java VSR Implementation
Boolean and Vector Space Retrieval Models
Efficient Retrieval Document-term matrix t1 t tj tm nf
Java VSR Implementation
Presentation transcript:

Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.

Intelligent Information Retrieval 2 Implementation Issues for Ranking System Recall the use of inverted files for the storage of index terms –split into a dictionary and a postings file –the dictionary contains terms, but will now also contain statistics about the term (e.g., number of postings) and the IDF value –the postings file will contain the record ids and the weights for all occurrences of the term DocFreq Doc Freq

Intelligent Information Retrieval 3 Implementation Issues for Ranking Systems Four major options for storing weights in the postings file –Store the raw frequency slow, but flexible (term weights can be changed without changing the index) –Store normalized frequency the normalization would be done during the creation of the final dictionary and postings files the normalized frequency would be inserted into the postings file instead of the raw term frequency –Store the completely weighted term allows simple addition of term weights during the search, rather than first multiplying by the IDF of the term very fast, but changes to index requires changing all postings because the IDF changes; also relevance feedback reweighting is difficult –No within-record frequency postings records do not store weights all processing will be done at search time

Intelligent Information Retrieval 4 Implementation Issues for Ranking Systems Searching the inverted file –diagram illustrates the basic process –if stem is found in dictionary, the address of the postings list is returned along with the IDF and the number of postings –in each posting, the record ids are read; total term-weight for each record id added to a unique accumulator for that id (set up as a hash table) –depending on which option for storing weights was used, additional normalization or IDF computation may be necessary –as each query term is processed, its postings cause further additions to the accumulators Searching the inverted file –diagram illustrates the basic process –if stem is found in dictionary, the address of the postings list is returned along with the IDF and the number of postings –in each posting, the record ids are read; total term-weight for each record id added to a unique accumulator for that id (set up as a hash table) –depending on which option for storing weights was used, additional normalization or IDF computation may be necessary –as each query term is processed, its postings cause further additions to the accumulators Query Parser Dictionary Lookup Using Binary Search Accumulator Sort by Weight Get Weights from Postings File Terms Dictionary Entry Rec. Numbers for Terms Rec. Numbers; Total Weights

Intelligent Information Retrieval 5 VSR Implementation: Inverted Index system computer database science D 2, 4 D 5, 2 D 1, 3 D 7, 4 Index terms df D j, tf j Index file Postings lists     

Intelligent Information Retrieval 6 TokenInfo Class public class TokenInfo { public double idf; // A list of TokenOccurences giving documents where // this token occurs public ArrayList occList; // Create an initially empty data structure public TokenInfo() { occList = new ArrayList(); idf = 0.0; } public class TokenInfo { public double idf; // A list of TokenOccurences giving documents where // this token occurs public ArrayList occList; // Create an initially empty data structure public TokenInfo() { occList = new ArrayList(); idf = 0.0; }

Intelligent Information Retrieval 7 TokenOccurrence Class // A lightweight object for storing information about an // occurrence of a token (a.k.a word, term) in a Document public class TokenOccurrence { // A reference to the Document where it occurs public DocumentReference docRef = null; // The number of times it occurs in the Document public int count = 0; // Create an occurrence with these values public TokenOccurrence(DocumentReference docRef, int count) { this.docRef = docRef; this.count = count; } // A lightweight object for storing information about an // occurrence of a token (a.k.a word, term) in a Document public class TokenOccurrence { // A reference to the Document where it occurs public DocumentReference docRef = null; // The number of times it occurs in the Document public int count = 0; // Create an occurrence with these values public TokenOccurrence(DocumentReference docRef, int count) { this.docRef = docRef; this.count = count; }

Intelligent Information Retrieval 8 DocumentReference Class // A simple data structure for storing a reference to a // document file that includes information on the length of // its document vector. This is a lightweight object to store // in inverted index without storing an entire Document object. public class DocumentReference { // The file where the referenced document is stored. public File file = null; // The length of the corresponding Document vector. public double length = 0.0; public DocumentReference(File file, double length) { this.file = file; this.length = length; }... } // A simple data structure for storing a reference to a // document file that includes information on the length of // its document vector. This is a lightweight object to store // in inverted index without storing an entire Document object. public class DocumentReference { // The file where the referenced document is stored. public File file = null; // The length of the corresponding Document vector. public double length = 0.0; public DocumentReference(File file, double length) { this.file = file; this.length = length; }... }

Intelligent Information Retrieval 9 Representing a Document Vector HashmapVector –A data structure for a term vector of a document stored as a HashMap that maps tokens to the weight of that token in the document –Needed as an efficient, indexed representation of sparse document vectors. Possible Operations Associated with a HashmapVector –Increment the weight of a given token –Multiply the vector by a constant factor –Return Max weight of any token in the vector –Others: copy, dot product, add, etc.

Intelligent Information Retrieval 10 Creating an Inverted Index Create an empty HashMap, H (will hold the dictionary) For each document, D, (i.e. file in an input directory): Create a HashMapVector,V, for D; For each token, T, in V: If T is not already in H, create an empty TokenInfo for T and insert it into H; Create a TokenOccurence for T in D and add it to the occList in the TokenInfo for T; Compute IDF for all tokens in H; Compute vector lengths for all documents in H;

Intelligent Information Retrieval 11 Computing IDF Let N be the total number of Documents; For each token, T, in H: Determine the total number of documents, M, in which T occurs (the length of T’s occList); Set the IDF for T to log(N/M); Note this requires a second pass through all the tokens after all documents have been indexed.

Intelligent Information Retrieval 12 Document Vector Length Remember that the length of a document vector is the square-root of sum of the squares of the weights of its tokens. Remember the weight of a token is: TF * IDF. Therefore, must wait until IDF’s are known (and therefore until all documents are indexed) before document lengths can be determined.

Intelligent Information Retrieval 13 Computing Document Lengths Assume the length of all document vectors are initialized to 0.0; For each token T in H: Let, I be the IDF weight of T; For each TokenOccurence of T in document D Let, C, be the count of T in D; Increment the length of D by (I*C) 2 ; For each document D in H: Set the length of D to be the square-root of the current stored length;

Intelligent Information Retrieval 14 Time Complexity of Indexing Complexity of creating vector and indexing a document of n tokens is O(n). So indexing m such documents is O(m n). Computing token IDFs for a vocabulary V is O(|V|). Computing vector lengths is also O(m n). Since |V|  m n, complete process is O(m n), which is also the complexity of just reading in the corpus.

Intelligent Information Retrieval 15 Retrieval with an Inverted Index Tokens that are not in both the query and the document do not effect the dot products in the computation of Cosine similarities. Usually the query is fairly short, and therefore its vector is extremely sparse. Use inverted index to find the limited set of documents that contain at least one of the query words.

Intelligent Information Retrieval 16 Inverted Query Retrieval Efficiency Assume that, on average, a query word appears in B documents: Then retrieval time is O(|Q| B), which is typically, much better than naïve retrieval that examines all N documents, O(|V| N), because |Q| << |V| and B << N. Q = q 1 q 2 … q n D 11 …D 1B D 21 …D 2B D n1 …D nB

Intelligent Information Retrieval 17 Processing the Query Incrementally compute cosine similarity of each indexed document as query words are processed one by one. To accumulate a total score for each retrieved document –store retrieved documents in a hashtable –DocumentReference is the key and the partial accumulated score is the value

Intelligent Information Retrieval 18 Inverted-Index Retrieval Algorithm Create a HashMapVector, Q, for the query. Create empty HashMap, R, to store retrieved documents with scores. For each token, T, in Q: Let I be the IDF of T, and K be the count of T in Q; Set the weight of T in Q: W = K * I; Let L be the list of TokenOccurences of T from H; For each TokenOccurence, O, in L: Let D be the document of O, and C be the count of O (tf of T in D); If D is not already in R (D was not previously retrieved) Then add D to R and initialize score to 0.0; Increment D’s score by W * I * C; (product of T-weight in Q and D)

Intelligent Information Retrieval 19 Retrieval Algorithm (cont) Compute the length, L, of the vector Q (square-root of the sum of the squares of its weights). For each retrieved document D in R: Let S be the current accumulated score of D; (S is the dot-product of D and Q) Let Y be the length of D as stored in its DocumentReference; Normalize D’s final score to S/(L * Y); Sort retrieved documents in R by final score Return results in an array