The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN
An index is a data structure that is designed to make search (or finding things) fast and efficient Text search often requires an inverted index Represents a class of similar data structures Inverted because we associate documents with words (rather than identifying words within or as part of documents)
Each index term or document feature is obtained during text transformation A document feature is some feature of the document expressed numerically For example, a topical feature estimates the degree to which the document is about a particular topic Example quality features include inlink count, number of days since page was last updated, etc.
Regardless of the ranking function, the model below provides a roadmap to implementation
Each index term is associated with an inverted list that may contain: A list of documents A list of word occurrences in documents Word counts Positional information regarding each word Metadata identifying fields (title, author, etc.) etc.
Each entry in an inverted index is called a posting The part of the posting that refers to a specific document or location is called a pointer Each document in the collection is given a unique number Lists are usually document-ordered ▪ Sorted by document number
assume each sentence is a separate document
Inverted index for documents S 1, S 2, S 3, and S 4 Deduplicates word occurrences What does this data structure tell us?
Inverted index with counts for documents S 1, S 2, S 3, and S 4 What does this data structure tell us?
inverted index with word positions what does this data structure tell us?
Proximity matching is a technique used to match multiword phrases Proximity matching also is used to match words within a window of size n e.g. words within five words of “fish” (n=5) matches “tropical fish”
A document field is a section of a document with additional semantic meaning e.g. date, from:, to:, etc. e.g. title, author, copyright, publisher, isbn, etc. Implementation options: Use separate inverted lists for each field Add extra information about fields to postings Use extent lists....
An extent is simply a contiguous region of a document (typically with special meaning) We can represent extents using word positions Inverted list records all extents for a given field extent list “fish” occurs at word position 2 in document S 1 the title occurs at word positions 1 and 2
Read and study Chapter 5 Do Exercises 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, and 5.8