Indexing and Searching Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Chapter 8
Outline Inverted Files Other Indices for Text Sequential Searching Pattern Matching Compression
Inverted Files And inverted file (or inverted index) is a word-oriented mechanism for indexing a text collection in order to speed up the searching task. Structure:vocabulary and occurrences Block addressing The text is divided in blocks, and the occurrences point to the blocks Full inverted indices:exact occurrences
Inverted Files The search algorithm on an inverted index Vocabulary search Retrieval of occurrences Manipulation of occurrences Construction (split the index into two files) Posting file:the lists of occurrences are stored contiguously The vocabulary is stored in lexicographical order and points to its list.
Inverted Files For Large texts Partial index Merging two indices consists of merging the sorted vocabularies.
Other Indices for Text Suffix Trees Suffix Arrays Signature Files
Suffix Trees and Suffix Arrays Each position in the text is considered as a text suffix Index points are selected form the text, which point to the beginning of the text positions which will be retrievable
Suffix arrays The main drawbacks of Suffix Array are its costly construction process. Allow binary searches done by comparing the contents of each pointer. Supra-indices (for large suffix array)
Construction of Suffix Arrays for Large Texts
Signature Files Word-oriented index structures base on hashing Maps words to bit masks of B bits Divides the text in blocks of b words each The mask is obtained by bitwise ORing the signatures of all the words in the text block. Hash the query to a bit mask W If W & Bi = W, the text block may contain the word
Sequential Searching Brute Force Knuth-Morris-Pratt Boyer-Moore Family Shift-Or Suffix Automaton Backward DAWG matching (BDM) BNDM
Knuth-Morris-Pratt
Boyer-Moore Family
Shift-Or
Suffix Automaton
Pattern Matching Searching allowing errors Dynamic Programming Automaton Regular Expressions and Extended patterns Pattern Matching Using Indices Inverted files Suffix Trees and Suffix Arrays
Dynamic Programming
Automaton
Regular Expressions
Pattern Matching Using Indices Inverted Files The types of queries such as suffix or substring queries, searching allowing errors and regular expressions, are solved by a sequential search The restriction is to find approximate matches or regular expressions that span many word.
Pattern Matching Using Indices Suffix Trees Suffix trees are able to perform complex searches Word, prefix, suffix, substring, and Range queries Regular expressions Unrestricted approximate string matching Useful in specific areas Find the longest substring Find the most common substring of a fixed size
Pattern Matching Using Indices Suffix Arrays Some patterns can be searched directly in the suffix array without simulation the suffix tree Word, prefix, suffix, subword search and range search
Compression Compressed text--Huffman coding Taking words as symbols Use an alphabet of bytes instead of bits Compressed indices Inverted Files Suffix Trees and Suffix Arrays Signature Files