Download presentation
Presentation is loading. Please wait.
1
Indexing and Searching
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Chapter 8
2
Outline Inverted Files Other Indices for Text Sequential Searching
Pattern Matching Compression
3
Inverted Files And inverted file (or inverted index) is a word-oriented mechanism for indexing a text collection in order to speed up the searching task. Structure:vocabulary and occurrences Block addressing The text is divided in blocks, and the occurrences point to the blocks Full inverted indices:exact occurrences
6
Inverted Files The search algorithm on an inverted index
Vocabulary search Retrieval of occurrences Manipulation of occurrences Construction (split the index into two files) Posting file:the lists of occurrences are stored contiguously The vocabulary is stored in lexicographical order and points to its list.
8
Inverted Files For Large texts Partial index
Merging two indices consists of merging the sorted vocabularies.
10
Other Indices for Text Suffix Trees Suffix Arrays Signature Files
11
Suffix Trees and Suffix Arrays
Each position in the text is considered as a text suffix Index points are selected form the text, which point to the beginning of the text positions which will be retrievable
13
Suffix arrays The main drawbacks of Suffix Array are its costly construction process. Allow binary searches done by comparing the contents of each pointer. Supra-indices (for large suffix array)
16
Construction of Suffix Arrays for Large Texts
17
Signature Files Word-oriented index structures base on hashing
Maps words to bit masks of B bits Divides the text in blocks of b words each The mask is obtained by bitwise ORing the signatures of all the words in the text block. Hash the query to a bit mask W If W & Bi = W, the text block may contain the word
19
Sequential Searching Brute Force Knuth-Morris-Pratt Boyer-Moore Family
Shift-Or Suffix Automaton Backward DAWG matching (BDM) BNDM
20
Knuth-Morris-Pratt
21
Boyer-Moore Family
22
Shift-Or
23
Suffix Automaton
25
Pattern Matching Searching allowing errors Dynamic Programming
Automaton Regular Expressions and Extended patterns Pattern Matching Using Indices Inverted files Suffix Trees and Suffix Arrays
26
Dynamic Programming
27
Automaton
28
Regular Expressions
29
Pattern Matching Using Indices
Inverted Files The types of queries such as suffix or substring queries, searching allowing errors and regular expressions, are solved by a sequential search The restriction is to find approximate matches or regular expressions that span many word.
30
Pattern Matching Using Indices
Suffix Trees Suffix trees are able to perform complex searches Word, prefix, suffix, substring, and Range queries Regular expressions Unrestricted approximate string matching Useful in specific areas Find the longest substring Find the most common substring of a fixed size
31
Pattern Matching Using Indices
Suffix Arrays Some patterns can be searched directly in the suffix array without simulation the suffix tree Word, prefix, suffix, subword search and range search
32
Compression Compressed text--Huffman coding Taking words as symbols
Use an alphabet of bytes instead of bits Compressed indices Inverted Files Suffix Trees and Suffix Arrays Signature Files
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.