Indexing and Searching

Slides:



Advertisements
Similar presentations
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8): Indexing.
Advertisements

Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh
Information Retrieval in Practice
Parametrized Matching Amir, Farach, Muthukrishnan Orgad Keller.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
Basic Algorithms on Arrays. Learning Objectives Arrays are useful for storing data in a linear structure We learn how to process data stored in an array.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
Modern information retrieval Chapter 8 – Indexing and Searching.
Tries Standard Tries Compressed Tries Suffix Tries.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Modern Information Retrieval Chapter 8 Indexing and Searching.
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Processing Data in External Storage CS Data Structures Mehmet H Gunes Modified from authors’ slides.
Modern Information Retrieval
Goodrich, Tamassia String Processing1 Pattern Matching.
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
Modern Information Retrieval
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber Tech. Rep. TR94-17,Department of Computer Science, University of Arizona, May 1994.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Modern Information Retrieval Chapter 4 Query Languages.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Chapter 13 File Structures. Understand the file access methods. Describe the characteristics of a sequential file. After reading this chapter, the reader.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Information Retrieval CSE 8337 Spring 2005 Indexing and Searching Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
CS 430: Information Discovery
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Documents and Indexing Readings Overview Topic Discussions Schedule Set Projects and Papers Ideas.
Chapter 5 Record Storage and Primary File Organizations
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Why indexing? For efficient searching of a document
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
New Indices for Text : Pat Trees and PAT Arrays
COMP9319 Web Data Compression and Search
9/12/2018.
Recuperació de la informació
CS 430: Information Discovery
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Indexing and Searching (File Structures)
Query Languages.
2018, Spring Pusan National University Ki-Joune Li
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Information Retrieval B
Indexing and Searching
Presentation transcript:

Indexing and Searching Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Chapter 8

Outline Inverted Files Other Indices for Text Sequential Searching Pattern Matching Compression

Inverted Files And inverted file (or inverted index) is a word-oriented mechanism for indexing a text collection in order to speed up the searching task. Structure:vocabulary and occurrences Block addressing The text is divided in blocks, and the occurrences point to the blocks Full inverted indices:exact occurrences

Inverted Files The search algorithm on an inverted index Vocabulary search Retrieval of occurrences Manipulation of occurrences Construction (split the index into two files) Posting file:the lists of occurrences are stored contiguously The vocabulary is stored in lexicographical order and points to its list.

Inverted Files For Large texts Partial index Merging two indices consists of merging the sorted vocabularies.

Other Indices for Text Suffix Trees Suffix Arrays Signature Files

Suffix Trees and Suffix Arrays Each position in the text is considered as a text suffix Index points are selected form the text, which point to the beginning of the text positions which will be retrievable

Suffix arrays The main drawbacks of Suffix Array are its costly construction process. Allow binary searches done by comparing the contents of each pointer. Supra-indices (for large suffix array)

Construction of Suffix Arrays for Large Texts

Signature Files Word-oriented index structures base on hashing Maps words to bit masks of B bits Divides the text in blocks of b words each The mask is obtained by bitwise ORing the signatures of all the words in the text block. Hash the query to a bit mask W If W & Bi = W, the text block may contain the word

Sequential Searching Brute Force Knuth-Morris-Pratt Boyer-Moore Family Shift-Or Suffix Automaton Backward DAWG matching (BDM) BNDM

Knuth-Morris-Pratt

Boyer-Moore Family

Shift-Or

Suffix Automaton

Pattern Matching Searching allowing errors Dynamic Programming Automaton Regular Expressions and Extended patterns Pattern Matching Using Indices Inverted files Suffix Trees and Suffix Arrays

Dynamic Programming

Automaton

Regular Expressions

Pattern Matching Using Indices Inverted Files The types of queries such as suffix or substring queries, searching allowing errors and regular expressions, are solved by a sequential search The restriction is to find approximate matches or regular expressions that span many word.

Pattern Matching Using Indices Suffix Trees Suffix trees are able to perform complex searches Word, prefix, suffix, substring, and Range queries Regular expressions Unrestricted approximate string matching Useful in specific areas Find the longest substring Find the most common substring of a fixed size

Pattern Matching Using Indices Suffix Arrays Some patterns can be searched directly in the suffix array without simulation the suffix tree Word, prefix, suffix, subword search and range search

Compression Compressed text--Huffman coding Taking words as symbols Use an alphabet of bytes instead of bits Compressed indices Inverted Files Suffix Trees and Suffix Arrays Signature Files