Indexing and Searching (File Structures)

Slides:



Advertisements
Similar presentations
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Advertisements

Space-for-Time Tradeoffs
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
1 Signature Files Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.
Elementary Data Types Prof. Alamdeep Singh. Scalar Data Types Scalar data types represent a single object, i.e. only one value can be derived. In general,
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Modern Information Retrieval Chapter 8 Indexing and Searching.
Processing Data in External Storage CS Data Structures Mehmet H Gunes Modified from authors’ slides.
Modern Information Retrieval
1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
CSE3201/CSE4500 Information Retrieval Systems
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Elementary Data Types Scalar Data Types Numerical Data Types Other
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Indexing and Searching
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
CS 430: Information Discovery
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Been-Chian Chien, Wei-Pang Yang, and Wen-Yang Lin 8-1 Chapter 8 Hashing Introduction to Data Structure CHAPTER 8 HASHING 8.1 Symbol Table Abstract Data.
CE Operating Systems Lecture 17 File systems – interface and implementation.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
Chapter 5 Record Storage and Primary File Organizations
Storage and File Organization
Why indexing? For efficient searching of a document
Course Developer/Writer: A. J. Ikuomola
Module 11: File Structure
Tries 07/28/16 11:04 Text Compression
Indexing Structures for Files and Physical Database Design
Chapter 11: File System Implementation
Indexing and hashing.
New Indices for Text : Pat Trees and PAT Arrays
Azita Keshmiri CS 157B Ch 12 indexing and hashing
CSCI 210 Data Structures and Algorithms
COMP9319 Web Data Compression and Search
Chapter 11: File System Implementation
Database Management Systems (CS 564)
Spatial Indexing I Point Access Methods.
CS 430: Information Discovery
13 Text Processing Hongfei Yan June 1, 2016.
CSCE350 Algorithms and Data Structure
Space-for-time tradeoffs
Physical Database Design
Chapter 11: File System Implementation
Indexing and Hashing Basic Concepts Ordered Indices
Chapter 7 Space and Time Tradeoffs
Space-for-time tradeoffs
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Database Design and Programming
2018, Spring Pusan National University Ki-Joune Li
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Space-for-time tradeoffs
Chapter 11: File System Implementation
Space-for-time tradeoffs
Space-for-time tradeoffs
15-826: Multimedia Databases and Data Mining
Indexing and Searching
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Indexing and Searching (File Structures) Modern Information Retrieval (Chapter 8) With G. Navarro

File Struces Inverted Files Signatures PAT Trees Sequential Searching Compression

Inverted Files Information Retrieval: Data Structures and Algorithms (Chapters 3) W.B. Frakes and R. Baeza-Yates (Eds.) 1992.

Inverted Files Characteristics Preprocessing A word-oriented mechanism based on sorted list of keywords, with each keyword having links to the documents containing that keyword. Preprocessing Each document is assigned a list of keywords or attributes. Each keyword (attribute) is associated with relevance weights.

Inversion of Word List 1. The input text is parsed into a list of words along with their location in the text. (time and storage consuming operation) 2. This list is inverted from a list of terms in location order to a list of terms in alphabetical order. 3. Add term weights, or reorganize or compress the files.

Inversion of Word List

Structure and Construction Structure (split the index into two files) Vocabulary: O(nb) according to Heaps’ Law Occurrences : depends on the addressing granularity Construction The vocabulary is stored in lexicographical order and points to posting list. Posting file:the lists of occurrences are stored contiguously

Dictionary and Postings File (document #, frequency)

Vocabulary and Posting File

Structures used in Inverted Files Vocabulary Sorted Arrays Hashing Structures Keyword Trees: Tries (digital search trees) The Search Procedure Vocabulary search Retrieval of occurrences Manipulation of occurrences

Size of an Inverted File Block addressing The text is divided in blocks, and the occurrences point to the blocks instead of full inverted indices where exact occurrences are recorded

Cost Advantage Disadvantage easy to implement updating the index is expensive

Signature Files Information Retrieval: Data Structures and Algorithms (Chapters 4) W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.

Signature Files Characteristics Word-oriented index structures based on hashing Low overhead (10%~20% over the text size) at the cost of forcing a sequential search over the index Suitable for not very large texts Inverted files outperform signature files for most applications

Construction and Search Word-oriented index structures base on hashing Maps words to bit masks of B bits Divides the text in blocks of b words each The mask is obtained by bitwise ORing the signatures of all the words in the text block. Search Hash the query to a bit mask W If W & Bi = W, the text block may contain the word

Example Four blocks: 000101 110101 100100 101101 Block 4: 001100 This is a text. A text has many words. Words are made from letters. 000101 110101 100100 101101 Hash(text) = 000101 Hash(many)= 110000 Hash(words)= 100100 Hash(made)= 001100 Hash(letters)= 100001 Block 4: 001100 OR 100001 101101

False Drop Assumes that m bits are randomly set in the mask Let a=m/B For b words, the probability that a given bit of the mask is set is 1-(1-1/B)bm 1-e-ba Hence, the probability that the l random bits are also set is Fd =(1-e-ba)aB  False alarm Fd is minimized for a=ln(2)/b Fd = 2-m m = B ln2/b

Sequential Signature File (SSF) Assume documents span exactly one logical block the size of document signature F = the size of block signature B

Classification of Signature-Based Methods Horizontal partitioning Grouping similar signatures together and/or providing an index on the signature matrix may result in better-than-linear search. Vertical partitioning Storing the signature matrix column-wise improves the response time on the expense of insertion time.

Classification of Signature-Based Methods Vertical partitioning without compression bit-sliced signature files (BSSF, B’SSF) frame sliced (FSSF) generalized frame-sliced (GFSSF) with compression compressed bit slices (CBS) doubly compressed bit slices (DCBS) no-false-drop method (NFD)

Classification of Signature-Based Methods Sequential storage of the signature matrix without compression sequential signature files (SSF) with compression bit-block compression (BC) variable bit-block compression (VBC) Horizontal partitioning data independent partitioning Gustafson’s method partitioned signature files data dependent partitioning 2-level signature files 5-trees

Criteria The storage overhead The response time on single word queries The performance on insertion, as well as whether the insertion maintains the “append-only” property

Vertical Partitioning Idea avoid bringing useless portions of the document signature in main memory Methods store the signature file in a bit-sliced form or in a frame-sliced form store the signature matrix column-wise to improve the response time on the expense of insertion time

Bit-Sliced Signature Files (BSSF) Transposed bit matrix documents (document signature) transpose documents represent

F bit-files search: (1) retrieve m bit-files. documents F bit-files search: (1) retrieve m bit-files. e.g., the word signature of free is 001 000 110 010 the document contains “free”: 3rd, 7th, 8th, 11th bit are set i.e., only 3rd, 7th, 8th, 11th files are examined. (2) “and” these vectors. The 1s in the result N-bit vector denote the qualifying logical blocks (documents). (3) retrieve text file through pointer file. insertion: require F disk accesses for a new logical block (document), one for each bit-file, but no rewriting

Frame-Sliced Signature File (FSSF) Ideas Random disk accesses are more expensive than sequential ones Force each word to hash into bit positions that are closer to each other in the document signature these bit files are stored together and can be retrieved with a few random accesses Procedures The document signature (F bits long) is divided into k frames of s consecutive bits each. For each word in the document, one of the k frames will be chosen by a hash function. Using another hash function, the word sets m bits in that frame.

Frame-Sliced Signature File (Cont.) documents frames Each frame will be kept in consecutive disk blocks.

FSSF (Continued) Example (n=2, B=12, s=6, f=2, m=3) Word Signature free 000000 110010 text 010110 000000 doc. signature 010110 110010 Search Only one frame has to be retrieved for a single word query. I.E., only one random disk access is required. e.g., search documents that contain the word “free” ->because the word signature of “free” is placed in 2nd frame, only the 2nd frame has to be examined. At most k frames have to be scanned for an k word query. Insertion Only f frames have to be accessed instead of F bit-slices.

Horizontal Partitioning 1. Goal: group the signatures into sets, partitioning the signature matrix horizontally. 2. Grouping criterion documents

Partitioned Signature Files Using a portion of a document signature as a signature key to partition the signature file. All signatures with the same key will be grouped into a so-called “module”. When a query signature arrives, examine its signature key and look for the corresponding modules scan all the signatures within those modules that have been selected

Suffix Trees

Suffix Trees and Suffix Arrays Each position in the text is considered as a text suffix Index points are selected form the text, which point to the beginning of the text positions which will be retrievable

Suffix arrays The main drawbacks of Suffix Array are its costly construction process. Allow binary searches done by comparing the contents of each pointer. Supra-indices (for large suffix array)

Construction of Suffix Arrays for Large Texts

Sequential Searching

Algorithms Brute Force Knuth-Morris-Pratt Boyer-Moore Family Shift-Or Suffix Automaton

Knuth-Morris-Pratt

Boyer-Moore Family

Shift-Or

Suffix Automaton

Pattern Matching

Algorithms Searching allowing errors Dynamic Programming Automaton Regular Expressions and Extended patterns Pattern Matching Using Indices Inverted files Suffix Trees and Suffix Arrays

Dynamic Programming

Automaton

Regular Expressions

Pattern Matching Using Indices Inverted Files The types of queries such as suffix or substring queries, searching allowing errors and regular expressions, are solved by a sequential search The restriction is to find approximate matches or regular expressions that span many word.

Pattern Matching Using Indices Suffix Trees Suffix trees are able to perform complex searches Word, prefix, suffix, substring, and Range queries Regular expressions Unrestricted approximate string matching Useful in specific areas Find the longest substring Find the most common substring of a fixed size

Pattern Matching Using Indices Suffix Arrays Some patterns can be searched directly in the suffix array without simulation the suffix tree Word, prefix, suffix, subword search and range search

Compression Compressed text--Huffman coding Compressed indices Taking words as symbols Use an alphabet of bytes instead of bits Compressed indices Inverted Files Suffix Trees and Suffix Arrays Signature Files