Information Retrieval CSE 8337 Spring 2005 Indexing and Searching Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.

Slides:



Advertisements
Similar presentations
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Advertisements

Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Modern information retrieval Chapter 8 – Indexing and Searching.
Modern Information Retrieval Chapter 8 Indexing and Searching.
CSC1016 Coursework Clarification Derek Mortimer March 2010.
Data Structures Hash Tables
Modern Information Retrieval
BTrees & Bitmap Indexes
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
2010/3/81 Lecture 8 on Physical Database DBMS has a view of the database as a collection of stored records, and that view is supported by the file manager.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Indexing and Searching
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.
Indexing and Hashing (emphasis on B+ trees) By Huy Nguyen Cs157b TR Lee, Sin-Min.
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Spring 2006 Copyright (c) All rights reserved Leonard Wesley0 B-Trees CMPE126 Data Structures.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Signature files. Signature Files Important alternative to inverted indexes. Given a document, the signature is calculated as follows. - First, each word.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
1 Index Structures. 2 Chapter : Objectives Types of Single-level Ordered Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes.
CS 430: Information Discovery
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
CSC 211 Data Structures Lecture 13
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of CHAPTER 12: Multi-way Search Trees Java Software Structures: Designing.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
1 Multi-Level Indexing and B-Trees. 2 Statement of the Problem When indexes grow too large they have to be stored on secondary storage. However, there.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
CS4432: Database Systems II
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Why indexing? For efficient searching of a document
CHP - 9 File Structures.
Indexing and hashing.
Multiway Search Trees Data may not fit into main memory
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Database System Implementation CSE 507
Database Management Systems (CS 564)
CS 430: Information Discovery
Indexing and Hashing Basic Concepts Ordered Indices
Database Design and Programming
Information Retrieval B
Indexing and Searching
CSE 326: Data Structures Lecture #14
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Information Retrieval CSE 8337 Spring 2005 Indexing and Searching Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto Prof. Mario Nascimento of University of Alberta

CSE 8337 Spring Indexing and Searching TOC Indexing techniques: Inverted files Suffix arrays Signature files Technique used to search each type of index Other searching techniques

CSE 8337 Spring Motivation Just like in traditional RDBMSs searching for data may be costly In a RDB one can take (a lot of) advantage from the well defined structure of (and constraints on) the data Linear scan of the data is not feasible for non-trivial datasets (real life) Indices are not optional in IR (not meaning that they are in RDBMS)

CSE 8337 Spring Motivation Traditional indices, e.g., B-trees, are not well suited for IR Main approaches: Inverted files (or lists) Suffix arrays Signature files

CSE 8337 Spring Inverted Files There are two main elements: vocabulary – set of unique terms Occurrences – where those terms appear The occurrences can be recorded as terms or byte offsets Using term offset is good to retrieve concepts such as proximity, whereas byte offsets allow direct access VocabularyOccurrences (byte offset) ……

CSE 8337 Spring Inverted Files The number of indexed terms is often several orders of magnitude smaller when compared to the documents size (Mbs vs Gbs) The space consumed by the occurrence list is not trivial. Each time the term appears it must be added to a list in the inverted file That may lead to a quite considerable index overhead

CSE 8337 Spring Example Text: Inverted file That house has a garden. The garden has many flowers. The flowers are beautiful beautiful flowers garden house 70 45, 58 18, 29 6 VocabularyOccurrences

CSE 8337 Spring Inverted Files Coarser addressing may be used All occurrences within a block (perhaps a whole document) are identified by the same block offset Much smaller overhead Some searches will be less efficient, e.g., proximity searches. Linear scan may be needed, though hardly feasible (specially on- line) TermsOccurrences (block offset) ……

CSE 8337 Spring Space Requirements The space required for the vocabulary is rather small. According to Heaps’ law the vocabulary grows as O(n  ), where  is a constant between 0.4 and 0.6 in practice On the other hand, the occurrences demand much more space. Since each word appearing in the text is referenced once in that structure, the extra space is O(n) To reduce space requirements, a technique called block addressing is used

CSE 8337 Spring Block Addressing The text is divided in blocks The occurrences point to the blocks where the word appears Advantages: the number of pointers is smaller than positions all the occurrences of a word inside a single block are collapsed to one reference Disadvantages: online search over the qualifying blocks if exact positions are required

CSE 8337 Spring Example Text: Inverted file: Block 1Block 2 Block 3 Block 4 That house has a garden. The garden has many flowers. The flowers are beautiful beautiful flowers garden house VocabularyOccurrences

CSE 8337 Spring Inverted Files 45% 19% 18% 73% 26% 25% 36% 18% 1.7% 64% 32% 2.4% 35% 26% 0.5% 63% 47% 0.7% Addressing words Addressing documents Addressing 256 blocks Index Small collection (1Mb) Medium collection (200Mb) Large collection (2Gb)

CSE 8337 Spring Inverted Files - construction Building the index in main memory is not feasible (wouldn’t fit, and swapping would be unbearable) Building it entirely in disk is not a good idea either (would take a long time) One idea is to build several partial indices in main memory, one at a time, saving them to disk and then merging all of them to obtain a single index

CSE 8337 Spring Inverted Files - construction The procedure works as follows: Build and save partial indices l1, I2, …, In Merge Ij and Ij+1 into a single partial index Ij,j+1 Merging indices mean that their sorted vocabularies are merged, and if a term appears in both indices then the respective lists should be merged (keeping the document order) Then indices Ij,j+1 and Ij+2,j+3 are merged into partial index Ij+1,j+3, and so on and so forth until a single index is obtained Several partial indices can be merged together at once

CSE 8337 Spring Inverted Files - construction This procedure takes O(n log (n/M)) time plus O(n) to build the partial indices – where n is the document size and M is the amount of main memory available Adding a new document is a matter of merging its (partial) index (indices) to the index already built Deletion can be done in O(n) time – scanning over all lists of terms occurring in the deleted document

CSE 8337 Spring Searching The search algorithm on an inverted index follows three steps: Vocabulary search: the words present in the query are searched in the vocabulary Retrieval occurrences: the lists of the occurrences of all words found are retrieved Manipulation of occurrences: the occurrences are processed to solve the query

CSE 8337 Spring Searching Searching task on an inverted file always starts in the vocabulary ( It is better to store the vocabulary in a separate file ) The structures most used to store the vocabulary are hashing, tries or B-trees An alternative is simply storing the words in lexicographical order ( cheaper in space and very competitive with O(log v) cost )

CSE 8337 Spring Construction All the vocabulary is kept in a suitable data structure storing for each word a list of its occurrences Each word of the text is read and searched in the vocabulary If it is not found, it is added to the vocabulary with a empty list of occurrences and the new position is added to the end of its list of occurrences

CSE 8337 Spring Example I I I I I I I I 1 I 2 I 3 I 4 I 5 I 6 I 7 I final index initial dumps level 1 level 2 level 3

CSE 8337 Spring Conclusion Inverted file is probably the most adequate indexing technique for database text The indices are appropriate when the text collection is large and semi-static Otherwise, if the text collection is volatile online searching is the only option Some techniques combine online and indexed searching

CSE 8337 Spring Inverted Files - searching Searching using an inverted file Vocabulary search The terms used in the query (decoupled in the case of phrase or proximity queries) are searched separately Retrieval of occurrences lists Filtering answer If the query was boolean then the retrieved lists have to be “booleany” processed as well If the inverted file used blocking and the query used proximity (for instance) than the actual byte/term offset has to be obtained from the documents

CSE 8337 Spring Inverted Files - searching Processing the lists of occurrences (filtering the answer set) may be critical For instance, how to process a proximity query (involving two terms) ? The lists are built in increasing order, so they may be traversed in a synchronous way, and each occurrence is checked for the proximity What if blocking is used ? No positional information is kept, so a linear scan of the document is required The traversal and merging of the obtained lists are sensitive operations

CSE 8337 Spring Inverted Files - layout Indexed Terms Number of occurrences Occurrences Lists Vocabulary Posting File This could be a tree like structure !

CSE 8337 Spring Signature Files Unlike the case of inverted files where, most of the time, where is a tree structure underneath, signature files use hash tables The main idea is to divide the document into blocks of fixed size and each block has assigned to it a signature (also fixed size), which is used to search the document for the queried pattern The block signature is obtained by OR’ing the hashed bitstrings of each term in the block

CSE 8337 Spring Signature Files Consider: H(information) = H(text) = H(data) = H(retrieval) = The block signatures of a document D containing the text “textual retrieval and information retrieval” (after removing stopwords and stemming) for a block size of two terms – would be: B1D = and B2D =

CSE 8337 Spring Signature Files - searching To search for a given term we compare whether the term’s bitstring could be “inside” the block signatures Consider we are searching for “text” in document D H(text) = and B1D = H(text) bit-wise-AND B1D = = H(text) Therefore “text” could be in B1D (it is in this particular case) Consider we are now searching for “data” H(data) bit-wise-AND B1D = = H(data) H(data) bit-wise-AND B2D = = H(data) Though “data” is not in either block ! Signature files may yield false hits …

CSE 8337 Spring Signature File - layout Block signature (bitmask) Logical Blocks Pointers to blocks in the documents

CSE 8337 Spring Signature Files – design issues How to keep the probability of a false alarms low ? How to predict how good a signature is ? Consider: B: the size (number of bits) of each term’s bitstring b: number of terms per document block l: the minimum number of bits set (turned on) in B, this depends on the hashing function The value of l which minimizes the false hits (yielding probability of false hits equal to 2 -l ) is l = B ln(2)/b B/b indicates the space overhead to pay

CSE 8337 Spring Signature Files - alternatives There are many strategies which can be used with signature files Signature compression Given that the bitstring is likely to have many 0s a simple run-length encoding technique can be used is encoded as “328” (each digit in the code represents the number of 0s preceding a 1) – remember encoding from last class The main drawback is that comparing bitstrings require decoding at search time

CSE 8337 Spring Signature Files - alternatives Using the original signature file layout and a bitstring (queried term signature) all logical blocks are checked as to whether that block could contain that bitstring This may take a long time if the signature file is long (this depends on the block size) Could we check only at the bits we are interested ? Yes, we can …

CSE 8337 Spring Signature Files - alternatives Original signature file layout: Each row has a bitmask corresponding to a block Each column tells whether that bit is set of not Bit-sliced signature files layout: Each row correspond to a particular bit in the bitmasks Each column is the bitmask for the blocks The signature file is “transposed”

CSE 8337 Spring Signature Files - alternatives Block signature (bitmask) Bits in the bitmask Pointers to Blocks in the documents Each row is a file !

CSE 8337 Spring Signature Files - alternatives Advantages of bit-slicing: Given a term bitstring one can read only the lines (files) corresponding to those set bits and if they are set in a given column, then the whole column may be retrieved. Empty answers are found very fast Insertion is expensive, as all rows must be updated, but in an append-only fashion, this is good for WORM media types

CSE 8337 Spring Signature Files - alternatives It has been proposed that m = 50% (half of the bits to be 1) A typical value of m is 10, that implies 10 accesses for a single term query Some authors have suggested to use a smaller m (thus smaller number of accesses) The trade-off is that to maintain low probability of false-hits, the signature has to be larger, thus the space required for the signature grows larger as well

CSE 8337 Spring (Signature) S-Trees Can we improve on the searching procedure of signature files ? S-tree: it’s a tree of signatures Each leaf has similar block signatures The notion of similarity can use the Hamming (editing) distance OR’ing all signatures on a leaf will generate a signature to the parent node entry pointing to the leaf The procedure goes on and on until the root is reached

CSE 8337 Spring (Signature) S-trees Searching for a term is a matter of checking the signature in each node entry and traversing down the tree following only the pointer with “promising’’ signatures When OR’ing the signature the the number os bits set will increase as we go along up in the tree If a poor signature (too many 1’s) is used then the number of bits set in each entry will increase too fast as we go up in the tree, and the number of false-traversals will increase fast

CSE 8337 Spring (Signature) S-tree Insertion is efficient The tree is kept balanced, similarly to a B+- tree Unlike the previous approaches this one cannot take advantage of append-only This approach has not been much explored. It is not certain whether it can be useful or not (in terms of speediness, etc Many variations of Signature Files exist …

CSE 8337 Spring Suffix Trees/Arrays Suffix trees are a generalization of inverted files For traditional queries, i.e., those based on simple terms, inverted files are the structure of choice This type of index treats the text to be indexed as a finite, but long string. Thus each position in the text represent a suffix of that text, and each suffix is uniquely identified by its position Such position determine what is and what is not indexed and is called index point

CSE 8337 Spring Suffix Trees Note that the choice of index points is crucial to the retrieval capabilities Consider the document: This is a text. A text has many words. Words … And the index points/suffixes: 11: text. A text has many words. Words … 19: text has many words. Words … 28: many words. Words … 33: words. Words … 40: Words …

CSE 8337 Spring Suffix Trees Suffix Trie a m n t ex t ‘ ’. w o r d s.

CSE 8337 Spring Suffix Trees  Suffix Tree m t ‘ ’. w.

CSE 8337 Spring Suffix Trees Even though the example showed only terms, one can use suffix trees to index and search phrases For practical purposes a rather small phrase length (typically much smaller than the text) should be used Suffix arrays are a better implementation of suffix trees

CSE 8337 Spring Suffix Arrays Careful inspection of the leaves in the suffix tree shows that there is a lexicographical order among the index points An equivalent suffix array would be: [28, 19, 11, 40, 33] for (many, text, text., words, words.} This allows the search to be processed as a binary search, but note that one must use the array and the text at query time The number of I/Os may become a problem

CSE 8337 Spring Suffix Arrays One idea is to use a supra-index A supra-index will contain a sample of the indexed terms, e.g., […, text, word, …] [28, 19, 11, 40, 33] This alleviates the binary search, but makes it quite similar to inverted files

CSE 8337 Spring Suffix Arrays This similarity is true if only single terms are indexed, but suffix trees/arrays aim more than that, they aim mostly at phrase queries However, the array (or tree) is unlikely to fit in main memory, whereas for the inverted files the vocabulary could fit in main memory This requires careful implementation to reduce I/Os Good implementations show reduction of over 50% compared to naive ones

CSE 8337 Spring Motivation Thus far, we’ve seen how to pre-process texts (stemming, “stopping” words, compressing, etc) We’ve also seen how to build, process, improve and assess the quality of queries and the returned answer sets In the last classes, we saw how to effectively index the texts Now, how do we search text ? It goes beyond searching the indices, specially in the cases where the indices are not enough to answer queries (e.g., proximity queries)

CSE 8337 Spring Sequential Searching If no index is used this is the only option, however sometimes (e.g., when blocking is used) it is necessary even if an index exists The problem is “simple” Given a pattern P of length n and a text T of length m (m >> n) find all positions in T where P occurs There is much more work on this than we can cover, including many theoretical results. Thus we will discover some well known approaches

CSE 8337 Spring Brute Force The name says it all … Starting from all possible initial positions (i.e., all positions) whether the pattern could start at that position It takes O(mn) time in the worst case, but O(n) in the average case – not that bad The most important thing is that it suggests the use of a sliding window over the text. The idea is to see whether we can see the pattern through the window

CSE 8337 Spring Knuth-Morris-Pratt Consider the following text: TEXTURE OF A TEXT IS NOT TEXTUAL And that the queried term is ‘TEXT ’ (note the blank at the end !) The window is slid TEXTURE OF A TEXT IS NOT TEXTUAL At this point there cannot be a match so the window can be slid to TEXTURE OF A TEXT IS NOT TEXTUAL Is that correct ?

CSE 8337 Spring Knuth-Morris-Pratt Only if we are not interested in the occurrence of pattern within terms, otherwise not For instance, search for ABRACADABRA within ABRACABRACADABRA The previous idea would do: ABRACABRACADABRA and then skip to ABRACABRACADABRA missing the pattern The correct move would be to ABRACABRACADABRA How come ?

CSE 8337 Spring Knuth-Morris-Pratt The processing time of this algorithm is O(n), which is linear, thus better than the Brute Force Algorithm However, in the average case the Brute Force algorithm is O(n) as well Many, many others algorithms exist …