1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, 1999. (Chapter 8)

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Indexing and.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Digital Search Trees & Binary Tries Analog of radix sort to searching. Keys are binary bit strings.  Fixed length – 0110, 0010, 1010,  Variable.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.
Modern Information Retrieval
1 File Structures Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.
BTrees & Bitmap Indexes
1 Indexing and Searching (File Structures) Modern Information Retrieval (C hapter 8) With G. Navarro.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Chapter 8 File organization and Indices.
Digital Search Trees & Binary Tries Analog of radix sort to searching. Keys are binary bit strings.  Fixed length – 0110, 0010, 1010,  Variable.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
Indexing and Searching
Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts B + -Tree Index Files Indexing mechanisms used to speed up access to desired data.  E.g.,
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Chapter 13 File Structures. Understand the file access methods. Describe the characteristics of a sequential file. After reading this chapter, the reader.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
CS 430: Information Discovery
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
CHAPTER 8 SEARCHING CSEB324 DATA STRUCTURES & ALGORITHM.
Introduction to Digital Libraries Information Retrieval.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1 Chapter 7 Skip Lists and Hashing Part 2: Hashing.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Why indexing? For efficient searching of a document
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
New Indices for Text : Pat Trees and PAT Arrays
Azita Keshmiri CS 157B Ch 12 indexing and hashing
CS 430: Information Discovery
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
PAT Trees Index for arbitrary character sequence in text
Digital Search Trees & Binary Tries
Indexing and Searching (File Structures)
Indexing and Hashing Basic Concepts Ordered Indices
Digital Search Trees & Binary Tries
Database Systems (資料庫系統)
Chapter 11 Indexing And Hashing (1)
Space-for-time tradeoffs
Space-for-time tradeoffs
Indexing and Searching
Presentation transcript:

1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)

2 Introduction l Word-based indexing »Inverted indices are good for search words »Queries such as phrases are expensive to solve using Inverted files »For word-based applications, inverted files perform better l Suffix trees and suffix arrays »complex queries

3 Text Suffixes text. A text has many words. Words are made from letters. text has many words. Words are made from letters. many words. Words are made from letters. words. Words are made from letters. Words are made from letters. made from letters. letters. This is a text. A text has many words. Words are made from letters.

4 The Suffix Trie and Suffix Tree This is a text. A text has many words. Words are made from letters

5 PAT Trees and PAT Arrays Information Retrieval: Data Structures and Algorithms by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, (Chapters 5)

6 PAT Trees and PAT Arrays l Problems of tradition IR models »Documents and words are assumed. »Keywords must be extracted from the text (indexing). »Queries are restricted to keywords. l New indices for text »A text is regarded as a long string. »Each position corresponds to a semi-infinite string (sistring). »No structures and no keywords

7 Semi-infinite Strings l Example TextOnce upon a time, in a far away land … sistring 1Once upon a time … sistring 2nce upon a time … sistring 8on a time, in a … sistring 11a time, in a far … sistring 22a far away land … l Compare sistrings 22 < 11 < 2 < 8 < 1

8 PAT Tree l PAT Tree A Patricia tree constructed over all the possible sistrings of a text l Patricia tree »a binary digital tree where the individual bits of the keys are used to decide on the branching –A zero bit will cause a branch to the left subtree –A one bit will cause a branch to the right subtree »each internal node indicates which bit of the query is used for branching –absolute bit position –a count of the number of bits to skip »each external node points to a sistring –the integer displacement to original text

Example Text … sistring … sistring … sistring … sistring … sistring … sistring … sistring … sistring : external node sistring (integer displacement) total displacement of the bit to be inspected : internal node skip counter & pointer

Text … sistring … sistring … sistring … sistring … sistring … sistring … sistring … sistring 註: 3 和 6 要 4 個 bits 才能區辨 Search 00101

11 Indexing Points l The above example assumes every position in the text is indexed. i.e. n external nodes, one for each indexed position in the text l Word and phrase searches sistrings that are at the beginning of words are necessary l Trade-off between size of the index and search requirements

12 Prefix searching l idea every subtree of the PAT tree has all the sistrings with a given prefix. l Search: proportional to the query length exhaust the prefix or up to external node. Search for the prefix “10100” and its answer

13 Proximity Searching l Find all places where s 1 is at most a fixed (given by a user) number of characters away from s 2. in 4 ation ==> insulation, international, information l Algorithm 1. Search for s 1 and s Select the smaller answer set from these two sets and sort by position. 3. Traverse the unsorted answer set, searching every position in the sorted set and checking if the distance between positions satisfying the proximity condition. sort+traverse time:m 1 logm 1 +m 2 logm 1 (assume m 1 <m 2 )

14 Range Searching l Search for all the strings within a certain lexicographical range. »Ex: the range of “abc”..”acc”: –“abracadabra”, “acacia” ○ –“abacus”, “acrimonious” X l Algorithm »Search each end of the defining intervals. »Collect all the sub-trees between (and including) them.

15 Longest Repetition Searching l the match between two different positions of a text where this match is the longest in the entire text, e.g., Text sistring sistring sistring sistring sistring sistring sistring sistring the tallest internal node gives a pair of sistrings that match for the greatest number of characters

16 “Most Significant” or “Most Frequent” Matching l The most frequently occurring strings within the text database »e.g., the most frequent trigram l Find the most frequent trigram »find the largest subtree at a distance 3 characters from root the tallest internal node gives a pair of sistrings that match for the greatest number of characters i.e., 1, 2, 3 are the same for sistrings and

17 Building PAT Trees as Patricia Trees (1) l Bucketing of external nodes »collect more than one external node »a bucket replaces any subtree with size less than a certain constraint (b) save significant number of internal nodes »the external nodes inside a bucket do not have any structure associated with them increase the number of comparisons for each search

18 Building PAT Trees as Patricia Trees (2) l Mapping the tree onto the disk using super-nodes »Advantage: save the number of disk access and space »Every disk page has a single entry point, contains as much of the trees as possible, and –terminates either in external nodes or in pointers to other disk pages –The pointers in internal nodes will address either a disk page or another node inside the same page l reduces the storage cost of internal nodes »Example –Assume a disk page contains on the order of 1,000 internal/external nodes –on the average, each disk page contains about 10 steps of a root-to- leaf path

19 PAT Trees Represented as Arrays l External node bucket size, b l If we keep the external nodes in the bucket in the same relative order as they would be in the tree »Indirect binary search vs. sequential search PAT array Text

20 Searching PAT Trees as Arrays l Prefix searching and range searching doing an indirect binary search over the array with the results of the comparisons being less than, equal, and greater than. l Example Search for the prefix 100 and its answer l Most frequent, Longest repetition »Manber and Baeza-Yates (1991) PAT array Text

21 Comparisons l Signature files »Use hashing techniques to produce an index »Advantage –storage overhead is small (10%-20%) »Disadvantages –the search time on the index is linear –some answers may not match the query, thus filtering must be done

22 Comparisons ( Continued ) l Inverted files »storage overhead (30% ~ 100%) »search time for word searches is logarithmic l PAT arrays »potential use in other kind of searches –phrases –regular expression searching –approximate string searching –longest repetitions –most frequent searching