ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Succinct Data Structures for Permutations, Functions and Suffix Arrays
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.
Space-for-Time Tradeoffs
Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
String Processing II: Compressed Indexes Patrick Nichols Jon Sheffi Dacheng Zhao
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Modern Information Retrieval Chapter 8 Indexing and Searching.
Wavelet Trees Ankur Gupta Butler University. Text Dictionary Problem The input is a text T drawn from an alphabet Σ. We want to support the following.
Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Modern Information Retrieval
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
1 A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber Tech. Rep. TR94-17,Department of Computer Science, University of Arizona, May 1994.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Space Efficient Linear Time Construction of Suffix Arrays
Indexing and Searching
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara University of Tokyo Karlsruhe, Sep 15, 2008 Kunihiko.
Compressed suffix arrays and suffix trees with applications to text indexing and string matching.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Application: String Matching By Rong Ge COSC3100
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Random access to arrays of variable-length items
1 Today’s Material Iterative Sorting Algorithms –Sorting - Definitions –Bubble Sort –Selection Sort –Insertion Sort.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Su ffi x Tree of Alignment: An E ffi cient Index for Similar Data JOONG CHAE NA1, HEEJIN PARK2, MAXIME CROCHEMORE3, JAN HOLUB4, COSTAS S. ILIOPOULOS3, LAURENT.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
 2006 Pearson Education, Inc. All rights reserved. 1 Searching and Sorting.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Linear Time Suffix Array Construction Using D-Critical Substrings
Tries 07/28/16 11:04 Text Compression
Succinct Data Structures
Andrzej Ehrenfeucht, University of Colorado, Boulder
Reporting (1-D) Given a set of points S on the line, preprocess them to build structure that allows efficient queries of the from: Given an interval I=[x1,x2]
External Sorting Chapter 13
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

ETRI Suffix arrays Suffix array of text T The lexicographically sorted list of all suffixes of text T

ETRI Suffix arrays Example for T = abbabaababbb# The suffixes of T abbabaababbb# (1) bbabaababbb# (2) abaababbb# (3) … b# (12) # (13) are stored in lexicographical order. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # # is the lexicographically smallest special character.

ETRI Suffix arrays Example for T = abbabaababbb# The suffixes of T are abbabaababbb# (1) bbabaababbb# (2) abaababbb# (3) … b# (12) # (13) In actual suffix arrays, we store only the starting positions of suffixes in T but for convenience, we assume that suffixes themselves are stored. 113# 26a a b a b b b # 34a b a a b a b b b # 47a b a b b b # 51a b b a b a a b a b b b # 69a b b b # 712b # 85b a a b a b b b # 93b a b a a b a b b b # 108b a b b b # 11 b b # 122b b a b a a b a b b b # 1310b b b #

ETRI Suffix arrays Definition: s-suffixes Suffixes starting with string s a-suffixes, ba-suffixes, … 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #

ETRI Suffix arrays vs. Suffix trees Construction time Suffix Array = Suffix Tree Space Suffix Array = Suffix Tree In practice, suffix arrays are more space efficient than suffix trees. Search time Suffix Array:, (p=|P|, n=|T|) Suffix Tree:

ETRI Contribution Construction time Suffix Array = Suffix Tree Space Suffix Array = Suffix Tree In practice, suffix arrays are more space efficient than suffix trees. Search time Suffix Array:,, Suffix Tree:

ETRI The meaning of our contribution Construction time Suffix Array = Suffix Tree Space Suffix Array = Suffix Tree In practice, suffix arrays are more space efficient than suffix trees. Search time Suffix Array:,, Suffix Tree: Search time: SA ST

ETRI The meaning of our contribution Construction time Suffix Array = Suffix Tree Space Suffix Array = Suffix Tree In practice, suffix arrays are more space efficient than suffix trees. Search time Suffix Array:,, Suffix Tree: Search time: SA ST Suffix arrays are more powerful than suffix trees.

ETRI Our search algorithm

ETRI Search in a suffix array Definition: Search in a suffix array Input A pattern P A suffix array of T Output All P-suffixes of T

ETRI Search in a suffix array All ab-suffixes are neighbors. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = ab T = abbabaababbb# Find all ab-suffixes. A search example

ETRI Search in a suffix array We have only to find the first and the last ab-suffixes. Because the other ab-suffixes are stored between them. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = ab T = abbabaababbb# A search example

ETRI Related work In developing our search algorithm, we adopt the idea suggested by Ferragina and Manzini (FOCS 2001). Search a pattern in a file compressed by the Burrows-Wheeler compression algorithm Search P from the last character to the first character of P abaaabb P = ababaaabb We adopt this backward pattern searching idea.

ETRI Algorithm outline 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = aba T = abbabaababbb# Outline of our search algorithm We find all aba-suffixes by searching P backward. Our algorithm has p stages (In this case, there are 3 stages.)

ETRI Algorithm outline P = aba T = abbabaababbb# Outline of our search algorithm We find all aba-suffixes by searching P backward. Stage 1: find all a-suffixes. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #

ETRI Algorithm outline P = aba T = abbabaababbb# Outline of our search algorithm We find all aba-suffixes by searching P backward. stage 1: find all a-suffixes. stage 2: find all ba-suffixes. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #

ETRI Algorithm outline P = aba T = abbabaababbb# Outline of our search algorithm We find all aba-suffixes by searching P backward. stage 1: find all a-suffixes. stage 2: find all ba-suffixes. stage 3: find all aba-suffixes. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #

ETRI Elaborate stage 2 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = aba A stage by elaborating stage 2 We find the first ba-suffix from the first a-suffix and the last ba-suffix from the last a-suffix. We find all ba-suffixes using a-suffixes found in stage 1.

ETRI Elaborate stage 2 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = aba Only explain how to find the first ba-suffix from the first a-suffix. Finding the last ba-suffix is similar. A stage by elaborating stage 2

ETRI Elaborate stage 2 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # To find the first ba-suffix, we count the number of suffixes that precede ba-suffixes in this suffix array. P = aba

ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Suffixes preceding ba-suffixes are divided into two categories. - A-type: Suffixes starting with characters lexicographically smaller than b. (#-suffixes, a-suffixes) - B-type: Suffixes starting with the same character b and preceding ba-suffixes. We count A-type and B-type suffixes in different ways. Elaborate stage 2 A-type B-type

ETRI Count the number of A-type suffixes 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # The number of A-type suffixes = The number of #-suffixes and a-suffixes = The position of the last a-suffix. A-type

ETRI Count the number of A-type suffixes We generate an array that stores the positions of the last #-suffix, the last a-suffix, and the last b- suffix. With this array, we can count A-type suffixes in O(1) time. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # #1 a6 b13

ETRI Count the number of A-type suffixes 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Array Space: Time: O(n) (one scan) #1 a6 b13

ETRI Count the number of B-type suffixes 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Count B-type suffixes b-suffixes preceding ba-suffixes. B-type

ETRI Count the number of B-type suffixes 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # B-type suffixes b-suffixes preceding ba-suffixes. A suffix generated by removing the leftmost b from a B-type suffix appears in a suffix subarray preceding a-suffixes found in stage 1. B-type

ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Count the number of B-type suffixes The number of B-type suffixes are the number of suffixes being in a suffix subarray that precedes a-suffixes whose previous characters are bs B-type We count this with array N. b b b a # b b a b a b a a Let U be the conceptual array of previous characters of suffixes. U

ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # b b b a # b b a b a b a a Count the number of B-type suffixes #ab Array N entries N[i,b] stores the number of suffixes whose previous characters are bs in a suffix subarray SA[1,i]. U

ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # b b b a # b b a b a b a a Count the number of B-type suffixes #ab U We can count B-type suffixes in O(1) time by accessing an entry of N.

ETRI Array N Space: An alternative way Space: O(n) time for counting B-type suffixes. Array N #ab

ETRI Query forN[i,b] Counting B-type suffixes O(log n) time O(log ) time

ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # b b b a # b b a b a b a a U Query forN[i,b] O(log n) time In an O(log n) time algorithm, we generate an array whose ith entry stores the location of the ith b in U

ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time To count suffixes whose previous characters are bs in SA[1,8]. = To count bs in U[1,8] 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #

ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time Find the largest value not exceeding 8 in this array. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #

ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time # 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # To find 7 in this array, we perform binary search. O(log n) time.

ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time The index of 7 (5) is the number of b’s in U[1,8]. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #

ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time # 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # 15 Generally, we require arrays for all characters. # a b O(n) space

ETRI Query forN[i,b] O(log n) time O(log ) time

ETRI For the last characters of each block, we compute the entries of N. b b b a # b b a b a b a a U Query forN[i,b]: time 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Divide U into -sized blocks. #ab

ETRI For the other entries in each block, we generate a similar data structure used in O(log n) time alg. O(log ) time for binary search. Still O(n) space in total. b b b a # b b a b a b a a U Query forN[i,b]: time 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # #ab

ETRI Summary p stages Each stage Count A-type suffixes Time: O(1) Space: O(n) for M array Count B-type suffixes Time: Space: O(n) for computing the value of an entry N In total, time with O(n) space.

ETRI Conclusion In a suffix array, one can choose or search time algorithm depending on the alphabet size. Suffix arrays are more powerful than suffix trees.