Download presentation
Presentation is loading. Please wait.
Published byDorcas Haynes Modified over 8 years ago
1
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park
2
ETRI Suffix arrays Suffix array of text T The lexicographically sorted list of all suffixes of text T
3
ETRI Suffix arrays Example for T = abbabaababbb# The suffixes of T abbabaababbb# (1) bbabaababbb# (2) abaababbb# (3) … b# (12) # (13) are stored in lexicographical order. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # # is the lexicographically smallest special character.
4
ETRI Suffix arrays Example for T = abbabaababbb# The suffixes of T are abbabaababbb# (1) bbabaababbb# (2) abaababbb# (3) … b# (12) # (13) In actual suffix arrays, we store only the starting positions of suffixes in T but for convenience, we assume that suffixes themselves are stored. 113# 26a a b a b b b # 34a b a a b a b b b # 47a b a b b b # 51a b b a b a a b a b b b # 69a b b b # 712b # 85b a a b a b b b # 93b a b a a b a b b b # 108b a b b b # 11 b b # 122b b a b a a b a b b b # 1310b b b #
5
ETRI Suffix arrays Definition: s-suffixes Suffixes starting with string s a-suffixes, ba-suffixes, … 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #
6
ETRI Suffix arrays vs. Suffix trees Construction time Suffix Array = Suffix Tree Space Suffix Array = Suffix Tree In practice, suffix arrays are more space efficient than suffix trees. Search time Suffix Array:, (p=|P|, n=|T|) Suffix Tree:
7
ETRI Contribution Construction time Suffix Array = Suffix Tree Space Suffix Array = Suffix Tree In practice, suffix arrays are more space efficient than suffix trees. Search time Suffix Array:,, Suffix Tree:
8
ETRI The meaning of our contribution Construction time Suffix Array = Suffix Tree Space Suffix Array = Suffix Tree In practice, suffix arrays are more space efficient than suffix trees. Search time Suffix Array:,, Suffix Tree: Search time: SA ST
9
ETRI The meaning of our contribution Construction time Suffix Array = Suffix Tree Space Suffix Array = Suffix Tree In practice, suffix arrays are more space efficient than suffix trees. Search time Suffix Array:,, Suffix Tree: Search time: SA ST Suffix arrays are more powerful than suffix trees.
10
ETRI Our search algorithm
11
ETRI Search in a suffix array Definition: Search in a suffix array Input A pattern P A suffix array of T Output All P-suffixes of T
12
ETRI Search in a suffix array All ab-suffixes are neighbors. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = ab T = abbabaababbb# Find all ab-suffixes. A search example
13
ETRI Search in a suffix array We have only to find the first and the last ab-suffixes. Because the other ab-suffixes are stored between them. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = ab T = abbabaababbb# A search example
14
ETRI Related work In developing our search algorithm, we adopt the idea suggested by Ferragina and Manzini (FOCS 2001). Search a pattern in a file compressed by the Burrows-Wheeler compression algorithm Search P from the last character to the first character of P abaaabb P = ababaaabb We adopt this backward pattern searching idea.
15
ETRI Algorithm outline 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = aba T = abbabaababbb# Outline of our search algorithm We find all aba-suffixes by searching P backward. Our algorithm has p stages (In this case, there are 3 stages.)
16
ETRI Algorithm outline P = aba T = abbabaababbb# Outline of our search algorithm We find all aba-suffixes by searching P backward. Stage 1: find all a-suffixes. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #
17
ETRI Algorithm outline P = aba T = abbabaababbb# Outline of our search algorithm We find all aba-suffixes by searching P backward. stage 1: find all a-suffixes. stage 2: find all ba-suffixes. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #
18
ETRI Algorithm outline P = aba T = abbabaababbb# Outline of our search algorithm We find all aba-suffixes by searching P backward. stage 1: find all a-suffixes. stage 2: find all ba-suffixes. stage 3: find all aba-suffixes. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #
19
ETRI Elaborate stage 2 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = aba A stage by elaborating stage 2 We find the first ba-suffix from the first a-suffix and the last ba-suffix from the last a-suffix. We find all ba-suffixes using a-suffixes found in stage 1.
20
ETRI Elaborate stage 2 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = aba Only explain how to find the first ba-suffix from the first a-suffix. Finding the last ba-suffix is similar. A stage by elaborating stage 2
21
ETRI Elaborate stage 2 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # To find the first ba-suffix, we count the number of suffixes that precede ba-suffixes in this suffix array. P = aba
22
ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Suffixes preceding ba-suffixes are divided into two categories. - A-type: Suffixes starting with characters lexicographically smaller than b. (#-suffixes, a-suffixes) - B-type: Suffixes starting with the same character b and preceding ba-suffixes. We count A-type and B-type suffixes in different ways. Elaborate stage 2 A-type B-type
23
ETRI Count the number of A-type suffixes 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # The number of A-type suffixes = The number of #-suffixes and a-suffixes = The position of the last a-suffix. A-type
24
ETRI Count the number of A-type suffixes We generate an array that stores the positions of the last #-suffix, the last a-suffix, and the last b- suffix. With this array, we can count A-type suffixes in O(1) time. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # #1 a6 b13
25
ETRI Count the number of A-type suffixes 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Array Space: Time: O(n) (one scan) #1 a6 b13
26
ETRI Count the number of B-type suffixes 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Count B-type suffixes b-suffixes preceding ba-suffixes. B-type
27
ETRI Count the number of B-type suffixes 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # B-type suffixes b-suffixes preceding ba-suffixes. A suffix generated by removing the leftmost b from a B-type suffix appears in a suffix subarray preceding a-suffixes found in stage 1. B-type
28
ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Count the number of B-type suffixes The number of B-type suffixes are the number of suffixes being in a suffix subarray that precedes a-suffixes whose previous characters are bs B-type We count this with array N. b b b a # b b a b a b a a Let U be the conceptual array of previous characters of suffixes. U
29
ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # b b b a # b b a b a b a a Count the number of B-type suffixes #ab 001 002 003 013 113 114 115 125 126 136 137 147 157 Array N entries N[i,b] stores the number of suffixes whose previous characters are bs in a suffix subarray SA[1,i]. U
30
ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # b b b a # b b a b a b a a Count the number of B-type suffixes #ab 001 002 003 013 113 114 115 125 126 136 137 147 157 U We can count B-type suffixes in O(1) time by accessing an entry of N.
31
ETRI Array N Space: An alternative way Space: O(n) time for counting B-type suffixes. Array N #ab 001 002 003 013 113 114 115 125 126 136 137 147 157
32
ETRI Query forN[i,b] Counting B-type suffixes O(log n) time O(log ) time
33
ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # b b b a # b b a b a b a a U Query forN[i,b] O(log n) time In an O(log n) time algorithm, we generate an array whose ith entry stores the location of the ith b in U. 11 22 33 46 57 69 711
34
ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time 11 22 33 46 57 69 711 To count suffixes whose previous characters are bs in SA[1,8]. = To count bs in U[1,8] 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #
35
ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time 11 22 33 46 57 69 711 Find the largest value not exceeding 8 in this array. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #
36
ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time 11 22 33 46 57 69 711 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # To find 7 in this array, we perform binary search. O(log n) time.
37
ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time 11 22 33 46 57 69 711 The index of 7 (5) is the number of b’s in U[1,8]. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #
38
ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time 11 22 33 46 57 69 711 14 28 310 412 513 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # 15 Generally, we require arrays for all characters. # a b O(n) space
39
ETRI Query forN[i,b] O(log n) time O(log ) time
40
ETRI For the last characters of each block, we compute the entries of N. b b b a # b b a b a b a a U Query forN[i,b]: time 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Divide U into -sized blocks. #ab 003 114 126 147
41
ETRI For the other entries in each block, we generate a similar data structure used in O(log n) time alg. O(log ) time for binary search. Still O(n) space in total. b b b a # b b a b a b a a U Query forN[i,b]: time 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # #ab 003 114 126 147
42
ETRI Summary p stages Each stage Count A-type suffixes Time: O(1) Space: O(n) for M array Count B-type suffixes Time: Space: O(n) for computing the value of an entry N In total, time with O(n) space.
43
ETRI Conclusion In a suffix array, one can choose or search time algorithm depending on the alphabet size. Suffix arrays are more powerful than suffix trees.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.