ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park
ETRI Suffix arrays Suffix array of text T The lexicographically sorted list of all suffixes of text T
ETRI Suffix arrays Example for T = abbabaababbb# The suffixes of T abbabaababbb# (1) bbabaababbb# (2) abaababbb# (3) … b# (12) # (13) are stored in lexicographical order. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # # is the lexicographically smallest special character.
ETRI Suffix arrays Example for T = abbabaababbb# The suffixes of T are abbabaababbb# (1) bbabaababbb# (2) abaababbb# (3) … b# (12) # (13) In actual suffix arrays, we store only the starting positions of suffixes in T but for convenience, we assume that suffixes themselves are stored. 113# 26a a b a b b b # 34a b a a b a b b b # 47a b a b b b # 51a b b a b a a b a b b b # 69a b b b # 712b # 85b a a b a b b b # 93b a b a a b a b b b # 108b a b b b # 11 b b # 122b b a b a a b a b b b # 1310b b b #
ETRI Suffix arrays Definition: s-suffixes Suffixes starting with string s a-suffixes, ba-suffixes, … 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #
ETRI Suffix arrays vs. Suffix trees Construction time Suffix Array = Suffix Tree Space Suffix Array = Suffix Tree In practice, suffix arrays are more space efficient than suffix trees. Search time Suffix Array:, (p=|P|, n=|T|) Suffix Tree:
ETRI Contribution Construction time Suffix Array = Suffix Tree Space Suffix Array = Suffix Tree In practice, suffix arrays are more space efficient than suffix trees. Search time Suffix Array:,, Suffix Tree:
ETRI The meaning of our contribution Construction time Suffix Array = Suffix Tree Space Suffix Array = Suffix Tree In practice, suffix arrays are more space efficient than suffix trees. Search time Suffix Array:,, Suffix Tree: Search time: SA ST
ETRI The meaning of our contribution Construction time Suffix Array = Suffix Tree Space Suffix Array = Suffix Tree In practice, suffix arrays are more space efficient than suffix trees. Search time Suffix Array:,, Suffix Tree: Search time: SA ST Suffix arrays are more powerful than suffix trees.
ETRI Our search algorithm
ETRI Search in a suffix array Definition: Search in a suffix array Input A pattern P A suffix array of T Output All P-suffixes of T
ETRI Search in a suffix array All ab-suffixes are neighbors. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = ab T = abbabaababbb# Find all ab-suffixes. A search example
ETRI Search in a suffix array We have only to find the first and the last ab-suffixes. Because the other ab-suffixes are stored between them. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = ab T = abbabaababbb# A search example
ETRI Related work In developing our search algorithm, we adopt the idea suggested by Ferragina and Manzini (FOCS 2001). Search a pattern in a file compressed by the Burrows-Wheeler compression algorithm Search P from the last character to the first character of P abaaabb P = ababaaabb We adopt this backward pattern searching idea.
ETRI Algorithm outline 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = aba T = abbabaababbb# Outline of our search algorithm We find all aba-suffixes by searching P backward. Our algorithm has p stages (In this case, there are 3 stages.)
ETRI Algorithm outline P = aba T = abbabaababbb# Outline of our search algorithm We find all aba-suffixes by searching P backward. Stage 1: find all a-suffixes. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #
ETRI Algorithm outline P = aba T = abbabaababbb# Outline of our search algorithm We find all aba-suffixes by searching P backward. stage 1: find all a-suffixes. stage 2: find all ba-suffixes. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #
ETRI Algorithm outline P = aba T = abbabaababbb# Outline of our search algorithm We find all aba-suffixes by searching P backward. stage 1: find all a-suffixes. stage 2: find all ba-suffixes. stage 3: find all aba-suffixes. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #
ETRI Elaborate stage 2 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = aba A stage by elaborating stage 2 We find the first ba-suffix from the first a-suffix and the last ba-suffix from the last a-suffix. We find all ba-suffixes using a-suffixes found in stage 1.
ETRI Elaborate stage 2 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = aba Only explain how to find the first ba-suffix from the first a-suffix. Finding the last ba-suffix is similar. A stage by elaborating stage 2
ETRI Elaborate stage 2 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # To find the first ba-suffix, we count the number of suffixes that precede ba-suffixes in this suffix array. P = aba
ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Suffixes preceding ba-suffixes are divided into two categories. - A-type: Suffixes starting with characters lexicographically smaller than b. (#-suffixes, a-suffixes) - B-type: Suffixes starting with the same character b and preceding ba-suffixes. We count A-type and B-type suffixes in different ways. Elaborate stage 2 A-type B-type
ETRI Count the number of A-type suffixes 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # The number of A-type suffixes = The number of #-suffixes and a-suffixes = The position of the last a-suffix. A-type
ETRI Count the number of A-type suffixes We generate an array that stores the positions of the last #-suffix, the last a-suffix, and the last b- suffix. With this array, we can count A-type suffixes in O(1) time. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # #1 a6 b13
ETRI Count the number of A-type suffixes 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Array Space: Time: O(n) (one scan) #1 a6 b13
ETRI Count the number of B-type suffixes 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Count B-type suffixes b-suffixes preceding ba-suffixes. B-type
ETRI Count the number of B-type suffixes 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # B-type suffixes b-suffixes preceding ba-suffixes. A suffix generated by removing the leftmost b from a B-type suffix appears in a suffix subarray preceding a-suffixes found in stage 1. B-type
ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Count the number of B-type suffixes The number of B-type suffixes are the number of suffixes being in a suffix subarray that precedes a-suffixes whose previous characters are bs B-type We count this with array N. b b b a # b b a b a b a a Let U be the conceptual array of previous characters of suffixes. U
ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # b b b a # b b a b a b a a Count the number of B-type suffixes #ab Array N entries N[i,b] stores the number of suffixes whose previous characters are bs in a suffix subarray SA[1,i]. U
ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # b b b a # b b a b a b a a Count the number of B-type suffixes #ab U We can count B-type suffixes in O(1) time by accessing an entry of N.
ETRI Array N Space: An alternative way Space: O(n) time for counting B-type suffixes. Array N #ab
ETRI Query forN[i,b] Counting B-type suffixes O(log n) time O(log ) time
ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # b b b a # b b a b a b a a U Query forN[i,b] O(log n) time In an O(log n) time algorithm, we generate an array whose ith entry stores the location of the ith b in U
ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time To count suffixes whose previous characters are bs in SA[1,8]. = To count bs in U[1,8] 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #
ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time Find the largest value not exceeding 8 in this array. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #
ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time # 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # To find 7 in this array, we perform binary search. O(log n) time.
ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time The index of 7 (5) is the number of b’s in U[1,8]. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #
ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time # 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # 15 Generally, we require arrays for all characters. # a b O(n) space
ETRI Query forN[i,b] O(log n) time O(log ) time
ETRI For the last characters of each block, we compute the entries of N. b b b a # b b a b a b a a U Query forN[i,b]: time 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Divide U into -sized blocks. #ab
ETRI For the other entries in each block, we generate a similar data structure used in O(log n) time alg. O(log ) time for binary search. Still O(n) space in total. b b b a # b b a b a b a a U Query forN[i,b]: time 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # #ab
ETRI Summary p stages Each stage Count A-type suffixes Time: O(1) Space: O(n) for M array Count B-type suffixes Time: Space: O(n) for computing the value of an entry N In total, time with O(n) space.
ETRI Conclusion In a suffix array, one can choose or search time algorithm depending on the alphabet size. Suffix arrays are more powerful than suffix trees.