Presentation is loading. Please wait.

Presentation is loading. Please wait.

ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.

Similar presentations


Presentation on theme: "ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park."— Presentation transcript:

1 ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park

2 ETRI Suffix arrays Suffix array of text T The lexicographically sorted list of all suffixes of text T

3 ETRI Suffix arrays Example for T = abbabaababbb# The suffixes of T abbabaababbb# (1) bbabaababbb# (2) abaababbb# (3) … b# (12) # (13) are stored in lexicographical order. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # # is the lexicographically smallest special character.

4 ETRI Suffix arrays Example for T = abbabaababbb# The suffixes of T are abbabaababbb# (1) bbabaababbb# (2) abaababbb# (3) … b# (12) # (13) In actual suffix arrays, we store only the starting positions of suffixes in T but for convenience, we assume that suffixes themselves are stored. 113# 26a a b a b b b # 34a b a a b a b b b # 47a b a b b b # 51a b b a b a a b a b b b # 69a b b b # 712b # 85b a a b a b b b # 93b a b a a b a b b b # 108b a b b b # 11 b b # 122b b a b a a b a b b b # 1310b b b #

5 ETRI Suffix arrays Definition: s-suffixes Suffixes starting with string s a-suffixes, ba-suffixes, … 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #

6 ETRI Suffix arrays vs. Suffix trees Construction time Suffix Array = Suffix Tree Space Suffix Array = Suffix Tree In practice, suffix arrays are more space efficient than suffix trees. Search time Suffix Array:, (p=|P|, n=|T|) Suffix Tree:

7 ETRI Contribution Construction time Suffix Array = Suffix Tree Space Suffix Array = Suffix Tree In practice, suffix arrays are more space efficient than suffix trees. Search time Suffix Array:,, Suffix Tree:

8 ETRI The meaning of our contribution Construction time Suffix Array = Suffix Tree Space Suffix Array = Suffix Tree In practice, suffix arrays are more space efficient than suffix trees. Search time Suffix Array:,, Suffix Tree: Search time: SA ST

9 ETRI The meaning of our contribution Construction time Suffix Array = Suffix Tree Space Suffix Array = Suffix Tree In practice, suffix arrays are more space efficient than suffix trees. Search time Suffix Array:,, Suffix Tree: Search time: SA ST Suffix arrays are more powerful than suffix trees.

10 ETRI Our search algorithm

11 ETRI Search in a suffix array Definition: Search in a suffix array Input A pattern P A suffix array of T Output All P-suffixes of T

12 ETRI Search in a suffix array All ab-suffixes are neighbors. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = ab T = abbabaababbb# Find all ab-suffixes. A search example

13 ETRI Search in a suffix array We have only to find the first and the last ab-suffixes. Because the other ab-suffixes are stored between them. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = ab T = abbabaababbb# A search example

14 ETRI Related work In developing our search algorithm, we adopt the idea suggested by Ferragina and Manzini (FOCS 2001). Search a pattern in a file compressed by the Burrows-Wheeler compression algorithm Search P from the last character to the first character of P abaaabb P = ababaaabb We adopt this backward pattern searching idea.

15 ETRI Algorithm outline 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = aba T = abbabaababbb# Outline of our search algorithm We find all aba-suffixes by searching P backward. Our algorithm has p stages (In this case, there are 3 stages.)

16 ETRI Algorithm outline P = aba T = abbabaababbb# Outline of our search algorithm We find all aba-suffixes by searching P backward. Stage 1: find all a-suffixes. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #

17 ETRI Algorithm outline P = aba T = abbabaababbb# Outline of our search algorithm We find all aba-suffixes by searching P backward. stage 1: find all a-suffixes. stage 2: find all ba-suffixes. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #

18 ETRI Algorithm outline P = aba T = abbabaababbb# Outline of our search algorithm We find all aba-suffixes by searching P backward. stage 1: find all a-suffixes. stage 2: find all ba-suffixes. stage 3: find all aba-suffixes. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #

19 ETRI Elaborate stage 2 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = aba A stage by elaborating stage 2 We find the first ba-suffix from the first a-suffix and the last ba-suffix from the last a-suffix. We find all ba-suffixes using a-suffixes found in stage 1.

20 ETRI Elaborate stage 2 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # P = aba Only explain how to find the first ba-suffix from the first a-suffix. Finding the last ba-suffix is similar. A stage by elaborating stage 2

21 ETRI Elaborate stage 2 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # To find the first ba-suffix, we count the number of suffixes that precede ba-suffixes in this suffix array. P = aba

22 ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Suffixes preceding ba-suffixes are divided into two categories. - A-type: Suffixes starting with characters lexicographically smaller than b. (#-suffixes, a-suffixes) - B-type: Suffixes starting with the same character b and preceding ba-suffixes. We count A-type and B-type suffixes in different ways. Elaborate stage 2 A-type B-type

23 ETRI Count the number of A-type suffixes 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # The number of A-type suffixes = The number of #-suffixes and a-suffixes = The position of the last a-suffix. A-type

24 ETRI Count the number of A-type suffixes We generate an array that stores the positions of the last #-suffix, the last a-suffix, and the last b- suffix. With this array, we can count A-type suffixes in O(1) time. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # #1 a6 b13

25 ETRI Count the number of A-type suffixes 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Array Space: Time: O(n) (one scan) #1 a6 b13

26 ETRI Count the number of B-type suffixes 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Count B-type suffixes b-suffixes preceding ba-suffixes. B-type

27 ETRI Count the number of B-type suffixes 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # B-type suffixes b-suffixes preceding ba-suffixes. A suffix generated by removing the leftmost b from a B-type suffix appears in a suffix subarray preceding a-suffixes found in stage 1. B-type

28 ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Count the number of B-type suffixes The number of B-type suffixes are the number of suffixes being in a suffix subarray that precedes a-suffixes whose previous characters are bs B-type We count this with array N. b b b a # b b a b a b a a Let U be the conceptual array of previous characters of suffixes. U

29 ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # b b b a # b b a b a b a a Count the number of B-type suffixes #ab 001 002 003 013 113 114 115 125 126 136 137 147 157 Array N entries N[i,b] stores the number of suffixes whose previous characters are bs in a suffix subarray SA[1,i]. U

30 ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # b b b a # b b a b a b a a Count the number of B-type suffixes #ab 001 002 003 013 113 114 115 125 126 136 137 147 157 U We can count B-type suffixes in O(1) time by accessing an entry of N.

31 ETRI Array N Space: An alternative way Space: O(n) time for counting B-type suffixes. Array N #ab 001 002 003 013 113 114 115 125 126 136 137 147 157

32 ETRI Query forN[i,b] Counting B-type suffixes O(log n) time O(log ) time

33 ETRI 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # b b b a # b b a b a b a a U Query forN[i,b] O(log n) time In an O(log n) time algorithm, we generate an array whose ith entry stores the location of the ith b in U. 11 22 33 46 57 69 711

34 ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time 11 22 33 46 57 69 711 To count suffixes whose previous characters are bs in SA[1,8]. = To count bs in U[1,8] 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #

35 ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time 11 22 33 46 57 69 711 Find the largest value not exceeding 8 in this array. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #

36 ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time 11 22 33 46 57 69 711 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # To find 7 in this array, we perform binary search. O(log n) time.

37 ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time 11 22 33 46 57 69 711 The index of 7 (5) is the number of b’s in U[1,8]. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b #

38 ETRI b b b a # b b a b a b a a U Query forN[i,b]: O(log n) time 11 22 33 46 57 69 711 14 28 310 412 513 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # 15 Generally, we require arrays for all characters. # a b O(n) space

39 ETRI Query forN[i,b] O(log n) time O(log ) time

40 ETRI For the last characters of each block, we compute the entries of N. b b b a # b b a b a b a a U Query forN[i,b]: time 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # Divide U into -sized blocks. #ab 003 114 126 147

41 ETRI For the other entries in each block, we generate a similar data structure used in O(log n) time alg. O(log ) time for binary search. Still O(n) space in total. b b b a # b b a b a b a a U Query forN[i,b]: time 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # #ab 003 114 126 147

42 ETRI Summary p stages Each stage Count A-type suffixes Time: O(1) Space: O(n) for M array Count B-type suffixes Time: Space: O(n) for computing the value of an entry N In total, time with O(n) space.

43 ETRI Conclusion In a suffix array, one can choose or search time algorithm depending on the alphabet size. Suffix arrays are more powerful than suffix trees.


Download ppt "ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park."

Similar presentations


Ads by Google