Presentation is loading. Please wait.

Presentation is loading. Please wait.

Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Similar presentations


Presentation on theme: "Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The."— Presentation transcript:

1 Suffix arrays

2 Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order 2031

3 How do we build it ? Build a suffix tree Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array. O(n) time

4 How do we search for a pattern ? If P occurs in T then all its occurrences are consecutive in the suffix array. Do a binary search on the suffix array Takes O(mlogn) time

5 Example Let S = mississippi i ippi issippi ississippi mississippi pi 7 4 1 0 9 8 6 3 10 5 2 ppi sippi sisippi ssippi ssissippi L R Let P = issa M

6 How do we accelerate the search ? L R Maintain = LCP(P,L) Maintain r = LCP(P,R) Assume  ≥ r M r

7 L R M r If = r then start comparing M to P at + 1

8 L R M r > r

9 L R M r Someone whispers LCP(L,M) LCP(L,M) >

10 L R M r Continue in the right half LCP(L,M) >

11 L R M r LCP(L,M) <

12 L R M r LCP(L,M) < Continue in the left half

13 L R M r LCP(L,M) = start comparing M to P at + 1

14 Analysis If we do more than a single comparison in an iteration then max(, r ) grows by 1 for each comparison  O(m + logn) time

15 Construct the suffix array without the suffix tree

16 Linear time construction Recursively ? Say we want to sort only suffixes that start at even positions ?

17 Change the alphabet You in fact sort suffixes of a string shorter by a factor of 2 ! Every pair of characters is now a character

18 Change the alphabet a$0 aa1 ab2 b$3 ba4 bb5 $ a b a aa b 2 12

19 But we do not gain anything…

20 Divide into triples $ yab ba da b bad o abb ada bba do$

21 Divide into triples $ yab ba da b bad o abb ada bba do$ $ yab ba da b bad o bba dab bad o$$

22 Sort recursively 2/3 of the suffixes $ yab ba da b bad o abb ada bba do$ bba dab bad o$$ 124645 37 016425 37 $ yab ba da b bad o 142653 78 148275 1011 1234 56 789101112 0 01234 56 7

23 Sort the remaining third $ yab ba da b bad o 142653 78 (b, 2)(a, 5) (a, 7) (y, 1) (b, 2) (a, 5) (a, 7) (y, 1) 3 6 9 0 1234 56 789101112 0 148275 1011 

24 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 3 6 9 0 148275 1011

25 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 3 6 9 0 48275 1011 6

26 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 39 0 48275 1011 64

27 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 39 0 8275 1011 649

28 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 3 0 8275 1011 6493

29 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 0 8275 1011 64938

30 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 0 275 1011 649382

31 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 0 75 1011 6493827

32 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 0 5 1011 64938275

33 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 0 1011 64938275

34 Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 649382751011 0

35 summary $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 6493827510110 When comparing to a suffix with index 1 (mod 3) we compare the char and break ties by the ranks of the following suffixes When comparing to a suffix with index 2 (mod 3) we compare the char, the next char if there is a tie, and finally the ranks of the following suffixes

36 Compute LCP’s $ yab ba da b bad o 1217928 1011 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ 536 4 1 4 8 2 7 5 10 11 0 6 3 9

37 Crucial observation $ yab ba da b bad o 1217928 1011 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(i,j) = min {LCP(i,i+1),LCP(i+1,i+2),….,LCP(j-1,j)} 536 4 1 4 8 2 7 5 10 11 0 6 3 9

38 $ yab ba da b bad o 1217928 1011 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(11,0) 536 4 1 6 4 9 3 8 2 7 5 10 11 0 0 Find LCP’s of consecutive suffixes

39 $ yab ba da b bad o 1217928 1011 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(8,2) 536 4 1 6 4 9 3 8 2 7 5 10 11 0 01

40 $ yab ba da b bad o 1217928 1011 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(9,3) 536 4 1 6 4 9 3 8 2 7 5 10 11 0 01 0

41 $ yab ba da b bad o 1217928 1011 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(6,4) 536 4 1 6 4 9 3 8 2 7 5 10 11 0 101 0

42 $ yab ba da b bad o 1217928 1011 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(7,5) 536 4 1 6 4 9 3 8 2 7 5 10 11 0 0101 0

43 $ yab ba da b bad o 1217928 1011 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(1,6) 536 4 1 6 4 9 3 8 2 7 5 10 11 0 5 0101 0

44 $ yab ba da b bad o 1217928 1011 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(2,7) 536 4 1 6 4 9 3 8 2 7 5 10 11 0 45 0101 0

45 $ yab ba da b bad o 1217928 1011 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ 536 4 1 6 4 9 3 8 2 7 5 10 11 0 45 0101 0 LCP(3,8) 3

46 $ yab ba da b bad o 1217928 1011 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ 536 4 1 6 4 9 3 8 2 7 5 10 11 0 45 0101 0 LCP(4,9) 3 2

47 $ yab ba da b bad o 1217928 1011 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ 536 4 1 6 4 9 3 8 2 7 5 10 11 0 45 0101 0 LCP(5,10) 3 21

48 $ yab ba da b bad o 1217928 1011 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ 536 4 1 6 4 9 3 8 2 7 5 10 11 0 45 0101 0 LCP(10,11) 3 210

49 $ yab ba da b bad o 1217928 1011 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ 536 4 1 6 4 9 3 8 2 7 5 10 11 0 45 0101 0 The starting position deceases by 1 in every iteration. So it cannot increase more than O(n) times 3 210 Analysis

50 $ yab ba da b bad o 1217928 1011 1234 56 789101112 1 0 6493827510110 536 4 45 0101 0 3 210 We need more LCPs for search Linearly many, calculate the all bottom up


Download ppt "Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The."

Similar presentations


Ads by Google