Suffix arrays
Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order 2031
How do we build it ? Build a suffix tree Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array. O(n) time
How do we search for a pattern ? If P occurs in T then all its occurrences are consecutive in the suffix array. Do a binary search on the suffix array Takes O(mlogn) time
Example Let S = mississippi i ippi issippi ississippi mississippi pi ppi sippi sisippi ssippi ssissippi L R Let P = issa M
How do we accelerate the search ? L R Maintain = LCP(P,L) Maintain r = LCP(P,R) Assume ≥ r M r
L R M r If = r then start comparing M to P at + 1
L R M r > r
L R M r Someone whispers LCP(L,M) LCP(L,M) >
L R M r Continue in the right half LCP(L,M) >
L R M r LCP(L,M) <
L R M r LCP(L,M) < Continue in the left half
L R M r LCP(L,M) = start comparing M to P at + 1
Analysis If we do more than a single comparison in an iteration then max(, r ) grows by 1 for each comparison O(m + logn) time
Construct the suffix array without the suffix tree
Linear time construction Recursively ? Say we want to sort only suffixes that start at even positions ?
Change the alphabet You in fact sort suffixes of a string shorter by a factor of 2 ! Every pair of characters is now a character
Change the alphabet a$0 aa1 ab2 b$3 ba4 bb5 $ a b a aa b 2 12
But we do not gain anything…
Divide into triples $ yab ba da b bad o abb ada bba do$
Divide into triples $ yab ba da b bad o abb ada bba do$ $ yab ba da b bad o bba dab bad o$$
Sort recursively 2/3 of the suffixes $ yab ba da b bad o abb ada bba do$ bba dab bad o$$ $ yab ba da b bad o
Sort the remaining third $ yab ba da b bad o (b, 2)(a, 5) (a, 7) (y, 1) (b, 2) (a, 5) (a, 7) (y, 1)
Merge $ yab ba da b bad o
Merge $ yab ba da b bad o
Merge $ yab ba da b bad o
Merge $ yab ba da b bad o
Merge $ yab ba da b bad o
Merge $ yab ba da b bad o
Merge $ yab ba da b bad o
Merge $ yab ba da b bad o
Merge $ yab ba da b bad o
Merge $ yab ba da b bad o
Merge $ yab ba da b bad o
summary $ yab ba da b bad o When comparing to a suffix with index 1 (mod 3) we compare the char and break ties by the ranks of the following suffixes When comparing to a suffix with index 2 (mod 3) we compare the char, the next char if there is a tie, and finally the ranks of the following suffixes
Compute LCP’s $ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$
Crucial observation $ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(i,j) = min {LCP(i,i+1),LCP(i+1,i+2),….,LCP(j-1,j)}
$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(11,0) Find LCP’s of consecutive suffixes
$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(8,2)
$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(9,3)
$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(6,4)
$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(7,5)
$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(1,6)
$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(2,7)
$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(3,8) 3
$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(4,9) 3 2
$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(5,10) 3 21
$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(10,11) 3 210
$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ The starting position deceases by 1 in every iteration. So it cannot increase more than O(n) times Analysis
$ yab ba da b bad o We need more LCPs for search Linearly many, calculate the all bottom up