Suffix Arrays A New Method for Online String Searches U.Manber and G.Myers
Introduction - String matching Let A = a 0 a 1...a N-1 be a large text of length N Let W = w 0 w 1...w p-1 be a word of length P Is W a substring of A?
Introduction - Suffix Trees Build time O(N) Search time O(P) Structure space O(N) Big constant Dependent of |Σ|
Suffix Arrays An array of all the suffixes of A Sorted by lexicographical order A = aababa bababaababaabaaababaa
A = aababa A i = a i a i+1...a N-1 The suffix of A that starts at position i. Position array (Pos) Pos[k] is the start position of kth smallest suffix A Pos[k] is the suffix pointed from Pos[k] A Pos[k] is the kth smallest suffix Pos 012345 Suffix Arrays 503142 012345
Searching “ Is W a substring of A? ” W is a substring of A Some suffix A i starts with W i is W ’ s location All the instances of W must match consecutive suffixes in the array Find the array interval that contains those suffixes
Searching - Definitions For a string u u p = u 0 u 1...u p-1 For strings u,v u ≤ p v u p ≤ v p Same for ≠, =, >… For any p, Pos is ordered according to ≤ p
Searching - Definitions W = w 0 w 1 … w P-1 L W = min (k : W ≤ p A Pos[k] or k = N) First suffix ≥ p from W R W = max (k : A Pos[k] ≤ p W or k = 1) Last suffix ≤ p from W LWLW RWRW W > p A Pos[k] W < p A Pos[k] W = p A Pos[k]
Search Algorithm k [L W, R W ] W = p A Pos[k] To find W ’ s instances - find [L W, R W ] Number of W ’ s occurrences is (R W -L W +1) Matches are A Pos[L W ], …, A Pos[R W ] Suffix array is sorted - use binary search
Binary Search Search interval [L,R] Midpoint M Compare W to A Pos[M] Decide where to search next W ≤ p A Pos[M] - search in left half (R = M) W > p A Pos[M] - search in right half (L = M) O(PlogN) cbbbcdabcaab W = abc LMR
Search Algorithm Observation: We can use information from one comparison to speedup the next comparisons Use additional information lcp = longest common prefix
Search Algorithm - lcp lcp(v,w) = the length of the longest common prefix of v and w Obtained by comparing v and w and stopping at the first unequal symbol Use precomputed lcp information to reduce the number of comparisons to O(P + logN)
Search Algorithm Consider all possible midpoints M = 1 … N-2 Every midpoint corresponds to a triplet [L M,M,R M ] Suppose we precomputed two arrays: Llcp[M] = lcp (A Pos[L M ], A Pos[M] ) Rlcp[M] = lcp (A Pos[M], A Pos[R M ] )
Search Algorithm Maintain two more variables l = lcp(A Pos[L], W) r = lcp(W, A Pos[R] ) W = abcd adacdacbacaacabcdabcabbabaa l = 2r = 1 LMLM RMRM M Llcp[M ] = 1 Rlcp[M ] = 1
Search Algorithm Assume l ≥ r Compare l with Llcp If l < Llcp[M] W > l+1 A Pos[L M ] A Pos[L M ] = l+1 A Pos[M] W > l+1 A Pos[M] adacdacabcdabacababaabababaaaba l = 2r = 1 LMLM RMRM M Llcp[M ] = 3 Rlcp[M ] = 1 W = abcd Go Right! l remains unchanged
Search Algorithm If l > Llcp[M] A Pos[L M ] < l A Pos[M] W = l A Pos[L M ] W < l A Pos[M] adcadbadaadacaacabdabcdaba l = 2r = 1 LMLM RMRM M Llcp[M ] = 1 Rlcp[M ] = 1 W = abcd Go Left! r = Llcp[M]
Search Algorithm If l = Llcp[M] W can be in either half Start comparing A and A Pos[M] from the (l+1) symbol First unequal symbol determines whether to go right or left r/l will be updated to l+j j+1 comparisons adcadbadaabcdabccabcabaaabaab l = 2r = 1 LMLM RMRM M Llcp[M ] = 2 Rlcp[M ] = 1 W = abcd
Search Algorithm - Complexity In each Iteration: Let h=max(l,r) We start comparing from the h th symbol to the h+j+1 j+1 symbol comparisons Next time we will start from the h+j symbol j symbols out of the j+1 will not be compared again
Search Algorithm - Complexity Every symbol in W will be successfully matched at most once O(P) successful comparisons At most one symbol will be unsuccessfully matched in each iteration O(logN) unsuccessful comaprsions Total: O(P + logN) comparisons
Build Suffix Array So far … A O(P + logN) search algorithm Given a sorted suffix array Given lcp information (Llcp, Rlcp) Next … Sort the suffix array in O(NlogN) Compute the lcp ’ s while sorting the array
Sort Algorithm First stage Sort the suffixes into buckets, according to first symbol Inductive stage Assume array is bucket sorted according to first H symbols Every H-bucket holds suffixes with the same H first symbols Buckets are ordered according to the ≤ H relation Sort according to 2H first symbols
Sort Algorithm – Intuition Let A i, A j be two suffixes in the same H- bucket A i = H A j Next H symbols of A i and A j are the first H symbols of A i+H and A j+H In order to determine the ≤ 2H order of A i and A j, look at the ≤ H order of A i+H and A j+H baababaaababaaabaaaababaaaaa A = aababaa H = 2 AiAi AjAj A j+H A i+H
Sort Algorithm – Main Idea Let A i be a suffix in the first H- bucket A i starts with the smallest H-symbol string A i-H should be the first in its 2H- bucket bababaaababaabaababaa A = aababa H = 1
Sort Algorithm In stage H Go over all the suffixes in the ≤ H order For each A i move A i-H to the next available place in its H-bucket The suffixes are now sorted according to the ≤ 2H order Go on to stage 2H to produce ≤ 4H order
in Sort Algorithm - Example 01234567 n A = assassin sin A3A3 A0A0 A6A6 A7A7 A1A1 A5A5 A4A4 A2A2 ssassinssinsassinassassinassin sassinssinsinssassinninassassinassin H = 1 H = 2
Sort Algorithm - Example A = assassin 56210743 A0A0 A3A3 A6A6 A7A7 A2A2 A5A5 A4A4 A1A1 ssassinssinsinsassinninassinassassin H = 2 H = 4 ssinssassinsinsassinninassinassassin A0A0 A3A3 A6A6 A7A7 A2A2 A5A5 A1A1 A4A4
Sort Algorithm - Complexity First Stage Bucket sort according to first symbol O(NlogN) Inductive Stages O(logN) stages O(N) per stage Total O(NlogN) Space Can be implemented using two N-sized integer arrays
Finding Longest Common Prefixes The search algorithm uses lcp information: Llcp[M] = lcp (A Pos[L M ], A Pos[M] ) Rlcp[M] = lcp (A Pos[M], A Pos[R M ] ) We want to compute this information while we are sorting the array
Finding Longest Common Prefixes Show how to compute lcp ’ s for suffixes in adjacent H-buckets during the sort algorithm Use that to compute the lcp ’ s of all the suffixes that are consecutive in the sorted suffix array Show how to compute lcps for all the necessary suffixes
Finding LCP for adjacent buckets After the first sort stage, lcp ’ s of suffixes in adjacent buckets is 0 Assume after stage H we know the lcps between suffixes in adjacent H-buckets Suppose A p and A q are in the same H- bucket but not in the same 2H bucket H ≤ lcp(A p, A q ) < 2H lcp(A p, A q ) = H + lcp(A p+H, A q+H ) lcp(A p+H, A q+H ) < H
Let i,j be A p+H, A q+H ’ s positions in the suffix array Assume i<j Array is ordered according to the < H order lcp(A Pos[i], A Pos[j] ) = min(lcp(A Pos[k-1], A Pos[k] )) Finding LCP for adjacent buckets k [i+1,j] bababaaababaabaababaa H = 1 ij 2 10
LCP Data Structures – Hgt[] We need a data structure that will allow us: get the lcp ’ s of consecutive suffixes get their minimum Hgt[] – an N-1 sized array Hgt[i] = lcp(A Pos[i-1], A Pos[i] )
Hgt will be computed inductively throughout the sort Initialized to N+1 Hgt[i] is updated in stage 2H A Pos[i] started a new 2H-bucket To update Hgt[i]: Let a,b be the array positions of A Pos[i-1]+H and A Pos[i] +H Assume a≤b Hgt[i] = H + min(Hgt[k]) LCP Data Structures – Hgt[] k [a+1,b]
lcp (sin, ssin) = 1+ lcp(in, sin) = 1 + min(lcp(in,n), lcp(n,sassin), lcp(sassin,sin) = 1 + 0 = 1 lcp(sassin,sin) = 1 + lcp(assin, in) = 1 Finding LCP - Example sassinssinsinssassinninassassinassin ssassinssinsinsassinninassinassassin 0009999 11 000990001123 H = 2 H = 1 ssinssassinsinsassinninassinassassin H = 4
We need the following operations for Hgt[]: Set(i, h) – sets Hgt[i] to h Min_height(i,j) – determines min(Hgt[k]) We need to find a way to find the lcp ’ s for all the necessary suffixes – not just the ones in consecutive positions k [i,j] LCP Data Structures - Interval Tree
A full and balanced binary tree N-1 leaves, correspond to Hgt[] O(logN) height, N-2 interior vertices Keep a Hgt value for each interior vertex as well: Hgt[v] = min(Hgt[left(v)], Hgt[right(v)])
LCP Data Structures - Interval Tree Operations implementation: Set(i,h) Set Hgt[i] to h and update the Hgt values on the path from i to the root Min-height(i,j) Finds the minimal Hgt value by scanning O(logN) vertices in the tree Operations complexity – O(logN)
Finding LCP – Interval Tree (2,3)(3,4)(4,5)(5,6)(6,7)(1,2)(0,1) 0 9000 00 9 0 99 9 9 1 1 1
Finding LCP - Complexity In stage 2H we update Hgt[i] for all the leaves that started new buckets Each update is one set operation and one Min_height - O(logN) Throughout the algorithm every leaf is updated exactly once - O(N) updates Updates complexity: O(NlogN) In each stage we scan the array to see which suffixes opened new buckets Scans complexity: O(NlogN) Total LCP complexity O(NlogN)
Finding LCP - Llcp[] and Rlcp[] We want Llcp[] and Rlcp[] to be available directly from the interval tree at the end of the sort Use an interval tree that represents a binary search Each interior node corresponds to (L M, R M ) for some M For each interior node (L M, R M ) Left(L M, R M ) = (L M,M) Right(L M, R M ) = (M, R M ) N-2 interior nodes Leaves correspond to (i-1,i) Leaf(i-1,i) = Hgt[i]
Finding LCP - Llcp[] and Rlcp[] According to interval tree structure: Hgt[(L,R)] = min(Hgt[k]) Hgt[(L,R)] = lcp (A Pos[L], A Pos[R] ) Llcp[M] = Hgt[(L M,M)] Rlcp[M] = Hgt[(M,R M )] k [ L+1,R ]
Worst Case Complexity Suffix Array Build time O(NlogN) Search time O(P+logN) Structure space O(N) 2N - 3N integers Independent of |Σ| Suffix Tree Build time O(N) Search time O(P) Structure space O(N) Big constant Dependent of |Σ|
Expected Time Improvements Improve the expected case time of Search Algorithm Sort Algorithm LCP computation Use the following assumptions All N-symbol strings are equally likely Under this assumption: Expected length of longest repeated substring of A is O(log |Σ| N)
Expected Case Improvements - Main Idea Let T = Let Int T (u) = integer encoding in base |Σ| of the T-symbol prefix of u Example: T = 3 Σ = a,b u = abaa Int T (u) = 010 = 2 There are | Σ | T ≤ N possible T-symbol prefixes Int T (u) is a number in [0,N-1] Map each suffix A p to Int T (A p ) Can be done in O(N) time
Expected Case Improvements - Search Algorithm Use an additional array Buck[] Think of the sorted array as buckets, based on the Int T encoding Buck[k] = min{ i | Int T (A Pos[i]) = k} The first position that contains a suffix that ’ s mapped to k Compute Buck[] at the end of the sort algorithm O(N) additional time
Expected Case Improvements - Search Algorithm Given a word W We need to find L w and R w Let k = Int T (W) L w and R w must be in k ’ s bucket (Buck[k], Buck[k+1]) We only need to search one bucket
Expected Case Improvements - Search Algorithm Number of buckets = | Σ | T ≤ N Average number of elements in a bucket = O(1) In the binary search for W Expected size of bucket to search = O(1) Expected number of search steps: O(1) Expected case time: O(P)
Expected Case Improvements - Sort Algorithm First stage of sort Sort according to first symbol Replace first stage with sort according to Int T Equivalent to sort according to first T symbols Can be done in O(N) time We changed the base case of the sort from H=1 to H=T
Expected Case Improvements - Sort Algorithm Observation: Let C be the length of the longest repeated substring of A Sort is in fact complete once we have reached (C+1)-buckets Suppose some (C+1)-bucket contains more than one suffix Then we have two suffixes with lcp > C This prefix is a repeated substring longer than C - contradiction
Expected Case Improvements - Sort Algorithm Expected case: C = O(log |Σ| N) = O(T) Number of stages: O(1) Expected case time: O(N)
Expected Case Improvements - LCP Computation Replace interval tree with sort history Binary tree Models the refinement of buckets during the sort A vertex for each H-bucket Each vertex holds the stage number at which its bucket was split
Expected Case Improvements - LCP Computation Leaves correspond to suffixes and are arranged in an N element array Each vertex has at least two children O(N) nodes Can be built with O(N) additional time during the sort
Expected Case Improvements - LCP Computation Given the sort history we can compute lcp(A p, A q ) Find the nca (nearest common ancestor) of A p and A q Let H be the nca ’ s stage number lcp(A p, A q ) = H + lcp(A p+H, A q+H ) Recursively compute lcp(A p+H, A q+H ) Stop when the nca is the root
Expected Case Improvements - LCP Computation Each step is O(1) At each step the stage number of the nca is at least halved Suppose we stop the recursion when H < T ’ = Expected length of longest repeated substring is O(T) Expected case lcp is O(T) = O(log |Σ| N)
Expected Case Improvements - LCP Computation O(1) recursive steps in the expected case Expected case time for one lcp: O(1) Expected case time for computing Llcp[], Rlcp[]: O(N)
Expected Case Improvements - LCP Computation We need a way to find lcp ’ s that are known to be less than T ’ Build a |Σ| T ’ x |Σ| T ’ array: Lookup[Int T ’ (x), Int T ’ (y)] = lcp(x,y) for all T ’ -symbol strings x,y Max N entries (|Σ| T ’ = √N) Compute incrementally in O(N) Final recursion steps are replaced by O(1) lookup
Expected Time Complexity Search time O(P) Sort + LCP computation time O(N)
