Suffix Arrays A New Method for Online String Searches U.Manber and G.Myers.

Suffix Arrays A New Method for Online String Searches U.Manber and G.Myers

Introduction - String matching  Let A = a 0 a 1...a N-1 be a large text of length N  Let W = w 0 w 1...w p-1 be a word of length P  Is W a substring of A?

Introduction - Suffix Trees  Build time O(N)  Search time O(P)  Structure space O(N) Big constant  Dependent of |Σ|

Suffix Arrays  An array of all the suffixes of A  Sorted by lexicographical order A = aababa bababaababaabaaababaa

A = aababa  A i = a i a i+1...a N-1 The suffix of A that starts at position i.  Position array (Pos) Pos[k] is the start position of kth smallest suffix A Pos[k] is the suffix pointed from Pos[k] A Pos[k] is the kth smallest suffix Pos 012345 Suffix Arrays 503142 012345

Searching  “ Is W a substring of A? ” W is a substring of A Some suffix A i starts with W i is W ’ s location All the instances of W must match consecutive suffixes in the array Find the array interval that contains those suffixes

Searching - Definitions  For a string u u p = u 0 u 1...u p-1  For strings u,v u ≤ p v u p ≤ v p Same for ≠, =, >…  For any p, Pos is ordered according to ≤ p

Searching - Definitions  W = w 0 w 1 … w P-1  L W = min (k : W ≤ p A Pos[k] or k = N) First suffix ≥ p from W  R W = max (k : A Pos[k] ≤ p W or k = 1) Last suffix ≤ p from W LWLW RWRW W > p A Pos[k] W < p A Pos[k] W = p A Pos[k]

Search Algorithm  k [L W, R W ] W = p A Pos[k] To find W ’ s instances - find [L W, R W ] Number of W ’ s occurrences is (R W -L W +1) Matches are A Pos[L W ], …, A Pos[R W ] Suffix array is sorted - use binary search

Binary Search  Search interval [L,R]  Midpoint M  Compare W to A Pos[M]  Decide where to search next W ≤ p A Pos[M] - search in left half (R = M) W > p A Pos[M] - search in right half (L = M)  O(PlogN) cbbbcdabcaab W = abc LMR

Search Algorithm  Observation: We can use information from one comparison to speedup the next comparisons  Use additional information lcp = longest common prefix

Search Algorithm - lcp  lcp(v,w) = the length of the longest common prefix of v and w  Obtained by comparing v and w and stopping at the first unequal symbol  Use precomputed lcp information to reduce the number of comparisons to O(P + logN)

Search Algorithm  Consider all possible midpoints M = 1 … N-2  Every midpoint corresponds to a triplet [L M,M,R M ]  Suppose we precomputed two arrays: Llcp[M] = lcp (A Pos[L M ], A Pos[M] ) Rlcp[M] = lcp (A Pos[M], A Pos[R M ] )

Search Algorithm  Maintain two more variables l = lcp(A Pos[L], W) r = lcp(W, A Pos[R] ) W = abcd adacdacbacaacabcdabcabbabaa l = 2r = 1 LMLM RMRM M Llcp[M ] = 1 Rlcp[M ] = 1

Search Algorithm  Assume l ≥ r Compare l with Llcp  If l < Llcp[M] W > l+1 A Pos[L M ] A Pos[L M ] = l+1 A Pos[M] W > l+1 A Pos[M] adacdacabcdabacababaabababaaaba l = 2r = 1 LMLM RMRM M Llcp[M ] = 3 Rlcp[M ] = 1 W = abcd Go Right! l remains unchanged

Search Algorithm If l > Llcp[M] A Pos[L M ] < l A Pos[M] W = l A Pos[L M ] W < l A Pos[M] adcadbadaadacaacabdabcdaba l = 2r = 1 LMLM RMRM M Llcp[M ] = 1 Rlcp[M ] = 1 W = abcd Go Left! r = Llcp[M]

Search Algorithm If l = Llcp[M] W can be in either half Start comparing A and A Pos[M] from the (l+1) symbol First unequal symbol determines whether to go right or left r/l will be updated to l+j j+1 comparisons adcadbadaabcdabccabcabaaabaab l = 2r = 1 LMLM RMRM M Llcp[M ] = 2 Rlcp[M ] = 1 W = abcd

Search Algorithm - Complexity  In each Iteration: Let h=max(l,r) We start comparing from the h th symbol to the h+j+1 j+1 symbol comparisons Next time we will start from the h+j symbol j symbols out of the j+1 will not be compared again

Search Algorithm - Complexity  Every symbol in W will be successfully matched at most once O(P) successful comparisons  At most one symbol will be unsuccessfully matched in each iteration O(logN) unsuccessful comaprsions  Total: O(P + logN) comparisons

Build Suffix Array So far … A O(P + logN) search algorithm Given a sorted suffix array Given lcp information (Llcp, Rlcp) Next … Sort the suffix array in O(NlogN) Compute the lcp ’ s while sorting the array

Sort Algorithm  First stage Sort the suffixes into buckets, according to first symbol  Inductive stage Assume array is bucket sorted according to first H symbols  Every H-bucket holds suffixes with the same H first symbols  Buckets are ordered according to the ≤ H relation Sort according to 2H first symbols

Sort Algorithm – Intuition  Let A i, A j be two suffixes in the same H- bucket  A i = H A j  Next H symbols of A i and A j are the first H symbols of A i+H and A j+H  In order to determine the ≤ 2H order of A i and A j, look at the ≤ H order of A i+H and A j+H baababaaababaaabaaaababaaaaa A = aababaa H = 2 AiAi AjAj A j+H A i+H

Sort Algorithm – Main Idea  Let A i be a suffix in the first H- bucket  A i starts with the smallest H-symbol string  A i-H should be the first in its 2H- bucket bababaaababaabaababaa A = aababa H = 1

Sort Algorithm  In stage H Go over all the suffixes in the ≤ H order For each A i move A i-H to the next available place in its H-bucket The suffixes are now sorted according to the ≤ 2H order Go on to stage 2H to produce ≤ 4H order

in Sort Algorithm - Example 01234567 n A = assassin sin A3A3 A0A0 A6A6 A7A7 A1A1 A5A5 A4A4 A2A2 ssassinssinsassinassassinassin sassinssinsinssassinninassassinassin H = 1 H = 2

Sort Algorithm - Example A = assassin 56210743 A0A0 A3A3 A6A6 A7A7 A2A2 A5A5 A4A4 A1A1 ssassinssinsinsassinninassinassassin H = 2 H = 4 ssinssassinsinsassinninassinassassin A0A0 A3A3 A6A6 A7A7 A2A2 A5A5 A1A1 A4A4

Sort Algorithm - Complexity  First Stage Bucket sort according to first symbol O(NlogN)  Inductive Stages O(logN) stages O(N) per stage  Total O(NlogN)  Space Can be implemented using two N-sized integer arrays

Finding Longest Common Prefixes  The search algorithm uses lcp information: Llcp[M] = lcp (A Pos[L M ], A Pos[M] ) Rlcp[M] = lcp (A Pos[M], A Pos[R M ] )  We want to compute this information while we are sorting the array

Finding Longest Common Prefixes  Show how to compute lcp ’ s for suffixes in adjacent H-buckets during the sort algorithm  Use that to compute the lcp ’ s of all the suffixes that are consecutive in the sorted suffix array  Show how to compute lcps for all the necessary suffixes

Finding LCP for adjacent buckets  After the first sort stage, lcp ’ s of suffixes in adjacent buckets is 0  Assume after stage H we know the lcps between suffixes in adjacent H-buckets  Suppose A p and A q are in the same H- bucket but not in the same 2H bucket H ≤ lcp(A p, A q ) < 2H lcp(A p, A q ) = H + lcp(A p+H, A q+H ) lcp(A p+H, A q+H ) < H

 Let i,j be A p+H, A q+H ’ s positions in the suffix array  Assume i<j  Array is ordered according to the < H order  lcp(A Pos[i], A Pos[j] ) = min(lcp(A Pos[k-1], A Pos[k] )) Finding LCP for adjacent buckets k [i+1,j] bababaaababaabaababaa H = 1 ij 2 10

LCP Data Structures – Hgt[]  We need a data structure that will allow us: get the lcp ’ s of consecutive suffixes get their minimum  Hgt[] – an N-1 sized array  Hgt[i] = lcp(A Pos[i-1], A Pos[i] )

 Hgt will be computed inductively throughout the sort Initialized to N+1 Hgt[i] is updated in stage 2H A Pos[i] started a new 2H-bucket To update Hgt[i]:  Let a,b be the array positions of A Pos[i-1]+H and A Pos[i] +H  Assume a≤b  Hgt[i] = H + min(Hgt[k]) LCP Data Structures – Hgt[] k [a+1,b]

lcp (sin, ssin) = 1+ lcp(in, sin) = 1 + min(lcp(in,n), lcp(n,sassin), lcp(sassin,sin) = 1 + 0 = 1 lcp(sassin,sin) = 1 + lcp(assin, in) = 1 Finding LCP - Example sassinssinsinssassinninassassinassin ssassinssinsinsassinninassinassassin 0009999 11 000990001123 H = 2 H = 1 ssinssassinsinsassinninassinassassin H = 4

 We need the following operations for Hgt[]: Set(i, h) – sets Hgt[i] to h Min_height(i,j) – determines min(Hgt[k])  We need to find a way to find the lcp ’ s for all the necessary suffixes – not just the ones in consecutive positions k [i,j] LCP Data Structures - Interval Tree

 A full and balanced binary tree  N-1 leaves, correspond to Hgt[]  O(logN) height, N-2 interior vertices  Keep a Hgt value for each interior vertex as well: Hgt[v] = min(Hgt[left(v)], Hgt[right(v)])

LCP Data Structures - Interval Tree  Operations implementation: Set(i,h)  Set Hgt[i] to h and update the Hgt values on the path from i to the root Min-height(i,j)  Finds the minimal Hgt value by scanning O(logN) vertices in the tree  Operations complexity – O(logN)

Finding LCP – Interval Tree (2,3)(3,4)(4,5)(5,6)(6,7)(1,2)(0,1) 0 9000 00 9 0 99 9 9 1 1 1

Finding LCP - Complexity  In stage 2H we update Hgt[i] for all the leaves that started new buckets Each update is one set operation and one Min_height - O(logN) Throughout the algorithm every leaf is updated exactly once - O(N) updates Updates complexity: O(NlogN)  In each stage we scan the array to see which suffixes opened new buckets Scans complexity: O(NlogN)  Total LCP complexity O(NlogN)

Finding LCP - Llcp[] and Rlcp[]  We want Llcp[] and Rlcp[] to be available directly from the interval tree at the end of the sort  Use an interval tree that represents a binary search Each interior node corresponds to (L M, R M ) for some M For each interior node (L M, R M )  Left(L M, R M ) = (L M,M)  Right(L M, R M ) = (M, R M ) N-2 interior nodes Leaves correspond to (i-1,i) Leaf(i-1,i) = Hgt[i]

Finding LCP - Llcp[] and Rlcp[]  According to interval tree structure: Hgt[(L,R)] = min(Hgt[k]) Hgt[(L,R)] = lcp (A Pos[L], A Pos[R] )  Llcp[M] = Hgt[(L M,M)]  Rlcp[M] = Hgt[(M,R M )] k [ L+1,R ]

Worst Case Complexity Suffix Array  Build time O(NlogN)  Search time O(P+logN)  Structure space O(N) 2N - 3N integers  Independent of |Σ| Suffix Tree  Build time O(N)  Search time O(P)  Structure space O(N) Big constant  Dependent of |Σ|

Expected Time Improvements  Improve the expected case time of Search Algorithm Sort Algorithm LCP computation  Use the following assumptions All N-symbol strings are equally likely Under this assumption:  Expected length of longest repeated substring of A is O(log |Σ| N)

Expected Case Improvements - Main Idea  Let T =  Let Int T (u) = integer encoding in base |Σ| of the T-symbol prefix of u  Example: T = 3 Σ = a,b u = abaa Int T (u) = 010 = 2  There are | Σ | T ≤ N possible T-symbol prefixes Int T (u) is a number in [0,N-1]  Map each suffix A p to Int T (A p ) Can be done in O(N) time

Expected Case Improvements - Search Algorithm  Use an additional array Buck[] Think of the sorted array as buckets, based on the Int T encoding Buck[k] = min{ i | Int T (A Pos[i]) = k}  The first position that contains a suffix that ’ s mapped to k  Compute Buck[] at the end of the sort algorithm O(N) additional time

Expected Case Improvements - Search Algorithm  Given a word W We need to find L w and R w  Let k = Int T (W)  L w and R w must be in k ’ s bucket (Buck[k], Buck[k+1])  We only need to search one bucket

Expected Case Improvements - Search Algorithm  Number of buckets = | Σ | T ≤ N  Average number of elements in a bucket = O(1)  In the binary search for W Expected size of bucket to search = O(1) Expected number of search steps: O(1) Expected case time: O(P)

Expected Case Improvements - Sort Algorithm  First stage of sort Sort according to first symbol  Replace first stage with sort according to Int T Equivalent to sort according to first T symbols Can be done in O(N) time We changed the base case of the sort from H=1 to H=T

Expected Case Improvements - Sort Algorithm Observation:  Let C be the length of the longest repeated substring of A  Sort is in fact complete once we have reached (C+1)-buckets Suppose some (C+1)-bucket contains more than one suffix Then we have two suffixes with lcp > C This prefix is a repeated substring longer than C - contradiction

Expected Case Improvements - Sort Algorithm  Expected case: C = O(log |Σ| N) = O(T) Number of stages: O(1)  Expected case time: O(N)

Expected Case Improvements - LCP Computation Replace interval tree with sort history  Binary tree  Models the refinement of buckets during the sort  A vertex for each H-bucket  Each vertex holds the stage number at which its bucket was split

Expected Case Improvements - LCP Computation  Leaves correspond to suffixes and are arranged in an N element array  Each vertex has at least two children  O(N) nodes  Can be built with O(N) additional time during the sort

Expected Case Improvements - LCP Computation  Given the sort history we can compute lcp(A p, A q ) Find the nca (nearest common ancestor) of A p and A q Let H be the nca ’ s stage number lcp(A p, A q ) = H + lcp(A p+H, A q+H ) Recursively compute lcp(A p+H, A q+H ) Stop when the nca is the root

Expected Case Improvements - LCP Computation  Each step is O(1)  At each step the stage number of the nca is at least halved  Suppose we stop the recursion when H < T ’ =  Expected length of longest repeated substring is O(T) Expected case lcp is O(T) = O(log |Σ| N)

Expected Case Improvements - LCP Computation  O(1) recursive steps in the expected case  Expected case time for one lcp: O(1)  Expected case time for computing Llcp[], Rlcp[]: O(N)

Expected Case Improvements - LCP Computation  We need a way to find lcp ’ s that are known to be less than T ’  Build a |Σ| T ’ x |Σ| T ’ array: Lookup[Int T ’ (x), Int T ’ (y)] = lcp(x,y) for all T ’ -symbol strings x,y Max N entries (|Σ| T ’ = √N) Compute incrementally in O(N) Final recursion steps are replaced by O(1) lookup

Expected Time Complexity  Search time O(P)  Sort + LCP computation time O(N)

Suffix Arrays A New Method for Online String Searches U.Manber and G.Myers.

Similar presentations

Presentation on theme: "Suffix Arrays A New Method for Online String Searches U.Manber and G.Myers."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Suffix Arrays A New Method for Online String Searches U.Manber and G.Myers.

Similar presentations

Presentation on theme: "Suffix Arrays A New Method for Online String Searches U.Manber and G.Myers."— Presentation transcript:

Similar presentations

About project

Feedback