Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ukkonen's suffix tree construction algorithm

Similar presentations


Presentation on theme: "Ukkonen's suffix tree construction algorithm"— Presentation transcript:

1 Ukkonen's suffix tree construction algorithm
a ba$ a ba$ aba$ aba$ $ $ab $ab 2 $ab 2 2 4 1 1 1 $ 1 $ 3 String Algorithms; Nov

2 Motivation Yet another suffix tree construction algorithm... Why?

3 Motivation Yet another suffix tree construction algorithm... Why?
An online algorithm, i.e. one we can update with new sequences

4 Sketch of algorithm Recall McCreight’s approach:
For i = 1 .. n+1, build compressed trie of {x[j..n]$ | j ≤ i} Ukkonen’s approach: For i = 1 .. n+1, build compressed trie of {x[j..i]$ | j ≤ i} Compressed trie of all suffixes of prefix x[1..i]$ of x$ A suffix tree except for “leaf” property

5 McCreight's algorithm – x=aba
T1 : {x$[j..n] | j ≤ 1 } aba$ 1 T2 : {x$[j..n] | j ≤ 2 } aba$ $ab 2 1 T3 : {x$[j..n] | j ≤ 3 } a ba$ $ab 2 1 $ 3 T4 : {x$[j..n] | j ≤ 4 } a ba$ $ $ab 2 4 1 $ 3

6 Ukkonen's algorithm – x=aba
T1 : {x$[j..1]} 1 T2 : {x$[j..2]} ab b 2 1 T3 : {x$[j..3]} a ba Note: no node for x[3..3] = “a” ab 2 1 a ba$ T4 : {x$[j..n]} $ $ab 2 4 1 $ 3

7 Tasks in iteration i In iteration i we must
Update each x[j..i] to x[j..i+1] Add string x[i+1] (special case of above) Leaf: Inner node: Edge: x[j..i] x[j..i] x[j..i] x[j..i] j x[j..i] x[i+1] x[i+1] x[j..i] x[i+1] x[i+1] x[j..i] x[j..i] x[j..i] x[j..i] j j j x[i+1] x[i+1] x[i+1]

8 First attempt... “Obvious” algorithm: Running time O(n3)
For i=1,...,n+1: for j=1,...,i: find x[j..i] append x[i+1] Running time O(n3) Need lots of tricks to get O(n)!

9 Updating leaves for free
If we label leaves with (k,∞) – denoting “k to the current i”, updating a leaf is automatic Leaf: Inner node: Edge: x[j..i] x[j..i] x[j..i] x[j..i] j x[j..i] x[i+1] x[i+1] x[j..i] x[i+1] x[i+1] x[j..i] x[j..i] x[j..i] x[j..i] j j j x[i+1] x[i+1] x[i+1]

10 Updating existing strings is free
If x[j..i+1] is already in the tree, the update is automatic Leaf: Inner node: Edge: x[j..i] x[j..i] x[j..i] x[j..i] j x[j..i] x[i+1] x[i+1] x[j..i] x[i+1] x[i+1] x[j..i] x[j..i] x[j..i] x[j..i] j j j x[i+1] x[i+1] x[i+1]

11 “Real” operations If we can recognize the free operations, we need only deal with the remaining Leaf: Inner node: Edge: x[j..i] x[j..i] x[j..i] x[j..i] j x[j..i] x[i+1] x[i+1] x[j..i] x[i+1] x[i+1] x[j..i] x[j..i] x[j..i] x[j..i] j j j x[i+1] x[i+1] x[i+1]

12 Lemma 5.2.4 Let j denote suffix x[j..i] of x[1..i]
If j>1 is a leaf node in Ti, then so is j-1 If, from j<i, there is a path in Ti that begins with “a”, then there is a path in Ti from j+1 beginning with “a”

13 Lemma 5.2.4 Let j denote suffix x[j..i] of x[1..i]
If j>1 is a leaf node in Ti, then so is j-1 If, from j<i, there is a path in Ti that begins with “a”, then there is a path in Ti from j+1 beginning with “a” “Once an edge, always an edge”

14 Lemma 5.2.4 Let j denote suffix x[j..i] of x[1..i]
If j>1 is a leaf node in Ti, then so is j-1 If, from j<i, there is a path in Ti that begins with “a”, then there is a path in Ti from j+1 beginning with “a” “Once a leaf, always been a leaf”

15 Proof of lemma 5.2.4 a) If j>1 is a leaf node in Ti, then so is j-1
Assume j-1 is not a leaf. Then there exists k<j-1 such that: j-1 k x[k..i] x[j-1..i] x[k..i] x[j-1..i] Then thus: j j-1 x[j..i] x[j..i] k x[k+1..i] x[k+1..i] k+1

16 Proof of lemma b) If, from j<i, there is a path in Ti that begins with “a”, then there is a path in Ti from j+1 beginning with “a” Assume j is followed by “a”, then there exists k<j such that: “a” j k “a” j+1 thus: k+1 Hence j+1 is followed by “a”.

17 Corollary In iteration i, there exist indices jL and jR such that:
All suffixes j≤jL are leaves All suffixes j≥jR are already in the trie I II III jL jR

18 Consequence of corollary
I and III are free operations x[j..i] x[j..i] x[j..i] x[j..i] j x[j..i] x[i+1] x[i+1] x[j..i] x[i+1] x[i+1] j x[j..i] x[j..i] x[j..i] x[j..i] j j x[i+1] x[i+1] x[i+1] I II III jL jR

19 Updated algorithm Implicitly handling “free” operations:
For i=1,...,n+1: for j=jL,...,jR: find x[j..i] append x[i+1] jL in iteration i is the last leaf inserted in iteration i-1 (all smaller indices are already leaves) jR in iteration i is the first index where x[j..i+1] is already in the trie (all larger indices are already in the trie)

20 Suffixes in II are made into leaves
Whenever jL < j < jR, j is made a leaf: x[j..i] x[j..i] x[j..i] x[i+1] x[j..i] x[i+1] j j

21 Suffixes in II are made into leaves
Whenever jL < j < jR, j is made a leaf: x[j..i] x[j..i] x[j..i] x[i+1] x[j..i] x[i+1] j j Once j is a leaf, it will be in I and never in II again

22 Time to go from II to I We handle j in II or implicitly in III time 2n: Iterations Index j in II 2 n+1 “Path” length is 2n

23 Updated running time Running time is 2n*T(find x[j..i])
For i=1,...,n+1: for j=jL,...,jR: find x[j..i] append x[i+1] Only 2n of these: Constant time Running time is 2n*T(find x[j..i])

24 Updated running time Running time is 2n*T(find x[j..i])
For i=1,...,n+1: for j=jL,...,jR: find x[j..i] append x[i+1] Only 2n of these: Constant time? Constant time Running time is 2n*T(find x[j..i]) We just have to deal with T(find x[j..i]) in O(1) No worries!

25 Using fastscan and s(-)
When searching for x[j..i], it is already in the trie We can use fastscan for the search T(find x[j..i]) in O(d) where d is the (node-)depth of x[j..i] If we keep suffix links, s(-), in the tree we can use these as shortcuts

26 Suffix links Invariant: All inner nodes have suffix links

27 Suffix links Invariant: All inner nodes have suffix links
Ensuring the invariant: We only insert inner nodes x[j..i] when adding leaves j Whenever we insert a new node, x[j..i] for some j<i, we also find or insert x[j+1..i], and can update s(x[j..i]) := x[j+1..i] If we insert x[i..i], then s(x[i..i]) := 

28 Finding x[j+1..i] from x[j..i]
s(parent(j)) parent(j) s x[j+1..i] j Starting from here (initial j is jL and we can keep a pointer to that node between iterations) Using fastscan here

29 Bound on fastscan Time usage by fastscan is bounded by n – for the maximal (node-)depth in the trie – plus total decrease of (node-)depth Decrease in depth: Moving to parent(j): 1 Moving to s(parent(j)): max 1 “Restarting” at jL: ?

30 Bound on fastscan Time usage by fastscan is bounded by n – for the maximal (node-)depth in the trie – plus total decrease of (node-)depth Decrease in depth: Moving to parent(j): 1 Moving to s(parent(j)): max 1 “Restarting” at jL: ? s(parent(j)) x[j+1..i] s parent(j) j Using fastscan here

31 Hacking the suffix link
When searching for x[jL+1..i], update s(x[jL..i]) to point to the nearest ancestor of x[jL+1..i] s(parent(jL)) parent(jL) s x[jL+1..i] jL Nearest ancestor of x[jL+1..i]

32 Hacking the suffix link
When searching for x[jL+1..i], update s(x[jL..i]) to point to the nearest ancestor of x[jL+1..i] s(parent(jL)) parent(jL) s x[jL+1..i] jL Nearest ancestor of x[jL+1..i] “Restart” in constant time

33 Searching time Vertical steps are paid for by the previous horizontal step (free restarting) Horizontal steps are total fastscan bounded by O(n) Runtime O(n) Index j in II 2 n+1 2 Iterations n+1

34 Summary We have seen a new suffix tree construction algorithm
Ukkonen’s algorithm is an “online” algorithm: As long as no suffix is a prefix of another, the intermediate trees are suffix trees Generalized suffix trees can be built one string at a time


Download ppt "Ukkonen's suffix tree construction algorithm"

Similar presentations


Ads by Google