Download presentation
Presentation is loading. Please wait.
1
Ukkonen's suffix tree construction algorithm
a ba$ a ba$ aba$ aba$ $ $ab $ab 2 $ab 2 2 4 1 1 1 $ 1 $ 3 String Algorithms; Nov
2
Motivation Yet another suffix tree construction algorithm... Why?
3
Motivation Yet another suffix tree construction algorithm... Why?
An online algorithm, i.e. one we can update with new sequences
4
Sketch of algorithm Recall McCreight’s approach:
For i = 1 .. n+1, build compressed trie of {x[j..n]$ | j ≤ i} Ukkonen’s approach: For i = 1 .. n+1, build compressed trie of {x[j..i]$ | j ≤ i} Compressed trie of all suffixes of prefix x[1..i]$ of x$ A suffix tree except for “leaf” property
5
McCreight's algorithm – x=aba
T1 : {x$[j..n] | j ≤ 1 } aba$ 1 T2 : {x$[j..n] | j ≤ 2 } aba$ $ab 2 1 T3 : {x$[j..n] | j ≤ 3 } a ba$ $ab 2 1 $ 3 T4 : {x$[j..n] | j ≤ 4 } a ba$ $ $ab 2 4 1 $ 3
6
Ukkonen's algorithm – x=aba
T1 : {x$[j..1]} 1 T2 : {x$[j..2]} ab b 2 1 T3 : {x$[j..3]} a ba Note: no node for x[3..3] = “a” ab 2 1 a ba$ T4 : {x$[j..n]} $ $ab 2 4 1 $ 3
7
Tasks in iteration i In iteration i we must
Update each x[j..i] to x[j..i+1] Add string x[i+1] (special case of above) Leaf: Inner node: Edge: x[j..i] x[j..i] x[j..i] x[j..i] j x[j..i] x[i+1] x[i+1] x[j..i] x[i+1] x[i+1] x[j..i] x[j..i] x[j..i] x[j..i] j j j x[i+1] x[i+1] x[i+1]
8
First attempt... “Obvious” algorithm: Running time O(n3)
For i=1,...,n+1: for j=1,...,i: find x[j..i] append x[i+1] Running time O(n3) Need lots of tricks to get O(n)!
9
Updating leaves for free
If we label leaves with (k,∞) – denoting “k to the current i”, updating a leaf is automatic Leaf: Inner node: Edge: x[j..i] x[j..i] x[j..i] x[j..i] j x[j..i] x[i+1] x[i+1] x[j..i] x[i+1] x[i+1] x[j..i] x[j..i] x[j..i] x[j..i] j j j x[i+1] x[i+1] x[i+1]
10
Updating existing strings is free
If x[j..i+1] is already in the tree, the update is automatic Leaf: Inner node: Edge: x[j..i] x[j..i] x[j..i] x[j..i] j x[j..i] x[i+1] x[i+1] x[j..i] x[i+1] x[i+1] x[j..i] x[j..i] x[j..i] x[j..i] j j j x[i+1] x[i+1] x[i+1]
11
“Real” operations If we can recognize the free operations, we need only deal with the remaining Leaf: Inner node: Edge: x[j..i] x[j..i] x[j..i] x[j..i] j x[j..i] x[i+1] x[i+1] x[j..i] x[i+1] x[i+1] x[j..i] x[j..i] x[j..i] x[j..i] j j j x[i+1] x[i+1] x[i+1]
12
Lemma 5.2.4 Let j denote suffix x[j..i] of x[1..i]
If j>1 is a leaf node in Ti, then so is j-1 If, from j<i, there is a path in Ti that begins with “a”, then there is a path in Ti from j+1 beginning with “a”
13
Lemma 5.2.4 Let j denote suffix x[j..i] of x[1..i]
If j>1 is a leaf node in Ti, then so is j-1 If, from j<i, there is a path in Ti that begins with “a”, then there is a path in Ti from j+1 beginning with “a” “Once an edge, always an edge”
14
Lemma 5.2.4 Let j denote suffix x[j..i] of x[1..i]
If j>1 is a leaf node in Ti, then so is j-1 If, from j<i, there is a path in Ti that begins with “a”, then there is a path in Ti from j+1 beginning with “a” “Once a leaf, always been a leaf”
15
Proof of lemma 5.2.4 a) If j>1 is a leaf node in Ti, then so is j-1
Assume j-1 is not a leaf. Then there exists k<j-1 such that: j-1 k x[k..i] x[j-1..i] x[k..i] x[j-1..i] Then thus: j j-1 x[j..i] x[j..i] k x[k+1..i] x[k+1..i] k+1
16
Proof of lemma b) If, from j<i, there is a path in Ti that begins with “a”, then there is a path in Ti from j+1 beginning with “a” Assume j is followed by “a”, then there exists k<j such that: “a” j k “a” j+1 thus: k+1 Hence j+1 is followed by “a”.
17
Corollary In iteration i, there exist indices jL and jR such that:
All suffixes j≤jL are leaves All suffixes j≥jR are already in the trie I II III jL jR
18
Consequence of corollary
I and III are free operations x[j..i] x[j..i] x[j..i] x[j..i] j x[j..i] x[i+1] x[i+1] x[j..i] x[i+1] x[i+1] j x[j..i] x[j..i] x[j..i] x[j..i] j j x[i+1] x[i+1] x[i+1] I II III jL jR
19
Updated algorithm Implicitly handling “free” operations:
For i=1,...,n+1: for j=jL,...,jR: find x[j..i] append x[i+1] jL in iteration i is the last leaf inserted in iteration i-1 (all smaller indices are already leaves) jR in iteration i is the first index where x[j..i+1] is already in the trie (all larger indices are already in the trie)
20
Suffixes in II are made into leaves
Whenever jL < j < jR, j is made a leaf: x[j..i] x[j..i] x[j..i] x[i+1] x[j..i] x[i+1] j j
21
Suffixes in II are made into leaves
Whenever jL < j < jR, j is made a leaf: x[j..i] x[j..i] x[j..i] x[i+1] x[j..i] x[i+1] j j Once j is a leaf, it will be in I and never in II again
22
Time to go from II to I We handle j in II or implicitly in III time 2n: Iterations Index j in II 2 n+1 “Path” length is 2n
23
Updated running time Running time is 2n*T(find x[j..i])
For i=1,...,n+1: for j=jL,...,jR: find x[j..i] append x[i+1] Only 2n of these: Constant time Running time is 2n*T(find x[j..i])
24
Updated running time Running time is 2n*T(find x[j..i])
For i=1,...,n+1: for j=jL,...,jR: find x[j..i] append x[i+1] Only 2n of these: Constant time? Constant time Running time is 2n*T(find x[j..i]) We just have to deal with T(find x[j..i]) in O(1) No worries!
25
Using fastscan and s(-)
When searching for x[j..i], it is already in the trie We can use fastscan for the search T(find x[j..i]) in O(d) where d is the (node-)depth of x[j..i] If we keep suffix links, s(-), in the tree we can use these as shortcuts
26
Suffix links Invariant: All inner nodes have suffix links
27
Suffix links Invariant: All inner nodes have suffix links
Ensuring the invariant: We only insert inner nodes x[j..i] when adding leaves j Whenever we insert a new node, x[j..i] for some j<i, we also find or insert x[j+1..i], and can update s(x[j..i]) := x[j+1..i] If we insert x[i..i], then s(x[i..i]) :=
28
Finding x[j+1..i] from x[j..i]
s(parent(j)) parent(j) s x[j+1..i] j Starting from here (initial j is jL and we can keep a pointer to that node between iterations) Using fastscan here
29
Bound on fastscan Time usage by fastscan is bounded by n – for the maximal (node-)depth in the trie – plus total decrease of (node-)depth Decrease in depth: Moving to parent(j): 1 Moving to s(parent(j)): max 1 “Restarting” at jL: ?
30
Bound on fastscan Time usage by fastscan is bounded by n – for the maximal (node-)depth in the trie – plus total decrease of (node-)depth Decrease in depth: Moving to parent(j): 1 Moving to s(parent(j)): max 1 “Restarting” at jL: ? s(parent(j)) x[j+1..i] s parent(j) j Using fastscan here
31
Hacking the suffix link
When searching for x[jL+1..i], update s(x[jL..i]) to point to the nearest ancestor of x[jL+1..i] s(parent(jL)) parent(jL) s x[jL+1..i] jL Nearest ancestor of x[jL+1..i]
32
Hacking the suffix link
When searching for x[jL+1..i], update s(x[jL..i]) to point to the nearest ancestor of x[jL+1..i] s(parent(jL)) parent(jL) s x[jL+1..i] jL Nearest ancestor of x[jL+1..i] “Restart” in constant time
33
Searching time Vertical steps are paid for by the previous horizontal step (free restarting) Horizontal steps are total fastscan bounded by O(n) Runtime O(n) Index j in II 2 n+1 2 Iterations n+1
34
Summary We have seen a new suffix tree construction algorithm
Ukkonen’s algorithm is an “online” algorithm: As long as no suffix is a prefix of another, the intermediate trees are suffix trees Generalized suffix trees can be built one string at a time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.