Download presentation
Presentation is loading. Please wait.
Published byEbony Deaner Modified over 10 years ago
2
Suffix Trees Construction and Applications João Carreira 2008
3
Outline Why Suffix Trees? Definition Ukkonen's Algorithm (construction) Applications
4
Why Suffix Trees?
5
Asymptotically fast.
6
Why Suffix Trees? Asymptotically fast. The basis of state of the art data structures.
7
Why Suffix Trees? Asymptotically fast. The basis of state of the art data structures. You don't need a Phd to use them.
8
Why Suffix Trees? Asymptotically fast. The basis of state of the art data structures. You don't need a Phd to use them. Challenging.
9
Why Suffix Trees? Asymptotically fast. The basis of state of the art data structures. You don't need a Phd to use them. Challenging. Expose interesting algorithmic ideas.
10
Definition m leaves numbered 1 to m Suffix Tree for an m-character string:
11
Definition m leaves numbered 1 to m edge-label vs node-label Suffix Tree for an m-character string:
12
Definition m leaves numbered 1 to m edge-label vs node-label each internal node has at least two children Suffix Tree for an m-character string:
13
Definition m leaves numbered 1 to m edge-label vs node-label each internal node has at least two children the label of the leaf j is S[ j..m ] Suffix Tree for an m-character string:
14
Definition m leaves numbered 1 to m edge-label vs node-label each internal node has at least two children the label of the leaf j is S[ j..m ] no two edges out of the same node can have edge-labels beginning with the same character Suffix Tree for an m-character string:
15
Definition Example String: xabxac Length (m): 6 characters Number of Leaves: 6 Node 5 label: ac
16
Implicit vs Explicit What if we have “axabx” ?
17
Ukkonen's Algorithm suffix tree construction
18
Ukkonen's Algorithm Text: S[ 1..m ] m phases phase j is divided into j extensions: In extension j of phase i + 1: find the end of the path from the root labeled with substring S[ j..i ] extend the substring by adding the character S(i + 1) to its end suffix tree construction
19
Extension Rules Rule 1: Path β ends at a leaf. S(i + 1) is added to the end of the label on that leaf edge.
20
Extension Rules Rule 2: No path from the end of β starts with S(i + 1), but at least one labeled path continues from the end of β.
21
Extension Rules Rule 3: Some path from the end of β starts with S(i + 1), so we do nothing.
22
Ukkonen's Algorithm Complexity: suffix tree construction
23
Ukkonen's Algorithm Complexity: m phases suffix tree construction
24
Ukkonen's Algorithm Complexity: m phases phase j -> j extensions suffix tree construction
25
Ukkonen's Algorithm Complexity: m phases phase j -> j extensions find the end of the path of substring β: O(|β|) = O(m) suffix tree construction
26
Ukkonen's Algorithm Complexity: m phases phase j -> j extensions find the end of the path of substring β: O(|β|) = O(m) each extension: O(1) suffix tree construction
27
Ukkonen's Algorithm Complexity: m phases phase j -> j extensions find the end of the path of substring β: O(|β|) = O(m) each extension: O(1) O(m 3 ) suffix tree construction
28
“First make it run, then make it run fast.” Brian Kernighan
29
Suffix Links Definition: For an internal node v with path-label xα, if there is another node s(v), with path-label α, then a pointer from v to s(v) is called a suffix link.
30
Suffix Links Lemma: If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. If Rule 2 applies:
31
Suffix Links Lemma: If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. If Rule 2 applies: S[ j..i ] continues with c ≠ S(i + 1)
32
Suffix Links Lemma: If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. If Rule 2 applies: S[ j..i ] continues with c ≠ S(i + 1) S[ j + 1..i ] continues with c.
33
Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].
34
Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ.
35
Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ. 3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree.
36
Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ. 3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree. 4. If a new internal w was created in extension j – 1 (by rule 2), then string α must end at node s(w), the end node for the suffix link from w. Create the suffix link (w, s(w)) from w to s(w).
37
Node Depth The node-depth of v is at most one greater than the node depth of s(v). α ß xß xα xλ λ xß xα xλ ß α λ equal node-depth: 3 Node depth: 4Node depth: 3
38
γ number of characters in an edge “Directly implemented” edge traversal: O(|γ|) Skip/count Trick
39
“Jump” from node to node. K = number of nodes in a path Time to traverse a path: O(|K|) γ number of characters in an edge “Directly implemented” edge traversal: O(|γ|)
40
Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof:
41
Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1
42
Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link.
43
Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one.
44
Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one.
45
Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one. Each down-walk moves to a node of greater depth.
46
Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one. Each down-walk moves to a node of greater depth. Over the entire phase the node-depth is decremented at most 2m times.
47
Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one. Each down-walk moves to a node of greater depth. Over the entire phase the node-depth is decremented at most 2m times. No node can have depth greater than m, so the total increment to current node-depth (down walks) is bounded by 3m over the entire phase.
48
Ukkonen's Algorithm m phases 1 phase: O(m)
49
Ukkonen's Algorithm m phases 1 phase: O(m) O(m 2 )
50
“First make it run fast, then make it run faster.” João Carreira
51
Edge-Label Compression A string with m characters has m suffixes. If edge labels are represented with characters, O(m 2 ) space is needed.
52
Edge-Label Compression A string with m characters has m suffixes. If edge labels are represented with characters, O(m 2 ) space is needed. To achieve O(m) space, each edge-label: (p, q)
53
Two more tricks...
54
Rule 3 is a show stopper If rule 3 applies in extension j, it will also apply in all further extensions until the end of the phase. Why?
55
Rule 3 is a show stopper If rule 3 applies in extension j, it will also apply in all further extensions until the end of the phase. Why? When rule 3 applies, the path labeled S[ j..i ] must continue with character S(i + 1), and so the path labeled S[ j + 1..i ] does also, and rule 3 again applies in extensions j+1...i+1.
56
Rule 3 is a show stopper End any phase i +1 the first time rule 3 applies. The remaining extensions are said to be done implicitly.
57
Once a leaf always a leaf Leaf created => always a leaf in all successive trees. No mechanism for extending a leaf edge beyond its current leaf. Once there is a leaf labeled j, extension rule 1 will always apply to extension j in any sucessive phase.
58
Once a leaf always a leaf Leaf created => always a leaf in all successive trees. No mechanism for extending a leaf edge beyond its current leaf. Once there is a leaf labeled j, extension rule 1 will always apply to extension j in any sucessive phase. Leaf Edge Label: (p, e)
59
Single Phase Algorithm In each phase i:
60
Single Phase Algorithm During construction:
61
Implicit to Explicit One last phase to add character $: O(m)
62
Suffix Trees are a Swiss Knife
63
Applications Exact String Matching:
64
Applications Exact String Matching: Three ocurrences of string aw. Preprocessing: O(m) Search: O(n + k)
65
Applications And much more.. Longest common substring O(n) Longest repeated substring O(n) Longest palindrome O(n) Most frequently occurring substrings of a minimum length O(n) Shortest substrings occurring only once O(n) Lempel-Ziv decomposition O(n).....
66
“Biology easily has 500 years of exciting problems to work on.” Donald Knuth
67
web.ist.utl.pt/joao.carreira Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.