Presentation is loading. Please wait.

Presentation is loading. Please wait.

Suffix Trees Construction and Applications João Carreira 2008.

Similar presentations


Presentation on theme: "Suffix Trees Construction and Applications João Carreira 2008."— Presentation transcript:

1

2 Suffix Trees Construction and Applications João Carreira 2008

3 Outline Why Suffix Trees? Definition Ukkonen's Algorithm (construction)‏ Applications

4 Why Suffix Trees?

5 Asymptotically fast.

6 Why Suffix Trees? Asymptotically fast. The basis of state of the art data structures.

7 Why Suffix Trees? Asymptotically fast. The basis of state of the art data structures. You don't need a Phd to use them.

8 Why Suffix Trees? Asymptotically fast. The basis of state of the art data structures. You don't need a Phd to use them. Challenging.

9 Why Suffix Trees? Asymptotically fast. The basis of state of the art data structures. You don't need a Phd to use them. Challenging. Expose interesting algorithmic ideas.

10 Definition m leaves numbered 1 to m Suffix Tree for an m-character string:

11 Definition m leaves numbered 1 to m edge-label vs node-label Suffix Tree for an m-character string:

12 Definition m leaves numbered 1 to m edge-label vs node-label each internal node has at least two children Suffix Tree for an m-character string:

13 Definition m leaves numbered 1 to m edge-label vs node-label each internal node has at least two children the label of the leaf j is S[ j..m ] Suffix Tree for an m-character string:

14 Definition m leaves numbered 1 to m edge-label vs node-label each internal node has at least two children the label of the leaf j is S[ j..m ] no two edges out of the same node can have edge-labels beginning with the same character Suffix Tree for an m-character string:

15 Definition Example String: xabxac Length (m): 6 characters Number of Leaves: 6 Node 5 label: ac

16 Implicit vs Explicit What if we have “axabx” ?

17 Ukkonen's Algorithm suffix tree construction

18 Ukkonen's Algorithm Text: S[ 1..m ] m phases phase j is divided into j extensions: In extension j of phase i + 1: find the end of the path from the root labeled with substring S[ j..i ] extend the substring by adding the character S(i + 1) to its end suffix tree construction

19 Extension Rules Rule 1: Path β ends at a leaf. S(i + 1) is added to the end of the label on that leaf edge.

20 Extension Rules Rule 2: No path from the end of β starts with S(i + 1), but at least one labeled path continues from the end of β.

21 Extension Rules Rule 3: Some path from the end of β starts with S(i + 1), so we do nothing.

22 Ukkonen's Algorithm Complexity: suffix tree construction

23 Ukkonen's Algorithm Complexity: m phases suffix tree construction

24 Ukkonen's Algorithm Complexity: m phases phase j -> j extensions suffix tree construction

25 Ukkonen's Algorithm Complexity: m phases phase j -> j extensions find the end of the path of substring β: O(|β|) = O(m)‏ suffix tree construction

26 Ukkonen's Algorithm Complexity: m phases phase j -> j extensions find the end of the path of substring β: O(|β|) = O(m)‏ each extension: O(1)‏ suffix tree construction

27 Ukkonen's Algorithm Complexity: m phases phase j -> j extensions find the end of the path of substring β: O(|β|) = O(m)‏ each extension: O(1)‏ O(m 3 )‏ suffix tree construction

28 “First make it run, then make it run fast.” Brian Kernighan

29 Suffix Links Definition: For an internal node v with path-label xα, if there is another node s(v), with path-label α, then a pointer from v to s(v) is called a suffix link.

30 Suffix Links Lemma: If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. If Rule 2 applies:

31 Suffix Links Lemma: If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. If Rule 2 applies: S[ j..i ] continues with c ≠ S(i + 1)‏

32 Suffix Links Lemma: If a new internal node v with path label xα is added to the current tree in extension j of some phase, then either the path labeled α already ends at an internal node or an internal at the end of the string α will be created in the next extension of the same phase. If Rule 2 applies: S[ j..i ] continues with c ≠ S(i + 1)‏ S[ j + 1..i ] continues with c.

33 Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].

34 Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ.

35 Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ. 3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree.

36 Single Extension Algorithm Extension j of phase i + 1: 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ]. 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the suffix link and walk down from s(v) following the path for string λ. 3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree. 4. If a new internal w was created in extension j – 1 (by rule 2), then string α must end at node s(w), the end node for the suffix link from w. Create the suffix link (w, s(w)) from w to s(w).

37 Node Depth The node-depth of v is at most one greater than the node depth of s(v). α ß xß xα xλ λ xß xα xλ ß α λ equal node-depth: 3 Node depth: 4Node depth: 3

38 γ number of characters in an edge “Directly implemented” edge traversal: O(|γ|)‏ Skip/count Trick

39 “Jump” from node to node. K = number of nodes in a path Time to traverse a path: O(|K|)‏ γ number of characters in an edge “Directly implemented” edge traversal: O(|γ|)‏

40 Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof:

41 Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1

42 Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link.

43 Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one.

44 Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one.

45 Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one. Each down-walk moves to a node of greater depth.

46 Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one. Each down-walk moves to a node of greater depth. Over the entire phase the node-depth is decremented at most 2m times.

47 Ukkonen's Algorithm Using the skip/count trick: any phase of Ukkonen's algorithm takes O(m) time. Proof: There are i + 1 ≤ m extensions in phase i + 1 In a single extension, the algorithm walks up at most one edge, traverses one suffix link, walks down some number of nodes, applies the extension rules and may add a suffix link. The up-walk decreases the current node-depth by at most one. Each suffix link traversal decreases the node-depth by at most another one. Each down-walk moves to a node of greater depth. Over the entire phase the node-depth is decremented at most 2m times. No node can have depth greater than m, so the total increment to current node-depth (down walks) is bounded by 3m over the entire phase.

48 Ukkonen's Algorithm m phases 1 phase: O(m)‏

49 Ukkonen's Algorithm m phases 1 phase: O(m)‏ O(m 2 )‏

50 “First make it run fast, then make it run faster.” João Carreira

51 Edge-Label Compression A string with m characters has m suffixes. If edge labels are represented with characters, O(m 2 ) space is needed.

52 Edge-Label Compression A string with m characters has m suffixes. If edge labels are represented with characters, O(m 2 ) space is needed. To achieve O(m) space, each edge-label: (p, q)‏

53 Two more tricks...

54 Rule 3 is a show stopper If rule 3 applies in extension j, it will also apply in all further extensions until the end of the phase. Why?

55 Rule 3 is a show stopper If rule 3 applies in extension j, it will also apply in all further extensions until the end of the phase. Why? When rule 3 applies, the path labeled S[ j..i ] must continue with character S(i + 1), and so the path labeled S[ j + 1..i ] does also, and rule 3 again applies in extensions j+1...i+1.

56 Rule 3 is a show stopper End any phase i +1 the first time rule 3 applies. The remaining extensions are said to be done implicitly.

57 Once a leaf always a leaf Leaf created => always a leaf in all successive trees. No mechanism for extending a leaf edge beyond its current leaf. Once there is a leaf labeled j, extension rule 1 will always apply to extension j in any sucessive phase.

58 Once a leaf always a leaf Leaf created => always a leaf in all successive trees. No mechanism for extending a leaf edge beyond its current leaf. Once there is a leaf labeled j, extension rule 1 will always apply to extension j in any sucessive phase. Leaf Edge Label: (p, e)‏

59 Single Phase Algorithm In each phase i:

60 Single Phase Algorithm During construction:

61 Implicit to Explicit One last phase to add character $: O(m)‏

62 Suffix Trees are a Swiss Knife

63 Applications Exact String Matching:

64 Applications Exact String Matching: Three ocurrences of string aw. Preprocessing: O(m)‏ Search: O(n + k)‏

65 Applications And much more.. Longest common substring O(n)‏ Longest repeated substring O(n)‏ Longest palindrome O(n)‏ Most frequently occurring substrings of a minimum length O(n)‏ Shortest substrings occurring only once O(n)‏ Lempel-Ziv decomposition O(n)‏.....

66 “Biology easily has 500 years of exciting problems to work on.” Donald Knuth

67 web.ist.utl.pt/joao.carreira Questions?


Download ppt "Suffix Trees Construction and Applications João Carreira 2008."

Similar presentations


Ads by Google