Download presentation
Presentation is loading. Please wait.
Published byGodfrey Cox Modified over 9 years ago
1
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
2
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i <= j of S. S = cater => ate is a substring. car is not a substring. Empty string is a substring of S.
3
Subsequence Subsequence of string S … string composed of characters i 1 < i 2 < … < i k of S. S = cater => ate is a subsequence. car is a subsequence. The empty string is a subsequence.
4
String/Pattern Matching You are given a source string S. Answer queries of the form: is the string p i a substring of S? Knuth-Morris-Pratt (KMP) string matching. O(|S| + | p i |) time per query. O(n|S| + i | p i |) time for n queries. Suffix tree solution. O(|S| + i | p i |) time for n queries.
5
String/Pattern Matching KMP preprocesses the query string p i, whereas the suffix tree method preprocesses the source string S. An application of string matching. Genome project. Databank of strings (gene sequences). Character set is ATGC. Determine if a “new” sequence is a substring of a databank sequence.
6
Definition Of Suffix Tree Compressed trie with edge information. Keys are the nonempty suffixes of a given string S. Nonempty suffixes of S = sleeper are: sleeper leeper eeper eper per, er, and r.
7
String Matching & Suffixes p i is a substring of S iff p i is a prefix of some suffix of S. Nonempty suffixes of S = sleeper are: sleeper leeper eeper eper per, er, and r. Which of these are substrings of S? leep, eepe, pe, leap, peel
8
Last Character Of S Repeats When the last character of S appears more than once in S, S has at least one suffix that is a proper prefix of another suffix. S = creeper creeper, reeper, eeper, eper, per, er, r When the last character of S appears more than once in S, use an end of string character # to overcome this problem. S = creeper# creeper#, reeper#, eeper#, eper#, per#, er#, r#, #
9
Suffix Tree For S = abbbabbbb# abbb b # abbbb#b# #abbbb# b # # b b# 1 2 3 4 5
10
Suffix Tree For S = abbbabbbb# abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 1 2 3 4 5
11
Suffix Tree For S = abbbabbbb# abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 1 1 4 8 2 1 52 3 4
12
Suffix Tree Construction See Web write up for algorithm. Time complexity |S| = n, alphabet size = r. O(nr) using array nodes. This is O(n) for r a constant (or r <= c). O(n) expected time using a hash table. O(n) time algorithm for large r in reference cited in Web write up.
13
Suffix Array Array that contains the start position of suffixes in lexicographic order. abbbabbbb# Assume # < a < b # < abbbabbbb# < abbbb# < b# < babbbb# < bb# < bbabbbb# < bbb# < bbbabbbb# < bbbb# SA = [10, 1, 5, 9, 4, 8, 3, 7, 2, 6] LCP = length of longest common prefix between adjacent entries of SA. LCP = [0, 4, 0, 1, 1, 2, 2, 3, 3, -]
14
Suffix Array Less space than suffix tree Linear time construction Can be used to solve several of the problems solved by a suffix tree with same asymptotic complexity. Substring matching binary search for p using SA. O(|p| log |S|).
15
O(|p i |) Time Substring Matching babbabbbababa abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10
16
Find All Occurrences Of p i Search suffix tree for p i. Suppose the search for p i is successful. When search terminates at an element node, p i appears exactly once in the source string S.
17
Search Terminates At Element Node abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 abbbb#
18
Search Terminates At Branch Node When the search for p i terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of p i.
19
Search Terminates At Branch Node abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 ab
20
Find All Occurrences Of p i To find all occurrences of p i in time linear in the length of p i and linear in the number of occurrences of p i, augment suffix tree: Link all element nodes into a chain in inorder. Each branch node keeps a pointer to the left most and right most element node in its subtree.
21
Augmented Suffix Tree abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 b
22
Longest Repeating Substring Find longest substring of S that occurs more than m > 1 times in S. Label branch nodes with number of element nodes in subtree. Find branch node with label >= m and max char# field.
23
Longest Repeating Substring abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# 12345678910 15 4 3 26 7 8 9 10 m = 2 2 3 5 7 m = 5 10
24
Longest Common Substring Given two strings S and T. Find the longest common substring. S = carport, T = airports Longest common substring = rport Longest common subsequence = arport Longest common subsequence may be found in O(|S|*|T|) time using dynamic programming. Longest common substring may be found in O(|S|+|T|) time using a suffix tree.
25
Longest Common Substring Let $ be a new symbol. Construct the suffix tree for the string U = S$T#. U = carport$airports# No repeating substring includes $. Find longest repeating substring that is both to left and right of $. Find branch node that has max char# and has at least one element node in its subtree that represents a suffix that begins in S as well as at least one that begins in T.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.