1 Suffix Trees © Jeff Parker, 2009. 2 Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently.

Slides:



Advertisements
Similar presentations
On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.
Advertisements

Boosting Textual Compression in Optimal Linear Time.
Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Two implementation issues Alphabet size Generalizing to multiple strings.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Suffix Trees and Suffix Arrays
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
15-853Page : Algorithms in the Real World Suffix Trees.
OUTLINE Suffix trees Suffix arrays Suffix trees Indexing techniques are used to locate highest – scoring alignments. One method of indexing uses the.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
CSE 746 – Introduction to Bioinformatics Research Project Two methods of DNA Sequencing – Comparing and Intertwining Suffix Trees and De Bruijn Graphs.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Tries Standard Tries Compressed Tries Suffix Tries.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Suffix trees. Trie A tree representing a set of strings. a c b c e e f d b f e g { aeef ad bbfe bbfg c }
On-line Construction of Suffix Tree Esko Ukkonen Algorithmica Vol. 14, No. 3, pp , 1995.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
Data Structures Introduction Phil Tayco Slide version 1.0 Jan 26, 2015.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Multi-way Trees. M-way trees So far we have discussed binary trees only. In this lecture, we go over another type of tree called m- way trees or trees.
1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.
Computer Algorithms Submitted by: Rishi Jethwa Suvarna Angal.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Contents What is a trie? When to use tries
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Linear Time Suffix Array Construction Using D-Critical Substrings
15-853:Algorithms in the Real World
COMP9319 Web Data Compression and Search
Andrzej Ehrenfeucht, University of Colorado, Boulder
Mark Redekopp David Kempe
B+-Trees.
Ukkonen's suffix tree construction algorithm
Strings: Tries, Suffix Trees
Suffix trees.
String Data Structures and Algorithms
String Data Structures and Algorithms
Suffix trees and suffix arrays
Suffix Trees String … any sequence of characters.
Strings: Tries, Suffix Trees
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

1 Suffix Trees © Jeff Parker, 2009

2 Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently

3 Problems We have a corpus of information Genes Proteins What to see what to sequences have in common Want to be able to find matches for a gene or protein. Model this as a search for a pattern in a text. Problem is hard because Strings are very long The set of possible matches is large Today we will focus on exact matches

4 Pattern Matching The basis for the simplest (exact) pattern match follows Algorithm Line up text and pattern Compare the two If they match Report the position of match Else Slide pattern to right and try again Text Pattern

5 Compare pattern at this position // Does the pattern match the text at this position? boolean compare(String text, int pos, String pattern) { for (int i = 0; i < pattern length; i++) if (text[pos + i] =/= pattern[i]) return false; return true; }

6 Simple Pattern Match // Where is pattern pat in string text? int findMatch (String text, String pat) { int pos = 0; while (pos <= text.length - pat.length) { if (compare(text, pos, pattern)) return pos; pos++;// Slide pattern right one space } return -1; }

7 Analysis For pattern of length N and a text of length M This algorithm behaves well in practice: O(N + M) The worst case is bad: O(NM) We can do better if we preprocess Preprocess Pattern: Boyer-Moore, Knuth, Morris, Pratt Preprocess text: Suffix Tree

8 O(|pattern|) Pattern Matching Rather than view the problem as moving the pattern, rephrase

9 Faster Pattern Matching Is our pattern the prefix of a suffix of the text string S? Take all suffixes…

10 Faster Pattern Matching Take all suffixes and slide left

11 Faster Pattern Matching Want to find a string that has pattern as prefix

12 Sort suffixes

13 Build Trie Allows O(N) search for pattern

14 Suffix Trie Multi-way tree Each branch is labeled with char If the trie is ready, match takes O(|pattern|) time Example: text S is ababc s 1 = ababc s 2 = babc s 3 = abc s 4 = bc s 5 = c 1 a a c b b b b c c c c a

15 Suffix Trie Suffix trie takes O(|S| 2 ) space Each step of search for match takes constant time If no branch matches char, we fail Leaf holds name of suffix We may have multiple matches String ab occurs twice Prefix of s1 and s3 1 a a c b b b b c c c c a s 1 = ababc s 2 = babc s 3 = abc s 4 = bc s 5 = c

16 Suffix Tree Nodes that mark a split are called essential Remove non-essential nodes, and label edges with string 1 a a c b b b b c c c c a abc ab c c c abc b

17 Properties Tree has |S| leaves and 2|S|-1 edges |S|-1 interior nodes Algorithm for search is the same: walk the tree matching edges While this has less nodes, not clear that we need Any less storage? Sum of length of strings can still be O(N 2 ) Any speedup building tree? Storage is easier to address abc ab c c c abc b

18 Worst Case Storage Here are some trees that need O(N 2 ) storage when stored as tries abcdefg We can get a trie that need O(N 2 ) storage with a limited alphabet: a n b n a n b n c 1 2 abcdefg bcdefg 3 cdefgdefgefg 4 5

19 Efficient Storage We store the whole string once, and keep pointers to that string in nodes We have constant space per node and O(|S|) nodes, thus linear space abc ab c c c abc b 1, 2 a 1 b 2 a 3 b 4 c 5 2, 2 5, 5 3, 5 5, 5 3, 5 5, 5 sibling child

20 Applications: Longest Repeat As well as searching for a string, we can answer questions such as What is the longest string that is duplicated? What is the longest string that occurs k times? Internal nodes mark repeating substrings Keep track of the splits, and remember the deepest. In our example, s 1 and s 3 share ab abc ab c c c abc b ababc

21 Longest Common Substring Given two strings S and T, find the longest common substring Build the suffix tree for the string S$T Mark leaves of suffixes that begin in S red Mark leaves of suffixes that begin in T black Make bottom up traverse, looking for lowest split that has leaves in both sets abc ab c c c abc b

22 Applications: Longest Palindrome Given two strings S, find the longest common palindrome Build the suffix tree for the string S$S -1 Mark leaves of suffixes that begin in S red Mark leaves of suffixes that begin in S -1 black Look for lowest split that has leaves in both sets abc ab c c c abc b

23 Linear Time Construction There is a long history of work mississippi ississippi ssissippi sissippi issippi ssippi sippi ippi ppi pi i Weiner 1973

24 Linear Time Construction There is a long history of work mississippi ississippi ssissippi sissippi issippi ssippi sippi ippi ppi pi i Weiner 1973 McCreight 1976

25 Linear Time Construction There is a long history of work mississippi ississippi ssissippi sissippi issippi ssippi sippi ippi ppi pi i Weiner 1973 McCreight 1976 Ukkonen 1992

26 McCreight Add the suffixes from longest to shortest We add a termination symbol, such as $, that does not appear in text This forces each addition to split the existing tree We can split (add a node and two edges) in constant time Can we find the place to do the splitting in constant time? Suffix links give amortized linear time. But first understand alg. ababc 2 1 babcababc 1 2 babc ab 1 abcc 3

27 Ukkonen Online algorithm: we don’t need to know all of string Grow all suffixes together. In step k, add S[k] to end of each suffix At some point, string s k will split from tree (s 2 breaks loose in step 2) After that, s k will never split again (though something may split from it) A split for s k may mean an similar split for s k+1 3 splits when adding c: s 3 splits from s 1, s 4 from s 2 and s 5 from root a... 1 ab... 1 b... aba.. 1 ba.. a.. 2 abab. 1 bab. ab. b. abc ab b c c abc c

28 Review Introduce graphical notation for implicit nodes aba means both suffixes “a” and “aba” are on edge a... 1 ab... 1 b... aba.. 1 ba.. 2 abab. 1 bab. abc ab b c c abc c ababc$ = s1 babc$ = s2 abc$ = s3 bc$ = s4 c$ = s5 $ = s6

29 Mississippi mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6 sippi$ = s 7 ippi$ = s 8 ppi$ = s 9 pi$ = s 10 i$ = s 11 $ = s 12 m... 1 mi i... mis is... 3 s... miss iss... 3 ss... s 4 is an implicit node s 4 is the active path Def: First non-leaf suffix remaining

30 Mississippi mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6 sippi$ = s 7 ippi$ = s 8 ppi$ = s 9 pi$ = s 10 i$ = s 11 $ = s 12 miss iss... 3 ss... s 4 is an implicit node (red s in s 3 edge) s 4 is the active path Def: First non-leaf suffix remaining When we add s[5] = i, active path s 4 splits s 5 becomes the active point. missi si... s issi... i

31 Mississippi mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6 sippi$ = s 7 ippi$ = s 8 ppi$ = s 9 pi$ = s 10 i$ = s 11 $ = s 12 s 5 is the active path (First non-leaf suffix remaining) At end there are 3 non-leaf-suffixes (s 5, s 6, s 7 ) missi si... s issi... i... missis sis... s issis... is... mississ... 2 siss... s ississ... iss

32 mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6 sippi$ = s 7 ippi$ = s 8 ppi$ = s 9 pi$ = s 10 i$ = s 11 $ = s 12 Add i Add p. Have never seen p, so all 4 (now 5) trailing suffixes split s 10, at root, becomes active path Mississippi mississi... 2 sissi... s ississi... issi mississip... s p... 8 i ssi ssip... si i ssip... p... ssip...

33 mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6 sippi$ = s 7 ippi$ = s 8 ppi$ = s 9 pi$ = s 10 i$ = s 11 $ = s 12 Redraw last diagram. About to add a second p. s 10 is active path, and it is at root Mississippi issip i ssi p p p mississip 1 s i si pssip p

34 mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6 sippi$ = s 7 ippi$ = s 8 ppi$ = s 9 pi$ = s 10 i$ = s 11 $ = s 12 Active path is still s 10 It is trailing s 9 Mississippi issipp i ssi pp p mississipp 1 s i si ppssipp pp ssipp

35 mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6 sippi$ = s 7 ippi$ = s 8 ppi$ = s 9 pi$ = s 10 i$ = s 11 $ = s 12 Add i. Forces split of s 10 from s 9. Active path is now s 11 Mississippi issippi i ssi ppi p mississippi 1 s i si ppissippi ppi ssippi 10 pii

36 Algorithm We are building a tree, adding character S[k] to every suffix We traverse the boundary path - the growing edge of tree Boundary path includes Suffixes that have already become leaves Suffixes that currently end in implicit interior nodes We add character S[k] to the end of each suffix In general we have O(N) suffixes on boundary path, and we add each of N characters to each suffix on the boundary path, and we must navigate from suffix to suffix, which may be O(N) steps apart. How can we do this in O(N) time?

37 Algorithm We have O(N) suffixes on boundary path, We add each of N characters to each suffix on the boundary path, We navigate from suffix to suffix, which may be O(N) steps apart. How can we do this in O(N) time? Ans: We cheat. Here are three big ideas (will explain each in detail) 1) Once a path has split off, updating it is free, so we ignore it 2) Rather than “walk” the boundary edge as we add a new character, we only need to watch one representative: the active path - the longest suffix that is not yet a leaf 3) When we do need to walk the boundary path there is a cheap way to walk from suffix to suffix, by creating suffix links

38 Leaves are Cheap 1) Once a path has split off, “updating” it is free We represent a leaf that splits at character S[k] as the string S[k..whatever] If some later suffix is following our path, it is up to him to find the point of difference S 5 is following S 2, but S 2 is a leaf and does not care We don’t even need to know the length of the string (whatever) mississi... 2 s ississi... issi mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6

39 Active Path 2) We can focus our attention on the longest suffix that has not yet broken free, called the active path. This represents rest of boundary path Assume active path is the suffix S i and we are have just added char S[k] Assume that S i is a prefix of suffix S j up to this point Then S i+1 is a prefix of suffix S j+1 and so on Proof: S i+1 is just S i without character S[i] The converse is not true. S i may leave the tree while S i+1 remains in the tree S[i..k] SiSi S[j..k] SjSj S[i+1..k] S i+1 S[j+1..k] S j+1 This means that we only need to watch S 5 mississi... 2 s ississi... issi

40 mississippi$ = s 1 ississippi$ = s 2 ssissippi$ = s 3 sissippi$ = s 4 issippi$ = s 5 ssippi$ = s 6 sippi$ = s 7 ippi$ = s 8 ppi$ = s 9 pi$ = s 10 i$ = s 11 $ = s 12 Add p. Have never seen p, so s 5, s 6, s 7, s 8 and s 9 all split. s 10, which is currently at the root, becomes the new active path Review example mississi... 2 sissi... s ississi... issi mississip... s p... 8 i ssi ssip... si i ssip... p... ssip... S 5 is a prefix of S 2 S 6 is a prefix of S 3 S 7 is a prefix of S 4 S 8 is a prefix of S 5

41 Suffix Links 3) There is a cheap way to walk the boundary path Once the active path splits, we need to walk the boundary path until splitting stops To explain the suffix link, return to our view as a trie for ababc We have inserted s[1] through s[4], about to insert s[5] = c s 1 points to s 2, which points to s 3, which points to s 4, which points to root I know I will have no problems with leaves s 1 and s 2 : active path is s 3 When I find that s 3 needs to split from s 1, I need to check s 4 as well, and perhaps s 5 I follow the suffix pointers from s 3 a a b b b b a

42 Accounting I add one character at a time to one suffix - the active path This is clearly linear When the active path splits, I need to start walking the boundary path from old active path to new end path (point were the splitting stops) Any individual character may cause lots of splitting, but each suffix only splits once. Amortized cost is linear To walk the boundary path, I update the suffix links. This can also be amortized. a a b b b b a a a b b b b a cc c

43 Building Suffix Links When we split, we need to add new nodes These nodes will need new suffix links We are showing a chain of suffix links

44 Building Suffix Links When we split, we need to add new nodes These nodes will need new suffix links

45 Building Suffix Links When we split, we need to add new nodes These nodes will need new suffix links

46 Building Suffix Links When we split, we need to add new nodes These nodes will need new suffix links

47 Building Suffix Links When we split, we need to add new nodes These nodes will need new suffix links

48 Building Suffix Links When we split, we need to add new nodes These nodes will need new suffix links

49 Building Suffix Links When we split, we need to add new nodes These nodes will need new suffix links

50 Building Suffix Links When we split, we need to add new nodes These nodes will need new suffix links

51 Canonize We represent a suffix as an explicit node and a (growing) string of characters Start with (n 1 (a)) Add characters bbac to get (n 1 (abbac)) We canonize this in a sequence of steps to get a better representation (n 2 (bac)) (n 3 (c)) This allows us to use the suffix link at n 3 rather than the suffix link at n 1 ab ba ca n1n1 n2n2 n3n3 n4n4

52 Post mortem Algorithm to build Suffix Tree is linear in time and space. We haven’t proved this, but perhaps it is now plausible But is the algorithm practical? There are real issues when dealing with long strings The human genome has about 3 billion base pairs Keeping the suffix links updated can cause thrashing as we walk all over the suffix tree representing this The suffix tree is important enough that people are working the issue One idea that is easy to describe: merging suffix trees

53 References A great reference to the field is Dan Gusfield’s Algorithms on Strings, Trees, and Sequences P. Weiner (1973). "Linear pattern matching algorithm". 14th Annual IEEE Symposium on Switching and Automata Theory, Edward M. McCreight (1976). "A Space-Economical Suffix Tree Construction Algorithm". Journal of the ACM 23 (2): E. Ukkonen (1995). "On-line construction of suffix trees". Algorithmica 14 (3): R. Giegerich and S. Kurtz (1997). "From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction". Algorithmica 19 (3): Gusfield, Dan [1997] (1999). Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press.