Suffix trees.

Slides:



Advertisements
Similar presentations
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Two implementation issues Alphabet size Generalizing to multiple strings.
Constant-Time LCA Retrieval
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Suffix Trees and Suffix Arrays
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Suffix Trees © Jeff Parker, Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Tries Standard Tries Compressed Tries Suffix Tries.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Modern Information Retrieval
1 Applications of Suffix Trees Charles Yan Exact String Matching |P|=n, |T|=m P and T are both known at the same time Boyer-Moore, or Suffix.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Suffix trees. Trie A tree representing a set of strings. a c b c e e f d b f e g { aeef ad bbfe bbfg c }
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Comp. Genomics Recitation 3 (week 4) 26/3/2009 Multiple Hypothesis Testing+Suffix Trees Based in part on slides by William Stafford Noble.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
McCreight's suffix tree construction algorithm
COMP9319 Web Data Compression and Search
Andrzej Ehrenfeucht, University of Colorado, Boulder
Strings: Tries, Suffix Trees
Greedy Algorithms Many optimization problems can be solved more quickly using a greedy approach The basic principle is that local optimal decisions may.
Suffix trees.
String Data Structures and Algorithms
String Data Structures and Algorithms
Suffix trees and suffix arrays
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
String Matching with k Mismatches
Strings: Tries, Suffix Trees
INF 141: Information Retrieval
Presentation transcript:

Suffix trees

Trie A tree representing a set of strings. c { a aeef b ad bbfe bbfg e

Trie (Cont) Assume no string is a prefix of another c a b e b d e f c Each edge is labeled by a letter, no two edges outgoing from the same node are labeled the same. Each string corresponds to a leaf. a b e b d e f c f e g

Compressed Trie Compress unary nodes, label edges by strings c c a a b  c a a b e b d bbf d eef e f c f c e g e g

Suffix tree Given a string s a suffix tree of s is a compressed trie of all suffixes of s To make these suffixes prefix-free we add a special character, say $, at the end of s

Suffix tree (Example) Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ $ { $ b$ ab$ bab$ abab$ } a b b $ a a $ b b $ $

Trivial algorithm to build a Suffix tree Put the largest suffix in a b $ a b b a Put the suffix bab$ in a b b $ $

a b b a a b b $ $ Put the suffix ab$ in a b b a b $ a $ b $

a b b a b $ a $ b $ Put the suffix b$ in a b b $ a a $ b b $ $

a b b $ a a $ b b $ $ $ Put the suffix $ in a b b $ a a $ b b $ $

$ a b b $ a a $ b b $ $ We will also label each leaf with the starting point of the corres. suffix. $ a b b 5 $ a a $ b 4 b $ $ 3 2 1

Analysis Takes O(n2) time to build. You can do it in O(n) time But, how come? does it take O(n) space ?

Linear space ? Consider the string aaaaaabbbbbb $ a b $ bbbbbb$ a b $

To use only O(n) space encode the edge-labels Consider the string aaaaaabbbbbb $ a b $ (7,13) a b $ bbbbbb$ b a $ b bbbbbb$ a $ b bbbbbb$ c $ a b$ bbbbbb$ abbbbbb$

To use only O(n) space encode the edge-labels Consider the string aaaaaabbbbbb (13,13) (1,1) (7,7) (7,13) (1,1) (7,7) (13,13) (7,13) (7,7) (1,1) (13,13) (7,7) (7,13) (1,1) (13,13) (7,7) (7,13) c (13,13) (1,1) (12,13) (7,13) (6,13)

What can we do with it ? Exact string matching: Given a Text T, |T| = n, preprocess it such that when a pattern P, |P|=m, arrives you can quickly decide when it occurs in T. We may also want to find all occurrences of P in T

Exact string matching In preprocessing we just build a suffix tree in O(n) time $ a b b 5 $ a a $ b 4 b $ $ 3 2 1 Given a pattern P = ab we traverse the tree according to the pattern.

By traversing this subtree we get all k occurrences in O(n+k) time $ a b b 5 $ a a $ b 4 b $ $ 3 2 1 If we did not get stuck traversing the pattern then the pattern occurs in the text. Each leaf in the subtree below the node we reach corresponds to an occurrence. By traversing this subtree we get all k occurrences in O(n+k) time

Generalized suffix tree Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of s  S To associate each suffix with a unique string in S add a different special char to each s

Generalized suffix tree (Example) Let s1=abab and s2=aab here is a generalized suffix tree for s1 and s2 # { $ # b$ b# ab$ ab# bab$ aab# abab$ } $ a b 5 4 # $ a b a b 3 b $ 4 # a # $ 2 b 1 $ 3 2 1

So what can we do with it ? Matching a pattern against a database of strings

Longest common substring (of two strings) Every node with a leaf descendant from string s1 and a leaf descendant from string s2 represents a maximal common substring and vice versa. # $ a b 5 4 # $ a b a b 3 b 4 # $ a # $ 2 b 1 Find such node with largest “string depth” $ 3 2 1

Lowest common ancetors A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes # $ a b 5 4 # $ a b a b 3 b 4 # $ a # $ 2 b 1 $ 3 2 1

Finding maximal palindromes A palindrome: caabaac, cbaabc Want to find all maximal palindromes in a string s Let s = cbaaba The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position m-i+1 of sr

Maximal palindromes algorithm Prepare a generalized suffix tree for s = cbaaba$ and sr = abaabc# For every i find the LCA of suffix i of s and suffix m-i+1 of sr

Let s = cbaaba$ then sr = abaabc# 7 a b 7 $ b a c # baaba$ c # 6 c # a $ a b 6 a $ 4 abc # 5 5 3 3 $ a $ c # 4 1 2 2 1

Analysis O(n) time to identify all palindromes

Can we construct a suffix tree in linear time ?

Ukkonen’s linear time construction ACTAATC A A 1

ACTAATC AC A 1

ACTAATC AC AC 1

ACTAATC AC AC C 1 2

ACTAATC ACT AC C 1 2

ACTAATC ACT ACT C 1 2

ACTAATC ACT ACT CT 1 2

ACTAATC ACT ACT CT T 1 2 3

ACTAATC ACTA ACT CT T 1 2 3

ACTAATC ACTA ACTA CT T 1 2 3

ACTAATC ACTA ACTA CTA T 1 2 3

ACTAATC ACTA ACTA CTA TA 1 2 3

ACTAATC ACTAA ACTA CTA TA 1 2 3

ACTAATC ACTAA ACTAA CTA TA 1 2 3

ACTAATC ACTAA ACTAA TA CTAA 1 2 3

ACTAATC ACTAA ACTAA TAA CTAA 1 2 3

ACTAATC ACTAA A TAA CTAA 2 3 A CTAA 4 1

Phases & extensions Phase i is when we add character i In phase i we have i extensions of suffixes

ACTAATC ACTAAT A TAA CTAA 2 3 A CTAA 4 1

ACTAATC ACTAAT A TAA CTAA 2 3 A CTAAT 4 1

ACTAATC ACTAAT A TAA CTAAT 2 3 A CTAAT 4 1

ACTAATC ACTAAT A TAAT CTAAT 2 3 A CTAAT 4 1

ACTAATC ACTAAT A TAAT CTAAT 2 3 AT CTAAT 4 1

ACTAATC ACTAAT A TAAT CTAAT 2 3 AT T CTAAT 4 1 5

Extension rules Rule 1: The suffix ends at a leaf, you add a character on the edge entering the leaf Rule 2: The suffix ended internally and the extended suffix does not exist, you add a leaf and possibly an internal node Rule 3: The suffix exists and the extended suffix exists, you do nothing

ACTAATC ACTAATC A TAAT CTAAT 2 3 AT T CTAAT 4 1 5

ACTAATC ACTAATC A TAAT CTAAT 2 3 AT T CTAATC 4 1 5

ACTAATC ACTAATC A TAAT CTAATC 2 3 AT T CTAATC 4 1 5

ACTAATC ACTAATC A TAATC CTAATC 2 3 AT T CTAATC 4 1 5

ACTAATC ACTAATC A TAATC CTAATC 2 3 ATC T CTAATC 4 1 5

ACTAATC ACTAATC A TAATC CTAATC 2 3 ATC TC CTAATC 4 1 5

ACTAATC ACTAATC A T CTAATC 2 ATC TC C AATC CTAATC 4 1 5 3 6

Skip forward.. ACTAATCAC 4 2 7 3 6 1 5 A T C AC TCAC ATCAC CAC TAATCAC

ACTAATCACT 4 7 3 6 1 5 2 A T C AC TCAC ATCAC CAC TAATCAC CTAATCAC

ACTAATCACT 4 7 3 6 1 5 2 A T C AC TCAC ATCAC CAC TAATCAC CTAATCACT

ACTAATCACT 4 7 3 6 1 5 2 A T C AC TCAC ATCAC CAC TAATCACT CTAATCACT

ACTAATCACT 4 7 3 6 1 5 2 A T C AC TCAC ATCAC CAC TAATCACT CTAATCACT

ACTAATCACT 4 7 3 6 1 5 2 A T C AC TCAC ATCACT CAC TAATCACT CTAATCACT

ACTAATCACT 4 7 3 6 1 5 2 A T C AC TCACT ATCACT CAC TAATCACT CTAATCACT

ACTAATCACT 4 7 3 6 1 5 2 A T C AC TCACT ATCACT CACT TAATCACT CTAATCACT

ACTAATCACT 4 7 3 6 1 5 2 A T C ACT TCACT ATCACT CACT TAATCACT

ACTAATCACTG 4 7 3 6 1 5 2 A T C ACT TCACT ATCACT CACT TAATCACT

ACTAATCACTG 4 7 3 6 1 5 2 A T C ACT TCACT ATCACT CACT TAATCACT

ACTAATCACTG 4 7 3 6 1 5 2 A T C ACT TCACT ATCACT CACT TAATCACTG

ACTAATCACTG 4 7 3 6 1 5 2 A T C ACT TCACT ATCACT CACT TAATCACTG

ACTAATCACTG 4 7 3 6 1 5 2 A T C ACT TCACT ATCACTG CACT TAATCACTG

ACTAATCACTG 4 7 3 6 1 5 2 A T C ACT TCACTG ATCACTG CACT TAATCACTG

ACTAATCACTG 4 7 3 6 1 5 2 A T C ACT TCACTG ATCACTG CACTG TAATCACTG

ACTAATCACTG 4 7 3 6 1 5 2 A T C ACTG TCACTG ATCACTG CACTG TAATCACTG

ACTAATCACTG 4 2 7 3 6 5 1 8 A T C ACTG TCACTG ATCACTG CACTG TAATCACTG

ACTAATCACTG 4 7 3 6 5 1 8 2 9 A T C ACTG TCACTG ATCACTG CACTG T CT

ACTAATCACTG 11 4 7 3 6 10 5 1 8 2 9 A T G C ACTG TCACTG ATCACTG CACTG

Observations i At the first extension we must end at a leaf because no longer suffix exists (rule 1) i At the second extension we still most likely to end at a leaf. We will not end at a leaf only if the second suffix is a prefix of the first

Say at some extension we do not end at a leaf Then this suffix is a prefix of some other suffix (suffixes) We will not end at a leaf in subsequent extensions Is there a way to continue using ith character ? (Is it a prefix of a suffix where the next character is the ith character ?) Rule 3 Rule 2

Rule 3 Rule 2 If we apply rule 3 then in all subsequent extensions we will apply rule 3 Otherwise we keep applying rule 2 until in some subsequent extension we will apply rule 3 Rule 3

In terms of the rules that we apply a phase looks like: 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 We have nothing to do when applying rule 3, so once rule 3 happens we can stop We don’t really do anything significant when we apply rule 1 (the structure of the tree does not change)

Representation We do not really store a substring with each edge, but rather pointers into the starting position and ending position of the substring in the text With this representaion we do not really have to do anything when rule 1 applies

How do phases relate to each other 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 i The next phase we must have: 1 1 1 1 1 1 1 1 1 1 1 2/3 So we start the phase with the extension that was the first where we applied rule 3 in the previous phase

Suffix Links ACTAATCACTG 11 4 7 3 6 10 5 1 8 2 9 A T G C ACTG TCACTG

ACTAATCACTG 11 4 7 3 6 10 5 1 8 2 9 A T G C ACTG TCACTG ATCACTG CACTG

ACTAATCACTG 11 4 7 3 6 10 5 1 8 2 9 A T G C ACTG TCACTG ATCACTG CACTG

ACTAATCACTG 11 4 7 3 6 10 5 1 8 2 9 A T G C ACTG TCACTG ATCACTG CACTG

ACTAATCACTG 11 4 7 3 6 10 5 1 8 2 9 A T G C ACTG TCACTG ATCACTG CACTG

Suffix Links From an internal node that corresponds to the string aβ to the internal node that corresponds to β (if there is such node) β aβ

Is there such a node ? β aβ x.. y Suppose we create v applying rule 2. Then there was a suffix aβx… and now we add aβy β aβ x.. v y So there was a suffix βx…

Is there such a node ? β aβ x.. z.. x.. y Suppose we create v applying rule 2. Then there was a suffix aβx… and now we add aβy β aβ x.. z.. x.. v y So there was a suffix βx… If there was also a suffix βz… Then a node corresponding to β is there

Is there such a node ? β aβ x.. y x.. y Suppose we create v applying rule 2. Then there was a suffix aβx… and now we add aβy β aβ x.. y x.. v y So there was a suffix βx… If there was also a suffix βz… Then a node corresponding to β is there Otherwise it will be created in the next extension when we add βy

You are at the (internal) node corresponding to the last extension Inv: All suffix links are there except (possibly) of the last internal node added You are at the (internal) node corresponding to the last extension i i Remember: we apply rule 2 You start a phase at the last internal node of the first extension in which you applied rule 3 in the previous iteration

1) Go up one node (if needed) to find a suffix link 2) Traverse the suffix link 3) If you went up in step 1 along an edge that was labeled δ then go down consuming a string δ

β aβ δ δ x.. v y Create the new internal node if necessary

β aβ δ δ x.. v y Create the new internal node if necessary

β aβ δ δ y x.. v y Create the new internal node if necessary, add the suffix

β aβ δ δ y x.. v y Create the new internal node if necessary, add the suffix and install a suffix link if necessary

Analysis Handling all extensions of rule 1 and all extensions of rule 3 per phase take O(1) time  O(n) total How many times do we carry out rule 2 in all phases ? O(n) Does each application of rule 2 takes constant time ? No ! (going up and traversing the suffix link takes constant time, but then we go down possibly on many edges..)

β aβ δ δ y x.. v y

So why is it a linear time algorithm ? How much can the depth change when we traverse a suffix link ? It can decrease by at most 1

Punch line Each time we go up or traverse a suffix link the depth decreases by at most 1 When starting the depth is 0, final depth is at most n So during all applications of rule 2 together we cannot go down more than 3n times THM: The running time of Ukkonen’s algorithm is O(n)

Another application Suppose we have a pattern P and a text T and we want to find for each position of T the length of the longest substring of P that matches there. How would you do that in O(n) time ?

Drawbacks of suffix trees Suffix trees consume a lot of space It is O(n) but the constant is quite big Notice that if we indeed want to traverse an edge in O(1) time then we need an array of ptrs. of size |Σ| in each node

Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order 3 1 4 2

How do we build it ? Build a suffix tree Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array. O(n) time

How do we search for a pattern ? If P occurs in T then all its occurrences are consecutive in the suffix array. Do a binary search on the suffix array Takes O(mlogn) time

Example Let S = mississippi i ippi issippi Let P = issa ississippi 8 5 2 1 10 9 7 4 11 6 3 ippi issippi Let P = issa ississippi mississippi pi M ppi sippi sisippi ssippi ssissippi R

How do we accelerate the search ? Maintain l = LCP(P,L) l L Maintain r = LCP(P,R) If l = r then start comparing M to P at l + 1 M R r

How do we accelerate the search ? If l > r then Suppose we know LCP(L,M) If LCP(L,M) < l we go left If LCP(L,M) > l we go right If LCP(L,M) = l we start comparing at l + 1 M R r

Analysis of the acceleration If we do more than a single comparison in an iteration then max(l, r ) grows by 1 for each comparison  O(logn + m) time