SUFFIX TREES From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Longest Common Subsequence
Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Representing Relations Using Matrices
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : k-difference.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Goodrich, Tamassia String Processing1 Pattern Matching.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
By Makinen, Navarro and Ukkonen. Abstract Let A and B be two run-length encoded strings of encoded lengths m’ and n’, respectively. we will show an O(m’n+n’m)
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
An Improved Algorithm to Accelerate Regular Expression Evaluation Author: Michela Becchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Week 11 - Wednesday.  What did we talk about last time?  Graphs  Euler paths and tours.
1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
MCS 101: Algorithms Instructor Neelima Gupta
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
MCS 101: Algorithms Instructor Neelima Gupta
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
LIMITATIONS OF ALGORITHM POWER
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Core String Edits, Alignments, and Dynamic Programming.
13 Text Processing Hongfei Yan June 1, 2016.
CSE 589 Applied Algorithms Spring 1999
Suffix Trees String … any sequence of characters.
Dynamic Programming-- Longest Common Subsequence
Bioinformatics Algorithms and Data Structures
Chap 3 String Matching 3 -.
Presentation transcript:

SUFFIX TREES From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi

An introduction to string matching String matching is an important branch of algorithmica, and it has applications in many fields, as:  Text searching  Molecular biology  Data compression  and so on…

Exact String matching: a brief history  Naive algorithm Naive algorithm Naive algorithm  Knuth-Morris-Pratt (1977) Knuth-Morris-Pratt  Boyer-Moore (1977) Boyer-Moore Suffix Trees: Weiner (1973), McCreight (1978), Ukkonen (1995)

Naive Algorithm Naive Algorithm bcadbcddacdbbba cdda

Knuth-Morris-Pratt bcabbcaddbcababcdbbba bcababcd b

Boyer-Moore babcabaddbabdabcdbbba babdab Maximum between: the bad character rule the good suffix rule babdab

Suffix Trees Definition: A suffix tree for a string T of length m is a rooted tree such that: 1. 1.It has exactly m leafs, numbered from 1 to m; 2. 2.Every edge has a label, which is a substring of T; 3. 3.Every internal node has at least two children; 4. 4.Labels of two edges starting at an internal node do not start with the same character; 5. 5.The label of the path from the root to a leaf numbered I is the suffix of T starting at position i, i.e. T[i..m]

Suffix Trees - II abbcbab# ab # bcbab# b # cbab# bcbab# ab# cbab#

Suffix Trees – searching a pattern abbcbab# ab # bcbab# b # cbab# bcbab# ab# cbab# Pattern: bcb

Suffix Trees – naive construction abbcbab# ab # bcbab# b # cbab# bcbab# ab# cbab# abbcbab# bbcbab#

Suffix Trees – Ukkonen Algorithm Ukkonen algorithm was published in 1995, and it is the fastest and well performing algorithm for building a suffix tree in linear time. The basic idea is constructing iteratively the implicit suffix trees for S[1..i] in the following way: Construct tree I 1 For i = 1 to m-1 // phase i+1 for j = 1 to i+1 // extension j find the end of the path from the root with label S[j…i] in the current tree. Extend the path adding character S(i+1), so that S[j…i+1] is in the tree. The extension will follow one of the next three rules, being  = S[j..i]:  ends at a leaf. Add S(i+1) at the end of the label of the path to the leaf 2.There’s one path continuing from the end of , but none starting with S(i+1). Add a node at the end of  and a path stating from the new node with label S(i+1), terminating in a leaf with number j. 3.There’s one path from the end of  starting with S(i+1). In this case do nothing.

Suffix Trees – Ukkonen Algorithm - II The main idea to speed up the construction of the tree is the concept of suffix link. Suffix links are pointers from a node v with path label x  to a node s(v) with path label  (  is a string and x a character). The interesting feature of suffix trees is that every internal node, except the root, has a suffix link towards another node. abbcbab# ab # bcbab# b # cbab# bcbab# ab# cbab# Suffix link v S(v)

Suffix Trees – Ukkonen Algorithm - III With suffix links, we can speed up the construction of the ST xx     In addition, every node can be crossed in costant time, just keeping track of the label’s length of every single edge. This can be done because no two edges exiting from a node can start with the same character, hence a single comparison is needed to decide which path must be taken. Anyway, using suffix links, complexity is still quadratic.

Suffix Trees – Ukkonen Algorithm - IV To complete the speed up of the algorithm, we need the following observations: Once a leaf is created, it will remain forever a leaf. Once in a phase rule 3 is used, all succeccive extensions make use of it, hence we can ignore them. If in phase i the rule 1 and 2 are applied in the first j i moves, in phase i+1 the first j i extensions can be made in costant time, simply adding the character S(i+2) at the end of the paths to the first j i leafs (we will use a global variable e do do this). Hence the extensions will be computed explicitly from j i+1, reducing their global number to 2m. Storing the path labels explicitly will cost a quadratic space. Anyway, each edge need only costant space, i.e. two pointers, one to the beginning and one to the end of the substring it has as label.

Generalized Suffix Trees A generalized suffix tree is simply a ST for a set of strings, each one ending with a different marker. The leafs have two numbers, one identifiing the string and the other identifiing the position inside the string. S 1 = abbc$ S 2 = babc# ab c# bc$ b c$ bc$ abc# c$ (2,2) (1,1) (1,3) (2,3) (1,2) (2,1) (1,4) (2,4)

Longest common substring Let S 1 and S 2 be two string over the same alphabeth. The Longest Common Substring problem is to find the longest substring of S 1 that is also a substring of S 2. Knuth in 1970 conjectured that this problem was  (n 2 ) Building a generalized suffix tree for S 1 and S 2, to solve the problem one has to identify the nodes which belong to both suffix trees of S 1 and S 2 and choose the one with greatest string depth (length of the path label from the root to itself). All these operations cost O(n).

Longest Common Extension A problem that can be solved linearly using suffix trees is the Longest Common Extension problem, that is, for every couple of indexes (i,j), finding the length of the longest substring of T starting at position i that matches a substring of P starting at position j. It can be solved in O(n+m) time, building a generalized suffix tree for T and P, and finding, for every leaf i of T and j of P, their lowest common ancestor in the tree (it can be done in costant time after preprocessing the tree).

Hamming and Edit Distances Hamming Distance: two strings of the same length are aligned and the distance is the number of mismatches between them. abbcdbaabbc abbdcbbbaac H = 6 Edit Distance: it is the minimum number of insertions, deletions and substitutions needed to trasform a string into another. abbcdbaabbc cbcdbaabc abbcdbaabbc E = 3

The k - mismatches problem We have a text T and a pattern P, and we want to find occurences of P in T, allowing a maximum of k mismatches, i.e. we want to find all the substring T’ of T such that H(P,T’) ≤ k. We can use suffix trees, but they do not perfome well anymore: the algorithm scans all the paths to leafs, keeping track of errors, and abandons the path if this number becomes greater that k. The algorithm is fastened using the longest common extensions. For every suffix of T, the pieces of agreement between the suffix and P are matched together until P is exausted or the errors overcome k. Every piece is found in costant time. The complexity of the resulting algorithm is O(k|T|). aaacaabaaaaa…. aabaab c b An occurence is found in position 2 of T, with one error.

Inexact Matching In biology, inexact matching is very important: Similarity in DNA sequences implies often that they have the same biological function (viceversa is not true); Mutations and error transcription make exact comparison not very useful. There are a lot of algorithms that deal with inexact matching (with respect to edit distance), and they are mainly based on dynamic programming or on automata. Suffix trees are used as a secondary tools in some of them, because their structure is inadapt to deal with insertions and deletions, and even with substitutions. The main efforts are spend in fastening the average behaviour of algorithms, and this is justified because of the fact that random sequences often fall in these cases (and DNA sequences have an high degree of randomness).

Dynamic Programming The main idea is computing the edit distance between any of the prefixes of S and T. Let D(i,j) be this distance. Of course, the edit distance between S and T is D(n,m), where n=|P| and m=|T|. The following properties hold: 1.D(i,0) = i, D(0,j) = j; 2.D(i,j) = min { D(i,j-1) + 1, D(i-1,j) + 1, D(i-1,j-1) + t(i,j) }. We aim to compute edit distance (global alignements) between two string S and T Hence in O(mn) time we can compute a matrix which encodes not only the edit distance, bu also the way to trasform a string into another (just keeping track, by means of pointers, of which elements realize the minimum)

Dynamic Programming II C ASE A R E

Non-Deterministic Automata CASE CASE CASE                           To recognize the approximate occurences of a pattern P in a text T, we can build a non-deterministic automaton for P, and run it with T as input. This leads to faster algorithms for the search, but the problem is building the automaton.

Longest Common Subsequence The Longest Common Subsequence between two strings S1 and S2 is the greater number of characters of S1 that can be aligned to S2. It is a global alignement problem, which is obviously connected with edit distance. Anyway, often it is modelled with a scoring scheme, which gives a positive score to matches and a negative one to mismatches, insertions and substitutions. So the best global alignement is the one which maximizes the total score. Clearly, given the best global alignement, the number of matches is the longest common subsequence solution. a b b c d a b b a a b _ c b a b _ a

The k – differences problem This problem is to find all the occurences of a pattern P in a text T, allowing a maximum number of k insertions, deletions or substitutions. The Landau-Vishkin algorithm solves it in O(k|T|) time, and implements an hybrid dynamic programming tecnique, which uses suffix trees to solve a subproblem.Landau-Vishkin The algorithm looks for paths in the dynamic programming matrix (which start in the upper row), in particular for d -paths, which are paths that specify exactly d mismatches and spaces. Some of these paths are computed, for d ≤ k, and the ones that reach the bottom row correspond to approximate occurences of P in T, with exactly d mismatches or spaces.

Landau-Vishkin Algorithm Landau-Vishkin Algorithm Each diagonal is numbered: the main diagonal is numbered with 0, the upper diagonals with increasing positive integers while the lower diagonals with decreasing negative integers. A d -path is farthest reaching diagonal i if it ends in diagonal i and the index of its ending column is greater than or equal to the one of every other d -path ending in diagonal i.

Landau-Vishkin Algorithm - II Landau-Vishkin Algorithm - II The farthest reaching d-path that ends in diagonal i is one of the following three: 1.(d-1)- path of diagonal i + 1, plus a vertical edge plus the maximum extension along diagonal i that corresponds to identical substrings in P and T 2.(d-1)- path of diagonal i - 1, plus an horizontal edge plus the maximum extension along diagonal i that corresponds to identical substrings in P and T 3.(d-1)- path of diagonal i, plus a diagonal edge that corresponds to a mismatch plus the maximum extension along diagonal i that corresponds to identical substrings in P and T The maximum extension between substring of P and T can be done in costant time by means od suffix trees. ii+1

Inexact Matching, a new approach Suffix trees work very well for exact matching, but they fail when we admit errors in the matching process. This happens because, the only way to find approximate occurences of a pattern, when we search it in a suffix tree, is to walk down every path, keeping track of errors and discarding the paths which overcome the tolerance level previously chosen. A different approach may be that of defining a different data structure, though similar to suffix trees, which encodes in some way a concept of distance, in particular the Hamming Distance. A possible way is to shift from alphabeth A to alphabet A k, encoding the distance in a relation between letters: two letters are said to be “equivalent” if and only if their Hamming distance is less than a threshold .

Equivalence between letters Let’s show and example of this idea of equivalence, with A = {0,1} and k = 3. So, we can build the following table for A 3 : If the distance between two letters is less or equal than 1, we define them equivalent. For example a  b, b  d, but NOT(a  d).

Bundled Suffix Trees Given this equivalence relation (which is not transitive), we want to incorporate it in a tree structure. For simplicity, we assume that the tree for the sequence S is the smallest tree which contains, for every substring of S, all the exact paths and all the equivalent paths that can be found in S. For “historical reasons”, we will call it a Bundled Suffix Tree. A bundled suffix tree B for a string S of length m is a rooted tree such that: Definition: A bundled suffix tree B for a string S of length m is a rooted tree such that: It has exactly m leafs, numbered from 1 to m ; Every edge has a label, which is a substring of S ; Every node has a set of labels, which is a subset of { 1,2,..,m,  }; The tree obtained deleting all nodes which do not has  as label is the suffix tree for S ; For every substring P of S, the subtree of B rooted at the end of the path labeled with P has node labels which union (discarding  ) gives all the starting positions of substrings of S equivalent to P ; In every path from the root to a leaf no two nodes can be labelled with the same number.

Bundled Suffix Trees - II abbcda#ab dc a # b bcd a# a # b bcd a # d c a # d c a # d 5,3 2          ,4

Open Problems 1. Does BST work well for Hamming distance? (they seem to need a distributed distance). 2. How can BST be used to manage approximate searching using edit distance? At what price? 3. Which is the average number of “red” nodes expected? Is it linear or does it grows quadratically? 4. Is there a linear algorithm for building BST? 5. Does BST manage to improve existant algorithms, or the interest is just theoretical?