Applications of Suffix Trees Dr. Amar Mukherjee CAP 5937 – ST: Bioinformatics University of central Florida.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Longest Common Subsequence
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Main Index Contents 11 Main Index Contents Week 6 – Binary Trees.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
1 Applications of Suffix Trees Charles Yan Exact String Matching |P|=n, |T|=m P and T are both known at the same time Boyer-Moore, or Suffix.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Data Structures – LECTURE 10 Huffman coding
Data Structures Using C++ 2E Chapter 11 Binary Trees and B-Trees.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
An O(N 2 ) Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai and Satoru.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Huffman Codes. Encoding messages  Encode a message composed of a string of characters  Codes used by computer systems  ASCII uses 8 bits per character.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Section 10.1 Introduction to Trees These class notes are based on material from our textbook, Discrete Mathematics and Its Applications, 6 th ed., by Kenneth.
The Pumping Lemma for Context Free Grammars. Chomsky Normal Form Chomsky Normal Form (CNF) is a simple and useful form of a CFG Every rule of a CNF grammar.
CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
 Rooted tree and binary tree  Theorem 5.19: A full binary tree with t leaves contains i=t-1 internal vertices.
Trees  Linear access time of linked lists is prohibitive Does there exist any simple data structure for which the running time of most operations (search,
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
Foundation of Computing Systems
Bijective tree encoding Saverio Caminiti. 2 Talk Outline Domains Prüfer-like codes Prüfer code (1918) Neville codes (1953) Deo and Micikevičius code (2002)
LIMITATIONS OF ALGORITHM POWER
Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,
Chapter 6 – Trees. Notice that in a tree, there is exactly one path from the root to each node.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 18.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
15-853:Algorithms in the Real World
HUFFMAN CODES.
Ariel Rosenfeld Bar-Ilan Uni.
13 Text Processing Hongfei Yan June 1, 2016.
Data Structure and Algorithms
Suffix Trees String … any sequence of characters.
Chap 3 String Matching 3 -.
Important Problem Types and Fundamental Data Structures
Sequences 5/17/ :43 AM Pattern Matching.
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
Presentation transcript:

Applications of Suffix Trees Dr. Amar Mukherjee CAP 5937 – ST: Bioinformatics University of central Florida

7.1 Exact String Matching Three important variants:  Both P (|P|=n) and T (|T|=m) are known: Suffix tree method achieves same worst-case bound O(n+m) as KMP or BM.  T is fixed and build suffix tree, then P is input: k: number of occurrences of P Using suffix tree: O(n+k) In contrast ( preprocess P): O(n+m) for any single P  P is fixed, then T is input Select KMP or BM rather than suffix tree.

7.2 Exact Set Matching Both Aho-Corasick and Suffix methods find all occurrences of P in T in O(n+m+k). But have preference by case. Comparison:  AC: build keyword tree: size O(n), time O(n).  When set of patterns is larger than T, suffix tree approach uses less space, but more time to search. E.g. molecular biology, where pattern library is large.

 When total size of patterns is smaller than T, AC method use less space. But suffix tree uses less time.  Neither method is superior in time and space. One case where suffix tree is better, see Application 8.  Time/space trade-off remains, but suffix tree can be used for chosen time/space combinations, whereas no choice for keyword tree.

7.3 Substring Problem for a Database of Patterns The most interesting version:  A set of string, or a database, is known and fixed. A sequence of strings will be presented. For each presented string S, find all the strings in the database containing S as a substring. The total length of all the strings, m, in the database is assumed to be large. In the context of genomic DNA data, the problem of finding substring cannot be solved by exact set matching.

Suffix tree solution:  A generalized suffix tree is built for database in O(m) time and O(m) space.  Any single string S of length n can be found or declared not to be there in O(n) time.  If S matches a path in the tree. A full string is in S iff matching path reaches a leaf when last symbol of S is examined. Find all occurrences containing S as substring in O(n+k) time by traversing subtree below where S is found.

7.4. Longest Common Substring for Two Strings Different from Longest Common Subsequence problem. E.g. S1: superiorcalifornialivers S2: sealiver  Longest common substring: alive Longest common substring of two strings can be found in linear time using a generalized suffix tree.  Find the node with the greatest depth that is marked both 1 and 2.  Linear construction time  Node marking and calculation of string depth can be done by standard linear tree traversal methods. 1970: Don Knuth conjectured that a linear time algorithm would be impossible.

7.5 Recognizing DNA Contamination Given a string S1( the newly isolated and sequenced string of DNA) and a known string S2 ( the combined sources of possible contamination), find all substrings of S2 that occur in S1 and are longer than some given length l. These substrings are candidates of unwanted pieces of S2 that have contaminated the desired DNA string.

Finding common substrings Can be solved in linear time by extending the longest common substring of two strings. Build a generalized suffix tree for S1 and S2. Mark each internal node that has in its subtree a leaf representing a suffix of S1 and also a leaf representing a suffix of S2. Report all marked nodes that have string depth of l or greater.

7.6 Common Substrings of more than two sequences Given a set of strings, find substrings that are common to a large number of those strings. Formal statement: Given: K strings whose lengths sum to n For each k between 2 and K, we define l(k) to be the length of the longest substring common to at least k of the strings. Example: Strings - {sandollar, sandlot, handler, grand, pantry} l(2) – 4, l(3) – 3, l(4) – 3, l(5) – 2

7.6: Linear-time Solution Build generalized suffix tree T for all the input strings. For every internal node v of T, define c(v) to be the number of distinct string identifiers that appear at the leaves in the subtree of v.[ It is easy to compute the number of leaf nodes under v; but computing c(v) is complicated by the fact that more than two leaves may have same identifier.] l(k) is the depth of the deepest node v such that c(v)  k

7.6: Complexity Counting the number of leaves under an internal node v does not give c(v). Therefore, each internal node v maintains a K- length bit vector. Bit i in the vector is set to 1 if there is at least one leaf under v belonging to string i. The bit vector for an internal node v can be obtained ORing the bit-vectors of all the children of v. Since there are O(n) edges in the tree, the time needed will be O(Kn) There is a O(n) solution. See Chapter 9.

7.10 All-pairs Suffix-Prefix Matching Definition: Given two strings S i and S j, any suffix of S i that matches a prefix of S j is called a suffix-prefix match of S i, S j. Given a collection of strings S = S 1, S 2,…S k, the all- pairs suffix-prefix problem is the problem of finding, for each ordered pair S i, S j in S, the longest suffix-prefix match of S i, S j. Motivation: Approximate methods for the shortest superstring problem.

7.10: linear time solution We call an edge terminal edge if it is labeled with only a string termination symbol. Solution:  Build a generalized suffix tree T(S) for the k strings in S.  Build list L(v) for each internal node v L(v) contains index i if a terminal edge labeled i is incident on v  The deepest node v on the path to leaf j such that i  L(v) identifies that longest match between a suffix of S i and a prefix of S i.

7.10 (continued … ) Traverse T(S) in a depth-first manner Maintain k stacks, one for each string When a node v is reached in forward direction, push v on to the ith stack, for each i  L(v). When a leaf j corresponding to the entire string S j is reached, scan the k stacks and record the current top of each stack. When the depth-first traversal backs up past a node v, pop the top of any stack whose index is in L(v) Complexity:O(m+k 2 )

Importance of Repetitive Structures in molecular strings Over 50% of human genome consists of repeats. Complimentary palindromes regulate transcription (by forming hair-pin loops) Clustered genes that code for similar proteins Pseudogenes Restriction enzyme cutting sites Tandem repeats and tandem arrays

Uses of repetitive structures Genetic mapping  Requires the identification of markers that are highly variable between individuals  Tandem arrays can be used as such markers The number of repeats in a tandem array varies from individual to individual  Micro satellite markers – tandem repeats of very short strings

Finding all maximal repetitive structures Defining repeats is crucial  A string consisting of n copies of the same character will have O(n 4 ) pairs of repeats Maximal repeated pair in a given string S: “A pair of identical substrings  and  in S such that extending  and  in either direction would destroy the equality of the two strings”  i.e, occurrences x  y and v  w, x  v and y  w, where x,y,v and w are characters will give a maximal repeated pair  and .  Represented by a triple (p 1, p 2, n’), where p 1 and p 2 give the staring positions and n’ gives the length.  R(S) – the set of all triples describing maximal pairs in S

Maximal repeated pairs Example: Sxabcyiiizabcqabcyrxar Maximal pairs: (2,10,3) – xabcyiiizabcqabcyrxar (2,14,4) – xabcyiiizabcqabcyrxar (10,14,3) – xabcyiiizabcqabcyrxar (6,7,2) – xabcyiiizabcqabcyrxar - Allows overlaps!

More definitions Maximal repeat : “A substring of S that occurs in a maximal pair in S”. Example: abc in S = xabcyiiizabcqabcyrxar Note: There can be numerous maximal repeated pairs, but there can be only a limited number of maximal repeats. Supermaximal repeat “A maximal repeat that never occurs as a substring of any other maximal repeat” Example: abcy in S = xabcyiiizabcqabcyrxar

Using suffix trees to find maximal repeats Lemma : “If a string  is a maximal repeat in S, then  will be the path-label of an internal node v in T(S)” Proof: Gusfield, page 144 Theorem : “There can be at most n maximal repeats in any string of length n” - Why? ~~ root v

Finding maximal repeats: Definitions left character  For each position i in S, S(i-1) is called the left character i.  Left character of a leaf in T(S) is the left character of the suffix position represented by that leaf. left diverse  An internal node v in T(S) is called left-diverse if at least two leaves in v’s subtree have different left characters. Theorem: “The string  labeling the path to a node v of T(S) is a maximal repeat if and only if v is left diverse”

Finding left diverse nodes in linear time For each internal node v, the algorithm either:  Records that v is left diverse, or:  Records the character left(v) that is the left character of every leaf in v’s subtree. Starts by recording the left character of each leaf in T(S) Processes the internal nodes in T(S) bottom-up  If any child of v is left diverse, then v is left diverse  If none of the children are left diverse, then it examines the recorded characters of all the children: If all of the characters are x, then the left character of v is x If all of them are not x, then v is left-diverse

Finding all maximal repeats in linear time Path labels to all internal nodes in T(S) that are left diverse - Simply delete all internal nodes that are not left diverse!

Finding Supermaximal repeats in linear time Near-supermaximal repeat A substring  is near-supermaximal repeat if  is a maximal repeat that occurs at least once in a location where it is not contained in another maximal repeat Example: in a  bx  ya  bx  b,  is neither supermaximal nor near-supermaximal abc in xabcyiiizabcqabcyrxar is near-supermaximal Note: The set of near-supermaximal repeats is not the same as the set of maximal repeats that are not super-maximal

Finding super-maximal repeats Lemma : If v and w are internal nodes in T(S) such that w is a child of v, and if  is the path-label of v, then none of the occurrences of  specified the leaves in the subtree under w witness the near- supermaximality of . Lemma : Let w be a leaf representing a suffix starting at position i, and let w be a child of v. Then, the occurrence of  at position i witnesses the near- supermaximality of  if and only if x is the left character of no other leaf below v. root   v w  xx v w

Finding supermaximal repeats in linear time Theorem “A left diverse internal node v represents a near-supermaximal repeat  if and only if one of v’s children is a leaf, and its left- character is the left character of no other leaf below v.” “A left diverse internal node v represents a supermaximal repeat  if and only if all of v’s children are leaves, and each has a distinct left character” Degree of near-supermaximality: The fraction of occurrence of  that witness its near super-maximality

7.13: Circular string linearization Problem  Cut a circular string S so that the resulting linear string is lexically smallest of all the n possible linear strings created by cutting S. Solution  Cut S at an arbitrary position to give a linear string L.  Build the suffix tree T for the string LL$, where $ is lexically greater than any character in L.  Traverse tree T At every node, take the lexically smallest edge Traverse until the traversed has string-depth of n. Any leaf l at that point can be used to cut the string