Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String Dan Gusfield and Jens Stoye Journal of Computer and System Science 69.

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

Space-for-Time Tradeoffs
Inpainting Assigment – Tips and Hints Outline how to design a good test plan selection of dimensions to test along selection of values for each dimension.
Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Suffix Trees Construction and Applications João Carreira 2008.
Introduction to Algorithms Rabie A. Ramadan rabieramadan.org 6 Ack : Carola Wenk nad Dr. Thomas Ottmann tutorials.
Advanced Topics in Algorithms and Data Structures Lecture 7.1, page 1 An overview of lecture 7 An optimal parallel algorithm for the 2D convex hull problem,
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
CSE 746 – Introduction to Bioinformatics Research Project Two methods of DNA Sequencing – Comparing and Intertwining Suffix Trees and De Bruijn Graphs.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Boyer Moore Algorithm String Matching Problem Algorithm 3 cases Searching Timing.
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Faster Algorithm for String Matching with k Mismatches Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp Date.
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Reverse Colussi algorithm
Using PQ Trees For Comparative Genomics - CPM Using PQ Trees For Comparative Genomics Gad M. Landau – Univ. of Haifa Laxmi Parida – IBM T.J. Watson.
Indexing and Searching
Efficient algorithms for the scaled indexing problem Biing-Feng Wang, Jyh-Jye Lin, and Shan-Chyun Ku Journal of Algorithms 52 (2004) 82–100 Presenter:
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
INTEGRALS Areas and Distances INTEGRALS In this section, we will learn that: We get the same special type of limit in trying to find the area under.
1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
MCS 101: Algorithms Instructor Neelima Gupta
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Design & Analysis of Algorithms COMP 482 / ELEC 420 John Greiner.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.
MCS 101: Algorithms Instructor Neelima Gupta
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Chapter 10 Graph Theory Eulerian Cycle and the property of graph theory 10.3 The important property of graph theory and its representation 10.4.
Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Review Quick Sort Quick Sort Algorithm Time Complexity Examples
Compressing Bi-Level Images by Block Matching on a Tree Architecture Sergio De Agostino Computer Science Department Sapienza University of Rome ITALY.
Applications of Suffix Trees Dr. Amar Mukherjee CAP 5937 – ST: Bioinformatics University of central Florida.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,
15-853:Algorithms in the Real World
Searching Similar Segments over Textual Event Sequences
String Data Structures and Algorithms
String Data Structures and Algorithms
Suffix Arrays and Suffix Trees
Chap 3 String Matching 3 -.
Introduction to Algorithms
Space-for-time tradeoffs
Space-for-time tradeoffs
Presentation transcript:

Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String Dan Gusfield and Jens Stoye Journal of Computer and System Science 69 (2004) Presenter: Yung-Hsing Peng Date:

Abstract

Motivation Recently it was shown that the number of different types of tandem repeats contained in a string of length n is bounded by O(n) [FS98] Can we find one occurrence of each tandem repeat type in O(n) time? Such a list of different repeat types is called the vocabulary of string S.

An Example for Vocabulary For example, a vocabulary of tandem repeats of the string abaabaabbaaabaaba$ is given by a set of pairs {(1, 6), (2, 6), (3, 2), (3, 6), (8, 2)} representing the tandem repeats abaaba, baabaa, aa, aabaab, bb. In above example, the set of occurrences of tandem repeats is {(1, 6), (2, 6), (3, 2), (3, 6), (6, 2), (8, 2), (10, 2), (11, 2), (11, 6), (12, 6), (14, 2)} In this paper, we present an algorithm that finds the vocabulary of a string S of length n in O(n) time and space, by decorating the suffix tree of S.

Example for Our Goal

Basic Knowledge If a string aw is a tandem repeat, then the string wa is also a tandem repeat, where ‘a’ is a single character and w is a string. An interval of positions i, i+1,…, j is called a run of l-length tandem repeats if (i, l), (i+1, l),…, (j, l) are each tandem repeat pairs. In this case, we say that (i, l) covers (i+1, l), (i+2, l)… (j, l). If (i, l) covers (j, l), then the substring S[j..j+l-1] can be obtained by a series of successive right-rotations from the substring S[i..i+l-1]

The Leftmost Covering Set A set of pairs P is a leftmost covering set, if the leftmost occurrence of each type of tandem repeat in S is covered by a pair. For example, {(1, 6), (8, 2), (11, 2)} is a covering set of abaabaabbaaabaaba$, but is not a leftmost covering set since the leftmost occurrence of aa at position 3 is not covered. However, {(1, 6), (3, 2), (8, 2)} is a leftmost covering set. Note that the vocabulary set is {(1, 6), (2, 6), (3, 6), (3, 2), (8, 2)}, and both (3, 2) and (11, 2) represent aa.

Main Idea Our goal can be achieved using a three-phase procedure. Phase I: Find the leftmost covering set. Phase II: Decorate the suffix tree using leftmost covering set. Phase III: Traverse the suffix tree and decorate the vocabulary set to the suffix tree.

Useful Tools for Phase I Two crucial tools are needed in Phase I. The first is the Lempel- Ziv (LZ) decomposition, and the second is the repeated use of longest common extension queries. Using these two crucial tools, we can find the leftmost covering set of a string S, in O(n) time and space.

Longest Common Extension Given two strings S1,S2 with length m and n, the longest common extension of a pair (i, j) is the length of the longest common prefix of S1[i…m] and S2[j…n]. This problem can be solved in constant time, after an O(n) time and space preprocessing. [Gus97] With this powerful tool, one can easily find all tandem repeats in O(n 2 ) by discussing all possible length of tandem repeats in every location, so called brute force. However, we can reduce the time to O(n) by combining another good tool, called LZ decomposition.

Lempel-Ziv decomposition l i : the length of prefix S i : the starting position. After we compute every l i and s i, we can use the formula i B+1 =i B +max(1, l iB ) to decompose the string S (red square represents the l i discussed). All computation can be done in O(n) [RPE81]

Usefulness of LZ decomposition(1/2) The right half of any tandem repeat occurrence must touch at most two blocks of the LZ decomposition, otherwise the decomposition must be wrong. (A contradiction below) The leftmost occurrence of any tandem repeat must touch at least two blocks, otherwise there must be another same tandem repeat in the left side (Any substring in a single block must appear in the left side)

Usefulness of LZ decomposition(2/2) There are two conditions to discuss: (1) The left half of the tandem repeat touches exactly one block. (2) The left half of the tandem repeat touches more than one block. It implies that the length of a tandem repeat is block-dependent, hence we don’t need to discuss the length brutally from 1 to n, at every location.

Algorithm for Condition 1 The length of the tandem repeat is 2k.

Algorithm for Condition 2

The End of Phase I Now we have found the leftmost covering set. Algorithm 1a and 1b both run in O(n), since the blocks are non- overlapping and these algorithms process the blocks one by one, each for O(|B|). All tools (LZ decomposition + longest common extension) run in O(n) We can find the leftmost covering set of a string S with its length n, in O(n) time.

Phase II After Phase I, we can obtain the leftmost covering set above (Note that it was sorted in Phase I). What we have to do now is to decorate the suffix tree with these pairs. Hint: Attach them to the leaves first, then use the bottom-up strategy. Please read the paper for detailed illustration.

A Useful Tool in Phase III You can jump from ax to xa more quickly, since you can jump from ax to x directly by using the suffix link labeled ‘a’.

Phase III Use DFS search to get every decorated pairs. For every decorated pairs, use the suffix link to do the “right- rotation” mentioned before (to extend the run). If the right-rotation fail or collide with another decorated pair, it means the run of this pair is ended. Phase III can also be done in O(n), because it use the suffix link to speed up the rotation time (every rotation can be done in constant time). Please read the paper for detailed illustration.

Conclusion

Reference