Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Elementary Linear Algebra Anton & Rorres, 9th Edition
Longest Common Subsequence
Chapter 4 Systems of Linear Equations; Matrices
Divide and Conquer. Subject Series-Parallel Digraphs Planarity testing.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : k-difference.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
SUFFIX TREES From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi.
296.3: Algorithms in the Real World
JM - 1 Introduction to Bioinformatics: Lecture IV Sequence Similarity and Dynamic Programming Jarek Meller Jarek Meller Division.
Applied Discrete Mathematics Week 12: Trees
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String.
Dynamic Programming Reading Material: Chapter 7..
Dynamic Programming Technique. D.P.2 The term Dynamic Programming comes from Control Theory, not computer science. Programming refers to the use of tables.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Introduction to Bioinformatics Algorithms Block Alignment and the Four-Russians Speedup Presenter: Yung-Hsing Peng Date:
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms.
Multiple Sequence alignment Chitta Baral Arizona State University.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
© 2004 Goodrich, Tamassia Dynamic Programming1. © 2004 Goodrich, Tamassia Dynamic Programming2 Matrix Chain-Products (not in book) Dynamic Programming.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Class 2: Basic Sequence Alignment
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Assignment 4. (Due on Dec 2. 2:30 p.m.) This time, Prof. Yao and I can explain the questions, but we will NOT tell you how to solve the problems. Question.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Sequence Alignment.
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
Extending Alignments Υλικό βασισμένο στο κεφάλαιο 13 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
Chapter 3 Computational Molecular Biology Michael Smith
Elementary Linear Algebra Anton & Rorres, 9 th Edition Lecture Set – 02 Chapter 2: Determinants.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
1 Network Models Transportation Problem (TP) Distributing any commodity from any group of supply centers, called sources, to any group of receiving.
Comp. Genomics Recitation 10 Clustering and analysis of microarrays.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,
Chapter 7 Dynamic Programming 7.1 Introduction 7.2 The Longest Common Subsequence Problem 7.3 Matrix Chain Multiplication 7.4 The dynamic Programming Paradigm.
Divide and Conquer Faculty Name: Ruhi Fatima Topics Covered Divide and Conquer Matrix multiplication Recurrence.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Refining Core String Edits and Alignments. How to find the optimal alignment in linear space.
Core String Edits, Alignments, and Dynamic Programming.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,
Divide & Conquer Algorithms
Sequence Alignment 11/24/2018.
Dynamic Programming 1/15/2019 8:22 PM Dynamic Programming.
Greedy Algorithms TOPICS Greedy Strategy Activity Selection
CSE 589 Applied Algorithms Spring 1999
3. Brute Force Selection sort Brute-Force string matching
All pairs shortest path problem
Dynamic Programming-- Longest Common Subsequence
3. Brute Force Selection sort Brute-Force string matching
Bioinformatics Algorithms and Data Structures
Computational Genomics Lecture #3a
CSE 5290: Algorithms for Bioinformatics Fall 2009
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press

How to find the optimal alignment in linear space

Definition: For any α, let α r denote the reverse of a string α. Definition: Given strings S1 and S2, define V r (i,j) as the similarity of the string consisting of the first i characters of, and the string consisting of the first j characters of. Equivalently, V r (i,j) is the similarity of the last i characters of S1 and the last j characters of S2. Lemma:

How to find the optimal alignment in linear space Lemma:

How to find the optimal alignment in linear space Definition: Let k* be a position k that maximizes Definition: Let L n/2 be the subpath of L that starts with the last node of L in row n/2-1 and ends with the first node of L in row n/2+1.

k-difference global alignment

Theorem: There is a global alignment of S1 and S2 with at most k differences if and only if the above algorithm assigns a value of k or less to cell(n,m). Hence the k-difference global alignment problem can be solved in O(km) time and O(km) space.

The return of the suffix tree: k-difference inexact matching Definition: As before, the main diagonal of the n by m dynamic programming table consists if cells(i,i) for 0<=i<=n<=m.The diagonals above the main diagonal are numbered 1 through m; the diagonal starting in cell(0,i) is diagonal i. The diagonals below the main diagonal are numbered -1 through –n; the diagonal starting in cell(i,0) is diagonal i.

Hybrid dynamic programming : the high-level idea

Path R1 consists of the farthest-reaching (d-1) –path on diagonal i+1, followed by a vertical edge (a space in text T) to diagonal i, followed by the maximal extension along diagonal i that corresponds to identical substrings in P and T. Since R1 begins with a (d-1)-path and adds one more space for the vertical edge, R1 is a d- path. Path R1 consists of the farthest-reaching (d-1) –path on diagonal i-1, followed by a horizontal edge (a space in pattern P) to diagonal i, followed by the maximal extension along diagonal i that corresponds to identical substrings in P and T. Path R2 is a d-path. Path R3 consists of the farthest-reaching (d-1) –path on diagonal i, followed by a diagonal edge corresponding to a mismatch between a character of P and a character of T, followed by the maximal extension along diagonal i that corresponds to identical substrings in P and T. Path R3 is a d-path.

Hybrid dynamic programming : the high-level idea

How to solve the k-difference primer problem For any position j, the k-difference primer problem becomes: ▫Find the shortest prefix of string α[j..n] (if it exists) that has edit distance at least k from every substring in β. Solution: ▫Run the k-differences algorithm with string α[j..n] playing the role of P and β playing the role of T. Theorem: If α has length n and β has length m, then the k-differences primer selection problem can be solved in O(knm) total time.

Yet more suffix trees and more hybrid dynamic programming Two problems ▫The P-against-all problem Given strings P and T, compute the edit distance between P and every substring T’ of T. ▫The threshold all-against-all problem Given strings P and T and a threshold d, find every pair of substrings P’ of P and T’ of T such that the edit distance between P’ and T’ is less than d.

The P-against-all problem Basic Idea: we compute edit distance columnwise (instead of in the usual rowwise manner), then we can combine the two edit distance computations for the first n' columns, since the first n, characters of T’, and T’’, are the same. It would be redundant to compute the first n by n, subtable separately for the two edit distances.

The P-against-all problem

An O(C + R)-time methοd The method uses a suffix tree T p for string P and a suffix tree T T for string T. The worst-case time for the method will be shown to be O(C+R), where C is the length of T p times the length of T T independent of whatever the output criteria are, and R is the size οf the output. Definition: The dynamic programming table for a pair of nodes (u, v), from T p and T T, respectively, is defined as the dynamic programming table for the edit distance between the string represented by node u and the string represented by node v.

Longest common subsequence reduces to longest increasing subsequence Definition: Given strings S 1 and S 2 (of length m and n, respectively) over an alphabet Σ, let r(i) be the number of times that the jth character of string S 1 appears in string S 2. Definition: Let r denote the sum

The reduction For each alphabet character x that occurs at least once in S1, create a list of the positions where character x occurs in string S 2 ; write this list in decreasing order. Two distinct alphabet characters will have totally disjoint lists. Now create a list called Π(S 1, S 2 ) of length r, in which each character instance in S 1 is replaced with the associated list for that character. That is, for each position l in S 1, insert the list associated with the character S 1 (i).

The reduction Theorem: The longest Common subsequence problem can be solved in O(rlogn)time.

The Four-Russians speedup-t-blocks Definition: A t-block is a t by t square in the dynamic programming table. The rough idea of the Four-Russians method is to partition the dynamic programming table into t-blocks and compute the essential values in the table one t-block at a time, rather than one cell at a time. Lemma: The distance values in a t-block starting in position (i,j) are a function of the values in its first row and column and the substrings S 1 [i…i+t-1] and S 2 [j..j+t-1]. Definition: Given above lemma and the following figure, we define the block function as the function from the five inputs (A,B,C,D,E) to the output F The values in the last row and column of a r-block are also a function of the inputs (A, Β, C, D, Ε). We call the function from those inputs to the values in the last row and column of a r-block, the restricted block function.

The Four-Russians speedup-t-blocks

Computing edit distance with the restricted block function–Block edit distance algorithm Cover the (n+1) by (n+1) dynamic programming table with t-blocks, where the last column of every t-block is shared with the first column of the t-block to its right (if any), and the last row of every t- block is shared with the first row of the t-block below it (if any). In this way, and since n=k(t-1), the table will consist of k rows and k columns of partially overlapping t-blocks. Initialize the values in the first row and column of the full table according to the base conditions of the recurrence. In a rowwise manner, use the restricted block function to successively determine the values in the last row and last column of each block. By the overlapping nature of the blocks, the values in the last column (or row) of a block are the values in the first column (or row) of the block to its right (or below it) The value in cell (n,n) is the edit distance of S1 and S2.

Computing edit distance with the restricted block function–Block edit distance algorithm

Accounting detail block edit distance algorithm ▫the sizes of the input and the output of the restricted block function are both O(t). There are Θ(n 2 /t 2 ) blocks, hence the total time used by the block edit distance algorithm is o(n 2 /t). Setting t to Θ(log n), the time is O(n 2 /lοgn). However in the unit-cost RAM model of computation, each output value can be retrieved in constant time since t=O(logn). =>the method is reduced to O(n 2 /(logn) 2 ). precomputation time ▫The key issue involves the number of input choices to the restricted block function. ▫Every cell has an integer from zero to n=>(n+1) t possible values for any t-length row or column. ▫If the alphabet has size σ, then there are σ t, possible substrings of length t=>#distinct input combinations to the restricted block function: is (n + 1) 2t σ 2t ▫For each input, it takes Θ(t 2 ) time to evaluate the last row and column of the resulting r-block (by running the standard dynamic program). Thus the overall time used in this way to precompute the function outputs to all possible input choices is Θ((n + 1) 2t σ 2t t 2 ).

The trick: offset encoding Lemma: In any row, column, or diagonal of the dynamic programming table for edit distance, two adjacent cells can have a value that differs by at most one. Definition: The offset vector is a t-length vector of values from {- 1,0,1}, where the first entry must be zero. Theorem: Consider a t-block with upper left corner in position (i,j). The two offset vectors for the last row and last column of the block can be determined from the two offset vectors for the first row and column of the block and from substrings S1[1..i] and S2[1..j]. That is, no D value is needed in the input in order to determine the offset vectors in the last row and column of the block. Definition: The function that determines the two offset vectors for the last row and last column from the two offset vectors for the first row and column of a block together with substrings S1[1..i] and S2[1..j] is called the offset function

Time analysis Theorem: The edit distance of two strings of length n can be computed in O(n 2 /logn) time or O(n 2 /(logn) 2 ) time in the unit-cost RAM model