Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU
Chapter 7 Dynamic Programming.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Comp 122, Fall 2004 Dynamic Programming. dynprog - 2 Lin / Devi Comp 122, Spring 2004 Longest Common Subsequence  Problem: Given 2 sequences, X =  x.
Dynamic Programming Optimization Problems Dynamic Programming Paradigm
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
§ 8 Dynamic Programming Fibonacci sequence
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Dynamic Programming Reading Material: Chapter 7..
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Introduction to Bioinformatics Algorithms Block Alignment and the Four-Russians Speedup Presenter: Yung-Hsing Peng Date:
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Distance Functions for Sequence Data and Time Series
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Dynamic Programming Optimization Problems Dynamic Programming Paradigm
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
Dynamic Programming Reading Material: Chapter 7 Sections and 6.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Class 2: Basic Sequence Alignment
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
7 -1 Chapter 7 Dynamic Programming Fibonacci sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Minimum Edit Distance Definition of Minimum Edit Distance.
1 Dynamic Programming Andreas Klappenecker [partially based on slides by Prof. Welch]
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Chapter 7 Dynamic Programming 7.1 Introduction 7.2 The Longest Common Subsequence Problem 7.3 Matrix Chain Multiplication 7.4 The dynamic Programming Paradigm.
Dynamic Programming (Edit Distance). Edit Distance Input: – Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Core String Edits, Alignments, and Dynamic Programming.
Example 2 You are traveling by a canoe down a river and there are n trading posts along the way. Before starting your journey, you are given for each 1
Definition of Minimum Edit Distance
Divide & Conquer Algorithms
Definition of Minimum Edit Distance
Distance Functions for Sequence Data and Time Series
Distance Functions for Sequence Data and Time Series
Computational Biology Lecture #6: Matching and Alignment
Computational Biology Lecture #6: Matching and Alignment
Intro to Alignment Algorithms: Global and Local
Dynamic Programming 1/15/2019 8:22 PM Dynamic Programming.
CSE 589 Applied Algorithms Spring 1999
Lecture 8. Paradigm #6 Dynamic Programming
Dynamic Programming-- Longest Common Subsequence
Bioinformatics Algorithms and Data Structures
Computational Genomics Lecture #3a
Advanced Analysis of Algorithms
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming with cost/similarity/scoring matrix

Measuring Distance of S and T Consider S and T We can transform S into T using the following four operations –insertion of a character into S –deletion of a character from S –substitution (replacement) of a character in S by another character (typically in T) –matching (no operation)

Example S = vintner T = writers vintner wintner (Replace v with w) wrintner (Insert r) writner (Delete first n) writer (Delete second n) writers (Insert S)

Example Edit Transcript (or just transcript): –a string that describes the transformation of one string into the other Example –RIMDMDMMI –v intner –wri t ers

Edit Distance Edit distance of strings S and T –The minimum number of edit operations (insertion, deletion, replacement) needed to transform string S into string T –Levenshtein distance, Levenshtein appears to have been the first to define this concept Optimal transcript –An edit transcript of S and T that has the minimum number of edit operations –cooptimal transcripts

Alignment A global alignment of strings S and T is obtained –by inserting spaces (dashes) into S and T they should have the same number of characters (including dashes) at the end –then placing two strings over each other matching one character (or dash) in S with a unique character (or dash) in T –Note ALL positions in both S and T are involved

Alignments and Edit transcripts Example Alignment –v-intner- –wri-t-ers Alignments and edit transcripts are interrelated –edit transcript: emphasizes process the specific mutational events –alignment: emphasizes product the relationship between the two strings –Alignments are often easier to work with and visualize also generalize better to more than 2 strings

Edit Distance Problem Input –2 strings S and T Task –Output edit distance of S and T –Output optimal edit transcript –Output optimal alignment Solution method –Dynamic Programming

Definition of D(i,j) Let D(i,j) be the edit distance of S[1..i] and T[1..j] –The edit distance of the first i characters of S with the first j characters of T –Let |S| = n, |T| = m D(n,m) = edit distance of S and T We will compute D(i,j) for all i and j such that 0 <= i <= n, 0 <= j <= m

Recurrence Relation Base Case: –For 0 <= i <= n, D(i,0) = i –For 0 <= j <= m, D(0,j) = j Recursive Case: –0 < i <= n, 0 < j <= m –D(i,j) = min D(i-1,j) + 1(what does this mean?) D(i,j-1) + 1(what does this mean?) D(i-1,j-1) + d(i,j)(what does this mean?) –d(i,j) = 0 if S(i) = T(j) and is 1 otherwise

What the various cases mean D(i,j) = min –D(i-1,j) + 1: Align S[1..i-1] with T[1..j] optimally Match S(i) with a dash in T –D(i,j-1) + 1 Align S[1..i] with T[1..j-1] optimally Match a dash in S with T(j) –D(i-1,j-1) + d(i,j) Align S[1..i-1] with T[1..j-1] optimally Match S(i) with T(j)

Computing D(i,j) values D(i,j)writers v1 i2 n3 t4 n5 e6 r7

Initialization: Base Case D(i,j)writers v11 i22 n33 t44 n55 e66 r77

Row i=1 D(i,j)writers v i22 n33 t44 n55 e66 r77

Entry i=2, j=2 D(i,j)writers v i222? n33 t44 n55 e66 r77

Entry i=2, j=3 D(i,j)writers v i2222? n33 t44 n55 e66 r77

Calculation methodologies Location of edit distance –D(n,m) Example was to calculate row by row Can also calculate column by column Can also use antidiagonals Key is to build from upper left corner

Traceback Using table to construct optimal transcript Pointers in cell D(i,j) –Set a pointer from cell (i,j) to cell (i, j-1) if D(i,j) = D(i, j-1) + 1 cell (i-1,j) if D(i,j) = D(i-1,j) + 1 cell (i-1,j-1) if D(i,j) = D(i-1,j-1) + d(i,j) –Follow path of pointers from (n,m) back to (0,0)

What the pointers mean horizontal pointer: cell (i,j) to cell (i, j-1) –Align T(j) with a space in S –Insert T(j) into S vertical pointer: cell (i,j) to cell (i-1, j) –Align S(i) with a space in T –Delete S(i) from S diagonal pointer: cell (i,j) to cell (i-1, j-1) –Align S(i) with T(j) –Replace S(i) with T(j)

Table and transcripts The pointers represent all optimal transcripts Theorem: –Any path from (n,m) to (0,0) following the pointers specifies an optimal transcript. –Conversely, any optimal transcript is specified by such a path. –The correspondence between paths and transcripts is one to one.

Running Time Initialization of table –O(n+m) Calculating table and pointers –O(nm) Traceback for one optimal transcript or optimal alignment –O(n+m)

Operation-Weight Edit Distance Consider S and T We can assign weights to the various operations –insertion/deletion of a character: cost d –substitution (replacement) of a character: cost r –matching: cost e –Previous case: d = r = 1, e = 0

Modified Recurrence Relation Base Case: –For 0 <= i <= n, D(i,0) = i d –For 0 <= j <= m, D(0,j) = j d Recursive Case: –0 < i <= n, 0 < j <= m –D(i,j) = min D(i-1,j) + d D(i,j-1) + d D(i-1,j-1) + d(i,j) –d(i,j) = e if S(i) = T(j) and is r otherwise

Alphabet-Weight Edit Distance Define weight of each possible substitution –r(a,b) where a is being replaced by b for all a,b in the alphabet –For example, with DNA, maybe r(A,T) > r(A,G) –Likewise, I(a) may vary by character Operation-weight edit distance is a special case of this variation Weighted edit distance refers to this alphabet- weight setting

Modified Recurrence Relation Base Case: –For 0 <= i <= n, D(i,0) =  1 <= k <= i I(S(k)) –For 0 <= j <= m, D(0,j) =  1 <= k <= j I(T(k)) Recursive Case: –0 < i <= n, 0 < j <= m –D(i,j) = min D(i-1,j) + I(S(i)) D(i,j-1) + I(T(j)) D(i-1,j-1) + d(i,j) –d(i,j) = r(S(i), T(j))

Measuring Similarity of S and T Definitions –Let  be the alphabet for strings S and T –Let  ’ be the alphabet  with character - added –For any two characters x,y in  ’, s(x,y) denotes the value (or score) obtained by aligning x with y –For a given alignment A of S and T, let S’ and T’ denote the strings after the chosen insertion of spaces and l their new length –The value of alignment A is  1<=i<=l s(S’(i),T’(i))

Example a b a a - b a b a a a a a b - b =5 sab- a1-20 b2 -0

String Similarity Problem Input –2 strings S and T –Scoring matrix s for alphabet  ’ Task –Output optimal alignment value of S and T The alignment of S and T with maximal, not minimal, value –Output this alignment

Modified Recurrence Relation Base Case: –For 0 <= i <= n, V(i,0) =  1 <= k <= i s(S(k),-) –For 0 <= j <= m, V(0,j) =  1 <= k <= j s(-,T(k)) Recursive Case: –0 < i <= n, 0 < j <= m –V(i,j) = max V(i-1,j) + s(S(i),-) V(i,j-1) + s(-,T(j)) V(i-1,j-1) + s(S(i), T(j))

Longest Common Subsequence Problem Given 2 strings S and T, a common subsequence is a subsequence that appears in both S and T. The longest common subsequence problem is to find a longest common subsequence (lcs) of S and T –subsequence: characters need not be contiguous –different than substring Can you use dynamic programming to solve the longest common subsequence problem?

Computing alignments using linear space. Hirschberg [1977] Suppose we only need the maximum similarity/distance value of S and T without an alignment or transcript How can we conserve space? –Only save row i-1 when computing row i in the table

Illustration 01234nn … m...

Linear space and an alignment Assume S has length 2n Divide and conquer approach –Compute value of optimal alignment of S[1..n] with all prefixes of T Store row n only at end along with pointer values of row n –Compute value of optimal alignment of S r [1..n] with all prefixes of T r Store only values in row n Find k such that –V(S[1..n],T[1..k]) + V(S r [1..n],T r [1..m-k]) –is maximized over 0 <= k <=m

Illustration V(S[1..6], T[1..0]) V(S r [1..6], T r [1..18]) k=0 m-k=18

Illustration V(S[1..6], T[1..1]) V(S r [1..6], T r [1..17]) k=1 m-k=17

Illustration V(S[1..6], T[1..2]) V(S r [1..6], T r [1..16]) k=2 m-k=16

Illustration V(S[1..6], T[1..9]) V(S r [1..6], T r [1..9]) k=9 m-k=9

Illustration V(S[1..6], T[1..18]) V(S r [1..6], T r [1..0]) k=18 m-k=0

Illustration

Recursive Step Let k* be the k that maximizes –V(S[1..n],T[1..k]) + V(S r [1..n],T r [1..m-k]) Record all steps on row n including the one from n-1 and the one to n+1 Recurse on the two subproblems –S[1..n-1] with T[1..j] where j <= k* –S r [1..n] with T r [1..q] where q <= m-k*

Illustration

Time Required cmn time to get this answer so far Two subproblems have at most half the total size of this problem –At most the same cmn time to get the rest of the solution cmn/2 + cmn/4 + cmn/8 + cmn/16 + … <= cmn Final result –Linear space with only twice as much time