Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Chapter 7 Dynamic Programming.
Sequence Alignment Tutorial #2
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Dynamic Programming Optimization Problems Dynamic Programming Paradigm
Sequence Alignment Tutorial #2
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
§ 8 Dynamic Programming Fibonacci sequence
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Dynamic Programming Reading Material: Chapter 7..
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Sequence Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
Dynamic Programming Reading Material: Chapter 7 Sections and 6.
Introduction to Bioinformatics Algorithms Sequence Alignment.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11.8: Gaps Lecturer:
Alignment II Dynamic Programming
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Class 2: Basic Sequence Alignment
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
7 -1 Chapter 7 Dynamic Programming Fibonacci sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Chapter 3 Computational Molecular Biology Michael Smith
Minimum Edit Distance Definition of Minimum Edit Distance.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Chapter 7 Dynamic Programming 7.1 Introduction 7.2 The Longest Common Subsequence Problem 7.3 Matrix Chain Multiplication 7.4 The dynamic Programming Paradigm.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Core String Edits, Alignments, and Dynamic Programming.
Bioinformatics.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Sequence Alignment.
Computational Biology Lecture #6: Matching and Alignment
Computational Biology Lecture #6: Matching and Alignment
Intro to Alignment Algorithms: Global and Local
CSE 589 Applied Algorithms Spring 1999
Bioinformatics Algorithms and Data Structures
Sequence Alignment Tutorial #2
Presentation transcript:

Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming with cost/similarity/scoring matrix

Biological Motivation Read pages in textbook “First Fact of Biological Sequence Analysis” –In biomolecular sequences (DNA, RNA, amino acid sequences), high sequence similarity usually implies significant functional or structural similarity –sequence similarity implies functional/structural similarity –Converse is NOT true –Evolution reuses, builds upon, duplicates, and modifies “successful” structures

Measuring Distance of S and T Consider S and T We can transform S into T using the following four operations –insertion of a character into S –deletion of a character from S –substitution (replacement) of a character in S by another character (typically in T) –matching (no operation)

Example S = vintner T = writers vintner wintner (Replace v with w) wrintner (Insert r) writner (Delete first n) writer (Delete second n) writers (Insert S)

Example Edit Transcript (or just transcript): –a string that describes the transformation of one string into the other Example –RIMDMDMMI –v intner –wri t ers

Edit Distance Edit distance of strings S and T –The minimum number of edit operations (insertion, deletion, replacement) needed to transform string S into string T –Levenshtein distance [299], Levenshtein appears to have been the first to define this concept Optimal transcript –An edit transcript of S and T that has the minimum number of edit operations –cooptimal transcripts

Alignment A global alignment of strings S and T is obtained –by inserting spaces (dashes) into S and T they should have the same number of characters (including dashes) at the end –then placing two strings over each other matching one character (or dash) in S with a unique character (or dash) in T –Note ALL positions in both S and T are involved –Later, we will consider local alignments

Alignments and Edit transcripts Example Alignment –v-intner- –wri-t-ers Alignments and edit transcripts are interrelated –edit transcript: emphasizes process the specific mutational events –alignment: emphasizes product the relationship between the two strings –Alignments are often easier to work with and visualize also generalize better to more than 2 strings

Edit Distance Problem Input –2 strings S and T Task –Output edit distance of S and T –Output optimal edit transcript –Output optimal alignment Solution method –Dynamic Programming

Definition of D(i,j) Let D(i,j) be the edit distance of S[1..i] and T[1..j] –The edit distance of the first i characters of S with the first j characters of T –Let |S| = n, |T| = m D(n,m) = edit distance of S and T We will compute D(i,j) for all i and j such that 0 <= i <= n, 0 <= j <= m

Recurrence Relation Base Case: –For 0 <= i <= n, D(i,0) = i –For 0 <= j <= m, D(0,j) = j Recursive Case: –0 < i <= n, 0 < j <= m –D(i,j) = min D(i-1,j) + 1 D(i,j-1) + 1 D(i-1,j-1) + d(i,j) –d(i,j) = 0 if S(i) = T(j) and is 1 otherwise

What the various cases mean D(i,j) = min –D(i-1,j) + 1: Align S[1..i-1] with T[1..j] optimally Match S(i) with a dash in T –D(i,j-1) + 1 Align S[1..i] with T[1..j-1] optimally Match a dash in S with T(j) –D(i-1,j-1) + d(i,j) Align S[1..i-1] with T[1..j-1] optimally Match S(i) with T(j)

Computing D(i,j) values D(i,j)writers v1 i2 n3 t4 n5 e6 r7

Initialization: Base Case D(i,j)writers v11 i22 n33 t44 n55 e66 r77

Row i=1 D(i,j)writers v i22 n33 t44 n55 e66 r77

Entry i=2, j=3 D(i,j)writers v i2222? n33 t44 n55 e66 r77

Calculation methodologies Location of edit distance –D(n,m) Example was to calculate row by row Can also calculate column by column Can also use antidiagonals Key is to build from upper left corner

Traceback Using table to construct optimal transcript Pointers in cell D(i,j) –Set a pointer from cell (i,j) to cell (i, j-1) if D(i,j) = D(i, j-1) + 1 cell (i-1,j) if D(i,j) = D(i-1,j) + 1 cell (i-1,j-1) if D(i,j) = D(i-1,j-1) + d(i,j) –Follow path of pointers from (n,m) back to (0,0) Example: Figure 11.3 on page 222

What the pointers mean horizontal pointer: cell (i,j) to cell (i, j-1) –Align T(j) with a space in S –Insert T(j) into S vertical pointer: cell (i,j) to cell (i-1, j) –Align S(i) with a space in T –Delete S(i) from S diagonal pointer: cell (i,j) to cell (i-1, j-1) –Align S(i) with T(j) –Replace S(i) with T(j)

Table and transcripts The pointers represent all optimal transcripts Theorem: –Any path from (n,m) to (0,0) following the pointers specifies an optimal transcript. –Conversely, any optimal transcript is specified by such a path. –The correspondence between paths and transcripts is one to one.

Running Time Initialization of table –O(n+m) Calculating table and pointers –O(nm) Traceback for one optimal transcript or optimal alignment –O(n+m)

Operation-Weight Edit Distance Consider S and T We can assign weights to the various operations –insertion/deletion of a character: cost d –substitution (replacement) of a character: cost r –matching: cost e –Previous case: d = r = 1, e = 0

Modified Recurrence Relation Base Case: –For 0 <= i <= n, D(i,0) = i d –For 0 <= j <= m, D(0,j) = j d Recursive Case: –0 < i <= n, 0 < j <= m –D(i,j) = min D(i-1,j) + d D(i,j-1) + d D(i-1,j-1) + d(i,j) –d(i,j) = e if S(i) = T(j) and is r otherwise

Alphabet-Weight Edit Distance Define weight of each possible substitution –r(a,b) where a is being replaced by b for all a,b in the alphabet –For example, with DNA, maybe r(A,T) > r(A,G) –Likewise, I(a) may vary by character Operation-weight edit distance is a special case of this variation Weighted edit distance refers to this alphabet- weight setting

Modified Recurrence Relation Base Case: –For 0 <= i <= n, D(i,0) =  1 <= k <= i I(S(k)) –For 0 <= j <= m, D(0,j) =  1 <= k <= j I(T(k)) Recursive Case: –0 < i <= n, 0 < j <= m –D(i,j) = min D(i-1,j) + I(S(i)) D(i,j-1) + I(T(j)) D(i-1,j-1) + d(i,j) –d(i,j) = r(S(i), T(j))

Measuring Similarity of S and T Definitions –Let  be the alphabet for strings S and T –Let  ’ be the alphabet  with character - added –For any two characters x,y in  ’, s(x,y) denotes the value (or score) obtained by aligning x with y –For a given alignment A of S and T, let S’ and T’ denote the strings after the chosen insertion of spaces and l their new length –The value of alignment A is  1<=i<=l s(S(i),T(i))

Example a b a a - b a b a a a a a b - b =5 sab- a1-20 b2 -0

String Similarity Problem Input –2 strings S and T –Scoring matrix s for alphabet  ’ Task –Output optimal alignment value of S and T The alignment of S and T with maximal, not minimal, value –Output this alignment

Modified Recurrence Relation Base Case: –For 0 <= i <= n, V(i,0) =  1 <= k <= i s(S(k),-) –For 0 <= j <= m, V(0,j) =  1 <= k <= j s(-,T(k)) Recursive Case: –0 < i <= n, 0 < j <= m –V(i,j) = max V(i-1,j) + s(S(i),-) V(i,j-1) + s(-,T(j)) V(i-1,j-1) + s(S(i), T(j))

Longest Common Subsequence Problem Given 2 strings S and T, a common subsequence is a subsequence that appears in both S and T. The longest common subsequence problem is to find a longest common subsequence (lcs) of S and T –subsequence: characters need not be contiguous –different than substring O(nm) solution: –Make scoring matrix 1 for match, 0 for mismatch –The matched characters in an alignment of maximal value form a longest common subsequence

Similarity and Distance If we are focused on aligning both entire strings, maximizing similarity is essentially identical to minimizing distance –Just need to modify scoring matrices appropriately When we consider substrings of uncertain length, maximizing similarity often makes more sense than minimizing distance –Overlapping strings –Local alignment

Overlapping Strings Find best alignment where the two strings overlap without penalizing for the unmatched ends –Application: sequence assembly problem strings are likely to overlap without being substrings of each other Solution method –End-space free variant of dynamic programming –Change base conditions so that V(i,0) = V(0,j) = 0 –Need to search over row n and column n for optimal value Optimal value may not be in entry (n,m) Why is max similarity better than min distance?

Maximally Similar Substrings Local alignment problem –Input Two strings S and T –Task Find substrings s and t of S and T that have the maximum possible alignment value as well as this value. Let v* denote this value. Why is max similarity better than min distance? Read pages for motivation

Local suffix alignments Define v(i,j) to be the value of the optimal alignment of any of the i+1 suffixes of S[1..i] with any of the j+1 suffixes of T[1..j]. –We bound v(i,j) to be at least 0 by scoring the alignment of two empty suffixes to be 0 Theorem –v* (the value of the optimal local alignment) = max{ v(i,j) | 1 <= i <= n, 1 <= j <= m}

Recurrences for local suffix alignments Base Case: –For 0 <= i <= n, v(i,0) =  –For 0 <= j <= m, v(0,j) =  Recursive Case: –0 < i <= n, 0 < j <= m –v(i,j) = max 0 v(i-1,j) + s(S(i),-) v(i,j-1) + s(-,T(j)) v(i-1,j-1) + s(S(i), T(j))

Comments Traceback –No longer start from cell (n,m) –Search whole table for max value and start from there –Still O(mn) running time Terminology –In the literature, the distinction between problem statements from solution methods is not clear –Global alignment often referred to as Needleman- Wunsch alignment There solution method was cubic in terms of m,n –Smith-Waterman often used to refer to both local alignments and their solution method

Comments continued Scoring schemes –The utility of optimal local alignments is highly dependent on the scoring scheme –Examples matches 1, mismatches & spaces 0 leads to longest common subsequence mismaches and spaces big negatives leads to longest common substring –Average score in matrix must be negative, otherise local alignments tend to be global –There is a theory developed about scoring schemes that we will cover later.

Aligning with Gaps Gaps: Any maximal run of spaces in a single string of a given alignment Example –S = aaabbbcccdddeeefff –T = aaabbbdddeeefffggg –Alignment aaabbbcccdddeeefff--- aaabbb---dddeeefffggg

Scoring with gaps Example Scoring –aaabbbcccdddeeefff--- –aaabbb---dddeeefffggg – = 13 Why include gaps in scoring schemes? –Read –When an insertion/deletion event occurs, often more than a single character is inserted or deleted. –A single gap cost helps model the fact that a sequence of insertions/deletions is really one mutational event

Constant gap weight model We present a series of possible gap weight models, each of which is a special case of the next one Constant gap weight model –each individual space is free (W s = 0) –each gap has constant cost W g –Alignment problem boils down to finding an alignment that maximizes Match scores - mismatch scores - W g (# of gaps) –Dynamic programming can still solve in O(nm) time

Affine gap weight model Gap opening versus gap extension penalties –each gap has constant cost W g –each individual space has cost W s < W g, typically Alignment problem boils down to finding an alignment that maximizes Match scores - mismatch scores - W g (# of gaps) - W s (# of spaces) Dynamic programming can still solve in O(nm) time Probably most commonly used model because of efficiency and generality of model

Convex gap weight model Extension penalty should not be a constant but rather decrease as length of gap increases –One example each gap has cost W g + log q where q is the length of the gap Time now requires more than O(nm) time –In chapter 13 is an O(nmlog m) time solution –Further improvement is possible, but costly

Arbitrary gap weight model Gap cost is an arbitrary function of gap length –each gap has cost w(q) where q is the length of the gap –no properties are assumed on w(q) such as its second derivative is negative Solution time is now O(nm 2 + n 2 m) –cubic cost, similar to original Needleman-Wunsch solution

Recurrences for arbitrary gap weights Base Case: –For 0 <= i <= n, V(i,0) =  w(i) –For 0 <= j <= m, V(0,j) =  w(j) Recursive Case: –0 < i <= n, 0 < j <= m –V(i,j) = max V(i-1,j-1) + s(S(i),T(j)) max 0<=k<j-1 [V(i,k) - w(j-k)] –Match S[1..i] with T[1..k] and gap of length j-k at end of T max 0<=k<i-1 [V(k,j) - w(i-k)] –Match S[1..k] with T[1..j] and gap of length i-k at end of S

Recurrences for affine gap weights Base Case: –For 0 <= i <= n, V(i,0) = E(i,0)  W g - iW s –For 0 <= j <= m, V(0,j) = F(0,j) = -W g - jW s Recursive Case: –0 < i <= n, 0 < j <= m –V(i,j) = max [E(i,j), F(i,j), G(i,j)] G(i,j) = V(i-1,j-1) + s(S(i),T(j)) E(i,j) = max [E(i,j-1), V(i,j-1) - W g ] - W s –max checks if gap begins at S(i) or if it began earlier F(i,j) = max [F(i-1,j), V(i-1,j) - W g ] - W s –max checks if gap begins at T(j) or if it began earlier