Computational Biology Lecture #6: Matching and Alignment

Slides:

Advertisements

Similar presentations

Global Sequence Alignment by Dynamic Programming.

Advertisements

Longest Common Subsequence

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.

Dynamic Programming Optimization Problems Dynamic Programming Paradigm

S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter

6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.

. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.

Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.

Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)

. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.

Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at

Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.

Introduction to Bioinformatics Algorithms Sequence Alignment.

. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.

PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.

Alignment III PAM Matrices. 2 PAM250 scoring matrix.

Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.

. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.

Sequence Alignment - III Chitta Baral. Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by.

1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.

Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Sequence Alignment.

Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.

Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.

CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)

Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.

Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.

Chapter 3 Computational Molecular Biology Michael Smith

Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

Sequence Alignment.

Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,

Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.

Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.

Core String Edits, Alignments, and Dynamic Programming.

Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Sequence comparison: Dynamic programming

Definition of Minimum Edit Distance

Distance Functions for Sequence Data and Time Series

Bioinformatics Algorithms and Data Structures

Computational Genomics Lecture #2b

Sequence Alignment 11/24/2018.

Computational Biology Lecture #7: Local Alignment

Pairwise sequence Alignment.

Pair Hidden Markov Model

Computational Biology Lecture #6: Matching and Alignment

Intro to Alignment Algorithms: Global and Local

Dynamic Programming Computation of Edit Distance

CSE 589 Applied Algorithms Spring 1999

Multiple Sequence Alignment

BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment

Multiple Sequence Alignment (I)

Dynamic Programming-- Longest Common Subsequence

Bioinformatics Algorithms and Data Structures

Computational Genomics Lecture #3a

Presentation transcript:

Computational Biology Lecture #6: Matching and Alignment Bud Mishra Professor of Computer Science and Mathematics 10 ¦ 22 ¦ 2001 12/1/2018 ©Bud Mishra, 2001

Inexact Matching ©Bud Mishra, 2001 Example: Edit Distance Problem: Edit distance between two biological sequences May correspond to: Evolutionary Distance Functional Distance Structural Distance 12/1/2018 ©Bud Mishra, 2001

Edit Distance ©Bud Mishra, 2001 Simplest distance function corresponds to: EDIT DISTANCE Atomic Edit Functions: Insertion AATCGG a AATACGG Deletion AATACGG a AATCGG Substitution AATCGG a AATAGG A composite edit function ' Function Composition of Atomic Edit Functions 12/1/2018 ©Bud Mishra, 2001

Cost of a Composite Edit Function (Based on the cost or distance for Atomic Edit Functions) Given: Two strings S1 and S2 Distance(S1, S2) = min { cost(E) | E(S1) = S2} where E = composite edit function mapping S1 to S2. 12/1/2018 ©Bud Mishra, 2001

Some Properties of Distance Function Assume: (8 e= Atomic Edit Function) cost(e) = cost(e-1) Distance(S1, S2) = Distance(S2, S1) Symmetric Distance(S1, S1) = 0 Distance(S1, S2) + Distance(S2, S3) 5 Distance(S1, S3) Triangle Inequality Simplest Cost Function: Each atomic edit function is of unit cost. 12/1/2018 ©Bud Mishra, 2001

Edit Operations ©Bud Mishra, 2001 I: Insertion of a character into the first string S1 D: Deletion of a character from the first string S1 R: Replacement (or Substitution) of a character in the first string S1 with a character in the second string S2 M: Matching (Identity) 12/1/2018 ©Bud Mishra, 2001

Edit Transcript ©Bud Mishra, 2001 Example: GATT A CA | | | | | The complete edit function is described by an “edit transcript” EDIT TRANSCRIPT = s 2 {D, M, R, I}* Example (in left): Edit transcript = DDMIRIMMII Edit Distance = 1+1+0+1+1+1+0+0+1+1=7 GATT A CA | | | | | TTCGGCATT D 1 2 MM I 3 R 4 5 6 7 12/1/2018 ©Bud Mishra, 2001

Levenshtein Distance or Edit Distance Edit Distance between two strings S1 and S2 is defined as the minimum number of atomic edit operations – insertions, deletions (indels), and substitutions – needed to transform the first string into the second Optimal Transcript = An edit transcript corresponding to the minimum number of atomic edit operations of unit cost. EDP The Edit Distance Problem is to compute the edit distance between two given strings, along with an optimal edit transcript that describes the transformation (alignment) 12/1/2018 ©Bud Mishra, 2001

Dynamic Programming Calculation of Edit Distance Define: D(i,j) ´ Min number of atomic edit operations needed to transform the first i characters of S1 into the first j characters of S2 ´ EditDistance(S1[1..i], S2[1..j]) |S1| = n |S2| = m Distance(S1, S2) = D(n,m) Dynamic Programming: 3 components: Recurrence Relation Tabular Computation Traceback 12/1/2018 ©Bud Mishra, 2001

Recurrence ©Bud Mishra, 2001 D(i-1,j-1) D(i-1,j) D(i,j-1) D(i,j) Base Relation: D(0,0) = 0 EditDistance(l,l) = 0 Recurrence Relations: In 1 coordinate: D(i,0) = D(i-1,0)+1 (S1[i] deleted) D(0,j) = D(0,j-1)+1 (S2[j] inserted) EditDistance(S1[1..i], l) =i (i deletions) EditDistance(l, S2[1..j]) =j (j insertions) In both coordiates: D(i,j) = min {D(i-1,j) +1, (S1[i] deleted) {D(i,j-1)+1, (S2[j] inserted) {D(i-1,j-1)+ D(i,j)} (substn or match) D(i,j) = 1, if S1[i] ¹ S2[j]; 0 otherwise. D(i-1,j-1) D(i-1,j) D(i,j-1) D(i,j) D(i,j) +1 +1 12/1/2018 ©Bud Mishra, 2001

Efficient Tabular Computation of Edit Distance Recursive Implementation a 2O(n+m)-time computation Bottom-up computation (n+1)£ (m+1) distinct values for D(i,j) to be computed Dynamic Programming Table of size (n+1)£ (m+1) String S_1 corresponds to the rows (Vertical Axis) String S_2 corresponds to the columns (Horizontal Axis) Fill out D(i,0) Ã First Column Fill out D(0,j) Ã First Row Fill out rows D(i,j) Ã Left-to-Right (increasing i) 12/1/2018 ©Bud Mishra, 2001

The Algorithm ©Bud Mishra, 2001 Time complexity = O(nm) for i=0 to n do D(i,0) Ã i; for j=0 to m do D(0,j) Ã j; for i=1 to n do for j=1 to m do D(i,j) Ã min[ D(i-1,j)+1, D(i,j-1)+1, D(i-1,j-1) + D(i,j)] ¤ Time complexity = O(nm) 12/1/2018 ©Bud Mishra, 2001

Example ©Bud Mishra, 2001 S2 ! S1 ! A T C G Solutions with minimal edit distance =4 (sol.1) T A-GGC AATCGG 1 2 3 4 (Edit Script= RMIRMR) (sol.2) - -TAGGC AATCGG - 1 2 3 4 (Edit Script = IIMRMMD) A T C G 1 2 3 4 5 6 S1 ! 12/1/2018 ©Bud Mishra, 2001

Trace Back ©Bud Mishra, 2001 Extracting Optimal Edit Transcript: Set a pointer from: Cell(i,j) ! Cell(i,j-1), if D(i,j) = D(i,j-1)+1 Horizontal Edge ) I, Insertion Cell(i,j) ! Cell(i-1,j), if D(i,j) = D(i-1,j)+1 Vertical Edge ) D, Deletion Cell(i,j) ! Cell(i-1,j-1), if D(i,j) = D(i-1,j-1)+D(i,j) Diagonal Edge ) R, Substitution, if D(i,j)=1 M. Match, if D(i,j) = 0. Optimal Edit Transcript can be computed in O(n+m) additional time. 12/1/2018 ©Bud Mishra, 2001

GAPS: The Scoring Model Basic operations: Sequencing Errors or Evolutionary processes of Mutations and Selections Substitution: Changes one base to another. Gaps: Insertions or Deletions: Adds or removes a base. Total Score Assigned to an Alignment= Sum of terms for each aligned pair of bases plus terms for each gap. 12/1/2018 ©Bud Mishra, 2001

Total Score of an Alignment with Gaps Total Score Assigned to an Alignment Corresponds to log of the Relative likelihood that the two sequences are related compared to being unrelated. Assumptions: Mutations or Sequencing Errors at different sites in a sequence occur independently. 12/1/2018 ©Bud Mishra, 2001

Substitution Matrices Notation: x and y ´ Pairs of sequences, |x| = n and |y| = m. x, y 2 (A+G+C+T)* xi = ith symbol in x yj = jth symbol in y Random Model, R: P(x,y|R) = Õ qxi Õ qyj qa = probability that the letter “a” occurs independently at a given site. 12/1/2018 ©Bud Mishra, 2001

Random Model vs. Alternative Model Alternative Model, M: P(x,y | M) = Õ pxi, yj pab = Probability that the letters “a” and “b” have each been derived independently from some common letter. Log-Odds Ratio (LOD): s(a,b) = ln (pab/qa qb) P(x,y|M)/P(x,y|R) = Õ (pxi yi/ qxi qyi) = Õ exp[s(xi,yi)] = exp[å s(xi, yi)] 12/1/2018 ©Bud Mishra, 2001

Score ©Bud Mishra, 2001 Score = ln [ P(x,y|M)/P(x,y|R) ] = å s(xi, yi) = s(x,y) Score Matrix or Substitution Matrix: A T C G 2 -1 Blossum 50 PAM (Point Accepted Mutation) 12/1/2018 ©Bud Mishra, 2001

1-PAM Matrix ©Bud Mishra, 2001 Let M be a probability transition matrix. Mab = Pr(a , b), a, b = chracters pa = Pr(“a” occurs in a string) fab = The number of times the mutation a , b was observed to occur. fa = åa ¹ b fab & f = å fa. K = 1-PAM Evolutionary distance “The amount of evolution that will change 1 in K characters on average.” 12/1/2018 ©Bud Mishra, 2001

Maa = 1 – ma, Mab = fab /(Kf pa) = (fab/fa) ma 1-PAM Matrix ma = fa /(Kf pa), Maa = 1 – ma, Mab = fab /(Kf pa) = (fab/fa) ma a-PAM Matrix = Ma M* = lima Ã 1 Ma Scorea(a,b) = 10 log10 Maab/pb Sequence comparison with 40 PAM, 120 PAM & 250 PAM score functions… 12/1/2018 ©Bud Mishra, 2001

Gaps: ©Bud Mishra, 2001 g = length of a gap, exponentially distributed » Exp(l) f(g) = l e-l g P(g) = f(g) Õ qxi ln P(g) = -l g + ln l + sum ln qxi = -d – (g-1) e (Affine Score Model) d = Gap-open Penalty e = Gap-Extension Penalty 12/1/2018 ©Bud Mishra, 2001

Multiple Sequence Alignment Defn:Given strings S1, S2, …Sk a multiple (global) alignment maps them to strings S’1, S’2, …, S’k (by inserting chosen spaces) such that |S’1| = |S’2| = L = |S’k|, and Removal of spaces from S’i contracts it to Si, for 1 5 i 5 k. 12/1/2018 ©Bud Mishra, 2001

Value of a Multiple Global Alignment The sum of pairs (SP) value for a multiple global alignment A of k strings is the sum of the values of all Ck,2 pairwise alignments induced by A. Given: Two strings S1 and S2. The expanded strings S’1 and S’2 correspond to a pairwise alignment. d(x, y) = distance between two characters x and y = 1, if x ¹ y and 0, if x = y. d(x,-) = d(-,y) = 1. Distance(S’1, S’2) = åi=1l d(S’1[i], S’2[i]), where l = |S’1|=|S’2|. 12/1/2018 ©Bud Mishra, 2001

Optimal Global Alignment An optimal SP(global) alignment of strings S1, S2, …, Sk is an alignment that has a minimum possible sum-of-pairs value for these strings among all possible multiple sequence alignments. 12/1/2018 ©Bud Mishra, 2001

Generalization of DP ©Bud Mishra, 2001 Assume |S1| = |S2| = L = |Sk| = n. The generalized k-dimensional DP table has (n+1)k entries. Each entry depends on 2k – 1 adjacent entries. D(i1, 0, …, 0) = i1 D(0, i2,…, 0) = i2 M D(0,0,…, ik) = ik D(i1, i2, …ik) = min; ¹ S µ {1..k} [ D[…, ij-1,…]j2 S + ål¹ m 2 S d(il, im) + |S| £ (n-|S|) ] 12/1/2018 ©Bud Mishra, 2001

Complexity ©Bud Mishra, 2001 The time and space complexity of the generalized DP solution of the multiple alignment problem is = O((2n)k) Theorem: The optimal SP alignment problem is NP-complete. In the worst-case, one cannot expect to do much better unless P=NP. 12/1/2018 ©Bud Mishra, 2001

P-Time Heuristics ©Bud Mishra, 2001 A Polynomial Time Approximate Algorithm for Multiple String Alignment: Assumption about the distance function: Triangle Inequality: 8chars, x ,y, z d(x,z) 5 d(x,y) + d(y,z) 8char, x d(x, x) = 0 D(S1. S2) , Value of the min. global alignment of S1 & S2. 12/1/2018 ©Bud Mishra, 2001

Algorithm ©Bud Mishra, 2001 Input: T = {S1, S2, …, Sk} Step 1: Find S1 2 CT that minimizes åS 2 T n {S1} D(S1, S) Time Complexity = O(k2 n2) Ck,2 DP each taking O(n2) time Call the remaining strings S2, …, Sk 12/1/2018 ©Bud Mishra, 2001

= O(k2 n2) + åi=1k O(i n2) = O(k2 n2) The ith Step Step i: Assume S1, …Si-1 have been aligned as S’1, …, S’i-1 Add Si: Run DP to align S’1 & Si a S”i and S’i Adjust S’2, …, S’i-1 by adding spaces where spaces were added in S’1 S1, S2, …, Si )aligned S’1, S’2, …, S’i. Length(S’1) in step i 5 i n. DP(S’1, Si) takes O(i n2) time Total time Complexity = O(k2 n2) + åi=1k O(i n2) = O(k2 n2) 12/1/2018 ©Bud Mishra, 2001

Competitiveness ©Bud Mishra, 2001 M = Alignment induced by the algorithm d(i,j) = Distance M induces on pair Si, Sj M* = Optimal alignment 2 SP(M) = åi=1k åj=1, j ¹ ik d(i,j) 5 åi=1k åj=1, j ¹ ik d(i,1) + d(1,j) (Triangle Inequality) = åi=1k åj=1, j ¹ ik d(1,i) + åi=1k åj=1, j ¹ ik d(1,j) (Symmetry) = åi=2k (k-1) d(1,i) +åj=2k (k-1) d(1,j) = 2(k-1) åi=2k d(1,i) 12/1/2018 ©Bud Mishra, 2001

Competitiveness ©Bud Mishra, 2001 SP(M) 5 2 (1 – 1/k) SP(M*) 2SP(M*) = åi=1k åj=1, j¹ ik D(Si, Sj) = åj=2k D(S1, Sj) + åj=1, j ¹ 2k D(S2, Sj) + L + åj=1k-1 D(Sk, Sj) = k åi=2k d(1,i) SP(M*) = (k/2) åi=2k d(1,i) ¸ (k/2) [SP(M)/(k-1)] SP(M) 5 2 (1 – 1/k) SP(M*) 12/1/2018 ©Bud Mishra, 2001