. Sequence Alignment III Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by.

Slides:



Advertisements
Similar presentations
Computational Genomics Lecture #3a
Advertisements

. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
Hidden Markov Model in Biological Sequence Analysis – Part 2
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
Measuring the degree of similarity: PAM and blosum Matrix
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Markov Chains Lecture #5
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Computational Genomics Lecture #3a Much of this class has been edited from Nir Friedman’s lecture which is available at Changes.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Defining Scoring Functions, Multiple Sequence Alignment Lecture #4
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.
Multiple Sequence alignment Chitta Baral Arizona State University.
Sequence Alignment III CIS 667 February 10, 2004.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then Shlomo Moran. Background Readings:
. Pairwise and Multiple Alignment Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignments
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.
Pairwise Sequence Analysis-III
. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Sequence Alignment.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Computational Genomics Lecture #2b
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Pairwise Sequence Alignment (cont.)
Computational Genomics Lecture #3a
Alignment IV BLOSUM Matrices
Presentation transcript:

. Sequence Alignment III Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes made by Dan Geiger and Shlomo Moran. Background Readings: The second chapter (pages 12-45) in the text book, Biological Sequence Analysis, Durbin et al., Chapter 3 in Introduction to Computational Molecular Biology, Setubal and Meidanis, 1997.

2 Score(s[i],t[i]) Log Odd-Ratio Test for Alignment Taking logarithm of Q yields If log Q > 0, then s and t are more likely to be related. If log Q < 0, then they are more likely to be unrelated. How can we relate this quantity to a score function ?

3 Estimating p(·,·) for proteins Generate a large diverse collection of accepted mutations. An accepted mutation is a mutation due to an alignment of closely related protein sequences. For example, Hemoglobin alpha chain in humans and other organisms (homologous proteins). Let p a = n a /n where n a is the number of occurrences of letter a and n is the total number of letters in the collection, so n =  a n a. Mutation counts be the number of mutations a  b, be the total number of mutations that involve a, be the total number of amino acids involved in a mutation. Note that f is twice the number of mutations.

4 PAM-1 matrices Define M ab to be the symmetric probability matrix for switching between a and b. We set, M aa = 1 – m a, so that m a is the probability that a is involved in a change. We define M ab, such that only 1% of amino acids change according to this matrix or 99% don’t. Hence the name, 1-Percent Accepted Mutation (PAM). In other words,

5 PAM-1 matrices where K is a proportional constant. We wish that m a will be proportional to the relative mutability of letter a compared to other letters. So K=100 for PAM-1 matrices. Note that K=50 yields 2% change, etc. We select K to satisfy the PAM-1 definition:

6 Evolutionary distance The choice that 1% of amino acids change (and that K =100) is quite arbitrary. It could fit specific set of proteins whose evolutionary distance is such that indeed 1% of the letters have mutated. This is a unit of evolutionary change, not time because evolution acts differently on distinct sequence types. What is the substitution matrix for k units of evolutionary time ?

7 Model of Evolution We make some assumptions: 1. Each position changes independently of the rest 2. The probability of mutations is the same in each position 3. Evolution does not “remember” Time t t+  t+2  t+3  t+4  A A C CG T T T CG

8 Model of Evolution u How do we model such a process? u This process is called a Markov Chain A chain is defined by the transition probability  P(X t+  =b|X t =a) - the probability that the next state is b given that the current state is a  We often describe these probabilities by a matrix: M[  ] ab = P(X t+  =b|X t =a)

9 Multi-Step Changes  Thus M[2  ] = M[  ]M[  ]  By induction (HMW exercise): M[n  ] = M[  ] n  Based on M ab, we can compute the probabilities of changes over two time periods Using Conditional independence (No memory)

10 A Markov Model (chain) X1X1 X2X2 X n-1 XnXn Every variable x i has a domain. For example, suppose the domain are the letters {a, c, t, g}. Every variable is associated with a local probability table P(X i = x i | X i-1 = x i-1 ) and P(X 1 = x 1 ). The joint distribution is given by In short, we write: where Pa i are the parents of variable/node X i,namely, none or X i-1.

11 Markov Model of Evolution Revisited In the evolution model we studied earlier we had P(x 1 ) = (p a, p c, p g, p t ) which sum to 1 and called the prior probabilities, and P(x i |x i-1 ) = M[  ] which is a stationary transition probability table, not depending on the index i. The quantity we computed earlier from this model was the joint probability table X1X1 X2X2 X n-1 XnXn M M

12 Longer Term Changes  Estimate M[  ] = M (PAM-1 matrices)  Use M[n  ] = M n (PAM-n matrices) u Define u Use this quantity to define the score for your application of interest.

13 Comments regarding PAM u Historically researchers use PAM-250. (The only one published in the original paper.) u Original PAM matrices were based on small number of proteins (circa 1978). Later versions use many more examples. u Used to be the most popular scoring rule, but there are some problems with PAM matrices.

14 Degrees of freedom in PAM definition With K=100 the 1-PAM matrix is given by With K=50 the basic matrix is different, namely: Use the 1-PAM matrix to the fourth power: M[4  ] = M[  ] 4 Or Use the K=50 matrix to the second power: M[4  ] = M[2  ] 2 Thus we have two different ways to estimate the matrix M[4  ] :

15 Problems in building distance matrices u How do we find pairs of aligned sequences? u How far is the ancestor ? earlier divergence  low sequence similarity later divergence  high sequence similarity E.g., M[250  ] is known not reflect well long period changes. u Does one letter mutate to the other or are they both mutations of a third letter ?

16 BLOSUM Outline u Idea: use aligned ungapped regions of protein families.These are assumed to have a common ancestor. Similar ideas but better statistics and modeling. u Procedure: l Cluster together sequences in a family whenever more than L% identical residues are shared. l Count number of substitutions across different clusters (in the same family). l Estimate frequencies using the counts. u Practice: Blosum50 and Blosum62 are wildly used. (See page in the text book). Considered state of the art nowadays.

17 Multiple Sequence Alignment S 1 =AGGTC S 2 =GTTCG S 3 =TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG T-AT-A --A--A CCACCA -GC-GC

18 Multiple Sequence Alignment Definition: Given strings S 1, S 2, …,S k a multiple (global) alignment map them to strings S’ 1, S’ 2, …,S’ k that may contain blanks, where: 1.|S’ 1 |= |S’ 2 |=…= |S’ k | 2.The removal of spaces from S’ i leaves S i Aligning more than two sequences.

19 Multiple alignments We use a matrix to represent the alignment of k sequences, K=(x 1,...,x k ). We assume no columns consists solely of blanks. MQ_ILLL MLR-LL- MK_ILLL MPPVLIL The common scoring functions give a score to each column, and set: score(K)= ∑ i score(column(i)) For k=10, a scoring function has 2 k -1 > 1000 entries to specify. One need not have a general function. For example, the order of arguments need not matter: score(I,_,I,V) = score(_,I,I,V). x1x1 x2x2 x3x3 x4x4

20 SUM OF PAIRS MQ_ILLL MLR-LL- MK_ILLL MPPVLIL A common scoring function is SP – sum of scores of the projected pairwise alignments: SPscore(K)=∑ i<j score(x i,x j ). In order for this score to be written as ∑ i score(column(i)), we must specify score(-,-)=0. Why ? Because these entries appear in the sum of columns but not in the sum of projected pairwise alignments (lines). Note that we need to specify the score(-,-) because a column may have several blanks (as long as not all entries are blanks).

21 SUM OF PAIRS MQ_ILLL MLR-LL- MK_ILLL MPPVLIL Definition: The sum-of-pairs (SP) value for a multiple global alignment A of k strings is the sum of the values of all projected pairwise alignments induced by A where the pairwise alignment function score(x i,x j ) is additive.

22 Example Consider the following alignment: a c - c d b - - c - a d b d a - b c d a d Using the edit distance and for, this alignment has a SP value of = 12

23 Multiple Sequence Alignment Given k strings of length n, there is a natural generalization of the dynamic programming algorithm that finds an alignment that maximizes SP-score(K) = ∑ i<j score(x i,x j ). Instead of a 2-dimensional table, we now have a k-dimensional table to fill. For each vector i =(i 1,..,i k ), compute an optimal multiple alignment for the k prefix sequences x 1 (1,..,i 1 ),...,x k (1,..,i k ). The adjacent entries are those that differ in their index by one or zero. Each entry depends on 2 k -1 adjacent entries.

24 The idea via K=2 V[i,j]V[i+1,j] V[i,j+1]V[i+1,j+1] Note that the new cell index (i+1,j+1) differs from previous indices by one of 2 k -1 non-zero binary vectors (1,1), (1,0), (0,1). Recall the notation: and the following recurrence for V :

25 The idea for arbitrary k Order the vectors i=(i 1,..,i k ) by increasing order of the sum ∑i j. Set s(0,..,0)=0, and for i > (0,...,0): The vector b ranges over all non-zero binary vectors. The vector i-b is the non-negative difference of i and b. The j th entry of column(i,b) equals c j = x j (i j ) if b i =1, and c j = ‘-’ otherwise. (Reflecting that b is 1 at location j if that location changed in the “current comparison”). Where

26 Complexity of the DP approach Number of cells n k. Number of adjacent cells O(2 k ). Computation of SP score for each column(i,b) is o(k 2 ) Total run time is O(k 2 2 k n k ) which is utterly unacceptable ! Not much hope for a polynomial algorithm because the problem has been shown to be NP complete. Need heuristic to reduce time.

27 Time saving heuristics: Relevance tests Heuristic: Avoid computing score(i) for irrelevant vectors. MQ_ILLL MLR-LL- MK_ILLL MPPVLIL x1x1 x2x2 x3x3 x4x4 Let L be a lower bound on the optimal SP score of a multiple alignment of the k sequences. A lower bound L can be obtained from an arbitrary multiple alignment, computed in any way. Main idea: Using L, compute lower bounds L uv for the optimal score for every two sequences s=x u and t=x v, 1  u < v  k. When processing vector i=(..i u,..i v …), the relevant cells are such that in every projection on x u and x v, the optimal pairwise score is above L uv.

28 Recall the Linear Space algorithm u V[i,j] = d(s[1..i],t[1..j]) u B[i,j] = d(s[i+1..n],t[j+1..m])  F[i,j] + B[i,j] = score of best alignment through (i,j) t s These computations done in linear space. Build such a table for every two sequences s=x u and t=x v, 1  u, v  k. This entry encodes the optimum through (i u,i v ).

29 Time saving heuristics: Relevance test But can we go over all cells determine if they are relevant or not ? No. Start with (0,…,0) and add to the list relevant entries until reaching (n 1,…,n k )

30 Star Alignments Rather then summing up all pairwise alignments, select a fixed sequence x 0 as a center, and set Star-score(K) = ∑ j>0 score(x 0,x j ). The algorithm to find optimal alignment: at each step, add another sequence aligned with x 0, keeping old gaps and possibly adding new ones.

31 Tree Alignments Assume that there is a tree T=(V,E) whose leaves are the sequences. Associate a sequence in each internal node. Tree-score(K) = ∑ (i,j)  E score(x i,x j ). Finding the optimal assignment of sequences to the internal nodes is NP Hard. We will meet again this problem in the next topic: Phylogenetic trees

32 Multiple Sequence Alignment – Approximation Algorithm In tutorial time you will see an O(k 2 n 2 ) multiple alignment algorithm that errs by a factor of at most 2(1-1/k) < 2.