. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir.

Slides:



Advertisements
Similar presentations
Sequence Alignment I Lecture #2
Advertisements

Substitution matrices
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 11 th,
. Sequence Alignment III Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Defining Scoring Functions, Multiple Sequence Alignment Lecture #4
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Class 3: Estimating Scoring Rules for Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Dayhoff’s Markov Model of Evolution. Brands of Soup Revisited Brand A Brand B P(B|A) = 2/7 P(A|B) = 2/7.
Sequence Alignment - III Chitta Baral. Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by.
. Pairwise and Multiple Alignment Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
An Introduction to Bioinformatics
Substitution Numbers and Scoring Matrices
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
. EM with Many Random Variables Another Example of EM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Sequence Analysis-III
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Computational Genomics Lecture #2b
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Pairwise Sequence Alignment (cont.)
Alignment IV BLOSUM Matrices
Presentation transcript:

. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor. Background Readings: Section 2.7, 2.8 in Biological Sequence Analysis, Durbin et al., Sections in Algorithms on Strings, Trees, and Sequences, Gusfield, Intro to ML and Scoring functions 2. PAM and BLOSUM AA scoring matrices

2 Scoring Functions u So far, we discussed dynamic programming algorithms for  global alignment  local alignment  other, related versions u All of these assumed a scoring function: that determines the value of substitutions, insertions, and deletions.

3 Where does the scoring function come from ? An additive scoring function is defined by specifying a function  ( ,  ) such that  (x,y) is the score of replacing x by y (including x=y)  (x,-) is the score of deleting x  (-,x) is the score of inserting x But how do we come up with the “correct” score ? Idea: By encoding experience of what are similar sequences for the task at hand. Similarity depends on time, evolution trends, and sequence types.

4 Why probability setting is appropriate to define and interpret a scoring function ? Similarity is probabilistic in nature because biological changes like mutation, recombination, and selection, are random events. We could answer questions such as: How probable it is for two sequences to be similar? Is the similarity found significant or spurious? How to change a similarity score when, say, mutation rate of a specific area on the chromosome becomes known ?

5 A Probabilistic Model u For starters, will focus on alignment without indels (implying aligned sequences are equal length). u For now, we assume each position (nucleotide /amino- acid) is independent of other positions. This is not a very realistic assumption, BUT it makes our life a lot easier, and is a good start. u We consider two options: M: the sequences are Matched (related) R: the sequences are Random (unrelated)

6 Unrelated Sequences (R) u Our random model R of unrelated sequences is simple  Each position is sampled independently from a fixed distribution q(  ) over the alphabet   We assume there is a distribution q(  ) that describes the probability of single letters. u Then:

7 Related Sequences (M)  We assume that each pair of aligned positions (s[i],t[i]) evolved from a common ancestor  Let p(a,b) be a distribution over pairs of letters.  p(a,b) is the probability that some ancestral letter evolved into this particular pair of letters. Compare to:

8 Odd-Ratio Test for Alignment If Q > 1, then the two strings s and t are more likely to be related (M) than unrelated (R). If Q < 1, then the two strings s and t are more likely to be unrelated (R) than related (M).

9 Score(s[i],t[i]) Log Odd-Ratio Test for Alignment Taking logarithm of Q yields If log Q > 0, then s and t are more likely to be related. If log Q < 0, then they are more likely to be unrelated. (usually we want some constant positive threshold to “declare” relatedness). How can we relate this quantity to a scoring function ?

10 Probabilistic Interpretation of Scores u We define the scoring function via u Then, the score of an alignment is the log-ratio between the two models:  Score > 0  Model is more likely  Score < 0  Random is more likely

11 Probabilistic Interpretation of Scores u We’ve defined the scoring function via u For example, suppose q(a)=0.1, q(b)=0.1, f(a,b)=0.001, f(a,a)=0.08, f(b,b)=0.07. u Then score (a,b)=log 0.1 < 0; score (a,a)=log 8 > 0; score (b,b)=log 7 > 0; u Logarithms are convenient in order to have an additive scoring function under assumption of independence (log of product = sum of logs).

12 Modeling Assumptions u It is important to note that this interpretation depends on our modeling assumption!! u For example, if we assume that the letter in each position depends on the letter in the preceding position, then the likelihood ratio will have a different form. u If we assume, for proteins, some joint distribution of letters that are nearby in 3D space after protein folding, then likelihood ratio will again be different.

13 Estimating Probabilities  Suppose we are given a long string s[1..n] of letters from   We want to estimate the single letters distribution, q(·), that best corresponds to the sequence u How should we go about this? The theory of parameter estimation in statistics Deals with this problem in detail. But in our case, there is no need for heavy tools.

14 Estimating q(  )  Suppose we are given a long string s[1..n] of letters from   s is the concatenation of many sequences in our database  We want to estimate the distribution q(  )  That is, q is defined per single letters Likelihood function:

15 Estimating q(  ) (cont.) How do we define q ? Likelihood function: ML parameters ( M aximum L ikelihood) Namely q(a) is simply the observed frequency of a in the sequence

16 Estimating p(·,·) Intuition:  Find pair of aligned sequences s[1..n], t[1..n], u Estimate probability of pairs: u The sequences s and t are be the concatenation of many aligned pairs of sequences from the database Number of times a is aligned with b in (s,t)

17 Problems in Estimating p(·,·) u How do we find pairs of aligned sequences? u How far is the ancestor of the aligned sequences?  earlier divergence  low sequence similarity  recent divergence  high sequence similarity u Does one letter mutate to the other or are they both mutations of a common ancestor having yet another residue/nucleotide acid ?

18 Estimating p(·,·) for amino acids Definition: An accepted point mutation is an observed substitution in a gapless alignment of closely related (homologous) protein sequences. For example, Hemoglobin alpha chain in humans and other mammalians. Note that we cannot claim such mutation actually occurred (if we observe an A to B substitution, the “evolutionary truth” could be A to C to B (one substitution, two hidden mutations).

19 PAM-1 matrices (M. Dayhoff, 1970) We take pairs of sequences that are no more than “1 unit of evolutionary distance apart”. This is interpreted as having substitutions in 1 percent (1/100) of the aligned sites (all other are matches, as we assumed no gaps). In this ensemble of alignments, we count the frequencies of single letters, q(  ), and the frequencies of substitutions, p(·,·). The (a,b) entry in the PAM1 matrix is then

20 PAM-1 matrices Values in matrix usually multiplied by 10 & rounded. Example: A portion of PAM1 matrix 1. A R N D C Q E 2. A R N D C Q E Most off diagonal entries are negative, but not all. For example score(B,N)=score(B,D)=7 (out of our small portion).

21 PAM-n matrices Generalize PAM1: Take pairs of sequences that are no more than “n unit of evolutionary distance apart”. Interpret this as having substitutions in n percent (n/100) of the aligned sites (all other are matches, as we assumed no gaps). Nice idea, but does not quite work. Sequences with more than 3 or 4 percent substitutions cannot be aligned without gaps. In addition, we may want evolutionary units that are larger than 100 (e.g. PAM250 exists and useful).

22 PAM-n matrices Let be the conditional probability that in 1 evolutionary time unit, an A mutates to a B. In our terminology, What is then the probability that in n evolutionary time units, an A mutates to a B ? We should start with an A (an event whose probability is q(A) ), and then go through all possible paths that lead from A to B in n steps.

23 PAM-n matrices We should start with an A (an event whose probability is q(A) ), and then go through all possible paths that lead from A to B in n steps. The probability of such process is given exactly by the A,B entry of the n-th power of the matrix M. Therefore, the probability of starting with A and moving to B in n steps equals q(A) M n (A,B).

24 PAM-n matrices Therefore, the probability of starting with A and moving to B in n steps equals q(A) M n (A,B). Recall the random (unrelated sequences) model, where the probability of having S[i]=A and T[i]=B equals q(A) q(B). So PAM n (A,B), which is the log odds score, equals log( q(A) M n (A,B) / q(A)q(b) ) = log(M n (A,B) / q(b) ) Notice that for n=1 this is the same as PAM1.

25 PAM-250 matrices Values in matrix usually multiplied by 10 & rounded. Example: A portion of PAM250 matrix 1. A R N D C Q E 2. A R N D C Q E Many more off diagonal entries are positive (compare to PAM1).

26 The BLOSUM substitution matrices BLOSUM stands for BLOcks Substitution Matrices. The basic principles are similar to PAM, but the sequences used to compute substitution frequencies differ. BLOSUM uses a database containing multiple sequence alignments from related (by function) but possibly remote (evolutionary) proteins. Conserved blocks (local alignments) are detected, and substitution frequencies are computed based on them. BLOSUMn reflects more conserved sequences for higher values of n.

27 BLOSUM vs. PAM BLOSUM 62 is the default matrix in BLAST 2.0.