Computational Genomics Lecture #2b

Slides:



Advertisements
Similar presentations
Computational Genomics Lecture #3a
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment III Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Measuring the degree of similarity: PAM and blosum Matrix
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Heuristic alignment algorithms and cost matrices
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Computational Genomics Lecture #3a Much of this class has been edited from Nir Friedman’s lecture which is available at Changes.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Defining Scoring Functions, Multiple Sequence Alignment Lecture #4
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
1 CAP5510 – Bioinformatics Substitution Patterns Tamer Kahveci CISE Department University of Florida.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
. Pairwise and Multiple Alignment Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignments
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise Sequence Alignment and Database Searching
Multiple sequence alignment (msa)
Learning Sequence Motif Models Using Expectation Maximization (EM)
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Computational Biology Lecture #6: Matching and Alignment
Computational Biology Lecture #6: Matching and Alignment
Intro to Alignment Algorithms: Global and Local
Multiple Sequence Alignment
Computational Genomics Lecture #3a
Alignment IV BLOSUM Matrices
Presentation transcript:

Computational Genomics Lecture #2b Scoring functions and DNA and AAs Multiple sequence alignment Background Readings: Chapters 2.5, 2.7 in the text book, Biological Sequence Analysis, Durbin et al., 2001. Chapters 3.5.1- 3.5.3, 3.6.2 in Introduction to Computational Molecular Biology, Setubal and Meidanis, 1997.  Chapter 15 in Gusfield’s book. Much of this class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor.

Scoring Functions, Reminder So far, we discussed dynamic programming algorithms for global alignment local alignment All of these assumed a scoring function: that determines the value of perfect matches, substitutions, insertions, and deletions.

Where does the scoring function come from ? We have defined an additive scoring function by specifying a function ( ,  ) such that (x,y) is the score of replacing x by y (x,-) is the score of deleting x (-,x) is the score of inserting x But how do we come up with the “correct” score ? Answer: By encoding experience of what are similar sequences for the task at hand. Similarity depends on time, evolution trends, and sequence types.

Why probability setting is appropriate to define and interpret a scoring function ? Similarity is probabilistic in nature because biological changes like mutation, recombination, and selection, are random events. We could answer questions such as: How probable it is for two sequences to be similar? Is the similarity found significant or spurious? How to change a similarity score when, say, mutation rate of a specific area on the chromosome becomes known ?

A Probabilistic Model For starters, will focus on alignment without indels. For now, we assume each position (nucleotide /amino-acid) is independent of other positions. We consider two options: M: the sequences are Matched (related) R: the sequences are Random (unrelated)

Unrelated Sequences Our random model R of unrelated sequences is simple Each position is sampled independently from a distribution over the alphabet  We assume there is a distribution p() that describes the probability of letters in such positions. Then:

Related Sequences We assume that each pair of aligned positions (s[i],t[i]) evolved from a common ancestor Let q(a,b) be a distribution over pairs of letters. q(a,b) is the probability that some ancestral letter evolved into this particular pair of letters. Compare to:

Odd-Ratio Test for Alignment If Q > 1, then the two strings s and t are more likely to be related (M) than unrelated (R). If Q < 1, then the two strings s and t are more likely to be unrelated (R) than related (M).

Log Odd-Ratio Test for Alignment Taking logarithm of Q yields Score(s[i],t[i]) If log Q > 0, then s and t are more likely to be related. If log Q < 0, then they are more likely to be unrelated. How can we relate this quantity to a score function ?

Probabilistic Interpretation of Scores We define the scoring function via Then, the score of an alignment is the log-ratio between the two models: Score > 0  Model is more likely Score < 0  Random is more likely

Modeling Assumptions It is important to note that this interpretation depends on our simplified modeling assumption!! For example, if we assume that the letter in each position depends on the letter in the preceding position, then the likelihood ratio will have a different form. If we assume, for proteins, some joint distribution of letters that are nearby in 3D space after protein folding, then likelihood ratio will again be different.

Estimating Probabilities Suppose we are given a long string s[1..n] of letters from  We want to estimate the distribution q(·) that generated the sequence How should we go about this? We build on the theory of parameter estimation in statistics using either maximum likelihood estimation or the Bayesian approach (later on).

Estimating q() Suppose we are given a long string s[1..n] of letters from  s can be the concatenation of all sequences in our database We want to estimate the distribution q() That is, q is defined per single letters Likelihood function:

Estimating q() (cont.) How do we define q? Intuitively Likelihood function: ML parameters (Maximum Likelihood)

Estimating p(·,·) Intuition: Find a pair of aligned sequences s[1..n], t[1..n], Estimate probability of pairs: The sequences s and t can be the concatenation of many aligned pairs from the database Number of times a is aligned with b in (s,t)

Problems in Estimating p(·,·) How do we find pairs of aligned sequences? How far is the ancestor ? earlier divergence  low sequence similarity recent divergence  high sequence similarity Does one letter mutate to the other or are they both mutations of a common ancestor having yet another residue/nucleotide acid ?

Scoring Matrices Deal with DNA first (simpler) then AA (not too bad either)

What is it & why ? Let alphabet contain N letters N x N matrix N = 4 and 20 for nucleotides and amino acids N x N matrix (i,j) shows the relationship between i-th and j-th letters. Positive number if letter i is likely to mutate into letter j Negative otherwise Magnitude shows the degree of proximity Symmetric

Scoring Matrices for DNA 1 -3 A C G T 1 A C G T 1 -5 -1 Transitions & transversions identity BLAST

The BLOSUM45 Matrix A R N D C Q E G H I L K M F P S T W Y V

Scoring Matrices for Amino Acids Chemical similarities Non-polar, Hydrophobic (G, A, V, L, I, M, F, W, P) Polar, Hydrophilic (S, T, C, Y, N, Q) Electrically charged (D, E, K, R, H) Requires expert knowledge Genetic code: Nucleotide substitutions E: GAA, GAG D: GAU, GAC F: UUU, UUC Actual substitutions PAM BLOSUM

Scoring Matrices: Actual Substitutions Manually align proteins Look for amino acid substitutions Entry ~ log (freq(observed) / freq(expected)) Log-odds matrices

BLOSUM BLOcks Substitution Matrices Henikoff & Henikoff, 1992 Next slides taken from lecture notes by Tamer Kahveci, CISE Department, University of Florida (www.cise.ufl.edu/~tamer/teaching/ fall2004/lectures/03-CAP5510-Fall04.ppt

BLOSUM Matrix Begin with a set of protein sequences and obtain aligned blocks. ~2000 blocks from 500 families of related proteins A block is the ungapped alignment of a highly conserved region of a family of proteins. MOTIF program is used to find blocks Substitutions in these blocks are used to compute BLOSUM matrix block 1 block 2 block 3 WWYIR CASILRKIYIYGPV GVSRLRTAYGGRKNRG WFYVR … CASILRHLYHRSPA … GVGSITKIYGGRKRNG WYYVR AAAVARHIYLRKTV GVGRLRKVHGSTKNRG WYFIR AASICRHLYIRSPA GIGSFEKIYGGRRRRG

Constructing the Matrix Count the frequency of occurrence of each amino acid. This gives the background distribution pa Count the number of times amino acid a is aligned with amino acid b: fab A block of width w and depth s contributes ws(s-1)/2 pairs. Denote by np the total number of pairs. Compute the occurrence probability of each pair: qab = fab/ np Compute the expected probability of occurrence of each pair eab = 2papb, if a ≠ b papb otherwise Compute twice (?) the log likelihood ratios, normalize, and round to nearest integer. 2* log2 qab / eab i j >= i a≠b

Computation of BLOSUM-X The amount of similarity in blocks has a great effect on the BLOSUM score. BLOSUM-X is generated by taking only blocks with %X identity. For example, a BLOSUM62 matrix is calculated from protein blocks with 62% identity. So BLOSUM80 represents closer sequences (more recent divergence) than BLOSUM62. On the web, Blast uses BLOSUM80, BLOSUM62 (the default), or BLOSUM45. a b

BLOSUM 62 Matrix Check scores for M I L V -small hydrophobic N D E Q -acid, hydrophilic H R K -basic F Y W -aromatic S T P A G -small hydrophilic C -sulphydryl

PAM vs. BLOSUM Equivalent PAM and BLOSSUM matrices: PAM100 = Blosum90 BLOSUM62 is the default matrix to use.

And Now Ladies and Gentlemen Boys and Girls the holy grail Multiple Sequence Alignment

Multiple Sequence Alignment S1=AGGTC Possible alignment A - T G C S2=GTTCG S3=TGAAC Possible alignment A G - T C

Multiple Sequence Alignment Aligning more than two sequences. Definition: Given strings S1, S2, …,Sk a multiple (global) alignment map them to strings S’1, S’2, …,S’k that may contain blanks, where: |S’1|= |S’2|=…= |S’k| The removal of spaces from S’i leaves Si

Multiple alignments We use a matrix to represent the alignment of k sequences, K=(x1,...,xk). We assume no columns consists solely of blanks. The common scoring functions give a score to each column, and set: score(K)= ∑i score(column(i)) x1 x2 x3 x4 M Q _ I L R - K P V For k=10, a scoring function has 2k -1 > 1000 entries to specify. The scoring function is symmetric - the order of arguments need not matter: score(I,_,I,V) = score(_,I,I,V).

SUM OF PAIRS M Q _ I L R - K P V A common scoring function is SP – sum of scores of the projected pairwise alignments: SPscore(K)=∑i<j score(xi,xj). M Q _ I L R - K P V Note that we need to specify the score(-,-) because a column may have several blanks (as long as not all entries are blanks). In order for this score to be written as ∑i score(column(i)), we set score(-,-) = 0. Why ? Because these entries appear in the sum of columns but not in the sum of projected pairwise alignments (lines).

SUM OF PAIRS M Q _ I L R - K P V Definition: The sum-of-pairs (SP) value for a multiple global alignment A of k strings is the sum of the values of all projected pairwise alignments induced by A where the pairwise alignment function score(xi,xj) is additive. M Q _ I L R - K P V

Example Consider the following alignment: a c - c d b - 3 3 +4 - c - a d b d 3 + 4 + 5 = 12 a - b c d a d Using the edit distance and for , this alignment has a SP value of

Multiple Sequence Alignment Given k strings of length n, there is a natural generalization of the dynamic programming algorithm that finds an alignment that maximizes SP-score(K) = ∑i<j score(xi,xj). Instead of a 2-dimensional table, we now have a k-dimensional table to fill. For each vector i =(i1,..,ik), compute an optimal multiple alignment for the k prefix sequences x1(1,..,i1),...,xk(1,..,ik). The adjacent entries are those that differ in their index by one or zero. Each entry depends on 2k-1 adjacent entries.

The idea via K=2 V[i,j] V[i+1,j] V[i,j+1] V[i+1,j+1] Recall the notation: and the following recurrence for V: V[i,j] V[i+1,j] V[i,j+1] V[i+1,j+1] Note that the new cell index (i+1,j+1) differs from previous indices by one of 2k-1 non-zero binary vectors (1,1), (1,0), (0,1).

Multiple Sequence Alignment Given k strings of length n, there is a generalization of the dynamic programming algorithm that finds an optimal SP alignment. Computational Cost: Instead of a 2-dimensional table we now have a k-dimensional table to fill. Each dimension’s size is n+1. Each entry depends on 2k-1 adjacent entries. Number of evaluations of scoring function : O(2knk)

Complexity of the DP approach Number of cells nk. Number of adjacent cells O(2k). Computation of SP score for each column(i,b) is o(k2) Total run time is O(k22knk) which is totally unacceptable ! Maybe one can do better?

But MSA is Intractable Not much hope for a polynomial algorithm because the problem has been shown to be NP complete (proof is quite Tricky and recent. Some previous proofs were bogus). Look at Isaac Elias presentation of NP completeness proof. Need heuristic or approximation to reduce time.

Multiple Sequence Alignment – Approximation Algorithm Now we will see an O(k2n2) multiple alignment algorithm for the SP-score that approximate the optimal solution’s score by a factor of at most 2(1-1/k) < 2.

Star-score(K) = ∑j>0score(S1,Sj). Star Alignments Rather then summing up all pairwise alignments, select a fixed sequence S1 as a center, and set Star-score(K) = ∑j>0score(S1,Sj). The algorithm to find optimal alignment: at each step, add another sequence aligned with S1, keeping old gaps and possibly adding new ones.

Multiple Sequence Alignment – Approximation Algorithm Polynomial time algorithm: assumption: the function δ is a distance function: (triangle inequality) Let D(S,T) be the value of the minimum global alignment between S and T.

Multiple Sequence Alignment – Approximation Algorithm (cont.) Polynomial time algorithm: The input is a set Γ of k strings Si. 1. Find “center string” S1 that minimizes 2. Call the remaining strings S2, …,Sk. 3. Add a string to the multiple alignment that initially contains only S1 as follows: Suppose S1, …,Si-1 are already aligned as S’1, …,S’i-1. Add Si by running dynamic programming algorithm on S’1 and Si to produce S’’1 and S’i. Adjust S’2, …,S’i-1 by adding spaces to those columns where spaces were added to get S’’1 from S’1. Replace S’1 by S’’1.

Multiple Sequence Alignment – Approximation Algorithm (cont.) Time analysis: Choosing S1 – running dynamic programming algorithm times – O(k2n2) When Si is added to the multiple alignment, the length of S1 is at most in, so the time to add all k strings is

Multiple Sequence Alignment – Approximation Algorithm (cont.) Performance analysis: M - The alignment produced by this algorithm. d(i,j) - the distance M induces on the pair Si,Sj. M* - optimal alignment. For all i, d(1,i)=D(S1,Si) (we performed optimal alignment between S’1 and Si and )

Multiple Sequence Alignment – Approximation Algorithm (cont.) Performance analysis: Triangle inequality Definition of S1

Multiple Sequence Alignment – Approximation Algorithm Algorithm relies heavily on scoring function being a distance. It produced an alignment whose SP score is at most twice the minimum. What if scoring function was similarity? Can we get an efficient algorithm whose score is half the maximum? Third of maximum? … We dunno !

Tree Alignments Assume that there is a tree T=(V,E) whose leaves are the sequences. Associate a sequence in each internal node. Tree-score(K) = ∑(i,j)Escore(xi,xj). Finding the optimal assignment of sequences to the internal nodes is NP Hard. We will meet again this problem in the study of Phylogenetic trees (it is related to the parsimony problem).

Multiple Sequence Alignment Heuristics Example - 4 sequences A, B, C, D. A. B D A C A B C D Perform all 6 pair wise alignments. Find scores. Build a “similarity tree”. distant similar B. Multiple alignment following the tree from A. B Align most similar pairs allowing gaps to optimize alignment. D A Align the next most similar pair. C Now, “align the alignments”, introducing gaps if necessary to optimize alignment of (BD) with (AC).

(modified from Speed’s ppt presentation, see p. 81 in Kanehisa’s book) The tree-based progressive method for multiple sequence alignment, used in practice (Clustal) (a) a tree (dendrogram) obtained by cluster analysis (b) pairwise alignment of 2 sequences’ alignments. (a) (b) L W R D G R G A L Q L W R G G R G A A Q D W R - G R T A S G DEHUG3 DEPGG3 DEBYG3 DEZYG3 DEBSGF L R R - A R T A S A L - R G A R A A A E (modified from Speed’s ppt presentation, see p. 81 in Kanehisa’s book)

Visualization of Alignment Helps