Lecture 6: Sequence Alignment Statistics

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Lecture 6 CS5661 Pairwise Sequence Analysis-V Relatedness –“Not just important, but everything” Modeling Alignment Scores –Coin Tosses –Unit Distributions.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Lecture 6, Thursday April 17, 2003
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
Lecture outline Database searches
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties.
Sequence analysis course
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Sequence Alignment - III Chitta Baral. Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Significance in protein analysis
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Lectures 3-6: Pair-wise Sequence Alignment
Blast Basic Local Alignment Search Tool
Sequence comparison: Local alignment
Lecture 5: Local Sequence Alignment Algorithms
Sequence comparison: Significance of similarity scores
Intro to Alignment Algorithms: Global and Local
Pairwise Sequence Alignment (cont.)
Variants of HMMs.
Sequence comparison: Significance of similarity scores
Alignment IV BLOSUM Matrices
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
1-month Practical Course Genome Analysis Iterative homology searching
Presentation transcript:

Lecture 6: Sequence Alignment Statistics CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics

Review of last lecture How to map gaps more accurately? GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG Score = 8 x m – 3 x d Score = 8 x m – 3 x d Gaps usually occur in bunches During evolution, chunks of DNA may be lost or inserted entirely Aligning genomic sequences vs. cDNAs: cDNAs are spliced versions of the genomic seqs

Model gaps more accurately Previous model: Gap of length n incurs penalty nd General: Convex function E.g. (n) = c * sqrt (n) F(i-1, j-1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k,j) – (i-k) maxk=0…j-1F(i,k) – (j-k) Running Time: O((M+N)MN) (cubic) Space: O(NM)  n  n

Compromise: affine gaps (n) = d + (n – 1)e | | gap gap open extension e d Match: 2 Gap open: -5 Gap extension: -1 GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG 8x2-5-2 = 9 8x2-3x5 = 1 We want to find the optimal alignment with affine gap penalty in O(MN) time O(MN) or better O(M+N) memory

Dynamic programming Consider three sub-problems when aligning x1..xi and y1..yj F(i,j): best alignment (score) of x1..xi & y1..yj if xi aligns to yj Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap xi xi xi yj yj yj F(i, j) Ix(i, j) Iy(i, j)

(-, yj) / e (xi,yj) /  (xi,yj) /  (-, yj) / d (xi,-) / d (xi,-) / e Input Output (-, yj) / e (xi,yj) /  Ix (xi,yj) /  (-, yj) / d F (xi,-) / d Iy (xi,-) / e Start state (xi,yj) /  Current state Input Output Next state F (xi,yj)  (-,yj) d Ix (xi,-) Iy e …

(-, yj) / e (xi,yj) /  (-, yj) / d (xi,-) / d (xi,-) / e F Ix Iy (xi,yj) /  (xi,-) / d (xi,-) / e (-, yj) / d (-, yj) / e start state F-F-F-F F-Iy-F-F-Ix F-F-Iy-F-Ix AAC ACT AAC ||| ACT AAC- || -ACT AAC- | | A-CT Given a pair of sequences, an alignment (not necessarily optimal) corresponds to a state path in the FSM. Optimal alignment: a state path to read the two sequences such that the total output score is the highest

(-, yj)/e (xi,yj) / (xi,yj) / (-, yj) /d (xi,-) /d (xi,-)/e Ix (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i-1, j-1) + (xi, yj) F(i, j) = max Ix(i-1, j-1) + (xi, yj) Iy(i-1, j-1) + (xi, yj) xi yj

F(i, j-1) + d Ix(i, j) = max Ix(i, j-1) + e (-, yj)/e (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i, j-1) + d Ix(i, j) = max Ix(i, j-1) + e xi yj Ix(i, j)

F(i-1, j) + d Iy(i, j) = max Iy(i-1, j) + e (-, yj)/e (xi,yj) / Ix (xi,yj) / (-, yj) /d F (xi,-) /d Iy (xi,-)/e (xi,yj) / F(i-1, j) + d Iy(i, j) = max Iy(i-1, j) + e xi yj Iy(i, j)

F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1) F(i, j – 1) + d Ix(i, j) = max Ix(i, j – 1) + e F(i – 1, j) + d Iy(i, j) = max Iy(i – 1, j) + e Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y

y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A - x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F: aligned on both Iy: Insertion on y y = G C C F(i-1, j-1) Iy(i-1, j-1) x = - -5 -6 -7 Iy(i-1,j) (xi, yj) F(i-1,j) G C A e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix: Insertion on x

y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F - 2 x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = - -5 -6 -7 F(i-1, j-1) Iy(i-1, j-1) G C A (xi, yj) = 2 Ix(i-1, j-1) F(i, j) Ix

y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F - 2 -7 x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = - -5 -6 -7 F(i-1, j-1) Iy(i-1, j-1) G C A (xi, yj) = -2 Ix(i-1, j-1) F(i, j) Ix

y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F - 2 -7 -8 x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = -5 -6 -7 - -3 -4 G C A F(i,j-1) d = -5 Ix(i,j) Ix(i,j-1) e = -1 Ix

y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F - 2 -7 -8 x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = -5 -6 -7 - -3 -4 F(i-1, j-1) Iy(i-1, j-1) G C A (xi, yj) = -2 Ix(i-1, j-1) F(i, j) Ix

y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F - 2 -7 -8 4 -1 x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = -5 -6 -7 - -3 -4 F(i-1, j-1) Iy(i-1, j-1) G C A (xi, yj) = 2 Ix(i-1, j-1) F(i, j) Ix

y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F - 2 -7 -8 4 -1 x = - -5 -6 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = -5 -6 -7 - -3 -4 -12 -1 G C A F(i,j-1) d = -5 Ix(i,j) Ix(i,j-1) e = -1 Ix

y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F - 2 -7 -8 4 -1 x = - -5 -6 -3 -7 -8 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C x = -5 -6 -7 - -3 -4 -12 -1 Iy(i-1,j) G C A F(i-1,j) e=-1 d=-5 Iy(i,j) Ix

y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 G C A G C A F - 2 -7 -8 4 -1 -5 -9 -6 1 x = - -5 -6 -3 -12 -13 -7 -8 -1 -2 m = 2 s = -2 d = -5 e = -1 G C A G C A F Iy y = G C C F(i-1, j-1) Iy(i-1, j-1) x = Iy(i-1,j) -5 -6 -7 - -3 -4 -12 -1 -13 -10 -14 -11 (xi, yj) F(i-1,j) G C A e Ix(i-1, j-1) d F(i, j) F(i,j-1) Iy(i,j) d Ix(i,j) Ix(i,j-1) e Ix

GCAC || | GC-C y = G C C y = G C C x = x = m = 2 s = -2 d = -5 e = -1 - 2 -7 -8 4 -1 -5 -9 -6 1 x = - -5 -6 -3 -12 -13 -7 -8 -1 -2 m = 2 s = -2 d = -5 e = -1 G C A G C A x GCAC || | GC-C y F Iy y = G C C y = G C C x = -5 -6 -7 - -3 -4 -12 -1 -13 -10 -14 -11 x = G C A G C A Ix

Today: statistics of alignment Where does (xi, yj) come from? Are two aligned sequences actually related?

Probabilistic model of alignments We’ll first focus on protein alignments without gaps Given an alignment, we can consider two possible models R: the sequences are related by evolution U: the sequences are unrelated How can we distinguish these two models? How is this view related to amino-acid substitution matrix?

Model for unrelated sequences Assume each position of the alignment is independently sampled from some distribution of amino acids ps: probability of amino acid s in the sequences Probability of seeing an amino acid s aligned to an amino acid t by chance is Pr(s, t | U) = ps * pt Probability of seeing an ungapped alignment between x = x1…xn and y = y1…yn randomly is i

Model for related sequences Assume each pair of aligned amino acids evolved from a common ancestor Let qst be the probability that amino acid s in one sequence is related to t in another sequence The probability of an alignment of x and y is give by

Probabilistic model of Alignments How can we decide which model (U or R) is more likely? One principled way is to consider the relative likelihood of the two models (the odd ratios) A higher ratio means that R is more likely than U

Log odds ratio Taking logarithm, we get Recall that the score of an alignment is given by

Therefore, if we define We are actually defining the alignment score as the log odds ratio between the two models R and U

How to get the probabilities? ps can be counted from the available protein sequences But how do we get qst? (the probability that s and t have a common ancestor) Counted from trusted alignments of related sequences

Protein Substitution Matrices Two popular sets of matrices for protein sequences PAM matrices [Dayhoff et al, 1978] Better for aligning closely related sequences BLOSUM matrices [Henikoff & Henikoff, 1992] For both closely or remotely related sequences

BLOSUM-N matrices Constructed from a database called BLOCKS Contain many closely related sequences Conserved amino acids may be over-counted N = 62: the probabilities qst were computed using trusted alignments with no more than 62% identity identity: % of matched columns Using this matrix, the Smith-Waterman algorithm is most effective in detecting real alignments with a similar identity level (i.e. ~62%)

: Scaling factor to convert score to integer. Important: when you are told that a scoring matrix is in half-bits =>  = ½ ln2 Positive for chemically similar substitution Common amino acids get low weights Rare amino acids get high weights

BLOSUM-N matrices If you want to detect homologous genes with high identity, you may want a BLOSUM matrix with higher N. say BLOSUM75 On the other hand, if you want to detect remote homology, you may want to use lower N, say BLOSUM50 BLOSUM-62: good for most purposes 45 62 90 Weak homology Strong homology

For DNAs No database of trusted alignments to start with Specify the percentage identity you would like to detect You can then get the substitution matrix by some calculation

For example Suppose pA = pC = pT = pG = 0.25 We want 88% identity qAA = qCC = qTT = qGG = 0.22, the rest = 0.12/12 = 0.01 (A, A) = (C, C) = (G, G) = (T, T) = log (0.22 / (0.25*0.25)) = 1.26 (s, t) = log (0.01 / (0.25*0.25)) = -1.83 for s ≠ t.

Substitution matrix A C G T 1.26 -1.83

A C G T 5 -7 Scale won’t change the alignment Multiply by 4 and then round off to get integers

Arbitrary substitution matrix Say you have a substitution matrix provided by someone It’s important to know what you are actually looking for when you use the matrix

Which one should I use for my sequences? NCBI-BLAST WU-BLAST A C G T 1 -2 A C G T 5 -4 What’s the difference? Which one should I use for my sequences?

We had Scale it, so that Reorganize:

Since all probabilities must sum to 1, We have Suppose again ps = 0.25 for any s We know (s, t) from the substitution matrix We can solve the equation for λ Plug λ into to get qst

A C G T 1 -2 A C G T 5 -4 Translate: 95% identity NCBI-BLAST WU-BLAST A C G T 1 -2 A C G T 5 -4  = 1.33 qst = 0.24 for s = t, and 0.004 for s ≠ t Translate: 95% identity  = 0.19 qst = 0.16 for s = t, and 0.03 for s ≠ t Translate: 65% identity

Details for solving  A C G T 1 -2 Known: (s,t) = 1 for s=t, and (s,t) = -2 for s t. Since and s,t qst = 1, we have 12 * ¼ * ¼ * e-2 + 4 * ¼ * ¼ * e = 1 Let e = x, we have ¾ x-2 + ¼ x = 1. Hence, x3 – 4x2 + 3 = 0; X has three solutions: 3.8, 1, -0.8 Only the first leads to a positive   = ln (3.8) = 1.33 A C G T 1 -2

Today: statistics of alignment Where does (xi, yj) come from? Are two aligned sequences actually related?

Statistics of Alignment Scores Q: How do we assess whether an alignment provides good evidence for homology (i.e., the two sequences are evolutionarily related)? Is a score 82 good? What about 180? A: determine how likely it is that such an alignment score would result from chance

P-value of alignment p-value The probability that the alignment score can be obtained from aligning random sequences Small p-value means the score is unlikely to happen by chance The most common thresholds are 0.01 and 0.05 Also depend on purpose of comparison and cost of misclaim

Statistics of global seq alignment Theory only applies to local alignment For global alignment, your best bet is to do Monte-Carlo simulation What’s the chance you can get a score as high as the real alignment by aligning two random sequences? Procedure Given sequence X, Y Compute a global alignment (score = S) Randomly shuffle sequence X (or Y) N times, obtain X1, X2, …, XN Align each Xi with Y, (score = Ri) P-value: the fraction of Ri >= S

Human HEXA Fly HEXO1 Score = -74

-74 Distribution of the alignment scores between fly HEXO1 and 200 randomly shuffled human HEXA sequences There are 88 random sequences with alignment score >= -74. So: p-value = 88 / 200 = 0.44 => alignment is not significant

…………………………………………………… Mouse HEXA Human HEXA Score = 732 ……………………………………………………

No random sequences with alignment score >= 732 Distribution of the alignment scores between mouse HEXA and 200 randomly shuffled human HEXA sequences 732 No random sequences with alignment score >= 732 So: the P-value is less than 1 / 200 = 0.05 To get smaller p-value, have to align more random sequences Very slow Unless we can fit a distribution (e.g. normal distribution) Such distribution may not be generalizable No theory exists for global alignment score distribution

Statistics for local alignment Elegant theory exists Score for ungapped local alignment follows extreme value distribution (Gumbel distribution) Normal distribution Extreme value distribution An example extreme value distribution: Randomly sample 100 numbers from a normal distribution, and compute max Repeat 100 times. The max values will follow extreme value distribution

Statistics for local alignment Given two unrelated sequences of lengths M, N Expected number of ungapped local alignments with score at least S can be calculated by E(S) = KMN exp[-S] Known as E-value : scaling factor as computed in last lecture K: empirical parameter ~ 0.1 Depend on sequence composition and substitution matrix

P-value for local alignment score P-value for a local alignment with score S when P is small.

Example You are aligning two sequences, each has 1000 bases m = 1, s = -1, d = -inf (ungapped alignment) You obtain a score 20 Is this score significant?

 = ln3 = 1.1 (computed as discussed on slide #41) E(S) = K MN exp{- S} E(20) = 0.1 * 1000 * 1000 * 3-20 = 3 x 10-5 P-value = 3 x 10-5 << 0.05 The alignment is significant Distribution of 1000 random sequence pairs 20

Multiple-testing problem Searching a 1000-base sequence against a database of 106 sequences (each of length 1000) How significant is a score 20 now? You are essentially comparing 1000 bases with 1000x106 = 109 bases (ignore edge effect) E(20) = 0.1 * 1000 * 109 * 3-20 = 30 By chance we would expect to see 30 matches The P-value (probability of seeing at least one match with score >= 30) is 1 – e-30 = 0.9999999999 The alignment is not significant Caution: it does NOT mean that the two sequences are unrelated. Rather, it simply means that you have NO confidence to say whether the two sequences are related.

Score threshold to determine significance You want a p-value that is very small (even after taking into consideration multiple-testing) What S will guarantee you a significant p-value? E(S)  P(S) << 1 => KMN exp[-S] << 1 => log(KMN) -S < 0 => S > T + log(MN) /  (T = log(K) / , usually small)

Score threshold to determine significance In the previous example m = 1, s = -1, d = -inf =>  = 1.1 Aligning 1000bp vs 1000bp S > log(106) / 1.1 = 13. So 20 is significant. Searching 1000bp against 106 x 1000bp S > log(1012) / 1.1 = 25. so 20 is not significant.

Statistics for gapped local alignment Theory not well developed Extreme value distribution works well empirically Need to estimate K and  empirically Given the database and substitution matrix, generate some random sequence pairs Do local alignment Fit an extreme value distribution to obtain K and 

In summary How to obtain a substitution matrix? Obtain qst and ps from established alignments (for DNA: from your knowledge) Computing score: How to understand arbitrary substitution matrix? Solve function to obtain  and target qst Which tells you what percent identity you are expecting How to understand alignment score? probability that a score can be expected from chance. Global alignment: Monte-Carlo simulation Local alignment: Extreme Value Distribution Estimate p-value from a score Determine a score threshold without computing a p-value