The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

Slides:



Advertisements
Similar presentations
Primer on Probability Sushmita Roy BMI/CS 576 Sushmita Roy Oct 2nd, 2012 BMI/CS 576.
Advertisements

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Substitution matrices
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Bayesian Evolutionary Distance P. Agarwal and D.J. States. Bayesian evolutionary distance. Journal of Computational Biology 3(1):1— 17, 1996.
Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 11 th,
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Measuring the degree of similarity: PAM and blosum Matrix
Introduction to Bioinformatics
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
Lecture outline Database searches
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
Sequence similarity.
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
Similar Sequence Similar Function Charles Yan Spring 2006.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple Sequence Alignments
Substitution matrices
Sequence Alignment - III Chitta Baral. Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
An Introduction to Bioinformatics
Substitution Numbers and Scoring Matrices
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
Pairwise Sequence Analysis-III
. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Blosum matrices What are they? Morten Nielsen BioSys, DTU
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
©CMBI 2009 Transfer of information The main topic of this course is transfer of information. In the protein world that leads to the questions: 1)From which.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Pairwise Sequence Alignment (cont.)
Lecture 6: Sequence Alignment Statistics
Alignment IV BLOSUM Matrices
Presentation transcript:

The statistics of pairwise alignment BMI/CS Colin Dewey Fall 2015

Issues in scoring pairwise alignments How do we determine the substitution and gap scores for alignment? How do we determine whether the score of the best alignment is indicative of truly related sequences? These issues are related and addressed via statistical models

Probabilistic Model of Alignments We’ll focus on protein alignments without gaps given an alignment, we can consider two possibilities R: the sequences are related by evolution U: the sequences are unrelated How can we distinguish these possibilities? How is this view related to amino-acid substitution matrices?

Model for Unrelated Sequences We’ll assume that each position in the alignment is sampled randomly from some distribution of amino acids We’ll assume that amino acids at each position are independent of each other let be the probability of amino acid a the probability of an n-character alignment of x and y is given by

Model for Related Sequences We’ll assume that each pair of aligned amino acids evolved from a common ancestor We’ll assume each pair is independent of the other pairs let be the probability that evolution gave rise to amino acid a in one sequence and b in another sequence the probability of an alignment of x and y is given by

Probabilistic Model of Alignments How can we decide which possibility (U or R) is more likely? one principled way is to consider the relative likelihood of the two possibilities taking the log, we get This is the log-odds ratio (or log likelihood ratio)

Probabilistic Model of Alignments If we let the substitution matrix score for the pair a, b be: Then the score of an ungapped alignment is the log likelihood ratio:

Substitution Matrices two popular sets of matrices for protein sequences –PAM matrices [Dayhoff et al., 1978] –BLOSUM matrices [Henikoff & Henikoff, 1992] both try to capture the the relative substitutability of amino acid pairs in the context of evolution

Blosum 62 Matrix

Substitution Matrices but how do we get values for (probability that a and b arose from a common ancestor)? it depends on how long ago sequences diverged diverged recently: diverged long ago: the substitution matrix score for the pair a, b is given by:

Substitution Matrices key idea: trusted alignments of related sequences provide information about biologically permissible mutations protein structure similarity provides the gold standard for which alignments are trusted

BLOSUM Matrices [Henikoff & Henikoff, PNAS 1992] probabilities estimated from “blocks” of sequence fragments that represent structurally conserved regions in proteins transition frequencies observed directly by counting pairs of characters between clusters in the blocks. Sequences within blocks are clustered at various levels: –45% identical (BLOSUM-45) –50% identical (BLOSUM-50) –62% identical (BLOSUM-62) –etc.

BLOSUM Matrices given: a set of sequences in a block fill in matrix A with number of observed substitutions (we won’t worry about details of some normalization that happens here) a paired with b

Assessing significance of the alignment score There are two ways to do this –Bayesian approach –Classical approach

Bayesian approach Compute probability of Related model using Bayes rule Requires prior probability of R and U

Classical approach Determine how likely it is that such an alignment score would result from chance. 3 ways to calculate chance; look at alignment scores for –real but non-homologous sequences –real sequences shuffled to preserve compositional properties –sequences generated randomly based upon a DNA/protein sequence model

Scores from Random Alignments suppose we assume sequence lengths m and n a particular substitution matrix and amino-acid frequencies and we consider generating random sequences of lengths m and n and finding the best alignment of these sequences this will give us a distribution over alignment scores for random pairs of sequences

Statistics of Alignment Scores: The Extreme Value Distribution in particular, we get an extreme value distribution

Distribution of Scores S is a given score threshold m and n are the lengths of the sequences under consideration K and are constants that can be calculated from the substitution matrix the frequencies of the individual amino acids the expected number of alignments, E, with score at least S is given by:

Statistics of Alignment Scores to generalize this to searching a database, have n represent the summed length of the sequences in the DB (adjusting for edge effects) the NCBI BLAST server does just this theory for gapped alignments not as well developed computational experiments suggest this analysis holds for gapped alignments (but K and must be estimated from data)