Substitution Numbers and Scoring Matrices

Slides:



Advertisements
Similar presentations
Substitution matrices
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Bayesian Evolutionary Distance P. Agarwal and D.J. States. Bayesian evolutionary distance. Journal of Computational Biology 3(1):1— 17, 1996.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Measuring the degree of similarity: PAM and blosum Matrix
1 Chapter 2 Data Searches and Pairwise Alignments 暨南大學資訊工程學系 黃光璿 2004/03/08.
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
Dot plots Dynamic Programming
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
We have shown that: To see what this means in the long run let α=.001 and graph p:
Introduction to bioinformatics
Sequence alignment & Substitution matrices By Thomas Nordahl & Morten Nielsen.
Scoring the Alignment of Amino Acid Sequences Constructing PAM and Blosum Matrices.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Substitution matrices
Dayhoff’s Markov Model of Evolution. Brands of Soup Revisited Brand A Brand B P(B|A) = 2/7 P(A|B) = 2/7.
Sequence Alignment - III Chitta Baral. Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
1 CAP5510 – Bioinformatics Substitution Patterns Tamer Kahveci CISE Department University of Florida.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Basics of Sequence Alignment and Weight Matrices and DOT Plot
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
The Blosum scoring matrices Morten Nielsen BioSys, DTU.
Pairwise Sequence Analysis-III
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Blosum matrices What are they? Morten Nielsen BioSys, DTU
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Scoring the Alignment of Amino Acid Sequences Constructing PAM and Blosum Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Evolutionary Change in Sequences
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise Sequence Alignment and Database Searching
Alignment IV BLOSUM Matrices
Presentation transcript:

Substitution Numbers and Scoring Matrices

Substitution Numbers The number of observed substitutions K is an important quantity in molecular evolutionary analysis A simple count may be misleading, so statistical models are developed to estimate the number of substitutions Jukes-Cantor model Kimura model (both are for nucleotides, but the ideas can extend to amino acids)

Jukes-Cantor Model Assumes that each nucleotide is equally likely to change into any other nucleotide with probability α per time step What is the probability that if we start with C we end up with C after 2 time steps Pcc(2) = C -> C -> C C -> A -> C C -> T -> C C-> G -> C α A T C G φ α G A α α α φ = 1 - 3 α α T C α

Jukes-Cantor Model The entry M(a,b) in the matrix M1 represents the probability of substitution from nucleotide a to b in one time step What is the matrix M2, i.e. whose entries M(a,b) represent the probability of substitution from a to b in two time steps essentially what we did on prev. slide but for all pairs of bases A->X->A A->X->T A->X->C A->X->G T->X->A T->X->T T->X->C T->X->G C->X->A C->X->T C->X->C C->X->G G->X->A G->X->T G->X->C G->X->G A T C G φ α C->X->C = α∙α + α∙α + φ∙φ + α∙α (prev. slide) C->X->A = α∙φ + α∙α + φ∙α + α∙α M1 = φ = 1 - 3 α

Jukes-Cantor Model Turns out that Mn = (M1)n i.e. whose entries M(a,b) represent the probability of substitution from a to b in n time steps In general under the J.C. model the probability that a site will contain a C after t time steps is given by: Pc(t) = ¼ + (¾)e-4αt This model can be used to derive an estimate of the number of substitutions that have occurred between the sequences K = -¾ ln[ 1 – (4/3) p ] p – the fraction of nucleotides that are considered mismatch

Kimura Model Addresses the unrealistic assumption in J.C. model that all substitutions are equally likely Two types of substitutions transitions – purine<=>purine exchange or pyrimidine<=>pyrimidine transversions – purine<=>pyrimidine exchange α A T C G φ β Α α G A β β β φ = 1 – α – 2 β β T C α

Kimura Model What is the probability that if we start with C we end up with C after 2 time steps Pcc(2) = C -> C -> C C -> A -> C C -> T -> C C-> G -> C In general under the Kimura model the probability that a site will contain a C after t time steps is given by: Pc(t) = ¼ + (¼)e-4βt + (½)e-2(α+β)t Estimated number of substitutions (TR – transitions, TV – transverions) K = ½ ln[ 1 / (1 – 2*TR – TV)] + ¼ ln[ 1 / (1 – 2*TV)]

Scoring Matrices

Alignment Score Alignment score attempts to measure likelihood of a common evolutionary ancestor Two possible ways to explain a given pairwise alignment random model – the alignment could be produced purely by chance evolutionary model – there is high correlation between aligned pairs Under random model each position is independent of the others probability of amino acid a occurring at each position is pa Under non-random model probability of amino acid a depends on matched residue b – qab

Substitution Matrices Given a (non-gapped) pairwise alignment of sequences A = a1 a2 a3 a4…an B = b1 b2 b3 b4…bn under non-random model probability of the alignment Pnon-random = qa1b1qa2b2qa3b3qa4b4…qanbn under random model probability of the alignment Prandom = pa1pa2pa3pa4…pan pb1pb2pb3pb4…pbn = pa1pb1pa2pb2pa3pb3qa4pb4…panpbn Use ratio of probabilities (odds ratio) to compare the models r = –––––––– r > 1, non-random more likely Pnon-random Prandom

Substitution Matrices Ratio of probabilities (odds ratio) r = –––––––– = –––––––––––––––––––––––––––––– = –––––––––––––––––––––––––––––– Typically the log-odds ratio is used log(r) = log( –––––––––––––––––––––––––––––– ) = log(––––––)+log(––––––)+log(––––––)+ ... +log(––––––) Pnon-random qa1b1qa2b2qa3b3qa4b4 …qanbn Prandom pa1pb1pa2pb2pa3pb3qa4pb4…panpbn qa1b1 qa2b2 qa3b3 qa4b4 … qanbn pa1pb1 pa2pb2 pa3pb3 qa4pb4 …panpbn qa1b1 qa2b2 qa3b3 qa4b4 … qanbn pa1pb1 pa2pb2 pa3pb3 qa4pb4 …panpbn Entry (a1, b1) in the substitution matrix qa1b1 qa2b2 qa3b3 qanbn pa1pb1 pa2pb2 pa3pb3 panpbn

Substitution Matrices Provide the “likelihood” that two amino acids (nucleotides) will occur as aligned pair Common substitution matrices for protein alignment PAM family – derived from alignments of high sequence identity (Dayhoff, Schwartz, and Orcutt. “A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure Volume 5. 1978:345-352) BLOSUM family – derived from alignments of low sequence identity (Henikoff and Henikoff. “Amino acid substitution matrices from protein blocks”. Proc. Natl. Acad. Sci. 1992. 89(22): 10915–10919.) BLOSUM62 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

BLOSUM Matrices Based on ungapped multiple local alignments of conserved regions of proteins with low sequence identity These alignments are used to derive qab pa pb which give the substitution score for amino acids a and b score(a, b) = log(––––––) Procedure obtain known ungapped multiple local alignments split into clusters, so that every pair in a cluster has ≥ C% identity for each pair of amino acids a and b calculate qab = frequency of a,b pair / total # pairs (sequences within a cluster are given weight 1 / size_of_cluster) qab papb

BLOSUM Matrices Calculating qQN for BLOSUM62 – within a cluster for each sequence there is one with (≥ 62% identity) ATCKQ ATCRN ASCKN SSCRN SDCEQ SECEN TECRQ 7 clusters, 21 pairs of clusters 5*21 = 105 total # of aligned pairs QN matched in 12 pairs of clusters qQN = frequency of QN pair / total # aligned pairs = 12 / 105 = 0.114

BLOSUM Matrices Calculating qQN for BLOSUM50 – within a cluster for each sequence there is one with (≥ 50% identity) ATCKQ ATCRN ASCKN SSCRN SDCEQ SECEN TECRQ 3 clusters, 3 pairs of clusters 5 bases * 3 clusters = 15 total # of aligned pairs QN match frequency (between clusters): top, mid: top, bot: mid, bot: total: qQN = frequency of QN pair / total # aligned pairs = 14/8 / 15 = 0.1166

BLOSUM Matrices Calculating qQN for BLOSUM50 – within a cluster for each sequence there is one with (≥ 50% identity) ATCKQ ATCRN ASCKN SSCRN SDCEQ SECEN TECRQ 3 clusters, 3 pairs of clusters 5 bases * 3 clusters = 15 total # of aligned pairs QN match frequency (between clusters): top, mid: ¼*½ + ¾*½ top, bot: ¾*1 mid, bot: ½*1 total: 1/8+3/8+3/4+1/2 = 14/8 qQN = frequency of QN pair / total # aligned pairs = 14/8 / 15 = 0.1166

BLOSUM Matrices So far calculated qabN (i.e. probability that a and b will be paired up under non-random model) To compute the substitution score need to know pa and pb (i.e. probability that a and b occur by chance) pa = qaa + ½ Σa≠bqab ≈ fraction of all amino acids that are type a The entry computed in the substitution matrix is: qab score(a, b) = log(––––––) papb

PAM Matrices Based on ungapped multiple local alignments of conserved regions of proteins with high sequence identity (> 85%) Uses phylogenetic trees to compute the entries in the substitution matrix Procedure build a phylogenetic tree for sequence of high identity compute relative mutability, ma, of each amino acid (frequency of a substitutions in the phylogenetic tree) compute Fab (number of substitutions of a with b) compute Mab (mutation probability that a will be replaced by b) Mab = mb Fab / ΣcFcb compute entry in scoring matrix score(a, b) = log(Mab / frequency of a)

PAM Matrices Constructing a PAM matrix ACGCTAFKI GCGCTAFKI ACGCTAFKL GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL ACGCTAFKI GCGCTAFKI ACGCTAFKL GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL A->G I->L A->G A->L C->S G->A Compute score(G, A) – need mA, FGA, ΣcFcA ma = 4 / 2*6 FGA = 3 Σ FcA = 4 Mab = mA FGA / ΣcFcA score(G, A) = log(MGA/ frequency_of_G) = log(MGA/ (10/63))