Substitution matrices

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Bayesian Evolutionary Distance P. Agarwal and D.J. States. Bayesian evolutionary distance. Journal of Computational Biology 3(1):1— 17, 1996.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Last lecture summary.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.
Introduction to bioinformatics
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Scoring matrices Identity PAM BLOSUM.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Substitution matrices
Dayhoff’s Markov Model of Evolution. Brands of Soup Revisited Brand A Brand B P(B|A) = 2/7 P(A|B) = 2/7.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
An Introduction to Bioinformatics
Substitution Numbers and Scoring Matrices
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST Workshop Maya Schushan June 2009.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Pairwise Sequence Analysis-III
Phylogeny Ch. 7 & 8.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
©CMBI 2009 Transfer of information The main topic of this course is transfer of information. In the protein world that leads to the questions: 1)From which.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Pairwise Sequence Alignment (cont.)
Alignment IV BLOSUM Matrices
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Substitution matrices

Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance calculations

What is a Sequence Alignment? Given two nucleotide or amino acid sequences, determine if they are related (descended from a common ancestor) Technically, we can align any two sequences, but not always in a meaningful way

What is a Sequence Alignment? (cont.) HIGHLY RELATED: HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL RELATED: HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL ++ ++++H+ KV + +A ++ +L L+++H+ K LGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG SPURIOUS ALIGNMENT: HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL GS+ + G + +D L ++ H+ D+ A +AL D ++AH+ F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE How to filter out the last one & pick up the second?

Why Should We Care? Fragment assembly in DNA sequencing – Experimental determination of nucleotide sequences is only reliable up to about 500-800 base pairs (bp) at a time – But a genome can be millions of bp long! – If fragments overlap, they can be assembled: ...AAGTACAATCA CAATTACTCGGA... – Need to align to detect overlap

Why Should We Care? (cont.) Finding homologous proteins and genes – i.e. evolutionarily related (common ancestor) – Thus want to avoid the spurious alignment

What do we do? Model evolutionary process -mutational changes to DNA Can you name some?

Mutations Nothing happens Substitutions Insertions Deletions Transitions (A-G; C-T) Transversions (A-C, A-T, C-T, T-G) Insertions Deletions

How do we do it? Choose a scoring scheme Choose an algorithm to find optimal alignment with respect to a scoring scheme Statistically validate alignment

Scoring Schemes Since goal is to find related sequences, want evolution-based scoring scheme Mutations occur often at the genomic level, but their rates of acceptance by natural selection vary depending on the mutation I.e. changing an AA to one with similar properties is more likely to be accepted Assume that all changes occur independently of each other and are Markovian (makes working with probabilities easier)

If AA ai is aligned with aj, then aj was substituted for ai ...KALM... ...KVLM... Was this due to an accepted mutation or simply by chance? If A or V is likely in general, then there is less evidence that this is a mutation Want the score sij to be higher if mutations more likely Take ratio of mutation prob. to prob. of AA appearing at random Generally, if aj is similar to ai in property, then accepted mutation more likely and sij higher

Mutations Only consider immediate mutations ai -> aj, not ai -> ak -> aj Mutations are undirected => scoring matrix is symmetric Markov assumptions

Nucleic acid substitution matrices Uniform mutation rates Higher transitions (a) than transversions (b) Pyrimidines C T A G a b b b b a Purines

Mutation matrices Uniform mutation rates A G T C A 0.99 G 0.00333 0.99 T 0.00333 0.00333 0.99 C 0.00333 0.00333 0.00333 0.99 3 fold higher transitions (a) than transversions (b) A G T C A 0.99 G 0.006 0.99 T 0.002 0.002 0.99 C 0.002 0.002 0.006 0.99 1% probability of change at each position (Mij values)

log-odds ratio Alignment: ATC AGT Probability under match hypothesis: P(alignment | match) = P(A,A | match)P(T,G | match)P(C,T | match) Probability under random hypothesis: P(alignment | random ) = P(A,A | random )P(T,G | random)P(C,T | random) Relative probability (odds ratio) R(alignment) = P(alignment | match) / P(alignment | random ) = R(A,A)R(T,C)R(C,T) For computational reasons we take logarithms log R(alignment) = log R(A,A) + log R(T,G) + log R(C,T)

log-odds ratio (cont.) Use mutation matrix to derive log odds matrix The probability (sij) of obtaining a match between nucleotides i and j, divided by the random probability of aligning i and j is given by sij = log (piMij / pipj) = log (Mij / pj) where Mij is the mutation matrix value and pi and pj are the fractional compositions of each nucleotide.

Substitution matrices Uniform mutation rates A G T C A 2 G -6 2 T -6 -6 2 C -6 -6 -6 2 3 fold higher transitions (a) than transversions (b) A G T C A 2 G -5 2 T -7 -7 2 C -7 -7 -5 2 1% probability of change at each position, log base 2

Gap penalties Why?

Gap penalties Modeling biology Insertions / deletions: Single nucleotide Unequal crossover Introns, transposons etc. Gap score wx for gap of length x consists of Gap opening (g) and gap extension (r) penalties. wx = g + rx

Protein matrices PAM BLOSUM

The PAM Transition Matrices Dayhoff et al. started with several hundred alignments between very closely related proteins Compute the frequency with which each AA is changed into each other AA changes over a short evolutionary distance (short enough where only 1% AAs change) 1 PAM = 1% point accepted mutation (accepted by natural selection, i. e. not altering the function of the protein)

Phylogenetic tree ACGH DBGH ADIJ CBIJ B-C A-D B-D A–C ABGH ABIJ I-G J-H ACGH DBGH ADIJ CBIJ

PAMs (cont.) Estimate pi with the frequency of AA ai over both sequences, i.e. # of ai’s/number of AAs Let fij = fji = # of ai <-> aj changes in data set, fi = Sj¹i fij and f = Si fi Define the scale to be the amount of evolution to change 1 in 100 AAs (on average) [1 PAM dist] Relative mutability of ai is the ratio of number of mutations to total exposure to mutation: mi = fi / (100fpi)

PAMs (cont.) If mi is probability of a mutation for ai, then Mii = 1 - mi is prob. of no change ai -> aj if and only if ai changes and ai -> aj given that ai changes, so Mij = Pr(ai -> aj) = Pr(ai -> aj | ai changed)Pr(ai changed) = (fij/fi)mi = fij/(100fpi) The 1 PAM transition matrix consists of the Mii’s and gives the probabilities of mutations from ai to aj

What About 2 PAM? How about the probability that ai -> aj in two evolutionary steps? It’s the prob. that ai -> ak (for any k) in step 1, and ak -> aj in step 2. This is SkMikMkj = Mij2

k PAM Transition Matrix In general, the probability that ai -> aj in k evolutionary steps is Mijk As k -> ¥, the rows of Mk tend to be identical with the ith entry of each row equal to pi A result of our Markovian assumption of mutation

Building a Scoring Matrix When aligning different AAs in two sequences, we want to differentiate mutations and random events We are thus interested in the ratio of transition probability to prob. of randomly seeing new AA Ratio > 1 if and only if mutation more likely than random event

Building a Scoring Matrix (cont’d) When aligning multiple AAs, ratio of probs for multiple alignment = product of ratios: Taking logs will let us use sums rather than products

Building a Scoring Matrix (cont’d) Final step: computation faster with integers than with reals, so scale up (to increase precision) and round: sij = C log2(Mij/pj) C is a scaling constant For k PAM, use Mijk

PAM Scoring Matrix Miscellany Pairs of AAs with similar properties (e.g. hydrophobicity) have high pairwise scores, since similar AAs are more likely to be accepted mutations. In general, low PAM numbers find short, strong local similarities and high PAM numbers find long, weak ones. Often multiple searches will be run, using e.g. 40 PAM, 120 PAM, 250 PAM.

BLOSUM Scoring Matrices Based on multiple alignments, not pairwise. Direct derivation of scores for more distantly related proteins. Only possible because of new data: multiple alignments of known related proteins.

BLOCKS database entry

BLOSUM Scoring Matrices (cont’d) Started with ungapped alignments from BLOCKS database Sequences clustered at L% sequence identity This time fij = # of ai <-> aj changes between pairs of sequences from different clusters, normalizing by dividing by (n1n2) = product of sizes of clusters 1 and 2 fi = Sj fij and f = Si fi (different from PAM i=j) Then the scoring matrix entry is

BLOSUM 50 Scoring Matrix

Substitution matrices: summary Nucleic acid matrices are model derived Amino acid matrices are derived from observed data PAM: closely related sequences BLOSUM: BLOCKS domain database ratio of odds: pr. of real / pr. of random match log-odds: additive scoring system