Position-Specific Substitution Matrices

Slides:

Advertisements

Similar presentations

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.

Advertisements

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.

OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.

Measuring the degree of similarity: PAM and blosum Matrix

Hidden Markov Models in Bioinformatics Applications

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Lecture outline Database searches

Heuristic alignment algorithms and cost matrices

Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.

Position-Specific Substitution Matrices. PSSM A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where.

Scoring Matrices June 19, 2008 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.

Scoring Matrices June 22, 2006 Learning objectives- Understand how scoring matrices are constructed. Workshop-Use different BLOSUM matrices in the Dotter.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Similar Sequence Similar Function Charles Yan Spring 2006.

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

STATISTIC & INFORMATION THEORY (CSNB134)

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

An Introduction to Bioinformatics

Hidden Markov Models for Sequence Analysis 4

Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)

Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.

1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

Sequence Alignment Csc 487/687 Computing for bioinformatics.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.

Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:

Construction of Substitution matrices

Doug Raiford Phage class: introduction to sequence databases.

Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.

Step 3: Tools Database Searching

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

Hidden Markov model BioE 480 Sept 16, In general, we have Bayes theorem: P(X|Y) = P(Y|X)P(X)/P(Y) Event X: the die is loaded, Event Y: 3 sixes.

Chapter 6 - Profiles1 Assume we have a family of sequences. To search for other sequences in the family we can Search with a sequence from the family Search.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Pairwise Sequence Alignment and Database Searching

Sequence similarity, BLAST alignments & multiple sequence alignments

Linear Algebra Review.

Blast Basic Local Alignment Search Tool

LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:

Entropy, Information contents & Logo plots By Thomas Nordahl Petersen

The normal distribution

PROBABILITY AND STATISTICS

Fast Sequence Alignments

Entropy, Information contents & Logo plots By Thomas Nordahl Petersen

Point Specific Alignment Methods

Entropy, Information contents & Logo plots By Thomas Nordahl Petersen

Alignment IV BLOSUM Matrices

Lecture 2: Basic Information Theory

Basic Local Alignment Search Tool

BLAST Slides adapted & edited from a set by

Sequence alignment, E-value & Extreme value distribution

1-month Practical Course

BLAST Slides adapted & edited from a set by

Chapter 5: Sampling Distributions

1-month Practical Course Genome Analysis Iterative homology searching

Presentation transcript:

Position-Specific Substitution Matrices

PSSM A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where in the protein they are. This is obviously only an approximation: within a family of related proteins, some residues are very important for function and hardly change at all, while others can vary quite a bit. Position-specific substitution matrices are an approach to this problem: developing a different substitution matrix for each position in a set of aligned proteins. Requires a set of aligned, related proteins Gaps can be a problem: do you have a separate gap opening and extension penalty for each position, or do you use the same value for all positions? Most PSSM use a single set of values Hidden Markov Models address this question specifically PSI-BLAST is the primary general use of PSSM.

Some Aligned Sequences gi|154350476|gb|ABS72555.1| AVPLMQPEAPIVGTGMEYVSGKDSGAAVICKHPGIVERVEAKNVWVRRYE gi|225184649|emb|CAB11883.2| AVPLMQPEAPFVGTGMEYVSGKDSGAAVICKHPGIVERVEAKNVWVRRYE gi|157679649|gb|ABV60793.1| AVPLMQPESPIVGTGMEYVSGKDSGAAVICRYPGVVERVEAKNIWVRRYE gi|52346465|gb|AAU39099.1| AVPLMQPESPIVGTGMEYVSAKDSGAAVICRHPGIVERVEAKNIWVRRYE BMQ_0128 AVPLLNPEAPIVGTGMEYVSGKDSGAAVICKYPGVVERVEAKQIIVRRYE gi|42735098|gb|AAS39038.1| AVPLMNPESPIVGTGMEYVSAKDSGAAVICKHPGIVERVEAREVWVRRYV gi|10172738|dbj|BAB03845.1| AVPLLVPEAPIVGTGMEHVSAKDSGAAIVSKHRGIVERVTAKEIWVRRLE gi|56908158|dbj|BAD62685.1| AVPLLVPEAPLVGTGMEHVSAKDSGAAVVSKYAGIVERVTAKEIWVRRIE ****: **:*:******:**.******::.:: *:**** *::: ***

Making a PSSM There are several variations on the theme, but the best way in analogous to how substitution matrices are made, using the log-odds method. Start with a set of aligned sequences. For each position, count the number of each type of amino acid that has occurred. The frequency of amino acid a in column u is qu,a Note that we aren’t counting substitutions here, since in a multiple alignment we don’t know how the different sequences are related. We also need to know the frequency of amino acid a among sequences in general, pa. The odds ratio is the frequency of amino acid a given real-world evolution divided by the frequency expected if amino acids are completely random. = qu,a / pa Finally, take the logarithm so scores can be added. mu,a is the score used for amino acid a in column u. This needs to be done for all amino acids in all columns. mu,a = log (qu,a / pa) It is possible to weight the scores to compensate for bias in the original sequence selection.

The Missing Data Problem You are trying to determine the frequency of all 20 amino acids at each position in the sequence. There are inevitably some amino acids that never occur in certain positions. However, if they do occur, in a new sequence, their score mu,a = log (qu,a / pa) would be negative infinity, the logarithm of 0. This is not a useful score. The simplest solution is to simply start counting at 1 instead of 0. The counts are then referred to as pseudocounts. Normally the frequency of amino acid a in column u is qu,a = nu,a / N, where N is the number of sequences being examined. For pseudocounts, qu,a = (nu,a + 1)/ (N + 20). The N+20 term is because there are 20 amino acids. Slightly more sophisticated is using the proportions of each amino acid in the database, pa. The sum of all 20 pa is 1. qu,a = (nu,a + pa)/ (N + 1). Related to this is using data from a substitution matrix as the source of the proportions. By adding constants, you can vary the proportions of pseudocounts and real counts, depending on how much real data you have. More sophisticated methods also exist.

Information and Entropy The modern theory of information was developed by Claude Shannon in 1948. The basis for most modern communication. A common application is ZIP files, which compress information. Entropy is a measure of the uncertainty of the results of an event. Entropy = number of bits (binary, yes/no decisions) needed store or communicate the results. The results of a coin flip, with 2 equally likely outcomes, needs 1 bit to describe. Rolling a die, with 6 equal outcomes, needs somewhat more than 2 bits to describe. Related to the concept of entropy in thermodynamics. The entropy of an event (H) is the -1 time sum of the probability of each possible outcome (px) times the base 2 logarithm of that probability, H = - pxlog2px Units are bits. plog2p = 0 by convention Thus for a coin flip, pH = pT = 1/2. The base 2 log of 1/2 is -1, so H = -(1/2  -1 + 1/2  -1 ) = 1, or 1 bit of information. For a 6-sided die, each possible outcome has a 1/6 probability. Log2(1/6) = -2.58, so rolling a die has H = -6  1/6  -2.58 = 2.58 bits of information.

Information Content Outcomes with different probabilities affect the entropy. Entropy is maximal when all outcomes are equally likely. Entropy is 0 when there is only 1 possible outcome. Imagine a loaded die, where the probability of a 6 is 1/2 and the probability of any other number is 1/10. H = - (1/2log2(1/2) + 5·(1/10)log2(1/10) ) = -(0.5 + 5·(1/10) ·-3.321) = 1.661 bits Compare this with a fair die, which has an entropy of 2.58 bits. The fair die’s outcome is much more uncertain than the loaded die. The information of an event is the loss of uncertainty concerning an outcome. It is difference between the maximum possible entropy (with all equal outcomes) and the actual amount of entropy calculated with different outcomes having different probabilities.

Sequence Logos A sequence logo is a visual representation of a PSSM, showing the relative importance of different positions and which residues contribute the most. Based on Shannon information theory. Consider a single position in a set of aligned protein sequences. If all 20 amino acids are equally likely, the entropy of that position is Hmax = -20 (1/20)log2(1/20) = log2(1/20) = 4.32. The information I of position u is I = Hmax - Hu. I = 0 when all amino acids are equally likely. If there is only 1 amino acid ever found at a position (completely conserved), there is no uncertainty about it, so its entropy is 0 and the information content is 4.32. A more complicated example: say that this position has a 1/3 chance of being R and a 2/3 chance of being K. Hu = -(1/3log2(1/3) + 2/3log2(2/3) ) = -(-.528 + -.390) = 0.918. I = 4.32 - .918 = 3.402 bits. For a sequence logo, the relative frequency of each amino acid is multiplied by the position’s information content, which is then converted into a height.

PSI-BLAST Part of the BLAST programs available at NCBI Finding new family members that don’t hit the original query An iterative process: first the database (usually nr) is searched with an initial query sequence, and all hits with e-values better than some cutoff (default = 0.005) are taken these aligned sequences are used to construct a PSSM The PSSM is then used to search the database again. If new sequences better than the e-value cutoff are found, the PSSM is updated to include them, and the search is run again. Eventually, no new sequences are found and the PSI-BLAST search is complete. Considerably slower than regular BLAST You have to manually do each iteration, at the top of the Descriptions area. After 3 iterations with ORF00135 we get no more new hits. With “conserved hypothetical protein” BMQ_0196 (next slide) we get new hits for at least 4 iterations, and also extensions on the length of match of many hits. Most are hypothetical genes, but some mention possible functions. Unfortunately, you can’t download the PSSM, but you can save it and re-use it if you like.

Another Sequence for PSI-BLAST >BMQ_0196 | QMB1551_chromosome:164387-165967 | conserved hypothetical protein MDKLMNRSWVMKIIALLLAFMLYLSVNLDDGASSSNKILNRSSSANTGVETLTDVPVQVS YNEKNRIVRGVPDTVIMTLEGPKNILAQTKLQKDYQAYIDLDNLSLGQHRVKVQYRNISD NLNVVVKPDIVNVTIEERDSKQFSVEASYDKNKVKNGYEAGEATVSPRAVTVTGASSQLD QVAYVKAIIDLDNASKTVTKQATVVALDKNLNKLNVTVQPETVNVTIPVRNISKKVPIDV IQEGTPGDGVNITKLEPKTDTVKIIGPSDSLEKIDKIDNIPVDVTGITKSKDIKVNVPVP DGIDSVSPKQITVHVEVDKQGDEKDAEETDASAAETKSFKNLPVSLTGQSSKYTYELLSP TSVDADVKGPKSDLDKLTKSGISLSANVGNLSAGEHTVPIIINSPDSVTSTLSTKQAKVR VTAKKQSGTNDEQTDDKETSGSTSDKETSGSTSDKETKPDTGTGSGTNPGTGNSGDSADK PSEETDTPEDNTDTPTDSTETGDDSSNQSDENSTPVDGQTDNTSGN