Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Bayesian Evolutionary Distance P. Agarwal and D.J. States. Bayesian evolutionary distance. Journal of Computational Biology 3(1):1— 17, 1996.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Measuring the degree of similarity: PAM and blosum Matrix
Introduction to Bioinformatics
Sequence Alignment and Database Searching
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Lecture outline Database searches
Heuristic alignment algorithms and cost matrices
Sequence analysis course
We continue where we stopped last week: FASTA – BLAST
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Multiple Sequence Alignments
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
An Introduction to Bioinformatics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Construction of Substitution matrices
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Stephen Altschul National Center for Biotechnology Information
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Computer Applications and Bioinformatics
Ab initio gene prediction
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Alignment IV BLOSUM Matrices
Basic Local Alignment Search Tool
Presentation transcript:

Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar

Local alignments The preferred method to compute regions of local similarity for two sequences of amino acids is to consider the entire length of the sequence and optimize a similarity matrix. PAM and BLOSUM both a number of different matrices constructed to model similarity between amino acid sequences at different evolutionary distances. Here, we follow Altschul c99c to investigate PAM matrices from an information theoretic perspective.

Caveats and assumptions The following theory only applies to locally aligned segments that lack gaps. Why is this assumption easier to tolerate in local alignment vs. global alignment? Why is this assumption still restrictive for local alignments?

Notation and definitions Amino acids: a i Substitution score of aligned amino acids a i and a j : s ij A Maximal Segment Pair (MSP) is a pair of equal length segments from two amino acid sequences that, when aligned, have maximum score.

Random model For any two amino acid sequences, there exists at least one MSP. It is convenient to compute what MSP scores look like for random sequences to serve as a basis for comparison. We will consider a very simple model Each amino acid a i appears randomly with probability p i reflecting actual frequencies of amino acid sequences What could be a more biologically accurate (yet mathematically less feasible) method for generating amino acid sequences?

More assumptions...

ACGT- Ac-c C c G c T c - ACGT- A1 C 1 G 1 T 1 - MSP score = 8MSP score = c*8...AGCGCTAC...

Random Model

Substitution matrices

Local alignment and information theory

Relative entropy and substitution matrices

Relative entropy Relative entropy (KL divergence) is a measure of how closely related two probability distributions are Given two probability distributions Q and P, relative entropy can be informally stated in several different manners The amount of additional bits required to code samples from P when using Q The amount of information lost when Q is used and P is the true distribution of the data

Relative entropy and substitution matrices But how does this relate to substitution matrices? Well, if the target and background frequency distributions are closely related, then the relative entropy is low and it is very difficult to distinguish between the target and background frequencies. We would therefore require a much longer alignment. On the other hand, if the target and background frequency distributions are very different, the relative entropy is high and we’re able to compute much shorter alignments.

Example 1 – cystic fibrosis Variants in a transport protein have been associated with cystic fibrosis A search of this gene in the PIR protein sequence database yields the table on the following slide

Example 1 – cystic fibrosis Altshul, S.F. (c99c) “Amino Acid Substitution Matrices from an Information Theoretic Perspective”, Journal of Molecular Biology, 2c9:

Example 1 – cystic fibrosis Of note, the best PAM-250 score is not higher than the highest score of a random alignment given the background frequencies. On the other hand, PAM-120 gives alignments in the same region with scores higher than the highest chance alignment Why do you think PAM-120 a better fit here?

References Explains the connection between information theory and substitution matrices Altshul, S.F. (c99c) “Amino Acid Substitution Matrices from an Information Theoretic Perspective”, Journal of Molecular Biology, 2c9: Provides much of the theory for the above article Karlin, S. Dembo, A. Kawabata, T. “Statistical Composition of High-Scoring Segments from Molecular Sequences.” The Annals of Statistics 18 (1990), (2), Karlin, S. and Altschul SF. “Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes” PNAS (6)