A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.

Slides:

Advertisements

Similar presentations

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.

Advertisements

Direct-Coupling Analysis (DCA) and Its Applications in Protein Structure and Protein-Protein Interaction Prediction Wang Yang

Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.

CVPR2013 Poster Representing Videos using Mid-level Discriminative Patches.

Measuring the degree of similarity: PAM and blosum Matrix

50%, guessing 100%, all correct Accuracy = Figure 2 Predictive Accuracy of SMO algorithm using each attribute separately Prediction of catalytic residues.

Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Structural bioinformatics

Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.

Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity Nicholas M. Luscombe and Janet M. Thornton JMB (2002)

Heuristic alignment algorithms and cost matrices

Expected accuracy sequence alignment

Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.

Similar Sequence Similar Function Charles Yan Spring 2006.

BNFO 602 Multiple sequence alignment Usman Roshan.

Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.

Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU.

Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul.

Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

Unsupervised Rough Set Classification Using GAs Reporter: Yanan Yean.

Ordinal Decision Trees Qinghua Hu Harbin Institute of Technology

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.

Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology.

Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.

1 Comparison of Principal Component Analysis and Random Projection in Text Mining Steve Vincent April 29, 2004 INFS 795 Dr. Domeniconi.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Approximation Algorithms For Protein Folding Prediction Giancarlo MAURI,Antonio PICCOLBONI and Giulio PAVESI Symposium on Discrete Algorithms, pp ,

PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),

Identifying property based sequence motifs in protein families and superfamies: application to DNase-1 related endonucleases Venkatarajan S. Mathura et.

Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.

Expected accuracy sequence alignment Usman Roshan.

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Expected accuracy sequence alignment Usman Roshan.

Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.

MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.

EMBL-EBI Eugene Krissinel SSM - MSDfold. EMBL-EBI MSDfold (SSM)

PROTEIN INTERACTION NETWORK – INFERENCE TOOL DIVYA RAO CANDIDATE FOR MASTER OF SCIENCE IN BIOINFORMATICS ADVISOR: Dr. FILIPPO MENCZER CAPSTONE PROJECT.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Outline Time series prediction Find k-nearest neighbors Lag selection Weighted LS-SVM.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Sequence similarity, BLAST alignments & multiple sequence alignments

Extra Tree Classifier-WS3 Bagging Classifier-WS3

Generalizations of Markov model to characterize biological sequences

Protein structure prediction.

Evaluating Classifiers for Disease Gene Discovery

Presentation transcript:

A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim

Outline  Background & Motivation  Data  Methods  Experiments  Conclusions  Acknowledgements  Bibliography

Background & Motivation  Prior works have studied correlation between positions in protein families (Cline et al., 2002, Martin et al., 2005)  Used multiple sequence alignments to detect correlation, and make links to protein structure and residue co-evolution  Correlation across an MSA has been tied to co-evolution and contact points in the protein structure.  Less work on correlation between residues in a sequence, significance is less clear

Background & Motivation  “protein sequences can be regarded as slightly edited random strings” (Weiss et al. 2000)  Can we detect the increased correlation in protein sequences vs random sequences?  Is there correlation between distant residues?  Is correlation characteristic of the protein structure?  Can we measure correlation for hydropathy or other residue non-specific interactions?

The Protein Families Database  We use the Pfam-A subset, consisting of around 8000 curated families  Pfam-A contains families with a wide variety of sequence length and number of sequences  Pfam-A contains multiple sequence alignments for families  Limit experiments to sequences containing 100 or more residues to reduce sampling effects

 Let's look towards Information Theory Methods: Measuring Correlation How can we predict the next residue?

 Let's look towards Information Theory Methods: Measuring Correlation How can we predict the next residue? Pick the most frequently printed residue We feel more certain about our guess with the second sequence as it seems less random

 We can quantify the uncertainty in a sequence with Shannon Entropy  Entropy is maximal when P i is uniform for all i  Entropy is 0 when P i = 1 for some i  The lower the Entropy, the better our prediction should be Methods: Measuring Correlation

 Should we guess 'N'? Is there a correlation between 'V' and 'K'? Between 'N' and 'N'?  We can measure the correlation with Mutual Information for the sequence  Substitute frequencies for probabilities

Mutual Information Example  Sequence: AANANK

Mutual Information Example  Sequence: AANANK

Mutual Information Example  Sequence: AANANK

Mutual Information Example  MI( AANANK ) = MI( JJCJCL )  Sequence: AANANK

Experiment: Measuring Correlation  Sample 100 sequences from PFAM  Shuffle each sequence 100 times use shuffle command from HMMER package preserves length and residue frequency of sequence randomly re-orders residues  Compare MI score for each sequence to the MI scores of its shuffles

Results: Correlation

Results : Normalized Correlation

Methods: Correlation Classification  Nearest Neighbor classification algorithm  plot N-dimensional vector in space 3 Training Classes

 Nearest Neighbor classification algorithm  Plot N-dimensional vector in space 3 Training Classes Methods: Correlation Classification 3 Training Test Vector

Methods: Correlation Classification  Measure the distance from the new point to each existing point  Assign the family of nearest training point to the test vector

Methods: NCBI BLAST Classification  Build BLAST database from training sequences with formatdb  Blast test sequence about database with default parameters  Classify test sequence according to the highest scoring match (High Scoring Sequence Pair )  If no sequence match is found, classification fails

Methods: Experimental Method  Randomly Select 10 families from PFAM database  Evaluate classification techniques on each possible combination of 3 families from the 10  The results of all sub-experiments are summed  Accuracy is measured by: # of correct classifications # of classification attempts

Methods: Leave-one-out Validation  Comprehensive Validation

Results: Neighbor Correlation

Experiment: Long Range Correlation  Extend correlation measure beyond neighboring residues  gap: number of residues between the residues we are comparing  we are considering the pairing of all residues within 20 positions of each other  MI Vector = [ MI(0), MI(1), … MI(19) ]

Results: 20D-Correlation Vector

Experiment: Physical Properties  Not all intra-protein interactions are residue specific  Cline(2002) explores information attributed to hydropathy, charge, disulfide bonding, and burial  Hydropathy was found to contain half the information as the 20-element amino acid alphabet, and its 2-element alphabet is more resistant to finite-sample size effects

Hydropathy Alphabet Hydrophobic: C,I,M,F,W,Y,V,L Hydrophilic:R,N,D,E,Q,H,K,S,T,P,A,G  This partitioning from Weiss, et al. (2000)  Converting every residue in a sequence to a ‘+’ or ‘-’

Results: Hydropathy Correlation

Experiment: Combined Vectors  Combine residue and hydropathy correlation vectors  A single 40 dimensional vector per sequence

Results: Combined Vectors

Conclusions  Correlation was strong enough for building sequence classifiers without using sequence  Significant Long Range Correlation between protein sequence residues  Correlation exists in terms of residues and physical properties

Future Work  More comprehensive study of long range interactions how much distance should we consider? analyze gap distances individually and compare look for combination of distances and methods to most improve classification power  Explore other physical properties  Measure correlation of residue groups  Investigate normalization or correction techniques to reduce sampling effects

Acknowledgements  Dr. Sun Kim  Dr. Mehmet Dalkilic  The Center for Genomics and Bioinformatics Computing resources Support throughout this process

References  Aha D, Kibler D, Albert M. Machine Learning  Bateman A, Coin L et al. Nucleic Acids Research  Cline M, Karplus K, Lathrop R, et al. PROTEINS: Structures, Functions, and Genetics  Kohavi R. International Joint Conference on AI 1995  Martin L, Gloor G, Dunn S, Wahl L. Bioinformatics 21(22) 2005  Shannon C.E., The Bell System Tech. Journal  Weiss O, Jim é nez-Monta ñ o M, Herzel H. J. theor Biol