Support Vector Machine-based Transmembrane Protein Topology Prediction Tim Nugent.

Slides:



Advertisements
Similar presentations
Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,
Advertisements

Assignment of PROSITE motifs to topological regions: Application to a novel database of well characterised transmembrane proteins Tim Nugent.
Using Support Vector Machines for transmembrane protein topology prediction Tim Nugent.
Alpha-helical transmembrane protein fold prediction using residue contacts Timothy Nugent and David Jones Bioinformatics Group, Department of Computer.
Progress in Transmembrane Protein Research 12 Month Report Tim Nugent.
Structural Classification and Prediction of Reentrant Regions in Alpha-Helical Transmembrane Proteins: Application to Complete Genomes Håkan Viklunda,
Assignment of PROSITE motifs to topological regions: Application to a novel database of well characterised transmembrane proteins Tim Nugent 6 Month.
(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
Secondary structure prediction from amino acid sequence.
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Protein Backbone Angle Prediction with Machine Learning Approaches by R Kang, C Leslie, & A Yang in Bioinformatics, 1 July 2004, vol 20 nbr 10 pp
High Throughput Computing and Protein Structure Stephen E. Hamby.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Prediction of protein localization and membrane protein topology Gunnar von Heijne Department of Biochemistry and Biophysics Stockholm Bioinformatics Center.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modeling Anne Mølgaard, CBS, BioCentrum, DTU.
An Introduction to Bioinformatics Protein Structure Prediction.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Sequence analysis June 18, 2008 Learning objectives-Understand the concept of sliding window programs. Understand difference between identity, similarity.
Melissa David Adam Ossin Rutger Mantingh Supervisor: Antoinette Killian.
Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Tertiary Structure Prediction
Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic.
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,
Protein Secondary Structure Prediction with inclusion of Hydrophobicity information Tzu-Cheng Chuang, Okan K. Ersoy and Saul B. Gelfand School of Electrical.
Transmembrane proteins in the Protein Data Bank: identification and classification Gabor, E. Tusnady, Zsuzanna Dosztanyi and Istvan Simon Bioinformatics,
Levels of Protein Structure
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Secondary structure prediction
TMpro: Transmembrane Helix Prediction using Amino Acid Properties and Latent Semantic Analysis Madhavi Ganapathiraju, N. Balakrishnan, Raj Reddy and Judith.
2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”
Localization prediction of transmembrane proteins Stefan Maetschke, Mikael Bodén and Marcus Gallagher The University of Queensland.
7.5 Proteins Learning Target: Explain the significance of polar and nonpolar amino acids. Outline the difference between fibrous and globular proteins.
THE PUZZLING PROPERTIES OF THE PERMEASE (PPP) Kim Finer, Jennifer Galovich, Ruth Gyure, Dave Westenberg March 4, 2006.
Protein Secondary Structure Prediction G P S Raghava.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
Protein Properties Function, structure Residue features Targeting Post-trans modifications BIO520 BioinformaticsJim Lund Reading: Chapter , 11.7,
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
HMMs and SVMs for Secondary Structure Prediction
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Machine Learning Methods of Protein Secondary Structure Prediction Presented by Chao Wang.
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
V2 SS 2009 Membrane Bioinformatics 1 V2 - Predicting TM helices from sequence Review % of all genes code for transmembrane proteins (1) High energetic.
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Applying Neural Networks
Madhavi Ganapathiraju Graduate student Carnegie Mellon University
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Combining HMMs with SVMs
Support Vector Machine (SVM)
Protein Structure Prediction
Dalian Zhong, Li-Min Yang, Paul Blount  Biophysical Journal 
Chimeras Reveal a Single Lipid-Interface Residue that Controls MscL Channel Kinetics as well as Mechanosensitivity  Li-Min Yang, Dalian Zhong, Paul Blount 
Protein Disorder Prediction
Import Determinants of Organelle-Specific and Dual Targeting Peptides of Mitochondria and Chloroplasts in Arabidopsis thaliana  Changrong Ge, Erika Spånning,
Investigating Lipid Composition Effects on the Mechanosensitive Channel of Large Conductance (MscL) Using Molecular Dynamics Simulations  Donald E. Elmore,
Neural Networks for Protein Structure Prediction Dr. B Bhunia.
Dalian Zhong, Li-Min Yang, Paul Blount  Biophysical Journal 
Presentation transcript:

Support Vector Machine-based Transmembrane Protein Topology Prediction Tim Nugent

Alpha-helical Transmembrane Proteins Transmembrane proteins fulfil many critical cellular functions. Comprise about 30% of the human proteome. Composed of hydrophobic, membrane-spanning alpha-helices, connected with loop regions. Poorly represented in structural databases. Predicting their structure and topology is therefore an important challenge for bioinformatics.

Transmembrane Protein Topology Topology of a transmembrane protein describes which regions are membrane-spanning and which are 'inside' or 'outside' (e.g. cytoplasmic/extracellular or cytoplasmic/lumenal). Number and position of TM helices. Position of the N-terminal.

Early Hydrophobicity-based Approaches To generate data for a plot, the protein sequence is scanned with a moving window of size residues. At each position, the mean hydrophobic index of the amino acids within the window is calculated and that value plotted as the midpoint of the window. Aquaporin KGVWTQAFWKA V TAEFLAMLIFVLLSVGSTINWGGSEN

Discriminating between Inside and Outside Loops Hydrophobic: Val, Phe, Ile, Leu, Met. Positive: Lys, Arg, His. Cytoplasmic loops are enriched in positively charged residues: the 'positive-inside rule' of von Heijne

Machine Learning-based Approaches

Using Support Vector Machines for Topology Prediction Recently, more advanced methods using machine learning algorithms such as hidden Markov models (e.g. TMHMM, PHOBIUS) and neural networks (MEMSAT3) have been developed, They have achieved significant improvements in prediction accuracy (~80%). However, none of the top scoring methods use SVMs. While hidden Markov models and neural networks may have multiple outputs, SVMs are binary classifiers. In order to deal with TM topology prediction, multiple SVM will have to be combined, e.g. TM helix / Loop Inside Loop / Outside Loop Signal Peptide / ¬Signal Peptide Re-entrant Loop / ¬Re-entrant Loop

Assembling a Novel Data Set of Transmembrane Proteins In order to study and predict features of transmembrane (TM) proteins, the use of a high quality data set containing sequences with experimentally confirmed TM regions is essential for both training and validation purposes. Based on Möller set and MPTOPO database. Novel TM sequences parsed from SWISS-PROT and blasted vs PDB. Remove fragments, chain breaks, colicins, venoms etc. Homology reduce at 40% sequence identity. Topologies determined by OPM or PDB_TM. Since PDB structures of TM proteins contain no lipid, theoretical approaches are used to predict the position of the membrane relative to the structure, and thus the TM helix boundaries. OPM uses water-lipid transfer energy minimisation PDB_TM uses hydrophobicity/structural feature analysis

Assembling a Novel Data Set of Transmembrane Proteins Theoretical membrane placement on to the Mechanosensitive channel protein MscS crystal structure (PDB code 2oau) by OPM (left) and PDB_TM (right). The membrane region is between the red and blue bars.

Re-entrant Helices Re-entrant helices in Aquaporin Z (left) from Escherichia coli (PDB code 1rc2) and Potassium channel (right) from Bacillus cereus (PDB code 2ahy) marked with black arrows.

MySQL Table Schema

Data Set Composition

Support Vector Machine Training Data set of 131 non-redundant protein sequences. Jack knife cross-validation - sequences with >25% sequence identity removed from training sets. Signal peptide SVM – 10-fold cross validation + additional data from Phobius set and SWISS-PROT. PSI-BLAST profiles vs Uniref 90. E-value threshold for inclusion = Normalise by Z-score residue sliding window. Transduction. Optimise window size, kernel choice and parameters using Mathew's Correlation Coefficient:

Per Residue SVM Prediction Accuracy

Dynamic Programming Simplified version of original MEMSAT algorithm, treating TM helices as discrete units, rather than separating them into inside, outside and middle components. Re-entrant helix and signal peptide states were added. Residues were therefore predicted to lie in one of five different topological regions: inside loop, outside loop, TM helix, re-entrant helix and signal peptide. For evaluating signal peptide preference, residues with positive signal peptide scores up to position 30 in a target sequence were added to the outside loop score and subtracted from the inside loops score, in order to direct prediction towards a non-cytoplasmic amino terminal. The value was also scaled by a factor of 10 and subtracted from the TM helix SVM score to prevent TM helix prediction. For the same reason, positive re-entrant helix scores were scaled by a factor of 10 and subtracted from the TM helix SVM score

Overall Prediction Accuracy Benchmark results for the SVM-based method ('TMSVM') against a selection of leading topology predictors. 'Correct signal peptide' and 'correct re-entrant helix' refer to correct topology prediction for proteins containing these features. TMSVM was able to detect signal peptides with 92% accuracy, and re-entrant helices with 39% accuracy. No false positives of either class were predicted. OCTOPUS results were not cross-validated therefore are likely to be overestimated as there is considerable overlap between test and training sets. Tested vs the Möller (low resolution) data set – scores 77%, same as MEMSAT3.

Formate Dehydrogenase

Ubiquinol Oxidase

Glycerol uptake facilitator

ABC transporter BtuCD

Photosystem I

Discriminating between TM and Globular Proteins For SVM training, we used 416 randomly chosen proteins from the MEMSAT3 [11] set which consists of 2685 non-redundant chains from globular proteins of known structure, combined with our novel set of 131 TM proteins. The remaining 2269 sequences were used used as test cases. PSI-BLAST profiles were generated for all sequences and 10-fold cross validation was used to assess performance, again removing sequences from the training fold with greater than 25% sequences identity to any sequence in the test fold. Window size = 33, Kernel = RBF, MCC = 0.78

Whole Genome Analysis

Conclusions Novel SVM-based approach predicts correct topology with 88% accuracy, 9% higher than next best method OCTOPUS. Incorporates signal peptide and re-entrant helix prediction. Signal peptide containing proteins correctly predicted with 92% accuracy. Re-entrant helix containing proteins correctly predicted with 55% accuracy – room for improvement. Good TM/globular protein discrimination – combined with SP prediction, highly suited to whole genome analysis. Further work SVM to predict amphipathic/pore-forming helices.