TMpro: Transmembrane Helix Prediction using Amino Acid Properties and Latent Semantic Analysis Madhavi Ganapathiraju, N. Balakrishnan, Raj Reddy and Judith.

Slides:



Advertisements
Similar presentations
Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,
Advertisements

Assignment of PROSITE motifs to topological regions: Application to a novel database of well characterised transmembrane proteins Tim Nugent.
Using Support Vector Machines for transmembrane protein topology prediction Tim Nugent.
Progress in Transmembrane Protein Research 12 Month Report Tim Nugent.
Structural Classification and Prediction of Reentrant Regions in Alpha-Helical Transmembrane Proteins: Application to Complete Genomes Håkan Viklunda,
Support Vector Machine-based Transmembrane Protein Topology Prediction Tim Nugent.
Structural Biology of Membrane Proteins Problems of structure determination & Membrane-specific solutions KcsA structure Mechanistic insights KvAP and.
Secondary structure prediction from amino acid sequence.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Pfam(Protein families )
Frequent-Subsequence-Based Prediction of Outer Membrane Proteins R. She, F. Chen, K. Wang, M. Ester, School of Computing Science J. L. Gardy, F. S. L.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Tools to analyze protein characteristics Protein sequence -Family member -Multiple alignments Identification of conserved regions Evolutionary relationship.
Profile-profile alignment using hidden Markov models Wing Wong.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Characterization of Secondary Structure of Proteins using Different Vocabularies Madhavi K. Ganapathiraju Language Technologies Institute Advisors Raj.
Presented by Zeehasham Rasheed
Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred: A New.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Distributed Representations of Sentences and Documents
Introduction to Data Mining Engineering Group in ACL.
9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Transmembrane proteins in the Protein Data Bank: identification and classification Gabor, E. Tusnady, Zsuzanna Dosztanyi and Istvan Simon Bioinformatics,
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical.
Localization prediction of transmembrane proteins Stefan Maetschke, Mikael Bodén and Marcus Gallagher The University of Queensland.
Part I : Introduction to Protein Structure A/P Shoba Ranganathan Kong Lesheng National University of Singapore.
7.5 Proteins Learning Target: Explain the significance of polar and nonpolar amino acids. Outline the difference between fibrous and globular proteins.
Virtual Screening C371 Fall INTRODUCTION Virtual screening – Computational or in silico analog of biological screening –Score, rank, and/or filter.
Distributed Representative Reading Group. Research Highlights 1Support vector machines can robustly decode semantic information from EEG and MEG 2Multivariate.
1 Web Site: Dr. G P S Raghava, Head Bioinformatics Centre Institute of Microbial Technology, Chandigarh, India Prediction.
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
National Taiwan University, Taiwan
A collaborative tool for sequence annotation. Contact:
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Prediction of Protein Binding Sites in Protein Structures Using Hidden Markov Support Vector Machine.
Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.
Data Mining and Decision Support
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
1 CISC 841 Bioinformatics (Fall 2008) Review Session.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
BNFO 615 Fall 2016 Usman Roshan NJIT. Outline Machine learning for bioinformatics – Basic machine learning algorithms – Applications to bioinformatics.
SUNY Korea BioData Mining Lab - Journal Review
Sentiment analysis algorithms and applications: A survey
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Madhavi Ganapathiraju Graduate student Carnegie Mellon University
SMA5422: Special Topics in Biotechnology
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Virtual Screening.
Sequence Based Analysis Tutorial
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (IV)
Profile HMMs GeneScan TMMOD
Bioinformatics 김유환, 문현구, 정태진, 정승우.
Introduction to Sentiment Analysis
Word representations David Kauchak CS158 – Fall 2016.
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Presentation transcript:

TMpro: Transmembrane Helix Prediction using Amino Acid Properties and Latent Semantic Analysis Madhavi Ganapathiraju, N. Balakrishnan, Raj Reddy and Judith Klein-Seetharaman Carnegie Mellon University 6 th International Conference on Bioinformatics, Hong Kong, PR China, August 29 th, 2007

2 Outline Introduction Membrane proteins Transmembrane helix prediction Previous methods Drawbacks Amino acid properties Approach Algorithm Features and models Evaluations Web server IntroductionPropertiesApproachAlgorithmWeb ServerPrevious Methods

3 Membrane Proteins Important class of proteins Many important functions carried out by them Provide access to cell for drug targeting Embedded in the cell / organelle membrane Cell Membrane Membrane Protein Soluble Protein IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm

4 Transmembrane Segment Characteristics Cytoplasm (Aqueous medium) Transmembrane 30Å hydrophobic core A helix has to be 19 residues long to go from one side to the other Extracellular (Aqueous medium) Side view Questions to be addressed by prediction algorithm How many transmembrane segments are there? Where are the transmembrane locations in primary sequence? IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm

5 Transmembrane Helix Prediction Important protein family structure and function regions accessible from extracellular side Challenges Little available training data Overtraining Difficulty in discovery of novel architectures IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm

6 Hydrophobicity scale 9 residue window average hydrophobicity Limitations: segment boundary unclear & low accuracy KD scale, GES scale, WW scale… IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm Kyte-Doolittle hydrophobicity profile

7 Current best methods use HMMs Limitations: too many parameters & restrictive topology Hidden Markov Model Methods (TMHMM) IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm actualpredicted Potassium channel

TMpro: property based algorithm for transmembrane helix prediction

9 Opportunities for Improvement Previous methods: Do not employ all possible property distributions Find average occurrences of amino acids Nonpolar residuesCharged Residues IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm Aromatic Residues Amino acid properties

10 Properties We Studied IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm

11 Modified Representation of Primary Sequence Amino Acid Property Sequences Charge Polarity Aromaticity Size Electronic properties IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm

12 Predictive Capability of Each Property Adjust parameters of TMHMM (v 1.0): To make it emit one of the property values Properties considered Polarity : polar, non-polar Aromaticity: aromatic, aliphatic, neutral Electronic properties: strong donor, weak donor, neutral, weak acceptor, strong acceptor 3-valued property observations achieve 91% accuracy of that of 20-valued amino acid observation IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm

13 Approach Biology-Language Analogy Mapping Biology: Knowledge about a topic Language: Multiple genome sequences Raw text stored in databases, libraries, websites Expression, folding, structure, function and activity of proteins Meaning of words, sentences, phrases, paragraphs Understand complex biological systems Retrieval Summarization Translation Extraction Decoding Mapping Biology: Knowledge about a topic Language: Multiple genome sequences Raw text stored in databases, libraries, websites Expression, folding, structure, function and activity of proteins Meaning of words, sentences, phrases, paragraphs Understand complex biological systems Retrieval Summarization Translation Extraction Decoding IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm Ganapathiraju, et al (2004) LNCS 3345

14 Text Domain Equivalent Words: Property-values VQLAHHFSEPEITLIIFGVMAGVIGTILLISYGIRRLIKK ----ppn-n-n---- -p--pp-p----p RRR OOO.OOO.O.OOoOO W1 : positively charged W2 : polar W3 : nonpolar W4 : aromatic W5 : aliphatic W6 : strong electron acceptor W7 : strong electron donor W8 : weak electron acceptor W9 : weak electron donor W10 : medium sized Documents and Words Documents: 15-residue windows IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm

15 Latent Semantic Analysis Words Documents Build Word-Document Matrix IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm Dimension 1 Dimension 2 Distinct features of TM and nonTM achieved W = USV T For classification feature vectors SV T can be used Reduced dimensions: 4

16 Different Classifiers/Models Support vector machines Neural networks Linear classifier Hidden Markov modeling Decision trees Neural network with LSA features is called TMpro IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm

17 Evaluations Uses evolutionary information and many more model parameters IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm Benchmark Server Results Evaluation on larger datasets

18 TMpro Web Interface Novel features for manual annotation IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm

19 Acknowledgements Co-authors: Judith Klein-Seetharaman Raj Reddy N. Balakrishnan Web-site Development: Christopher Jon Jursa Hassan A. Karimi IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm

Thank you!

21 Larger training data does not improve TMHMM STMHMM is TMHMM trained with recent 145 TM proteins IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm

22 Performance on Recent Large Dataset MethodQ ok FQhtm Q2 Confusion with soluble Score%obs%prd PDBTM (191 proteins, 789 TM segments) 1TMHMM SOSUI DAS TMfilter TMpro NN MPtopo (101 proteins, 443 TM segments) 6TMHMM SOSUI DAS TMfilter TMpro NN IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm