TMpro: Transmembrane Helix Prediction using Amino Acid Properties and Latent Semantic Analysis Madhavi Ganapathiraju, N. Balakrishnan, Raj Reddy and Judith Klein-Seetharaman Carnegie Mellon University 6 th International Conference on Bioinformatics, Hong Kong, PR China, August 29 th, 2007
2 Outline Introduction Membrane proteins Transmembrane helix prediction Previous methods Drawbacks Amino acid properties Approach Algorithm Features and models Evaluations Web server IntroductionPropertiesApproachAlgorithmWeb ServerPrevious Methods
3 Membrane Proteins Important class of proteins Many important functions carried out by them Provide access to cell for drug targeting Embedded in the cell / organelle membrane Cell Membrane Membrane Protein Soluble Protein IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm
4 Transmembrane Segment Characteristics Cytoplasm (Aqueous medium) Transmembrane 30Å hydrophobic core A helix has to be 19 residues long to go from one side to the other Extracellular (Aqueous medium) Side view Questions to be addressed by prediction algorithm How many transmembrane segments are there? Where are the transmembrane locations in primary sequence? IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm
5 Transmembrane Helix Prediction Important protein family structure and function regions accessible from extracellular side Challenges Little available training data Overtraining Difficulty in discovery of novel architectures IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm
6 Hydrophobicity scale 9 residue window average hydrophobicity Limitations: segment boundary unclear & low accuracy KD scale, GES scale, WW scale… IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm Kyte-Doolittle hydrophobicity profile
7 Current best methods use HMMs Limitations: too many parameters & restrictive topology Hidden Markov Model Methods (TMHMM) IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm actualpredicted Potassium channel
TMpro: property based algorithm for transmembrane helix prediction
9 Opportunities for Improvement Previous methods: Do not employ all possible property distributions Find average occurrences of amino acids Nonpolar residuesCharged Residues IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm Aromatic Residues Amino acid properties
10 Properties We Studied IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm
11 Modified Representation of Primary Sequence Amino Acid Property Sequences Charge Polarity Aromaticity Size Electronic properties IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm
12 Predictive Capability of Each Property Adjust parameters of TMHMM (v 1.0): To make it emit one of the property values Properties considered Polarity : polar, non-polar Aromaticity: aromatic, aliphatic, neutral Electronic properties: strong donor, weak donor, neutral, weak acceptor, strong acceptor 3-valued property observations achieve 91% accuracy of that of 20-valued amino acid observation IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm
13 Approach Biology-Language Analogy Mapping Biology: Knowledge about a topic Language: Multiple genome sequences Raw text stored in databases, libraries, websites Expression, folding, structure, function and activity of proteins Meaning of words, sentences, phrases, paragraphs Understand complex biological systems Retrieval Summarization Translation Extraction Decoding Mapping Biology: Knowledge about a topic Language: Multiple genome sequences Raw text stored in databases, libraries, websites Expression, folding, structure, function and activity of proteins Meaning of words, sentences, phrases, paragraphs Understand complex biological systems Retrieval Summarization Translation Extraction Decoding IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm Ganapathiraju, et al (2004) LNCS 3345
14 Text Domain Equivalent Words: Property-values VQLAHHFSEPEITLIIFGVMAGVIGTILLISYGIRRLIKK ----ppn-n-n---- -p--pp-p----p RRR OOO.OOO.O.OOoOO W1 : positively charged W2 : polar W3 : nonpolar W4 : aromatic W5 : aliphatic W6 : strong electron acceptor W7 : strong electron donor W8 : weak electron acceptor W9 : weak electron donor W10 : medium sized Documents and Words Documents: 15-residue windows IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm
15 Latent Semantic Analysis Words Documents Build Word-Document Matrix IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm Dimension 1 Dimension 2 Distinct features of TM and nonTM achieved W = USV T For classification feature vectors SV T can be used Reduced dimensions: 4
16 Different Classifiers/Models Support vector machines Neural networks Linear classifier Hidden Markov modeling Decision trees Neural network with LSA features is called TMpro IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm
17 Evaluations Uses evolutionary information and many more model parameters IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm Benchmark Server Results Evaluation on larger datasets
18 TMpro Web Interface Novel features for manual annotation IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm
19 Acknowledgements Co-authors: Judith Klein-Seetharaman Raj Reddy N. Balakrishnan Web-site Development: Christopher Jon Jursa Hassan A. Karimi IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm
Thank you!
21 Larger training data does not improve TMHMM STMHMM is TMHMM trained with recent 145 TM proteins IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm
22 Performance on Recent Large Dataset MethodQ ok FQhtm Q2 Confusion with soluble Score%obs%prd PDBTM (191 proteins, 789 TM segments) 1TMHMM SOSUI DAS TMfilter TMpro NN MPtopo (101 proteins, 443 TM segments) 6TMHMM SOSUI DAS TMfilter TMpro NN IntroductionPropertiesApproachWeb ServerPrevious MethodsAlgorithm