Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.

Slides:



Advertisements
Similar presentations
Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,
Advertisements

11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and.
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
Protein Backbone Angle Prediction with Machine Learning Approaches by R Kang, C Leslie, & A Yang in Bioinformatics, 1 July 2004, vol 20 nbr 10 pp
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Aula 4 Radial Basis Function Networks
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Structures.
Radial Basis Function Networks
Protein Tertiary Structure Prediction
Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
BING: Binarized Normed Gradients for Objectness Estimation at 300fps
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Protein Secondary Structure Prediction G P S Raghava.
Data Classification with the Radial Basis Function Network Based on a Novel Kernel Density Estimation Algorithm Yen-Jen Oyang Department of Computer Science.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
MaskIt: Privately Releasing User Context Streams for Personalized Mobile Applications SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Applications of Supervised Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Prediction of Protein Binding Sites in Protein Structures Using Hidden Markov Support Vector Machine.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Hierarchical Mixture of Experts Presented by Qi An Machine learning reading group Duke University 07/15/2005.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
Chapter 14 Protein Structure Classification
Feature Extraction Introduction Features Algorithms Methods
Introduction Feature Extraction Discussions Conclusions Results
Prediction of RNA Binding Protein Using Machine Learning Technique
Extra Tree Classifier-WS3 Bagging Classifier-WS3
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Support Vector Machine (SVM)
Sequence Based Analysis Tutorial
Protein Structures.
Papers 15/08.
Protein Structural Classification
Support Vector Machines 2
Presentation transcript:

Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins

Outlines Background Methods The QUICKRBF algorithm The RVKDE algorithm The hybrid predictor The training data set The testing data set Performance metrics Results comparison with Boden’s predictor comparison with Kuznetsov’s predictor Conclusion

Background (1/2) Proteins are dynamic macromolecules which may undergo conformational transitions upon changes in environment, and protein flexibility is correlated to essential biological functions the GTPase Ras protein the U1 snRNP A from Homo sapiens the prion protein (PrP) There were methods which identify conformationally flexible regions in proteins folding of protein tertiary structures analysis of protein sequences

Background (2/2) In this atricle, we propose a novel approach to design a sequence-based predictor, which incorporates two classifiers based on two distinctive supervised learning algorithms, for identifying polypeptide segments that may fold to form different secondary structure elements in different environments the QUICKRBF algorithm the relaxed variable kernel density estimation (RVKDE) algorithm

Methods: the QUICKRBF algorithm (1/2)

Methods: the QUICKRBF algorithm (2/2) Structure of a RBF network The input layer broadcasts the feature vector of the input query sample Each node in the hidden layer generates its associated radial basis function Each node in the output layer computes a linear combination of radial basis functions associated to the hidden nodes

Methods: the RVKDE algorithm (1/2) The RVKDE algorithm Randomly and independently select a set of sampling instances from the training set For each sampling instance, which is accompanied with its k’ nearest training instances, calculate the corresponding approximate probability density function

Methods: the RVKDE algorithm (2/2) For data classification applications, one kernel density estimator is created to approximate the distribution of each class of training instances A query instance located at v is predicted to belong to the class that gives the maximum likelihood value

Methods: the hybrid predictor (1/2)

Methods: the hybrid predictor (2/2) For each amino acid residue Both the QUICKRBF and the RVKDE classifiers predict this residue as conformationally ambivalent or rigid If the RVKDE classifier predicts it as conformationally ambivalent but the QUICKRBF classifier predicts it as rigid, check whether 3 of the 4 predictions by QUICKRBF for adjacent residues are ambivalent or rigid If the QUICKRBF classifier predicts it as conformationally ambivalent but the RVKDE classifier predicts it as rigid, check whether 3 of the 4 predictions by RVKDE for adjacent residues are ambivalent or rigid

Methods: the testing data set The experiments reported in this article have been conducted with an independent testing data set which contains 170 protein chains from MolMovDB

Methods: the training data set (1/3) All the protein chains in the PDB (released on 01-April-2008) that have the same entry name and primary accession number in SwissProt (release 55.1) are grouped (11084 groups) The BLAST package is invoked to check the redundancy among the groups of protein chains (3496 groups) The CLUSTALW package and the DSSP package are invoked to label each residue in the protein chains with secondary structure types A conformationally ambivalent region is defined to be a segment of 3 or more consecutive residues within which each residue and the aligned residues have discrepant types of secondary structures

Methods: the training data set (2/3) The training data set contains 3404 representative protein chains with a feature vector The feature vectors are derived from the position specific scoring matrices (PSSM) computed by the PSI-BLAST package with sliding window size set to 7

Methods: the training data set (3/3)

Results: performance metrics

Results: comparison with Boden’s predictor

Results: comparison with Kuznetsov’s predictor (1/2) By Kuznetsov’s definition, a residue in a protein chain is said to be flexible, if its phi degree varies more than 62 or its psi degree varies more than 68 in two different conformations we labelled the residues in our collection of training protein chains based on Kuznetsov’s definition and then trained our hybrid predictor with this separately-generated training data set The independent testing data set contains 137 protein chains; none of them is homologous to Kuznetsov’s or our training protein chains with sequence identity higher than 20%

Results: comparison with Kuznetsov’s predictor (2/2)

Conclusion Experimental results show that the overall performance delivered by the hybrid predictor proposed in this article is superior to the performance delivered by the existing predictors Room for improvement Investigate how physiochemical properties of polypeptide segments can be more effectively exploited Investigate the effects of incorporating the lately-developed machine learning approaches, e.g. the random forest design and the multi-stage design

Q & AQ & A

Thank you for your attention!