Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of.

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Random Forest Predrag Radenković 3237/10
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Learning Algorithm Evaluation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
High Throughput Computing and Protein Structure Stephen E. Hamby.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Sparse vs. Ensemble Approaches to Supervised Learning
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.
Speaker Adaptation for Vowel Classification
Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department.
1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,
Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Sparse vs. Ensemble Approaches to Supervised Learning
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Ensemble Learning (2), Tree and Forest
Evaluating Classifiers
Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology.
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
Friday 17 rd December 2004Stuart Young Capstone Project Presentation Predicting Deleterious Mutations Young SP, Radivojac P, Mooney SD.
Chapter 9 Neural Network.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
Prediction of protein disorder Zsuzsanna Dosztányi MTA-ELTE Momentum Bioinformatics Group Department of Biochemistry Eotvos Lorand University, Budapest,
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Small protein modules with similar 3D structure but different amino acid sequence Institute of Evolution, University of Haifa, ISRAEL Genome Diversity.
Secondary structure prediction
EMBC2001 Using Artificial Neural Networks to Predict Malignancy of Ovarian Tumors C. Lu 1, J. De Brabanter 1, S. Van Huffel 1, I. Vergote 2, D. Timmerman.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
Ensemble Methods: Bagging and Boosting
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Akram Bitar and Larry Manevitz Department of Computer Science
AISTATS 2010 Active Learning Challenge: A Fast Active Learning Algorithm Based on Parzen Window Classification L.Lan, H.Shi, Z.Wang, S.Vucetic Temple.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
Ensemble Methods in Machine Learning
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Classification Ensemble Methods 1
Matching Protein  -Sheet Partners by Feedforward and Recurrent Neural Network Proceedings of Eighth International Conference on Intelligent Systems for.
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
“ Using Sequence Motifs for Enhanced Neural Network Prediction of Protein Distance Constraints ” J.Gorodkin, O.Lund, C.A.Anderson, S.Brunak On ISMB 99.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Modeling Cell Proliferation Activity of Human Interleukin-3 (IL-3) Upon Single Residue Replacements Majid Masso Bioinformatics and Computational Biology.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Introduction Feature Extraction Discussions Conclusions Results
Extra Tree Classifier-WS3 Bagging Classifier-WS3
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Support Vector Machine (SVM)
Protein Disorder Prediction
Combining Predictors for Short and Long Protein Disorder
Structural Flexibility of CaV1. 2 and CaV2
Evolutionary Ensembles with Negative Correlation Learning
A Data Partitioning Scheme for Spatial Regression
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of Electrical Engineering and Computer Science, 2 Department of Biochemistry and Biophysics Washington State University, Pullman, WA Center for Information Science and Technology Temple University, Philadelphia, PA 19122

ABSTRACT Attribute construction, choice of classifier and post- processing were explored for improving prediction of protein disorder. While ensembles of neural networks achieved the higher accuracy, the difference as compared to logistic regression classifiers was smaller then 1%. Bagging of neural networks, where moving averages over windows of length 61 were used for attribute construction, combined with postprocessing by averaging predictions over windows of length 81 resulted in 82.6% accuracy for a larger set of ordered and disordered proteins than used previously. This result was a significant improvement over previous methodology, which gave an accuracy of 70.2%. More- over, unlike the previous methodology, the modified attribute construction allowed prediction at protein ends.

Standard ``Lock and Key’’ Paradigm for Protein Structure/Function Relationships (Fischer, Ber. Dt. Chem. Ges.,1894) Amino Acid Sequence 3-D Structure Protein Function Motivation

(Kissinger et al, Nature, 1995) ? ? Protein Disorder - Part of a Protein without a Unique 3D Structure Example: Calcineurin Protein

Overall Objective Better Understand Protein Disorders Hypothesis: Since amino acid sequence determines structure, sequence should determine lack of structure (disorder) as well. Test Construct a protein disorder predictor Check its accuracy Apply it on large protein sequence databases

Objective of this Study Previous results showed that disorder can be predicted from sequence with ~70% accuracy (based on 32 disordered proteins) Our goals are to increase accuracy by –Increasing database of disordered proteins –Improving knowledge representation and attribute selection –Examining predictor types and post-processing –Perform extensive cross-validation using different accuracy measures

Data Sets Searching disordered proteins (DIFFICULT) –Keyword search of PubMed ( for disorders identified by NMR, Circular dichroism, protease digestion –Search over Protein Data Bank (PDB) for disorders identified by X-ray crystallography Searching ordered proteins (EASY) –Most proteins in Protein Data Bank (PDB) are ordered

Set of protein disorders (D_145) –Search revealed 145 nonredundant proteins ( 40 amino acids) with 16,705 disordered residues Set of ordered proteins (O_130) –130 nonredundant completely ordered proteins with 32,506 residues were chosen to represent examples of protein order Data Sets

Data representation Background Conformation is mostly influenced by locally surrounding amino acids Higher order statistics not very useful in proteins [Nevill-Manning, Witten, DCC 1999] Domain knowledge is a source of potentially discriminative features

W C Y L A A M A H Q F A G A G K L K C T S A L S C T class: (1 / 0) (disordered/ordered) WINDOW (size = W in ) SEQUENCE Calculate over window: 20 Compositions K2 entropy 14Å Contact Number Hydropathy Flexibility Coordination Number Bulkiness CFYW Volume Net Charge Attribute Selection (including protein ends)

Attribute construction resembles low-pass filtering. Consequence –effective data size of D_145 is ~ 2*16,705/W in –effective data size of O_130 is ~ 2*32,506/W in K2 entropy - low complexity proteins are likely disordered Flexibility, Hydropathy, etc. - correlated with disorder 20 AA compositions - occurrence or lack of some AA from the window is correlated with disorder incidence Attribute Selection (including protein ends)

Disorder Predictor Models We examine: Logistic Regression (LR) Classification model, stable, linear Neural Networks Slow training, unstable, powerful, need much data Ensemble of Neural Networks (Bagging, Boosting) Very slow, stable, powerful

Postprocessing We examine LONG disordered regions: –neighboring residues likely belong to the same ordered/disordered region Predictions can be improved: –Perform moving averaging of prediction over a window of length W out Data Disorder Predictor W out Filter Prediction

Accuracy Measures Length of disordered regions in different proteins varies from 40 to 1,800 AA We measure two types of accuracy –Per-residue (averaged over residues) –Per-protein (averaged over proteins) ROC curve - measures True Positive (TP) against False Negative (FN) predictions

Experimental Methodology Balanced data sets of order/disorder examples Cross-validation: –145 disordered proteins divided into 15 subsets (15-fold cross validation for TP accuracy) –130 ordered proteins divided into 13 subsets (13- fold CV for TN accuracy) To prevent collinearity and overfitting 20 attributes are selected (18 AA compositions, Flexibility and K2 entropy)

2,000 examples randomly selected for training Feedforward Neural Networks with one hidden layer and 5 hidden nodes. 100 epochs of resilient backpropagation Bagging and Boosting ensembles with 30 neural networks Examined W in, W out = {1, 9, 21, 41, 61, 81, 121} For each pair (W in, W out ) CV repeated 10 times for neural networks and once for Logistic Regression, Bagging and Boosting Experimental Methodology

Results – Model Comparison Per-protein accuracy, (W in, W out ) = (41,1) Neural networks slightly more accurate then linear predictors Ensemble of NNs slightly better then individual NN Boosting and Bagging result in similar accuracy

TN rate is significantly higher then TP rate (~ 10%) ORDER DISORDER Indication that attribute space coverage of disorder is larger then coverage of order  Disorder is more diverse then order

Results – Influence of Filter Size Per-protein accuracy with bagging W in =9 W in =61 W in =21 Different pairs of (W in, W out ) can result in similar accuracy W out =81 seems to be the optimal choice

Results – Optimal (W in, W out ) Per-protein and per-residue accuracy of bagging Per-residue accuracy gives higher values For a wide range of Win, optimal W out =81 The best result achieved with (W in, W out ) = (41,1)

Results – ROC Curve Compare (W in, W out ) = (21,1) and (61,81) (W in, W out ) = (61,81) is superior: ~10% improvement in per- protein accuracy (W in, W out ) = (21,1) corresponds to our previous predictor (61,81) (21,1)

Results – Accuracy at Protein Ends Comparison on O_130 proteins Comparison of accuracies at the first 20 (Region I) and last 20 (Region II) positions of O_130 proteins Solid: (W in =61, W out =81) Dashed: (W in =21, W out =1)

Results – Accuracy at Protein Ends Comparison on D_145 proteins start at the beginning do not start at the beginningAveraged accuracies of the first 20 positions of 91 disorder regions that start at the beginning of protein sequence (Region I) and 54 disordered regions that do not start at the beginning of protein sequence (Region II) Averaged accuracies of the last 20 positions of 76 disordered regions that do not end at the end of protein sequence (Region III) and 69 disorder regions that end at the end of protein sequence (Region IV). Solid: (W in =61, W out =81) Dashed: (W in =21, W out =1)

Conclusions Modifications in data representation, attribute selection, and prediction post-processing were proposed Predictors of different complexity were proposed Achieved 10% accuracy improvement over our previous predictors Difference in accuracy between linear models and ensembles of neural networks is fairly small

Acknowledgements Support from NSF-CSE-IIS and NSF-IIS to Z.O. and A.K.D. and from N.I.H. 1R01 LM06916 to A.K.D. and Z.O is gratefully acknowledged.