Combining Predictors for Short and Long Protein Disorder

Slides:



Advertisements
Similar presentations
Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,
Advertisements

Structural Classification and Prediction of Reentrant Regions in Alpha-Helical Transmembrane Proteins: Application to Complete Genomes Håkan Viklunda,
Secondary structure prediction from amino acid sequence.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Structural bioinformatics
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of.
1 Protein Structure Prediction Charles Yan. 2 Different Levels of Protein Structures The primary structure is the sequence of residues in the polypeptide.
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Template-based Prediction of Protein 8-state Secondary Structures June 12 th 2013 Ashraf Yaseen and Yaohang Li DEPARTMENT OF COMPUTER SCIENCE OLD DOMINION.
Protein Tertiary Structure Prediction
Rising accuracy of protein secondary structure prediction Burkhard Rost
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,
Intelligent Systems for Bioinformatics Michael J. Watts
Levels of Protein Structure
Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches1 By Jayakumar Rudhrasenan S Primary Supervisor: Prof. Heiko Schroder.
Makromolekulak_2010_12_07 Simon István. Prion protein.
Representations of Molecular Structure: Bonds Only.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Prediction of protein disorder Zsuzsanna Dosztányi MTA-ELTE Momentum Bioinformatics Group Department of Biochemistry Eotvos Lorand University, Budapest,
Prediction of protein disorder Zsuzsanna Dosztányi Institute of Enzymology, Budapest, Hungary
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
© Wiley Publishing All Rights Reserved. Protein 3D Structures.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
An algorithm to guide selection of specific biomolecules to be studied by wet-lab experiments Jessica Wehner and Madhavi Ganapathiraju Department of Biomedical.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Secondary structure prediction
2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Protein Secondary Structure Prediction G P S Raghava.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
AISTATS 2010 Active Learning Challenge: A Fast Active Learning Algorithm Based on Parzen Window Classification L.Lan, H.Shi, Z.Wang, S.Vucetic Temple.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Matching Protein  -Sheet Partners by Feedforward and Recurrent Neural Network Proceedings of Eighth International Conference on Intelligent Systems for.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
“ Using Sequence Motifs for Enhanced Neural Network Prediction of Protein Distance Constraints ” J.Gorodkin, O.Lund, C.A.Anderson, S.Brunak On ISMB 99.
We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.
Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.
Feature Extraction Introduction Features Algorithms Methods
Prediction of RNA Binding Protein Using Machine Learning Technique
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Support Vector Machine (SVM)
Prediction of Protein Structure and Function on a Proteomic Scale
Protein dynamics Folding/unfolding dynamics
Generalizations of Markov model to characterize biological sequences
Volume 19, Issue 7, Pages (July 2011)
Protein structure prediction.
Protein Disorder Prediction
Volume 23, Issue 5, Pages (May 2015)
K-Medoid May 5, 2019.
Discussion of Protein Disorder Prediction
Neural Networks for Protein Structure Prediction Dr. B Bhunia.
Presentation transcript:

Combining Predictors for Short and Long Protein Disorder Zoran Obradovic, Slobodan Vucetic and Kang Peng Information Science and Technology Center, Temple University, PA 19122 A. Keith Dunker and Predrag Radivojac Center for Computational Biology and Bioinformatics, Indiana University, IN 46202 NIH grant R01 LM007688-01A1 to A.K. Dunker and Z. Obradovic is gratefully acknowledged

4 levels of protein structure Introduction Protein Structure - under physiological condition, the amino acid sequence of a protein folds spontaneously into specific (native) three dimensional (3-D) structure or conformation hydrogen bond -strand 4 levels of protein structure hydrogen bond

Importance of Protein Structure The “central dogma” – amino acid sequence determine protein structure, and protein structure determine its biological function > 1NLG:_ NADP-LINKED GLYCERALDEHYDE-3-PHOSPHATE EKKIRVAINGFGRIGRNFLRCWHGRQNTLLDVVAINDSGGVKQASHLLKYDSTLGTFAAD VKIVDDSHISVDGKQIKIVSSRDPLQLPWKEMNIDLVIEGTGVFIDKVGAGKHIQAGASK VLITAPAKDKDIPTFVVGVNEGDYKHEYPIISNASCTTNCLAPFVKVLEQKFGIVKGTMT TTHSYTGDQRLLDASHRDLRRARAAALNIVPTTTGAAKAVSLVLPSLKGKLNGIALRVPT PTVSVVDLVVQVEKKTFAEEVNAAFREAANGPMKGVLHVEDAPLVSIDFKCTDQSTSIDA SLTMVMGDDMVKVVAWYDNEWGYSQRVVDLAEVTAKKWVA Function: Gene Transfer Amino Acid Sequence 3-D Structure Biological Function Thus, it is important to know a protein’s structure to understand its function and other biological properties

Protein Structure Prediction The sequence-structure gap Current experimental structure determination techniques, e.g. X-ray diffraction and NMR spectroscopy, are still slow, expensive and have their limitations As a result, there are less than 30,000 experimental protein structures, compared to more than 1.6 million known protein sequences Protein structure prediction – predicting protein structures from amino acid sequences using computational methods Aspects of protein structure prediction 1D – secondary structures, solvent accessibility, transmembrane helices, signal peptides/cleavage sites, coiled coils, disordered regions 2D – inter-residue contacts, inter-strand contacts 3D – individual atom coordinates in the tertiary structure (the ultimate goal)

# of participating groups The CASP Experiments Critical Assessment of Techniques for Protein Structure Prediction The primary goal To obtain an in-depth and objective assessment of current methods for predicting protein structure from amino acid sequence The procedure Proteins with “soon to be solved” structures are selected as prediction targets, and their amino acid sequences are made available Prediction teams submit their prediction models before the experimental structures are released Prediction models are compared to experimental structures for detailed evaluation by independent assessors # of targets # of participating groups # of submitted models CASP6 (2004) 76 208 41283 CASP5 (2002) 67 215 28728 CASP4 (2000) 43 163 11136 CASP3 (1998) 98 3807 CASP2 (1996) 42 72 947 CASP1 (1994) 33 35 135 CASP Website: http://predictioncenter.llnl.gov/

Prediction Categories in CASP6 Tertiary structure (3-D coordinates for individual atoms) prediction Comparative/Homology modeling Fold recognition New fold modeling Disordered region prediction (since CASP5) Domain boundary prediction (new) Residue-residue contact prediction (new) Secondary structure prediction was excluded in CASP6 In CASP6 there were 20 groups participated in Disordered Region prediction, while only 6 groups in CASP5

Disordered Region (DR) Part of a protein or a whole protein that does NOT have stable 3D structure in its native state Perform important biological functions Have distinct sequence properties Evolve faster than ordered regions Common in nature Kissinger et al, 1995 Other definitions of disordered region Missing coordinates (used by CASP) High B-factors Random coils NOn-Regular Secondary Structure (NORS)

Prediction of Disordered Regions One example for each sequence position (residue) Class label 0/1: disordered / ordered Input Window of length Win Amino Acid Sequence K Q L L W C Y L A A M A H Q F G A G K L K C T S A T T W Q G Attributes derived from the local window 20 AA frequencies K2-entropy (sequence complexity) Flexibility Hydropathy more …

Long DR Predictors on Short DR Disordered regions can be divided into 2 groups according to their lengths short DRs – 30 consecutive residues or shorter long DRs – longer than 30 consecutive residues Our previous disorder predictors were specific to long DRs Predictors – VL-XT, VL2, VL3, VL3H, VL3P, VL3B Accuracies – 70% (VL-XT) ~ 85% (VL3P) They were less successful on short DRs, as shown in CASP5 25~66% per-residue accuracy on short DRs 75~95% per-residue accuracy on long DRs Possible reasons The window lengths for attribute construction and post-filtering were optimized for long DRs Training data did NOT include any short DRs Short DRs are different from long DRs in terms of amino acid compositions, flexibility index, hydropathy and net charge

Amino Acid Compositions of Short DRs Radivojac et al., Protein Science, 2004 Amino acid frequency difference from Globular-3D Consequence – a predictor specialized for short disordered regions is necessary

Our Approach in CASP6 Idea – two specialized predictors for long and short disordered regions, and a meta predictor to estimate which specialized predictor is more suitable for current input Long Disorder Predictor (>30aa) Short Disorder Predictor (30aa)  Meta Predictor OL OS wL wS Final Prediction Input In CASP5, we used only Long Disorder Predictor component

The Training Dataset Dataset Number of Chains Number of long DRs Number of short DRs LONGa 153 163 24 SHORTc 511 43 630 ORDERa,b 290 XRAYd 381 329 TOTALe 1335 230 983 a) LONG and ORDER – training data for VL3 predictors (Z. Obradovic, K. Peng, S. Vucetic, P. Radivojac, C. J. Brown, A. K. Dunker, Proteins, 53 (S6): 566-572, 2003; K. Peng, S. Vucetic, P. Radivojac, C. J. Brown, A. K. Dunker, Z. Obradovic, Journal of Bioinformatics and Computational Biology, in press) b) ORDER – training data for a B-factor predictor and used in a study of flexibility index (P. Radivojac, Z. Obradovic, D. K. Smith, G. Zhu, S. Vucetic, C. J. Brown, J. D. Lawson, A. K. Dunker, Protein Science, 13 (1):71-80, 2004; D. K. Smith, P. Radivojac, Z. Obradovic, A. K. Dunker, G. Zhu, Protein Science, 12 (5):1060-1072, 2003) c) SHORT – training data for a short disorder predictor (Radivojac et al., Protein Science, 13 (1):71-80, 2004) d) XRAY – a non-redundant set of PDB chains released between June 2003 and May 2004 e) TOTAL - the merged sequences are non-redundant with less than 50% identity

Specialized Disorder Predictors Optimized for long and short disordered regions, respectively Predictor Attributes Window Length Accuracyc (%) Wina Woutb short DR long DR order Long Disorder (>30aa) Amino acid frequencies K2-Entropy Flexibility index Hydropathy/net charge ratio 41 31 50.13.6 76.54.2 85.10.9 Short Disorder (30aa) (In addition to the attributes above) PSI-BLAST profile Secondary structure prediction (PSIPred) An indicator of terminal regions 15 5 81.52.1 66.73.5 82.40.5 a) Length of input window for attribute construction b) Length of output window for post-filtering c) Out-of-sample per-chain accuracies were estimated by 1) randomly split the 1335 sequences into 75%:25%, 2) the first part for training and the second for testing, 3) repeat steps 1 and 2 for 30 times and average the accuracies

The Prediction Process For each sequence position (residue) The three predictors construct attributes and output OL, OS and OG The final output is calculated as O = OL * OG + OS * (1 – OG) If O > 0.5, predict disorder Otherwise, predict order Long Disorder Predictor (>30aa) Short Disorder Predictor (30aa)  Meta Predictor OL OS OG 1-OG The final output O = OL* OG + OS * (1 - OG) Input

Training the Meta Predictor The meta predictor was then trained as a 2-class classifier (short disorder vs. long disorder) Constructing labeled dataset for training of meta predictor Used same attributes as for the short disorder predictor Residues from long DRs and their flanking regions were labeled as class 1 Residues from short DRs (3aa) and their flanking regions were labeled as class 0 The remaining residues were discarded (u) Disorder labels: Class labels: GKKGAVAEDGDELRTEPEAKKSKTAAKKNDKEAAGEGPALYEDPPDHKTS ooooooooooooooooooooDDDDDDDDoooooooooooooooooooooo uuuuuuuuuuuuuu00000000000000000000uuuuuuuuuuuuuuuu A Short Disordered Region (8aa) Ordered Region Sequence: Current Residue Input Window (Length Win) The input window (of length Win =61) centered at current residue must overlap with more than half of a disordered region Example:

CASP6 Targets 63 targets with 3-D coordinates information available, with 90 disordered regions and 90 ordered regions Length range Number of regions Number of residues Disordered regions 1-3 35 58 4-15 41 304 16-30 9 201 31-100 4 266 >100 1 102 Total 90 931 Ordered regions 12,520

Prediction Accuracy (a) per-region accuracy (b) per-residue accuracy VL2 (CASP6 model-3) – a previously developed long disorder predictor (S. Vucetic, C.J. Brown, A.K. Dunker and Z. Obradovic, Proteins: Structure, Function and Genetics, 52:573-584, 2003) VL3E(CASP6 model-2) – a previously developed long disorder predictor (Z. Obradovic, K. Peng, S. Vucetic, P. Radivojac, C. J. Brown, A. K. Dunker, Proteins, 53 (S6): 566-572, 2003; K. Peng, S. Vucetic, P. Radivojac, C. J. Brown, A. K. Dunker, Z. Obradovic, Journal of Bioinformatics and Computational Biology, in press ) NEW (CASP6 model-1) – the combined predictor NEW/short – the specialized predictor for short disordered regions (30aa) NEW/long – the specialized predictor for long disordered regions (>30aa)

Prediction on Long Disordered Regions (a) Prediction by component predictors (b) Comparison to previous predictors Notes: (1) red segments indicate disordered regions (of missing coordinates), (2) The threshold for predicting disorder is 0.5

Prediction on Short Disordered Regions Notes: (1) red segments indicate disordered regions (of missing coordinates), (2) The threshold for predicting disorder is 0.5 In both targets, all short DRs were identified, but with considerable amount of false positives. More detailed analysis shows that the new predictor tend to over-predict at N- and C- termini

Correlation with High B-factor Regions Notes: (1) red segments indicate disordered regions (of missing coordinates), (2) The threshold for predicting disorder is 0.5, (3) no B-factor data for disordered regions

Conclusion by CASP6 Assessor “Group 193 is best on all measures, on both no-density segments and B-factors, and is significantly better than next 3 groups, 096, 003, 347 on no-density segments, who are about the same as each other. Groups 3, 347, and 472 are good at B-factors” Group IDs: 193 ISTZORAN (Zoran Obradovic, Temple University) 096 CaspIta (Tosatto et al., Univ. of Padova) 003 Jones UCL (David Jones, University College London) 347 DRIP PRED (server from Bob MacCallum, Stockholm) 472 Softberry (good at B-factor correlation) Assessor’s report is available at CASP6 website: http://predictioncenter.llnl.gov/casp6/meeting/presentations/DR_assessment_RD.pdf

Future Directions The length threshold 30 for dividing DRs into long and short is artificial and may not be the best choice A better method for partitioning the DRs into more homogenous length groups (maybe more than 2) The new predictor produced considerable amount of false positives, especially at the N- and C- terminals. Build predictors specific to terminal and internal regions, and combine them (a similar approach to VL-XT) The dataset contains noises, i.e. mislabeling, since not all missing coordinate regions may not necessarily be due to disorder

The End Thank You!!