Predicting protein disorder Peter Tompa Institute of Enzymology Hungarian Academy of Sciences Budapest, Hungary.

Slides:



Advertisements
Similar presentations
(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
Advertisements

Protein Backbone Angle Prediction with Machine Learning Approaches by R Kang, C Leslie, & A Yang in Bioinformatics, 1 July 2004, vol 20 nbr 10 pp
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Biological sequence analysis and information processing by artificial neural networks.
Prediction of protein localization and membrane protein topology Gunnar von Heijne Department of Biochemistry and Biophysics Stockholm Bioinformatics Center.
PROTEIN SECONDARY STRUCTURE PREDICTION WITH NEURAL NETWORKS.
Structural bioinformatics
Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of.
Biological sequence analysis and information processing by artificial neural networks.
Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Structures.
Artificial Neural Networks for Secondary Structure Prediction CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (slides by J. Burg)
Protein Bioinformatics Course
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Makromolekulak_2010_12_07 Simon István. Prion protein.
Representations of Molecular Structure: Bonds Only.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Prediction of protein disorder Zsuzsanna Dosztányi MTA-ELTE Momentum Bioinformatics Group Department of Biochemistry Eotvos Lorand University, Budapest,
PART II. Prediction of functional regions within disordered proteins Zsuzsanna Dosztányi MTA-ELTE Momentum Bioinformatics Group Department of Biochemistry.
Protein Secondary Structure Prediction. Input: protein sequence Output: for each residue its associated Secondary structure (SS): alpha-helix, beta-strand,
Prediction of protein disorder Zsuzsanna Dosztányi Institute of Enzymology, Budapest, Hungary
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Secondary structure prediction
2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.
Protein structure prediction May 26, 2011 HW #8 due today Quiz #3 on Tuesday, May 31 Learning objectives-Understand the biochemical basis of secondary.
Intrinsically disordered proteins: drug development and the most interesting examples Peter Tompa Institute of Enzymology Hungarian Academy of Sciences.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Protein Secondary Structure Prediction G P S Raghava.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Prediction of Protein Binding Sites in Protein Structures Using Hidden Markov Support Vector Machine.
Fehérjék 3. Simon István. p27 Kip1 IA 3 FnBP Tcf3 Bound IUP structures.
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Stephen Altschul National Center for Biotechnology Information
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.
Improved Protein Secondary Structure Prediction. Secondary Structure Prediction Given a protein sequence a 1 a 2 …a N, secondary structure prediction.
Intrinsically disordered proteins Zsuzsanna Dosztányi EMBO course Budapest, 3 June 2016.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Structural Bioinformatics Elodie Laine Master BIM-BMC Semestre 3, Laboratoire de Biologie Computationnelle et Quantitative (LCQB) e-documents:
Madhavi Ganapathiraju Graduate student Carnegie Mellon University
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Combining HMMs with SVMs
Sequence Based Analysis Tutorial
Protein Structures.
Protein structure prediction.
Protein Disorder Prediction
Combining Predictors for Short and Long Protein Disorder
Discussion of Protein Disorder Prediction
Neural Networks for Protein Structure Prediction Dr. B Bhunia.
Presentation transcript:

Predicting protein disorder Peter Tompa Institute of Enzymology Hungarian Academy of Sciences Budapest, Hungary

Why do we want to predict disorder ? 1)Gap in knowledge (460 vs. thousands of IUPs) 2)Structural genomics initiatives 3)Bioinformatics studies 4)Single protein studies

How do we know disorder is prevalent ? 1)Heat-resistance of proteins (20% in Jurkat cells) 2)Low-complexity regions (25% in SwissProt) 3)Bioinformatics studies (12% in humans)

Prediction: classification problem InputMethod 1.sequence 2.propensity vector 3.alignment (profile) 4.interaction energies 1. statistical methods 2. machine learning 3. structural approach Output (property) 1. DisProt 2. PDB 1. binary 2. score Assessment We look for the relationship of input data and a property, which is a function with free parameters

Three basic approaches 1)Simple statistics 2)Machine learning 3)Structural approach

Dunker et al. (2001) J. Mol. Graph. Model. 19, 26 The unusual AA composition of IDPs order-promoting disorder-promoting

AA feature space: AAindex database A number is associated with every amino acid, which quantitatively describes how characteristic the given feature is to the AA (has 517 different scales at present)

Two things you might not want to do… 1) SEG: low complexity regions

Sequence databases contain a lot of regions in which only a few amino acids occur (simple), or some dominate (biased), this can be described by an entropy function - Wootton (1993) L > 20; window size, N alphabet size (20), n i : az i. aminosav száma Sequence complexity (SEG) Shannon`s entropy

Low-complexity regions in proteins Wootton (1994) Comp. Chem. 18, 269

> SRY_MOUSE TRANSCRIPTIONAL ACTIVATOR MEGHVKRPMNAFMVWSRGERHKLAQQNPSM QNTEISKQLGCRWKSLTEAEKRPFFQEAQR LKILHREKYPNYKYQPHRRAKVSQRSGILQ PAVASTKLYNLLQWDRNPHAITYRQDWSRA AHLYSKNQQSFYWQPVDIPTGHL qqqqqqqqqqqfhnhhqqqqqfydhhqqqq qqqqqqqqfhdhhqqkqqfhdhhqqqqqfh dhhhhhqeqqfhdhhqqqqqfhdhqqqqqq qqqqqfhdhhqqkqqfhdhhhhqqqqqfhd hqqqqqqfhdhqqqqhqfhdhpqqkqqfhd hpqqqqqfhdhhhqqqqkqqfhdhhqqkqq fhdhhqqkqqfhdhhqqqqqfhdhhqqqqq qqqqqqqqfhdqq LTYLLTADITGEHTPYQEHLSTALWLAVS …may correspond to disorder

The relationship of low complexity and disorder

Two things you might not want to do… 2) NORSp: regions w/o secondary structure

NORSp seems to capture disorder… NORSp: 1) uses PSI-BLAST to generate a sequence profile 2) predicts secondary structure and solvent accessibility by PROFphd 3) predicts transmembrane helices by PHDhtm 4) predicts coiled coils by COILS 5) combines info and merges overlapping regions LDR (40 < ) protein, % Domain of life B A E

1tbi …but there are globular proteins without and IDPs with secondary structure Radhakrishnan (1997) Cell 91, 741 Radhakrishnan (1998) FEBS Lett. 430, 317 CREB KID

Tumor suppressor p53 Uversky et al. (2005) J. Mol. Recogn. 18, 343

1) Simple statistics

DisEMBL (single parameter) p53 hot loops remark 465 coil remark 465: missing from PDB

Uversky plot: charge-hydrophobicity (two parameters) Mean hydrophobicity Mean net charge Uversky (2002) Eur. J. Biochem. 269, 2

Uversky plot: a little improved Oldfield (2005) Biochemistry 44, 1989

Making it position specific: FoldIndex p53 Prilusky (2005) Bioinformatics 21, 3435

3) Input vector of features, machine learning… 2) Machine learning

input score Artificial neural network (NN)

Basic unit: a neuron Hidden layer training Artificial neural networks ordered disordered

Input 18 amino acids Hydrophobicity Sequence complexity Predictor of naturally disordered regions (PONDR ® ) VL2, VL3 VLXT VSL2

Dunker, 1998

Tumor suppressor p53 Uversky et al. (2005) J. Mol. Recogn. 18, 343

f(x,w,b) = sign (w,x – b) How would you classify this data?But which is best? The maximum margin linear classifier is considered best Support vectors against which the margin pushes up This is the simplest SVM, the linear SVM (LSVM) Support vector machine (SUV)

p53 DISOPRED 2 (input: missing X-ray, alignment) 45

3) Structural approach (interaction potential)

The protein non-folding problem  Protein folding problem How does amino acid sequence determine protein structure ?  Protein non-folding problem How does amino acid sequence determine the lack of protein structure ?

Globular proteins have special sequences that enable the formation of a large number of favorable interactions Globular proteins have special sequences that enable the formation of a large number of favorable interactions IDPs contain (disorder-promoting) amino acids, which tend to avoid interacting with each other. IDPs contain (disorder-promoting) amino acids, which tend to avoid interacting with each other. An IDP thus cannot fold into a low-energy conformation. An IDP thus cannot fold into a low-energy conformation. The protein non-folding porblem

A simple implementation, FoldUnfold p53 Calculates the contact number of amino acids

Estimating the total pairwise interresidue interaction energy of a sequence: IUPred 1) Calculate interresidue interaction energies from structure 2) Try to estimate the energy without knowing the structure 3) Apply the estimation to sequences w/o structure (e.g. to IDPs, which have no structure)

Aminosav kölcsönhatási energiák származtatása globuláris fehérjékből Statistical potentials Interactions that occur more frequently are probably more favorable Take logarithm of frequency It is an energy-like feature

Interresidue interaction energy calculated for known structures Dosztanyi (2005) J. Mol. Biol. 347, 827

How to estimate the interresidue interaction energy of a protein of unknown structure or w/o structure? MKVPPHSIEA EQSVLGGLML DNERWDDVAE RVVADDFYTR PHRHIFTEMA RLQESGSPID LITLAESLER QGQLDSVGGF AYLAELSKNT PSAANISAYA DIVRERAVVR EMIS Sequence MODEL 1 ATOM 1 N MET A ATOM 2 CA MET A ATOM 3 C MET A ATOM 4 O MET A Structure Calculated energy per residue Estimated energy per residue ?

Estimation of pairwise interresidue interaction energy from sequence The contribution of individual AAs depends on its potential partners, i.e. its neighborhood. A quadratic formula is needed to take this into consideration. The relationship between AA composition and energy is given by an optimized 20x20 energy predictor matrix, P ij

Correlation of calculated and estimated interaction energies Corr. coeff Dosztanyi (2005) J. Mol. Biol. 347, 827

IDPIDP GLOB Dosztanyi (2005) J. Mol. Biol. 347, 827 The estimated energy for globular proteins and IDPs

IUPred: Making it position-specific: the IUPred algorithm

IUPred, p53

4) A new development: distinguishing long from short disorder

Radivojac (2004) Prot. Sci. 13, 71 PONDR VSL2: short vs. long disorder Peng (2006) BMC Bioinfo. 7, 208

Which is the best predictor ? The CASP experiment Cannot be answered in general, because binary or continuous different kinds of disorder short and long disorder problems with disordered protein data number of glob. vs IUP residues not clear how to evaluate Means two things at least: how user friendly it is ( return, graphic output, downloadable, speed, scriptable, can be parametrized) how accurately it predicts disorder

Evaluation criteria in CASP6 Receiver operating characteristic (ROC) curve and its integral (S roc ) Sensitivity, specifity and their ratio (Q2) Weighted score of sensitivity an specifity (S w )

ROC curves in CASP6 PONDR DRIPpred DISOPRED IUPred

Overall evaluation in CASP6 PONDR IUPred DISOPRED DRIPpred DISpro

Why do we want to predict disorder ? 1)Gap in knowledge (460 vs. thousands of IUPs) 2)Structural genomics initiatives 3)Bioinformatics studies 4)Single protein studies

LDR (40 < ) protein, % kingdom B A E Tompa et al. (2006) J. Proteome Res. 5, ) Protein disorder: an evo. success-story

2) Structural genomics:filtering for disorder helps Center for Eukaryotic Structural Genomics (CESG) targets retrospective prioritization by PONDR (71 targets) evaluation by considering HSQC fingerprinting modest increase of viable targets (36% to 41%)

Iakoucheva et al. (2002) J. Mol. Biol. 323, 573 regulatory signaling biosynthetic metabolic protein (%) 30< 40< 50< 60 < length of disordered region 3) Bioinformatics studies: disorder prevails in regulatory proteins

3) Bioinformatic studies: disorder correlates with complex size Hegyi et al. (2007) BMC Struct. Biol. 7, 65 single

4) Single protein studies