Prediction of Protein Structure and Function on a Proteomic Scale

Slides:

Advertisements

Similar presentations

Structural Classification and Prediction of Reentrant Regions in Alpha-Helical Transmembrane Proteins: Application to Complete Genomes Håkan Viklunda,

Advertisements

Functional Site Prediction Selects Correct Protein Models Vijayalakshmi Chelliah Division of Mathematical Biology National Institute.

PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.

Protein Structure Database Introduction Database of Comparative Protein Structure Models ModBase 生資所 g 詹濠先.

Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.

Structural bioinformatics

Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modeling Anne Mølgaard, CBS, BioCentrum, DTU.

Tertiary protein structure viewing and prediction July 1, 2009 Learning objectives- Learn how to manipulate protein structures with Deep View software.

Protein threading algorithms 1.GenTHREADER Jones, D. T. JMB(1999) 287, Protein Fold Recognition by Prediction-based Threading Rost, B., Schneider,

Protein structure (Part 2 of 2).

Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]

Tertiary protein structure viewing and prediction July 5, 2006 Learning objectives- Learn how to manipulate protein structures with Deep View software.

Thomas Blicher Center for Biological Sequence Analysis

Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.

The Protein Data Bank (PDB)

. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]

Molecular modelling / structure prediction (A computational approach to protein structure) Today: Why bother about proteins/prediction Concepts of molecular.

1 Protein Structure Prediction Charles Yan. 2 Different Levels of Protein Structures The primary structure is the sequence of residues in the polypeptide.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Homology Modelling Thomas Blicher Center for Biological Sequence Analysis.

Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul.

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Protein Tertiary Structure Prediction Structural Bioinformatics.

Protein Structures.

Homology Modeling David Shiuan Department of Life Science and Institute of Biotechnology National Dong Hwa University.

Protein Tertiary Structure Prediction

Construyendo modelos 3D de proteinas ‘fold recognition / threading’

Practical session 2b Introduction to 3D Modelling and threading 9:30am-10:00am 3D modeling and threading 10:00am-10:30am Analysis of mutations in MYH6.

Representations of Molecular Structure: Bonds Only.

1 P9 Extra Discussion Slides. Sequence-Structure-Function Relationships Proteins of similar sequences fold into similar structures and perform similar.

Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.

Modelling Genome Structure and Function Ram Samudrala University of Washington.

Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009

Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.

Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.

Structure prediction: Homology modeling

Protein Structure Prediction: Homology Modeling & Threading/Fold Recognition D. Mohanty NII, New Delhi.

Protein Folding & Biospectroscopy Lecture 6 F14PFB David Robinson.

Query sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDN GVDGEWTYTE Structure-Sequence alignment “Structure is better preserved than sequence” Me! Non-redundant.

Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.

Modelling genome structure and function Ram Samudrala University of Washington.

Protein Tertiary Structure Prediction Structural Bioinformatics.

Modelling Genome Structure and Function Ram Samudrala University of Washington.

A new protein-protein docking scoring function based on interface residue properties Reporter: Yu Lun Kuo (D )

Automated Structure Prediction using Robetta in CASP11 Baker Group David Kim, Sergey Ovchinnikov, Frank DiMaio.

Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.

Homology 3D modeling Miguel Andrade Mainz, Germany Faculty of Biology,

Modelling the rice proteome

Protein Structure Prediction and Protein Homology modeling

Homology 3D modeling and effect of mutations

Protein Structures.

Volume 86, Issue 6, Pages (June 2004)

Yang Zhang, Andrzej Kolinski, Jeffrey Skolnick Biophysical Journal

Rosetta: De Novo determination of protein structure

Srayanta Mukherjee, Yang Zhang Structure

Homology Modeling.

Volume 19, Issue 7, Pages (July 2011)

Protein structure prediction.

Yang Liu, Perry Palmedo, Qing Ye, Bonnie Berger, Jian Peng

Srayanta Mukherjee, Yang Zhang Structure

GPCR-I-TASSER: A Hybrid Approach to G Protein-Coupled Receptor Structure Modeling and the Application to the Human Genome Jian Zhang, Jianyi Yang, Richard.

Volume 20, Issue 6, Pages (June 2012)

Volume 20, Issue 3, Pages (March 2012)

Volume 17, Issue 7, Pages (July 2009)

Recognizing Protein Substructure Similarity Using Segmental Threading

Protein-Protein Docking: From Interaction to Interactome

Volume 86, Issue 6, Pages (June 2004)

Protein structure prediction

Yang Zhang, Jeffrey Skolnick Biophysical Journal

M-TASSER: An Algorithm for Protein Quaternary Structure Prediction

Presentation transcript:

Prediction of Protein Structure and Function on a Proteomic Scale Jeff Skolnick Director Center of Excellence in Bioinformatics

General Approach

Prediction of Protein Structure

Overview of CASP5 Results:

Comparative Modeling (CM) Results

T0153 CM COORDINATE SUPERPOSITION RMSD = 1.74 Å ( 129 / 134 aa ) NATIVE (discontinuous line) : PREDICTED (continuous line) : 1mq7 A rank #1

Fold Recognition (FR) results

T0135 FR(A) GLOBAL COORDINATE SUPERPOSITION RMSD = 4.80 Å ( 106 / 106 aa ) NATIVE (discontinuous line) : PREDICTED (continuous line) : rank #1

T0135 FR(A) GLOBAL COORDINATE SUPERPOSITION RMSD = 4.80 Å ( 106 / 106 aa ) NATIVE (discontinuous line) : PREDICTED (continuous line) : rank #1 Yellow line: region originally aligned to the template (1h6kX )

New Fold (NF) results

T0181 (NF) PREDICTED: rank #2

How representative is the set of solved PDB structures?

The PDB is a covering set of protein structures at low resolution Results from a new structure alignment program, SAL Kihara & Skolnick, J. Mol. Biol, 2003:333:393-802

Structural alignments to proteins of different secondary structure Different CATH ids 100 residue proteins

Use of best structural alignments Can we build good models starting from protein templates with average sequence id of 7%?

TASSER:Threading/ASSEmbly/Refinement

Very large scale structure prediction benchmark

Comprehensive benchmark set of PDB structures Length range: 41~200 Sequence identity cut-off: 35% In total: 1489

Summary of Results

Summary of Overall Folding Results SAL TASSER MODELLER Besta Alignb Top-5c Top-1d <RMSD>e 2.510 1.877 2.246 2.352 2.708 3.740 4.318 <COV>f 82% 100% NRMSD<6.0 NRMSD<5.5 NRMSD<5.0 NRMSD<4.5 NRMSD<4.0 NRMSD<3.5 NRMSD<3.0 NRMSD<2.5 NRMSD<2.0 NRMSD<1.5 NRMSD<1.0 1489 1485 1472 1440 1369 1255 1064 776 498 218 46 1488 1476 1422 1250 922 411 83 1487 1481 1468 1447 1396 1259 987 623 253 52 1475 1464 1450 1423 1359 1206 928 582 241 49 1462 1431 1395 1336 1141 1008 750 520 244 37 1326 1266 1195 1116 984 834 647 475 300 124 20 1202 1138 1060 962 841 697 551 397 85 15

Some Examples:

Summary At low resolution, the PDB is most likely complete for single domain proteins Can build acceptable full length models in the majority of cases Can refine the initial structures to move closer to native, even starting from the best structural alignment

Results from threading/refinement “Real Life” situation

TASSER:Threading/ASSEmbly/Refinement

“Easy” Cases: At least two threading templates identified with significant consensus region or One template with z-score that is highly significant

“Medium ” Cases: At least two threading templates identified without any significant consensus region or One template with z-score above threshold for correct fold assignment

Composite Threading Results We can identify the correct global fold in 92% of the entire representative set of small PDB structures Can generate good template alignments in 59% of the cases Good substructures 67% of the cases

Summary of Results

Examples of Alignment improvement Medium Easy Template Final model Template Final model Thin lines: Native; thick lines: Template/model Two factors mainly contribute to the improvement: geometric connectivity Better packing of local structure and side group because of the force field

Comparison to Ensemble of NMR Structures (Predicted Structure to Centroid/Farthest NMR Structure to Centroid) Thick Line is Predicted Structure

Benchmark set of larger proteins (201-300 residues) 487 Single-domain proteins 236 two-domain proteins 22 three-, four-domain proteins 745

Successful Predictions of Transmembrane Proteins

Application to ORFS <201 residues in E. coli 61% Easy (829/1360) 38% Medium (521/1360) 10 Hard TASSER 68% (920/1360) Good models

Summary Acceptable model in about 2/3 of the cases (969/1489) Application to E coli Yields similar results ~2/3 of proteins should have good model -Almost all (90%) have a good template

Development of Active Site Descriptors

Representation of an Automated Functional Template [ AFT ] Types of functional sites from SwissProt: METAL BINDING ACT_SITE SITE cm SCj cm SCi Cai+1 Caj-1 Caj Cai Caj+1 Cai-1 Set of distances between: cm SCk Cak Ca atoms and center of mass of the side chains corresponding to 3 to 5 functional residues, Cak-1 Cak+1 Ca atoms corresponding to the adjacent residues.

Specificity parameters of AFTs 30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 Positive hits Negative hits Restrictive cutoff: average value of DRMSDMaxPos and DRMSDMinNeg. Permissive cutoff: expected number of false positive matchs is less than 0.005 in a random structure. Number of hits in the subset of PDB High confidence DRMSD interval Low confidence DRMSD interval DRMSDMaxPos DRMSDMinNeg 0.0 0.5 1.0 1.5 2.0 2.5 DRMSD [ Å ]

Fraction of decoys correctly annotated vs Fraction of decoys correctly annotated vs. ranking of the best true positive hit Global Ca crmsd from the native structure Local Ca drmsd from the native structure 73% 56% 48% 35% The recognition by an AFT matching the first three components of the true EC number is considered a true positive hit.

Threading of Entire Genomes

Summary of Fold Assignments Organism Total Protein ORFs ORF Coverage (%) Amino Acid Coverage % FASTA (%) PSI-BLAST – PDB (%) PSI-BLAST – PDBseq (%) GTOP (%) Pedant (%) Gerstein (%) M. genitalium 484 387 (80.0) 48.1 231 (47.7) 205 (42.4) 259 (53.5) 273 (56.4) 214 (44.2) E. coli 4289 3356 (78.2) 50.2 1660 (38.7) 1516 (35.3) 1906 (44.4) 2032 (58.5) 1954 (45.6) 1191 (27.8) B. subtilis 4106 2988 (72.8) 47.2 1465 (35.7) 1314 (32.0) 1732 (42.4) 1947 (60.2) 1963 (47.7) 1121 (27.3) A. aeolicus 1522 1297 (85.2) 48.0 646 (42.4) 592 (38.9) 771 (50.7) 827 (53.1) 800 (52.6) 527 (34.6) S. cerevisiae 6343 4610 (72.7) 30.0 1962 (30.9) 1804 (28.4) 2422 (38.2) 2694 (42.5) 2766 (42.9) 1699 (27.3)

Comments on fold distribution Protein folds can be assigned to 72-85% of genes in each genome. 30-50% of the total amino acids in a genome are covered by the assigned folds. Generally, distribution of folds are similar in the 5 organisms. Folds of a/b type are abundant. Folds of multi-functions are abundant in a genome. Kinase fold shows up in top 5 only in S.cerevisiae.

MULTIPROSPECTOR: Prediction of Protein-Protein Interactions L. Lu, H. Lu, J. Skolnick. Proteins, 2002, 49, 350-364.

Overall Idea of Multimer Threading Monomer threading A X: GELPIAPIGRIIKNA GAERVSDDARIALAK VLEEMGEEIASEAVK LAKHAGRKTIKAEDI KLARKMFK Y: GEVPIAPLGRIIKNA VLEEMGEEIASEAIR LAKHAGRKTIKAEDV KLAKKMFK B A B X: GELPIAPIGRIIKNA GAERVSDDARIALAK VLEEMGEEIASEAVK LAKHAGRKTIKAEDI KLARKMFK Y: GEVPIAPLGRIIKNA VLEEMGEEIASEAIR LAKHAGRKTIKAEDV KLAKKMFK X Y Assign fold on the basis of Z score and Interface Energy Multimer Threading Multimer Structure Library A B

Preliminary test on Known Dimers and Monomers Homodimers: 58 Heterodimers: 20 Monomers: 96 96 5 91 20 20 54 4 58 Proteins predicted to be dimers Proteins predicted to be monomers

Procedure for genomic scale prediction of protein-protein interactions by MULTIPROSPECTOR

Comparison of colocalization index for different methods

Distribution of predicted interactions in functional categories

Conclusions

Completeness of the PDB Conclusions Completeness of the PDB PDB is a covering set of single domain proteins at low to moderate resolution Protein Structure prediction problem can be solved with more powerful threading algorithms!!

TASSER For single domain proteins: In almost all cases, for all ranges of initial RMSD, even when starting from the “best” structural alignment, the final results are better than the initial template- the models move closer to native Based on a comprehensive folding benchmark, we expect low resolution structures for ~ 2/3 of proteins with low sequence identity to PDB structures Weak dependence on secondary structure type

Structure to Function Low resolution structures can be used to identify active sites. Genome scale threading – greater than 70% of ORFs can be assigned to known folds Extension to protein-protein interactions Comparable accuracy to agreement between two experimental methods

Acknowledgements http://bioinformatics.buffalo.edu/ Center of Excellence in Bioinformatics Yang Zhang Adrian Arakaki Purdue University Daisuke Kihara Yale University Long Lu University of Illinois Hui Lu $$$$$ NIH, NSF & The Oishei Foundation http://bioinformatics.buffalo.edu/