7. (Predicted) residue pair contacts guide ab initio modeling

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological.
Direct-Coupling Analysis (DCA) and Its Applications in Protein Structure and Protein-Protein Interaction Prediction Wang Yang
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Practical session 2b Introduction to 3D Modelling and threading 9:30am-10:00am 3D modeling and threading 10:00am-10:30am Analysis of mutations in MYH6.
COMPARATIVE or HOMOLOGY MODELING
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Lecture 12 CS5661 Structural Bioinformatics Motivation Concepts Structure Prediction Summary.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Graphical Modeling of Multiple Sequence Alignment Jinbo Xu Toyota Technological Institute at Chicago Computational Institute, The University of Chicago.
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Modelling protein tertiary structure Ram Samudrala University of Washington.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Expected accuracy sequence alignment Usman Roshan.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Automated Structure Prediction using Robetta in CASP11 Baker Group David Kim, Sergey Ovchinnikov, Frank DiMaio.
Semi-Supervised Clustering
Computational Structure Prediction
Fig. 3. ACE outperforms plmDCA in recovering the single variable frequencies for models describing (a) ER005, (b) LP SB, (c) PF00014, (d) HIV.
7. (Predicted) residue pair contacts guide ab initio modeling
Volume 16, Issue 2, Pages (February 2008)
Ab initio gene prediction
Volume 19, Issue 8, Pages (August 2011)
Prediction of Protein Structure and Function on a Proteomic Scale
Nir London, Ora Schueler-Furman  Structure 
Jianbin Wang, H. Christina Fan, Barry Behr, Stephen R. Quake  Cell 
Volume 112, Issue 7, Pages (April 2017)
Comparing Folding Codes in Simple Heteropolymer Models of Protein Evolutionary Landscape: Robustness of the Superfunnel Paradigm  Richard Wroe, Erich.
Yang Zhang, Andrzej Kolinski, Jeffrey Skolnick  Biophysical Journal 
Rosetta: De Novo determination of protein structure
Formation of Chromosomal Domains by Loop Extrusion
Homology Modeling.
Volume 19, Issue 7, Pages (July 2011)
Protein structure prediction.
Anastasia Baryshnikova  Cell Systems 
A Switching Observer for Human Perceptual Estimation
Yang Liu, Perry Palmedo, Qing Ye, Bonnie Berger, Jian Peng 
Volume 14, Issue 7, Pages (February 2016)
A Switching Observer for Human Perceptual Estimation
Volume 20, Issue 2, Pages (February 2012)
GPCR-I-TASSER: A Hybrid Approach to G Protein-Coupled Receptor Structure Modeling and the Application to the Human Genome  Jian Zhang, Jianyi Yang, Richard.
Volume 20, Issue 6, Pages (June 2012)
Volume 20, Issue 3, Pages (March 2012)
Volume 17, Issue 7, Pages (July 2009)
Fan Zheng, Jian Zhang, Gevorg Grigoryan  Structure 
Nir London, Ora Schueler-Furman  Structure 
Volume 16, Issue 2, Pages (February 2008)
Volume 19, Issue 8, Pages (August 2011)
Predicting Gene Expression from Sequence
Atomic-Level Protein Structure Refinement Using Fragment-Guided Molecular Dynamics Conformation Sampling  Jian Zhang, Yu Liang, Yang Zhang  Structure 
Yan Xia, Axel W. Fischer, Pedro Teixeira, Brian Weiner, Jens Meiler 
Encoding of Stimulus Probability in Macaque Inferior Temporal Cortex
Reliability of Assessment of Protein Structure Prediction Methods
Universal microbial diagnostics using random DNA probes
High-Resolution Comparative Modeling with RosettaCM
Volume 94, Issue 11, Pages (June 2008)
Presentation transcript:

7. (Predicted) residue pair contacts guide ab initio modeling … and homolog refinement too… Acknowledgments for slides in this lecture to Sergey Ovchinnikov!

Restraint function: Contact prediction via correlated mutations Recent breakthrough: Significantly longer proteins can be modeled without template (ab initio) ab initio restricted to small (100aa), single domain proteins + information about contacts Contact prediction from co-evolution -> dramatic increase of scope (… 500aa)

conformational change What is co-evolution? Important Contacts in Proteins are evolutionarily conserved and encoded in a Multiple Sequence Alignment within mediated by ligand between due to co-evolution conformational change by measuring coevolution, we can infer important contacts in proteins!

Contacting residues can be represented as a contact map! Contact: Residue – Residue interaction N C Grey = Structural Contact Blue = Predicted Contact Intensity = Strength of Prediction

GREMLIN used to measure Co-evolution Global statistical model Lapedes et al. 1990s x = position vi = one-body energy (Conservation) wij = two-body energy (Coupling) Positions X1 X2 X3 X4 V1 V2 V3 V4 Hetunandan Kamisetty Facebook Balakrishnan et al. 2010

GREMLIN used to measure Co-evolution Global statistical model Lapedes et al. 1990s x = position vi = one-body energy (Conservation) wij = two-body energy (Coupling) W14 Positions X1 X2 X3 X4 V1 V2 V3 V4 Balakrishnan et al. 2010

GREMLIN used to measure Co-evolution Global statistical model Lapedes et al. 1990s x = position fi = one-body energy (Conservation) ψij = two-body energy (Coupling) Learn pseudo-likelihood model: Connectivity (sparse: Few significant correlations – contacts) Parameters (optimize model of X - MSA) Balakrishnan et al. 2010

GREMLIN used to measure Co-evolution Global statistical model Lapedes et al. 1990s GREMLIN APC(L2norm( )) Wij APC average product correction L2 of the matrix sqroot (of sum of everything squared) Pseudo likelihood learning procedure, with penalty to promote sparse NET x = position vi = one-body energy (Conservation) wij = two-body energy (Coupling) 50S ribosomal protein L6 APC: ave. product correction

GREMLIN used to measure Co-evolution Global statistical model Lapedes et al. 1990s GREMLIN APC(L2norm( )) Wij APC average product correction L2 of the matrix sqroot (of sum of everything squared) Pseudo likelihood learning procedure, with penalty to promote sparse NET x = position vi = one-body energy (Conservation) wij = two-body energy (Coupling) Balakrishnan et al. 2010 50S ribosomal protein L6

Gremlin (Generative REgularized ModeLs of proteINs) based on pseudo-likelihood framework: Markov Random Field (more complex than HMM: chain) optimized for maximum correct contact predictions includes predicted context information: SS (PSIPRED) Contacts (SVMcon) informative MSA: # S (Sequences, <90%id) > 4-5 x L (protein length) @ 4-5L sequence depth, the top 1.5L contacts are ~50% correct reliable modeling: ≥ 1 reliable non-local contact every <12aa> -> prediction of longer proteins Original paper: Balakrishnan …. Langmead. Proteins 2010 Bakerlab: Kamisetty et al. . PNAS 2013; Kim et al.. Proteins 2013; Ovchinnikov et al. eLife2015 & Science 2017

GREMLIN used to measure Co-evolution When is it useful? Needs many sequences -> structural template often available -> no need for contact predictions …. Model discrimination? DGREMLIN: difference between native and model scores (CAMEO dataset n=329) For 10% (34/329 proteins) GREMLIN discriminates the native from the rest Utility of contact prediction for structure modeling. (A) Ranking of alternate models by GREMLINΔ. Three scenarios are illustrated; each represents a distinct protein target, black dots indicate alternate models, red dots indicate native structures. (Left) GREMLINΔ is not useful in selecting the closest model and does not correctly discriminate between native (target pdb:4hwnA) and homology models; (Middle) GREMLINΔ ranks homology models correctly (top five models within 0.05 of best five on average; R2 between GREMLIN score and fraction of native contacts > 0.8) but adds no additional information (target pdb:4fn4D); (Right) GREMLINΔ discriminates between best model and native structure (target pdb:4hxtA). In an additional 6% of the targets, GREMLINΔ correctly discriminated the native from the homology models but there were not enough models to reliably establish accuracy of ranking. Kamisetty et al. 2013

GREMLIN used to measure Co-evolution When is it useful? Needs many sequences -> structural template often available -> no need for contact predictions …. Better information than templates? HHD (closeness of template: DHHPred scores) 0: HHPred query and template alignment identical 1: no homolog with known structure (CAMEO dataset n=339) HHD >0.5 -> GREMLIN is useful for model discrimination (GREMLIN D>0) Utility of contact prediction for structure modeling. (B)HHΔ predicts GREMLINΔ: GREMLINΔ versus structural similarity of homolog to native structure computed by TM-align (14) (for homologs of all targets with high-resolution crystal structures < 2.1 Å). When HHΔ ≤ 0:5 (blue bars), GREMLINΔ is rarely better than random (green bars, constructed by pooling 100 permutations of predicted scores for each target). When HHΔ > 0:5 (red bars), GREMLINΔ is significantly positive and contact scores successfully discriminate between native and homology model even when the homolog is likely to be from the same fold (similarity ∈1⁄20:5; 0:8Þ). Error bars show mean and SD of distributions in all cases. (TMalign) Kamisetty et al. 2013

GREMLIN used to measure Co-evolution When is it useful? Needs many sequences -> structural template often available -> no need for contact predictions …. Analysis of PFAM GREMLIN could be useful for 14% (422/12,452) of the families Estimated from: # cases with distant template (HHD>0.5) # cases with enough sequences (Sequences/Length>4) Frequency of utility of contact prediction. The protein families in the Pfam database were divided into three groups based on the HHsearch P value of the closest protein of known structure (Left, HHsearch P value > 10−6.5; Middle, HHsearch P value between 10−40 and 10−6.5; Right, HHsearch P value > 10−40). Within each group, the number of families with sequences/ length less than 1, between 1 and 5, and greater than 5 are shown in blue, red, and green, respectively (Upper bars). For families with > 5 sequences per position (Upper green bars), distribution of HHΔ to the closest protein of known structure is shown in the lower panel. In cases where the difference in profiles is large (HHΔ>0:5: right bar in each group, Lower), these predictions are likely to improve on comparative models. Kamisetty et al. 2013

Example: CASP T0806 predicted contacts YAAA_ECOLI Seqs: 1208 Length: 258 Top 1.5L contacts HHsearch results of top HIT Prob = 12.4% E-value = 20 Improve confidence by combination with GREMLIN contacts

Not all contacts should be made! Monomer Homo-dimer Ligand mediated Multi-state

Functional form to “de-noise” Starting conformation Sigmoid Harmonic Sigmoidal restraints prevent “false” contacts from distorting the structure, maximizing self-consistent contacts. Though requires LOTS of sampling.

Residue-pair-specific Cβ-Cβ distance 2.9 9.0 Maximum Cβ-Cβ distance that allows a contact (< 5Å between any heavy atom). Bring residues close enough to form contacts, let Rosetta energy function decide if contact should be formed Can be used in centroid mode

CASP target T0806 - each model made/missed a different subset of contacts Contact maps of the top 4 models Structure Contacts (5Å) Predicted Contacts Top 4 models

Pipeline Contact prediction essential for convergence Hybridize (using RosettaCM) Fragment insertion (20 trials) Abinitio (using RosettaAB) Contact prediction essential for convergence Repeat until CASP deadline or convergence. High-Resolution comparative modeling with RosettaCM Y Song, F DiMaio, RYR Wang, D Kim, C Miles, TJ Brunette, J Thompson, D Baker One contact for every twelve residues allows robust and accurate topology‐level protein structure modeling Kim, D.E., DiMaio, F., Yu‐Ruei Wang, R., Song, Y. and Baker, D. Iterative refinement essential for improved model quality

Transition ab initio -> Template based modeling Contact-assisted ab initio prediction using Rosetta Contacts refine template topology Determination of Topology: Ab initio folding w constraints Find fragment pairs Refinement of Topology: Refine structure by imposing constraints One contact for every twelve residues allows robust and accurate topology‐level protein structure modeling Kim, D.E., DiMaio, F., Yu‐Ruei Wang, R., Song, Y. and Baker, D.

Modeling with contact predictions: CASP 12 results for Rosetta Examples Predicted contacts Model X-ray <5A; <10A; >10A Bakerlab: Kamisetty et al. . PNAS 2013; Kim et al.. Proteins 2013; Ovchinnikov et al. eLife2015 & Science 2017

Nf = #sequence clusters (<80% seq. id) Modeling with contact predictions: New models for uncharacterized families Calibrate on 27 proteins with known structure using subsampled alignments Approach: Generate Gremlin Matrix based on alignments of increasing length Generate 20K Models with constrained RosettaCM & select top-scoring model (De novo) Hybridize-refine top20 models (Refinement) Measure: Nf Number of effective sequences: Nf = #sequence clusters (<80% seq. id) √Length >64 Accurate model >16 Same fold Fig. 2. Metagenome data greatly increased fraction of structures that can be accurately modeled. (A) Dependence of coevolution guided Rosetta structure-prediction accuracy on the effective number of sequences Nf in the protein family. For each of 27 proteins of known structure, the multiple sequence alignment was subsampled, and residue-residue contacts were predicted by using GREMLIN. Rosetta structure-prediction calculations were then used to generate ~20,000 models, and a single model was selected on the basis of the Rosetta energy and the fit to the coevolution constraints; the average TM score of these selected models over all 27 cases is shown on the y axis (dashed line). Hybridization-based refinement of the top 20 models together with the top 10 map_align-based models for each case increases the average accuracy (solid line); models with fold-level accuracy (TM score of >0.5) are obtained for Nf ≥ 16, and models with accuracy typical of comparative modeling, for Nf of 64. (B) Fraction of protein families of unknown structure with at least 64 Nf. Dashed line: including only sequences in UniRef100 database; solid line: including sequences in UniRef100 database together with metagenome sequence data from the Joint Genome Institute (37). (C) Distribution of Nf values for 5211 Pfam families with currently unknown structure, after the addition of metagenomic sequences; 25% of the protein families have Nf > 64, 34% have Nf > 32, and 45% have Nf>16. NF: Correlates well with accuracy (TM score) Length-independent Ovchinnikov et al. eLife2015 (1) & Science 2017 (2)

Modeling with contact predictions: New models for uncharacterized families Nf > 64 Large-scale modeling of prokaryotic proteomes 58 / 121 prokaryotic protein families with no structural template  templates for ~400K prokaryotic proteins Large scale + metagenomic data 921/1297 with enough long-range contacts; 1024 domains 614/1024 no current structure -> 137 new folds  templates for ~500K uniprot & 3M metagenomic proteins Fig. 2. Metagenome data greatly increased fraction of structures that can be accurately modeled. (A) Dependence of coevolution guided Rosetta structure-prediction accuracy on the effective number of sequences Nf in the protein family. For each of 27 proteins of known structure, the multiple sequence alignment was subsampled, and residue-residue contacts were predicted by using GREMLIN. Rosetta structure-prediction calculations were then used to generate ~20,000 models, and a single model was selected on the basis of the Rosetta energy and the fit to the coevolution constraints; the average TM score of these selected models over all 27 cases is shown on the y axis (dashed line). Hybridization-based refinement of the top 20 models together with the top 10 map_align-based models for each case increases the average accuracy (solid line); models with fold-level accuracy (TM score of >0.5) are obtained for Nf ≥ 16, and models with accuracy typical of comparative modeling, for Nf of 64. (B) Fraction of protein families of unknown structure with at least 64 Nf. Dashed line: including only sequences in UniRef100 database; solid line: including sequences in UniRef100 database together with metagenome sequence data from the Joint Genome Institute (37). (C) Distribution of Nf values for 5211 Pfam families with currently unknown structure, after the addition of metagenomic sequences; 25% of the protein families have Nf > 64, 34% have Nf > 32, and 45% have Nf>16. Nf >64: accurate model >16: accurate fold Ovchinnikov et al. eLife2015 (1) & Science 2017 (2)

Summary : Structure prediction with correlated contacts Correlated evolution identifies neighboring residue pairs in protein structure Informative alignment MSA is critical Enough sequences are available today Contacts used to guide structure prediction In particular when no template is identified Significant increase in proteins with reliable structural models In particular for Transmembrane proteins Helped by metagenomic data