Presentation is loading. Please wait.

Presentation is loading. Please wait.

Modeling the Structures of Proteins and Macromolecular Assemblies

Similar presentations


Presentation on theme: "Modeling the Structures of Proteins and Macromolecular Assemblies"— Presentation transcript:

1 Modeling the Structures of Proteins and Macromolecular Assemblies
Get the 3 Aims slide. That is the direction/spirit for the talk. Do I quote lots of our papers????!!!! Ask people. Improve "overall model accuracy" with better backbone plots; list error types on it. Get the latest Chothia & Lesk plot from Eashwar. Get the latest numbers for the TrEMBM run, pie chart, density plot. Latest ModBase slide - perhaps from the ModView paper. Something better on LigBase - more info; on the slide. Have 3 slides on the BRCA1 BRCT domain. Get the NPC slides from Mike & Frank! Write the bullet points for each slide. Think globally. Maybe have a text-based slide with steps in ModPipe, with references, indicating a lot of our methods work. Andrej Šali Depts. of Biopharmaceutical Sciences and Pharmaceutical Chemistry California Institute for Quantitative Biomedical Research University of California at San Francisco U C S F 1/14/2019

2 Topics Large-scale comparative protein structure modeling
Target selection for NYSGRC Questions 2, 7, 5 Comments 1/14/2019

3 MODPIPE: Large-Scale Comparative Modeling
START MODELLER Get profile for sequence (SP/TrEMBL) Align sequence profile with multiple structure profile using local dynamic programming Select templates using permissive E-value cutoff For each target sequence Build models for target segment by satisfaction of spatial restraints Evaluate models MODELLER R. Sánchez & A. Šali, Proc. Natl. Acad. Sci. USA 95, 13597, 1998. Eswar et al. Nucl. Acids Res. 31, 3375–3380, 2003. Pieper et al., Nucl. Acids Res. 32, 2004. N. Eswar, M. Marti-Renom, M.S. Madhusudhan, B. John, A. Fiser, R. Sánchez, F. Melo, N. Mirkovic, B. Webb, M.-Y. Shen, A. Šali. PSI-BLAST run as optimized by C. Chothia. Just mention the permissive E-value filter for now. Point out we are using sequence-structure alignment (new). Mention MODELLER for model building (old method). Point out we are using optimized model assessment (new). Point out the permissivness in E-value – to find nugets in the mud. Point out similarity/difference with threading. For each template profile END 1/14/2019

4 Comparative modeling of the TrEMBL database
Unique sequences processed: 1,182,126 Sequences with fold assignments or models: 659,495 (56%) 70% of models based on <30% sequence identity to template. On average, only a domain per protein is modeled (an “average” protein has 2.5 domains of 175 aa). 9/15/ ~4 weeks on 660 Pentium CPUs 1/14/2019

5 Statistics from ModBase
Modeling coverage of SP/TR Distribution of sequence identities Distribution of model length Coverage of SCOP domains 1/14/2019

6 http://salilab.org/modbase Pieper et al., Nucl. Acids Res. 32, 2004.
1/14/2019

7 Major bidirectional resources involving ModBase
Daniel Greenblatt, Conrad C. Huang, Thomas E. Ferrin UCSC Human Gene Family Browser 1/14/2019

8 MODBASE and associated resources http://salilab.org/
Eswar et al. Nucl. Acids Res. 31, 3375–3380, 2003. Pieper et al., Nucl. Acids Res. 32, 2004. 1/14/2019

9 Synergy of crystallography and comparative modeling in structural genomics
Pieper et al., Nucl. Acids Res. 32, 2004. NYSGXRC X-ray Structure MODBASE Models PDB Code Database Accession Number Annotation Total Sequences Fold & Model 1b54 P38197 Hypothetical UPF0001 protein YBL036C 151 132 2 17 1f89 P49954 Hypothetical 32.5 kDa protein YLR351C 553 488 55 10 1njr Q04299 Hypothetical 32.1 kDa protein in ADH3-RCA1 intergenic region 4 1 3 1nkq P53889 Hypothetical 28.8 kDa protein in PSD1-SKO1 intergenic region 379 207 172 1jzt P40165 Hypothetical 27.5 kDa protein in SPX19-GCR2 intergenic region 1058 39 1006 13 1jr7 P76621 Hypothetical protein ygaT 11 1ku9 YF63_METJA hypothetical protein MJ1563 598 131 214 253 1/14/2019

10 Two important cutoffs: Fold assignment. 30% sequence identity.
For estimating the scope, not necessarily applied to actual target selection. It is convenient to divide comparative models into three classes, based on their predicted overall accuracy. Applications depend on accuracy. Applications may or may not succeed. Accuracy of a model needs to be predicted and considered before the model is used. Even low resolution models can be used for some questions. And even the highest resolution models, even x-ray structures, are not accurate enough for some questions (eg, catalysis). D. Baker & A. Sali. Science 294, 93, 2001. 1/14/2019

11 Typical errors in comparative models
Incorrect template MODEL X-RAY TEMPLATE Misalignment Region without a template Distortion/shifts in aligned regions Sidechain packing Application of comparative modeling to proteins of known structure identifies the following five types of errors in comparative models. Template selection (<25% sequence identity). Misalignments (<35% sequence identity). Loop modeling, shifts, sidechain modeling (whole range). Marti-Renom et al. Annu.Rev.Biophys.Biomol.Struct. 29, , 2000. 1/14/2019

12 Errors in the context of structural genomics
Currently, errors in fold assignment and alignment are the most frequent sources of errors, resulting into largest errors. However, if the 30% sequence identity cutoff in PSI is satisfied, fold assignment and alignment errors will eventually be infrequent. The most relevant problems will then be the modeling of loops, rigid body shifts/distortions, and sidechains. But, … The individual errors integrate into overall errors. It is good to be able to assess the overall accuracy of a model. It is not so bad if there are errors in a model, as long as one knows that and takes them into account when the model is used. A useful indicator of the overall structural error is a measure of sequence similarity between the modeled protein sequence and the sequence of the known template structure. The reason is … A simple sequence similarity measure is sequence identity. This plot shows … It was obtained by calculating automatically approx 10,000 models for proteins of known structure and by comparing these models with the actual structures. Describe the lower curve. Describe the upper curve. Mention the criticism of CM that it does not improve the template – this is what it refers to. But it is not a fair criticism because we do not know the correct target-template alignment in the absence of the target structure, and in any case even a model with a worse RMSD than the template can be more useful than the template to learn about the function of a protein, as I will demonstrate later. R. Sánchez & A. Šali, Proc. Natl. Acad. Sci. USA 95, 13597, 1998. 1/14/2019

13 Benefits of improved comparative modeling for PSI
Reduce the effort, keep accuracy: Example 1: If we could afford to model on 25% instead of 30% sequence identity, PSI would need many fewer of the expensive X-ray/NMR structures. Example 2: If average alignment accuracy increases by 10%, ~100,000 more domain sequences will have their folds assigned correctly (John and Sali, Nucl. Acids Res. 31, 3982, 2003). Keep the effort, increase accuracy: More accurate models, for the same templates. FREQUENCY 1/14/2019

14 Target selection for NYSGXRC
Define a set of proteins of biological interest. Determine which sequences can be modeled (ModBase: >30% sequence identity, >100 residues). Characterize and prioritize the remaining sequences (secondary structure, membrane, size, domains, Met content, …) Establish redundancy/clusters of sequences (PsiBlast). Display the sequences in IceDB for relatively efficient visual inspection and final selection. 1/14/2019

15 Target selections for NYSGXRC
Focus on groups of proteins of biological interest, while conforming to the general requirements of structural genomics. S. cerevesiae Cancer-related Cholesterol biosynthesis pathway Enzymes C. elegans Extracellular proteins Anti-bacterial drug targets H. influenzae D. radiodurans Nuclear pore proteins 1/14/2019

16 A bottleneck An efficient, accurate and automated method for defining families of appropriate “segments” of sequences in SP/TrEMBL. 1/14/2019

17 Scope of structural genomics (target selection)
Cluster PfamA. Find out how many clusters can be modeled, the remaining ones are targets. Extrapolate to whole SP/TrEMBL. Extrapolate to all conceivable protein sequences. There are ~16,000 30% seq id families that contain 90% of all sequences (Vitkup et al. Nat. Struct. Biol. 8, 559, 2001). 1/14/2019

18 Structure coverage of residues in SP/TrEMBL (1,288,388 sequences, 411,917,387 residues)
Start with unique sequences. Low-complexity regions were identified using SEG (Wootton & Federhen). Coiled-coil regions were identified using COILS (Lupas et.al.). TM segments were identified using PHD (Rost et.al.). Models (E-value ≤ 10-4 or GA341 ≥ 0.7) were extracted from ModBase (Pieper et.al.). Segments shorter than 30aa were eliminated after each filter. In this pie chart: good to have another cut for residues in models >80aa AND >30% seq id. Number of residues in <30aa segments (but not low-complexity, coiled-coil) in another cut in the red piece. Indicate transmembrane residues in the remaining part also, as well as total # of residues in TM. Modeled segments were not split into models only, fold assignments only etc since including models with >30% seqid and >80aa would have led to too many combinations. The >30% seqid and >80aa cut in the pie chart is not included since it would not be accurate due to the manner in which the filters have been applied. The fold only, model only cuts were also removed since it does not actually matter in this case. 1/14/2019

19 Simple clustering of sequences within each PfamA family
PfamA (Version 10) contains 733,829 sequences (61% of SP/TrEMBL) Sequences are sorted by length. Longest sequence becomes the first representative. All sequences are compared with the representative and those with sequence identities higher than threshold are included into the cluster and removed from the original stack. The longest of the remaining sequences becomes the next representative, and so on. Number of clusters of the PfamA sequences as a function of the threshold threshold 1/14/2019

20 For the representatives of the “30%” PfamA clusters
Total number of clusters: 71,422 Non-modeled representatives longer than 80 aa: 25,101 Representatives longer than 80aa without models based on at least 30% seq.id: ~35,000 Less than 80 non-modeled aa. Good Model: GA341 ≥ 0.7 Good Fold: E-value ≤ 10-4 1/14/2019

21 Summary But, there are a number of assumptions/approximations!
Question 2: How many structures will be needed to cover a high fraction of eukaryotic protein families and the combined prokaryotic and eukaryotic protein families? ~4,000 new structures to cover all 6,190 PfamA families at any level of similarity. 25,101 structures to cover all structurally unknown PfamA families longer than 80aa at 30% sequence identity. ~35,000 structures to cover all PfamA families longer than 80aa at 30% sequence identity. ~8,000 structures to cover 80% of sequences longer than 80aa in largest PfamA families at 30% sequence identity. ~13,000 structures to cover 90% of sequences longer than 80aa in largest PfamA families at 30% sequence identity. ~11,000 structures to cover 80% of sequences longer than 80aa in largest families in SP/TR at 30% sequence identity. But, there are a number of assumptions/approximations! 1/14/2019

22 Question 7: Does sampling of multiple family members from the same or different organisms adequately improve success in obtaining a representative structure? Attempting to determine structures of multiple members of a family to increase per family success rate is a key to the success of structural genomics. The approach is probably just fine (or at least it is not refuted by the present data): Single proteins have ~17% chance to crystallize (once in the pipeline). A family with an average of 2 members has ~35% chance that at least one of them will crystallize (once in the pipeline). Thus, the probabilities to get crystals for different members of the same family are roughly independent from each other (binomial statistics …). NYSGXRC data Andras Fiser, Guiping Xu AECOM, analysis of NYSGXRC data 1/14/2019

23 Question 5: Should we designate a certain percentage of targets to membrane proteins and/or protein complexes? 1/14/2019

24 From domains to assemblies
Sali. NIH Workshop on Structural Proteomics of Biological Complexes. Structure 11, 1043, 2003. Organized by W. Chiu and S. Almo. domains proteins assemblies After positive notes, qualify structural genomics. Docking needed: Because of the high-throughput nature of structural genomics (concentrating on domains, rather than whole proteins) and because of the modular nature of protein. ~2.5 domains in a protein a few domain partners per domain Sali, Earnest, Glaeser, Baumeister. From words to literature in structural proteomics. Nature 422, , 2003. 1/14/2019

25 Target selection schemes for structural proteomics
the proteins organized into core biological processes (D. Drubin); 100 complex structures of some importance (M. Gerstein); select subcomplexes, such as those determined by large-scale affinity purification experiments, (S. Almo); the same complex from a number of species to facilitate discovery of evolutionary relationships and functional mechanisms (S. Almo); a comprehensive set of representative binary domain interfaces that may allow us to model higher order complexes with useful accuracy and efficiency (A. Sali). 1/14/2019

26 Comments Biological and medical criteria:
A sensitive function of time and people, in addition to Nature; Structural GENOMICS implies comprehensive uniform coverage. True believers in structural biology might argue that determining the structures of uncharacterized proteins is a good way to start characterizing their biology. Therefore, uniform coverage may be optimal for maximizing biological and medical impact (in the long run) – there may not be a frustration between uniform coverage and medical relevance after all. Bottom line: A massive investment in structural biology is a good thing, details notwithstanding. 1/14/2019

27 A proposal to all of us: Goal:
To provide the best possible estimate of the scope of structural genomics, by comprehensively considering and contrasting a variety of target selection strategies. Format: Write a definitive exhaustive review contributed by all people with some results. Perhaps associate it with separate in-depth papers by individuals, published in the same journal issue. 1/14/2019

28 Acknowledgments Comparative Modeling Narayanan Eswar Ursula Pieper
Comparative Modeling Narayanan Eswar Ursula Pieper Ben Webb Marc A. Martí-Renom Roberto Sánchez (MSSM) Bino John András Fiser (AECOM) Francisco Melo (CU, Chile) M. S. Madhusudhan Ash Stuart Nebojša Mirkovic Valentin Ilyin (NE) Eric Feyfant (GI) Min-Yi Shen Structural Genomics Stephen Burley (SGX) Andras Fiser (AECOM) Guiping Xu (AECOM) NY-SGXRC NIH NSF Sinsheimer Foundation A. P. Sloan Foundation Burroughs-Wellcome Fund Merck Genome Res. Inst. Mathers Foundation I.T. Hirschl Foundation The Sandler Family Foundation Sun IBM Intel Structural Genomix 1/14/2019


Download ppt "Modeling the Structures of Proteins and Macromolecular Assemblies"

Similar presentations


Ads by Google