Lessons from CASP targets ShuoYong Shi, Lisa Kinch, Jimin Pei, Ruslan Sadreyev, and Nick V. Grishin Howard Hughes Medical Institute, Department of Biochemistry, University of Texas Southwestern Medical Center at Dallas
1. New folds: 397_1, 496_1; 2. A few known folds are predicted no better than new folds: 460, 407_2; 3. Short motif recognition = success: 465; 4. Short motif recognition = failure: 467; 5. Structural changes not predicted: 510; 6. Inspect your alignments carefully: 480
NF – new fold – historic category in CASP New fold: were there any? 2008 – where did the new folds go? 176 domains: 2 possibly new folds: ~1%
N-domain of T0397: 3d4r chain A residues New fold #1: N-domain of T0397
First models for T0397_1: Gaussian kernel density estimation for GDT-TS scores of the first server models, plotted at various bandwidths (=standard deviations). The GDT-TS scores are shown as a spectrum along the horizontal axis: each bar represents first server model. The bars are colored green, gray and black for top 10, bottom 25% and the rest of servers. The family of curves with varying bandwidth is shown. Bandwidth varies from 0.3 to 8.2 GDT-TS % units with a step of 0.1, which corresponds to the color ramp from magenta through blue to cyan. Thicker curves: red, yellow- framed brown and black, correspond to bandwidths 1, 2 and 4 respectively. First server models for T0397_1
structure and topology diagrams of ferredoxin fold – fold closest to T0397_1 Most similar: ferredoxin-like fold
N-domain of T0496: 3do9 chain A, residues New fold #2: N-domain of T0496
First models for T0496_1: Gaussian kernel density estimation for GDT-TS scores of the first server models, plotted at various bandwidths (=standard deviations). The GDT-TS scores are shown as a spectrum along the horizontal axis: each bar represents first server model. The bars are colored green, gray and black for top 10, bottom 25% and the rest of servers. The family of curves with varying bandwidth is shown. Bandwidth varies from 0.3 to 8.2 GDT-TS % units with a step of 0.1, which corresponds to the color ramp from magenta through blue to cyan. Thicker curves: red, yellow- framed brown and black, correspond to bandwidths 1, 2 and 4 respectively. First server models for T0496_1
structure and topology diagrams of RNAseH fold – fold closest to T0496_1 Most similar: RNAse H fold
1. New folds: 397_1, 496_1; 2. A few known folds are predicted no better than new folds: 460, 407_2; 3. Short motif recognition = success: 465; 4. Short motif recognition = failure: 467; 5. Structural changes not predicted: 510; 6. Inspect your alignments carefully: 480
E.g.#1: T0460 Know fold: some predicted no better than new! First models for T0460: Gaussian kernel density estimation for GDT-TS scores of the first server models, plotted at various bandwidths (=standard deviations). The GDT-TS scores are shown as a spectrum along the horizontal axis: each bar represents first server model. The bars are colored green, gray and black for top 10, bottom 25% and the rest of servers. The family of curves with varying bandwidth is shown. Bandwidth varies from 0.3 to 8.2 GDT-TS % units with a step of 0.1, which corresponds to the color ramp from magenta through blue to cyan. Thicker curves: red, yellow-framed brown and black, correspond to bandwidths 1, 2 and 4 respectively.
T0460: very difficult target Cartoon diagram of 460: 2k4n model 1 residues 1-52,67-10 Jumping through 20 NMR models of 2k4n
Cartoon diagram of 460: 2k4n model 1 residues 1-52,67-10 Cartoon diagram of NADH- quinone oxidoreductase: 2fug chain 5 residues T0460 is homologous to Nqo5 This homologous template was NOT FOUND BY ANY SERVER ! Why? Singleton sequence!
E.g.#2: C-domain of T0407 Know fold: some predicted no better than new! First models for T0407_2: Gaussian kernel density estimation for GDT-TS scores of the first server models, plotted at various bandwidths (=standard deviations). The GDT-TS scores are shown as a spectrum along the horizontal axis: each bar represents first server model. The bars are colored green, gray and black for top 10, bottom 25% and the rest of servers. The family of curves with varying bandwidth is shown. Bandwidth varies from 0.3 to 8.2 GDT-TS % units with a step of 0.1, which corresponds to the color ramp from magenta through blue to cyan. Thicker curves: red, yellow-framed brown and black, correspond to bandwidths 1, 2 and 4 respectively.
Date: Mon, 2 Jun :56: (CDT) From: Nick Grishin To: David Baker Cc: Ruslan Sadreyev, Robert M Vernon Subject: Re: C-terminus of T0407 I liked IG because of 1) length; 2) ~7 strands; 3) many IG are interaction domains in enzymes. These are very compelling reasons.
Cartoon diagram of 407, C-domain: 3e38 chain A residues Cartoon diagram of VAP-A MSP Homology Domain: 3z9l T0407_2 has Immunoglobulin fold
IG-based Baker model Top GDT server model: Phyre_de_novo TS1 No server predicted IG fold for T0407_2 Cartoon diagram of 407, C-domain: 3e38 chain A residues
1. New folds: 397_1, 496_1; 2. A few known folds are predicted no better than new folds: 460, 407_2; 3. Short motif recognition = success: 465; 4. Short motif recognition = failure: 467; 5. Structural changes not predicted: 510; 6. Inspect your alignments carefully: 480
T0465: who found the template? HHpred !!!
T0465 is a diverged FYSH domain FYSH domain of hypothetical protein AF0491: 1t95 chain A residues Cartoon diagram of T0465: 3dfd chain A residues
T0465 fold is predicted by HHpred HHpred2 TS1 Cartoon diagram of T0465: 3dfd chain A residues Falcon TS1
1. New folds: 397_1, 496_1; 2. A few known folds are predicted no better than new folds: 460, 407_2; 3. Short motif recognition = success: 465; 4. Short motif recognition = failure: 467; 5. Structural changes not predicted: 510; 6. Inspect your alignments carefully: 480
T0467: most interesting target ! Bioinfo.pl provides these predictions:
T0467: is bioinfo.pl correct ?
T0467 OB-fold C-terminal fragment: 2k5q model 1 residues Sso7d SH3-fold C-terminal fragment: 2bf4 chain A residues You can say so (if you want)
However, only local prediction is correct: extending it to cover the domain results in a wrong fold prediction ! T0467 OB-fold: 2k5q model 1 residues 7-97 Sso7d SH3-fold: 2bf4 chain A
1. New folds: 397_1, 496_1; 2. A few known folds are predicted no better than new folds: 460, 407_2; 3. Short motif recognition = success: 465; 4. Short motif recognition = failure: 467; 5. Structural changes not predicted: 510; 6. Inspect your alignments carefully: 480
T0510: “server only” target with a twist Cartoon diagram of 510 domains: 3doa, N-, middle and C-domains are shown in blue, green and red, respectively. Cartoon diagram of MutM domains: 1ee8_A, N-, middle and C-domains are shown in blue, green and red, respectively.
Closer look at the N-domains reveals large topological differences N-domain of 510: 3doa residues N-domain of MutM: 1ee8 chain A residues insertion close to the N-terminus is red insertion in the middle of the domain is blue
N-domains are nevertheless homologous
1. New folds: 397_1, 496_1; 2. A few known folds are predicted no better than new folds: 460, 407_2; 3. Short motif recognition = success: 465; 4. Short motif recognition = failure: 467; 5. Structural changes not predicted: 510; 6. Inspect your alignments carefully: 480
T0480: easy alignment with templates
NADH pyrophosphatase intervening domain 1vk6: residues Ribbon diagram of 480: 2k4x model 1 residues Zinc ion is shown in magenta and side chains of its ligands (four Cys) are displayed. T0480: most predictions had an error 480 MULTICOM-CLUSTER TS1
Jumping through 20 NMR models of 2k4x Ribbon diagram of 480: 2k4x model 1 residues Zinc ion is shown in magenta and side chains of its ligands (four Cys) are displayed. T0480: unusual bulge
T0480: bulge could have been predicted
Summary: 1. New folds constitute less than 2% of newly solved non-redundant structures. 2. Many known folds cannot be predicted because templates are impossible to find. 3. Globalization of correct local alignment may or may not yield correct fold prediction. 4. Large structural changes happen in protein cores. 5. Careful inspection of alignments may solve some modeling problems.
Acknowledgement Our group Collaborators HHMI, NIH, UTSW, The Welch Foundation Shuoyong Shi Jing Tong Ruslan Sadreyev Lisa Kinch Jimin Pei Ming Tang Sasha Safronova Yuan Qi Hua Cheng Jamie Wrabl Indraneel Majumdar Erik Nelson Yong Wang S. Sri Krishna Bong-Hyun Kim Dorothee Staber David Baker U. Washington Kimmen Sjölander UC Berkeley William Noble U. Washington