Presentation is loading. Please wait.

Presentation is loading. Please wait.

Combining Sequence and Structure Information Topic 17.

Similar presentations


Presentation on theme: "Combining Sequence and Structure Information Topic 17."— Presentation transcript:

1 Combining Sequence and Structure Information Topic 17

2 Problem: Identify the most important region(s) What is a functional site? This is actually a very difficult question to answer robustly. Of course, “catalytic residues” are functional sites. Generally, we assume other site directly interacting with the substrate or other proteins involved in a complex to be functional. However, what about sites far removed from the “active site region?” If a mutation at one of these sites is deleterious, is it functional?

3 Problem: Identify the most important region(s) + Catalog of Important Sites (KC and Livesay) Catalytic Sites: Sites that are identified as catalytic sites in the Catalytic Site Atlas (CSA). Active Sites: Union of CSA catalytic residues and all residues contacting the catalytic residues using HBPLUS. Ligand-Binding Sites: Sites identified by characterizing all enzyme-ligand interactions using HBPLUS. What about Allosteric Sites? Structural Sites? Etc?

4 Note that the two things are not the same +

5 The devil is in the details

6 + Methods Typical approach: Combine sequence and structural information Alignment Content: Sequence conservation and phylogeny-based. Typically also use structural information. Machine-Learning Methods: Computational “black boxes,” but give good results. Structure Features: Graph theoretic methods, protein surface shape, protein surface physiochemical properties, etc. Triosephosphate isomerase color-coded by conservation 

7 Catalytic Site Atlas +

8 Catalytic Propensity +

9 Multiple Sequence Alignment +

10 + The Sum of Pairs (SP) score of column m i is calculated as above where s(m k,m l ) is the scoring matrix substitution value. The sum is enumerated over all possible pairs within a single alignment column. The Shannon entropy (S) score is calculated where p i is the probability of each residue i in that column. Very similar, the Williamson Property Entropy (WPE), sums of groups of chemically similar residues (k=9), where the probability within the logarithm is normalized by the average column probability Rate4site (R4S) constructs a mathematical description of the underlying phylogeny in order to improve determination the rate of evolution at each site. The rate of evolution at each site is then estimated using the maximum likelihood principle, which considers both phylogenetic tree branch lengths and the stochastic nature of evolution. And many others.

11 Relative predictive power + CatalyticActiveLigand-binding R4S0.830.750.74 SP-score0.770.660.70 JSD0.780.720.67 WPE0.750.700.66

12 ConSurf is a web-implementation of R4S +

13 Throw everything at it, including the kitchen sink…

14 Gutteridge et al., JMB (2003) 330:719-734. Relative importance of input variables

15 Gutteridge et al., JMB (2003) 330:719-734. Three different NN’s += Using structural clustering to filter out FP’s Unfortunately the method tends to over-predict catalytic residues Structural clustering improves results

16 Going beyond conservation HKAMMKLQWBBMVRERCUGDYAD HRAFGSGFFBYTUJGGCADFYDD EFZHRDADFD-EGHDGCVRRSER ADZDFDAADFDEHGRRCADDSDD DFZBBDMJJJ-EDAFDCRRVSHT ADHADFDEBGJEVEEECADDSDD NTHLJDJDDGUEKJFJCLDLSEI OOHMCVDUEGTEDDEDC--DSEI JDILKJADFFIFEVEECLDKSVV JBIOUDFFVFCFLKEICKDKSEE Of course, well conserved positions make very good functional site predictions. But what defines differences between sub-families in the overall phylogeny?

17 Evolutionary trace (aka tree-determinant) residues..A........B....C...Y....Z..D.....E....C...S....H......G.E....C...S.. J.I......F.F....C...S.. Analyze to detect those residues with a tendency to be conserved within a subfamily of proteins, but which differ between subfamilies (tree-dependent positions), and regard them as a result of the evolutionary scenario in which conservation and specificity are present in a delicate balance.

18 Evolutionary trace (aka tree-determinant) residues Identifying and understanding the role of the essential sites that determine the structure and proper functioning of the molecule. A thorough evaluation of the importance of all sequence sites involves extremely time-consuming and laborious biochemical experimental methods. All methods presented here rely on some sort of co-evolutionary theme. Or put otherwise, Nature has allowed some plasticity within (some) functional positions assuming the appropriate conditions are met elsewhere. Starting from the groundbreaking Lichtarge et al. paper in 1996, there have been several approaches presented that use this intra-family co-evolution principle to predict functional sites. The methods, called evolutionary trace, tree determinate residues, phylogenetic motifs, ConSurf, and strong motifs are conceptually similar and provide somewhat consistent results.

19 Evolutionary trace (aka tree-determinant) residues The ET process

20 Structural clusters of ET overlap ligand binding sites Active site Trace residues 97% of the time (37 of 38 examples), the largest cluster of trace residues contacts the ligand (Madabushi et al, Journal of Molecular Biology, 2002).

21 Livesay et al. (2003). Biochemistry 42:3464-73. What leads to conservation of CuZnSOD surface electrostatics?

22 Livesay et al. (2003). Biochemistry 42:3464-73. Structural and structure variability

23 Stephen Jay Gould said, “The proof of evolution lies in those adaptations that arise from improbable foundations.” Livesay et al. (2003). Biochemistry 42:3464-73. An improbable result

24 Triosephosphate isomerase window width = 5 PSZ threshold = -1.5 TIM Prosite definition La, Sutch, Livesay (2005). Proteins 58:309-320. Phylogenetic motifs Notice structural clustering despite little overall sequence proximity

25 La, Sutch, Livesay (2005). Proteins 58:309-320.

26 Copper, zinc-superoxide dismutase TATA-box binding protein Inorganic pyrophosphatase Cytochrome P450 Myoglobin

27 Glutamate dehydrogenase Enolase Alcohol dehydrogenase Glecerolaldehyde-3-phosphate dehydrogenase

28 Trace residues that correspond to PMs are colored red. Trace residues that do not correspond to PMs are colored blue. PMs identify sequence clusters of ET residues

29 La, Sutch, Livesay (2005). Proteins 58:309-320. PMs also correspond to traditional motif definitions

30 That is, PMs represent a subset of motif space La, Sutch, Livesay (2005). Proteins 58:309-320.

31 APSRKFFVGGNWKMNGRKQSLGELIGTL NAAKVPADTEVVCAPPTAYIDFARQKLD PKIAVAAQNCYKVTGAFTGEISPGMIKD CGATWVVLGHSERRHVFGESDELIGQKV AHALAEGLGVIACIGEKLDEREAGITEV FEQTKVIADNVKDWSKVVLAYEPVWAIG TGKTATPQQAQEVHEKLRGWLKSNVSDA VAQSTRIIYGGVTGATCKELASQPDVDG FLVGGASLKPEFVDIINAKQ

32 Livesay, La (2005), Protein Science 14:1158-1170.

33 Figure caption: Ligand-binding Positions of Tyrosine Aminotransferase of Trypanosoma Cruzi. One chain of the crystal structure of tyrosine aminotransferase from Trypanosoma Cruzi (PDB code 1BW0). Results of a conservation-based measure (Williamson, in blue) are shown compared to the phylogeny-based SMERFS (in red). Positions predicted by both techniques are shown in green, the PLP cofactor in orange. Protein regions in stick representation and labelled are those important for cofactor binding, as described in the text. Manning et al. BMC Bioinformatics 2008 9:51 The SMERFS algorithm is intermediate in philosophy to those of TreeDet [21] and MINER [18] and compares local to global similarity matrices over windows on an alignment. The work presented here has shown that SMERFS produces sets of putative functional positions in multiple sequence alignments fundamentally different from those of conservation measures. For this reason conservation measures and phylogeny-aware methods such as SMERFS should be considered as complementary tools. The data suggest that if alignment positions involved in the core function of a protein, for example catalysis, are the target of a study, relatively simple conservations measures remain the most useful tool. If less critical positions, perhaps responsible for defining sequence subfamiliy specificity, are the target, then methods such as SMERFS may be of use. Finally, SMERFS has been shown to predict many more surface positions than conservation, reducing the possibility of confusing signals from positions of core structural rather than functional significance.

34 N p : total number of vertices L ij : shortest path between i and j Vertex: Cα Edge: if distance is within 8.5Å

35 Degree (aka, connectivity or valency): Simply the integer count of the number of edges a vertex shares. Closeness: The closeness centrality, CC, for a vertex v is the reciprocal of the sum of geodesic distances to all other vertices in the graph. Geodesic distance (aka, shortest path): The number of edges in the shortest path connecting two vertices. 1 2 3 5 4 6 Note: this assumes constant edge weights Centrality metrics

36 The networks are usually highly clustered with few links connecting any two random vertices. A key feature of many complex systems (including protein networks) is robustness, meaning that the system can continue to function despite perturbations. On the other hand, robustness is coupled with fragility toward non-trivial rearrangements of the connections between the system’s key internal parts. Proteins are no exception, they have evolved toward a robust design; however, they are vulnerable to mutation at certain residues, meaning that some special importance could be placed on central residues. Recently, various centrality scores have been used to predict folding nuclei and catalytic sites. Protein networks

37 del Sol et al., Mol Sys Biol (2006). Protein networks conserve “hubs”

38 However, there is a clear distinction b/t buried noncatalytic and catalytic sites. Catalytic residues One third most buried residues Middle third One third least buried residues Catalytic residues One third most buried residues Middle third One third least buried residues Close residues are typically buried… Chea and Livesay (2007), BMC Bioinformatics, 8:153.

39 Closeness centrality... Non-catalytic residues Catalytic residues Non-catalytic residues Catalytic residues ROC curve for CC predictions Chea and Livesay (2007), BMC Bioinformatics, 8:153. Catalytic site prediction power

40 Computed p-values on the null hypothesis that CC does not predict catalytic sites better than random. Simple steps to improve accuracy Raw predictions Accessibility filter Residue identity filter


Download ppt "Combining Sequence and Structure Information Topic 17."

Similar presentations


Ads by Google