Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploiting Structural and Comparative Genomics to Reveal Protein Functions  How many domain families can we find in the genomes and can we predict the.

Similar presentations


Presentation on theme: "Exploiting Structural and Comparative Genomics to Reveal Protein Functions  How many domain families can we find in the genomes and can we predict the."— Presentation transcript:

1 Exploiting Structural and Comparative Genomics to Reveal Protein Functions  How many domain families can we find in the genomes and can we predict the functions of relatives?  Exploiting protein structure to predict protein functions  Using correlated phylogenetic profiles based on CATH domains to reveal functional associations C A T H Domain families of known structure Gene3D Protein families and domain annotations for completed genomes

2 CATHEDRAL Oliver Redfern and Andrew Harrison CATH version 3.0 1100 fold groups 2100 homologous superfamilies 86,000 Domains Combines a rapid graph theory secondary structure filter with dynamic programming for accurate residue alignment SVM is used to combine scores and assess significance of match

3 DDP Fold Recognition Performance % Correct Fold Rank SSAP

4 Gene3D : Domain annotations in genome sequences scan against library of HMM models ~2000 CATH ~9000 Pfam >2 million protein sequences from 300 completed genomes and Uniprot assign domains to CATH and Pfam superfamilies Benchmarking by structural data shows that 76% of remote homologues can be identified using the HMMs

5 DomainFinder: structural domains from CATH take precedent Gene3D: Domain annotations in genome sequences NC CATH-1 Pfam-2 Pfam-1 NewFam CATH-1Pfam-1 NewFam Pfam-2

6 Domain families ranked by size (number of domain sequences) Percentage of all domain family sequences Rank by family size CATH superfamilies of known structure Pfam families of unknown structure NewFam of unknown stucture ~90% of domain sequences in the genomes and UniProt can be assigned to ~7000 domain families

7 structuralsuperfamily(CATH) Only ~3% of diverse sequences in large CATH domain families have known structures subfamily of relatives subfamily of relatives <100 families account for 50% of domain sequences of known fold F1 F2 F3 F4 F5 relatives likely to have similar functions

8 Iterative Profile Search Methodology  300 genomes, >2 million sequences including UniProt and RefSeq  structural domain assignments from CATH  functional domain assignments from Pfam  Also: SWISS-PROT, EC, COGs, GO, KEGG, MIPS, BIND, IntAct Gene3D : Domain mappings for 300 Completed Genomes http://www.biochem.ucl.ac.uk:8080/Gene3D Russell Marsden, Corin Yeats, Michael Maibaum, David Lee Nucleic Acids Res. 2006 Yeats et al. Nucleic Acids res. 2006.

9 Conservation of enzyme function in homologous domains with same multidomain architecture (MDA) in Gene3D CATH-1Pfam-1 NewFam Pfam-2 Conservation of EC number to 3 levels (%) CATH-1Pfam-1 NewFam Pfam-2 Protein 1 Protein 2 Sequence identity

10 Sequence identity thresholds for 95% conservation of enzyme function (to 3 EC Levels) Sequence identity thresholds number of sequences number of families number of sequences number of families 332 highly conserved families 60 highly variable families

11 Exploiting Structural and Comparative Genomics to Reveal Protein Functions  How many domain families can we find in the genomes and can we predict the functions of relatives?  Exploiting protein structure to predict protein functions  Using correlated phylogenetic profiles based on CATH domains to reveal functional associations C A T H Domain families of known structure Gene3D Protein families and domain annotations for completed genomes

12 Conservation of Enzyme Function in CATH Domain Families Pairwise sequence identity Structural similarity (SSAP) score same functions different functions

13 Number of diverse structural clusters within family Number of COG functional groups Correlation of structural variability with number of different functional groups

14 Multiple structural alignment by CORA allows identification of consensus secondary structure and embellishments Some families show great structural diversity In 117 superfamilies relatives expanded by >2 fold or more 2DSEC algorithm These families represent more than half the genome sequences of known fold Gabrielle Reeves

15 Structural embellishments can modify the active site Galectin binding superfamily

16 Structural embellishments can modulate domain interactions Glucose 6-phosphate dehydrogenase side orientation face orientation Dihydrodipiccolinate reductase Additional secondary structure shown at (a) are involved in subunit interactions a

17 Structural embellishments can modify function by modifying active site geometry and mediating new domain and subunit interactions Biotin carboxylase D-alanine-d-alanine ligase Dimer of biotin carboxylase ATP Grasp superfamily

18 Secondary structure insertions are distributed along the chain but aggregate in 3D 60% of domains have secondary structure embellishments co-located in 3D with 3 or more other embellishments In 80% of domains, 1 or more embellishments contact other domains or subunits Indel frequency < 1 % 0.85% 0.38% 0.23% 0.11% 0.06% 0.02% 0 20 40 60 80 123456789101112 Size of Indel (number of secondary structures) Frequency (%) 85% of residue insertions comprise only 1 or 2 secondary structures

19 2 Layer Beta Sandwich 2 Layer Alpha Beta Sandwich Alpha / Beta Barrel 3 Layer Alpha Beta Sandwich ~80% of variable families are adopt regular layered architectures

20 2 Layer Beta Sandwich 2 Layer Alpha Beta Sandwich Alpha / Beta Barrel 3 Layer Alpha Beta Sandwich

21 structuralsuperfamily(CATH) Function prediction to Guide Target Selection for Structural Genomics relatives likely to have similar functions Only ~3% of diverse sequence families (S30 clusters) in large CATH families have known structures close relatives with same MDA close relatives with same MDA F1 F2 F3 F4 F5

22 Conservation of Enzyme Function in Homologous Domains Structure similarity (SSAP) score Conservation of EC levels (%)

23 FLORA – structural templates for assigning structures to functional subgroups in CATH Perform CORA multiple structural alignment on functional subfamiles within CATH superfamily Use CORAXplode (HMMs) to find related sequences in UniProt and identify conserved residues (seed) Explore local structural environment of seed residues to find conserved structural motifs Dataset of 84 enzyme superfamilies in CATH of which 21 are functionally very diverse

24 Finding conserved residue positions (seeds) - Scorecons seed positions identify most highly conserved residue positions using Scorecons – Valdar and Thornton (2001) multiple sequence alignment of relatives from functional family guided by structure alignment

25 identify structurally conserved residue cliques and generate template new structures are scanned against a library of FLORA templates and SVMs used to assess significance of matches expand to local environment of 12Å assign conserved sequence seeds FLORA Algorithm for Identifying Structural Homologues with Similar Functions

26 Performance of FLORA vs Global Structure Comparison (SSAP) Error rate Coverage

27 Exploiting Structural and Comparative Genomics to Reveal Protein Functions  How many domain families can we find in the genomes and can we predict the functions of relatives?  Exploiting protein structure to predict protein functions  Using correlated phylogenetic profiles based on CATH domains to reveal functional associations C A T H Domain families of known structure Gene3D Protein families and domain annotations for completed genomes

28 Eisenberg Phylogenetic Profiles for Detecting Functional Associations Superfamily 1 Superfamily 2 Superfamily 3 CATH Domain Superfamily Organism sp1 sp2 sp3 sp4 35 0 12 60 12 13 14 11 6 0 0 0 Gene3D Phylogenetic Occurrence Profiles Superfamily 1 Superfamily 2 Superfamily 3 Superfamily Organism sp1 sp2 sp3 sp4 1 0 1 0 0 0 1 1 FunctionallyLinked presence or absence of superfamily in organism number of relatives from superfamily in organism

29 Superfamily 40% sequence identity cluster 30% sequence identity cluster 50% sequence identity cluster Phylogenetic Occurrence Profiles Based on Domain Superfamily and Subfamilies in Gene3D

30 Phylogenetic Profiles for Families and Subfamilies Superfam. 30% 40% 50% 60%… 100% phylogenetic occurrence profile matrix Sp1 Sp2 Sp3 Sp4 … Sp n Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7. Cluster n 3 3 5 7 … 5 0 2 4 5 … 4 1 0 1 0 … 1 0 2 0 0 … 6 1 0 2 1 … 0 0 3 1 2 … 1 0 0 0 1 … 2.... …. 0 1 0 1 … 0 domains clustered at different levels of sequence similarity: Juan Ranea and Corin Yeats

31 Comparison of Pairs of Phylogenetic Profiles Sp1 Sp2 Sp3 Sp4 Sp5 … Spn Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7. Cluster n 6 9 6 9 5 … 9 4 3 7 5 3 … 5 1 0 1 0 2 … 1 0 2 0 0 1 … 6 1 4 1 4 1 … 4 0 3 1 2 0 … 1 4 8 4 8 4 … 8..... …. 0 1 0 1 1 … 0 Sp1 Sp2 Sp3 Sp4 Sp5 … Spn 5 10 Sp1 Sp2 Sp3 Sp4 Sp5 … Spn 5 10 Sp1 Sp2 Sp3 Sp4 Sp5 … Spn 5 10 Cluster 1 Cluster 2 Cluster 1 Cluster 5 Cluster 1 Cluster 7 E1E1 E2E2 E 1 >> E 2 Euclidian distance:

32 Statistical Significance of Correlated Pairs (Comparison against 3 randomised models) Frequency Pearson correlation coefficients Real matrix Random matrix II Random matrix III Random matrix I

33 Domain Associations Network from 13 Eukaryotes: Actin & VCP-like ATPases DNA replication and repair Chaperones and Cytoskeleton DNA Topoisomerase & Elongation factor G

34 Number of domain relatives Species DNA topoisomerase & Elongation Factor G

35 Distances of correlated profile scores Frequency of significant GO semantic similarity scores Highly correlated profiles correspond to pairs of families with significant similarity in GO functions Highly correlated profiles correspond to pairs of families with significant similarity in GO functions biological processes

36 – On average 85% of domain sequences in genomes can be assigned to ~6000 domain families in CATH and Pfam – Information on multidomain architectures (MDAs) can extend functional annotations obtained through domain based homologies – Specific structural templates for functional subgroups within domain families can also help in assigning functions as more structures are solved – Analysis of Gene3D phylogenetic occurrence profiles allows detection of functional associations between families Summary Summary

37 Lesley Greene Alison Cuff Ian Sillitoe Tony Lewis Mark Dibley Oliver Redfern Tim Dallman Acknowledgements CATH Corin Yeats Sarah Addou Russell Marsden David Lee Alastair Grant Ilhem Diboun Juan Garcia Ranea Medical Research Council, Wellcome Trust, NIH EU funded Biosapiens, EU funded Embrace, BBSRC http://www.biochem.ucl.ac.uk/bsm/cath_new Gene3D


Download ppt "Exploiting Structural and Comparative Genomics to Reveal Protein Functions  How many domain families can we find in the genomes and can we predict the."

Similar presentations


Ads by Google