Download presentation
Presentation is loading. Please wait.
1
Exploiting Structural and Comparative Genomics to Reveal Protein Functions How many domain families can we find in the genomes and can we predict the functions of relatives? Exploiting protein structure to predict protein functions Using correlated phylogenetic profiles based on CATH domains to reveal functional associations C A T H Domain families of known structure Gene3D Protein families and domain annotations for completed genomes
2
CATHEDRAL Oliver Redfern and Andrew Harrison CATH version 3.0 1100 fold groups 2100 homologous superfamilies 86,000 Domains Combines a rapid graph theory secondary structure filter with dynamic programming for accurate residue alignment SVM is used to combine scores and assess significance of match
3
DDP Fold Recognition Performance % Correct Fold Rank SSAP
4
Gene3D : Domain annotations in genome sequences scan against library of HMM models ~2000 CATH ~9000 Pfam >2 million protein sequences from 300 completed genomes and Uniprot assign domains to CATH and Pfam superfamilies Benchmarking by structural data shows that 76% of remote homologues can be identified using the HMMs
5
DomainFinder: structural domains from CATH take precedent Gene3D: Domain annotations in genome sequences NC CATH-1 Pfam-2 Pfam-1 NewFam CATH-1Pfam-1 NewFam Pfam-2
6
Domain families ranked by size (number of domain sequences) Percentage of all domain family sequences Rank by family size CATH superfamilies of known structure Pfam families of unknown structure NewFam of unknown stucture ~90% of domain sequences in the genomes and UniProt can be assigned to ~7000 domain families
7
structuralsuperfamily(CATH) Only ~3% of diverse sequences in large CATH domain families have known structures subfamily of relatives subfamily of relatives <100 families account for 50% of domain sequences of known fold F1 F2 F3 F4 F5 relatives likely to have similar functions
8
Iterative Profile Search Methodology 300 genomes, >2 million sequences including UniProt and RefSeq structural domain assignments from CATH functional domain assignments from Pfam Also: SWISS-PROT, EC, COGs, GO, KEGG, MIPS, BIND, IntAct Gene3D : Domain mappings for 300 Completed Genomes http://www.biochem.ucl.ac.uk:8080/Gene3D Russell Marsden, Corin Yeats, Michael Maibaum, David Lee Nucleic Acids Res. 2006 Yeats et al. Nucleic Acids res. 2006.
9
Conservation of enzyme function in homologous domains with same multidomain architecture (MDA) in Gene3D CATH-1Pfam-1 NewFam Pfam-2 Conservation of EC number to 3 levels (%) CATH-1Pfam-1 NewFam Pfam-2 Protein 1 Protein 2 Sequence identity
10
Sequence identity thresholds for 95% conservation of enzyme function (to 3 EC Levels) Sequence identity thresholds number of sequences number of families number of sequences number of families 332 highly conserved families 60 highly variable families
11
Exploiting Structural and Comparative Genomics to Reveal Protein Functions How many domain families can we find in the genomes and can we predict the functions of relatives? Exploiting protein structure to predict protein functions Using correlated phylogenetic profiles based on CATH domains to reveal functional associations C A T H Domain families of known structure Gene3D Protein families and domain annotations for completed genomes
12
Conservation of Enzyme Function in CATH Domain Families Pairwise sequence identity Structural similarity (SSAP) score same functions different functions
13
Number of diverse structural clusters within family Number of COG functional groups Correlation of structural variability with number of different functional groups
14
Multiple structural alignment by CORA allows identification of consensus secondary structure and embellishments Some families show great structural diversity In 117 superfamilies relatives expanded by >2 fold or more 2DSEC algorithm These families represent more than half the genome sequences of known fold Gabrielle Reeves
15
Structural embellishments can modify the active site Galectin binding superfamily
16
Structural embellishments can modulate domain interactions Glucose 6-phosphate dehydrogenase side orientation face orientation Dihydrodipiccolinate reductase Additional secondary structure shown at (a) are involved in subunit interactions a
17
Structural embellishments can modify function by modifying active site geometry and mediating new domain and subunit interactions Biotin carboxylase D-alanine-d-alanine ligase Dimer of biotin carboxylase ATP Grasp superfamily
18
Secondary structure insertions are distributed along the chain but aggregate in 3D 60% of domains have secondary structure embellishments co-located in 3D with 3 or more other embellishments In 80% of domains, 1 or more embellishments contact other domains or subunits Indel frequency < 1 % 0.85% 0.38% 0.23% 0.11% 0.06% 0.02% 0 20 40 60 80 123456789101112 Size of Indel (number of secondary structures) Frequency (%) 85% of residue insertions comprise only 1 or 2 secondary structures
19
2 Layer Beta Sandwich 2 Layer Alpha Beta Sandwich Alpha / Beta Barrel 3 Layer Alpha Beta Sandwich ~80% of variable families are adopt regular layered architectures
20
2 Layer Beta Sandwich 2 Layer Alpha Beta Sandwich Alpha / Beta Barrel 3 Layer Alpha Beta Sandwich
21
structuralsuperfamily(CATH) Function prediction to Guide Target Selection for Structural Genomics relatives likely to have similar functions Only ~3% of diverse sequence families (S30 clusters) in large CATH families have known structures close relatives with same MDA close relatives with same MDA F1 F2 F3 F4 F5
22
Conservation of Enzyme Function in Homologous Domains Structure similarity (SSAP) score Conservation of EC levels (%)
23
FLORA – structural templates for assigning structures to functional subgroups in CATH Perform CORA multiple structural alignment on functional subfamiles within CATH superfamily Use CORAXplode (HMMs) to find related sequences in UniProt and identify conserved residues (seed) Explore local structural environment of seed residues to find conserved structural motifs Dataset of 84 enzyme superfamilies in CATH of which 21 are functionally very diverse
24
Finding conserved residue positions (seeds) - Scorecons seed positions identify most highly conserved residue positions using Scorecons – Valdar and Thornton (2001) multiple sequence alignment of relatives from functional family guided by structure alignment
25
identify structurally conserved residue cliques and generate template new structures are scanned against a library of FLORA templates and SVMs used to assess significance of matches expand to local environment of 12Å assign conserved sequence seeds FLORA Algorithm for Identifying Structural Homologues with Similar Functions
26
Performance of FLORA vs Global Structure Comparison (SSAP) Error rate Coverage
27
Exploiting Structural and Comparative Genomics to Reveal Protein Functions How many domain families can we find in the genomes and can we predict the functions of relatives? Exploiting protein structure to predict protein functions Using correlated phylogenetic profiles based on CATH domains to reveal functional associations C A T H Domain families of known structure Gene3D Protein families and domain annotations for completed genomes
28
Eisenberg Phylogenetic Profiles for Detecting Functional Associations Superfamily 1 Superfamily 2 Superfamily 3 CATH Domain Superfamily Organism sp1 sp2 sp3 sp4 35 0 12 60 12 13 14 11 6 0 0 0 Gene3D Phylogenetic Occurrence Profiles Superfamily 1 Superfamily 2 Superfamily 3 Superfamily Organism sp1 sp2 sp3 sp4 1 0 1 0 0 0 1 1 FunctionallyLinked presence or absence of superfamily in organism number of relatives from superfamily in organism
29
Superfamily 40% sequence identity cluster 30% sequence identity cluster 50% sequence identity cluster Phylogenetic Occurrence Profiles Based on Domain Superfamily and Subfamilies in Gene3D
30
Phylogenetic Profiles for Families and Subfamilies Superfam. 30% 40% 50% 60%… 100% phylogenetic occurrence profile matrix Sp1 Sp2 Sp3 Sp4 … Sp n Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7. Cluster n 3 3 5 7 … 5 0 2 4 5 … 4 1 0 1 0 … 1 0 2 0 0 … 6 1 0 2 1 … 0 0 3 1 2 … 1 0 0 0 1 … 2.... …. 0 1 0 1 … 0 domains clustered at different levels of sequence similarity: Juan Ranea and Corin Yeats
31
Comparison of Pairs of Phylogenetic Profiles Sp1 Sp2 Sp3 Sp4 Sp5 … Spn Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7. Cluster n 6 9 6 9 5 … 9 4 3 7 5 3 … 5 1 0 1 0 2 … 1 0 2 0 0 1 … 6 1 4 1 4 1 … 4 0 3 1 2 0 … 1 4 8 4 8 4 … 8..... …. 0 1 0 1 1 … 0 Sp1 Sp2 Sp3 Sp4 Sp5 … Spn 5 10 Sp1 Sp2 Sp3 Sp4 Sp5 … Spn 5 10 Sp1 Sp2 Sp3 Sp4 Sp5 … Spn 5 10 Cluster 1 Cluster 2 Cluster 1 Cluster 5 Cluster 1 Cluster 7 E1E1 E2E2 E 1 >> E 2 Euclidian distance:
32
Statistical Significance of Correlated Pairs (Comparison against 3 randomised models) Frequency Pearson correlation coefficients Real matrix Random matrix II Random matrix III Random matrix I
33
Domain Associations Network from 13 Eukaryotes: Actin & VCP-like ATPases DNA replication and repair Chaperones and Cytoskeleton DNA Topoisomerase & Elongation factor G
34
Number of domain relatives Species DNA topoisomerase & Elongation Factor G
35
Distances of correlated profile scores Frequency of significant GO semantic similarity scores Highly correlated profiles correspond to pairs of families with significant similarity in GO functions Highly correlated profiles correspond to pairs of families with significant similarity in GO functions biological processes
36
– On average 85% of domain sequences in genomes can be assigned to ~6000 domain families in CATH and Pfam – Information on multidomain architectures (MDAs) can extend functional annotations obtained through domain based homologies – Specific structural templates for functional subgroups within domain families can also help in assigning functions as more structures are solved – Analysis of Gene3D phylogenetic occurrence profiles allows detection of functional associations between families Summary Summary
37
Lesley Greene Alison Cuff Ian Sillitoe Tony Lewis Mark Dibley Oliver Redfern Tim Dallman Acknowledgements CATH Corin Yeats Sarah Addou Russell Marsden David Lee Alastair Grant Ilhem Diboun Juan Garcia Ranea Medical Research Council, Wellcome Trust, NIH EU funded Biosapiens, EU funded Embrace, BBSRC http://www.biochem.ucl.ac.uk/bsm/cath_new Gene3D
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.