Download presentation
Presentation is loading. Please wait.
Published byMitchell Butler Modified over 9 years ago
1
Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequences Domain structures Domain structure predictions Structure to function
2
Fold Group (1100) Homologous Superfamily (2100) 40,000 domain entries C AT H Sequence Family ~100,000 domains of known structure in CATH ~2 million sequences from genomes assigned to CATH superfamilies in Gene3D and functionally annotated Gene3D
3
Gene3D : Domain structure annotations in genome sequences scan against library of HMM models and sequences for CATH Pfam NewFam superfamilies ~5 million protein sequences from 560 completed genomes and UniProt ~ 2 million domain sequences assigned to CATH superfamilies
4
Gene3D (1) Cluster ~5 million sequences into protein superfamilies (2) Map domains onto the sequences using HMM technology (CATH & Pfam domains) >200,000 protein superfamilies ~10,000 domain superfamilies (2100 of known structure)
5
Proportion of genome sequences which can be assigned to domain families of known structure in CATH or SCOP HMM prediction threading prediction
6
Annotation levels for an average genome 0 50% 100% predicted to belong to structural superfamilies using HMM or threading techniques many predicted to be transmembrane many belonging to small species specific families
7
0 20 40 60 80 100 0100020003000400050006000 Families ordered by size Percentage of domain sequences Target selection strategy for PSI-2 known structure (CATH - MEGA) unknown structure (BIG -Pfam) Adam Godzik JCSG, Andras Fiser – NYSGC, Burkhard Rost - NESG
8
Population in genomes (x 1000) Structural Diversity Correlation of sequence and structural variability of CATH families with the number of different functional groups
9
Structural diversity in the CATH Domain Superfamily P-loop hydrolases Cutinase Cocaine esterase Acetylcholinesterase
10
Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequences Domain structures Domain structure predictions
11
Sequence identity thresholds for 90% conservation of enzyme function (to 3 EC Levels) highly variable families Number of sequences Sequence identity threshold for 90% conservation Number of families
12
N-Fold Increase in Functional Annotation for Sequences in Gene3D general thresholds family specific thresholds N-fold increase in coverage
13
Link to UniProt Links to GO Links to different levels in the Gene3D protein family Link to InterPro Links to CATH/Pfam Links to KEGG “S” - indicates you can search the term against Gene3D Get an XML version of this page Gene3D Functional information from GO, COGS, KEGG, EC, FunCat, MINT, IntAct, ComplexDB
14
Non-PSI PDBs PSI PDBs 0 terms1 term2 terms3 terms4 terms Functional annotation of structures using EC, GO, KEGG, FunCat resources
15
Phylogenetic trees derived from multiple sequence alignments can be used to infer functionally related proteins Tree Determinants - Valencia Evolutionary Trace - Lichtarge Funshift – Sonnhammer SCI-PHY – Sjolander
16
Score conservation for each position in the alignment using an entropy measure 1 = highly conserved 0 = unconserved Putative functional site Structural model Methods exploiting information on sequence conserved residue positions Scorecons –Thornton Protein Keys – Sander multiple sequence alignment of relatives from functional group
17
Superfamily of known structure (CATH) GEMMA: Compares sequence profiles (HMMs) between subfamilies sequence subfamily 80% seq. id) putative structure-function group clusters sequence relatives predicted to have similar structures/functions even at low levels of sequence identity
18
GeMMA v SCI-PHY using gold annotated sequences in Babbitt benchmark Purity (high is best) Edit distance (low) VI distance (low is best) Deviation from no. singletons (low)
19
Coverage of superfamily (%) experimental annotations inherit functions at 60% seq. id. inherit functions by GEMMA Functional annotation coverage using different strategies
20
Gene3D Biominer Methods Phylotuner: Correlation of domain occurrence profiles GOSS:Semantic Similarity calculation between protein pairs. CODA: Domain fusion analysis. HiPPI: homology inheritance of protein-protein physical interaction data. GECO: Correlation of gene expression data Protein interactions and gene networks
21
Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequences Domain structures Domain structure predictions Structure to function
22
Methods for Assessing Structural Novelty CATHEDRAL – structure comparison Redfern et al. PLOS comp. biol. 2007
23
Structural clusters in the Aminoacyl tRNA synthetases – like family Aminoacyl tRNA synthetases DNA-binding, stress-related Argininosuccinate lyases Gln-hydrolyzing synthases Nucleotidyl-transferases structure similarity score
24
1bkzA00 2.60.120.200 1dypA00 Galectin binding superfamily
25
Aminoacyl tRNA synthetases – like 1dnpA00 Deoxyribodi- pyrimidine photo-lyases Nucleotidylyl- transferases 1ej2A00 AA tRNA synthetase, Class I 1n3lA01 Electron transfer flavoprotein 1o97D01 Identifying functional groups in domain superfamilies
26
Exploiting 3D Templates to Represent Functional Relatives JESS – Thornton GASP - Babbitt SPASM – Kleywegt PINTS – Russell DRESPAT - Sarawagi pvSOAR – Joachimiak
27
SITESEER: Match 3-residue templates and assess relevance of hits by looking at residues within the local environment green and purple – identical residues; orange and white – similar residues Laskowski and Thornton
28
FLORA:3D templates for functional groups From multiple structure alignments of functional subgroups in the superfamily, identify vectors between amino acids that are highly conserved and distinctive for the functional subgroup.
29
FLORA:3D templates for functional groups localFLORA globalFLORA single site multiple sites
30
FLORA:Performance in recognising functionally related homologues Benchmark of 36 diverse enzyme groups (from 12 families)
31
Performance of FLORA Benchmarked on 36 large enzyme families
32
FLORA: 3D Templates for Structure-Function Groups in Domain Families 1dnpA01 Deoxyribo- dipyrimidine photo-lyases 1ej2A00 Nucleotidylyl- transferases 1q77A00 Unknown function MCSG 1n3lA01 AA tRNA synthetases 1o97D01 Electron transfer flavoprotein
33
Fold and structural motifs SSM fold search Surface clefts Residue conservation DNA-binding HTH motifs Nest analysis Sequence motifs (PROSITE, BLOCKS, SMART, Pfam, etc) Sequence scans Sequence search vs PDB Sequence search vs Uniprot Superfamily HMM library Gene neighbours n-residue templates Enzyme active sites Ligand binding sites DNA binding sites Reverse templates http://www.ebi.ac.uk/thornton-srv/databases/ProFunc/
34
Function Prediction for Proteins of ‘Putative’ or Unknown Function Class Sequence Evidence Structure Evidence Sequence + Structure Neither Successful Putative (57) 5344411 Unknown (132) 95*69*57*25 * Numbers refer to results where the top hit is classed as ‘Strong’ or ‘Moderate’ structural data provides relatively more information for proteins about which there is less knowledge these predictions need to be experimentally validated
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.