Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploiting Gene Clusters to Curate Annotations October, 2003 Ross Overbeek, Fellowship for Interpretation of Genomes (FIG)

Similar presentations


Presentation on theme: "Exploiting Gene Clusters to Curate Annotations October, 2003 Ross Overbeek, Fellowship for Interpretation of Genomes (FIG)"— Presentation transcript:

1 Exploiting Gene Clusters to Curate Annotations October, 2003 Ross Overbeek, Fellowship for Interpretation of Genomes (FIG)

2 Outline of the Talk The Emerging Opportunity The Use of Clusters to Find “Missing Genes” Experiences with a Single Pathway “The Project” Tools Needed to Support the Project

3 Three “Laws”  The amount of available DNA sequence data will double every 18 months  The number of available genomes will double every 18 months  The cost of sequence will drop by a factor of 2 every 18 months.

4 Basic Facts  We have about 230-250 publicly available more-or-less complete genomes  We will have about 1000 complete genomes within 3 years  This will lead to better annotations, not worse  The majority of annotations will need to be automated, and the process must accurately follow the steps that a human expert would take

5 The Use of Clusters to Find Missing genes

6 3,000 - 4,000 functional roles (300 – 3,000 per organism) Largely conserved across the three kingdoms (sequences; functions; pathways) “Missing genes” are still there Central Machinery of Life: Horizons of gene discovery

7 B E1 A + DC + EF E2E3 gene Agene B?gene C Missing genes in metabolic pathways making a case Missing gene Globally Missing Gene (never identified in any species)

8 B E1 A + DC + EF E2E3 gene Agene B?gene C Missing genes in metabolic pathways making a case Missing gene Locally Missing Gene (non-orthologous gene displacement)

9 gene A 1 gene C 1 gene R 1 gene T 1 gene G 1 gene X 1 GENOME 1 GENOME 2 gene A 2 gene M 2 gene X 2 GENOME 3 gene A 3 gene S 3 gene U 3 gene X 3 gene Y 3 gene N 2 gene C 2 gene Y 2 gene Q 3 GENE CLUSTERING ON THE CHROMOSOME (OPERONS) Techniques of genome context analysis (I) checking neighbors

10 gene A 1 gene C 1 GENOME 1 GENOME 3 gene C 3 / Z 3 GENOME 4 gene A 4 / X 4 gene A 3 gene C 4 GENOME 5 gene C 5 / A 5 PROTEIN FUSION EVENTS Techniques of genome context analysis (II) checking connections

11 gene A 1 gene C 1 gene R 1 gene T 1 gene X 1 GENOME 1 GENOME 5 gene C 5 / A 5 gene R 5 gene X 5 GENOME 2 gene A 2 gene W 2 gene C 2 SHARED REGULATORY SITES (REGULONS ) Techniques of genome context analysis (III) co-regulation

12 gene A 1 gene C 1 gene I 1 gene X 1 gene H 1 gene G 1 gene W 1 gene Y 1 gene Z 1 GENOME 1 gene A 2 gene C 2 gene I 2 gene X 2 gene H 2 gene G 2 gene W 2 gene Y 2 - GENOME 2 gene A 3 gene C 3 gene I 3 gene X 3 gene H 3 gene G 3 gene W 3 gene Y 3 gene Z 3 GENOME 3 gene A 4 gene C 4 gene I 4 gene X 4 gene H 4 - gene W 4 -- GENOME 4 gene A 5 gene C 5 gene I 5 gene X 5 gene H 5 gene G 5 gene W 5 - - GENOME 5 gene I 6 gene H 6 - gene W 6 gene Y 6 gene Z 6 GENOME 6 gene I 7 gene H 7 gene G 7 gene W 7 - gene Z 7 GENOME 7 gene I 8 gene H 8 - gene W 8 gene Y 8 - GENOME 8 gene I 9 gene H 9 - gene W 9 gene Y 9 gene Z 9 GENOME 9 gene I 10 - - gene W 10 gene Y 10 gene Z 10 GENOME 10 IN-GROUP OUT-GROUP --- --- --- --- --- Score: 10 5 68443 Techniques of genome context analysis (IV) co-evolution OCCURRENCE PROFILES

13 Missing gene case primary suspects

14 Chorismate catabolism Isochorismate anabolism Trp Phe Tyr syntheses D-Erythrose 4-P + Phosphoenol pyruvate 7P-2-Dehydro-3- deoxy-D-arabino -heptulosonate 3-Dehydro-Quinate 3-Dehydro-Shikimate aroH aroF aroG 1 aroB 2 aroD 3 Shikimate Kinase (EC 1.1.1.25) aroK aroL 5 Chorismate O5-(1-Carboxyvinyl)- 3-P-Shikimate aroA 6 aroC 7 Shikimate - 5 - P H OH H OH H H COOH H H OP H OH H OH H H COOH H H OH Shikimate 4 ydiB aroD Example I: Chorismate Pathway Missing gene in all archaea

15 ?? Fusion Protein Chromosomal Clustering: Prediction

16 Functional coupling in chorismate pathway ClusteringFusionOccurence

17 Example II: “Missing Drug Target” in S.pneumoniae acp P fab D accA accDaccB accC fabHfab F fab G fabZ fabI Gene fabI of Enoyl-ACP reductase (EC 1.3.1.9) is missing in a number of Streptococci

18 Clustering of FAB Genes : Prediction Genome X TR? 6.3.4.15 fabIhyp 3.5.1.? hyp TR? 2.1.1.79 FRNS Genome Y 5.99.1.2 Clostridium acetobutylicum TR? Streptococcus pyogenes ? hyp Escherichia coli EC 4…PLSX L32Pg30k MAF 2.7.4.92.7.7.7 TR? ? fabH acpP ? fabGfabFaccAaccDaccC accB fabZ fabD fabHfabDfabGacpPfabF fabGfabF accBfabD accAaccDaccCfabZ fabGfabFaccAaccDaccC accB fabZ fabDfabH acpP fabH acpP fabH acpP fabGfabFaccAaccDaccC accB fabZ fabD A conserved hypothetical FMN-binding protein “?” is the best candidate for the missing gene fabI in Gram-positive cocci

19 13 July 2000 Nature 406, 145 - 146 (2000) © Macmillan Publishers Ltd. Microbiology : A triclosan-resistant bacterial enzyme RICHARD J. HEATH AND CHARLES O. ROCK Triclosan is an antimicrobial agent that is widely used in a variety of consumer products and acts by inhibiting one of the highly conserved enzymes (enoyl-ACP reductase, or FabI) of bacterial fatty-acid biosynthesis. But several key pathogenic bacteria do not possess FabI, and here we describe a unique triclosan- resistant flavoprotein, FabK, that can also catalyse this reaction in Streptococcus pneumoniae. Our finding has implications for the development of FabI-specific inhibitors as antibacterial agents. Independent Experimental Verification

20 Missing genes, examples in cofactor pathways prediction and experimental verification

21 The Leucine Degradation Cluster: Origin of a New Perspective on Uses of Clusters

22 Isovaleryl-CoA dehydrogenase (EC 1.3.99.10) Leu Iso- valeryl- CoA Methyl- crotonoyl- CoA Methylcrotonoyl-CoA carboxylase (EC 6.4.1.4) Methylglutaconyl- CoA hydratase (EC 4.2.1.18) Methyl- glutaconyl- CoA HMG- CoA deamination oxydation Acetyl- CoA Aceto- acetate 6.2.1.16 4.1.3.4 carboxylase subunit biotin-containing subunit Context-based enrichment of initial functional assignments example from Brucella melitensis genome analysis E.C. NoFunctional role Gene ID No. in cluster 1.3.99.10 ISOVALERYL-COA DEHYDROGENASE BR00201 1 6.4.1.4 METHYLCROTONYL-COA CARBOXYLASE - Biotin-containing subunit BR00183 - Carboxylase subunit BR0019 4 4.2.1.18 METHYLGLUTACONYL-COA HYDRATASE BR00162 -------------------------------------------------------------------------------------------------------------------- 4.1.3.4 HYDROXYMETHYLGLUTARYL-COA LYASE BR0017*6 6.2.1.16 ACETOACETATE-COA LIGASE BR00215 BR0017* BR0021 BR0016 BR0018 BR0019 BR0020 TIGR specific non-specific * specific non-specific * frameshift * Biotin carboxylase; Carboxyl transferase familty subunit; Enoyl-CoA hydratase/isomerase family

23 No gene assigned in any organism in KEGG, NCBI, TIGR Gene assigned in B. melitensis 2003 (IG) Gene assignment propagated over 26 organisms using gene clustering Leucine degradation in Baccili

24 158 New assignments OrganismGene anchor Clustered genes

25 Gene cluster in B. subtilis

26 Leucine degradation in Baccili E.C. NoFunctional role No. in cluster 1.3.99.10ISOVALERYL-COA DEHYDROGENASE 2 6.4.1.4 METHYLCROTONYL-COA CARBOXYLASE - BIOTIN CONTAINING SUBUNIT 3 - CARBOXYLASE SUBUNIT 1 BIOTIN CARBOXYL CARRIER 7 4.2.1.18METHYLGLUTACONYL-COA HYDRATASE 4 -------------------------------------------------------------------------------------------------------------- 4.1.3.4HYDROXYMETHYLGLUTARYL-COA LYASE 5 6.2.1.16ACETOACETATE-COA LIGASE 6 6.2.1.16*ACETOACETATE-COA LIGASE* 14 ?

27

28

29 Listeria Clostridia Ralstonia Shew. Xylella 1 Cell division protein mraZ 3 S-adenosyl-methyltransferase mraW (EC 2.1.1.-) 4 Cell division protein ftsI 2UDP-N-acetylmuramoylalanine--D-glutamate ligase (EC 6.3.2.9) 2UDP-N-acetylmuramoylalanyl-D-glutamate--2,6-diaminopimelate ligase (EC 6.3.2.13) 5Phospho-N-acetylmuramoyl-pentapeptide-transferase (EC 2.7.8.13)

30 Brevibacter Enterococcus Brucella Geobacter 1 Phospho-N-acetylmuramoyl-pentapeptide-transferase (EC 2.7.8.13) 2UDP-N-acetylmuramoylalanine--D-glutamate ligase (EC 6.3.2.9) 6Cell division protein ftsW 5 UDP-N-acetylglucosamine--N-acetylmuramyl-(pentapeptide) pyrophosphoryl-undecaprenol N- acetylglucosamine transferase (EC 2.4.1.227) 2UDP-N-acetylmuramate--alanine ligase (EC 6.3.2.8) 9Cell division protein ftsZ 11UDP-N-acetylenolpyruvoylglucosamine reductase (EC 1.1.1.158) 2D-alanine--D-alanine ligase (EC 6.3.2.4)

31 Bacteroides thetaiotaomicron Bacillus cereus Geobacter metallireducens Buchnera 5 Cell division protein ftsW 1 UDP-N-acetylglucosamine--N-acetylmuramyl-(pentapeptide) pyrophosphoryl- undecaprenol N-acetylglucosamine transferase (EC 2.4.1.227) 2UDP-N-acetylmuramoylalanine--D-glutamate ligase (EC 6.3.2.9) 8UDP-N-acetylenolpyruvoylglucosamine reductase (EC 1.1.1.158) 9Cell division protein ftsQ 2 UDP-N-acetylmuramoylalanyl-D-glutamate--2,6-diaminopimelate ligase (EC 6.3.2.13) 3Cell division protein ftsA 6 Cell division protein ftsZ

32 Oceanobacillus iheyensis Enterococcus faecium DO Escherichia coli K12 Wigglesworthia brevipalpis 2 Cell division protein ftsA 1 Cell division protein ftsZ 8Hypothetical protein 10 Hypothetical protein 12 RNA binding protein 7 UDP-3-O-[3-hydroxymyristoyl] N-acetylglucosamine deacetylase (EC 3.5.1.-) 13Protein translocas subunit secA

33 The Project: Annotate 1000 Genomes in Three Years By making the task concrete, we force engineering decisions It will be easier to annotate 1000 genomes well than to annotate 50 well (comparative analysis is the key) Analysis by subsystem (rather than by genome) is clearly the key The use of clusters is the key to precise annotation of subsystems

34 Annotation by Subsystem Requires knowledge of known variants Evolution of clusters plays a major role There are three components of the task: –Building tools to support analysis –Actually doing the analysis on 30-50 subsystems –Coordinating with groups doing a limited set of wet lab confirmations

35 FIG: Building the Initial Annotation Tools Releasing the browser/curation tool with approximately 220-230 genomes within a few months Peer-to-peer updates/synchronization Open source and free (initially for Macs and Linux systems)


Download ppt "Exploiting Gene Clusters to Curate Annotations October, 2003 Ross Overbeek, Fellowship for Interpretation of Genomes (FIG)"

Similar presentations


Ads by Google