Exploiting Gene Clusters to Curate Annotations October, 2003 Ross Overbeek, Fellowship for Interpretation of Genomes (FIG)
Outline of the Talk The Emerging Opportunity The Use of Clusters to Find “Missing Genes” Experiences with a Single Pathway “The Project” Tools Needed to Support the Project
Three “Laws” The amount of available DNA sequence data will double every 18 months The number of available genomes will double every 18 months The cost of sequence will drop by a factor of 2 every 18 months.
Basic Facts We have about publicly available more-or-less complete genomes We will have about 1000 complete genomes within 3 years This will lead to better annotations, not worse The majority of annotations will need to be automated, and the process must accurately follow the steps that a human expert would take
The Use of Clusters to Find Missing genes
3, ,000 functional roles (300 – 3,000 per organism) Largely conserved across the three kingdoms (sequences; functions; pathways) “Missing genes” are still there Central Machinery of Life: Horizons of gene discovery
B E1 A + DC + EF E2E3 gene Agene B?gene C Missing genes in metabolic pathways making a case Missing gene Globally Missing Gene (never identified in any species)
B E1 A + DC + EF E2E3 gene Agene B?gene C Missing genes in metabolic pathways making a case Missing gene Locally Missing Gene (non-orthologous gene displacement)
gene A 1 gene C 1 gene R 1 gene T 1 gene G 1 gene X 1 GENOME 1 GENOME 2 gene A 2 gene M 2 gene X 2 GENOME 3 gene A 3 gene S 3 gene U 3 gene X 3 gene Y 3 gene N 2 gene C 2 gene Y 2 gene Q 3 GENE CLUSTERING ON THE CHROMOSOME (OPERONS) Techniques of genome context analysis (I) checking neighbors
gene A 1 gene C 1 GENOME 1 GENOME 3 gene C 3 / Z 3 GENOME 4 gene A 4 / X 4 gene A 3 gene C 4 GENOME 5 gene C 5 / A 5 PROTEIN FUSION EVENTS Techniques of genome context analysis (II) checking connections
gene A 1 gene C 1 gene R 1 gene T 1 gene X 1 GENOME 1 GENOME 5 gene C 5 / A 5 gene R 5 gene X 5 GENOME 2 gene A 2 gene W 2 gene C 2 SHARED REGULATORY SITES (REGULONS ) Techniques of genome context analysis (III) co-regulation
gene A 1 gene C 1 gene I 1 gene X 1 gene H 1 gene G 1 gene W 1 gene Y 1 gene Z 1 GENOME 1 gene A 2 gene C 2 gene I 2 gene X 2 gene H 2 gene G 2 gene W 2 gene Y 2 - GENOME 2 gene A 3 gene C 3 gene I 3 gene X 3 gene H 3 gene G 3 gene W 3 gene Y 3 gene Z 3 GENOME 3 gene A 4 gene C 4 gene I 4 gene X 4 gene H 4 - gene W 4 -- GENOME 4 gene A 5 gene C 5 gene I 5 gene X 5 gene H 5 gene G 5 gene W GENOME 5 gene I 6 gene H 6 - gene W 6 gene Y 6 gene Z 6 GENOME 6 gene I 7 gene H 7 gene G 7 gene W 7 - gene Z 7 GENOME 7 gene I 8 gene H 8 - gene W 8 gene Y 8 - GENOME 8 gene I 9 gene H 9 - gene W 9 gene Y 9 gene Z 9 GENOME 9 gene I gene W 10 gene Y 10 gene Z 10 GENOME 10 IN-GROUP OUT-GROUP Score: Techniques of genome context analysis (IV) co-evolution OCCURRENCE PROFILES
Missing gene case primary suspects
Chorismate catabolism Isochorismate anabolism Trp Phe Tyr syntheses D-Erythrose 4-P + Phosphoenol pyruvate 7P-2-Dehydro-3- deoxy-D-arabino -heptulosonate 3-Dehydro-Quinate 3-Dehydro-Shikimate aroH aroF aroG 1 aroB 2 aroD 3 Shikimate Kinase (EC ) aroK aroL 5 Chorismate O5-(1-Carboxyvinyl)- 3-P-Shikimate aroA 6 aroC 7 Shikimate P H OH H OH H H COOH H H OP H OH H OH H H COOH H H OH Shikimate 4 ydiB aroD Example I: Chorismate Pathway Missing gene in all archaea
?? Fusion Protein Chromosomal Clustering: Prediction
Functional coupling in chorismate pathway ClusteringFusionOccurence
Example II: “Missing Drug Target” in S.pneumoniae acp P fab D accA accDaccB accC fabHfab F fab G fabZ fabI Gene fabI of Enoyl-ACP reductase (EC ) is missing in a number of Streptococci
Clustering of FAB Genes : Prediction Genome X TR? fabIhyp ? hyp TR? FRNS Genome Y Clostridium acetobutylicum TR? Streptococcus pyogenes ? hyp Escherichia coli EC 4…PLSX L32Pg30k MAF TR? ? fabH acpP ? fabGfabFaccAaccDaccC accB fabZ fabD fabHfabDfabGacpPfabF fabGfabF accBfabD accAaccDaccCfabZ fabGfabFaccAaccDaccC accB fabZ fabDfabH acpP fabH acpP fabH acpP fabGfabFaccAaccDaccC accB fabZ fabD A conserved hypothetical FMN-binding protein “?” is the best candidate for the missing gene fabI in Gram-positive cocci
13 July 2000 Nature 406, (2000) © Macmillan Publishers Ltd. Microbiology : A triclosan-resistant bacterial enzyme RICHARD J. HEATH AND CHARLES O. ROCK Triclosan is an antimicrobial agent that is widely used in a variety of consumer products and acts by inhibiting one of the highly conserved enzymes (enoyl-ACP reductase, or FabI) of bacterial fatty-acid biosynthesis. But several key pathogenic bacteria do not possess FabI, and here we describe a unique triclosan- resistant flavoprotein, FabK, that can also catalyse this reaction in Streptococcus pneumoniae. Our finding has implications for the development of FabI-specific inhibitors as antibacterial agents. Independent Experimental Verification
Missing genes, examples in cofactor pathways prediction and experimental verification
The Leucine Degradation Cluster: Origin of a New Perspective on Uses of Clusters
Isovaleryl-CoA dehydrogenase (EC ) Leu Iso- valeryl- CoA Methyl- crotonoyl- CoA Methylcrotonoyl-CoA carboxylase (EC ) Methylglutaconyl- CoA hydratase (EC ) Methyl- glutaconyl- CoA HMG- CoA deamination oxydation Acetyl- CoA Aceto- acetate carboxylase subunit biotin-containing subunit Context-based enrichment of initial functional assignments example from Brucella melitensis genome analysis E.C. NoFunctional role Gene ID No. in cluster ISOVALERYL-COA DEHYDROGENASE BR METHYLCROTONYL-COA CARBOXYLASE - Biotin-containing subunit BR Carboxylase subunit BR METHYLGLUTACONYL-COA HYDRATASE BR HYDROXYMETHYLGLUTARYL-COA LYASE BR0017* ACETOACETATE-COA LIGASE BR00215 BR0017* BR0021 BR0016 BR0018 BR0019 BR0020 TIGR specific non-specific * specific non-specific * frameshift * Biotin carboxylase; Carboxyl transferase familty subunit; Enoyl-CoA hydratase/isomerase family
No gene assigned in any organism in KEGG, NCBI, TIGR Gene assigned in B. melitensis 2003 (IG) Gene assignment propagated over 26 organisms using gene clustering Leucine degradation in Baccili
158 New assignments OrganismGene anchor Clustered genes
Gene cluster in B. subtilis
Leucine degradation in Baccili E.C. NoFunctional role No. in cluster ISOVALERYL-COA DEHYDROGENASE METHYLCROTONYL-COA CARBOXYLASE - BIOTIN CONTAINING SUBUNIT 3 - CARBOXYLASE SUBUNIT 1 BIOTIN CARBOXYL CARRIER METHYLGLUTACONYL-COA HYDRATASE HYDROXYMETHYLGLUTARYL-COA LYASE ACETOACETATE-COA LIGASE *ACETOACETATE-COA LIGASE* 14 ?
Listeria Clostridia Ralstonia Shew. Xylella 1 Cell division protein mraZ 3 S-adenosyl-methyltransferase mraW (EC ) 4 Cell division protein ftsI 2UDP-N-acetylmuramoylalanine--D-glutamate ligase (EC ) 2UDP-N-acetylmuramoylalanyl-D-glutamate--2,6-diaminopimelate ligase (EC ) 5Phospho-N-acetylmuramoyl-pentapeptide-transferase (EC )
Brevibacter Enterococcus Brucella Geobacter 1 Phospho-N-acetylmuramoyl-pentapeptide-transferase (EC ) 2UDP-N-acetylmuramoylalanine--D-glutamate ligase (EC ) 6Cell division protein ftsW 5 UDP-N-acetylglucosamine--N-acetylmuramyl-(pentapeptide) pyrophosphoryl-undecaprenol N- acetylglucosamine transferase (EC ) 2UDP-N-acetylmuramate--alanine ligase (EC ) 9Cell division protein ftsZ 11UDP-N-acetylenolpyruvoylglucosamine reductase (EC ) 2D-alanine--D-alanine ligase (EC )
Bacteroides thetaiotaomicron Bacillus cereus Geobacter metallireducens Buchnera 5 Cell division protein ftsW 1 UDP-N-acetylglucosamine--N-acetylmuramyl-(pentapeptide) pyrophosphoryl- undecaprenol N-acetylglucosamine transferase (EC ) 2UDP-N-acetylmuramoylalanine--D-glutamate ligase (EC ) 8UDP-N-acetylenolpyruvoylglucosamine reductase (EC ) 9Cell division protein ftsQ 2 UDP-N-acetylmuramoylalanyl-D-glutamate--2,6-diaminopimelate ligase (EC ) 3Cell division protein ftsA 6 Cell division protein ftsZ
Oceanobacillus iheyensis Enterococcus faecium DO Escherichia coli K12 Wigglesworthia brevipalpis 2 Cell division protein ftsA 1 Cell division protein ftsZ 8Hypothetical protein 10 Hypothetical protein 12 RNA binding protein 7 UDP-3-O-[3-hydroxymyristoyl] N-acetylglucosamine deacetylase (EC ) 13Protein translocas subunit secA
The Project: Annotate 1000 Genomes in Three Years By making the task concrete, we force engineering decisions It will be easier to annotate 1000 genomes well than to annotate 50 well (comparative analysis is the key) Analysis by subsystem (rather than by genome) is clearly the key The use of clusters is the key to precise annotation of subsystems
Annotation by Subsystem Requires knowledge of known variants Evolution of clusters plays a major role There are three components of the task: –Building tools to support analysis –Actually doing the analysis on subsystems –Coordinating with groups doing a limited set of wet lab confirmations
FIG: Building the Initial Annotation Tools Releasing the browser/curation tool with approximately genomes within a few months Peer-to-peer updates/synchronization Open source and free (initially for Macs and Linux systems)