Annotations, Subsystems based approach Rob Edwards Argonne National Labs San Diego State University
The Problem (I) The 1,000th genome will be sequenced soon. 2,000 3,000 4,000 5,000 1996 2000 2004 2008 X Complete Genomes Year www.nmpdr.org www.theseed.org
The Problem (II) Growth of sequencing versus annotation. www.nmpdr.org www.theseed.org
The SEED and NMPDR The SEED : 504 complete genomes Archaea 30 Bacteria 445 Eukarya 29 NMPDR : 62 complete genomes Campylobacter 10 Listeria 16 Staphylococcus 11 Streptococcus 14 Vibrio www.nmpdr.org www.theseed.org
The questions: How do you generate consistent annotations for 1,000 genomes? How do you measure consistent annotations for 1,000 genomes?? www.nmpdr.org www.theseed.org
Basic biology lacI lacZ lacY lacA www.nmpdr.org www.theseed.org
Different types of clustering < 80 % < 80% www.nmpdr.org www.theseed.org
Occurrence of clustering in different genomes 1 Clusters of genes w/ maximum 80% identity Genes in subsystems in clusters Total number of genomes in group 120 0.8 Fraction of genes in clusters 0.6 80 Number of genomes 0.4 40 0.2 Aquificae Average Chlamydiae Chloroflexi Deinococcus- Thermus Firmicutes Actinobacteria Bacteroidetes Cyanobacteria Spirochaetes Thermotogae Proteobacteria
The Subsystems Approach to Annotation Subsystem is a generalization of “pathway” collection of functional roles jointly involved in a biological process or complex Functional Role is the abstract biological function of a gene product atomic, or user-defined, examples: 6-phosphofructokinase (EC 2.7.1.11) LSU ribosomal protein L31p Streptococcal virulence factors Does not (usually) contain “putative”, “thermostable”, etc Populated subsystem is complete spreadsheet of functions and roles across organisms www.nmpdr.org www.theseed.org
Thermatoga: ~40% of genes covered in 304 different subsystems
Subsystems developed based on Wet lab Chromosomal context Metabolic context Phylogenetic context Microarray data Proteomics data … www.nmpdr.org www.theseed.org
How do we measure annotations? www.nmpdr.org www.theseed.org
Natural Metrics Number of subsystems defined Number of functional roles defined Number of genes connected to functional roles www.nmpdr.org www.theseed.org
Annotations for NMPDR Genomes www.nmpdr.org www.theseed.org
Applied Metrics Number of solid connections of gene to functional role where “solid” is supported by experimental data connected to functional role and in chromosomal cluster with genes implementing functional roles from the same subsystem only gene in genome connected to a functional role in an active variant of a subsystem Reactions, GO terms, Articles, Other databases cross references (number and diversity) www.nmpdr.org www.theseed.org
Applied Metrics www.nmpdr.org www.theseed.org
The Importance of Consistency Consistency: same genes connected to same functional role Enables communication Required for most comparative genomics assays www.nmpdr.org www.theseed.org
hisA FIG function: Other functions in RefSeq: www.nmpdr.org Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC 5.3.1.16) Other functions in RefSeq: phosphoribosylformimino-5-aminoimidazole carboxamide phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase phosphoribosylformimino-5-aminoimidazole carboxamide ribotide... 1-(5-phosphoribosyl)-5-[(5- phosphoribosylamino)methylideneamino] imidazole-4-carboxamide isomerase N-(5-phospho-L-ribosyl-formimino)-5-amino-1-(5- phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1-(5'-phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1- (5'- phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1- (5'-phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1- (5'-phosphoribosyl)-4- imidazolecarboxamide isomerase Phosphoribosyl isomerase A [1-[5-phosphoribosyl]-5-[[5-phosphoribosylamino]methylideneamino] imidazole-4-carboxamide isomerase] www.nmpdr.org www.theseed.org
Measuring Consistency Define a set of protein families such that each family contains genes playing the same function Attach functional roles to protein families Measure the consistency of the annotations made to genes within each family "consistency" is the odds that two proteins from the same family have the same function Evaluate both families and functions. www.nmpdr.org www.theseed.org
Consistency among databases www.nmpdr.org www.theseed.org
Number of RefSeq proteins in families www.nmpdr.org www.theseed.org
How to measure accuracy If everything was called “hypothetical protein” the database would be 100% consistent Need to measure accuracy (specificity) as well as consistency Sample 100 proteins at random from “curated” set (i.e. that are believed to be correct) Manually inspect annotations to score correctness TIGR/SEED joint project for annotation consistency server www.nmpdr.org www.theseed.org
Acknowledgements SEED: http://www.theseed.org/ NMPDR: http://www.nmpdr.org/ RAST: http://www.nmpdr.org/anno-server metaRAST: http://metagenomics.theseed.org/ ANL Rick Stevens Bob Olsen Folker Meyer Daniela Bartels Tobi Paczian Daniel Paarmann Terry Disz FIG Veronika Vonstein Ross Overbeek Annotators