Annotations, Subsystems based approach Rob Edwards Argonne National Labs San Diego State University
The Problem (I) 1,000 2,000 3,000 4,000 5, X XXXX X X X X X Complete Genomes Year The 1,000th genome will be sequenced soon.
The Problem (II) Growth of sequencing versus annotation.
The SEED and NMPDR The SEED : 504 complete genomes Archaea30 Bacteria445 Eukarya29 NMPDR : 62 complete genomes Campylobacter10 Listeria16 Staphylococcus11 Streptococcus14 Vibrio11
The questions: How do you generate consistent annotations for 1,000 genomes? How do you measure consistent annotations for 1,000 genomes??
Basic biology lacZlacIlacYlacA
Different types of clustering < 80 %
Actinobacteria Aquificae Bacteroidetes Chlamydiae Chloroflexi Cyanobacteria Deinococcus- Thermus Firmicutes Spirochaetes Thermotogae Proteobacteria Clusters of genes w/ maximum 80% identity Genes in subsystems in clusters Total number of genomes in group Fraction of genes in clusters Number of genomes Average Occurrence of clustering in different genomes
The Subsystems Approach to Annotation Subsystem is a generalization of “pathway” –collection of functional roles jointly involved in a biological process or complex Functional Role is the abstract biological function of a gene product –atomic, or user-defined, examples: 6-phosphofructokinase (EC ) LSU ribosomal protein L31p Streptococcal virulence factors Does not (usually) contain “putative”, “thermostable”, etc Populated subsystem is complete spreadsheet of functions and roles across organisms
Thermatoga: ~40% of genes covered in 304 different subsystems
Subsystems developed based on Wet lab Chromosomal context Metabolic context Phylogenetic context Microarray data Proteomics data …
How do we measure annotations?
Natural Metrics Number of subsystems defined Number of functional roles defined Number of genes connected to functional roles
Annotations for NMPDR Genomes
Applied Metrics Number of solid connections of gene to functional role where “solid” is 1.supported by experimental data 2.connected to functional role and in chromosomal cluster with genes implementing functional roles from the same subsystem 3.only gene in genome connected to a functional role in an active variant of a subsystem Reactions, GO terms, Articles, Other databases cross references (number and diversity)
Applied Metrics
The Importance of Consistency Consistency: same genes connected to same functional role Enables communication Required for most comparative genomics assays
hisA FIG function: Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC ) Other functions in RefSeq: phosphoribosylformimino-5-aminoimidazole carboxamide phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase phosphoribosylformimino-5-aminoimidazole carboxamide ribotide... 1-(5-phosphoribosyl)-5-[(5- phosphoribosylamino)methylideneamino] imidazole-4-carboxamide isomerase N-(5-phospho-L-ribosyl-formimino)-5-amino-1-(5- phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1-(5'-phosphoribosyl)-4-imidazolecarboxamide isomerase Phosphoribosyl isomerase A [1-[5-phosphoribosyl]-5-[[5-phosphoribosylamino]methylideneamino] imidazole-4-carboxamide isomerase]
Measuring Consistency Define a set of protein families such that each family contains genes playing the same function Attach functional roles to protein families Measure the consistency of the annotations made to genes within each family 1."consistency" is the odds that two proteins from the same family have the same function 2.Evaluate both families and functions.
Consistency among databases
Number of RefSeq proteins in families
How to measure accuracy If everything was called “hypothetical protein” the database would be 100% consistent Need to measure accuracy (specificity) as well as consistency Sample 100 proteins at random from “curated” set (i.e. that are believed to be correct) Manually inspect annotations to score correctness TIGR/SEED joint project for annotation consistency server
FIG Veronika Vonstein Ross Overbeek Annotators ANL Rick Stevens Bob Olsen Folker Meyer Daniela Bartels Tobi Paczian Daniel Paarmann Terry Disz Acknowledgements SEED: NMPDR: RAST: metaRAST: